Fighting html form bots

My site has a contact form since the start of this website, a couple of years ago. To my great surprise I have had almost no bots visiting and submitting the form with spam for the first few years. However, since a few months I get at least one spam message per day, with the usual offerings of free sex and money. This is actually good news since it gives me the opportunity to experiment with ways to fight the bots and see what's most effective.

classic approach - (re)Captcha

A common way to prevent bots from submitting your forms is to include a field in your form only humans can enter correctly. Originally 'captcha' fields were popular, where the user has to copy a hard-to-read-text from an image in a form field. As the spam bots got smarter, the text became so hard to read, that it started preventing humans from being able to submit forms.

More recently, the re-captcha became popular. This idea is nice, but Google's tag line 'Easy on humans, hard on bots' is not entirely true. Clicking endless series of traffic lights and fire hydrants is required before Google feels you put in enough free AI training work, and you may move on to the form submission.

Machine Learning Captcha

All in all not the ideal user experience, so time to find better ways to fend off the bots.

first try - the hidden field

An interesting idea I came across is to add another field to your form that only bots will fill out. To prevent humans from entering data, you just hide the field from view while keeping it in the HTML form. One way to achieve this would be using an input marked as hidden: <input type="hidden">. However, this seems to be too easy to detect for bots, so I went with a normal input, hidden by css:
  <label for="nohuman" class="nohuman">Hidden (not for humans)</label> 
  <input type="text" name="nohuman" id="nohuman" class="nohuman" />
Accompanied by this css to hide it from human view:
.nohuman { 
    display: none; 
}
Once this setup is in place, the form submissions can be filtered relatively easily by discarding anything that includes the field nohuman.

the result: After running this setup for a week, the form was submitted 12 times by spam bots, 6 times with and 6 times without the hidden field filled in. So some effect, but no cigar.

second attempt - different hiding

So it looks like the bots noticed the field was not shown. Perhaps a different way of hiding the field will help. Instead of not displaying the field at all, I move it away from the visible part of the screen:
.nohuman { 
    position: absolute; 
    left: -100px; 
}
To prevent humans from accidentally tabbing into it, I set the attribute tabindex of the hidden input to 0.

the result: After running this for a few weeks, the results are basically the same as with attempt 1. Some bots fill out the field, some choose to ignore it, so no way to filter on this.

third time lucky?

Okay, the non-intrusive attempts did not cut it. My next try is just asking the question out loud: "Are you a robot?"
<label for="turingtest">Are you a robot?</label> 
<input type="text" required="required" id="turingtest" 
  pattern="[ ]*(no|NO|No|nO)[ ]*" title="Humans can say 'no' here"> 
This form field is marked as required and the only answer allowed is No. In a browser a user would be forced to answer correctly before submitting is possible. The most desirable outcome would be that the robots would honestly say yes, and be unable to submit the form. More likely is that they would say nothing/anything and just submit the form, ignoring the required attribute. This would also be okay, as it gives a way to filter. The most interesting outcome would be the bots actually denying their robotness to be able to submit the form.

the result: Interestingly, the amount of spam decreased by some 30%. My guess is that some bots could not handle the pattern attribute and just gave up, although the drop could also just be a coincidence. Of the remaining bots, the field mostly contained nothing or some random string. However, about 10% of the bots supplied the value 'no' even though the rest of the message did not look very human to me.

attempt 4: no means no

With the "Are you a robot" question I can filter out over 90% of the spam bots. Still, the smartest bots can somehow figure out how to supply the desired answer. In decreasing order of likelihood I would say one of these things happens:

  1. bot parses the regular expression and uses the pattern to supply the right answer
  2. bot recognizes the question, as this may be a common question in public html forms
  3. bot parses the title attribute which contains a hint for the right answer
  4. some human intervention is used to find the right answer
If the bots would parse the regex, one plausible implementation would be taking the first option of any 'choice' part of the regex. If this is the case, adding a fake option first could be a way to filter out even the cleverest of bots. Let's try to trick them like this:

<label for="turingtest">Are you a robot?</label> 
<input type="text" required="required" id="turingtest" 
  pattern="[ ]*(R2D2|no|NO|No|nO)[ ]*" title="Humans can say 'no' here"> 

the result: No luck here, in the weeks this test ran, no bot picked R2D2, and many kept answering no so the regex parsing is not likely (or even smarter than I feared)

attempt 5: harder question

With all the previous attempts failing as the bots somehow saw a way to guess the answer, let's make the question harder. By asking about the fluffiness of animals, it is unlikely the bots will come prepared. The new question should rule out the scenarios where the bot either recognizes the question or parses the title attribute to get to the answer.
<label for="turingtest">Which type of animal is the fluffiest: turtle, rabbit or snake? *</label> 
<input type="text" required="required" pattern="[ ]*(jellyfish|turtle|rabbit|snake)[ ]*" title="Pick the fluffiest animal"/> 

the result: We have a winner! In more than a month, not a single bot managed to supply the right answer. Actually none of the answers even matched the regex pattern as the field was either left empty, or contained some random characters. So choosing an original question and checking the answer seems to do the job.

conclusion

For a simple website, a 'somewhat hard' question can filter out casually passing by bots. This approach probably won't work on more interesting, high traffic sites, as bots will be trained to recognize the question or will solicit the help of humans to crack the problem.

All in all, a static question is a great low-tech alternative for annoying traffic images clicking or captcha-madness. I can imagine that rotating a set of questions, or making the question dynamic in other ways can make this scheme even more effective when needed.


related blogs: