How you’re helping digitise the world’s resources with reCaptcha!

You're probably familiar with reCAPTCHA, the Google-owned system that enables websites to determine whether you are a human, or a virtual bot with bad intentions.

The system looks a bit like this: 

Picture1.png

And depending on your luck, you either get a straightforward, and obvious word or you may get one of these:

Picture2.png

Although there is no science around this, it's highly likely that a significant amount of neck and back injuries are caused by confusing reCAPTCHA images that force the user to get as close to the screen as possible in order to work out where one letter finishes, and the next one begins.

At its worst – it is simply a matter of guessing and hoping that the next one will be easier.The next time you are almost climbing inside your screen trying to work out whether you are reading a 'j' or an 'i' with a weird tail, remind yourself of the service you are doing for mankind, or at the very least, Google.

You see, all those weird looking words come from out of date manuscripts, books that are no longer in print, newspapers and rare texts that contain text that is often difficult to make out, and virtually illegible if you're an artificial intelligence device or computer. But how does the system know that the computer won't be able to read it? Because the most effective reading system in the world has already tried and failed.

Here's how it works; the system, presumably housed in one of Google's snazzy offices and in close proximity to a foosball table, scans a collection of physical books, for example, the archives of the New York Times, and comes across a word that it isn't certain about. Depending on the quality of the book, or manuscript, this may be one word every six pages, or the entire book. That word is then scanned and sent out as a reCAPTCHA, essentially getting you to do the work that the computer can't.

The word appears on your screen, and you enter it as you see it. Of course, sometimes you are wrong and so are presented with a different word until you finally get it right. The system is happy to do this, because – at this stage anyway – artificial intelligence isn't at a point where it can make educated assumptions, but you - as a human - are. Two attempts later you finally get the word right and gain access to the website. Meanwhile, your correct entry is making its way back to reCAPTCHA, which is deciding what the word says and adding it to the scanned book in the GoogleBooks database.

Of course, this raises additional questions – like how does the system know you got the word right, if it doesn't know what the word is? Quite simply, it knows some of the letters and allows you enough flexibility to determine whether you are a human or not. For example, if the word is 'creation', and reCAPTCHA has determined that the first three letters are 'cre' and the final letter is 'n', it will allow you a certain amount of flexibility in line with probability.

That's fine, but what if you guess the word and it falls within acceptable parameters for the system, but you guessed wrong? Perhaps these important documents have sentences like, " I pulled the sheet off the painting and proudly showed off my cremation." The system has safeguards for this also, using checks and balances in the form of a statistical measurement process, which ensures that a word isn't clearly defined until corroborated by a certain number of users. Of course, the system can accommodate for this kind of rigourous process, because of the sheer number of hits reCaptcha gets each day.

So the next time you are pressing your nose against the screen, struggling with a reCaptcha, remember the important work you're doing; ensuring the world's most important and irreplaceable documents aren't lost in the digital age.

Share this with your friends and colleagues.