Skip to main content

A Curious Case of Captcha on EPF

Captcha is a small image containing a code(above) that is used to identify humans from machines. The code visible in the captcha image is supposed to makes sense only to humans.

Since these images can only make sense to humans, captchas are often placed at login screens in order to stop bots or machines from automatically logging in or sending bulk requests in one go (which may sometimes lead to denial of service and server crashes).

Here's a screenshot of Employee's Provident Fund Organization, India's login portal to access passbooks. A portal that employees all over India use to check on their provident fund:

Hmm. The captcha looks pretty clean. I wonder if clean captchas could be machine readable. Let's see how it works.

One look at the source code and you'll find something's not right.

'captcha.jpg?' + Math.random();

it's sort of an API call with random() function deciding what captcha should come next.

Predicting Random Numbers

One thing that you might want to know about Math.random() is:

Math.random() does not provide cryptographically secure random numbers. Do not use them for anything related to security. Use the Web Crypto API instead, and more precisely the window.crypto.getRandomValues() method.

Now I don't have a background in cryptography but that sounds like it'll be easier to break its prediction mechanism because it does not cryptographically secure random numbers.

A few minutes of googling and I come across the work of Daniel Simmons on how does the random function actually works and Douglas Goddard's code on github that lets me predict the numbers of random function.

Turns out, Math.random uses 'XorShift128+' algorithm to generate the random numbers and its execution depends on the kind of browser you're using. That means, you are generating the output that goes in place of math.random() and it is browser dependent.

The code I got from github (which had to be converted to py3 code because I don't run py2) was fortunately really flexible and covered the browser I used. Now I just required 3 pre-generated random numbers and I was good to go.

Getting the numbers was easy. Little CTRL+SHFT+i and I got:

When I ran the script(most of my time went into building Z3), I found that it was pretty accurate from the beginning.

Below are the random numbers generated for future captcha.

Reading Captcha- Programmatically

So there was another search and I found that Google has a pretty good OCR engine - tesseract which has a python wrapper called pytesseract. Which means using that, I can read text off of an image.

I downloaded a captcha image and set up the tesseract.

Now it was just a matter of one command.

python3 -l 'eng' ~/Desktop/captcha.jpg

and… it was a perfect match.

Fun Fact: Captcha actually stands for Completely Automated Public Turing test to tell Computers and Humans Apart.

Using already available code on the internet, I was able to read the captcha programmatically. Which defeats the purpose of captcha entirely. I can also extend this script to send bulk login requests to EPF Org's servers. I wonder what would happen. Would they be able to handle the load?

I don't know how deep the repercussions go, but that captcha was not there because captchas are trending. It is a security practice and should be taken seriously. Specially on critical websites like one that lets employees all over the country watch their money.

Something to think about.