Mike Rooney

programming and philosophy

Cracking On-Screen Keyboards With Visual Keyloggers

A few financial sites including HSBC and the US Treasury have recently added an extra measure of security to their site. Instead of simply requiring a username and password, an on-screen keyboard was added, requiring you to “type” in a second password with your mouse:

The logic behind this is that if a user’s computer becomes compromised with a keylogger, the attacker could only obtain the username and primary password. The secondary password would remain uncomprised as it doesn’t involve keypresses. This didn’t seem too useful to me however, so for my “Image Understanding” class I decided to see if it was possible to create a “visual keylogger” which could capture this secondary password. It wasn’t too difficult, and essentially demonstrated that the extra password was more inconvenience than security. Let me outline the basic process.

In order to do this, you need to be able to capture the contents of the screen at certain intervals. It seems like a fair assumption that if you (as the attacker of a comprimised system) can capture keyboard input, you can also grab screenshots. The goal is to turn a sequence of these screenshots of someone typing with an on-screen keyboard into a single string output equivalent to the password typed.

First we want to record the position of the mouse at each shot. This would normally be a trivial function by asking the OS; however, in my case I was writing this for an Image Understanding class and had to use the sequence of images as my sole input. As such, I used a basic templating approach to locate the mouse by a few of its key features. This was surprisingly robust; however, asking the OS for the mouse position is an easier, even more robust, and more likely attack vector in real life.

Now we need to figure out when the user clicked a key. Any keyboard used for a password purpose is going to give some form of feedback when a key is clicked, such as an asterisk in a password field, so the user knows if they have successfully clicked a key. The easiest way then to notice this is to subtract the color values of each screenshot from the previous one, giving you a new image with non-zero pixel values for each changed pixel. Among other things like cursor movement and web animations, the aforementioned asterisk feedback is going to be present in this image.

For each new image then, subtract and look for this feedback. If it’s there, that’s a key press! Combine this with the position of the mouse and you know where the user clicked. Now it gets slightly tricky. You know where they clicked, but if you grab that section of the screen, you’ll get something like this:

because the mouse had to be over the key to click it. This is rather easily worked around, however, by going backwards in your mouse position cache until it is a certain threshold away from the clicked position, and grabbing the key image at that point.

After the user enters the complete password, you are going to be left with an array of keyboard images. For any human, this is quite sufficient. For my class however, it was not, and it would not be for any large-scale operation where automation is desired. What we need to do is clean it up by throwing away any pixels under a certain darkness threshold, then cropping the result:

Ta-da! Now we have something that any OCR (optical character recognition) algorithm should be able to chomp through in its sleep cycles. If you are writing for a specific keyboard, you can also just have an array of what each key looks like in binary form and compare to get the answer.

And there you have it! With the combination of a few basic computer vision techniques, we can expand a keylogger to understand input from visual keyboards and render this security annoyance useless. A fun note is that the order/position of the keys is irrelevant. The US treasury website uses an on-screen keyboard as well, but shuffles the keys each attempt. As is hopefully obvious from this algorithm, there is no assumption of a keyboard layout; the keys could shuffle every single click and it wouldn’t matter.


Anonymous: I am not quite sure how that relates to on-screen keyboards, could you clarify? I have never used an on-screen keyboard to log in to a linux box.
And if the visual “text box” doesn’t actually change at all?

(Ex., when logging into Linux, the password: cursor never moves, so you can’t deduce A) when keys are pressed by vision and B) how long the password is).
ktzar: thanks for your comments but I am not sure how closely you read the post as both of those issues are addressed :) If the letter disappears when you hover or click, that wouldn’t matter as it goes back in time to find the letter since your cursor obscures it anyway. Scrambling on each click is also mentioned as something this algorithm wouldn’t care about!
Why if the simbol within the button dissapears just when you click on it? Or the simbols get scrabled after each click? Those are implementations I’ve seen ;)
A while ago I wrote about a possible approach that would make such attacks harder:


By having each key carry more than one type of data, capturing a single “password” would actually only get you one possible password in a larger family of possibilities.

It wouldn’t stand up to repeated captures, but if this technique was also used for the username and/or primary password fields (by also showing a coloured on-screen keyboard), it would take more effort to crack all the parts required. Not supercomputer effort, but perhaps enough to send the attacker off to softer targets.
Thanks for sharing! Because they can create money ‘out of thin air’ it is not a problem i can borrow a little more money of them :-)
(http://jessescrossroadscafe.blogspot.com/ for more info)