June 16, 2011

Character occurrence in passwords

Filed under: Computing, Geekery, R — Tags: , , , , , — csgillespie @ 12:52 pm

As everyone knows, it seems that Sony is taking a bit of a battering from hackers.  Thanks to Sony, numerous account and password details are now circulating on the internet. Recently, Troy Hunt carried out a brief analysis of the password structure. Here is a summary of his post:

  • There were around 40,000 passwords, of which 8000 would fail a simplistic dictionary attack;
  • Only 1% of passwords contained non-alphanumeric passwords;
  • 93% of passwords were between 6 and 10 characters.

In this post, we will investigate the remaining 32,000 passwords that passed the dictionary attack.

Distribution of characters

As Troy points out, the vast majority of passwords only contained a single type, i.e. all lower or all upper case. However, it turns out that things get even worst when we look at character frequency.

In the password database, there are a 78 unique characters. So if passwords were truly random, each character should occur with probability 1/78 = 0.013. However when we calculate the actual password occurrence, we see that it clearly isn’t random. The following figure shows the top 20 password characters, with the red line indicting 1/78.

Unsurprisingly, the vowels “e”, “a” and “o” are very popular, with the most popular numbers being 1,2, and 0 (in that order). No capital letters make it into the top twenty. We can also construct the cumulative probability plot for character occurrence. In the following figure, the red dots show the pattern we would expect if the passwords were truly random (link to a larger version of the plot):

Clearly, things aren’t as random as we would like.

Character order

Let’s now consider the order that the characters appear. To simplify things, consider only the eight character passwords. The most popular number to include in a password is “1”. If placement were random, then in passwords containing the number “1” we would expect to see the character evenly distributed. Instead, we get:

   ##Distribution of "1" over eight character passwords
   0.06 0.03 0.04 0.04 0.13 0.13 0.22 0.34

So in around of 84% of passwords that contain the number “1”, the number appears only in the second half of the password. Clearly, people like sticking a number “1” towards the end of their password.

We get a similar pattern with “2”:

   0.05 0.05 0.04 0.05 0.13 0.11 0.30 0.27

and with “!”

   #Small sample size here
   0.00 0.00 0.00 0.00 0.00 0.11 0.16 0.74

We see similar patterns with other alpha-numeric characters.

Number of characters needed to guess a password

Suppose we constructed all possible passwords using the first N most popular characters. How many passwords would that cover in our sample? The following figure shows proportion of passwords covered in our list using the first N characters:

To cover 50% of passwords in all list, we only need to use the first 27 characters. In fact, using only 20 characters covers around 25% of passwords, while using 31 characters covers 80% of passwords. Remember, these passwords passed the dictionary attack.


Typically when we calculate the probability of guessing a password, we assume that each character is equally likely to be chosen, i.e. the probability of choosing “e” is the same as choosing “Z”. This is clearly false. Also, since many systems now force people to have different character types in their password, it is too easy for users just to tack on a number as their final digit. I don’t want to go into how to efficiently explore “password space”, but it’s clear that a brute force search isn’t the way to go.

Personally, I’ve abandoned trying to remember passwords a long time ago, and just use a password manager. For example, my wordpress password is over 12 characters and consists of a completely random mixture of alphanumeric and special characters. Of course, you just need to make sure your password manager is secure….

May 25, 2011

Statistical podcast: random number seeds

Filed under: Computing, Geekery — Tags: , , , , — csgillespie @ 10:39 pm

One of the podcasts I listen to each week is Security Now! Typically, this podcast has little statistical content, as its main focus is computer security, but episode 301 looks at how to generate truly random numbers for seeding pseudo random number generators.

Generating truly random numbers to be used as a seed, turns out to be rather tricky. For example, in the Netscape browser, the random seed used by version 1.0 of the SSL protocol combined the time of day and the process number to seed its random number generator. However, it turns out that the process number is usually a small subset of all possible ids, and so is fairly easy to guess.

Recent advances indicate that we can get “almost true” randomness by taking multiple snap shorts of the processor counter. Since the counter covers around 3 billion numbers each second, we can use the counter to create a true random seed.

To find out more, listen to the podcast. The discussion on random seeds begins mid-way through the podcast.

January 14, 2011

Statistical podcast: Random and Pseudorandom

Filed under: Geekery, R — Tags: , , , , — csgillespie @ 10:01 am

This morning when I downloaded the latest version of In our time, I was pleased to see that this weeks topic was “Random and Pexkcd Random cartoonudorandom.” If you’re not familiar with “In our time”, then I can I definitely recommend the series. Each week three academics and Melvyn Bragg discuss a particular topic from history, science, philosophy, or religion. This weeks guests were Prof Marcus du Sautoy, Dr Colva Roney-Dougal and Prof Timothy Gowers. The discussion is aimed at the general public, but the phrase “dumbing down” certainly doesn’t apply! For example,  the introductory statement to the Mathematics episode is

Hello, Galileo wrote, “This grand book, the universe is written in the language of mathematics”. It was said before Galileo, and has been said since, and in the last decade of the 20th century, it’s being said again, most emphatically. So how important is maths in relation to other sciences at the end of the 20th century? What insight can it give us into the origins of life and the functioning of our brains? And what does it mean to say that mathematics has become more visual?


December 27, 2010

Problems with your Blackberry

Filed under: Geekery — Tags: , , , , — csgillespie @ 7:39 pm

I only caught part of the one Ronnie show this year, but it was a classic. The best sketch I watched involved fixing your Blackberry:

For non-UK people, orange is a mobile service provider. The Xbox 360 joke is brilliant!

December 8, 2010

LaTeX doodle pad

Filed under: Geekery, latex — Tags: , — csgillespie @ 8:59 pm

Every so often you’re typing away at a tex document and you have forgotten the latex command. You don’t have any latex books with you and using Google to find a latex symbol is a lesson in frustration. Don’t worry detexify is at hand. Just doodle your symbol in the box and up pops a few suggestions.

Here’s my attempt at drawing nleftarrow

Very useful!

December 7, 2010

DNS Spoofing

Filed under: Geekery — Tags: , , , — csgillespie @ 3:34 pm

Midway through 2008 a new (serious) internet vulnerability was discovered by Dan Kaminsky. Dan realised that there was a serious flaw in the way the Internet’s domain name system works. This flaw was critical, because it allowed the bad guys to redirect users to malicious sites without detection.

What is DNS

The Domain name system (DNS) converts easy to remember web addresses into their numerical IP counter parts. For example, rather than trying to remember, we just need to remember www.ncl.ac.uk. So when you type an URL into your address bar, you first ask your DNS resolver if it knows the IP address. If the resolver doesn’t know it will ask a DNS server for directions.

In the good old days, no-one worried about security and consequently the DNS protocol was lax. First, all queries to the DNS server were typically through the same port. Second, each request had a unique ID (from 1- 65,536), however, the ids were generated in a completely predicable manner, i.e. 1, 2, 3, ….65536. This meant that an attacker could respond to the resolver (with a fake IP address) before the DNS server got a chance.

The obvious solution to this is for each request to use a (pseudo) random port and id.

Checking your DNS address

All this is old news, and most DNS resolvers should have been updated – however a few haven’t. To test your whether your DNS resolver is doing something sensible, run the DNS spoofing test provided by grc. Essentially, the test makes a few thousand consecutive calls to your DNS resolver and stores the id and port number of the request.

The test takes a few minutes and is completely painless. At the end you get a few graphs showing your port and id numbers.  My ID distribution graph seems to indicate that the resolver that I’m using is issuing random query transaction IDs.

Create a free website or blog at WordPress.com.