# Why?

## June 16, 2011

### Character occurrence in passwords

Filed under: Computing, Geekery, R — Tags: , , , , , — csgillespie @ 12:52 pm

As everyone knows, it seems that Sony is taking a bit of a battering from hackers.  Thanks to Sony, numerous account and password details are now circulating on the internet. Recently, Troy Hunt carried out a brief analysis of the password structure. Here is a summary of his post:

• There were around 40,000 passwords, of which 8000 would fail a simplistic dictionary attack;
• Only 1% of passwords contained non-alphanumeric passwords;
• 93% of passwords were between 6 and 10 characters.

In this post, we will investigate the remaining 32,000 passwords that passed the dictionary attack.

## Distribution of characters

As Troy points out, the vast majority of passwords only contained a single type, i.e. all lower or all upper case. However, it turns out that things get even worst when we look at character frequency.

In the password database, there are a 78 unique characters. So if passwords were truly random, each character should occur with probability 1/78 = 0.013. However when we calculate the actual password occurrence, we see that it clearly isn’t random. The following figure shows the top 20 password characters, with the red line indicting 1/78.

Unsurprisingly, the vowels “e”, “a” and “o” are very popular, with the most popular numbers being 1,2, and 0 (in that order). No capital letters make it into the top twenty. We can also construct the cumulative probability plot for character occurrence. In the following figure, the red dots show the pattern we would expect if the passwords were truly random (link to a larger version of the plot):

Clearly, things aren’t as random as we would like.

## Character order

Let’s now consider the order that the characters appear. To simplify things, consider only the eight character passwords. The most popular number to include in a password is “1”. If placement were random, then in passwords containing the number “1” we would expect to see the character evenly distributed. Instead, we get:

```   ##Distribution of "1" over eight character passwords
0.06 0.03 0.04 0.04 0.13 0.13 0.22 0.34```

So in around of 84% of passwords that contain the number “1”, the number appears only in the second half of the password. Clearly, people like sticking a number “1” towards the end of their password.

We get a similar pattern with “2”:

`   0.05 0.05 0.04 0.05 0.13 0.11 0.30 0.27`

and with “!”

```   #Small sample size here
0.00 0.00 0.00 0.00 0.00 0.11 0.16 0.74```

We see similar patterns with other alpha-numeric characters.

## Number of characters needed to guess a password

Suppose we constructed all possible passwords using the first N most popular characters. How many passwords would that cover in our sample? The following figure shows proportion of passwords covered in our list using the first N characters:

To cover 50% of passwords in all list, we only need to use the first 27 characters. In fact, using only 20 characters covers around 25% of passwords, while using 31 characters covers 80% of passwords. Remember, these passwords passed the dictionary attack.

## Summary

Typically when we calculate the probability of guessing a password, we assume that each character is equally likely to be chosen, i.e. the probability of choosing “e” is the same as choosing “Z”. This is clearly false. Also, since many systems now force people to have different character types in their password, it is too easy for users just to tack on a number as their final digit. I don’t want to go into how to efficiently explore “password space”, but it’s clear that a brute force search isn’t the way to go.

Personally, I’ve abandoned trying to remember passwords a long time ago, and just use a password manager. For example, my wordpress password is over 12 characters and consists of a completely random mixture of alphanumeric and special characters. Of course, you just need to make sure your password manager is secure….

## 15 Comments »

1. Great analysis! I guess it could be used to build a pragmatically “optimised” password generator.

Comment by infominer — June 16, 2011 @ 3:58 pm

• I now just use a password manager to keep track of very long and random passwords. The number of sites that require a username and password seems to increase by one every day!

Comment by csgillespie — June 16, 2011 @ 8:22 pm

• More likely to be used in new password crack software.

Comment by Scott Miller — September 10, 2011 @ 5:23 pm

2. This is a really nice analysis. I like the use of cumulative probability plot and the fact that 25% of the (remaining) passwords consist only of the most popular 20 characters.

I guess we shouldn’t be surprised by the popularity of vowels and some consonants: passwords are often formed by concatenating words or phrases such as “Hip2B[].” Even though the complete password passes the dictionary test, it contains a substring which is in the dictionary. This implies that the frequency of letters in passwords probably correlates highly with the frequency of occurrence of letters in words. A neat plot would be “frequency in passwords” vs. “frequency in English language.”

Comment by Rick Wicklin — June 16, 2011 @ 5:19 pm

• Thanks for the comment. I suppose we just don’t know how hackers would attempt to crack passwords, i.e. have they ever taken any stats classes 😉

A quick search for “letter frequency” in the English language suggests there are minor differences between the letter distribution, but nothing too significant.

Comment by csgillespie — June 16, 2011 @ 8:18 pm

• That’s not true at all. We know that hackers will always use any data available to increase the effectiveness of their exploits. I guarantee that every time a list of passwords is released, someone uses the list to compile data for better password cracking programs.

Comment by Scott Miller — September 10, 2011 @ 5:23 pm

3. Nice analysis Colin, yet more conclusive evidence that password patterns are highly predictable.

Comment by Troy Hunt — June 17, 2011 @ 1:26 am

4. If only Sony had done the right thing and only stored hashes of the passwords rather than the passwords themselves, this analysis would not have even been possible!

Comment by Sean — June 17, 2011 @ 11:52 am

5. If we’re assuming that the attacker is trying to get one password, and not specifically my password, then I don’t have to outrun the bear… I just need to outrun the slowest person being chased by the bear. If that’s the case, it might make sense for me to restrict my random passwords to the letters, numbers, and characters which occur LEAST frequently. A random guesser will do no better, but a dictionary or optimized attacker will actually do WORSE against my password than against a typical password.

Comment by J.R. — June 17, 2011 @ 1:54 pm

6. Surprise! People don’t go through the hassle of memorizing a random string from a selection of 78 characters in order to register on a movie studio’s website.

Comment by matunos — June 18, 2011 @ 6:47 am

• If only it was that simple. The majority of the passwords came with associated email address, telephone number and home address. Besides, do you really think that people don’t reuse passwords?

Comment by csgillespie — June 20, 2011 @ 9:36 am

7. Which password manager do you use? Or which do you suggest?

Comment by Uj — June 19, 2011 @ 4:20 am

• I use last pass. It works under multiple operating systems and browsers. I now have a unique long password for almost every site that I visit.

Comment by csgillespie — June 20, 2011 @ 9:34 am

8. I often use shared computers and need to type my password every single time I log-in to anything. I also experience pain while typing. I can’t help but notice that a lot of the least used characters are characters that are more painful and inconvenient to type (capital letters, symbols). Yeah, I type unusual characters all the time while I’m working, but for some reason, it really burns when it’s in a password that I myself chose!

Comment by daffadowndilly — June 21, 2011 @ 2:38 pm

9. You should send an article into 2600 about this.

Comment by Greg — June 27, 2011 @ 6:56 am