Why?

February 16, 2016

RANDU: The case of the bad RNG

Filed under: Computing, R — Tags: , , — csgillespie @ 12:15 pm

The German Federal Office for Information Security (BSI) has established
criteria for quality random number generator (rng):

• A sequence of random numbers has a high probability of containing no identical consecutive elements.
• A sequence of numbers which is indistinguishable from true random’ numbers (tested using statistical tests.
• It should be impossible to calculate, or guess, from any given sub-sequence, any previous or future values in the sequence.
• It should be impossible, for all practical purposes, for an attacker to calculate, or guess the values used in the random number algorithm.

Points 3 and 4 are crucial for many applications. Everytime you make a
phone call, contact to a wireless point, pay using your credit card random
numbers are used.

Designing a good random number generator is hard and as a general rule you should never try to. R comes with many good quality random generators. The default generator is the Mersenne-Twister. This rng has a huge period of $2^{19937}-1$ (how many random numbers are generated before we have a repeat).

Linear congruential generators

A linear congruential generator (lcg) is a relatively simple rng (popular in the 60’s and 70’s). It has a simple form of

$r_{i}=(ar_{i-1}+b) \textrm{ mod }m, \quad i=1, 2, \ldots, m$

where $latexr_0$ is the initial number, known as the seed, and $$a,b,m$$ are the multiplier, additive constant and modulo respectively. The parameters are all integers.

The modulo operation means that at most $m$ different numbers can be generated
before the sequence must repeat – namely the integers $0,1,2, \ldots, m-1$. The
actual number of generated numbers is $h \leq m$, called the period of
the generator.

The key to random number generators is in setting the parameters.

RANDU

RANDU was a lcg with parameters $m=2^{31}, a=65539$ and $b=0$. Unfortunately this is a spectacularly bad choice of
parameters. On noting that $a=65,539=2^{16}+3$, then

$r_{i+1} = a r_i = 65539 \times r_i = (2^{16}+3)r_i \;.$

So

$r_{i+2} = a\;r_{i+1} = (2^{16}+3) \times r_{i+1} = (2^{16}+3)^2 r_i \;.$

On expanding the square, we get

$r_{i+2} = (2^{32}+6\times 2^{16} + 9)r_i = [6 (2^{16}+3)-9]r_i = 6 r_{i+1} - 9 r_i \;.$

Note: all these calculations should be to the mod $2^{31}$. So there is a large
correlation between the three points!

If compare randu to a standard rng (code in a gist)

It’s obvious that randu doesn’t produce good random numbers. Plotting  $x_i$, $x_{i-1}$ and $x_{i-2}$ in 3d

Generating the graphics

The code is all in a gist and can be run via

 devtools::source_gist("https://gist.github.com/csgillespie/0ba4bbd8da0d1264b124") 

You can then get the 3d plot via

 scatterplot3d::scatterplot3d(randu[,1], randu[,2], randu[,3], angle=154) ## Interactive version threejs::scatterplot3js(randu[,1], randu[,2], randu[,3]) 

June 16, 2011

Filed under: Computing, Geekery, R — Tags: , , , , , — csgillespie @ 12:52 pm

As everyone knows, it seems that Sony is taking a bit of a battering from hackers.  Thanks to Sony, numerous account and password details are now circulating on the internet. Recently, Troy Hunt carried out a brief analysis of the password structure. Here is a summary of his post:

• There were around 40,000 passwords, of which 8000 would fail a simplistic dictionary attack;
• 93% of passwords were between 6 and 10 characters.

In this post, we will investigate the remaining 32,000 passwords that passed the dictionary attack.

Distribution of characters

As Troy points out, the vast majority of passwords only contained a single type, i.e. all lower or all upper case. However, it turns out that things get even worst when we look at character frequency.

In the password database, there are a 78 unique characters. So if passwords were truly random, each character should occur with probability 1/78 = 0.013. However when we calculate the actual password occurrence, we see that it clearly isn’t random. The following figure shows the top 20 password characters, with the red line indicting 1/78.

Unsurprisingly, the vowels “e”, “a” and “o” are very popular, with the most popular numbers being 1,2, and 0 (in that order). No capital letters make it into the top twenty. We can also construct the cumulative probability plot for character occurrence. In the following figure, the red dots show the pattern we would expect if the passwords were truly random (link to a larger version of the plot):

Clearly, things aren’t as random as we would like.

Character order

Let’s now consider the order that the characters appear. To simplify things, consider only the eight character passwords. The most popular number to include in a password is “1”. If placement were random, then in passwords containing the number “1” we would expect to see the character evenly distributed. Instead, we get:

   ##Distribution of "1" over eight character passwords
0.06 0.03 0.04 0.04 0.13 0.13 0.22 0.34

So in around of 84% of passwords that contain the number “1”, the number appears only in the second half of the password. Clearly, people like sticking a number “1” towards the end of their password.

We get a similar pattern with “2”:

   0.05 0.05 0.04 0.05 0.13 0.11 0.30 0.27

and with “!”

   #Small sample size here
0.00 0.00 0.00 0.00 0.00 0.11 0.16 0.74

We see similar patterns with other alpha-numeric characters.

Number of characters needed to guess a password

Suppose we constructed all possible passwords using the first N most popular characters. How many passwords would that cover in our sample? The following figure shows proportion of passwords covered in our list using the first N characters:

To cover 50% of passwords in all list, we only need to use the first 27 characters. In fact, using only 20 characters covers around 25% of passwords, while using 31 characters covers 80% of passwords. Remember, these passwords passed the dictionary attack.

Summary

Typically when we calculate the probability of guessing a password, we assume that each character is equally likely to be chosen, i.e. the probability of choosing “e” is the same as choosing “Z”. This is clearly false. Also, since many systems now force people to have different character types in their password, it is too easy for users just to tack on a number as their final digit. I don’t want to go into how to efficiently explore “password space”, but it’s clear that a brute force search isn’t the way to go.

Personally, I’ve abandoned trying to remember passwords a long time ago, and just use a password manager. For example, my wordpress password is over 12 characters and consists of a completely random mixture of alphanumeric and special characters. Of course, you just need to make sure your password manager is secure….

May 25, 2011

Statistical podcast: random number seeds

Filed under: Computing, Geekery — Tags: , , , , — csgillespie @ 10:39 pm

One of the podcasts I listen to each week is Security Now! Typically, this podcast has little statistical content, as its main focus is computer security, but episode 301 looks at how to generate truly random numbers for seeding pseudo random number generators.

Generating truly random numbers to be used as a seed, turns out to be rather tricky. For example, in the Netscape browser, the random seed used by version 1.0 of the SSL protocol combined the time of day and the process number to seed its random number generator. However, it turns out that the process number is usually a small subset of all possible ids, and so is fairly easy to guess.

Recent advances indicate that we can get “almost true” randomness by taking multiple snap shorts of the processor counter. Since the counter covers around 3 billion numbers each second, we can use the counter to create a true random seed.

To find out more, listen to the podcast. The discussion on random seeds begins mid-way through the podcast.

January 12, 2011

Random variable generation (Pt 3 of 3)

Filed under: AMCMC, R — Tags: , , , , , , — csgillespie @ 3:59 pm

Ratio-of-uniforms

This post is based on chapter 1.4.3 of Advanced Markov Chain Monte Carlo.  Previous posts on this book can be found via the  AMCMC tag.

The ratio-of-uniforms was initially developed by Kinderman and Monahan (1977) and can be used for generating random numbers from many standard distributions. Essentially we transform the random variable of interest, then use a rejection method.

The algorithm is as follows:

Repeat until a value is obtained from step 2.

1. Generate $(Y, Z)$ uniformly over $\mathcal D \supseteq \mathcal C_h^{(1)}$.
2. If $(Y, Z) \in \mathcal C_h^{(1)}$. return $X = Z/Y$ as the desired deviate.

The uniform region is

$\mathcal C_h^{(1)} = \left\{ (y,z): 0 \le y \le [h(z/y)]^{1/2}\right\}.$

In AMCMC they give some R code for generate random numbers from the Gamma distribution.

I was going to include some R code with this post, but I found this set of questions and solutions that cover most things. Another useful page is this online book.

Thoughts on the Chapter 1

The first chapter is fairly standard. It briefly describes some results that should be background knowledge. However, I did spot a few a typos in this chapter. In particular when describing the acceptance-rejection method, the authors alternate between $g(x)$ and $h(x)$.

Another downside is that the R code for the ratio of uniforms is presented in an optimised version. For example, the authors use EXP1 = exp(1) as a global constant. I think for illustration purposes a simplified, more illustrative example would have been better.

This book review has been going with glacial speed. Therefore in future, rather than going through section by section, I will just give an overview of the chapter.

December 2, 2010

Random variable generation (Pt 2 of 3)

Filed under: AMCMC, R — Tags: , , , , — csgillespie @ 5:44 pm

Acceptance-rejection methods

This post is based on chapter 1.4 of Advanced Markov Chain Monte Carlo.

Another method of generating random variates from distributions is to use acceptance-rejection methods. Basically to generate a random number from $f(x)$, we generate a RN from an envelope distribution $g(x)$, where $\sup f(x)/g(x) \le M \le \infty$.  The acceptance-rejection algorithm is as follows:

Repeat until we generate a value from step 2:

1. Generate $x$ from $g(x)$ and $U$ from $Unif(0, 1)$

2. If $U \le \frac{f(x)}{M g(x)}$, return $x$ (as a random deviate from $f(x)$).

Example: the standard normal distribution

This example illustrates how we generate $N(0, 1)$ RNs using the logistic distribution as an envelope distribution. First, note that

$\displaystyle f(x) = \frac{1}{2\pi} e^{-x^2/2} \quad \mbox{and} \quad g(x) = \frac{e^{-x/s}}{s(1+ e^{-x/s})^2}$

On setting $s=0.648$, we get $M = 1.081$. This method is fairly efficient and has an acceptance rate of

$\displaystyle r = \frac{1}{M}\frac{\int f(x) dx}{\int g(x) dx} = \frac{1}{M} = 0.925$

since both $f$ and $g$ are normalised densities.

R code

This example is straightforward to code:

myrnorm = function(M){
while(1){
u = runif(1); x = rlogis(1, scale = 0.648)
if(u < dnorm(x)/M/dlogis(x, scale = 0.648))
return(x)
}
}


To check the results, we could call myrnorm a few thousand times:

hist(replicate(10000, myrnorm(1.1)), freq=FALSE)
lines(seq(-3, 3, 0.01), dnorm(seq(-3, 3, 0.01)), col=2)


Example: the standard normal distribution with a squeeze

Suppose the density $f(x)$ is expensive to evaluate. In this scenario we can employ an easy to compute function $s(x)$, where $0 \le s(x) \le g(x)$ . $s(x)$ is called a squeeze function. In this example, we’ll use a simple rectangular function, where $s(x) = 0.22$ for $x=-1, \ldots, 1$. This is shown in the following figure:

The modified algorithm is as follows:

Repeat until we generate a value from step 2:

1. Generate $x$ from $g(x)$ and $U$ from $Unif(0, 1)$

2. If $U \le \frac{s(x)}{M g(x)}$ or $U \le \frac{f(x)}{M g(x)}$, return $x$ (as a random deviate from $f(x)$).

Hence, when $U \le \frac{s(x)}{M g(x)}$ we don’t have to compute $f(x)$. Obviously, in this example $f(x)$ isn’t that difficult to compute.

November 28, 2010

Random variable generation (Pt 1 of 3)

Filed under: AMCMC, R — Tags: , , , , — csgillespie @ 7:35 pm

As I mentioned in a recent post, I’ve just received a copy of Advanced Markov Chain Monte Carlo Methods. Chapter 1.4 in the book (very quickly) covers random variable generation.

Inverse CDF Method

A standard algorithm for generating random numbers is the inverse cdf method. The continuous version of the algorithm is as follows:

1. Generate a uniform random variable $U$

2. Compute and return $X = F^{-1}(U)$

where $F^{-1}(\cdot)$ is the inverse of the CDF. Well known examples of this method are the exponential distribution and the Box-Muller transform.

Example: Logistic distribution

I teach this algorithm in one of my classes and I’m always on the look-out for new examples. Something that escaped my notice is that it is easy to generate RN’s using this technique from the Logistic distribution. This distribution has CDF

$\displaystyle F(x; \mu, s) = \frac{1}{1 + \exp(-(x-\mu)/s)}$
and so we can generate a random number from the logistic distribution using the following formula:
$\displaystyle X = \mu + s \log\left(\frac{U}{1-U}\right)$

Which is easily converted to R code:
 myRLogistic = function(mu, s){   u = runif(1)   return(mu + s log(u/(1-u))) }