# Why?

## February 16, 2016

### RANDU: The case of the bad RNG

Filed under: Computing, R — Tags: , , — csgillespie @ 12:15 pm

The German Federal Office for Information Security (BSI) has established
criteria for quality random number generator (rng):

• A sequence of random numbers has a high probability of containing no identical consecutive elements.
• A sequence of numbers which is indistinguishable from true random’ numbers (tested using statistical tests.
• It should be impossible to calculate, or guess, from any given sub-sequence, any previous or future values in the sequence.
• It should be impossible, for all practical purposes, for an attacker to calculate, or guess the values used in the random number algorithm.

Points 3 and 4 are crucial for many applications. Everytime you make a
phone call, contact to a wireless point, pay using your credit card random
numbers are used.

Designing a good random number generator is hard and as a general rule you should never try to. R comes with many good quality random generators. The default generator is the Mersenne-Twister. This rng has a huge period of $2^{19937}-1$ (how many random numbers are generated before we have a repeat).

## Linear congruential generators

A linear congruential generator (lcg) is a relatively simple rng (popular in the 60’s and 70’s). It has a simple form of

$r_{i}=(ar_{i-1}+b) \textrm{ mod }m, \quad i=1, 2, \ldots, m$

where $latexr_0$ is the initial number, known as the seed, and $$a,b,m$$ are the multiplier, additive constant and modulo respectively. The parameters are all integers.

The modulo operation means that at most $m$ different numbers can be generated
before the sequence must repeat – namely the integers $0,1,2, \ldots, m-1$. The
actual number of generated numbers is $h \leq m$, called the period of
the generator.

The key to random number generators is in setting the parameters.

### RANDU

RANDU was a lcg with parameters $m=2^{31}, a=65539$ and $b=0$. Unfortunately this is a spectacularly bad choice of
parameters. On noting that $a=65,539=2^{16}+3$, then

$r_{i+1} = a r_i = 65539 \times r_i = (2^{16}+3)r_i \;.$

So

$r_{i+2} = a\;r_{i+1} = (2^{16}+3) \times r_{i+1} = (2^{16}+3)^2 r_i \;.$

On expanding the square, we get

$r_{i+2} = (2^{32}+6\times 2^{16} + 9)r_i = [6 (2^{16}+3)-9]r_i = 6 r_{i+1} - 9 r_i \;.$

Note: all these calculations should be to the mod $2^{31}$. So there is a large
correlation between the three points!

If compare randu to a standard rng (code in a gist)

It’s obvious that randu doesn’t produce good random numbers. Plotting  $x_i$, $x_{i-1}$ and $x_{i-2}$ in 3d

### Generating the graphics

The code is all in a gist and can be run via

 devtools::source_gist("https://gist.github.com/csgillespie/0ba4bbd8da0d1264b124") 

You can then get the 3d plot via

 scatterplot3d::scatterplot3d(randu[,1], randu[,2], randu[,3], angle=154) ## Interactive version threejs::scatterplot3js(randu[,1], randu[,2], randu[,3]) 

## July 13, 2012

### Analysing time course microarray data using Bioconductor: a case study using yeast2 Affymetrix arrays

Filed under: latex, Microarray, Publications, R — Tags: , , — csgillespie @ 2:32 pm

A few years ago I was involved in analysing some time-course microarray data. Our biological collaborators were interested in how we analysed their data, so this lead to a creation of tutorial, which in turn lead to a paper. When we submitted the paper, one the referees “suggested” that we write the paper using Sweave; I had never used Sweave. At the time this was a massive pain and I regularly cursed the name of the anonymous referee.  A few years later, I’ve just updated code (due to a change in Bioconductor) and it was a breeze. A belated thanks to the referee.

In this latest update to the paper I’ve

• moved the paper to github;
• changed from Sweave to knitr;
• used RStudio instead of emacs.

You can find details full details about analysis on the associated github page.

## July 12, 2011

### Reviewing a paper that uses GPUs

Filed under: Computing, Publications — Tags: , , , , , — csgillespie @ 1:53 pm

Graphical processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word “GPUs”. Like any good geek, I was initially interested with the idea of using GPUs for statistical computing. However, last summer I messed about with GPUs and  the sparkle was removed. After looking at a number of papers, it strikes me that reviewers are forgetting to ask basic questions when reviewing GPU papers.

1. For speed comparisons, do the authors compare a GPU with a multi-core CPU. In many papers, the comparison is with a single-core CPU. If a programmer can use CUDA, they can certainly code in pthreads or openMP. Take off a factor of eight when comparing to a multi-core CPU.
2. Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older. So, take off a factor of two for each year of difference between a CPU and GPU.
3. I like programming with doubles. I don’t really want to think about single precision and all the difficulties that entails. However, many CUDA programs are compiled as single precision. Take off a factor of two for double precision.
4. When you use a GPU, you split the job in blocks of threads. The number of threads in each block depends on the type of problem under consideration and can have a massive speed impact on your problem. If your problem is something like matrix multiplication, where each thread multiplies two elements, then after a few test runs, it’s straightforward to come up with an optimal thread/block ratio. However, if each thread is a stochastic simulation, it now becomes very problem dependent. What could work for one model, could well be disastrous for another.
So in many GPU articles the speed comparisons could be reduced by a factor of 32!
Just to clarify, I’m not saying that GPUs have no future, rather, there has been some mis-selling of their potential usefulness in the (statistical) literature.

## May 29, 2011

### Impact factors for statistics journals

Filed under: Publications — Tags: , , , — csgillespie @ 1:08 pm

The other day I came across “Nefarious numbers” by Douglas Arnold and Kristine Fowler in arXiv. This paper examines how impact factors can be easily and blatantly manipulated.

## What is an Impact Factor

For a particular year, the impact factor of a journal is the average number of citations received per paper, published in that journal during the two preceding years. The impact factor as a number of glaring flaws:

• Impact factors vary across disciplines.
• The submission to publication process in a statistical journal can take up to a year.
• The impact factor is just a single statistic out of many possible measures.
• The underlying database contains many errors.

## International Journal of Nonlinear Sciences and Numerical Simulation

The Australian Research Council (ARC) recently released an evaluation, listing quality ratings for over 20,000 peer-reviewed journals across various disciplines. This list was constructed through a review process involving academics, disciplinary bodies and learned academies. The outcome is that over 20,000 peer-journals are ranked A* to C, where

• A*: one of the best in its field or sub-field;
• A: very high quality;
• B: solid, though not outstanding reputation;
• C: does not meet the criteria of the higher tiers.

The ARC ranked the international journal of nonlinear sciences and numerical simulation (IJNSNS) as a B. However, in 2008 this journal had an impact factor of 8.9 – more than double the next highest journal in the Applied Mathematics section. As the paper explains, the reason for the large impact factor is easy to see. In 2008, the top-citing three authors to IJNSNS were:

• Ji-Huan He, the journal’s Editor-in-Chief, who cited, within a the two-year window, 243 times;
• D. D. Ganji, a member of the editorial board, with 114 cites;
• Mohamed El Naschie, a regional editor, with 58 cites.

Comparing these numbers with other journals, shows how extreme IJNSNS really is – the next highest impact factor is around 4. Arnold and Fowler also investigate journals where the citations occurs. These journals turn out to be IJNSNS itself or special issues of other journals edited by someone on the IJNSNS board.

## Impact Factors for Statistics Journals

The ARC statistics section contains around two hundred journals. Some of these journals are “traditional” statistics journals, such as JASA, RSS, and biometrics. Other journals are more applied, such as Bioinformatics and Mathematical Biosciences. So in the following comparison, I just considered journals classed as “statistics” by the ISI Web of Knowledge. This leaves seventy-seven journals.

The following plot shows the two- and five-year impact factor for the seventy-seven statistical journals, grouped by the ARC rating. The red dots show the median impact factor for a particular grouping.

As would be expected, for the two-year IF there is very little difference between the ARC ratings – although more than I expected. Once we calculate the five-year impact factors,  the difference between ratings are clearer. Since many of the group C journals are new, a number of them don’t have five-year impact factor.

## Outlying Statistical Journals

There are three journals that stand out from their particular groups:

• Statistical Science, a group A journal. Since this is mainly a review journal, so it’s really not surprising that this has a high impact factor.
• Journal of Statistical and the Stata journal, group C journals. Since these are “statistical computing” journals, it isn’t that surprising that they have high impact

## Should we use Impact Factors

The best answer would be no! Just read the first page of  “Nefarious numbers” for a variety of reasons why we should dump impact factors. However, I suspect that impact factors will be forced on many of us, as a tool to quantify our research. Therefore, while we should try to fight against them, we should also keep an eye on them for evidence of people playing the system.

## March 9, 2011

### Workshop: Statistical Bioinformatics and Stochastic Systems Biology

Filed under: Uncategorized — Tags: , , , , , — csgillespie @ 11:53 am

The Third Biennial Newcastle Workshop on Statistical Bioinformatics and Stochastic Systems Biology will be held from 14.00 on Monday 13th to 16.30 on Tuesday 14th June 2011 at Newcastle University, UK.

Speakers will include the following

• Prof. David Balding, University College, London, UK.
• Prof. Arnoldo Frigessi, University of Oslo, Norway.
• Dr. Dirk Husmeier, Biomathematics and Statistics Scotland, UK.
• Dr. Michal Komorowski, Imperial College, London, UK.
• Dr. Pawel Paszek, University of Liverpool, UK.
• Prof. Magnus Rattray, University of Sheffield, UK.
• Dr. Guido Sanguinetti, University of Edinburgh, UK.
• Dr. Christopher Yau, University of Oxford, UK.

Registration

There is no registration fee but please do register. (The number of places may be limited). Register (free of charge) by sending your name and affiliation by email to mathstats-office@ncl.ac.uk. If possible, please register by 23rd May (to help with catering arrangements).

See the conference website further details.

## January 28, 2011

### R books for undergraduate students

Filed under: R, Teaching — Tags: , , , , , , — csgillespie @ 10:18 pm

In a recent post, I asked for suggestions for introductory R computing books. In particular, I was looking for books that:

• Assume no prior knowledge of programming.
• Assume very little knowledge of statistics. For example, no regression.
• Are cheap, since they are for undergraduate students.

Some of my cons aren’t really downsides as such. Rather, they just indicate that the books aren’t suitable for this particular audience. A prime example is “R in a Nutshell”.

I ended up recommending five books to the first year introductory R class.

## Recommended Books

• A first course in statistical programming with R (Braun & Murdoch)
• Pros: I quite like this book (hence the reason I put it on my list). It has a nice collection of exercises, it “looks nice” and doesn’t assume knowledge of programming. It also doesn’t assume (or try to teach) any statistics.
• Cons: When describing for loops and functions the examples aren’t very statistical. For example, it uses Fibonacci sequences in the while loop section and the sieve of Eratosthenes for if statements.
• An introduction to R (Venables & Smith)
• Pros: Simple, short and to the point. Free copies available. Money from the book goes to the R project.
• Cons: More a R reference guide than a textbook.
• A Beginner´s Guide to R by Zuur.
• Pros: Assumes no prior knowledge. Proceeds through concepts slowly and carefully.
• Cons: Proceeds through concepts very slowly and carefully.
• R in a Nutshell by Adler.
• I completely agree with the recent review by Robin Wilson: “Very comprehensive and very useful, but not good for a beginner. Great book though – definitely has a place on my bookshelf.”
• Pros: An excellent reference.
• Cons: Only suitable for students with a previous computer background.
• Introduction to Scientific Programming and Simulation Using R by Jones, Maillardet and Robinson.
• Pros: A nice book that teaches R programming. Similar to the Braun & Murdoch book.
• Cons: A bit pricey in comparison to the other books.

## Books not being recommended

These books were mentioned in the comments of the previous post.

• The Basics of S-PLUS by Krause & Olson.
• Most students struggle with R. Introducing a similar, but slightly different language is too sadistic.
• Software for Data Analysis: Programming with R by Chambers.
• Assumed some previous statistical knowledge.
• Bayesian Computation with R by Albert.
• Not suitable for first year students who haven’t taken any previous statistics courses.
• R Graphics by Paul Murrell
• I know graphics are important, but a whole book for an undergraduate student might be too much. I did toy with the idea of recommending this book, but I thought that five recommendations were more than sufficient.
• ggplot2 by Hadley Wickham.
• Great book, but our students don’t encounter ggplot2 in their undergraduate course.

## Online Resources

• Introduction to Probability and Statistics by Kerns
• Suitable for a combined R and statistics course. But I don’t really do much stats in this module.
• The R Programming wikibook (a work in progress).
• Will give the students this link.
• Biological Data Analysis Using R by Rodney J. Dyer. Available under the CC license.
• Nice resource. Possibly a little big for this course (I know that this is very picky, but I had to draw the line somewhere). Will probably use it for future courses.
• Hadley Wickham’s devtools wiki (a work in progress).
• Assumes a good working knowledge of R
• The R Inferno by Patrick Burns
• Good book, but too advanced for students who have never programmed before.
• Introduction to S programming
• It’s in french – this may or may not be a good thing depending on your point of view 😉

## January 25, 2011

### CPU and GPU trends over time

Filed under: Computing, R — Tags: , , , , , , — csgillespie @ 4:04 pm

GPUs seem to be all the rage these days. At the last Bayesian Valencia meeting, Chris Holmes gave a nice talk on how GPUs could be leveraged for statistical computing. Recently Christian Robert arXived a paper with parallel computing firmly in mind. In two weeks time I’m giving an internal seminar on using GPUs for statistical computing. To start the talk, I wanted a few graphs that show CPU and GPU evolution over the last decade or so. This turned out to be trickier than I expected.

After spending an afternoon searching the internet (mainly Wikipedia), I came up with a few nice plots.

## Intel CPU clock speed

CPU clock speed for a single cpu has been fairly static in the last couple of years  – hovering around 3.4Ghz. Of course, we shouldn’t fall completely into the Megahertz myth, but one avenue of speed increase has been blocked:

## Computational power per die

Although single CPUs have been limited, due to the rise of multi-core machines,  the computational power per die has still been increasing

## GPUs vs CPUs

When we compare GPUs with CPUs over the last decade in terms of Floating point operations (FLOPs), we see that GPUs appear to be far ahead of the CPUs

## Sources and data

• You can download the data files and R code used to generate the above graphs.
• If you find them useful, please drop me a line.
• I’ll probably write further posts on GPU computing, but these won’t go through the R-bloggers site (since it has little to do with R).
• Data for Figures 1 & 2 was obtained from “Is Parallel Programming Hard, And, If So, What Can You Do About It?” This book got the data from Wikipedia
• Data from Figure 3 was mainly from Wikipedia and the odd mailing list post.
• I believe these graphs show the correct general trend, but the actual numbers have been obtained from mailing lists, Wikipedia, etc. Use with care.

## January 12, 2011

### Random variable generation (Pt 3 of 3)

Filed under: AMCMC, R — Tags: , , , , , , — csgillespie @ 3:59 pm

# Ratio-of-uniforms

This post is based on chapter 1.4.3 of Advanced Markov Chain Monte Carlo.  Previous posts on this book can be found via the  AMCMC tag.

The ratio-of-uniforms was initially developed by Kinderman and Monahan (1977) and can be used for generating random numbers from many standard distributions. Essentially we transform the random variable of interest, then use a rejection method.

The algorithm is as follows:

Repeat until a value is obtained from step 2.

1. Generate $(Y, Z)$ uniformly over $\mathcal D \supseteq \mathcal C_h^{(1)}$.
2. If $(Y, Z) \in \mathcal C_h^{(1)}$. return $X = Z/Y$ as the desired deviate.

The uniform region is

$\mathcal C_h^{(1)} = \left\{ (y,z): 0 \le y \le [h(z/y)]^{1/2}\right\}.$

In AMCMC they give some R code for generate random numbers from the Gamma distribution.

I was going to include some R code with this post, but I found this set of questions and solutions that cover most things. Another useful page is this online book.

## Thoughts on the Chapter 1

The first chapter is fairly standard. It briefly describes some results that should be background knowledge. However, I did spot a few a typos in this chapter. In particular when describing the acceptance-rejection method, the authors alternate between $g(x)$ and $h(x)$.

Another downside is that the R code for the ratio of uniforms is presented in an optimised version. For example, the authors use EXP1 = exp(1) as a global constant. I think for illustration purposes a simplified, more illustrative example would have been better.

This book review has been going with glacial speed. Therefore in future, rather than going through section by section, I will just give an overview of the chapter.

## December 8, 2010

### New paper: Survival analysis

Filed under: Publications, R — Tags: , , , , , , — csgillespie @ 8:13 pm

Each year I try to carry out some statistical consultancy to give me experience in other areas of statistics and also to provide teaching examples. Last Christmas I was approached by a paediatric consultant from the RVI who wanted to carry out prospective survival analysis. The consultant, Bruce  Jaffray, had performed Nissen fundoplication surgery on 230 children. Many of the children had other medical conditions such as cerebral palsy or low BMI. He was interested in the factors that affected patients’ survival.

## The model

We fitted a standard cox proportional hazards model. The following covariates were significant:

• gastrostomy
• cerebral palsy
• gender
• need for revision surgery. This was when the child had to return to hospital for more surgery.
• an interaction term between gastrostomy & cerebral palsy.

The interaction term was key to getting a good model fit. The figures (one of which is shown below) were constructed using ggplot2 and R. The referees actually commented on the (good) quality statistical work and nice figures! Always nice to read. Unfortunately, there isn’t a nice survival to ggplot2 interface. I had to write some rather hacky R code 😦

## Results

The main finding of the paper was the negative effect of cerebral palsy and gastrostomy on survival. Unfortunately, if a child had a gastronomy or had cerebral palsy then survival was dramatically reduced. The interaction effect was necessary, otherwise we would have predicted that all children with a gastronomy and cerebral palsy wouldn’t survive.

#### Other results

• There was a rather strange and strong gender effect – male survival was greater than female.
• The revision covariate was also significant – children who needed their fundoplication redone had increased survival. At first glance this is strange – the operation had to be redone, yet this was good for survival. However, this was really a red herring. Essentially children who had their surgery redone had by definition survived a minimum amount of time. I think something a bit more sophisticated could have been done with this variable, but the numbers weren’t that large.

References: