Why?

July 12, 2011

Reviewing a paper that uses GPUs

Filed under: Computing, Publications — Tags: , , , , , — csgillespie @ 1:53 pm

Graphical processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word “GPUs”. Like any good geek, I was initially interested with the idea of using GPUs for statistical computing. However, last summer I messed about with GPUs and  the sparkle was removed. After looking at a number of papers, it strikes me that reviewers are forgetting to ask basic questions when reviewing GPU papers.

  1. For speed comparisons, do the authors compare a GPU with a multi-core CPU. In many papers, the comparison is with a single-core CPU. If a programmer can use CUDA, they can certainly code in pthreads or openMP. Take off a factor of eight when comparing to a multi-core CPU.
  2. Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older. So, take off a factor of two for each year of difference between a CPU and GPU.
  3. I like programming with doubles. I don’t really want to think about single precision and all the difficulties that entails. However, many CUDA programs are compiled as single precision. Take off a factor of two for double precision.
  4. When you use a GPU, you split the job in blocks of threads. The number of threads in each block depends on the type of problem under consideration and can have a massive speed impact on your problem. If your problem is something like matrix multiplication, where each thread multiplies two elements, then after a few test runs, it’s straightforward to come up with an optimal thread/block ratio. However, if each thread is a stochastic simulation, it now becomes very problem dependent. What could work for one model, could well be disastrous for another.
So in many GPU articles the speed comparisons could be reduced by a factor of 32!
Just to clarify, I’m not saying that GPUs have no future, rather, there has been some mis-selling of their potential usefulness in the (statistical) literature.

May 25, 2011

Statistical podcast: random number seeds

Filed under: Computing, Geekery — Tags: , , , , — csgillespie @ 10:39 pm

One of the podcasts I listen to each week is Security Now! Typically, this podcast has little statistical content, as its main focus is computer security, but episode 301 looks at how to generate truly random numbers for seeding pseudo random number generators.

Generating truly random numbers to be used as a seed, turns out to be rather tricky. For example, in the Netscape browser, the random seed used by version 1.0 of the SSL protocol combined the time of day and the process number to seed its random number generator. However, it turns out that the process number is usually a small subset of all possible ids, and so is fairly easy to guess.

Recent advances indicate that we can get “almost true” randomness by taking multiple snap shorts of the processor counter. Since the counter covers around 3 billion numbers each second, we can use the counter to create a true random seed.

To find out more, listen to the podcast. The discussion on random seeds begins mid-way through the podcast.

May 12, 2011

Makefiles and Sweave

Filed under: Computing, latex, R — Tags: , , , , — csgillespie @ 8:19 pm

A Makefile is a simple text file that controls compilation of a target file. The key benefit of using Makefile is that it uses file time stamps to determine if a particular action is needed. In this post we discuss how to use a simple Makefile that compiles a tex file that contains a number of \include statements. The files referred to by the \include statements are Sweave files.

Suppose we have a master tex file called master.tex. In this file we have:

\include chapter1
\include chapter2
\include chapter3
....

where the files chapter1, chapter2, chapter3 are Sweave files. Ideally, when we compile master.tex, we only want to sweave if the time stamp of chapter1.tex is older than the time stamp of chapter1.Rnw. This conditional compiling is even more important when we have a number of sweave files.

Meta-rules

To avoid duplication in a Makefile, it’s handy to use meta-rules. These rules specify how to convert from one file format to another. For example,

.Rnw.tex:
    R CMD Sweave $<

is a meta rule for converting an Rnw file to a tex file. In the above meta-rule, $< is the filename, i.e. chapter1.Rnw. Other helpful meta rules are:

.Rnw.R:
    R CMD Stangle $<

which is used to convert between Rnw and R files. We will also have a meta-rule for converting from .tex to .pdf.

For meta-rules to work, we have to list all the file suffixes that we will convert between. This means we have to include the following line:

.SUFFIXES: .tex .pdf .Rnw .R

Files to convert

Suppose we have a master tex file called master.tex and a sweave file chapter1.Rnw. This means we need to convert from:

  • master.tex to master.pdf
  • chapter1.Rnw to chapter1.tex
  • chapter1.Rnw to chapter1.R

Obviously, we don’t want to write down every file we need – especially if we have more than one sweave file. Instead, we just want to state the master file and the Rnw files. There are a couple of ways of doing this, however, the following way combines flexibility and simplicity. We first define the master and Rnw files:


##Suppose we have three Sweave files with a single master file
MAIN = master
RNWINCLUDES = chapter1 chapter2 chapter3

Now we add in the relevant file extensions

TEX = $(RNWINCLUDES:=.tex)
RFILES = $(RNWINCLUDES:=.R)
RNWFILES = $(INCLUDES:=.Rnw)

In the Makefile, whenever we use the $(TEX) variable, it is automatically expanded to

chapter1.tex chapter2.tex chapter3.tex

A similar rule applies to $(RFILES) and $(RNWFILES).

Conversion rules

We now define the file conversion rules. When we build our pdf file we want to:

  • build the tex file from Rnw file only if the Rnw files have changed or if the tex file doesn’t exist.
  • build the pdf file from the tex file only if master.tex file has changed or one of the Rnw files has changed, or the pdf file doesn’t exist.

We can accomplish this with the following rule:

$(MAIN).pdf: $(TEX) $(MAIN).tex

Typically, I also have a dependencies on a graphics directory and a bibtex file

$(MAIN).pdf: $(TEX) $(MAIN).tex refs.bib graphics/*.pdf

We also have a conversion rule to R files.

R: $(RFILES)

Cleaning up

We also use sweave to clean up after ourselves:

clean:
rm -fv $(MAIN).pdf $(MAIN).tex $(TEX) $(RFILES)
rm -fv *.aux *.dvi *.log *.toc *.bak *~ *.blg *.bbl *.lot *.lof
rm -fv *.nav *.snm *.out *.pyc \#*\# _region_* _tmp.* *.vrb
rm -fv Rplots.pdf *.RData

The complete Makefile

In the Makefile below:

  • make all – creates master.pdf;
  • make clean – deletes all files created as part of the latex and sweave process;
  • make R – creates the R files from the Rnw files.

.SUFFIXES: .tex .pdf .Rnw .R

MAIN = master
RNWINCLUDES = chapter1 chapter2 chapter3
TEX = $(RNWINCLUDES:=.tex)
RFILES = $(RNWINCLUDES:=.R)
RNWFILES = $(INCLUDES:=.Rnw)

all: $(MAIN).pdf
    $(MAIN).pdf: $(TEX) $(MAIN).tex

R: $(RFILES)

view: all
    acroread $(MAIN).pdf &

.Rnw.R:
    R CMD Stangle $<

.Rnw.tex:
    R CMD Sweave $<

.tex.pdf:
    pdflatex $<
    bibtex $*
    pdflatex $<
    pdflatex $<

clean:
    rm -fv $(MAIN).pdf $(MAIN).tex $(TEX) $(RFILES)
    rm -fv *.aux *.dvi *.log *.toc *.bak *~ *.blg *.bbl *.lot *.lof
    rm -fv *.nav *.snm *.out *.pyc \#*\# _region_* _tmp.* *.vrb
    rm -fv Rplots.pdf *.RData

Useful links

  • Jeromy Anglim’s post on Sweave and Make;
  • Ross Ihaka’s Makefile on Sweave;
  • Instead of using a Makefile, you could also use a shell script;

January 28, 2011

R books for undergraduate students

Filed under: R, Teaching — Tags: , , , , , , — csgillespie @ 10:18 pm

In a recent post, I asked for suggestions for introductory R computing books. In particular, I was looking for books that:

  • Assume no prior knowledge of programming.
  • Assume very little knowledge of statistics. For example, no regression.
  • Are cheap, since they are for undergraduate students.

Some of my cons aren’t really downsides as such. Rather, they just indicate that the books aren’t suitable for this particular audience. A prime example is “R in a Nutshell”.

I ended up recommending five books to the first year introductory R class.

Recommended Books

  • A first course in statistical programming with R (Braun & Murdoch)
    • Pros: I quite like this book (hence the reason I put it on my list). It has a nice collection of exercises, it “looks nice” and doesn’t assume knowledge of programming. It also doesn’t assume (or try to teach) any statistics.
    • Cons: When describing for loops and functions the examples aren’t very statistical. For example, it uses Fibonacci sequences in the while loop section and the sieve of Eratosthenes for if statements.
  • An introduction to R (Venables & Smith)
    • Pros: Simple, short and to the point. Free copies available. Money from the book goes to the R project.
    • Cons: More a R reference guide than a textbook.
  • A Beginner´s Guide to R by Zuur.
    • Pros: Assumes no prior knowledge. Proceeds through concepts slowly and carefully.
    • Cons: Proceeds through concepts very slowly and carefully.
  • R in a Nutshell by Adler.
    • I completely agree with the recent review by Robin Wilson: “Very comprehensive and very useful, but not good for a beginner. Great book though – definitely has a place on my bookshelf.”
    • Pros: An excellent reference.
    • Cons: Only suitable for students with a previous computer background.
  • Introduction to Scientific Programming and Simulation Using R by Jones, Maillardet and Robinson.
    • Pros: A nice book that teaches R programming. Similar to the Braun & Murdoch book.
    • Cons: A bit pricey in comparison to the other books.

Books not being recommended

These books were mentioned in the comments of the previous post.

  • The Basics of S-PLUS by Krause & Olson.
    • Most students struggle with R. Introducing a similar, but slightly different language is too sadistic.
  • Software for Data Analysis: Programming with R by Chambers.
    • Assumed some previous statistical knowledge.
  • Bayesian Computation with R by Albert.
    • Not suitable for first year students who haven’t taken any previous statistics courses.
  • R Graphics by Paul Murrell
    • I know graphics are important, but a whole book for an undergraduate student might be too much. I did toy with the idea of recommending this book, but I thought that five recommendations were more than sufficient.
  • ggplot2 by Hadley Wickham.
    • Great book, but our students don’t encounter ggplot2 in their undergraduate course.

Online Resources

  • Introduction to Probability and Statistics by Kerns
    • Suitable for a combined R and statistics course. But I don’t really do much stats in this module.
  • The R Programming wikibook (a work in progress).
    • Will give the students this link.
  • Biological Data Analysis Using R by Rodney J. Dyer. Available under the CC license.
    • Nice resource. Possibly a little big for this course (I know that this is very picky, but I had to draw the line somewhere). Will probably use it for future courses.
  • Hadley Wickham’s devtools wiki (a work in progress).
    • Assumes a good working knowledge of R
  • The R Inferno by Patrick Burns
    • Good book, but too advanced for students who have never programmed before.
  • Introduction to S programming
    • It’s in french – this may or may not be a good thing depending on your point of view 😉

Create a free website or blog at WordPress.com.