Why?

January 28, 2011

R books for undergraduate students

Filed under: R, Teaching — Tags: , , , , , , — csgillespie @ 10:18 pm

In a recent post, I asked for suggestions for introductory R computing books. In particular, I was looking for books that:

  • Assume no prior knowledge of programming.
  • Assume very little knowledge of statistics. For example, no regression.
  • Are cheap, since they are for undergraduate students.

Some of my cons aren’t really downsides as such. Rather, they just indicate that the books aren’t suitable for this particular audience. A prime example is “R in a Nutshell”.

I ended up recommending five books to the first year introductory R class.

Recommended Books

  • A first course in statistical programming with R (Braun & Murdoch)
    • Pros: I quite like this book (hence the reason I put it on my list). It has a nice collection of exercises, it “looks nice” and doesn’t assume knowledge of programming. It also doesn’t assume (or try to teach) any statistics.
    • Cons: When describing for loops and functions the examples aren’t very statistical. For example, it uses Fibonacci sequences in the while loop section and the sieve of Eratosthenes for if statements.
  • An introduction to R (Venables & Smith)
    • Pros: Simple, short and to the point. Free copies available. Money from the book goes to the R project.
    • Cons: More a R reference guide than a textbook.
  • A Beginner´s Guide to R by Zuur.
    • Pros: Assumes no prior knowledge. Proceeds through concepts slowly and carefully.
    • Cons: Proceeds through concepts very slowly and carefully.
  • R in a Nutshell by Adler.
    • I completely agree with the recent review by Robin Wilson: “Very comprehensive and very useful, but not good for a beginner. Great book though – definitely has a place on my bookshelf.”
    • Pros: An excellent reference.
    • Cons: Only suitable for students with a previous computer background.
  • Introduction to Scientific Programming and Simulation Using R by Jones, Maillardet and Robinson.
    • Pros: A nice book that teaches R programming. Similar to the Braun & Murdoch book.
    • Cons: A bit pricey in comparison to the other books.

Books not being recommended

These books were mentioned in the comments of the previous post.

  • The Basics of S-PLUS by Krause & Olson.
    • Most students struggle with R. Introducing a similar, but slightly different language is too sadistic.
  • Software for Data Analysis: Programming with R by Chambers.
    • Assumed some previous statistical knowledge.
  • Bayesian Computation with R by Albert.
    • Not suitable for first year students who haven’t taken any previous statistics courses.
  • R Graphics by Paul Murrell
    • I know graphics are important, but a whole book for an undergraduate student might be too much. I did toy with the idea of recommending this book, but I thought that five recommendations were more than sufficient.
  • ggplot2 by Hadley Wickham.
    • Great book, but our students don’t encounter ggplot2 in their undergraduate course.

Online Resources

  • Introduction to Probability and Statistics by Kerns
    • Suitable for a combined R and statistics course. But I don’t really do much stats in this module.
  • The R Programming wikibook (a work in progress).
    • Will give the students this link.
  • Biological Data Analysis Using R by Rodney J. Dyer. Available under the CC license.
    • Nice resource. Possibly a little big for this course (I know that this is very picky, but I had to draw the line somewhere). Will probably use it for future courses.
  • Hadley Wickham’s devtools wiki (a work in progress).
    • Assumes a good working knowledge of R
  • The R Inferno by Patrick Burns
    • Good book, but too advanced for students who have never programmed before.
  • Introduction to S programming
    • It’s in french – this may or may not be a good thing depending on your point of view ;)

January 25, 2011

CPU and GPU trends over time

Filed under: Computing, R — Tags: , , , , , , — csgillespie @ 4:04 pm

GPUs seem to be all the rage these days. At the last Bayesian Valencia meeting, Chris Holmes gave a nice talk on how GPUs could be leveraged for statistical computing. Recently Christian Robert arXived a paper with parallel computing firmly in mind. In two weeks time I’m giving an internal seminar on using GPUs for statistical computing. To start the talk, I wanted a few graphs that show CPU and GPU evolution over the last decade or so. This turned out to be trickier than I expected.

After spending an afternoon searching the internet (mainly Wikipedia), I came up with a few nice plots.

Intel CPU clock speed

CPU clock speed for a single cpu has been fairly static in the last couple of years  – hovering around 3.4Ghz. Of course, we shouldn’t fall completely into the Megahertz myth, but one avenue of speed increase has been blocked:

Computational power per die

Although single CPUs have been limited, due to the rise of multi-core machines,  the computational power per die has still been increasing

GPUs vs CPUs

When we compare GPUs with CPUs over the last decade in terms of Floating point operations (FLOPs), we see that GPUs appear to be far ahead of the CPUs

 

Sources and data

  • You can download the data files and R code used to generate the above graphs.
    • If you find them useful, please drop me a line.
    • I’ll probably write further posts on GPU computing, but these won’t go through the R-bloggers site (since it has little to do with R).
  • Data for Figures 1 & 2 was obtained from “Is Parallel Programming Hard, And, If So, What Can You Do About It?” This book got the data from Wikipedia
  • Data from Figure 3 was mainly from Wikipedia and the odd mailing list post.
  • I believe these graphs show the correct general trend, but the actual numbers have been obtained from mailing lists, Wikipedia, etc. Use with care.

January 15, 2011

Parsing and plotting time series data

Filed under: R — Tags: , , , , , — csgillespie @ 2:42 pm

This morning I came across a post which discusses the differences between scala, ruby and python when trying to analyse time series data. Essentially, there is a text file consisting of times in the format HH:MM and we want to get an idea of its distribution. Tom discusses how this would be a bit clunky in ruby and gives a solution in scala.

However, I think the data is just crying out to be “analysed” in R:

require(ggplot2)#Load the plotting package
times = c("17:05", "16:53", "16:29", ...)#would be loaded from a file
times = as.POSIXct(strptime(times, "%H:%M")) #convert to POSIXct format
qplot(times, fill=I('steelblue'), col=I('black'))#Plot with nice colours

Which gives

I definitely don’t want to get into any religious wars of R vs XYZ. I just wanted to point out that when analysing data, R does a really good job.

 

 

January 14, 2011

Statistical podcast: Random and Pseudorandom

Filed under: Geekery, R — Tags: , , , , — csgillespie @ 10:01 am

This morning when I downloaded the latest version of In our time, I was pleased to see that this weeks topic was “Random and Pexkcd Random cartoonudorandom.” If you’re not familiar with “In our time”, then I can I definitely recommend the series. Each week three academics and Melvyn Bragg discuss a particular topic from history, science, philosophy, or religion. This weeks guests were Prof Marcus du Sautoy, Dr Colva Roney-Dougal and Prof Timothy Gowers. The discussion is aimed at the general public, but the phrase “dumbing down” certainly doesn’t apply! For example,  the introductory statement to the Mathematics episode is

Hello, Galileo wrote, “This grand book, the universe is written in the language of mathematics”. It was said before Galileo, and has been said since, and in the last decade of the 20th century, it’s being said again, most emphatically. So how important is maths in relation to other sciences at the end of the 20th century? What insight can it give us into the origins of life and the functioning of our brains? And what does it mean to say that mathematics has become more visual?

 

January 13, 2011

Survival paper (update)

In a recent post, I discussed some  statistical consultancy I was involved with. I was quite proud of the nice ggplot2 graphics I had created. The graphs nicely summarised the main points of the paper:

I’ve just had the proofs from the journal, and next to the graphs there is the following note:

It is not usual BJS style to include 95 per cent confidence intervals in K-M
curves. Could you please re-draw Figs 1 & 2 omitting these and INCLUDING ALL
FOUR CURVES IN A SINGLE GRAPH. (If you wish to include 95% c.i., the data could
be produced in tabular form instead.)

They have a policy of not including CI on graphs? So instead of a single nice graphic, they now want a graph and a table with (at least) 9 rows and 5 columns?

January 12, 2011

Random variable generation (Pt 3 of 3)

Filed under: AMCMC, R — Tags: , , , , , , — csgillespie @ 3:59 pm

Ratio-of-uniforms

This post is based on chapter 1.4.3 of Advanced Markov Chain Monte Carlo.  Previous posts on this book can be found via the  AMCMC tag.

The ratio-of-uniforms was initially developed by Kinderman and Monahan (1977) and can be used for generating random numbers from many standard distributions. Essentially we transform the random variable of interest, then use a rejection method.

The algorithm is as follows:

Repeat until a value is obtained from step 2.

  1. Generate (Y, Z) uniformly over \mathcal D \supseteq \mathcal C_h^{(1)}.
  2. If (Y, Z) \in \mathcal C_h^{(1)}. return X = Z/Y as the desired deviate.

The uniform region is

\mathcal C_h^{(1)} = \left\{ (y,z): 0 \le y \le [h(z/y)]^{1/2}\right\}.

In AMCMC they give some R code for generate random numbers from the Gamma distribution.

I was going to include some R code with this post, but I found this set of questions and solutions that cover most things. Another useful page is this online book.

Thoughts on the Chapter 1

The first chapter is fairly standard. It briefly describes some results that should be background knowledge. However, I did spot a few a typos in this chapter. In particular when describing the acceptance-rejection method, the authors alternate between g(x) and h(x).

Another downside is that the R code for the ratio of uniforms is presented in an optimised version. For example, the authors use EXP1 = exp(1) as a global constant. I think for illustration purposes a simplified, more illustrative example would have been better.

This book review has been going with glacial speed. Therefore in future, rather than going through section by section, I will just give an overview of the chapter.

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.