July 13, 2012

Analysing time course microarray data using Bioconductor: a case study using yeast2 Affymetrix arrays

Filed under: latex, Microarray, Publications, R — Tags: , , — csgillespie @ 2:32 pm

A few years ago I was involved in analysing some time-course microarray data. Our biological collaborators were interested in how we analysed their data, so this lead to a creation of tutorial, which in turn lead to a paper. When we submitted the paper, one the referees “suggested” that we write the paper using Sweave; I had never used Sweave. At the time this was a massive pain and I regularly cursed the name of the anonymous referee.  A few years later, I’ve just updated code (due to a change in Bioconductor) and it was a breeze. A belated thanks to the referee.

In this latest update to the paper I’ve

  • moved the paper to github;
  • changed from Sweave to knitr;
  • used RStudio instead of emacs.

You can find details full details about analysis on the associated github page.

July 12, 2011

Reviewing a paper that uses GPUs

Filed under: Computing, Publications — Tags: , , , , , — csgillespie @ 1:53 pm

Graphical processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word “GPUs”. Like any good geek, I was initially interested with the idea of using GPUs for statistical computing. However, last summer I messed about with GPUs and  the sparkle was removed. After looking at a number of papers, it strikes me that reviewers are forgetting to ask basic questions when reviewing GPU papers.

  1. For speed comparisons, do the authors compare a GPU with a multi-core CPU. In many papers, the comparison is with a single-core CPU. If a programmer can use CUDA, they can certainly code in pthreads or openMP. Take off a factor of eight when comparing to a multi-core CPU.
  2. Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older. So, take off a factor of two for each year of difference between a CPU and GPU.
  3. I like programming with doubles. I don’t really want to think about single precision and all the difficulties that entails. However, many CUDA programs are compiled as single precision. Take off a factor of two for double precision.
  4. When you use a GPU, you split the job in blocks of threads. The number of threads in each block depends on the type of problem under consideration and can have a massive speed impact on your problem. If your problem is something like matrix multiplication, where each thread multiplies two elements, then after a few test runs, it’s straightforward to come up with an optimal thread/block ratio. However, if each thread is a stochastic simulation, it now becomes very problem dependent. What could work for one model, could well be disastrous for another.
So in many GPU articles the speed comparisons could be reduced by a factor of 32!
Just to clarify, I’m not saying that GPUs have no future, rather, there has been some mis-selling of their potential usefulness in the (statistical) literature.

May 29, 2011

Impact factors for statistics journals

Filed under: Publications — Tags: , , , — csgillespie @ 1:08 pm

The other day I came across “Nefarious numbers” by Douglas Arnold and Kristine Fowler in arXiv. This paper examines how impact factors can be easily and blatantly manipulated.

What is an Impact Factor

For a particular year, the impact factor of a journal is the average number of citations received per paper, published in that journal during the two preceding years. The impact factor as a number of glaring flaws:

  • Impact factors vary across disciplines.
  • The submission to publication process in a statistical journal can take up to a year.
  • The impact factor is just a single statistic out of many possible measures.
  • The underlying database contains many errors.

International Journal of Nonlinear Sciences and Numerical Simulation

The Australian Research Council (ARC) recently released an evaluation, listing quality ratings for over 20,000 peer-reviewed journals across various disciplines. This list was constructed through a review process involving academics, disciplinary bodies and learned academies. The outcome is that over 20,000 peer-journals are ranked A* to C, where

  • A*: one of the best in its field or sub-field;
  • A: very high quality;
  • B: solid, though not outstanding reputation;
  • C: does not meet the criteria of the higher tiers.

The ARC ranked the international journal of nonlinear sciences and numerical simulation (IJNSNS) as a B. However, in 2008 this journal had an impact factor of 8.9 – more than double the next highest journal in the Applied Mathematics section. As the paper explains, the reason for the large impact factor is easy to see. In 2008, the top-citing three authors to IJNSNS were:

  • Ji-Huan He, the journal’s Editor-in-Chief, who cited, within a the two-year window, 243 times;
  • D. D. Ganji, a member of the editorial board, with 114 cites;
  • Mohamed El Naschie, a regional editor, with 58 cites.

Comparing these numbers with other journals, shows how extreme IJNSNS really is – the next highest impact factor is around 4. Arnold and Fowler also investigate journals where the citations occurs. These journals turn out to be IJNSNS itself or special issues of other journals edited by someone on the IJNSNS board.

Impact Factors for Statistics Journals

The ARC statistics section contains around two hundred journals. Some of these journals are “traditional” statistics journals, such as JASA, RSS, and biometrics. Other journals are more applied, such as Bioinformatics and Mathematical Biosciences. So in the following comparison, I just considered journals classed as “statistics” by the ISI Web of Knowledge. This leaves seventy-seven journals.

The following plot shows the two- and five-year impact factor for the seventy-seven statistical journals, grouped by the ARC rating. The red dots show the median impact factor for a particular grouping.

As would be expected, for the two-year IF there is very little difference between the ARC ratings – although more than I expected. Once we calculate the five-year impact factors,  the difference between ratings are clearer. Since many of the group C journals are new, a number of them don’t have five-year impact factor.

Outlying Statistical Journals

There are three journals that stand out from their particular groups:

  • Statistical Science, a group A journal. Since this is mainly a review journal, so it’s really not surprising that this has a high impact factor.
  • Journal of Statistical and the Stata journal, group C journals. Since these are “statistical computing” journals, it isn’t that surprising that they have high impact

Should we use Impact Factors

The best answer would be no! Just read the first page of  “Nefarious numbers” for a variety of reasons why we should dump impact factors. However, I suspect that impact factors will be forced on many of us, as a tool to quantify our research. Therefore, while we should try to fight against them, we should also keep an eye on them for evidence of people playing the system.

January 13, 2011

Survival paper (update)

In a recent post, I discussed some  statistical consultancy I was involved with. I was quite proud of the nice ggplot2 graphics I had created. The graphs nicely summarised the main points of the paper:

I’ve just had the proofs from the journal, and next to the graphs there is the following note:

It is not usual BJS style to include 95 per cent confidence intervals in K-M
curves. Could you please re-draw Figs 1 & 2 omitting these and INCLUDING ALL
FOUR CURVES IN A SINGLE GRAPH. (If you wish to include 95% c.i., the data could
be produced in tabular form instead.)

They have a policy of not including CI on graphs? So instead of a single nice graphic, they now want a graph and a table with (at least) 9 rows and 5 columns?

December 8, 2010

New paper: Survival analysis

Filed under: Publications, R — Tags: , , , , , , — csgillespie @ 8:13 pm

Each year I try to carry out some statistical consultancy to give me experience in other areas of statistics and also to provide teaching examples. Last Christmas I was approached by a paediatric consultant from the RVI who wanted to carry out prospective survival analysis. The consultant, Bruce  Jaffray, had performed Nissen fundoplication surgery on 230 children. Many of the children had other medical conditions such as cerebral palsy or low BMI. He was interested in the factors that affected patients’ survival.

The model

We fitted a standard cox proportional hazards model. The following covariates were significant:

  • gastrostomy
  • cerebral palsy
  • gender
  • need for revision surgery. This was when the child had to return to hospital for more surgery.
  • an interaction term between gastrostomy & cerebral palsy.

The interaction term was key to getting a good model fit. The figures (one of which is shown below) were constructed using ggplot2 and R. The referees actually commented on the (good) quality statistical work and nice figures! Always nice to read. Unfortunately, there isn’t a nice survival to ggplot2 interface. I had to write some rather hacky R code 😦


The main finding of the paper was the negative effect of cerebral palsy and gastrostomy on survival. Unfortunately, if a child had a gastronomy or had cerebral palsy then survival was dramatically reduced. The interaction effect was necessary, otherwise we would have predicted that all children with a gastronomy and cerebral palsy wouldn’t survive.

Other results

  • There was a rather strange and strong gender effect – male survival was greater than female.
  • The revision covariate was also significant – children who needed their fundoplication redone had increased survival. At first glance this is strange – the operation had to be redone, yet this was good for survival. However, this was really a red herring. Essentially children who had their surgery redone had by definition survived a minimum amount of time. I think something a bit more sophisticated could have been done with this variable, but the numbers weren’t that large.


Blog at WordPress.com.