August 16, 2011

High Performance Computing (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , , , , — csgillespie @ 1:28 pm

Wilem Ligtenberg – GPU computing and R

Why GPU computing? Theoretical GFLOPs for a GPU is three times greater than a CPU. Use GPUs for same instruction multiple data problems (SIMD). Initially GPUs were developed for texture problems. For example, a wall smashed into lots of pieces. Each core handled a single piece. CUDA and FireStream are brand specific. However, OpenCL is an open standard for GPUs. In principle(!), write code that works on multiple platforms and code.

Current R-packages:

  • gputools: GPU implementation of standard statistical tools. CUDA only
  • rgpu: Dead.
  • cudaBayesreg: linear model from a bayes package.

ROpenCL is an R package that provides an R interface to the openCL library. Like Rcpp for OpenCL. ROpenCL manages the memory for you (yeah!). A little over a week ago the OpenCL package was published on CRAN by Simon Urbanek. The OpenCL is a really thin layer around Open CL. ROpenCL should work out of the box on Ubuntu, not sure of Windows.

Pragnesh Patel – Deploying and benchmarking R on a large shared memory system

Single system image – Nautilus: 1024 cores, 4TB of global shared memory, 8 NVIDIA Tesla GPUs.

  • Shared Memory: symmetric multiprocessor, uniform memory access, does not scale.
  • Physical distributed memory: multicomputer or cluster. Distributed shared
  • memory: Non-uniform memory access (NUMA).

Need to find parallel overhead on parallel programs. Implicit parallelism: BLAS, pnmath. Programmer does not worry about task division or process communication. Improves programmer productivity.

pnmath (Luke Tierney): this package implements parallel versions of most of the non-rng routines in the math library. Uses OpenMP directives. Requires only trivial changes to serial code. Results summary: pnmath is only worthwhile for some functions. For example, dnorm is slower in parallel. Weak scaling: algorithms that have heavy communication do not scale well.

Lessons: data placement is important, performance depends on system architecture, operating system and compiler optimizations.

Reference: Implicit and Explicit Parallel computing in R by Luke Tierney.

K Doi – The CUtil package which enables GPU computation in R

Windows only

Features of CUtil:

  • Easy to use for Windows users
  • overriding common operators;small data transfer cost;
  • support for double precision.

Works under Windows. A Linux version is planned – not sure when. Developed under R 2.12.X and 2.13.x.

Implemented functions: configuration function, standard matrix algebra routines, memory tranfer functions. Some RNG are also available, eg Normal, LogNormal.

“Tiny benchmark” example: Seems a lot faster around 25 times. However, the example only uses a single CPU as it’s test case.

M Quesada – Bayesian statistics and high performance computing with R

Desktop application OBANSoft has been developed. It has a modular design. Amongst other things, this allows integration with OpenMP, MPI, CUDA, etc.

  • Statistical library: Java + R.
  • Desktop: Java swing
  • Parallelization: Parallel R

Uses a “Model-View-Controller” setup.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know! The above paragraph was stolen from Allyson Lister who makes excellent notes when she attend conferences.

July 12, 2011

Reviewing a paper that uses GPUs

Filed under: Computing, Publications — Tags: , , , , , — csgillespie @ 1:53 pm

Graphical processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word “GPUs”. Like any good geek, I was initially interested with the idea of using GPUs for statistical computing. However, last summer I messed about with GPUs and  the sparkle was removed. After looking at a number of papers, it strikes me that reviewers are forgetting to ask basic questions when reviewing GPU papers.

  1. For speed comparisons, do the authors compare a GPU with a multi-core CPU. In many papers, the comparison is with a single-core CPU. If a programmer can use CUDA, they can certainly code in pthreads or openMP. Take off a factor of eight when comparing to a multi-core CPU.
  2. Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older. So, take off a factor of two for each year of difference between a CPU and GPU.
  3. I like programming with doubles. I don’t really want to think about single precision and all the difficulties that entails. However, many CUDA programs are compiled as single precision. Take off a factor of two for double precision.
  4. When you use a GPU, you split the job in blocks of threads. The number of threads in each block depends on the type of problem under consideration and can have a massive speed impact on your problem. If your problem is something like matrix multiplication, where each thread multiplies two elements, then after a few test runs, it’s straightforward to come up with an optimal thread/block ratio. However, if each thread is a stochastic simulation, it now becomes very problem dependent. What could work for one model, could well be disastrous for another.
So in many GPU articles the speed comparisons could be reduced by a factor of 32!
Just to clarify, I’m not saying that GPUs have no future, rather, there has been some mis-selling of their potential usefulness in the (statistical) literature.

January 25, 2011

CPU and GPU trends over time

Filed under: Computing, R — Tags: , , , , , , — csgillespie @ 4:04 pm

GPUs seem to be all the rage these days. At the last Bayesian Valencia meeting, Chris Holmes gave a nice talk on how GPUs could be leveraged for statistical computing. Recently Christian Robert arXived a paper with parallel computing firmly in mind. In two weeks time I’m giving an internal seminar on using GPUs for statistical computing. To start the talk, I wanted a few graphs that show CPU and GPU evolution over the last decade or so. This turned out to be trickier than I expected.

After spending an afternoon searching the internet (mainly Wikipedia), I came up with a few nice plots.

Intel CPU clock speed

CPU clock speed for a single cpu has been fairly static in the last couple of years  – hovering around 3.4Ghz. Of course, we shouldn’t fall completely into the Megahertz myth, but one avenue of speed increase has been blocked:

Computational power per die

Although single CPUs have been limited, due to the rise of multi-core machines,  the computational power per die has still been increasing

GPUs vs CPUs

When we compare GPUs with CPUs over the last decade in terms of Floating point operations (FLOPs), we see that GPUs appear to be far ahead of the CPUs


Sources and data

  • You can download the data files and R code used to generate the above graphs.
    • If you find them useful, please drop me a line.
    • I’ll probably write further posts on GPU computing, but these won’t go through the R-bloggers site (since it has little to do with R).
  • Data for Figures 1 & 2 was obtained from “Is Parallel Programming Hard, And, If So, What Can You Do About It?” This book got the data from Wikipedia
  • Data from Figure 3 was mainly from Wikipedia and the odd mailing list post.
  • I believe these graphs show the correct general trend, but the actual numbers have been obtained from mailing lists, Wikipedia, etc. Use with care.

Blog at WordPress.com.