August 19, 2011

Development of R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 8:44 am

Michael Rutter – R for Ubuntu

Ubuntu 10.10 uses 2.10.1. Backports are newer versions of software for old releases. R backports are available CRAN (link).

Lauchpad is a website for users to develop and maintain software (Canonical). One of Launchpad’s services is the personal package archive (PPA). This allows users to upload .deb source files, allowing easy creation of multiple Ubuntu releases and arch’s.


Dirk creates source file -> Michael gets source file -> packages built on launchpad -> Post on CRAN using apt-mirror.

There’s also a PPA available. PPAs are easier to add to the user’s system. Ubuntu has about 75 r-cran packages available in the main repository. A PPA could build the packages if the .deb packages were available. Could we use cran2deb?

cran2deb:  (no longer works), since maintaining the (virtual) machines to build the packages is time-consuming. Use launchpad.

cran2deb4ubuntu (PPA):  Contains most of the packages and dependencies from CRAN – 1107 in total. All packages can be installed with: sudo apt-get install r-cran-foo

  • Exceptions: non-free licences, windows/mac, dependencies not available to Launchpad (CUDA);
  • Problems(?): Can only install r-cran-foo outside of current R session. Can we get install.packages("foo") to look for r-cran-foo first?
  • Benefits: automatic updates to packages and creating R instances in the cloud.
  • Issues: c2d4u only available for 11.04. Naming and building issues for future versions. Space limitations on Launchpad may limit previous versions.

Andrew Runnalls – The CXXR project

The CXXR is progressively re-engineering the fundamental parts of the R interpreter from C to C++. Started in 2007, current release shadows 2.12.1. The aim of the project is to make the R interpreter more accessible to developers and researchers.

  • Improve documentation;
  • Encapsulation;
  • Move to an object-oriented structure;
  • Express internal algorithms.


In CR, the C union is used to implement R object. This has a few disadvantages:

  • compiler doesn’t know which of the 23 types is at an address;
  • debugging at the C level is tricky
  • Adding a new type of R object means modifying a data definition at the heart of the interpreter

CXXR maps R objects to a particular C++ class.


  • Move program code relating to a datatype into one places
  • Use C++ public/protected/private mechanism
  • Allow developers to extend the class hierarchy.

Illustrative example: write a package to handle large integers

GNU MP library defines a C++ class mpz_class to represent an arbitrarily large integer, but not NA’s In CXXR, NA’s are added with a single line of C. Another line of code is used to create a vector of BigInts. It’s straightforward to add binary operations.

Subscripting in R

R is renowned for the power of its subscripting operations. In the CR interpreter, there are around 2000 C-language statements to implement these facilities. But this C code is locked up; no API and hard-wired around CR’s built-in data types. This is buried treasure.

CXXR makes an API available through its API. The API abstracts away from the type of elements and container. Result: adding subscripting operations is fairly simple.

Current problems: no serialization. No provision for BigIntVectors to be saved across sessions

Claudia Beleites: Google Summer of Code 2011

Open source software coding projects. Results can be used as part of thesis or article.

  • Student stipend: US$5000. Mentoring Organization: US $50;
  • Project topics: 7 GUI/images/visualisation, 4 optimization, 1 on High performance computing.
  • Aims: introduce students to the R developer community and push forward their project. roxygen and cran2deb were previous GSoC projects.
  • Communication channels: email, IM, skype, personal meetings.


  • Two mentors per student.  The two admins ping projects every now and again;
  • Time lines are based on US summer holidays;
  • Vanishing mentor and student.

Advice for Mentors:

  • Start to look early (January) for students. Look for a co-mentor;
  • Plan the time carefully;
  • Remember that coding time is also holiday time and students range from 1st year to PhD students.

August 18, 2011

Simon Urbanek – R Graphics: supercharged

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 2:50 pm

New features:

  • rasterImage() (R2.11)
    • bitmap raster drawing;
    • have maps as data backdrops.
  • Polygons with holes: polypath() -(R2.12)
  • At present there is no way to tell when to actually show the plot. For example: plot(x); lines(x). Should we display the plot after plot or after lines
    • Solution dev.hold() and dev.flush()
    • Better performance and useful for animations – (currently in R-dev).


Data size increases, but large RAM (>100GB) and CPU power is affordable. Visualization needs to keep up.

  • Currently rendering is slow. Solutions: OpenGL + GPUs.
  • Visualisation methods for large data
    • interactivity (divide and conquer, shift of focus): use iPlots eXtreme (very nice demo of iplots!)
    • sufficient statistics, aggregations, etc.


Note: lots of very nice demos, hence the lack of notes.

Kaleidoscope IIIb (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 2:00 pm

O. Mersmann – The microbenchmark package

Slides and code (link).

SURGEON GENERAL’s WARNING: Microbenchmarks can lead to a distorted view of reality and massive loss of productivity

For a higher-order benchmarking package check out the rbenchmark package on R (suggestion from the speaker).

Why do we need micro-benchmarking? A simple example showed that it is currently very difficult to benchmark 1+1 and f= function() NULL  using system.time. Microbenchmark has a very simple interface. Unlike system.time, MB measures the times of each individual function call. Produces summary statistics and plots.

How does microbenchmark() work?

  • Linux: clock_getttime(), gethrtime();
  • MAC: mach_timebase_info();
  • Windows: QueryPerformanceCounter(), QueryPerformanceFrequency()


  1. Precision of clock is unknown: clock could drift, timing might be zero, might observe discrete values;
  2. Clock only measures elapsed time. Some of this time may not actually be the R process.
Countermeasures to these problems include configurable CPU warm-up phase, configurable order of execution, warning if timings underflow. There are problems with MacOS X and Windows XP.

Planned features:

  • More plotting functions;
  • Possibly use OS API;
  • Better diagnostic messages;
  • Estimate clock granularity.

Paul Murrell – Vector image processing

Problem: convert a pretty pdf map into an interactive SVG document.
PDF -> R -> SVG
Discussion of recent improvements to the core R graphics engine and grImport. Using the svg would produce a static svg. However, the gridSVG produces an interactive SVG. Use grid.animate, .garnish, .hyperlink, .script to make the picture interactive.
Looks like a very nice package.

Big data (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 11:23 am

Unfortunatley, I missed the first and last talks.

My notes from a session on Thursday morning

J. Demmler – Challenges of working with a large database of routinely
collected health data

The SAIL data bank holds over 1.9 billion (anonymous) entries. To use the data for research, they need to ensure that proper data security is observed. For example, secure data transport. All analysis is done with a secure environment. Files are moved into the environment via an FTP client

Why R? No advanced SQL options available, so using DB2 allows loops. Also R is great for data pre-cleaning and is suitable for the heavy analysis. To connect to the SAIL database, they need to use the RODBC package. SQL queries are run from within R, however SQL scripts are kept in separate files since they are “reviewed”.

Lots of errors in data, e.g. units.

John Bryant – Demographic: classes and methods for data about populations

Existing data structures for population type data:

  • array: messy code;
  • data frames: not that natural for this type of code;
  • demography package: not really extensible.

Target audience for this new package: applied statisticians, social scientists. Not programmers. Core to this package is the Demographic class: S4 object, specialized array with associated meta data.

August 17, 2011

Programming (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 1:35 pm

Ray Brownrigg – Tips and Tricks for young R programmers


Calculate the distribution function of a bivariate Kolomogorov Smirnoff statistic. Essentially three loops. Basic exhaustive search is O(N^3). Fortran gives a single order of magnitude speed-up. Restructuring in R using a single loop is an order faster than fortran. Further improvements make the algorithm 3 times faster.


  • Resolution of pdf graphs: specify width and height to suit eventual size.
  • Local versions of standard functions: compare rank(x) with .Internal(rank(x, "min")).  Ditto with sort
  • Vectorisation
  • Curve: handy for finding errors

F Schuster – Software design patterns in R

In Java software design patterns are everywhere. What about R?

What is a design pattern?

A generalised, reusable and time test-test solution. Every pattern has a description of its general principle. A collection of patterns are organised into catalogues. Reusing proven concepts helps productivity.

R design pattern

  • Factory method pattern. e.g. plotting program calls a function to get a symbol. The factor method makes the program independent of how the symbols are created.
  • A function closure maintains the object state. You can have private functions within a closure.
  • Map pattern – apply function in R
  • Filter –
  • compose concept and chain of responsibility

Patrick Burns – Random input testing with R

Good talk, just found it hard to make notes. A closely related topic is fuzzy testing.


Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Kaleidoscope IIb (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 9:35 am

L Collingwood – RTextTools

RTextTools. A machine learning library for automated text classification. This package builds on previous packages such as tm and random forests. Use case: undergrad labels congressional bills but then quits. Using the previously labelled data, automatically classify the remaining documents. The speaker gave a nice overview of machine learning techniques, but I was familiar with them so didn’t bother making notes.


  1. Read data;
  2. Missed opps;
  3. Create Corpus;
  4. Train Models – SVM, SLDA, TREE, etc;
  5. Classify models;
  6. Analyze data.

Jason Waddel – The Role of R in Lab Automation

License: free as in free beer and speech!

Summary: a scientist repeats the same experiment multiple times. How can we automate analysis.

R service bus allows a scientist to email/upload data and the results are automatically generated.

High level view

Various inputs such as pop, xml, REST WS. Each input is added to the queue. A pool of R servers handles the job. A simple configuration file handles the set-up.


Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 8:49 am

The RevoScaleR package isn’t open source, but it is free for academic users.

Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.


What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time.  Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.

Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.

High performance analytics = HPC + Data

  • HPC is CPU centric. Lot’s of processing on small amounts of data.
  • HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking
Revolutions approach this problem by having a set of R functions (written in C++). Try to keep things familiar. Analysis tools should work on small and large problems. The outputs should be standard R objects. Sample code for logistic regression looks very similar to standard R functions. To run the logistic function on a cluster, just change the “compute context” – a simple function call.
External memory applications allow automatic parallelisation. They split a job into tasks that operate on separate blocks data. Parallel algorithms split the task into separate jobs that can be run together – I think.


  • Initialization task: total = 0, count = 0;
  • Process data tasks: for each block of x, total =sum(x), count = length(x);
  • Update results: combine total and count;
  • Process results.


ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:

  • data is stored in blocks of rows
  • header is at the end
  • allows sequential reds
  • essentially unlimited in size
  • Efficient desk space usage.
Airline example: Results seem impressive and scale well. Compared to SAS it seems to do very well.

August 16, 2011

Jonathan Rougier – Nomograms for visualising relationships between three variables (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 2:19 pm


Example of Nomogram taken from wikipedia

Donkeys in Kenya. Tricky to find the weight of a donkey in the “field” – no pun intended! So using a few measurements,  estimate the weight. Other covariates include age. Standard practice is to fit:

\log(weight) = a + b \times \log(heartgirth) + c \times \log(Height)

for adult donkeys, and other slightly different models for young/old and ill donkeys. What can a statistician add:

  • Add in other factors;
  • Don’t (automatically) take logs of everything;
  • Fit interactions.
Box-Cox suggested that a square-transformation could be a good transformation. Full model has age, health, height and girth. Final model is:

weight = (-58.9 + 10.2 \times \log(heart) + 4.8 \times \log(height))^2

We want a simple way of using this model in the field. Use a monogram!

Digression on nomograms

Nomograms are visual tools for representing the relationship between three or more variables. Variations include:
  • curved scaled nomograms;
  • some others that I missed.
Lots of very nice nomograms from “The lost art of Nomograms”.

Back to donkeys

If we used a log transformation for weight rather than square root we get slightly higher weights for smaller/larger donkeys. Nomograms nicely highlight this.


Nomograms can be clearer and simpler, but don’t display predictive uncertainty.


Ulrike Gromping – Design of Experiments in R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 1:47 pm

Example: Car seat occupation:

Algorithm must decide whether airbag opens:

  • Must open for adult but not for small child or if the seat if empty;
  • a few others I missed.

Key questions are:

  • What type of design: 32 run regular fractional factorial;
  • Response measurement – depends on dummy position, so repeat for 3 different;
  • dummy places;
  • Precision – are 32 rounds enough.

Use frf2 function in Ulrike’s package to generate experimental design.

Principals of DoE. Initially developed by Fisher. Key principles are blocking, randomization, replication. Repeated measurements are NOT replications.

In this example, there is high measurement error variance. Repeats done directly in sequence. Need to decide between replications and repeats – these are not the same! Balanced factorial experiments provide intrinsic replication.

George Box advocated that when creating experimental designs, you should proceed sequentially. Smaller initial screening. This does not apply to computer experiments.

Experimental Design Task View

Started in Feb 2008 and currently contains 37 packages. Maintainers need help, please point out relevant packages or complain about unhelpful packages. Of the 37 packages, only 18 have a dependency relation to others.

As with many packages, FrF2 and DOE.base were developed (in 2008) because someone needed them.

Package dependencies

DoE.base -> FrF2 -> DoE.wrapper -> Gui interface (R commander plugin).

  • DoE.base: for full factorial with blocking and orthogonal arrays;
  • FrF2 2-level fractional factorials;
  • DoE.wrapper – unify syntax.
Packages have a common output structure based on the White book. An R commander plugin has been developed and released, but it’s still in the early stages.


Should the GUI give guidance with experimental design? This is tricky with such a flexible system. How can the software know the correct solution. Is there a middle ground? Solutions would be very work-intensive – who will do the work?

Call for activities

A lot is available, but there are still gaps in functionality. If you have the expertise, why not write a package?  Other possibilities are bug reports, suggestions for improvement, wishes, GUI beta testing, internationalization (not quite ready yet).

High Performance Computing (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , , , , — csgillespie @ 1:28 pm

Wilem Ligtenberg – GPU computing and R

Why GPU computing? Theoretical GFLOPs for a GPU is three times greater than a CPU. Use GPUs for same instruction multiple data problems (SIMD). Initially GPUs were developed for texture problems. For example, a wall smashed into lots of pieces. Each core handled a single piece. CUDA and FireStream are brand specific. However, OpenCL is an open standard for GPUs. In principle(!), write code that works on multiple platforms and code.

Current R-packages:

  • gputools: GPU implementation of standard statistical tools. CUDA only
  • rgpu: Dead.
  • cudaBayesreg: linear model from a bayes package.

ROpenCL is an R package that provides an R interface to the openCL library. Like Rcpp for OpenCL. ROpenCL manages the memory for you (yeah!). A little over a week ago the OpenCL package was published on CRAN by Simon Urbanek. The OpenCL is a really thin layer around Open CL. ROpenCL should work out of the box on Ubuntu, not sure of Windows.

Pragnesh Patel – Deploying and benchmarking R on a large shared memory system

Single system image – Nautilus: 1024 cores, 4TB of global shared memory, 8 NVIDIA Tesla GPUs.

  • Shared Memory: symmetric multiprocessor, uniform memory access, does not scale.
  • Physical distributed memory: multicomputer or cluster. Distributed shared
  • memory: Non-uniform memory access (NUMA).

Need to find parallel overhead on parallel programs. Implicit parallelism: BLAS, pnmath. Programmer does not worry about task division or process communication. Improves programmer productivity.

pnmath (Luke Tierney): this package implements parallel versions of most of the non-rng routines in the math library. Uses OpenMP directives. Requires only trivial changes to serial code. Results summary: pnmath is only worthwhile for some functions. For example, dnorm is slower in parallel. Weak scaling: algorithms that have heavy communication do not scale well.

Lessons: data placement is important, performance depends on system architecture, operating system and compiler optimizations.

Reference: Implicit and Explicit Parallel computing in R by Luke Tierney.

K Doi – The CUtil package which enables GPU computation in R

Windows only

Features of CUtil:

  • Easy to use for Windows users
  • overriding common operators;small data transfer cost;
  • support for double precision.

Works under Windows. A Linux version is planned – not sure when. Developed under R 2.12.X and 2.13.x.

Implemented functions: configuration function, standard matrix algebra routines, memory tranfer functions. Some RNG are also available, eg Normal, LogNormal.

“Tiny benchmark” example: Seems a lot faster around 25 times. However, the example only uses a single CPU as it’s test case.

M Quesada – Bayesian statistics and high performance computing with R

Desktop application OBANSoft has been developed. It has a modular design. Amongst other things, this allows integration with OpenMP, MPI, CUDA, etc.

  • Statistical library: Java + R.
  • Desktop: Java swing
  • Parallelization: Parallel R

Uses a “Model-View-Controller” setup.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know! The above paragraph was stolen from Allyson Lister who makes excellent notes when she attend conferences.

Older Posts »

Create a free website or blog at WordPress.com.