August 17, 2011

Lee E. Edlefsen – Scalable Data Analysis in R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 8:49 am

The RevoScaleR package isn’t open source, but it is free for academic users.

Collect and storing data has outpaced our ability to analyze it. Can R cope with this challenge? The RevoScaleR package is part of the revolution R Enterprise. This package provides data management and data analysis. Uses multiple cores and should scale.


What is scalability – from small in-memory data.frame to multi-terabyte data sets distributed across space and even time.  Key to solving this problem is being able to process more data than can fit into the memory at a single time. Data is processed in chunks.

Two main problems: capacity (memory problems) and speed (too slow). Most commonly used statistical software tools can’t handle large data. We still think in terms of “small data sets”.

High performance analytics = HPC + Data

  • HPC is CPU centric. Lot’s of processing on small amounts of data.
  • HPA is data centric. Less processing per amount of data. Needs efficient threading and data management. Key to this is data chunking
Revolutions approach this problem by having a set of R functions (written in C++). Try to keep things familiar. Analysis tools should work on small and large problems. The outputs should be standard R objects. Sample code for logistic regression looks very similar to standard R functions. To run the logistic function on a cluster, just change the “compute context” – a simple function call.
External memory applications allow automatic parallelisation. They split a job into tasks that operate on separate blocks data. Parallel algorithms split the task into separate jobs that can be run together – I think.


  • Initialization task: total = 0, count = 0;
  • Process data tasks: for each block of x, total =sum(x), count = length(x);
  • Update results: combine total and count;
  • Process results.


ScaleR can process data from a variety of formats. It uses it’s own optimized format (XDF) that is suitable for chunking. XDF format:

  • data is stored in blocks of rows
  • header is at the end
  • allows sequential reds
  • essentially unlimited in size
  • Efficient desk space usage.
Airline example: Results seem impressive and scale well. Compared to SAS it seems to do very well.


  1. The RevoScaleR package isn’t open source, but it is free for academic users. You can download it as part of Revolution R from http://www.revolutionanalytics.com/downloads/free-academic.php

    Comment by David Smith — August 18, 2011 @ 9:24 am

  2. Thanks for the comment. I’ve updated the post.

    Comment by csgillespie — August 18, 2011 @ 11:15 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

The Shocking Blue Green Theme. Blog at WordPress.com.


Get every new post delivered to your Inbox.

Join 160 other followers

%d bloggers like this: