February 16, 2016

RANDU: The case of the bad RNG

Filed under: Computing, R — Tags: , , — csgillespie @ 12:15 pm

The German Federal Office for Information Security (BSI) has established
criteria for quality random number generator (rng):

  • A sequence of random numbers has a high probability of containing no identical consecutive elements.
  • A sequence of numbers which is indistinguishable from true random’ numbers (tested using statistical tests.
  • It should be impossible to calculate, or guess, from any given sub-sequence, any previous or future values in the sequence.
  • It should be impossible, for all practical purposes, for an attacker to calculate, or guess the values used in the random number algorithm.

Points 3 and 4 are crucial for many applications. Everytime you make a
phone call, contact to a wireless point, pay using your credit card random
numbers are used.

Designing a good random number generator is hard and as a general rule you should never try to. R comes with many good quality random generators. The default generator is the Mersenne-Twister. This rng has a huge period of 2^{19937}-1 (how many random numbers are generated before we have a repeat).

Linear congruential generators

A linear congruential generator (lcg) is a relatively simple rng (popular in the 60’s and 70’s). It has a simple form of

r_{i}=(ar_{i-1}+b) \textrm{ mod }m, \quad i=1, 2, \ldots, m

where $latexr_0$ is the initial number, known as the seed, and \(a,b,m\) are the multiplier, additive constant and modulo respectively. The parameters are all integers.

The modulo operation means that at most m different numbers can be generated
before the sequence must repeat – namely the integers 0,1,2, \ldots, m-1. The
actual number of generated numbers is h \leq m, called the period of
the generator.

The key to random number generators is in setting the parameters.


RANDU was a lcg with parameters m=2^{31}, a=65539 and b=0. Unfortunately this is a spectacularly bad choice of
parameters. On noting that a=65,539=2^{16}+3, then

r_{i+1} = a r_i = 65539 \times r_i = (2^{16}+3)r_i \;.


r_{i+2} = a\;r_{i+1} = (2^{16}+3) \times r_{i+1} = (2^{16}+3)^2 r_i \;.

On expanding the square, we get

r_{i+2} = (2^{32}+6\times 2^{16} + 9)r_i = [6 (2^{16}+3)-9]r_i = 6 r_{i+1} - 9 r_i \;.

Note: all these calculations should be to the mod 2^{31}. So there is a large
correlation between the three points!

If compare randu to a standard rng (code in a gist)


It’s obvious that randu doesn’t produce good random numbers. Plotting  x_i, x_{i-1} and x_{i-2} in 3d


Generating the graphics

The code is all in a gist and can be run via


You can then get the 3d plot via

scatterplot3d::scatterplot3d(randu[,1], randu[,2], randu[,3],
## Interactive version
threejs::scatterplot3js(randu[,1], randu[,2], randu[,3])


February 15, 2016

Shiny benchmarks

Filed under: Computing, R — Tags: , — csgillespie @ 5:49 pm

A couple of months ago, the first version of benchmarkme was released. Around 140 machines have now been benchmarked.

From the fastest (an Apple i7) which ran the tests in around 10 seconds, to the slowest (an Atom(TM) CPU N450 @ 1.66GHz) which took 420 seconds! Other interesting statistics:

  • Around 6% of people ran BLAS optimised versions of R;
  • No-one (except for machines that I used) ran a byte compiled version of the package.

I intend to write to a blog post or two on BLAS and byte compiling, but for the meantime you can investigate the results via the new shiny interface. The package is still only available on github and can be installed via:

## Update the package
install.packages(c("drat", "httr", "Matrix", "shiny"))
install.packages("benchmarkme", type="source")

You then load the package in the usual way

## View past results
## shine() # Needs shiny
## get_datatable_past() # Needs DT

To benchmark your system, use

## This will take somewhere between 0.5 and 5 minutes
## Increase runs if you have a higher spec machine
res = benchmark_std(runs=3)

You can then compare your results other users

## shine(res)
## get_datatable(res)

and upload your results

## You can control exactly what is uploaded. See details below.

This function returns a unique identifier that will allow you to identify your results from the public data sets.

December 1, 2015

Crowd sourced benchmarks

Filed under: Computing, R — Tags: , — csgillespie @ 10:25 am

When discussing how to speed up slow R code, my first question is what is your computer spec? It always surprises me when complex biological experiments, costing a significant amount of money, are analysed using a six year old laptop. A new desktop machine costs around £1000 and that money would be saved within a month in user time. Typically the more the RAM you have, the larger the dataset you can handle. However it’s not so obvious of the benefit of upgrading the processor.

To quantify the impact of the CPU on an analysis, I’ve create a simple benchmarking package. The aim of this package is to provide a set of benchmarks routines and data from past runs. You can then compare your machine, with other CPUs. The package currently isn’t on CRAN, but you can install it via my drat repository

install.packages(c("drat", "httr", "Matrix"))
install.packages("benchmarkme", type="source")

You can load the package in the usual way, and view past results via


to get


Currently around forty machines have been benchmarked. To benchmark and compare your own system just run

## On slower machines, reduce runs.
res = benchmark_std(runs=3)



The final step is to upload your benchmarks

## You can control exactly what is uploaded. See the help page

The current record is held by a Intel(R) Core(TM) i7-4712MQ CPU.

July 1, 2015

useR 2015: Computational

Filed under: R, useR 2015 — csgillespie @ 12:19 pm

These are my initial notes from useR 2015. I will/may revise when I have time.

Computational Performance; Chair: Dirk Eddelbuettel

Running R+Hadoop using Docker Containers (E. James Harner)


  • Big data architectures:
    • HDFS/Hadoop: software framework for distributed storage and distributed processing
    • Tachyon/Spark: uses in-memory

Rc2 server (R cloud computing)

  • Has an editor & output panel. Interactive collaboration (Demo)
  • highly scalable
  • 4-tier architecture: client, app server, compute cloud (JSON over BSD sockets for R),
    databases (pgSQL & couchdb)

RC2 Client

  • Sharable project and workspaces
  • Graphs are written to files and moved to the database as blobs
  • Security: A 3 value token is used for auto-logins


Rc2 is an accessible IDE for students and data scientist to allow real time collaboration. It also acts as a front end to Hadoop and Spark clusters.

Algorithmic Differentiation for Extremum Estimation: An Introduction Using RcppEigen (Matt P. Dziubinski)


  • Parametric model: We want to estimate a parameter by maximizing an objective function
  • No closed formed expressions, so we need to numerically optimize


  • Derivative free: does not rely on knowledge of the objective function
  • Gradient-based: needs the gradient of the objective function
    • Steepest ascent, newton
    • Often exhibit superior convergence rates
    • But getting the gradient can be tricky, e.g. finite difference methods

Algorithmic diffentiation

  • Essentially use the chain rule
  • Need to recode the objective function in Cpp using Rcpp

Improving computational performance with algorithm engineering (Kirill Müller)

Application: activity based microsimulation models

Weighted sampling without replacement

  • Random sample: sample.int
  • Common framework: Subdivide an interval according to probabilities
    • If sampling without replacement, remove sub-interval
  • R uses trivial algorithm with update in O(n)
    • Heap-like data structure
  • Alternative approaches:
    • Rejection sampling
    • One-pass sampling (Efraimidis and Spirakis, 2006)

Statistical matching (data fusion)

  • Use Gower's distance to compare distribution
    • works with interval, ordinal and nominal variables

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

useR 2015: Networks

Filed under: R, useR 2015 — csgillespie @ 9:39 am

These are my initial notes from useR 2015. Will revise when I have time.

fbRads: Analyzing and managing Facebook ads from R (Gergely Daroczi)

Modern advertising

Google/Amazon/Facebook use our information

Ad platforms: Google: RAdwords, facebook likes: fbRads. You can use the facebook API to get information from facebook. Get hashes of email address, not the actual address. In the last few years, the API has changed.


  1. Grab useR's email addresses from CRAN and R-help mailing list.
  2. Create a facebook app with API to get a token.
  3. Create a custom audience.
  4. Create lookalike audiences: get facebook users who are similar to my target list.
  5. Define audience, ad and budget.
  6. Upload an image and description.
  7. Run A/B testing.

The performance metrics API is still being developed.

Web scraping with R – A fast track overview. (Peter Meißner)

There are a number of R packages for web-scraping.

Two problems:

  • Download: protocols/procedures, i.e. HTTP, cookies, POST, GET
  • extraction: parsing/extraction/cleansing, i.e. XML, JSON, html into R

Reading text from the Web

The simplest solution is to use readLines, then use some regular expressions (either with base R or stringr or …).

Reading HTML/XML

Use rvest and use xml_structure to view the structure of the XML scheme. To extract text, we need to use XPath (still using rvest). Within rvest there are a number of convenience functions, e.g. html_table to get a list of tables.


Use jsonlite to translate JSON to a data frame.

HTML forms/HTTP methods

Use httr and rvest packages.

Overcoming the Javascript Barrier

Use RSelenium for browser automation


  • Don't use Windows for web scraping. Use Linux (or if you must, a Mac)
  • Start with stringr, rvest, jsonlite
  • Need to learn regular expression, file manipulation
  • Before scraping, look for the download button

multiplex: Analysis of Multiple Social Networks with Algebra (Antonio Rivero Ostoic)


  • multiplex is a package designed to perform algebraic analyses of multiple networks (but isn't limited to algebra)
  • The function zbind creates multivariate network data from arrays
  • perm manipulates network data

Two-mode networks are represented in a Galois framework. This makes analysis easier(?)

What's new in igraph and networks (Gabor Csardi)


igraph is the premier R package for the analysis of network data and it went through major restructuring recently and has changed a lot since last time it was featured at useR! in 2008. This talk introduces the new/updated features of igraph: – Simplified ways of graph manipulation. – New methods community detection. – New layouts for graph visualization. – New statistical methods: graphlets, embeddings, graph matching, cohesive blocks, etc. – How to use igraph graphs with new visualization tools: DiagrammeR, D3, etc.

The igraph package deals mainly with infrastructure. It's actually a C library, with an R and python interface.

What's new: [ and [[

The [ operator makes the graph behave like an adjacency matrix. For example, to check if an edge exists, use air["BOS", "SFO"]. Can also use it to manipulate the network, e.g. to add or remove edges.

The [[ can be used to get all adjacent vertices

What's new: consistent function names and manipulators

  • make_*, sample_*, cluster_, layout_*, graph_from_*
  • manipulators: make_ and sample_
  • Pipe friendly syntax
  • Easier connection to other packages, e.g. networkD3

Current work

  • Better connection to other packages
  • Inference
  • Infrastructure cleanup

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

useR 2015: Romain Francois: My R adventures

Filed under: R, useR 2015 — Tags: , — csgillespie @ 8:05 am

Using R since 2002 and has been working on Rcpp, Rcpp11, Rcpp14 and dplyr
internals. Worked on a number of big projects.

  • 2005 he set up the R Graph Gallery
  • 2009 worked on rJava
  • 2010 Rcpp
  • 2013 dplyr

Key themes are Performance and usabililty

rJava 0.7-*

Creating objects was messy

d <-jnew("java/lang/Double", 42
.jcal(d, "D", "doubleValue)

rJava 0.8-*

d <- new(J("java/lang/Double"), 42)

Also much easier to import java packages.


Suppose you have

double add(double a, double b){ return a+b;}

and you want to use it in R. This used to be a lot of work. Before Rcpp, you
used the R/C Api, i.e. use SEXP. A lot of work and boilerplate. With Rcpp the
number of characters needed to translate the simple function above went from 250
to 50. Around 66% of CRAN packages depend (on some way) on Rcpp.

RcppParallel (tbb: thread building blocks)

The package makes it much easier to run things in parallel. Amazingly, a simple
parallel version of sqrt is faster than sqrt

dplyr (everyone knows what it does)

Uses hybrid evaluation. Looking to bring RcppParallel in the (near?) future.

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

April 1, 2015

Standardising Function Names in R

Filed under: R — csgillespie @ 12:01 am

The renamer Package

Tired of the disparate naming systems in R? Then this is the package for you.

Installing the package

The package is located in my drat. To install

install.packages("renamer", repos="http://csgillespie.github.io/drat", type="source")

or if you have drat installed

install.packages("renamer", type="source")

The source is available on my github page

Example: The CamelCaseR

If have an unnatural fear of underscores, that prevents the use of ggplot2, then you are saved

## Assumes you have ggplot2 installed


ggplot(cars) + geomPoint(aes(speed, dist))

Example: The UnderscoreR

If you’ve want to try the excellent `drat` package, but you find the lack of underscores
unnerving, you too are saved

install.packages("drat", repos="http://eddelbuettel.github.io/drat")

The examples on the drat homepage become


Future directions

Now that the problem with the R naming system has been solved, the next logical step is
to remove differences *between* languages

  • lisp` converter – randomly add brackets to all functions
  • python – insert spaces/tabs on every line
  • FORTRAN – all code starts on the eighth column

July 13, 2012

Analysing time course microarray data using Bioconductor: a case study using yeast2 Affymetrix arrays

Filed under: latex, Microarray, Publications, R — Tags: , , — csgillespie @ 2:32 pm

A few years ago I was involved in analysing some time-course microarray data. Our biological collaborators were interested in how we analysed their data, so this lead to a creation of tutorial, which in turn lead to a paper. When we submitted the paper, one the referees “suggested” that we write the paper using Sweave; I had never used Sweave. At the time this was a massive pain and I regularly cursed the name of the anonymous referee.  A few years later, I’ve just updated code (due to a change in Bioconductor) and it was a breeze. A belated thanks to the referee.

In this latest update to the paper I’ve

  • moved the paper to github;
  • changed from Sweave to knitr;
  • used RStudio instead of emacs.

You can find details full details about analysis on the associated github page.

September 17, 2011

UK R Courses – 2012

Filed under: Conferences, R, Teaching — Tags: , , — csgillespie @ 1:01 pm

The School of Mathematics & Statistics at Newcastle University (UK), are again running some R courses. In January, 2012, we will run:

The courses aren’t aimed at teaching statistics, rather they aim to go through the fundemental concepts of R programming. Further information is available at the course website. If you have any questions, feel free to contact me: colin.gillespie@newcastle.ac.uk


Bespoke courses are also on request.

August 19, 2011

Development of R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 8:44 am

Michael Rutter – R for Ubuntu

Ubuntu 10.10 uses 2.10.1. Backports are newer versions of software for old releases. R backports are available CRAN (link).

Lauchpad is a website for users to develop and maintain software (Canonical). One of Launchpad’s services is the personal package archive (PPA). This allows users to upload .deb source files, allowing easy creation of multiple Ubuntu releases and arch’s.


Dirk creates source file -> Michael gets source file -> packages built on launchpad -> Post on CRAN using apt-mirror.

There’s also a PPA available. PPAs are easier to add to the user’s system. Ubuntu has about 75 r-cran packages available in the main repository. A PPA could build the packages if the .deb packages were available. Could we use cran2deb?

cran2deb:  (no longer works), since maintaining the (virtual) machines to build the packages is time-consuming. Use launchpad.

cran2deb4ubuntu (PPA):  Contains most of the packages and dependencies from CRAN – 1107 in total. All packages can be installed with: sudo apt-get install r-cran-foo

  • Exceptions: non-free licences, windows/mac, dependencies not available to Launchpad (CUDA);
  • Problems(?): Can only install r-cran-foo outside of current R session. Can we get install.packages("foo") to look for r-cran-foo first?
  • Benefits: automatic updates to packages and creating R instances in the cloud.
  • Issues: c2d4u only available for 11.04. Naming and building issues for future versions. Space limitations on Launchpad may limit previous versions.

Andrew Runnalls – The CXXR project

The CXXR is progressively re-engineering the fundamental parts of the R interpreter from C to C++. Started in 2007, current release shadows 2.12.1. The aim of the project is to make the R interpreter more accessible to developers and researchers.

  • Improve documentation;
  • Encapsulation;
  • Move to an object-oriented structure;
  • Express internal algorithms.


In CR, the C union is used to implement R object. This has a few disadvantages:

  • compiler doesn’t know which of the 23 types is at an address;
  • debugging at the C level is tricky
  • Adding a new type of R object means modifying a data definition at the heart of the interpreter

CXXR maps R objects to a particular C++ class.


  • Move program code relating to a datatype into one places
  • Use C++ public/protected/private mechanism
  • Allow developers to extend the class hierarchy.

Illustrative example: write a package to handle large integers

GNU MP library defines a C++ class mpz_class to represent an arbitrarily large integer, but not NA’s In CXXR, NA’s are added with a single line of C. Another line of code is used to create a vector of BigInts. It’s straightforward to add binary operations.

Subscripting in R

R is renowned for the power of its subscripting operations. In the CR interpreter, there are around 2000 C-language statements to implement these facilities. But this C code is locked up; no API and hard-wired around CR’s built-in data types. This is buried treasure.

CXXR makes an API available through its API. The API abstracts away from the type of elements and container. Result: adding subscripting operations is fairly simple.

Current problems: no serialization. No provision for BigIntVectors to be saved across sessions

Claudia Beleites: Google Summer of Code 2011

Open source software coding projects. Results can be used as part of thesis or article.

  • Student stipend: US$5000. Mentoring Organization: US $50;
  • Project topics: 7 GUI/images/visualisation, 4 optimization, 1 on High performance computing.
  • Aims: introduce students to the R developer community and push forward their project. roxygen and cran2deb were previous GSoC projects.
  • Communication channels: email, IM, skype, personal meetings.


  • Two mentors per student.  The two admins ping projects every now and again;
  • Time lines are based on US summer holidays;
  • Vanishing mentor and student.

Advice for Mentors:

  • Start to look early (January) for students. Look for a co-mentor;
  • Plan the time carefully;
  • Remember that coding time is also holiday time and students range from 1st year to PhD students.
« Newer PostsOlder Posts »

Blog at WordPress.com.