July 1, 2015

useR 2015: Computational

Filed under: R, useR 2015 — csgillespie @ 12:19 pm

These are my initial notes from useR 2015. I will/may revise when I have time.

Computational Performance; Chair: Dirk Eddelbuettel

Running R+Hadoop using Docker Containers (E. James Harner)


  • Big data architectures:
    • HDFS/Hadoop: software framework for distributed storage and distributed processing
    • Tachyon/Spark: uses in-memory

Rc2 server (R cloud computing)

  • Has an editor & output panel. Interactive collaboration (Demo)
  • highly scalable
  • 4-tier architecture: client, app server, compute cloud (JSON over BSD sockets for R),
    databases (pgSQL & couchdb)

RC2 Client

  • Sharable project and workspaces
  • Graphs are written to files and moved to the database as blobs
  • Security: A 3 value token is used for auto-logins


Rc2 is an accessible IDE for students and data scientist to allow real time collaboration. It also acts as a front end to Hadoop and Spark clusters.

Algorithmic Differentiation for Extremum Estimation: An Introduction Using RcppEigen (Matt P. Dziubinski)


  • Parametric model: We want to estimate a parameter by maximizing an objective function
  • No closed formed expressions, so we need to numerically optimize


  • Derivative free: does not rely on knowledge of the objective function
  • Gradient-based: needs the gradient of the objective function
    • Steepest ascent, newton
    • Often exhibit superior convergence rates
    • But getting the gradient can be tricky, e.g. finite difference methods

Algorithmic diffentiation

  • Essentially use the chain rule
  • Need to recode the objective function in Cpp using Rcpp

Improving computational performance with algorithm engineering (Kirill Müller)

Application: activity based microsimulation models

Weighted sampling without replacement

  • Random sample: sample.int
  • Common framework: Subdivide an interval according to probabilities
    • If sampling without replacement, remove sub-interval
  • R uses trivial algorithm with update in O(n)
    • Heap-like data structure
  • Alternative approaches:
    • Rejection sampling
    • One-pass sampling (Efraimidis and Spirakis, 2006)

Statistical matching (data fusion)

  • Use Gower's distance to compare distribution
    • works with interval, ordinal and nominal variables

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

useR 2015: Networks

Filed under: R, useR 2015 — csgillespie @ 9:39 am

These are my initial notes from useR 2015. Will revise when I have time.

fbRads: Analyzing and managing Facebook ads from R (Gergely Daroczi)

Modern advertising

Google/Amazon/Facebook use our information

Ad platforms: Google: RAdwords, facebook likes: fbRads. You can use the facebook API to get information from facebook. Get hashes of email address, not the actual address. In the last few years, the API has changed.


  1. Grab useR's email addresses from CRAN and R-help mailing list.
  2. Create a facebook app with API to get a token.
  3. Create a custom audience.
  4. Create lookalike audiences: get facebook users who are similar to my target list.
  5. Define audience, ad and budget.
  6. Upload an image and description.
  7. Run A/B testing.

The performance metrics API is still being developed.

Web scraping with R – A fast track overview. (Peter Meißner)

There are a number of R packages for web-scraping.

Two problems:

  • Download: protocols/procedures, i.e. HTTP, cookies, POST, GET
  • extraction: parsing/extraction/cleansing, i.e. XML, JSON, html into R

Reading text from the Web

The simplest solution is to use readLines, then use some regular expressions (either with base R or stringr or …).

Reading HTML/XML

Use rvest and use xml_structure to view the structure of the XML scheme. To extract text, we need to use XPath (still using rvest). Within rvest there are a number of convenience functions, e.g. html_table to get a list of tables.


Use jsonlite to translate JSON to a data frame.

HTML forms/HTTP methods

Use httr and rvest packages.

Overcoming the Javascript Barrier

Use RSelenium for browser automation


  • Don't use Windows for web scraping. Use Linux (or if you must, a Mac)
  • Start with stringr, rvest, jsonlite
  • Need to learn regular expression, file manipulation
  • Before scraping, look for the download button

multiplex: Analysis of Multiple Social Networks with Algebra (Antonio Rivero Ostoic)


  • multiplex is a package designed to perform algebraic analyses of multiple networks (but isn't limited to algebra)
  • The function zbind creates multivariate network data from arrays
  • perm manipulates network data

Two-mode networks are represented in a Galois framework. This makes analysis easier(?)

What's new in igraph and networks (Gabor Csardi)


igraph is the premier R package for the analysis of network data and it went through major restructuring recently and has changed a lot since last time it was featured at useR! in 2008. This talk introduces the new/updated features of igraph: – Simplified ways of graph manipulation. – New methods community detection. – New layouts for graph visualization. – New statistical methods: graphlets, embeddings, graph matching, cohesive blocks, etc. – How to use igraph graphs with new visualization tools: DiagrammeR, D3, etc.

The igraph package deals mainly with infrastructure. It's actually a C library, with an R and python interface.

What's new: [ and [[

The [ operator makes the graph behave like an adjacency matrix. For example, to check if an edge exists, use air["BOS", "SFO"]. Can also use it to manipulate the network, e.g. to add or remove edges.

The [[ can be used to get all adjacent vertices

What's new: consistent function names and manipulators

  • make_*, sample_*, cluster_, layout_*, graph_from_*
  • manipulators: make_ and sample_
  • Pipe friendly syntax
  • Easier connection to other packages, e.g. networkD3

Current work

  • Better connection to other packages
  • Inference
  • Infrastructure cleanup

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

useR 2015: Romain Francois: My R adventures

Filed under: R, useR 2015 — Tags: , — csgillespie @ 8:05 am

Using R since 2002 and has been working on Rcpp, Rcpp11, Rcpp14 and dplyr
internals. Worked on a number of big projects.

  • 2005 he set up the R Graph Gallery
  • 2009 worked on rJava
  • 2010 Rcpp
  • 2013 dplyr

Key themes are Performance and usabililty

rJava 0.7-*

Creating objects was messy

d <-jnew("java/lang/Double", 42
.jcal(d, "D", "doubleValue)

rJava 0.8-*

d <- new(J("java/lang/Double"), 42)

Also much easier to import java packages.


Suppose you have

double add(double a, double b){ return a+b;}

and you want to use it in R. This used to be a lot of work. Before Rcpp, you
used the R/C Api, i.e. use SEXP. A lot of work and boilerplate. With Rcpp the
number of characters needed to translate the simple function above went from 250
to 50. Around 66% of CRAN packages depend (on some way) on Rcpp.

RcppParallel (tbb: thread building blocks)

The package makes it much easier to run things in parallel. Amazingly, a simple
parallel version of sqrt is faster than sqrt

dplyr (everyone knows what it does)

Uses hybrid evaluation. Looking to bring RcppParallel in the (near?) future.

Please note that the notes/talks section of this post is merely my notes on the
presentation. I may have made mistakes: these notes are not guaranteed to be
correct. Unless explicitly stated, they represent neither my opinions nor the
opinions of my employers. Any errors you can assume to be mine and not the
speaker’s. I’m happy to correct any errors you may spot – just let me know!

September 17, 2011

UK R Courses – 2012

Filed under: Conferences, R, Teaching — Tags: , , — csgillespie @ 1:01 pm

The School of Mathematics & Statistics at Newcastle University (UK), are again running some R courses. In January, 2012, we will run:

The courses aren’t aimed at teaching statistics, rather they aim to go through the fundemental concepts of R programming. Further information is available at the course website. If you have any questions, feel free to contact me: colin.gillespie@newcastle.ac.uk


Bespoke courses are also on request.

August 19, 2011

Development of R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 8:44 am

Michael Rutter – R for Ubuntu

Ubuntu 10.10 uses 2.10.1. Backports are newer versions of software for old releases. R backports are available CRAN (link).

Lauchpad is a website for users to develop and maintain software (Canonical). One of Launchpad’s services is the personal package archive (PPA). This allows users to upload .deb source files, allowing easy creation of multiple Ubuntu releases and arch’s.


Dirk creates source file -> Michael gets source file -> packages built on launchpad -> Post on CRAN using apt-mirror.

There’s also a PPA available. PPAs are easier to add to the user’s system. Ubuntu has about 75 r-cran packages available in the main repository. A PPA could build the packages if the .deb packages were available. Could we use cran2deb?

cran2deb:  (no longer works), since maintaining the (virtual) machines to build the packages is time-consuming. Use launchpad.

cran2deb4ubuntu (PPA):  Contains most of the packages and dependencies from CRAN – 1107 in total. All packages can be installed with: sudo apt-get install r-cran-foo

  • Exceptions: non-free licences, windows/mac, dependencies not available to Launchpad (CUDA);
  • Problems(?): Can only install r-cran-foo outside of current R session. Can we get install.packages("foo") to look for r-cran-foo first?
  • Benefits: automatic updates to packages and creating R instances in the cloud.
  • Issues: c2d4u only available for 11.04. Naming and building issues for future versions. Space limitations on Launchpad may limit previous versions.

Andrew Runnalls – The CXXR project

The CXXR is progressively re-engineering the fundamental parts of the R interpreter from C to C++. Started in 2007, current release shadows 2.12.1. The aim of the project is to make the R interpreter more accessible to developers and researchers.

  • Improve documentation;
  • Encapsulation;
  • Move to an object-oriented structure;
  • Express internal algorithms.


In CR, the C union is used to implement R object. This has a few disadvantages:

  • compiler doesn’t know which of the 23 types is at an address;
  • debugging at the C level is tricky
  • Adding a new type of R object means modifying a data definition at the heart of the interpreter

CXXR maps R objects to a particular C++ class.


  • Move program code relating to a datatype into one places
  • Use C++ public/protected/private mechanism
  • Allow developers to extend the class hierarchy.

Illustrative example: write a package to handle large integers

GNU MP library defines a C++ class mpz_class to represent an arbitrarily large integer, but not NA’s In CXXR, NA’s are added with a single line of C. Another line of code is used to create a vector of BigInts. It’s straightforward to add binary operations.

Subscripting in R

R is renowned for the power of its subscripting operations. In the CR interpreter, there are around 2000 C-language statements to implement these facilities. But this C code is locked up; no API and hard-wired around CR’s built-in data types. This is buried treasure.

CXXR makes an API available through its API. The API abstracts away from the type of elements and container. Result: adding subscripting operations is fairly simple.

Current problems: no serialization. No provision for BigIntVectors to be saved across sessions

Claudia Beleites: Google Summer of Code 2011

Open source software coding projects. Results can be used as part of thesis or article.

  • Student stipend: US$5000. Mentoring Organization: US $50;
  • Project topics: 7 GUI/images/visualisation, 4 optimization, 1 on High performance computing.
  • Aims: introduce students to the R developer community and push forward their project. roxygen and cran2deb were previous GSoC projects.
  • Communication channels: email, IM, skype, personal meetings.


  • Two mentors per student.  The two admins ping projects every now and again;
  • Time lines are based on US summer holidays;
  • Vanishing mentor and student.

Advice for Mentors:

  • Start to look early (January) for students. Look for a co-mentor;
  • Plan the time carefully;
  • Remember that coding time is also holiday time and students range from 1st year to PhD students.

August 18, 2011

Simon Urbanek – R Graphics: supercharged

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 2:50 pm

New features:

  • rasterImage() (R2.11)
    • bitmap raster drawing;
    • have maps as data backdrops.
  • Polygons with holes: polypath() -(R2.12)
  • At present there is no way to tell when to actually show the plot. For example: plot(x); lines(x). Should we display the plot after plot or after lines
    • Solution dev.hold() and dev.flush()
    • Better performance and useful for animations – (currently in R-dev).


Data size increases, but large RAM (>100GB) and CPU power is affordable. Visualization needs to keep up.

  • Currently rendering is slow. Solutions: OpenGL + GPUs.
  • Visualisation methods for large data
    • interactivity (divide and conquer, shift of focus): use iPlots eXtreme (very nice demo of iplots!)
    • sufficient statistics, aggregations, etc.


Note: lots of very nice demos, hence the lack of notes.

Kaleidoscope IIIb (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 2:00 pm

O. Mersmann – The microbenchmark package

Slides and code (link).

SURGEON GENERAL’s WARNING: Microbenchmarks can lead to a distorted view of reality and massive loss of productivity

For a higher-order benchmarking package check out the rbenchmark package on R (suggestion from the speaker).

Why do we need micro-benchmarking? A simple example showed that it is currently very difficult to benchmark 1+1 and f= function() NULL  using system.time. Microbenchmark has a very simple interface. Unlike system.time, MB measures the times of each individual function call. Produces summary statistics and plots.

How does microbenchmark() work?

  • Linux: clock_getttime(), gethrtime();
  • MAC: mach_timebase_info();
  • Windows: QueryPerformanceCounter(), QueryPerformanceFrequency()


  1. Precision of clock is unknown: clock could drift, timing might be zero, might observe discrete values;
  2. Clock only measures elapsed time. Some of this time may not actually be the R process.
Countermeasures to these problems include configurable CPU warm-up phase, configurable order of execution, warning if timings underflow. There are problems with MacOS X and Windows XP.

Planned features:

  • More plotting functions;
  • Possibly use OS API;
  • Better diagnostic messages;
  • Estimate clock granularity.

Paul Murrell – Vector image processing

Problem: convert a pretty pdf map into an interactive SVG document.
PDF -> R -> SVG
Discussion of recent improvements to the core R graphics engine and grImport. Using the svg would produce a static svg. However, the gridSVG produces an interactive SVG. Use grid.animate, .garnish, .hyperlink, .script to make the picture interactive.
Looks like a very nice package.

Big data (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 11:23 am

Unfortunatley, I missed the first and last talks.

My notes from a session on Thursday morning

J. Demmler – Challenges of working with a large database of routinely
collected health data

The SAIL data bank holds over 1.9 billion (anonymous) entries. To use the data for research, they need to ensure that proper data security is observed. For example, secure data transport. All analysis is done with a secure environment. Files are moved into the environment via an FTP client

Why R? No advanced SQL options available, so using DB2 allows loops. Also R is great for data pre-cleaning and is suitable for the heavy analysis. To connect to the SAIL database, they need to use the RODBC package. SQL queries are run from within R, however SQL scripts are kept in separate files since they are “reviewed”.

Lots of errors in data, e.g. units.

John Bryant – Demographic: classes and methods for data about populations

Existing data structures for population type data:

  • array: messy code;
  • data frames: not that natural for this type of code;
  • demography package: not really extensible.

Target audience for this new package: applied statisticians, social scientists. Not programmers. Core to this package is the Demographic class: S4 object, specialized array with associated meta data.

August 17, 2011

Programming (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 1:35 pm

Ray Brownrigg – Tips and Tricks for young R programmers


Calculate the distribution function of a bivariate Kolomogorov Smirnoff statistic. Essentially three loops. Basic exhaustive search is O(N^3). Fortran gives a single order of magnitude speed-up. Restructuring in R using a single loop is an order faster than fortran. Further improvements make the algorithm 3 times faster.


  • Resolution of pdf graphs: specify width and height to suit eventual size.
  • Local versions of standard functions: compare rank(x) with .Internal(rank(x, "min")).  Ditto with sort
  • Vectorisation
  • Curve: handy for finding errors

F Schuster – Software design patterns in R

In Java software design patterns are everywhere. What about R?

What is a design pattern?

A generalised, reusable and time test-test solution. Every pattern has a description of its general principle. A collection of patterns are organised into catalogues. Reusing proven concepts helps productivity.

R design pattern

  • Factory method pattern. e.g. plotting program calls a function to get a symbol. The factor method makes the program independent of how the symbols are created.
  • A function closure maintains the object state. You can have private functions within a closure.
  • Map pattern – apply function in R
  • Filter –
  • compose concept and chain of responsibility

Patrick Burns – Random input testing with R

Good talk, just found it hard to make notes. A closely related topic is fuzzy testing.


Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Kaleidoscope IIb (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 9:35 am

L Collingwood – RTextTools

RTextTools. A machine learning library for automated text classification. This package builds on previous packages such as tm and random forests. Use case: undergrad labels congressional bills but then quits. Using the previously labelled data, automatically classify the remaining documents. The speaker gave a nice overview of machine learning techniques, but I was familiar with them so didn’t bother making notes.


  1. Read data;
  2. Missed opps;
  3. Create Corpus;
  4. Train Models – SVM, SLDA, TREE, etc;
  5. Classify models;
  6. Analyze data.

Jason Waddel – The Role of R in Lab Automation

License: free as in free beer and speech!

Summary: a scientist repeats the same experiment multiple times. How can we automate analysis.

R service bus allows a scientist to email/upload data and the results are automatically generated.

High level view

Various inputs such as pop, xml, REST WS. Each input is added to the queue. A pool of R servers handles the job. A simple configuration file handles the set-up.


Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know!

Older Posts »

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.


Get every new post delivered to your Inbox.

Join 160 other followers