Why?

January 25, 2017

Training courses: R, Stan and Scala

Filed under: R — Tags: , — csgillespie @ 1:34 pm

Over the next few months we’re running a number of R, Stan and Scala courses around the UK.

Feburary
  • Mon 13 – Introduction to R (London)
  • Tue 14 – Programming with R (London)
  • Wed 15 – Advanced Graphics with R (London)
  • Thur 16 (2-day course) – Predictive Analysis (London)
March
  • Tue Mar 21 – R for Big Data (Edinburgh)
  • Mon Mar 27 – Survival/Churn Analysis with R (London)
  • Tue Mar 28 – Building an R Package (London)
  • Wed Mar 29 – Automated Reporting (first steps towards Shiny) (London)
  • Thu Mar 30 – Interactive Graphics with Shiny (London)
  • Fri Mar 31 – Spatial Data Analysis with R (London)
April
  • Mon Apr 03 – Introduction to R (Newcastle)
  • Tue Apr 04 – Statistical Modelling with R (Newcastle)
  • Wed Apr 05 – Programming with R (Newcastle)
  • Thu Apr 06 – Efficient R Programming (Newcastle)
  • Fri Apr 07 – Advanced Graphics with R (Newcastle)
  • Mon Apr 10 (2-day course) – Introduction to Bayesian Inference using RStan (London)
May
  • Tue 16th (3-day course) – Scala for Statistical Computing and Data Science (London)
June
  • Thu Jun 08 (2-day course) – Advanced R Programming (Newcastle)
  • Mon Jun 26 – Introduction to R (London)
  • Tue Jun 27 – Statistical Modelling with R (London)
  • Wed Jun 28 – Programming with R (London)
  • Thu Jun 29 (2-day course) – Advanced R Programming (London)

See the website for course descriptions. Any questions, feel free to contact me: colin@jumpingrivers.com

On site courses available on request.

January 22, 2017

Input/output benchmarks

Filed under: Computing, R — Tags: — csgillespie @ 5:08 pm

To quantify the impact of the CPU on an analysis, I created the package benchmarkme. The idea is simple. If everyone runs the same R script, we can easily compare machines.

One of the benchmarks in the package is for comparing read/write speeds; we write a large CSV file (using write.csv) and read it back in using read.csv

The package is on CRAN can be installed in the usual way

install.packages("benchmarkme")

Running

library(benchmarkme)
## If your computer is relatively slow, remove 200 from below
res = benchmark_io(runs = 3, size = c(5, 50, 200))
## Upload you data set
upload_results(res)

creates three matrices of size 5MB, 20MB and 200MB, writes the associated CSV file to the directory

Sys.getenv("TMPDIR")

and then reads the data set back into R. The object res contains the timings which can compared to other users via

plot(res)

rplot01

The above graph plots the current benchmarking results for writing a 5MB file (my machine is relatively fast).

Shiny

You can also compare your results using the Shiny interface. Simply create a results bundle

 create_bundle(res, filename = "results.rds")

and upload to the webpage.

Network drives

Often the dataset we wish to access is on a network drive. Unfortunately, network drives can be slow. The benchmark_io function has an argument that allows us to change the directory and estimate the network drive impact

res_net = benchmark_io(runs = 3, size = c(5, 20, 200), 
                           tmpdir = "path_to_dir")

 

January 9, 2017

benchmarkme Update

Filed under: Computing, R — Tags: , — csgillespie @ 8:36 pm

When discussing how to speed up slow R code, my first question is what is your computer spec? It always surprises me when complex biological experiments, costing a significant amount of money, are analysed using a six year old laptop. A new desktop machine costs around £1000 and that money would be saved within a month in user time. Typically the more the RAM you have, the larger the dataset you can handle. However it’s not so obvious of the benefit of upgrading the processor.

To quantify the impact of the CPU on an analysis, I created the package benchmarkme. The aim of this package is to provide a set of benchmarks routines and data from past runs. You can then compare your machine, with other CPUs.

The package is now on CRAN and can be installed in the usual way

install.packages("benchmarkme")

The benchmark_std() function assesses numerical operations such as loops and matrix operations. This benchmark comprises of three separate benchmarks: prog, matrix_fun, and matrix_cal. If you have less than 3GB of RAM (run `get_ram()` to find out how much is available on your system), then you should kill any memory hungry applications, e.g. firefox, and set `runs = 1` as an argument.

To benchmark your system, use

library("benchmarkme")
res = benchmark_std(runs = 3)

You can compare your results to other users via

plot(res)


My laptop is ranked around 50 out of 300. However, relative to the fastest processor, there’s not much difference.

Finally upload your results for the benefit of other users

## You can control exactly what is uploaded. See details below.
upload_results(res)

Shiny

You can also compare your results using the Shiny interface. Simply create a results bundle

 create_bundle(res, filename = "results.rds")

and upload to the webpage.

What’s uploaded

Two objects are uploaded:

1. Your benchmarks from benchmark_std or benchmark_io;
2. A summary of your system information (get_sys_details()).

The get_sys_details() returns:

Sys.info();
get_platform_info();
get_r_version();
get_ram();
get_cpu();
get_byte_compiler();
get_linear_algebra();
installed.packages();
Sys.getlocale();
– The `benchmarkme` version number;
– Unique ID – used to extract results;
– The current date.

The function Sys.info() does include the user and nodenames. In the public release of the data, this information will be removed. If you don’t wish to upload certain information, just set the corresponding argument, i.e.

upload_results(res, args = list(sys_info=FALSE))

Blog at WordPress.com.