# Why?

## August 16, 2011

### Jonathan Rougier – Nomograms for visualising relationships between three variables (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 2:19 pm

### Background:

Example of Nomogram taken from wikipedia

Donkeys in Kenya. Tricky to find the weight of a donkey in the “field” – no pun intended! So using a few measurements,  estimate the weight. Other covariates include age. Standard practice is to fit:

$\log(weight) = a + b \times \log(heartgirth) + c \times \log(Height)$

for adult donkeys, and other slightly different models for young/old and ill donkeys. What can a statistician add:

• Don’t (automatically) take logs of everything;
• Fit interactions.
Box-Cox suggested that a square-transformation could be a good transformation. Full model has age, health, height and girth. Final model is:

$weight = (-58.9 + 10.2 \times \log(heart) + 4.8 \times \log(height))^2$

We want a simple way of using this model in the field. Use a monogram!

### Digression on nomograms

Nomograms are visual tools for representing the relationship between three or more variables. Variations include:
• curved scaled nomograms;
• some others that I missed.
Lots of very nice nomograms from “The lost art of Nomograms”.

### Back to donkeys

If we used a log transformation for weight rather than square root we get slightly higher weights for smaller/larger donkeys. Nomograms nicely highlight this.

### Summary

Nomograms can be clearer and simpler, but don’t display predictive uncertainty.

### Ulrike Gromping – Design of Experiments in R (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 1:47 pm

### Example: Car seat occupation:

Algorithm must decide whether airbag opens:

• Must open for adult but not for small child or if the seat if empty;
• a few others I missed.

Key questions are:

• What type of design: 32 run regular fractional factorial;
• Response measurement – depends on dummy position, so repeat for 3 different;
• dummy places;
• Precision – are 32 rounds enough.

Use frf2 function in Ulrike’s package to generate experimental design.

Principals of DoE. Initially developed by Fisher. Key principles are blocking, randomization, replication. Repeated measurements are NOT replications.

In this example, there is high measurement error variance. Repeats done directly in sequence. Need to decide between replications and repeats – these are not the same! Balanced factorial experiments provide intrinsic replication.

George Box advocated that when creating experimental designs, you should proceed sequentially. Smaller initial screening. This does not apply to computer experiments.

Started in Feb 2008 and currently contains 37 packages. Maintainers need help, please point out relevant packages or complain about unhelpful packages. Of the 37 packages, only 18 have a dependency relation to others.

As with many packages, FrF2 and DOE.base were developed (in 2008) because someone needed them.

### Package dependencies

DoE.base -> FrF2 -> DoE.wrapper -> Gui interface (R commander plugin).

• DoE.base: for full factorial with blocking and orthogonal arrays;
• FrF2 2-level fractional factorials;
• DoE.wrapper – unify syntax.
Packages have a common output structure based on the White book. An R commander plugin has been developed and released, but it’s still in the early stages.

### Guidance

Should the GUI give guidance with experimental design? This is tricky with such a flexible system. How can the software know the correct solution. Is there a middle ground? Solutions would be very work-intensive – who will do the work?

### Call for activities

A lot is available, but there are still gaps in functionality. If you have the expertise, why not write a package?  Other possibilities are bug reports, suggestions for improvement, wishes, GUI beta testing, internationalization (not quite ready yet).

### High Performance Computing (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , , , , — csgillespie @ 1:28 pm

## Wilem Ligtenberg – GPU computing and R

Why GPU computing? Theoretical GFLOPs for a GPU is three times greater than a CPU. Use GPUs for same instruction multiple data problems (SIMD). Initially GPUs were developed for texture problems. For example, a wall smashed into lots of pieces. Each core handled a single piece. CUDA and FireStream are brand specific. However, OpenCL is an open standard for GPUs. In principle(!), write code that works on multiple platforms and code.

Current R-packages:

• gputools: GPU implementation of standard statistical tools. CUDA only
• cudaBayesreg: linear model from a bayes package.

ROpenCL is an R package that provides an R interface to the openCL library. Like Rcpp for OpenCL. ROpenCL manages the memory for you (yeah!). A little over a week ago the OpenCL package was published on CRAN by Simon Urbanek. The OpenCL is a really thin layer around Open CL. ROpenCL should work out of the box on Ubuntu, not sure of Windows.

## Pragnesh Patel – Deploying and benchmarking R on a large shared memory system

Single system image – Nautilus: 1024 cores, 4TB of global shared memory, 8 NVIDIA Tesla GPUs.

• Shared Memory: symmetric multiprocessor, uniform memory access, does not scale.
• Physical distributed memory: multicomputer or cluster. Distributed shared
• memory: Non-uniform memory access (NUMA).

Need to find parallel overhead on parallel programs. Implicit parallelism: BLAS, pnmath. Programmer does not worry about task division or process communication. Improves programmer productivity.

pnmath (Luke Tierney): this package implements parallel versions of most of the non-rng routines in the math library. Uses OpenMP directives. Requires only trivial changes to serial code. Results summary: pnmath is only worthwhile for some functions. For example, dnorm is slower in parallel. Weak scaling: algorithms that have heavy communication do not scale well.

Lessons: data placement is important, performance depends on system architecture, operating system and compiler optimizations.

Reference: Implicit and Explicit Parallel computing in R by Luke Tierney.

## K Doi – The CUtil package which enables GPU computation in R

Windows only

Features of CUtil:

• Easy to use for Windows users
• overriding common operators;small data transfer cost;
• support for double precision.

Works under Windows. A Linux version is planned – not sure when. Developed under R 2.12.X and 2.13.x.

Implemented functions: configuration function, standard matrix algebra routines, memory tranfer functions. Some RNG are also available, eg Normal, LogNormal.

“Tiny benchmark” example: Seems a lot faster around 25 times. However, the example only uses a single CPU as it’s test case.

## M Quesada – Bayesian statistics and high performance computing with R

Desktop application OBANSoft has been developed. It has a modular design. Amongst other things, this allows integration with OpenMP, MPI, CUDA, etc.

• Statistical library: Java + R.
• Desktop: Java swing
• Parallelization: Parallel R

Uses a “Model-View-Controller” setup.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know! The above paragraph was stolen from Allyson Lister who makes excellent notes when she attend conferences.

### Kaleidoscope Ic (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 9:53 am

These are my rough notes on the Kaleidoscope Ic session.

## David Smith – The R Ecosystem (useR! 2011)

David Smith works for Revolution Analytics. Quick overview of the R project – useR, r-journal, and r-forge. Social media starting to play a part in R – Google+, twitter, stackoverflow, and the traditional R mailing list. The developer community is fairly large in github ~1100 repositories. Crantastic! gives a “rating” from the community. Also the local R user groups and video Rchive.

Companies are gradually recognising that R can be useful – Google, facebook, lloyd’s, Thomas Cook, bing, trulia, and techcrunch. A number of companies are based around R. A crude estimate is that there are 1.2M R users in industry and around 1M academia.

A lot of growth in jobs with R – see indeed.com for numbers. Slides are available online at some point.

## James Harner – Rc2: R collaboration in the cloud

Rc2 is a web 2.0 front-end to R which is cloud based, scalable, collaborative, cross platform, touch optimized. Allows users to collaborate over the internet without concern for data becoming out of sync. A 4-tier architecture: client, application (ruby and Java), database: (postresql database for user information, Apache Cassandra for R files/code/data).

The client has a text editor, command line, voice chat, text output. Clients communicate asynchronously using websockets.  Graphics in the browser can be closed, zoomed out and walked through as a slide show.

Security is taken seriously. A 3-value token used for autologin, which:

• disables an account if someone attempts to hijack a session
• limits session to a single browser
• another point that I missed :(
Used a lot in teaching. Particularly useful for distance based learning. Walking through an example, communicating by voice, pass control to a student, take control. A full content management system is planned for the future.

## J.J. Allaire – RStudio: Integrated Development Environment for R

RStudio just released in March. It’s a new open source project, focused on coding productivity. Support for Windows, Mac, Linux, or via the web. RStudio is a coding tool not an R GUI. Intelligent code completion facilities, for example “tab” completion, which gives suggestions and help files. Context sensitive help. Can also do intelligent rename. Also runs blocks or previous selections.

Nice use of the History file for browsing. Easy to search. Puts timestamps next to the commands. In the pipeline is code navigation. Type the function name, and it searches the function within a file.

R encourages good software practices: sweave, sharing/documenting code and version control. In the pipeline is to add version control, debugging tools and project management to RStudio.

RStudio is a company. Looking to build a company around offering education and help – RStudio is open source! In 10 years it will be almost impossible to justify NOT using open source software in science.

Personal note: I think I may use this for teaching.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know! The above paragraph was stolen from Allyson Lister who makes excellent notes when she attend conferences.

### B. Ripley – The R development process (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , — csgillespie @ 8:48 am

There are my notes on the User2011 invited talk. Brian Ripley has been a member of R core since 1998

## The R Development Process – A insideR’s view

R Timeline:

• JCGS paper submitted in 1995;
• 1997: CRAN(Mar), Core team(Aug), CVS (Sept);
• R 1.0.0 Feb 2000 – 2.8MB. Many people don’t take 0.X.X seriously;
• R 2.0.0 Oct 2004, 10MB (actually 1.10.0);
• R 2.14.0. Oct 2011, ext 22MB;
• Roughly 4000 repo commits per year.

In the future, 2.15.0 scheduled for Mar 2012. R 3.0.0 has been discussed for a few years, but keeping legacy support could be tricky – there are currently around 3200 packages. So no plans for 3.0.0 in the near future. R-core has 20 members, but several are inactive and only a handful are actively developing R (there are other valuable contributions). There are currently 80 successful submissions per week.

### CRAN

CRAN is around 70GB with 1.9 GB for the current source packages. 10000 packages for Christmas 2016. Submission process is handled almost entirely by Kurt Hornik. It is very time-consuming to check packages – there are 110 packages submitted each week. In 2004, CRAN was replaced by “repos”. However, few public repositories have emerged. Binary packages are kept for two versions.

### The R Development Process

The R CORE team meets in person only every couple of years. R Core have total control over R. A rough criterion of membership is:

when it was more work to have someone out than in

Normal day to day business is by email as members are over a variety of time zones. The R foundation is the legally constituted body, with R-core (voting) members plus a small number other people.

### Getting features into R

R was principally develpoped for the benefit of the core team. Only they have votes.

Most of what you see in R is there because core members wanted it for research, teaching, support for other projects, or to develop R itself. For example, the lm package is their because of a 1998 course in regression. Since almost all core R members are  mathematician, they decided to build very general solutions rather than specific solutions.

If  a core member accepts a contribution they are commiting themselves/R-core to supporting that feature for many years. R-core have regretted accepting some (even small) contributions. So most new features should go into a package not into the core.

### Timescales

• Short: psnice, lis.dirs(recursive=FALSE).
• Year or two: Internationalization.

### Portability

Trying to phase out bash, sh and Make files for ease of use, maintainance, and performance. The parser for Rd2 was written in bison, but all the conversion scripts are in R. Also Fortran is becoming a problem since neither Apple nor Microsoft support it in their SDK. Legacies of R’s 32 bit beginnings is that there is only a single integer type. Longer integers have boon on the horizon for years, but still seems tricky. Could be in 3.0.0

### Performance

For a long time, performance issues could be solved by waiting six months for a new computer. However, this isn’t true any more. Rather, we have multiple cores. New package parallel to support multi-core processors in the next version of R.

### The future

R is heavily dependent on a small group of altruistic people who can feel that their contributions are not treated with respect. People have lives outside R, and circumstances and health do changes.

Other future developments are low-level support for threading, GUIs, vector types, replace library() with use() and moving to a yearly release schedule.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know! The above paragraph was stolen from Allyson Lister who makes excellent notes when she attend conferences.

## August 15, 2011

### Paul Murrell – Introduction to Grid graphics (useR! 2011)

Filed under: Conferences, R, useR! 2011 — Tags: , , , , , — csgillespie @ 8:06 pm

Typically, I’m very bad at taking notes in conference. This time around, I intend to make notes for each some of the talks I attend at this year’s useR! 2011 conference. Below are my notes that I made during this afternoon’s tutorial. Note: these are just notes I made and aren’t meant to be a full introduction to Grid graphics. If you are interested further in grid graphics I recommend that you visit Paul’s website (look at his talks near the bottom of the page) or buy his book.

## Introduction to Grid Graphics: Paul Murrell

Grid provides tools to draw and arrange basic shapes. It is a very low level graphics package. Grid provides functions that allow the plot canvas to be accessed programmatically. Viewports create a context for drawing. The basic shapes available are polygons, curves, raster images, and data symbols.

ggplot2 and lattice use grid graphics.

### Shapes

The grid coordinate systems consist of a value and one of four systems:

• npc: Normalised parent coordinates;
• native: relative to current x-, y-scale;
• in or cm;
• lines: lines of text.

Every basic shape has a gp argument. Similar specifications to base graphics, i.e. col, lwd, lty, cex, fill, etc.

### Viewports

A viewport describes a rectangular region on the page. Viewports have both a location and size. The viewport function creates a rectangular region on the page. Drawing occurs via the push viewport function:

pushViewport(vp)
grid.rect(gp=gpar(fill="grey"))

Crucially, viewports are relative to their current viewports. For example: pushViewport(vp); pushViewport(vp) is a viewport within a viewport. To move back up the viewport tree, use: popViewport(#no_of_viewports). However, this “destroys” the viewport. Usually better to use upViewport and downViewport.

### Layouts

A layout divides viewports into rows and columns. Use upViewport/downViewport to avoid destroying existing viewports.

### Grobs

A grob is a grapical object that contains a description of the shape. Grobs have names. grid.ls() lists grobs, while grid.edit() accesses and edits the grob.

### ggplot2 and lattice

ggplot2 can be manipulated via grid. However, this can be tricky. lattice plots are a bit easier to manipulate.

The last few tutorials I’ve attended at conferences have been a waste of time – poorly prepared presenters who didn’t know their audience. However, this tutorial was excellent. Paul kept the audience with him through “simple” exercises. Glad I attended.

Please note that the notes/talks section of this post is merely my notes on the presentation. I may have made mistakes: these notes are not guaranteed to be correct. Unless explicitly stated, they represent neither my opinions nor the opinions of my employers. Any errors you can assume to be mine and not the speaker’s. I’m happy to correct any errors you may spot – just let me know! The above paragraph was stolen from Allyson Lister who makes excellent notes when she attend conferences.

## July 12, 2011

### Reviewing a paper that uses GPUs

Filed under: Computing, Publications — Tags: , , , , , — csgillespie @ 1:53 pm

Graphical processing units (GPUs) are all the rage these days. Most journal issues would be incomplete if at least one article didn’t mention the word “GPUs”. Like any good geek, I was initially interested with the idea of using GPUs for statistical computing. However, last summer I messed about with GPUs and  the sparkle was removed. After looking at a number of papers, it strikes me that reviewers are forgetting to ask basic questions when reviewing GPU papers.

1. For speed comparisons, do the authors compare a GPU with a multi-core CPU. In many papers, the comparison is with a single-core CPU. If a programmer can use CUDA, they can certainly code in pthreads or openMP. Take off a factor of eight when comparing to a multi-core CPU.
2. Since a GPU has (usually) been bought specifically for the purpose of the article, the CPU can be a few years older. So, take off a factor of two for each year of difference between a CPU and GPU.
3. I like programming with doubles. I don’t really want to think about single precision and all the difficulties that entails. However, many CUDA programs are compiled as single precision. Take off a factor of two for double precision.
4. When you use a GPU, you split the job in blocks of threads. The number of threads in each block depends on the type of problem under consideration and can have a massive speed impact on your problem. If your problem is something like matrix multiplication, where each thread multiplies two elements, then after a few test runs, it’s straightforward to come up with an optimal thread/block ratio. However, if each thread is a stochastic simulation, it now becomes very problem dependent. What could work for one model, could well be disastrous for another.
So in many GPU articles the speed comparisons could be reduced by a factor of 32!
Just to clarify, I’m not saying that GPUs have no future, rather, there has been some mis-selling of their potential usefulness in the (statistical) literature.

## June 16, 2011

Filed under: Computing, Geekery, R — Tags: , , , , , — csgillespie @ 12:52 pm

As everyone knows, it seems that Sony is taking a bit of a battering from hackers.  Thanks to Sony, numerous account and password details are now circulating on the internet. Recently, Troy Hunt carried out a brief analysis of the password structure. Here is a summary of his post:

• There were around 40,000 passwords, of which 8000 would fail a simplistic dictionary attack;
• 93% of passwords were between 6 and 10 characters.

In this post, we will investigate the remaining 32,000 passwords that passed the dictionary attack.

## Distribution of characters

As Troy points out, the vast majority of passwords only contained a single type, i.e. all lower or all upper case. However, it turns out that things get even worst when we look at character frequency.

In the password database, there are a 78 unique characters. So if passwords were truly random, each character should occur with probability 1/78 = 0.013. However when we calculate the actual password occurrence, we see that it clearly isn’t random. The following figure shows the top 20 password characters, with the red line indicting 1/78.

Unsurprisingly, the vowels “e”, “a” and “o” are very popular, with the most popular numbers being 1,2, and 0 (in that order). No capital letters make it into the top twenty. We can also construct the cumulative probability plot for character occurrence. In the following figure, the red dots show the pattern we would expect if the passwords were truly random (link to a larger version of the plot):

Clearly, things aren’t as random as we would like.

## Character order

Let’s now consider the order that the characters appear. To simplify things, consider only the eight character passwords. The most popular number to include in a password is “1”. If placement were random, then in passwords containing the number “1” we would expect to see the character evenly distributed. Instead, we get:

##Distribution of "1" over eight character passwords
0.06 0.03 0.04 0.04 0.13 0.13 0.22 0.34

So in around of 84% of passwords that contain the number “1”, the number appears only in the second half of the password. Clearly, people like sticking a number “1” towards the end of their password.

We get a similar pattern with “2”:

0.05 0.05 0.04 0.05 0.13 0.11 0.30 0.27

and with “!”

#Small sample size here
0.00 0.00 0.00 0.00 0.00 0.11 0.16 0.74

We see similar patterns with other alpha-numeric characters.

## Number of characters needed to guess a password

Suppose we constructed all possible passwords using the first N most popular characters. How many passwords would that cover in our sample? The following figure shows proportion of passwords covered in our list using the first N characters:

To cover 50% of passwords in all list, we only need to use the first 27 characters. In fact, using only 20 characters covers around 25% of passwords, while using 31 characters covers 80% of passwords. Remember, these passwords passed the dictionary attack.

## Summary

Typically when we calculate the probability of guessing a password, we assume that each character is equally likely to be chosen, i.e. the probability of choosing “e” is the same as choosing “Z”. This is clearly false. Also, since many systems now force people to have different character types in their password, it is too easy for users just to tack on a number as their final digit. I don’t want to go into how to efficiently explore “password space”, but it’s clear that a brute force search isn’t the way to go.

Personally, I’ve abandoned trying to remember passwords a long time ago, and just use a password manager. For example, my wordpress password is over 12 characters and consists of a completely random mixture of alphanumeric and special characters. Of course, you just need to make sure your password manager is secure….

## May 29, 2011

### Impact factors for statistics journals

Filed under: Publications — Tags: , , , — csgillespie @ 1:08 pm

The other day I came across “Nefarious numbers” by Douglas Arnold and Kristine Fowler in arXiv. This paper examines how impact factors can be easily and blatantly manipulated.

## What is an Impact Factor

For a particular year, the impact factor of a journal is the average number of citations received per paper, published in that journal during the two preceding years. The impact factor as a number of glaring flaws:

• Impact factors vary across disciplines.
• The submission to publication process in a statistical journal can take up to a year.
• The impact factor is just a single statistic out of many possible measures.
• The underlying database contains many errors.

## International Journal of Nonlinear Sciences and Numerical Simulation

The Australian Research Council (ARC) recently released an evaluation, listing quality ratings for over 20,000 peer-reviewed journals across various disciplines. This list was constructed through a review process involving academics, disciplinary bodies and learned academies. The outcome is that over 20,000 peer-journals are ranked A* to C, where

• A*: one of the best in its field or sub-field;
• A: very high quality;
• B: solid, though not outstanding reputation;
• C: does not meet the criteria of the higher tiers.

The ARC ranked the international journal of nonlinear sciences and numerical simulation (IJNSNS) as a B. However, in 2008 this journal had an impact factor of 8.9 – more than double the next highest journal in the Applied Mathematics section. As the paper explains, the reason for the large impact factor is easy to see. In 2008, the top-citing three authors to IJNSNS were:

• Ji-Huan He, the journal’s Editor-in-Chief, who cited, within a the two-year window, 243 times;
• D. D. Ganji, a member of the editorial board, with 114 cites;
• Mohamed El Naschie, a regional editor, with 58 cites.

Comparing these numbers with other journals, shows how extreme IJNSNS really is – the next highest impact factor is around 4. Arnold and Fowler also investigate journals where the citations occurs. These journals turn out to be IJNSNS itself or special issues of other journals edited by someone on the IJNSNS board.

## Impact Factors for Statistics Journals

The ARC statistics section contains around two hundred journals. Some of these journals are “traditional” statistics journals, such as JASA, RSS, and biometrics. Other journals are more applied, such as Bioinformatics and Mathematical Biosciences. So in the following comparison, I just considered journals classed as “statistics” by the ISI Web of Knowledge. This leaves seventy-seven journals.

The following plot shows the two- and five-year impact factor for the seventy-seven statistical journals, grouped by the ARC rating. The red dots show the median impact factor for a particular grouping.

As would be expected, for the two-year IF there is very little difference between the ARC ratings – although more than I expected. Once we calculate the five-year impact factors,  the difference between ratings are clearer. Since many of the group C journals are new, a number of them don’t have five-year impact factor.

## Outlying Statistical Journals

There are three journals that stand out from their particular groups:

• Statistical Science, a group A journal. Since this is mainly a review journal, so it’s really not surprising that this has a high impact factor.
• Journal of Statistical and the Stata journal, group C journals. Since these are “statistical computing” journals, it isn’t that surprising that they have high impact

## Should we use Impact Factors

The best answer would be no! Just read the first page of  “Nefarious numbers” for a variety of reasons why we should dump impact factors. However, I suspect that impact factors will be forced on many of us, as a tool to quantify our research. Therefore, while we should try to fight against them, we should also keep an eye on them for evidence of people playing the system.

## May 25, 2011

### Statistical podcast: random number seeds

Filed under: Computing, Geekery — Tags: , , , , — csgillespie @ 10:39 pm

One of the podcasts I listen to each week is Security Now! Typically, this podcast has little statistical content, as its main focus is computer security, but episode 301 looks at how to generate truly random numbers for seeding pseudo random number generators.

Generating truly random numbers to be used as a seed, turns out to be rather tricky. For example, in the Netscape browser, the random seed used by version 1.0 of the SSL protocol combined the time of day and the process number to seed its random number generator. However, it turns out that the process number is usually a small subset of all possible ids, and so is fairly easy to guess.

Recent advances indicate that we can get “almost true” randomness by taking multiple snap shorts of the processor counter. Since the counter covers around 3 billion numbers each second, we can use the counter to create a true random seed.

To find out more, listen to the podcast. The discussion on random seeds begins mid-way through the podcast.