Why?

November 28, 2010

Random variable generation (Pt 1 of 3)

Filed under: AMCMC, R — Tags: , , , , — csgillespie @ 7:35 pm

As I mentioned in a recent post, I’ve just received a copy of Advanced Markov Chain Monte Carlo Methods. Chapter 1.4 in the book (very quickly) covers random variable generation.

Inverse CDF Method

A standard algorithm for generating random numbers is the inverse cdf method. The continuous version of the algorithm is as follows:

1. Generate a uniform random variable U

2. Compute and return X = F^{-1}(U)

where F^{-1}(\cdot) is the inverse of the CDF. Well known examples of this method are the exponential distribution and the Box-Muller transform.

Example: Logistic distribution

I teach this algorithm in one of my classes and I’m always on the look-out for new examples. Something that escaped my notice is that it is easy to generate RN’s using this technique from the Logistic distribution. This distribution has CDF

\displaystyle F(x; \mu, s) = \frac{1}{1 + \exp(-(x-\mu)/s)}
and so we can generate a random number from the logistic distribution using the following formula:
\displaystyle X = \mu + s \log\left(\frac{U}{1-U}\right)

Which is easily converted to R code:

myRLogistic = function(mu, s){
  u = runif(1)
  return(mu + s log(u/(1-u)))
}

November 27, 2010

Advanced Markov Chain Monte Carlo Methods (AMCMC)

Filed under: AMCMC, R — Tags: , , — csgillespie @ 4:55 pm

I’ve just received my copy of Advanced Markov Chain Monte Carlo Methods, by Liang, Liu, & Carroll. Although my PhD didn’t really involve any Bayesian methodology (and my undergrad was devoid of any Bayesian influence), I’ve found that the sort of problems I’m now tackling in systems biology demand a Bayesian/MCMC approach. There are a number of introductory books on MCMC, but not that many on advanced techniques.

This book suggests that it could be used as a possible textbook or reference guide in a one-semester statistics graduate course. I’m not entirely convinced that it would be a good textbook, but as a reference it looks very promising. A word of warning, if you’re not familiar with MCMC then you should try the Dani Gamerman MCMC book first. Some later chapters look particularly interesting, including topics on adaptive proposals, population-based MCMC methods and dynamic weightings.

Anyway, I intend to work through the book (well at least a few of the chapters) and post my results/code as I go. Well that’s the plan anyway.

 

November 23, 2010

R Style Guide

Filed under: R — Tags: , , — csgillespie @ 3:51 pm

Each year I have the pleasure (actually it’s quite fun) of teaching R programming to first year mathematics and statistics students. The vast majority of these students have no experience of programming, yet think they are good with computers because they use facebook!

Debugging students' R scripts

The class has around 100 students, and there are eight practicals. In some of  these practicals  the students have to submit code. Although the code is “marked” by a script, this only detects if the code is correct. Therefore, I have to go through a lot of R functions by hand and find bugs.

  • First year the course ran, I had no style guide.
    • Result: spaghetti R code.
  • Second year: asked the students to indent their code.
    • In fact, during practicals I refused to debug in any R code that hadn’t been indented.
    • Result: nicer looking code and more correct code.
  • This year I intend to introduce a R style guide based loosely on Google’s and Hadley’s guides.
    • One point that’s in my guide and not (and shouldn’t be) in the above style guides, is that all functions must have one and only return statement. I tend to follow the single return rule for the majority of my R functions, but do, on occasions, break it. The bible of code styling, Code Complete, recommends that you use returns judiciously.

R Style Guide

This style guide is intended to be very light touch. It’s intended to give students the basis of good programming style, not be a guide for submitting to cran.

File names

File names should end in .R and, of course, be meaningful. Files should be stored in a meaningful directory – not your Desktop!

GOOD: predict_ad_revenue.R
BAD: foo.R

Variable & Function Names

Variable names should be lowercase. Use _ to separate words within a name. Strive for concise but meaningful names (this is not easy!)

GOOD: no_of_rolls
BAD: noOfRolls, free

Function names have initial capital letters and are written in CamelCase

GOOD: CalculateAvgClicks
BAD: calculate_avg_clicks , calculateAvgClicks

If possible, make function names verbs.

Curly Braces

An opening curly brace should never go on its own line; a closing curly brace should always go on its own line.

GOOD:
if (x == 5) {
  y = 10
}
RtnX = function(x) {
  return(x)
}
BAD:
RtnX = function(x)
{
  return(x)
}

Functions

Functions must have a single return function just before the final brace

GOOD:
IsNegative = function(x){
  if (x < 0) {
    is_neg = TRUE
  } else {
    is_neg = FALSE
  }
  return(is_neg)
}
BAD:
IsNegative = function(x) {
  if (x < 0){
    return(TRUE)
  } else {
    return(FALSE)
  }
}

Of course, the above function could and should be simplified to
is_neg = (x < 0)

Commenting guidelines

Comment your code. Entire commented lines should begin with # and one space.  Comments should explain the why, not the what.

What’s missing

I decided against putting a section in on “spacing” , i.e. place spaces around all binary operators (=, +, -, etc.). I think spacing may be taking style a bit too far for a first year course.

Comments welcome!

November 16, 2010

Assignment operators in R: ‘=’ vs. ‘<-’

Filed under: R — Tags: , , — csgillespie @ 7:33 pm

In R, you can use  both ‘=’ and ‘<-’ as assignment operators. So what’s the difference between them and which one should you use?

What’s the difference?

The main difference between the two assignment operators is scope. It’s easiest to see the difference with an example:

##Delete x (if it exists)
> rm(x)
> mean(x=1:10) #[1] 5.5
> x #Error: object 'x' not found

Here x is declared within the function’s scope of the function, so it doesn’t exist in the user workspace. Now, let’s run the same piece of code with using the <- operator:

> mean(x <- 1:10)# [1] 5.5
> x # [1] 1 2 3 4 5 6 7 8 9 10

This time the x variable is declared within the user workspace.

When does the assignment take place?

In the code above, you may be tempted to thing that we “assign 1:10 to x, then calculate the mean.” This would be true for languages such as C, but it isn’t true in R. Consider the following function:

> a <- 1
> f <- function(a) return(TRUE)
> f <- f(a <- a + 1); a
[1] TRUE
[1] 1

Notice that the value of a hasn’t changed! In R, the value of a will only change if we need to evaluate the argument in the function. This can lead to unpredictable behaviour:

> f <- function(a) if(runif(1)>0.5) TRUE else a
> f(a <- a+1);a
[1] 2
> f(a <- a+1);a
[1] TRUE
[1] 2
> f(a <- a+1);a
[1] 3

Which one should I use

Well there’s quite a strong following for the “<-” operator:

  • The Google R style guide prohibits the use of “=” for assignment.
  • Hadley Wickham’s style guide recommends “<-”
  • If you want your code to be compatible with S-plus you should use “<-”
    • Update Following a comment from David Smith below, it seems that S-plus now accepts “=”.
  • I believe that the General R community recommend using “<-” – see for example this link in the mailing list.

However, I tend always use the “=” operator for the following reasons:

  • The other languages I program in (python, C and occasionally JavaScript) use the “=” operator.
  • It’s quicker to type “=” and “<-”.
  • Typically, when I type declare a variable – I only want it to exist in the current workspace.
  • Since I have the pleasure of teaching undergraduates their first course in programming, using “=” avoids misleading expressions like if (x[1]<-2)

Also Introducing Monte Carlo Methods with R, by Robert and Casella recommends using “=”.

If I’m missing something or you disagree, please leave a comment – I would be very interested.

References

November 13, 2010

Updated version of the DSMTS

Filed under: Uncategorized — Tags: , , — csgillespie @ 8:20 pm

Stochastic simulation is a very important tool for Systems Biology modelling. However, it is difficult to check the correctness of a stochastic simulator, since any two realizations from a single model will typically be different. Recognising this, we developed the discrete stochastic model test suite. The test suite contains four core models, and each model has a number of variants. We restricted our attention to models which admit direct analytic or numerical analysis. There are not many of these, and they are all fairly simple.

Today I updated the test suite with a new version. Only minor changes were made to the test suite (thanks to Sean Mauch for pointing them out). Here’s a list of changes:

  • Page 2: Undefined section number in link.
  • Page 4: Changed dstms to dsmts.
  • Page 14: Inserted a space between parameter and Mu.
  • Page 14: Inserted a space in “dsmts-002-01,except.”
  • Page 23: Inserted a space between parameter and Mu.
  • In dsmts-003-07.xml changed the model id to Dimerisation07.

November 6, 2010

Installing R packages

Filed under: R — Tags: , — csgillespie @ 9:00 am

Part of the reason R has become so popular is the vast array of packages available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!

This is a short post giving steps on how to actually install R packages. Let’s suppose you want to install the ggplot2 package. Well nothing could be easier. We just fire up an R shell and type:

> install.packages("ggplot2")

In theory the package should just install, however:

  • if you are using Linux and don’t have root access, this command may not work. If you use Ubuntu then this isn’t a problem.
  • if you have limited space in /home/ then you may want to install the package in the non-default directory.
  • you will be asked to select your local mirror, i.e. which server should you use to download the package.
  • if you have a proxy, then you may run into problems.

Installing packages in a different directory

First, you need to designate a directory where you will store the downloaded packages. On my machine, I use the directory /data/Rpackages/ After creating a package directory, to install a package we use the command:

> install.packages("ggplot2"
, lib="/data/Rpackages/")
> library(ggplot2, lib.loc="/data/Rpackages/")

It’s a bit of a pain having to type /data/Rpackages/ all the time. To avoid this burden,  we create a file .Renviron in our home area, and add the line R_LIBS=/data/Rpackages/ to it. This means that whenever you start R, the directory /data/Rpackages/ is added to the list of places to look for R packages and so:

> install.packages("ggplot2")
> library(ggplot2)

just works!

Setting the repository

Every time you install a R package, you are asked which repository R should use. To set the repository and avoid having to specify this at every package install, simply:

  • create a file .Rprofile in your home area.
  • Add the following piece of code to it:


cat(".Rprofile: Setting UK repository\n")
r = getOption("repos") # hard code the UK repo for CRAN
r["CRAN"] = "http://cran.uk.r-project.org"
options(repos = r)
rm(r)

Or as Hadley pointed out, this could be condensed to:

cat(".Rprofile: Setting UK repository\n")
options(repos = c(CRAN = "http://cran.uk.r-project.org/"))

but you would deleted any other repositories you have set – say the r-forge repository (thanks to Kevin for pointing this out).

Obviously, you don’t need the cat statement, but I find it helpful to remind me that I’ve set this default. Especially if you have a laptop and you a lot of travelling.

I found this tip in a stackoverflow answer .

Setting your http proxy

This section is an update and follows a comment from Sean Carmody below.

If you have a http proxy, then one method of setting the proxy in R is to use the sys.system command, viz.

Sys.setenv(http_proxy="http://myproxy.example.com:8080")

Sys.setenv(http_proxy=”http://myproxy.example.com:8080&#8243;)

November 1, 2010

An analysis of the Stackoverflow Beta sites

Filed under: R — Tags: , — csgillespie @ 5:26 pm

In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson)
#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
             "gamedev", "webmasters", "electronics", "tex", "unix",
             "photo", "english", "cstheory",  "ui", "apple", "wordpress",
             "rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)

#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));

#Go through each site and download summary statistics
for(i in 1:length(sites)){
  stack_url=paste("http://api.",
                   sites[i],
                   ".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA",
                   sep="")
  z =  gzcon(url(stack_url))
  y = readLines(z)
  sum_stats = fromJSON(paste(y, collapse=""))
  qs[i] = sum_stats$statistics[[1]]$total_questions
  votes[i] = sum_stats$statistics[[1]]$total_votes
  users[i] = sum_stats$statistics[[1]]$total_users
  views[i] = sum_stats$statistics[[1]]$views_per_day
  close(z)
  cat(sites[i],"\n")
}

For each of the twenty-one sites, we now have information on the:

  • number of questions;
  • number of votes;
  • number of users;
  • number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)

#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)

plot(scores.cor[,1], scores.cor[,2],
     xlab="PC 1",ylab="PC 2", pch=NA,
     main="PCA analysis of  Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:
Main features:

  • Most sites are similar with the big except of programming and possibly webapps.
  • Programming is different due to the large number of votes. They have twice as many votes as next highest site.
  • webapps (and math) are different due to the large number of questions.

Some more details

In case anyone is interested, the weightings  you get from the PCA are:

#PC1 is a simple average
> round(PC.cor$rotation, 2)
PC1     PC2     PC3     PC4
votes   0.54   -0.31   -0.17   -0.77
users   0.51    0.18    0.83    0.10
views   0.50   -0.53   -0.27    0.63
qs      0.44    0.77   -0.45    0.10

The Shocking Blue Green Theme. Create a free website or blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.