Why?

December 27, 2010

Problems with your Blackberry

Filed under: Geekery — Tags: , , , , — csgillespie @ 7:39 pm

I only caught part of the one Ronnie show this year, but it was a classic. The best sketch I watched involved fixing your Blackberry:

For non-UK people, orange is a mobile service provider. The Xbox 360 joke is brilliant!

December 21, 2010

R programming books

Filed under: R — Tags: , , , — csgillespie @ 3:05 pm

My sabbatical is rapidly coming to an end, and I have to start thinking more and more about teaching. Glancing over my module description for the introductory computational statistics course I teach, I noticed that it’s a bit light on recommend/background reading. In fact it has only two books:

• A first course in statistical programming with R (Braun & Murdoch)
• Pros: I quite like this book (hence the reason I put it on my list). It has a nice collection of exercises, it “looks nice” and doesn’t assume knowledge of programming. It also doesn’t assume (or try to teach) any statistics.
• Cons: When describing for loops and functions the examples aren’t very statistical. For example, it uses Fibonacci sequences in the while loop section and the sieve of Eratosthenes for if statements.
• An introduction to R (Venables & Smith)
• Pros: Simple, short and to the point. Free copies available. Money from the book goes to the R project.
• Cons: More a R reference guide than a textbook.

What other good R books could I recommend? In particular, I’m looking for books that:

• Assume no prior knowledge of programming.
• Assume very little knowledge of statistics. For example, no regression.
• Doesn’t try to teach statistics. So no “R with ….” type books.
• Are cheap!

Suggestions  welcome (needed!)

December 17, 2010

The strangest monte-carlo simulation ever

Filed under: Uncategorized — Tags: , , , — csgillespie @ 11:08 pm

While flicking through the television channels tonight, I had the misfortune of catching the last five minutes of the Deadliest Warrior. At this point you will asking yourself two questions:

• What is this programme about?
• Why did I watch the last five minutes?

From what I can gather in five minutes of viewing, the programme pits two historical warriors against each other. Why did I watch it? Well this week it was William Wallace vs Shaka zulu. At this point my in-built Scottish pride kicked in and I had to watch.

What followed was all slightly silly. However, right at the end they ran 1000 computer simulations to determine the “deadliest warrior.” Wallace triumphed in over 60% of battles!

I suppose that’s five minutes of my life never to be regained😦

You can get a preview of the series from amazon

December 14, 2010

Logical operators in R

Filed under: R — Tags: , , , — csgillespie @ 4:38 pm

A logical operator

In R, the operators “|” and “&” indicate the logical operations OR and AND. For example, to test if x equals 1 and y equals 2 we do the following:

 > x = 1; y = 2 > (x == 1) & (y == 2) [1] TRUE

However, if you are used to programming in C you may be tempted to write
 #Gives the same answer as above (in this example...) > (x == 1) && (y == 2) [1] TRUE 
At this point you could be lulled into a false sense of security and believe that they could be used interchangeably. Big mistake.

Let’s consider another example, this time a vector comparison:
 > z = 1:6 > (z > 2) & (z < 5) [1] FALSE FALSE TRUE TRUE FALSE FALSE > z[(z>2) & (z<5)] [1] 3 4 
but the double “&&” gives
 > (z > 2) && (z < 5) [1] FALSE > z[(z > 2) && (z < 5)] integer(0)#Probably not what you want 
It’s all gone a bit pear shaped! In fact it could have been worse:
 > (z > 2) && (z < 5) [1] TRUE > z[(z > 0) && (z < 5)] [1] 1 2 3 4 5 6 
Now you’ve the wrong answer and something that would be very tricky to spot. This is because R recylces the TRUE variable.

What’s the difference?

Well from the R help page:

“The longer form evaluates left to right examining only the first element of each vector”

where the longer form refers to “&&”.  So
 > (z > 2) && (z < 5) [1] FALSE 
is equivalent to:
 > (z[1] > 2) & (z[1] < 5) [1] FALSE 
The same concept applies to the OR operator, “|”.

As the commentators point out below, another key difference is for the longer form

“Evaluation proceeds only until the result is determined”

This concept is highlighted in the following example:
 > f = function(){cat("My name is f\n");return(TRUE)} > g = function(){cat("My name is g\n");return(FALSE)} > f() | g() My name is f My name is g [1] TRUE > f() || g() My name is f [1] TRUE 
This has two benefits:

• Evaluation will be faster. In the above example, the function g isn’t evaluated (thanks to Andrew Robson and NotMe)
• Also, you can use the double variety to check a property of a data structure before carrying on with your analysis, i.e. all(!is.na(x)) && mean(x) > 0 (thanks to Pat Burns for this tip)

December 8, 2010

Filed under: Geekery, latex — Tags: , — csgillespie @ 8:59 pm

Every so often you’re typing away at a tex document and you have forgotten the latex command. You don’t have any latex books with you and using Google to find a latex symbol is a lesson in frustration. Don’t worry detexify is at hand. Just doodle your symbol in the box and up pops a few suggestions.

Here’s my attempt at drawing nleftarrow

Very useful!

New paper: Survival analysis

Filed under: Publications, R — Tags: , , , , , , — csgillespie @ 8:13 pm

Each year I try to carry out some statistical consultancy to give me experience in other areas of statistics and also to provide teaching examples. Last Christmas I was approached by a paediatric consultant from the RVI who wanted to carry out prospective survival analysis. The consultant, Bruce  Jaffray, had performed Nissen fundoplication surgery on 230 children. Many of the children had other medical conditions such as cerebral palsy or low BMI. He was interested in the factors that affected patients’ survival.

The model

We fitted a standard cox proportional hazards model. The following covariates were significant:

• gastrostomy
• cerebral palsy
• gender
• need for revision surgery. This was when the child had to return to hospital for more surgery.
• an interaction term between gastrostomy & cerebral palsy.

The interaction term was key to getting a good model fit. The figures (one of which is shown below) were constructed using ggplot2 and R. The referees actually commented on the (good) quality statistical work and nice figures! Always nice to read. Unfortunately, there isn’t a nice survival to ggplot2 interface. I had to write some rather hacky R code😦

Results

The main finding of the paper was the negative effect of cerebral palsy and gastrostomy on survival. Unfortunately, if a child had a gastronomy or had cerebral palsy then survival was dramatically reduced. The interaction effect was necessary, otherwise we would have predicted that all children with a gastronomy and cerebral palsy wouldn’t survive.

Other results

• There was a rather strange and strong gender effect – male survival was greater than female.
• The revision covariate was also significant – children who needed their fundoplication redone had increased survival. At first glance this is strange – the operation had to be redone, yet this was good for survival. However, this was really a red herring. Essentially children who had their surgery redone had by definition survived a minimum amount of time. I think something a bit more sophisticated could have been done with this variable, but the numbers weren’t that large.

References:

December 7, 2010

DNS Spoofing

Filed under: Geekery — Tags: , , , — csgillespie @ 3:34 pm

Midway through 2008 a new (serious) internet vulnerability was discovered by Dan Kaminsky. Dan realised that there was a serious flaw in the way the Internet’s domain name system works. This flaw was critical, because it allowed the bad guys to redirect users to malicious sites without detection.

What is DNS

The Domain name system (DNS) converts easy to remember web addresses into their numerical IP counter parts. For example, rather than trying to remember 128.240.233.197, we just need to remember www.ncl.ac.uk. So when you type an URL into your address bar, you first ask your DNS resolver if it knows the IP address. If the resolver doesn’t know it will ask a DNS server for directions.

In the good old days, no-one worried about security and consequently the DNS protocol was lax. First, all queries to the DNS server were typically through the same port. Second, each request had a unique ID (from 1- 65,536), however, the ids were generated in a completely predicable manner, i.e. 1, 2, 3, ….65536. This meant that an attacker could respond to the resolver (with a fake IP address) before the DNS server got a chance.

The obvious solution to this is for each request to use a (pseudo) random port and id.

All this is old news, and most DNS resolvers should have been updated – however a few haven’t. To test your whether your DNS resolver is doing something sensible, run the DNS spoofing test provided by grc. Essentially, the test makes a few thousand consecutive calls to your DNS resolver and stores the id and port number of the request.

The test takes a few minutes and is completely painless. At the end you get a few graphs showing your port and id numbers.  My ID distribution graph seems to indicate that the resolver that I’m using is issuing random query transaction IDs.

December 2, 2010

Random variable generation (Pt 2 of 3)

Filed under: AMCMC, R — Tags: , , , , — csgillespie @ 5:44 pm

Acceptance-rejection methods

This post is based on chapter 1.4 of Advanced Markov Chain Monte Carlo.

Another method of generating random variates from distributions is to use acceptance-rejection methods. Basically to generate a random number from $f(x)$, we generate a RN from an envelope distribution $g(x)$, where $\sup f(x)/g(x) \le M \le \infty$.  The acceptance-rejection algorithm is as follows:

Repeat until we generate a value from step 2:

1. Generate $x$ from $g(x)$ and $U$ from $Unif(0, 1)$

2. If $U \le \frac{f(x)}{M g(x)}$, return $x$ (as a random deviate from $f(x)$).

Example: the standard normal distribution

This example illustrates how we generate $N(0, 1)$ RNs using the logistic distribution as an envelope distribution. First, note that

$\displaystyle f(x) = \frac{1}{2\pi} e^{-x^2/2} \quad \mbox{and} \quad g(x) = \frac{e^{-x/s}}{s(1+ e^{-x/s})^2}$

On setting $s=0.648$, we get $M = 1.081$. This method is fairly efficient and has an acceptance rate of

$\displaystyle r = \frac{1}{M}\frac{\int f(x) dx}{\int g(x) dx} = \frac{1}{M} = 0.925$

since both $f$ and $g$ are normalised densities.

R code

This example is straightforward to code:

myrnorm = function(M){
while(1){
u = runif(1); x = rlogis(1, scale = 0.648)
if(u < dnorm(x)/M/dlogis(x, scale = 0.648))
return(x)
}
}


To check the results, we could call myrnorm a few thousand times:

hist(replicate(10000, myrnorm(1.1)), freq=FALSE)
lines(seq(-3, 3, 0.01), dnorm(seq(-3, 3, 0.01)), col=2)


Example: the standard normal distribution with a squeeze

Suppose the density $f(x)$ is expensive to evaluate. In this scenario we can employ an easy to compute function $s(x)$, where $0 \le s(x) \le g(x)$ . $s(x)$ is called a squeeze function. In this example, we’ll use a simple rectangular function, where $s(x) = 0.22$ for $x=-1, \ldots, 1$. This is shown in the following figure:

The modified algorithm is as follows:

Repeat until we generate a value from step 2:

1. Generate $x$ from $g(x)$ and $U$ from $Unif(0, 1)$

2. If $U \le \frac{s(x)}{M g(x)}$ or $U \le \frac{f(x)}{M g(x)}$, return $x$ (as a random deviate from $f(x)$).

Hence, when $U \le \frac{s(x)}{M g(x)}$ we don’t have to compute $f(x)$. Obviously, in this example $f(x)$ isn’t that difficult to compute.