Why?

March 26, 2011

“An R package” or “A R package”

Filed under: R — Tags: , , — csgillespie @ 5:31 pm

I’m currently writing some lecture notes on R and I used the phrase “a R package” without thinking. Since the word following the article “a” was a consonant, I automatically went for “a” instead of “an”. The problem is that “R” sounds likes a vowel, so “a R package” grates on the listener. The correct rule is to use “an” when the word following the article “sounds like a vowel”.

A quick google search suggests other people mess up too:

  • the correct phrase “an R package” – around 600, 000 hits;
  • the incorrect phrase “a R package” – around 150,000 hits.

Who would have thought that people could be wrong on the Internet…

March 23, 2011

Graphical Display of R Package Dependencies

Filed under: R — Tags: , , , , — csgillespie @ 2:28 pm

In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.

General idea

  1. Scrap the package names the main cran package web-site
  2. For each package, scrap the associated web-page and retrieve its dependencies.

For example, ADaCGH has a large number of packages under the “DEPENDS” section.

Pre-processing

To make life easier, I made a few simplifications to the data:

  • any dependencies on R, MASS, stats, methods and utils were removed when plotting;
  • I removed any bioconductor and omega hat packages;
  • version numbers in the DEPENDS section were ignored.

It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″

Results

The top six packages based on the DEPENDS section are:

  • lattice – 165 times
  • survival – 107
  • mvtnorm – 103
  • tcltk – 76
  • graphics – 76
  • grid – 60

You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages).  The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.

In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.

I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.

R Details

  • To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
    • the web-pages were all well formed since they were generated from the package DESCRIPTION file
    • I needed practice with regular expressions
    • the R code is at the end of this post
  • You can download a csv file of the list edges from here
require("stringr")

####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
  url_st = "http://cran.r-project.org/web/packages"
  url_end = "index.html"
  url = paste(url_st, pkg_name, url_end, sep="/")

  cran_web = paste(readLines(url), collapse="")

  if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1)
    return()

  ## Get the table
  hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web)

  ## Clean the td & tr tags
  hrefs = gsub('</td></tr>.*',"", hrefs)

  ## Remove R from dependencies
  hrefs = gsub('R .*?<',"<", hrefs)
  ## Remove versions
  hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)

  ## Remove Bioconductor
  hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>",
       "", hrefs)

  ## Remove Omegahat
  hrefs =
  gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>",
  "", hrefs)

  ## Get dependencies
  depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>",  "\\1", hrefs)

  ##Unlist and remove white space
  depends_on = strsplit(depends_on, ",")[[1]]
  depends_on = as.vector(sapply(depends_on, str_trim))
  depends_on = depends_on[sapply(depends_on, nchar)>0]
  return(depends_on)
}

###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")

main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page)
main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table)

depends_on =
  gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>',
       "\\1 ", main_table)

cran_packages = unlist(strsplit(depends_on, " "))
from = vector("character", 10000)
to = vector("character", 10000)

j = 1
for(i in 1:length(cran_packages)) {
  dependencies = getDependencies(cran_packages[i])
  cat(i, ":", dependencies, "\n")
  if(!is.null(dependencies) &&
     length(dependencies) > 0) {
    l = length(dependencies) - 1
    from[j:(j+l)] = cran_packages[i]
    to[j:(j+l)] = dependencies
    j = j + l + 1
  }
}

dep_df = data.frame(from=from, to=to)
dep_df = dep_df[1:j,]

November 6, 2010

Installing R packages

Filed under: R — Tags: , — csgillespie @ 9:00 am

Part of the reason R has become so popular is the vast array of packages available at the cran and bioconductor repositories. In the last few years, the number of packages has grown exponentially!

This is a short post giving steps on how to actually install R packages. Let’s suppose you want to install the ggplot2 package. Well nothing could be easier. We just fire up an R shell and type:

> install.packages("ggplot2")

In theory the package should just install, however:

  • if you are using Linux and don’t have root access, this command may not work. If you use Ubuntu then this isn’t a problem.
  • if you have limited space in /home/ then you may want to install the package in the non-default directory.
  • you will be asked to select your local mirror, i.e. which server should you use to download the package.
  • if you have a proxy, then you may run into problems.

Installing packages in a different directory

First, you need to designate a directory where you will store the downloaded packages. On my machine, I use the directory /data/Rpackages/ After creating a package directory, to install a package we use the command:

> install.packages("ggplot2"
, lib="/data/Rpackages/")
> library(ggplot2, lib.loc="/data/Rpackages/")

It’s a bit of a pain having to type /data/Rpackages/ all the time. To avoid this burden,  we create a file .Renviron in our home area, and add the line R_LIBS=/data/Rpackages/ to it. This means that whenever you start R, the directory /data/Rpackages/ is added to the list of places to look for R packages and so:

> install.packages("ggplot2")
> library(ggplot2)

just works!

Setting the repository

Every time you install a R package, you are asked which repository R should use. To set the repository and avoid having to specify this at every package install, simply:

  • create a file .Rprofile in your home area.
  • Add the following piece of code to it:


cat(".Rprofile: Setting UK repository\n")
r = getOption("repos") # hard code the UK repo for CRAN
r["CRAN"] = "http://cran.uk.r-project.org"
options(repos = r)
rm(r)

Or as Hadley pointed out, this could be condensed to:

cat(".Rprofile: Setting UK repository\n")
options(repos = c(CRAN = "http://cran.uk.r-project.org/"))

but you would deleted any other repositories you have set – say the r-forge repository (thanks to Kevin for pointing this out).

Obviously, you don’t need the cat statement, but I find it helpful to remind me that I’ve set this default. Especially if you have a laptop and you a lot of travelling.

I found this tip in a stackoverflow answer .

Setting your http proxy

This section is an update and follows a comment from Sean Carmody below.

If you have a http proxy, then one method of setting the proxy in R is to use the sys.system command, viz.

Sys.setenv(http_proxy="http://myproxy.example.com:8080")

Sys.setenv(http_proxy=”http://myproxy.example.com:8080&#8243;)

The Shocking Blue Green Theme. Get a free blog at WordPress.com

Follow

Get every new post delivered to your Inbox.