Why?

March 26, 2011

“An R package” or “A R package”

Filed under: R — Tags: , , — csgillespie @ 5:31 pm

I’m currently writing some lecture notes on R and I used the phrase “a R package” without thinking. Since the word following the article “a” was a consonant, I automatically went for “a” instead of “an”. The problem is that “R” sounds likes a vowel, so “a R package” grates on the listener. The correct rule is to use “an” when the word following the article “sounds like a vowel”.

A quick google search suggests other people mess up too:

  • the correct phrase “an R package” – around 600, 000 hits;
  • the incorrect phrase “a R package” – around 150,000 hits.

Who would have thought that people could be wrong on the Internet…

March 23, 2011

Graphical Display of R Package Dependencies

Filed under: R — Tags: , , , , — csgillespie @ 2:28 pm

In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.

General idea

  1. Scrap the package names the main cran package web-site
  2. For each package, scrap the associated web-page and retrieve its dependencies.

For example, ADaCGH has a large number of packages under the “DEPENDS” section.

Pre-processing

To make life easier, I made a few simplifications to the data:

  • any dependencies on R, MASS, stats, methods and utils were removed when plotting;
  • I removed any bioconductor and omega hat packages;
  • version numbers in the DEPENDS section were ignored.

It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2″ depends on “plyr” the package author may only list “ggplot2″

Results

The top six packages based on the DEPENDS section are:

  • lattice – 165 times
  • survival – 107
  • mvtnorm – 103
  • tcltk – 76
  • graphics – 76
  • grid – 60

You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages).  The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.

In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.

I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.

R Details

  • To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
    • the web-pages were all well formed since they were generated from the package DESCRIPTION file
    • I needed practice with regular expressions
    • the R code is at the end of this post
  • You can download a csv file of the list edges from here
require("stringr")

####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
  url_st = "http://cran.r-project.org/web/packages"
  url_end = "index.html"
  url = paste(url_st, pkg_name, url_end, sep="/")

  cran_web = paste(readLines(url), collapse="")

  if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1)
    return()

  ## Get the table
  hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web)

  ## Clean the td & tr tags
  hrefs = gsub('</td></tr>.*',"", hrefs)

  ## Remove R from dependencies
  hrefs = gsub('R .*?<',"<", hrefs)
  ## Remove versions
  hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)

  ## Remove Bioconductor
  hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>",
       "", hrefs)

  ## Remove Omegahat
  hrefs =
  gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>",
  "", hrefs)

  ## Get dependencies
  depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>",  "\\1", hrefs)

  ##Unlist and remove white space
  depends_on = strsplit(depends_on, ",")[[1]]
  depends_on = as.vector(sapply(depends_on, str_trim))
  depends_on = depends_on[sapply(depends_on, nchar)>0]
  return(depends_on)
}

###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")

main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page)
main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table)

depends_on =
  gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>',
       "\\1 ", main_table)

cran_packages = unlist(strsplit(depends_on, " "))
from = vector("character", 10000)
to = vector("character", 10000)

j = 1
for(i in 1:length(cran_packages)) {
  dependencies = getDependencies(cran_packages[i])
  cat(i, ":", dependencies, "\n")
  if(!is.null(dependencies) &&
     length(dependencies) > 0) {
    l = length(dependencies) - 1
    from[j:(j+l)] = cran_packages[i]
    to[j:(j+l)] = dependencies
    j = j + l + 1
  }
}

dep_df = data.frame(from=from, to=to)
dep_df = dep_df[1:j,]

March 9, 2011

Workshop: Statistical Bioinformatics and Stochastic Systems Biology

Filed under: Uncategorized — Tags: , , , , , — csgillespie @ 11:53 am

The Third Biennial Newcastle Workshop on Statistical Bioinformatics and Stochastic Systems Biology will be held from 14.00 on Monday 13th to 16.30 on Tuesday 14th June 2011 at Newcastle University, UK.

Speakers will include the following

  • Prof. David Balding, University College, London, UK.
  • Prof. Arnoldo Frigessi, University of Oslo, Norway.
  • Dr. Dirk Husmeier, Biomathematics and Statistics Scotland, UK.
  • Dr. Michal Komorowski, Imperial College, London, UK.
  • Dr. Pawel Paszek, University of Liverpool, UK.
  • Prof. Magnus Rattray, University of Sheffield, UK.
  • Dr. Guido Sanguinetti, University of Edinburgh, UK.
  • Dr. Christopher Yau, University of Oxford, UK.

Registration

There is no registration fee but please do register. (The number of places may be limited). Register (free of charge) by sending your name and affiliation by email to mathstats-office@ncl.ac.uk. If possible, please register by 23rd May (to help with catering arrangements).

See the conference website further details.

 

The Shocking Blue Green Theme Blog at WordPress.com.

Follow

Get every new post delivered to your Inbox.