Why?

March 23, 2011

Graphical Display of R Package Dependencies

Filed under: R — Tags: , , , , — csgillespie @ 2:28 pm

In some work that I am currently involved in, we have to decide which GUI engine we should use. As an obvious starter, we decided to have a look at what other people are using in their packages. While cran helpfully displays all the R packages that are available, it doesn’t (I don’t think), give a nice summary of the package dependencies. After clicking on a few dozen packages and examining their dependencies, I decided that a quick script was in order.

General idea

  1. Scrap the package names the main cran package web-site
  2. For each package, scrap the associated web-page and retrieve its dependencies.

For example, ADaCGH has a large number of packages under the “DEPENDS” section.

Pre-processing

To make life easier, I made a few simplifications to the data:

  • any dependencies on R, MASS, stats, methods and utils were removed when plotting;
  • I removed any bioconductor and omega hat packages;
  • version numbers in the DEPENDS section were ignored.

It should be stressed that I’m only picking up what is listed in the DEPENDS section. For example, suppose a package depends on both”ggplot2″ and “plyr”. Since “ggplot2” depends on “plyr” the package author may only list “ggplot2”

Results

The top six packages based on the DEPENDS section are:

  • lattice – 165 times
  • survival – 107
  • mvtnorm – 103
  • tcltk – 76
  • graphics – 76
  • grid – 60

You could argue that I should remove “graphics” by the same arbitrary criteria I used when removing “MASS”. The total number of packages that are referred to in the DEPENDS section is just over 782 (out of a possible 3000 packages).  The following graph plots the package name against the number of times it appears in the DEPENDS section of another package. There is a clear exponential decay highlighting a few key packages.

In fact the top 40 packages, account for 50% of all dependencies, and that’s after the dependencies on R, utils, methods,.. were removed.

I also constructed a graphical network using cytoscape. However, it’s quite large (~2MB). You can download the network separately. To construct this network, I only used packages that had three or more dependencies. There were a dozen or so smaller graphs that I pruned.

R Details

  • To scape the web-pages I used regular expressions. Yes, I know you shouldn’t use regular expressions for parsing html, and should use a proper html parser, but
    • the web-pages were all well formed since they were generated from the package DESCRIPTION file
    • I needed practice with regular expressions
    • the R code is at the end of this post
  • You can download a csv file of the list edges from here
require("stringr")

####################
## Get dependencies
####################
getDependencies = function(pkg_name) {
  url_st = "http://cran.r-project.org/web/packages"
  url_end = "index.html"
  url = paste(url_st, pkg_name, url_end, sep="/")

  cran_web = paste(readLines(url), collapse="")

  if(regexpr("<td valign=top>Depends:</td><td>", cran_web) == -1)
    return()

  ## Get the table
  hrefs = gsub('(.*<td valign=top>Depends:</td><td>)',"", cran_web)

  ## Clean the td & tr tags
  hrefs = gsub('</td></tr>.*',"", hrefs)

  ## Remove R from dependencies
  hrefs = gsub('R .*?<',"<", hrefs)
  ## Remove versions
  hrefs = gsub("\\(&[ge; 0-9\\.\\-]*)", "", hrefs)

  ## Remove Bioconductor
  hrefs = gsub("<a href=\"http://www.bioconductor.org/packages/release/[a-z]*/html/([a-zA-Z0-9\\.]*)\"><span class=\"BioC\">([A-Za-z0-9\\.]*)</span></a>",
       "", hrefs)

  ## Remove Omegahat
  hrefs =
  gsub("<a href=\"http://www.omegahat.org/[A-Za-z0-9]*\"><span class=\"Ohat\">[A-Za-z0-9]*</span></a>",
  "", hrefs)

  ## Get dependencies
  depends_on = gsub("<a href=\"../([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a>",  "\\1", hrefs)

  ##Unlist and remove white space
  depends_on = strsplit(depends_on, ",")[[1]]
  depends_on = as.vector(sapply(depends_on, str_trim))
  depends_on = depends_on[sapply(depends_on, nchar)>0]
  return(depends_on)
}

###########
#Main Page
url = "http://cran.r-project.org/web/packages/"
cran_web_page = paste(readLines(url), collapse="")

main_table = gsub('.*<table summary="Available CRAN packages.">(.*)</table>.*', "\\1", cran_web_page)
main_table = gsub('<tr id="available-packages-[A-Z]"/>', "", main_table)

depends_on =
  gsub('<tr valign="top"><td><a href=\"../../web/packages/([0-9A-Za-z\\.]*)/index.html\">[0-9A-Za-z\\.]*</a></td><td>.*?</td></tr>',
       "\\1 ", main_table)

cran_packages = unlist(strsplit(depends_on, " "))
from = vector("character", 10000)
to = vector("character", 10000)

j = 1
for(i in 1:length(cran_packages)) {
  dependencies = getDependencies(cran_packages[i])
  cat(i, ":", dependencies, "\n")
  if(!is.null(dependencies) &&
     length(dependencies) > 0) {
    l = length(dependencies) - 1
    from[j:(j+l)] = cran_packages[i]
    to[j:(j+l)] = dependencies
    j = j + l + 1
  }
}

dep_df = data.frame(from=from, to=to)
dep_df = dep_df[1:j,]
Advertisements

Blog at WordPress.com.