November 1, 2010

An analysis of the Stackoverflow Beta sites

Filed under: R — Tags: , — csgillespie @ 5:26 pm

In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
             "gamedev", "webmasters", "electronics", "tex", "unix",
             "photo", "english", "cstheory",  "ui", "apple", "wordpress",
             "rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)

#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));

#Go through each site and download summary statistics
for(i in 1:length(sites)){
  z =  gzcon(url(stack_url))
  y = readLines(z)
  sum_stats = fromJSON(paste(y, collapse=""))
  qs[i] = sum_stats$statistics[[1]]$total_questions
  votes[i] = sum_stats$statistics[[1]]$total_votes
  users[i] = sum_stats$statistics[[1]]$total_users
  views[i] = sum_stats$statistics[[1]]$views_per_day

For each of the twenty-one sites, we now have information on the:

  • number of questions;
  • number of votes;
  • number of users;
  • number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)

#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)

plot(scores.cor[,1], scores.cor[,2],
     xlab="PC 1",ylab="PC 2", pch=NA,
     main="PCA analysis of  Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:
Main features:

  • Most sites are similar with the big except of programming and possibly webapps.
  • Programming is different due to the large number of votes. They have twice as many votes as next highest site.
  • webapps (and math) are different due to the large number of questions.

Some more details

In case anyone is interested, the weightings  you get from the PCA are:

#PC1 is a simple average
> round(PC.cor$rotation, 2)
PC1     PC2     PC3     PC4
votes   0.54   -0.31   -0.17   -0.77
users   0.51    0.18    0.83    0.10
views   0.50   -0.53   -0.27    0.63
qs      0.44    0.77   -0.45    0.10

1 Comment »

  1. Great post! I was wondering how to interface R with the Stack Overflow API. Thanks.

    Comment by Jeromy Anglim — November 3, 2010 @ 2:06 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

Create a free website or blog at WordPress.com.