In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson) #List of current SO beta sites sites = c("stats", "math","programmers", "webapps", "cooking", "gamedev", "webmasters", "electronics", "tex", "unix", "photo", "english", "cstheory", "ui", "apple", "wordpress", "rpg", "gis", "diy", "bicycles", "android") sites = sort(sites) #Create empty vectors to store the downloaded data qs = numeric(length(sites)); votes = numeric(length(sites)); users = numeric(length(sites)); views = numeric(length(sites)); #Go through each site and download summary statistics for(i in 1:length(sites)){ stack_url=paste("http://api.", sites[i], ".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA", sep="") z = gzcon(url(stack_url)) y = readLines(z) sum_stats = fromJSON(paste(y, collapse="")) qs[i] = sum_stats$statistics[[1]]$total_questions votes[i] = sum_stats$statistics[[1]]$total_votes users[i] = sum_stats$statistics[[1]]$total_users views[i] = sum_stats$statistics[[1]]$views_per_day close(z) cat(sites[i],"\n") }

For each of the twenty-one sites, we now have information on the:

- number of questions;
- number of votes;
- number of users;
- number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame df = data.frame(votes, users, views, qs) #Calculate the PCs PC.cor = prcomp(df, scale=TRUE) scores.cor = predict(PC.cor) plot(scores.cor[,1], scores.cor[,2], xlab="PC 1",ylab="PC 2", pch=NA, main="PCA analysis of Beta SO sites") text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:

Main features:

- Most sites are similar with the big except of programming and possibly webapps.
- Programming is different due to the large number of votes. They have twice as many votes as next highest site.
- webapps (and math) are different due to the large number of questions.

## Some more details

In case anyone is interested, the weightings you get from the PCA are:

`#PC1 is a simple average`

> round(PC.cor$rotation, 2)

PC1 PC2 PC3 PC4

votes 0.54 -0.31 -0.17 -0.77

users 0.51 0.18 0.83 0.10

views 0.50 -0.53 -0.27 0.63

qs 0.44 0.77 -0.45 0.10

Great post! I was wondering how to interface R with the Stack Overflow API. Thanks.

Comment by Jeromy Anglim — November 3, 2010 @ 2:06 am