In the last six months or so, the behemoth of Q & A sites stackoverflow, decided to change tack and launch a number of other non-computing-language sites. To launch a site in the stackoverflow family, sites have to spend time gathering followers in Area51. Once a site has gained a critical mass, a new StackExchange (SE) site is born.

At present there are around twenty-one SE beta sites. Being rather bored this weekend, I decided to see how these sites are similar/different. For a first pass, I did something rather basic, but useful none the less.

First, we need to use the stackoverflow api to download some summary statistics for each site:

library(rjson)
#List of current SO beta sites
sites = c("stats", "math","programmers", "webapps", "cooking",
"gamedev", "webmasters", "electronics", "tex", "unix",
"photo", "english", "cstheory", "ui", "apple", "wordpress",
"rpg", "gis", "diy", "bicycles", "android")
sites = sort(sites)
#Create empty vectors to store the downloaded data
qs = numeric(length(sites)); votes = numeric(length(sites));
users = numeric(length(sites)); views = numeric(length(sites));
#Go through each site and download summary statistics
for(i in 1:length(sites)){
stack_url=paste("http://api.",
sites[i],
".stackexchange.com/1.0/stats?key=wF07PVY0Mk2hva6r9UZDyA",
sep="")
z = gzcon(url(stack_url))
y = readLines(z)
sum_stats = fromJSON(paste(y, collapse=""))
qs[i] = sum_stats$statistics[[1]]$total_questions
votes[i] = sum_stats$statistics[[1]]$total_votes
users[i] = sum_stats$statistics[[1]]$total_users
views[i] = sum_stats$statistics[[1]]$views_per_day
close(z)
cat(sites[i],"\n")
}

For each of the twenty-one sites, we now have information on the:

- number of questions;
- number of votes;
- number of users;
- number of views.

An easy “starter for ten” in terms of analysis, is to do some quick principle components:

#Put all the data into a data.frame
df = data.frame(votes, users, views, qs)
#Calculate the PCs
PC.cor = prcomp(df, scale=TRUE)
scores.cor = predict(PC.cor)
plot(scores.cor[,1], scores.cor[,2],
xlab="PC 1",ylab="PC 2", pch=NA,
main="PCA analysis of Beta SO sites")
text(scores.cor[,1], scores.cor[,2], labels=sites)

This gives the following plot:

Main features:

- Most sites are similar with the big except of programming and possibly webapps.
- Programming is different due to the large number of votes. They have twice as many votes as next highest site.
- webapps (and math) are different due to the large number of questions.

## Some more details

In case anyone is interested, the weightings you get from the PCA are:

`#PC1 is a simple average`

> round(PC.cor$rotation, 2)

PC1 PC2 PC3 PC4

votes 0.54 -0.31 -0.17 -0.77

users 0.51 0.18 0.83 0.10

views 0.50 -0.53 -0.27 0.63

qs 0.44 0.77 -0.45 0.10