Welcome to Social Network Analysis
02 Nov 2015This is the first in a series of lectures and tutorials that I've prepared for COMM-645 at the USC Annenberg School of Communication.
These talks are meant to be introductory self standing lectures which tackle the basics of social network analysis, hope you enjoy.
Welcome to COMM645! Today we are going to be demonstrating some of the capabilities of R and social network analysis. Don't worry about running this code at this time, just follow along and watch how the script, the console window and the environment interact.
Basic R runs off of the command line with text commands, for this class we will be using RStudio a program that sits on top of R and makes it easier to use. Let's take a quick tour of how RStudio works so you can understand this demo.
My window has four panes. The source pane is where you write code. It is basically a text editor like notepad. Any code you write in the source window will not run automatically, so you can tweak or make changes to it slowly. Once your are ready to run your code you can send a line to R by placing your cursor on it and pressing the run key, or CTRL/Command-Enter.
Running code sends it from the source window to the console. The console is where your code is actually run, and any results will be displayed there. Additionally you can type code straight into the console and run it with the Enter key. You can only type one command at a time in the console so it is generally best to write most of your commands in the source window and just run code through the console only when you are mucking around or doing calculations you don't need to reproduce.
Next up is the environment/history window. The environment shows you all of the variables or objects that you have created in R. Pretty much anything you want can be stored as an object, from a single digit or letter to a massive network with millions of people in it. Each object is assigned a name which can then be referenced in your code at a later point. As an example let's assign the number 2 to the name "two" and watch what happens.
two<-2
1+two
## [1] 3
rm(two)
As you can see two appeared in the environment, typing two (without quotations) into any piece of R code will stick the number two in there instead. The rm command is short form for remove and deletes the object from the environment which creates an error if we rerun two+1.
Beside the environment there is a history tab, which will show you all of the commands that you have typed in your session.
Finally there is the utility window, which should have several tabs on top such as files, plots, packages and help. In order, files is a browser that lets you browse data on your computer and set the "working directory" the directory where R grabs data or other files from.
Plots is a generic area that will display graphs or networks.
Help is an easy window for looking up R commands, you can search it directly or write some code with ?? in front of it.
??lm
Finally there is the package tab which takes a bit more explaining.
Packages
R is a statistical programming language that is built from the ground up for managing, plotting and examining data. Base R has a lot of basic functionality such as handling data-sets, basic calculations and popular statistics such as regression or chi-squared tests.
data(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
demoLM<-lm(Sepal.Length~Petal.Length+Petal.Width, data=iris)
summary(demoLM)
##
## Call:
## lm(formula = Sepal.Length ~ Petal.Length + Petal.Width, data = iris)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18534 -0.29838 -0.02763 0.28925 1.02320
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.19058 0.09705 43.181 < 2e-16 ***
## Petal.Length 0.54178 0.06928 7.820 9.41e-13 ***
## Petal.Width -0.31955 0.16045 -1.992 0.0483 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4031 on 147 degrees of freedom
## Multiple R-squared: 0.7663, Adjusted R-squared: 0.7631
## F-statistic: 241 on 2 and 147 DF, p-value: < 2.2e-16
In this class we are especially interested in social network analysis, which isn't supported out of the box. Therefore we have to extend R with packages, chunks of code that extend or add abilities to R just like an app will extend your phone.
There are hundreds of R packages (a full list can be seen here) and to install them all you do is pass the install.packages code like so.
install.packages('igraph')
install.packages('statnet')
After installing the package it can be loaded and ready to use passing the library command.
library(igraph)
##
## Attaching package: 'igraph'
##
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
##
## The following object is masked from 'package:base':
##
## union
Now everything is ready to demonstrate R and social network analysis. Once again don't worry about following along, just sit back and watch how the various moving parts interact.
SNA Demo
The first thing we are going to do is load the igraph package. This program alongside the sna package are going to be the main parts of R that we will use in this class. We've already called the igraph library in the previous section so let's move along a read a network into R.
In this case we will be looking at a graph drawn from the musical Les Misérables. In this case each node will represent a character in the play, and an edge signifies any two characters on stage at the same time. We'll be reading the graph out of the graphml format, which is a specialized type of file for holding network data.
lemis<-read.graph('lemis.graphml', format='graphml')
lemis
## IGRAPH D-W- 77 254 --
## + attr: label (v/c), r (v/n), g (v/n), b (v/n), x (v/n), y (v/n),
## | size (v/n), id (v/c), Edge Label (e/c), weight (e/n), Edge Id
## | (e/c)
## + edges:
## [1] 2-> 1 3-> 1 4-> 1 4-> 3 5-> 1 6-> 1 7-> 1 8-> 1 9-> 1 10-> 1
## [11] 12-> 1 12-> 3 12-> 4 12->11 13->12 14->12 15->12 16->12 18->17 19->17
## [21] 19->18 20->17 20->18 20->19 21->17 21->18 21->19 21->20 22->17 22->18
## [31] 22->19 22->20 22->21 23->17 23->18 23->19 23->20 23->21 23->22 24->12
## [41] 24->13 24->17 24->18 24->19 24->20 24->21 24->22 24->23 25->12 25->24
## [51] 26->12 26->24 26->25 27->12 27->17 27->25 27->26 28->12 28->24 28->25
## + ... omitted several edges
So we can see that "lemis" is a network with 77 nodes and 254 edges. If we want to determine the degree (that is the number of edges) we simply pass one command.
deg<-degree(lemis)
deg
## [1] 10 1 3 3 1 1 1 1 1 1 1 36 2 1 1 1 9 7 7 7 7 7 7
## [24] 15 11 16 11 17 4 8 2 4 1 2 6 6 6 6 6 3 1 11 3 3 2 1
## [47] 1 2 22 7 2 7 2 1 4 19 2 11 15 11 9 11 13 12 13 12 10 1 10
## [70] 10 10 9 3 2 2 7 7
This gives us a list of numbers for each character in order, showing how many times they appeared in a scene with another character. To make it more readable we can attach the names to the degree list as well.
names(deg)<-V(lemis)$label
sort(deg, decreasing=TRUE)
## Valjean Gavroche Marius Javert
## 36 22 19 17
## Thenardier Fantine Enjolras Courfeyrac
## 16 15 15 13
## Bossuet Bahorel Joly MmeThenardier
## 13 12 12 11
## Cosette Eponine Mabeuf Combeferre
## 11 11 11 11
## Feuilly Myriel Grantaire Gueulemer
## 11 10 10 10
## Babet Claquesous Tholomyes Prouvaire
## 10 10 9 9
## Montparnasse Bamatabois Listolier Fameuil
## 9 8 7 7
## Blacheville Favourite Dahlia Zephine
## 7 7 7 7
## Gillenormand MlleGillenormand Brujon MmeHucheloup
## 7 7 7 7
## Judge Champmathieu Brevet Chenildieu
## 6 6 6 6
## Cochepaille Fauchelevent Simplice LtGillenormand
## 6 4 4 4
## MlleBaptistine MmeMagloire Pontmercy Anzelma
## 3 3 3 3
## Woman2 Toussaint Marguerite Perpetue
## 3 3 2 2
## Woman1 MotherInnocent MmeBurgon Magnon
## 2 2 2 2
## MmePontmercy BaronessT Child1 Child2
## 2 2 2 2
## Napoleon CountessDeLo Geborand Champtercier
## 1 1 1 1
## Cravatte Count OldMan Labarre
## 1 1 1 1
## MmeDeR Isabeau Gervais Scaufflaire
## 1 1 1 1
## Boulatruelle Gribier Jondrette MlleVaubois
## 1 1 1 1
## MotherPlutarch
## 1
So we see that Valjean appears in the most scenes, as expected for those of you familiar with the story.
Networks can also generate "centrality metrics" which are expressions that attempt to capture if certain members of the network are more important/significant than others. A great example is the Kevin Bacon game, where you pick any actor and see if you can get to Kevin Bacon in 6 hops. In network terms he has high closeness centrality, that is to say it is easy to get from Kevin Bacon's spot in a given movie star network to any other part. For our Les Misérables data we can find the Kevin Bacon of the play with the following command.
close<-closeness(lemis, mode='all')
names(close)<-V(lemis)$label
sort(close, decreasing=TRUE)
## Gavroche Valjean Montparnasse Javert
## 0.004366812 0.004255319 0.004065041 0.004032258
## Gueulemer Thenardier Claquesous Babet
## 0.003952569 0.003846154 0.003831418 0.003759398
## Mabeuf Bamatabois Bossuet Toussaint
## 0.003703704 0.003610108 0.003597122 0.003584229
## MmeHucheloup MmeThenardier Eponine Grantaire
## 0.003546099 0.003533569 0.003508772 0.003508772
## Cosette Marguerite Brujon Fantine
## 0.003484321 0.003401361 0.003401361 0.003389831
## Enjolras Prouvaire Marius Woman1
## 0.003355705 0.003355705 0.003300330 0.003289474
## Woman2 MotherInnocent Pontmercy Labarre
## 0.003246753 0.003246753 0.003236246 0.003225806
## MmeDeR Isabeau Gervais Scaufflaire
## 0.003225806 0.003225806 0.003225806 0.003225806
## LtGillenormand Simplice Fauchelevent Feuilly
## 0.003184713 0.003154574 0.003125000 0.003105590
## Bahorel Tholomyes Brevet Chenildieu
## 0.003086420 0.003067485 0.003067485 0.003067485
## Cochepaille Gillenormand Boulatruelle MlleGillenormand
## 0.003067485 0.003067485 0.002985075 0.002976190
## Joly Anzelma Magnon Courfeyrac
## 0.002949853 0.002915452 0.002906977 0.002906977
## BaronessT Combeferre MmePontmercy Perpetue
## 0.002881844 0.002801120 0.002777778 0.002710027
## MmeBurgon Child1 Child2 Judge
## 0.002666667 0.002645503 0.002645503 0.002506266
## Champmathieu MlleVaubois Jondrette MlleBaptistine
## 0.002506266 0.002433090 0.002222222 0.002173913
## MmeMagloire Gribier MotherPlutarch Listolier
## 0.002173913 0.002127660 0.002020202 0.002008032
## Fameuil Blacheville Zephine Dahlia
## 0.002008032 0.002004008 0.001912046 0.001908397
## Favourite Myriel Napoleon CountessDeLo
## 0.001904762 0.001851852 0.001626016 0.001626016
## Geborand Champtercier Cravatte OldMan
## 0.001626016 0.001626016 0.001626016 0.001626016
## Count
## 0.001449275
Here we can see that while Valjean has the most connections Gavroche has a higher closeness centrality, so you can get to more parts of the network faster if you start with him.
Similarly betweeness centrality captures how many shortest paths between any two given characters flow through a specific part of the network. In other words, what character is the bridge that connects otherwise disconnected groups from each other. I'm sure everyone has a friend (or is someone) who brings otherwise unconnected people together at a party or a get together. In network terms these folks have high betweenness centrality.
btw<-betweenness(lemis, directed=FALSE)
names(btw)<-V(lemis)$label
sort(btw, decreasing=TRUE)
## Valjean Gavroche Javert Myriel
## 1293.6140693 812.6849387 551.1907287 504.0000000
## Thenardier Fantine Mabeuf Bamatabois
## 367.0057359 325.9865440 253.0330087 227.4785714
## Cosette Marius Tholomyes Montparnasse
## 212.8580447 205.5187229 187.5952381 141.8520022
## MmeThenardier Claquesous Grantaire MlleGillenormand
## 129.8511905 120.6288059 102.9285714 102.8803030
## MmeHucheloup Bossuet Fauchelevent MmeBurgon
## 97.0595238 79.0863095 75.0000000 75.0000000
## Gueulemer Gillenormand Eponine Pontmercy
## 72.3816198 68.9583333 66.2004690 64.3137446
## LtGillenormand Babet Toussaint Marguerite
## 43.0125000 41.6359848 32.5222222 25.6547619
## Bahorel Magnon MmePontmercy BaronessT
## 21.1654762 13.8833333 13.5000000 10.9744048
## Feuilly Enjolras Brujon Simplice
## 10.9321429 7.8682900 4.6929293 3.6166667
## Courfeyrac Anzelma Joly Napoleon
## 1.8214286 0.7694805 0.5000000 0.0000000
## MlleBaptistine MmeMagloire CountessDeLo Geborand
## 0.0000000 0.0000000 0.0000000 0.0000000
## Champtercier Cravatte Count OldMan
## 0.0000000 0.0000000 0.0000000 0.0000000
## Labarre MmeDeR Isabeau Gervais
## 0.0000000 0.0000000 0.0000000 0.0000000
## Listolier Fameuil Blacheville Favourite
## 0.0000000 0.0000000 0.0000000 0.0000000
## Dahlia Zephine Perpetue Scaufflaire
## 0.0000000 0.0000000 0.0000000 0.0000000
## Woman1 Judge Champmathieu Brevet
## 0.0000000 0.0000000 0.0000000 0.0000000
## Chenildieu Cochepaille Boulatruelle Woman2
## 0.0000000 0.0000000 0.0000000 0.0000000
## MotherInnocent Gribier Jondrette MlleVaubois
## 0.0000000 0.0000000 0.0000000 0.0000000
## Combeferre Prouvaire MotherPlutarch Child1
## 0.0000000 0.0000000 0.0000000 0.0000000
## Child2
## 0.0000000
Valjean wins out again as the biggest bridge between various other parts of the story.
Finally let's plot the network, first we are going to scale each node by degree, the more connections a node has the bigger it is. Next we are
V(lemis)$size <- degree(lemis)*0.6
E(lemis)$arrow.size <- .2
E(lemis)$edge.color <- "gray80"
E(lemis)$width <- 1+E(lemis)$weight/12
l=layout.fruchterman.reingold(lemis)
plot(lemis,
vertex.label.cex=0.75,
vertex.label.color="black",
vertex.label.family="Helvetica",
layout=l)
(
In summary, today we've learned a bit about R, how to navigate around, what packages are and demonstrated that with less than 20 lines of code you can get network data, calculate powerful and informative network statistics and produce visualizations. Next time we will go over the same territory, but with you following along on your computers. So make sure that you've followed the R installation guide on blackboard before then.