21.3 Understand common graph metrics - Video Tutorials & Practice Problems
Video duration:
10m
Play a video:
<v Voiceover>When working</v> with networks there are a number of measures that are often very important such as average path length, diameter, farthest nodes. So we're gonna take a look at a number of these measurements that can be done very easily in R. First up, let's look at average path length. There is a function, average.path.length. You feed in the network, flights, and there we have it, the average length is 1.4947. You could also look at the diameter of the network and here you see the diameter is two. You can also go ahead and find out what nodes are farthest from others with farthest.nodes. For pretty much every network measurement you want to do, the function is the name of the measurement. Now we just looked at farthest node. Well, I wanna see which nodes those are. Sure I see right now it's one two two but what exactly does that mean? Let's say V(flights) subsetted by farthest.nodes(flights). And we forgot we had to close off the closing square bracket. That is a common thing you'll see a lot in R if you've got closing brackets, and that's one of the most frustrating things about debugging is trying to find where those brackets are. We run that and we see that Austin and Boston twice are the farthest nodes, and again this is based on what nodes are connected to other ones. We could also find the largest cliques. So let's look at that, largest.cliques(flights). And we get a warning message. That's okay, it's just a warning message. In R there are both errors and warning messages. And yes in R Studio they both appear red, so they both seem like oh, no, something horrible is happening but it's really okay. It just tells you there is a little warning. No need to get worried. Now we have for each of these, it shows us a number of largest cliques. Let's say we just wanna find out what vertices are in the first largest clique. So we will say V(flights), that'll give us the vertices. We'll subset it by largest.cliques(flights). We will just take the first one. Again we get the warning message but it's okay because in V for largest.cliques it is ignoring the directionality of the graphs. Don't worry too much about it. But we could see here the first largest clique is San Franciso, San Diego, Las Vegas, DCA, and Dallas. I'm gonna copy and paste this and see the second one. In here it happens to be San Francisco, San Diego, Vegas, Seattle, and Washington DC. Other measurements are the transitivity. In this case we've got 0.52. You could also find the degree of each of the nodes by saying degree(flights), and this shows us how many edges come out of each node. For instance, Austin has 12 edges, Vegas has 32, Portland has 30, San Francisco has 38, Philadelphia 14, Newark 14. Just a way to see how connected each node is. It could be helpful to see this as histograms, so just say hist(degree(flights)). In our plot window we could see the histogram for the connectedness of all the vertices. When we're flying, we often want the shortest journey. That could be meant by the amount of time or it could be meant by the amount of stops. So using the shortest paths, we could see how to get from any one node to another in the shortest number of hops. So we'll say shortest.paths(flights). This is gonna print out a lot of information on the screen. It prints out a big matrix. And we can see how if we scroll to the top, this shows us how many segments it takes to go from any city in the rows to any city in the columns. So for instance, to go from Austin to Chicago is two segments. That's the same as saying a one-stop flight. Okay, go from Austin to LA, it's one segment. That's a non-stop flight. So here the smaller the number the better. Again the zeros means you go from one city to itself, so we'll ignore that. And the number shows you the number of segments. One means it's a direct flight, two means there's one stop, three means there'll be two stops. And this is a symmetric graph because going from Austin to LA should take the same amount of stops as going from LA to Austin. Now this is a table. I'm usually a fan of visualizing instead of printing out numbers. So let's make a heat map. We'll say heatmap(shortest.paths)(flights)), and I'll zoom in, and this shows us how to get from any one city to another city. We might want to color code our graph according to different metrics of the network. So for instance, we might want any vertices where the degree is greater than 30 to be green, and where the degree is less than 14 to be red. Let's code that into the graph. So V(flights), we subset it by saying where the degree of flights is >= to 30. We set the color attribute to green and we get an error here. The error here is again that we didn't close off the subsetting. This is a very common thing you'll see that you need to close off subsetting properly, and we did it with a curly brace as opposed to a square bracket. Write it properly, and it works. We'll say V(flights) such that the degree(flights) is <= 14. Close off the square brackets, choose the color attribute. Let's say this is red. If we plot it now, we can see that our graph has been modified to show us information. This is very important when you're trying to display information to the end user. We can also go ahead and change the edge width so it's based on the time between airports. Use a plot(flights, edge.width=E(flights)$Time). So we see here that it made a massive mess of our graph. That is because base graphics using values literally so we need to scale it manually. I'm going to copy this line, paste in, and let's say divide it by 100. Now it scaled a little better. Maybe we should divide it by 150, maybe 200. This little give and take, this little parametrization is part of building a good network graph and is actually very important to the process. While we're at it though, instead of having to specify the weights ever time, we could actually build it right into the graph. So first let's say flights3 <= flights. We'll save it as a new variable. Then we'll specify as sort of a reserved attribute of weight to be the time, that way we don't have to do it every single time ourselves. So we can say E(flights3) $weight <= E(flights)$Time. This way instead of having to specify it manually each time, it can do it automatically. So when we do the shortest paths, it will be weighted by the time. Let's say heatmap(shortest.paths(flights3)). This graph will be different than the previous heatmap we saw. It's not just looking at the number of segments between airports, it's taking into account the amount of time it takes to get between them. So this all of a sudden can give you a very different perspective on which flights to take. And if we actually look at that matrix of shortest paths of flights3, it now no longer shows the number of segments. It actually shows you the amount of time it'll take you. So all of a sudden, one route that was fewer segments might not be as efficient as one of more segments but less time. These weights are incredibly valuable for analyzing your network. So these are just some of the many metrics you can use to analyze your network. igraph has built in just about every measurement you could think of. It's just a matter of discovering what's out there.