19.3 Perform hierarchical clustering - Video Tutorials & Practice Problems
Video duration:
5m
Play a video:
<v Voiceover>Another</v> popular form of clustering is hierarchical clustering. This builds essentially a tree, where at the top of the tree, it's just one big cluster. As you go down the tree, it makes splits, and now you have two clusters. You could split these further, and each of those make two more partitions, and on the other side make two more, and keep going down until each observation is its own cluster. The important part is to decide where you make the cut-off and what size to make the clusters. To see this example, we will stick with the wine data. We will say, wine H gets hclust, and hclust, instead of taking a data set, it takes a distance metric. That is, you need to compute some sort of distance, most often Euclidean, which is just the shortest distance between two points. Hclust, unlike the other partitioning algorithms, doesn't take the raw data, it takes a distance matrix. That means for every run of data set, you find this distance from every other row. Now, there's a number of different types of distances you can use. We'll just use Euclidean distance, which is sort of like the straight line distance between two points but in higher dimensions. We will say d equals dist of wineTrain. That's all it takes. Now we do plot of wine H, and we see a dendrogram. Let's blow this up as it will be hard to see. And we get this. At the top, there's just one big cluster, and it keeps dividing until you get to the bottom where every observation is accounted for. Depending on where you cut the dendrogram depends how many clusters you get and how big they are. We gave it data that was all numeric. If you wanted to give it data that was categorical, you can do it, but you need to fit the distance matrix in a different way. For that, you would use something called daisy. Daisy is a function that allows you to compute distance matrices on categorical data or a mix of numeric and categorical data. For now for our wine data though, we're good to go. And we can see a few different ways to do it. In addition to a few different distance calculations, there's a few different methods for building the dendrogram. Let's take a look at some of those. Hclust, and we're doing this on the distance of the wineTrain, use a method equals single. This is for how it connects one cluster to another. And, we have to close off the closing parenthesis, for the distance function. And we will copy this since it's all the same except for one part. We will do complete linkage. We will do average linkage. And centroid linkage. Now, we can plot each of these individually. Do wine H one, we won't put any labels because they get too crowded. And we'll give it a title telling us what type of linkage we used. In this case of single linkage, we have a dendrogram that looks like this. We'll copy this. And we'll say Centroid, Average, and Complete. We look here. We get yet another type of dendrogram. Looks like a different shape. It's all the same idea at the bottom, but the different shapes make a big difference. And we'll plot this one. Notice on the top it starts more to the right than it did before. There's just a different overall pattern to it. And lastly, we will look at centroid linkage. And again, it's even further to the right. The choices we make for these linkages, they have a big impact on the way the graph ends up looking in the end. Let's go back to our very first dendrogram, and see how we can split it into different clusters. We'll plot wine H, we'll make its dendrogram, and now we will split it into three clusters. We feed it the clustering, and we tell it how many clusters we want, and it automatically figures out how high up the dendrogram it needs to draw the box to come up with three clusters. So we zoom in on this. We can see, to get three clusters, we need to go pretty high in the splits. If we wanted 13 clusterings instead, in blue, we would draw it again and just say, 13. This really illustrates the lower you cut it, the more clusters you have. The higher you cut it, the fewer clusters you have. Now alternatively, if you don't know how many clusters you want, you just know how high you want to cut it, there's an option for that too. So once again, we will do plot wine H. We will do rect.hclust. We'll say wine H, height equals 200. And we'll make the border red. And while we're at it, we'll do the same thing for blue, but this time we'll make the height 800. We run all this code. And we see that by assigning that, we got two, we used 800, we had two clusters, and when we did 200, we got seven clusters. Lots of different ways to cut up this tree, and how you cut it makes a difference in your end results. Hclust, or hierarchical clustering is a nice alternative to k-means. It's an agglomerative method where the definition of a cluster isn't clean cut. There's clusters within clusters within clusters, and how you want to separate them is really a matter of preference.