19.1 Partition data with k-means - Video Tutorials & Practice Problems
Video duration:
12m
Play a video:
<v Voiceover>Machine learning generally</v> falls into two categories, supervised learning and unsupervised learning. Supervised learning is similar to regressions, where you train data on known quantities and you can make predictions off of that. Unsupervised learning is where you have no way of knowing if the training is right. It doesn't really specify. You don't have a response variable, you're just sort of trying to make sense of the data without someone qualifying you. One of the most popular forms of unsupervised learning is clustering. Within clustering, one of the most popular types is k-means clustering. It's where you divide up the dataset into k distinct subsets. To illustrate this, we're gonna use some data from the UC Irvine machine learning repository. The dataset we're using is all about wine. It's available at http://archive.ics.uci.edu/ml/datasets/Wine. We'll save that as a string. So we read in the data, using read.table, because it is a CSV. And as always it has a header, and the separator is a comma. It doesn't have any text in it, so I won't bother with stringsAsText equals false. Now we can look at the head of it. And we see we have all this different information about the wine. For now, we will make a training set, this way we can actually test our results. I know it's unsupervised learning, but in this case we actually know what the cultivar is, so we can compare our clustering with reality. So we will make wineTrain, and out of that we'll use, which, names of wine, doesn't equal Cultivar. It's gonna pull all the columns except for this Cultivar column. Now we are ready to do k-means. First though, we need to set a random seed. K-means clustering uses random starts to pick up how to start partitioning the data. If we want this to be reproducible, we should set a random seed. To do that, we do set.seed, and let's just use 278613. Now we can do the clustering. One thing about k-means is that you need to specify how many clusters you're going to have. Since we know ahead of time that this data has three different cultivars, it might make sense to use three clusters. So we can say wineK3 gets kmeans, x equals wineTrain, and centers equals three. Now if we print it out, we'll see a lot of information. But the main parts are the centers of each cluster, which observation went to which cluster, and some diagnostic information. Now the centers. We have a multidimensional space. We have many columns here. This data is in a 13-dimensional space, that means its center is in the middle of a 13-dimensional cube. So that's why we have all these numbers here for the centers. There are three centers, that's why there's three rows, and there's one coordinate for each column. The clustering vector right here tells us the first row's in the second cluster, the second row's in the second cluster, and so on, until this fifth row's in the first cluster, and that's how that works. Visualizing this can be hard because we're in multiple dimensions, but it's worth a shot. The useful package has a plot function, which takes the results of a k-means clustering and the original data, and tries to plot it in two dimensions. What this did was take those 13 columns and use something called multidimensional scaling using the cmdscale function to project it into two dimensions. These dimensions are the primary components that explain most of the variation. The color coding is according to the k-means clustering. We can see here that the clustering does pretty well agree with the principle components. They seem like they are their own discrete blobs. K-means clustering, if you don't provide starting points, it randomly picks centers in the data and uses that to start the algorithm. So it's often considered a good thing to restart the algorithm a number of times, let it run again and again and again until it converges on a result. So we will once again do set.seed, we will say 278613, the same seed as before. This time we'll say wineK3N25. We will let it restart 25 times. So it's very similar to before, it's kmeans, wineTrain, with three centers, but we're going to restart 25 times. We run this, and let's check out the cluster sizes for when we didn't do the restart. We'll just say wineK3, size. This tells us the size of each of the clusters. If we do that again for wineK3N25, we see it had the same exact sizes. So either we got lucky or it just happens to be a good dataset that the restarts weren't totally necessary. One drawback of k-means clustering is that you need to tell it how many clusters it should look for. Deciding on those clusters is a different problem. There are two methods I like, one is called Hartigan's rule, one is called the gap statistic, the gap statistic being far more well-known. Let's first look at using Hartigan's rule. It basically compares within cluster similarity to without cluster similarity and tells you if you should add another cluster. There's a function inside the useful package called FitKMeans that does this for us. It goes through and fits numerous k-means clusterings with two clusters, or three, with four, with five, with six, with seven, computes Hartigan's rule, and tells us if we should use it. So max.clusters equals 20, we're saying don't go past 20, we'll say number of restarts is 25, and the seed is once again 278613. We fit this and it iterates through. When it's done we can view it. It gave us a nice dataset saying if we should add the next additional cluster. According to this it looks like we should be at 12 or 13 clusters. In case we like a visual, there is a PlotHartigan function. Which makes a nice graph for us showing us at what point should we not add the next cluster. And it looks like right here at 14 is when we don't add, we keep 13 clusters. Using the clustering we already have, let's see how it compared to reality. We'll say table of wine, cultivar, and of wineK3N25, of cluster. We look at this and we get a confusion matrix telling us what we got wrong, what we got right. Now rather than try to interpret this right now, let's make a plot of this, it'll make life a little bit easier to see. So we will say plot of table, wine, cultivar, and of wineK3N25, of cluster. Then we'll give it a name. This is called a confusion matrix. And we'll break that on another line because it overran. Then we will say xlab equals Cultivar, and ylab equals Cluster. We plot this and it gives us a visual representation of that confusion matrix. So what this is telling us is, this is the portion that we said is cluster one and was actually cultivar one. This is what we said is cluster one and is cultivar two. We said one and it's three. Here's where we said it's two and it's one, so forth and so on. So you can see we didn't do the best job. What is actually cultivar two, we have it split between one and three. This wasn't the best clustering, but it's just a way to show a quick example. The other method for calculating how many clusters you should use is something called a gap statistic. This is in the cluster package. And it does take a while to run. This gives us a little warning message about an object called twins, but we should be okay. What we will do is we'll call something called theGap, and for this we call clusGap. This function clusGap, you feed it your data, you feed it a function to cluster with, and we're going to use pam, which we learned in another lesson, it just works well for this function. We'll tell it the maximum number of clusters should be 20. We run this, and we sit and wait for it to be done. This is doing a bootstrap for each clustering, so this does take a bit of time. So when this is finished, we can convert this into a data frame so we can easily view it and plot it. If we were to look at this, we'll look at the head of it. What this does, it shows us the gap between the within cluster similarity that we actually get, and what we were expected to get based on a bootstrap resampling. This is easier viewed than read, so we will do ggplot, gapDF, aes, x axis will just be one through nrow of gapDF, and then geom_line, aes is y equals logW, color will be blue. We're also gonna add a line for the expected of logW, and points for both logW and expected logW. In fact I'll just copy this. This'll become a point. This'll become a point. This'll be E.logW, E.logW. And these last two will be green. And we'll even add labels saying x equals number of clusters. And that should be x equals one through nrow. And we see here that the blue line is the within cluster similarity given the number of clusters, and the green line is what we expect based on the bootstrapped samples. We want this where the gap is going to be the smallest. That's probably somewhere over here. We can visualize this by actually looking at the actual gap statistic. To do this we'll build yet another ggplot. This one's gonna start out the same, ggplot, gapDF, aes, x equals one through nrow. Gonna be geom_line, aes, y equals the gap, the actual gap itself. My color will be red. We will copy this 'cause we're gonna make points. And we'll also give it error bars. So that is geom_errorbar. This aes has a minimum value, which equals gap minus SE.sim, the standard error for the simulations, and the ymax will be the opposite. And the color will also be red. And this shows us, given a number of clusters, what the gap between actuality and expected is. While you might be tempted to go for the cluster that provides the minimum gap. For simplicity reasons, you wanna go for the number of clusters that provides a gap that is the farthest away while still within one standard deviation of the optimal. This point right here is still within one standard deviation of the minimum. This, at number four, is the number of clusters we would choose. K-means provides a great way of partitioning the data. It requires a bit of foresight in determining how many clusters you want and doing a number of restarts, but it's a way of partitioning the data so that each single cluster is discrete and separate from the others.