18.1 Understand ACF and PACF - Video Tutorials & Practice Problems
Video duration:
7m
Play a video:
<v Voiceover>Data that are time-dependent</v> need to be treated with special care. That's because when you have data where one observation at a given time is affected by an observation at an earlier time, that means there is auto-correlation. That is, the independence of the observations is violated. This violation of independence needs to be accounted for. That is where special time series methods, such as Arima, come into play. This often occurs in financial data. It occurs in many other types of data, such as the level of a gasoline tank at a gas station, but most often people associate time series with financial data, so we will look at world bank data as an example. There is an open API to get to world bank data. You get it by doing require WDI. After that is loaded, we can go ahead and pull some data. We will pull data for the United States, CA, GB, CN, and so on. These are all the official two-digit international codes for these countries. We will also pull Sarta indicators about GDP, particularly overall GDP, and GDP per capita. And we will store them in their own vector. So when we do the pull from the world bank, we call it GDP gets WDI, and we say country is equal to our countries. This will pull data for all of them. Then we say indicator is equal to our indicators. And lastly, we specify a start and end year of 1960 and 2011. It is important to get the indicator names right. And also to get your variable names correct. That is why it is great to use auto-completion when possible. Now it is pulling down the data. Being that data comes across a little messy, we will just give it better names. For these names, we will use iso2c, country, year, per cap GDP, and GDP. Okay, so now let's take a look at our data. We have all sorts of information, both the short name, the long name, yearly information, per capita GDP, and overall GDP. So we probably want to plot this to see what's going on. So we load up GGPlot, and we say, GGPlot of GDP. AES is going to be Year on the X axis, per cap GDP on the Y axis, and notice I didn't say X and Y. The arguments are being matched positionally. It's a little dangerous doing it that way, but we can do it. I will set Country to color and also to line type. And I will say geom underscore line, and I will make the Y axis be continuous and have the label format of dollar. We run this, and we see that dollar does not exist. That's because dollar is a special function that comes in the scales package. And even though scales is part of GGPlot, it needs to be loaded separately. And now we have a nice plot showing how GDP per capita has changed over the years. Clearly there is some sort of time dependence in this data. So to examine this further, we will look at the per capita GDP of just the United States. To do that, we come here, we say US gets, first we will grab just the United States data. So we do GDP, dollar, per cap GDP, and we will limit it to the rows where the country is the United States. And we make sure we convert it to a time series, so we can use all the special functions of a time series. We'll say it starts in the first year of data, and it ends in the max year of data. If we look at it now, it's a special type of data format that contains all the information we are looking for. If we plot this now just using base plotting, we will give it a nice label. We can see how base R plots a time series object. Not quite as pretty as GGPlot, but it will do for us for now. A big part of the analysis of time series is looking at the auto-correlation function and the partial auto-correlation function. The auto-correlation function is good for determining the moving average order of a time series. The partial auto-correlation function is good for determining the auto-regressive portion of a time series. We'll learn more about auto-regression and moving average in a little bit, let's first look at the ACF and PACF. Let's clear the screen, and type in ACF of US. It brings up this plot. So what this plot shows is, it shows the correlation between the value of the data set and lagged versions of itself. Basically, today's data, and how it's correlated to yesterday's data. Today's versus two days ago, today's versus three days ago, and so forth and so on. If there was no linear dependence, there would just be a correlation between today and itself, and all the rest of these would be non-significant. This indicates we have some sort of trend that we need to deal with. A similar chart is the PACF. And what this chart shows is the correlation between today and a previous time, with all the linear dependence of the in-between time taken out. So in other words, it's today's correlation with two days ago, having taken into account the correlation between today and yesterday and yesterday and two days ago. It subtracts out the middle correlations. I'll expand this so it's a little easier to see. And both these charts are very useful at showing information about a time series. In these charts, any spikes above the blue line indicate a significant correlation. In this chart, it is showing that there is no significant correlations, it is entirely possible that there's not necessarily any AR components to this. There might still be, you can't judge just by the graph, but it's one possible interpretation. Being able to look at and use the ACF and the PACF are very important parts of working with time series data. They can be very informative as to what type of time series you're looking at and what type of linear independence is involved.