2: The Basic Building Blocks in R
2.3 Understand the different data types - Video Tutorials & Practice Problems
<v Voiceover>Variables</v> in R can take on a number of different types whether they are numbers, or dates, or text. Unlike some other languages, particularly C++, R is dynamically tight. That means at one minute a variable can be a number and the next minute it could be text, and then the next minute back to numbers. So, it is a very flexible language and many things can be done with it. So, let's go ahead and look at some of the data that R can store. The first type is numeric data such as the number two. So, x gets two. We look at this and we see that it's two. Now, this looks like an integer, but it's actually stored as a numeric type, and the way you can check that is by saying, class of x, and we see that it is, indeed, numeric. Another check that can be done is the function, is.numeric of x, and we see TRUE. It is numeric. Now, numeric data is different than integer data, but it's a very subtle difference, and, mostly likely, in computations it won't matter too much. But let's say, for some reason, we want to assign the integer five to the variable, i. We could do i gets five-L. The L means it's an integer. So, let's run that. Looking at it, it looks identical except for the fact that it's five and not two. It looks identical to the output from x. But, if we were to check the class of i, we would see that it is, indeed, an integer. We can always do is.integer of i, and, again, it's TRUE. What get's interesting is if we check if it's a numeric. So, is.numeric of i returns TRUE as well. That's because an integer is a subset of a numeric. Now, let's say we have an integer such as four-L, right. We check the class of four-L. We see that it is indeed an integer, and let's say we multiply four-L times 2.8. We get back a nice decimal result, that's because R automatically promoted them both to a decimal. When we have one integer and one decimal being multiplied against each other, it has to be promoted. Similarly, let's say we have one integer being divided by another integer. For instance, five-L divided by two-L. That results in 2.5, which we can confirm is indeed a numeric, that's because you simply cannot have a decimal in an integer number. When most people think about data, they don't necessarily think about text data, They think about numbers, something they learned in their statistics class, but text data is very important in the world of data science. For instance, let's say we want to assign the word data to the variable x. This is done by saying x gets data. Notice, data is in strings, either double-quotes or single-quotes will work. Running this, we see x is data. That's all and good, and that's called a character type. If we were to check the type, class of x, it's a character. Character is R's way for saying text data or string data. Now there's another type of character data called a factor. And we'll get more into factors a little bit later, but they are very important. So, let's say that we were to say, y gets factor of data. And, remember, data is in strings. When we look at that, we see that it spits out data. This time, it doesn't have double-quotes around it like it did before, and it's just by itself and it says Levels-data. Don't worry so much about that, we will indeed get to that in a little bit. A helper function for when we are working with character data is nchar. It's for when you want to check how many individual characters are stored in text. So, if we run nchar of x we see four, because the word data has four letters in it. Likewise, nchar of the word directly input, hello, comes out with five characters. Now it gets interesting when instead of putting in text data, you put in a number, such as nchar of three. That comes back as one. Putting in nchar of 452, comes back as three. So, it automatically up-converts these numbers into characters and then gives you the length of that new character. Now, using nchar on a factor has different results. Remember, y is a factor. It gives an error, because nchar does not work on factors. Another data type that can sometimes be difficult to work with are dates and times. So, let's look for a second how R handles dates. Let's create a variable called date1, as a date. So, literally type as.Date. Let's say June 28th, 2012. The way you input that is the four-digit year, dash, the two-digit month, dash, the two-digit day as a character. If we run this, we see date1 is 2012, June 28th. So far, so good. And, if we check the class of this date, it's a date type. Here's what's pretty cool. Let's say we do as.numeric of date1, we get 15519, that's because June 28th, 2012 is the 15,519th day since the Unix Epoch, which is January 1st, 1970. All dates in R are stored as the number of days since that Unix Epoch. Sometimes, however, you don't just want a date, you want a date-time. Fortunately, there is a data type for that. So we're going to assign date2 to be as.POSIXct, and we're gonna build it in a similar fashion, 2012, dash, 06, dash, 28, and we're going to put in the time, 17, colon, 42. Remember, that's all as a string. And when we look at it, we get 2012, 28th day of June, 1742, Eastern Daylight Time. The class of this variable is a POSIXct which is a specialized version of POSIXt. Now, for day times there are two types. There's the POSIXct and a POSIXlt. POSIXlts are much more difficult to work with, and I highly recommend sticking with POSIXct. Now, as we saw before, a date is stored as a number of days since the Unix Epoch. In a similar fashion, date-times are stored as the number of seconds since the Unix Epoch. So, we could do as.numeric, date2, and we see how long it's been since then. The last major data type in R is logical, that is TRUE or FALSE. They're typed in R as all capital letters, TRUE and FALSE. It's important that they're all capitalized. Now, TRUE is recognized as the number one, and FALSE is recognized as the number zero. So, if we were to type in TRUE times five, we expect to get back five. Typing in FALSE times five, we expect to get zero, that's because, again, TRUE is stored as the number one, and FALSE is stored as the number zero. Now it's possible to assign these TRUEs and FALSEs to variables such as k gets TRUE. We look at k, it comes up as TRUE. And, if we look at class k, we get logical. So, it is indeed able to store as a variable, and the check for logical is is.logical of k. That's TRUE. Now R has programmed in a couple of shortcuts which can be dangerous to use, because people fall into bad habits sometimes. The letter, T, comes up as TRUE. Now that might seem like a way to save a little bit of typing, the problem is T is easily re-assignable. So, if we were to do T gets seven, T is no longer TRUE. In fact, if we look at the class of T, it is now numeric. So, this can cause a lot of trouble when you're programming if you're using these shortcuts like T for TRUE and F for FALSE, it gets re-assigned somewhere, even by accident, and now all of the sudden your program is gonna go crazy. So, it's highly recommended not to use these shortcuts. Now logicals play a very important role in programming. They can result from checking equality or checking whether something's less than. For instance, if we were to check two is equal to three, and notice the way we check equality in R is a double equal sign, checking that, it returns FALSE because two does not equal three. If we wanted to check two is not equal to three, we do that using two, exclamation mark, equal sign, three, and that should return TRUE, because we are asking two is not equal to three. Yes, that is the case. Likewise we can check two is less than three. That's TRUE. Or, two is less than or equal to three. Once again, TRUE. Checking the opposite, two is greater than three, comes up FALSE, as expected. It's also possible to check whether character data are equal to each other. For instance, I want to see if the word, data, is equal to the word, stats. Of course they're not. But what's really cool is you could check, is data less than stats? And, it is indeed TRUE. So there are a number of different data types to use in R, and they cover all sorts of different types of data whether it's numeric data represented as numerics and integers, character data represented as characters, and factors, time-based data represented as dates or date-times, and logical data, which is the logical type. 0