11 Categorical variables

A variable describing e.g. fuel consumption (or the height of a research patient) on a continuous scale is a continuous variable. Contrary to this, the hypothetical variable hypertension, which was introduced in chapter 9 and received either the value 0 or 1, is a categorical variable. More specifically, such dichotomous or binary variables, receiving most often either 0 or 1 as their values, are called dummy variables.

It is generally wise to tell R which of your variables are categorical variables so that R doesn’t mistakenly treat them as continuous variables in any analyses. In R, categorical variables should be saved as factor type. This is very easy.

You can view your data by typing the following command in the Console:

car_dataset5

Even a quick glance at the output (only 4 lines are shown below) will reveal you that gear is meant to be a categorical variable in these data. The other variables are clearly continuous variables.

##                   fuel_cons_eu horsepower_eu  mass_eu gear
## Mazda RX4             11.20000      111.5279 1188.432    1
## Mazda RX4 Wag         11.20000      111.5279 1304.100    1
## Hornet Sportabout     12.57754      177.4308 1560.384    0
## Duster 360            16.44755      248.4031 1619.352    0

11.1 as.factor command

Let’s change the variable gear to factor, i.e. turn it into a categorical variable. We will save it over itself - we will not create a new dataset. There are a few easy ways to do this.

option: With Base-R, the following would be done:

car_dataset5$gear <- as.factor(car_dataset5$gear)

Note that in the expression car_dataset5$gear we indicate with the dollar sign $ that we want to change only the variable gear from the dataset car_dataset5 into factor.

option is to do the same thing using dplyr:

car_dataset5 <- car_dataset5 %>% mutate(gear = as.factor(gear))

Run either of the methods mentioned above. If you now look at your data again by typing the following command in the Console, you will not really see any difference from before (only 4 lines shown here):

car_dataset5

##                   fuel_cons_eu horsepower_eu  mass_eu gear
## Mazda RX4             11.20000      111.5279 1188.432    1
## Mazda RX4 Wag         11.20000      111.5279 1304.100    1
## Hornet Sportabout     12.57754      177.4308 1560.384    0
## Duster 360            16.44755      248.4031 1619.352    0

Indeed, it is problematic that you can’t see in that output if R knows whether the variables are continuous or categorical.

tibble comes to the rescue. tibble is an improved version of the Base-R’s “dataframe” (in the previous examples, “car_dataset”, “car_dataset2” etc. were in Base-R format). At this point, tou don’t need to understand anything about data frames or why tibble is a better format - just take my word for it.

11.2 tibble

Let’s change our dataset “car_dataset5” into a tibble. Let’s save the new data set as “car_dataset6”.

At the same time, we are making another important correction. So far, the rows in our dataset have been named after car brands, but naming rows in general is not recommended today. When we switch to the tibble format, line names are destroyed and replaced by line numbers. Car brands (or analogously: “code identifiers of participants”) will appear in the first column, which is a better practice.

It works like this - and notice that the command is nowadays as_tibble and not as.tibble:

car_dataset6 <- as_tibble(car_dataset5, rownames = "Car Brands")

After that, when you type the following command into the console, you can easily see that R indeed knows that the variable gear is categorical (or, in R’s terms, a factor):

car_dataset6

Below you can see 4 lines of data (see also the image below):

## # A tibble: 4 x 5
##   `Car Brands`      fuel_cons_eu horsepower_eu mass_eu gear 
##   <chr>                    <dbl>         <dbl>   <dbl> <fct>
## 1 Mazda RX4                 11.2          112.   1188. 1    
## 2 Mazda RX4 Wag             11.2          112.   1304. 1    
## 3 Hornet Sportabout         12.6          177.   1560. 0    
## 4 Duster 360                16.4          248.   1619. 0

R guide by Ville Langén is licensed under Attribution-ShareAlike 4.0 International