SOC830 Lab 4: Recoding Variables

The fourth lab session covers the following:

  • How to see explanations of R codes in RStudio
  • How to import an RDS file
  • How to recode variables
  • How to compute variables
  • How to import other formats of datasets
  • How to export datasets as other formats

We will use three packages for this lab. Load them using the following code:

library(dplyr)
library(sjlabelled)
library(sjmisc) 

How to see explanations of R codes in RStudio

Even R experts do not know all R codes. When they come across new R codes, they often rely on R documentation that explains them. An advantage of using RStudio is that you can easily access R documentation. Click on the tab of “Help” and type an R code you want to learn in the top right box (See Figure 1). Then, it will show R documentation about the code.

First Look of RStudio

Figure 1: First Look of RStudio

The documentation provides not only detailed information on R codes but also examples of codes. Nonetheless, reading it is not an easy task especially for novices. However, you will get more familiar with reading it at the end of this course..

How to import an RDS file

You saved the AuSSa subsample dataset as RDS format in lab 3. In this lab, we will import your saved data file into R. Use ‘data name <- readRDS("file-name.rds")’ to import it. Then, you will see the data loaded in the tab of Environment.

mydata <- readRDS("mydata.rds")

Recoding variables

After you start analysing survey data, you will spend most time in recoding variables. It is rare that researchers use variables as they are initially provided. Instead, researchers often customise the values of variables for their needs. This lab introduces this critical process of data management.

Creating a Variable of Age Groups

Suppose that we want to investigate how different age groups hold different political attitudes. However, age variable in mydata is not suitable for this purspse. Thus, we need to make a new variable of age group using age variable. Table 1 shows the recoding scheme for this task.

Table 1: Recoding Scheme of Age
Old variable(age)
New variable(age_r)
Values Values Labels
0 - 19 1 10s
20 - 29 2 20s
30 - 39 3 30s
40 - 49 4 40s
50 - 59 5 50s
60 - 69 6 60s
70 - 79 7 70s
80 - 89 8 80s
90 or more 9 90s

To recode variables, use ‘data name <- rec(data name, variable name, rec = "recoding scheme", append = TRUE)’. ‘append = TRUE’ means that R will append a newly recoded variable to data name. For example, the following code will recode age variable in mydata and generate a new variable of age group titled age_r.

mydata <- rec(mydata, age, rec = "min:19 = 1; 20:29 = 2; 30:39 = 3; 40:49 = 4;
              50:59 = 5; 60:69 = 6; 70:79 = 7; 80:89 = 8; 90:max = 9", 
              append = TRUE)

Let me explain more about “recoding scheme” in the above code. ‘a:b’ means all values from a to b. For example, ‘min:19’ means all the numbers from the minimum value to 19. In “recoding scheme”, we need to specify how the values of old variables are converted into the values of new variables. The left side of equal signs (=) is for the values of old variables, and the left side for the values of new variables. For example, ‘min:19 = 1’ means that all the values from the minimum value to 19 will be converted into 1. Semicolon(;) is used for separating the coding schemes of each value.

Running this code will make a new variable titled “age_r”. Then, we will assign the variable and value label to this new variable by running the following code:

mydata$age_r <- set_label(mydata$age_r, label = "Age Category")
mydata$age_r <- set_labels(mydata$age_r, labels = c ("10s" = 1,
                                                     "20s" = 2,
                                                     "30s" = 3,
                                                     "40s" = 4,
                                                     "50s" = 5,
                                                     "60s" = 6,
                                                     "70s" = 7,
                                                     "80s" = 8,
                                                     "90s" = 9))

Then, let us check the new variable by making a frequency table of it.

frq(mydata$age_r)
## 
## # Age Category (x) <numeric> 
## # total N=30  valid N=30  mean=4.77  sd=1.92
##  
##  val label frq raw.prc valid.prc cum.prc
##    1   10s   2    6.67      6.67    6.67
##    2   20s   1    3.33      3.33   10.00
##    3   30s   5   16.67     16.67   26.67
##    4   40s   5   16.67     16.67   43.33
##    5   50s   6   20.00     20.00   63.33
##    6   60s   6   20.00     20.00   83.33
##    7   70s   3   10.00     10.00   93.33
##    8   80s   1    3.33      3.33   96.67
##    9   90s   1    3.33      3.33  100.00
##   NA    NA   0    0.00        NA      NA

Creating a new political orientation variable

Suppose that we want to use a variable political orientation that consists of three categories: left, central, and right. polorient in mydata does not fit well for this purpose because it collects more detailed information than we want. However, we can make a new variable that will serve our purpose by recoding polorient. Table 2 shows the recoding scheme for this task.

Table 2: Recoding Scheme of Political Orientation
Old variable(polorient)
New variable(polorient_r)
Values Labels Values Labels
1 Far left 1 Left
2 Left 1 Left
3 Center 2 Center
4 Right 3 Right
5 Far right 3 Right

The following code will recode polorient variable in mydata and generate a new variable of political orientation titled polorient_r.

mydata <- rec(mydata, polorient, rec = "1:2 = 1; 3 = 2; 4:5 = 3", append = TRUE)

Then, we will assign the variable and value label to polorient_r.

mydata$polorient_r <- set_label(mydata$polorient_r, 
                                label = "3-category Political Orientation")
mydata$polorient_r <- set_labels(mydata$polorient_r, labels = c("Left" = 1,
                                                                "Center" = 2,
                                                                "Right" = 3))

The final step is to check the recoded variable.

frq(mydata$polorient_r)
## 
## # 3-category Political Orientation (x) <numeric> 
## # total N=30  valid N=30  mean=1.90  sd=0.99
##  
##  val  label frq raw.prc valid.prc cum.prc
##    1   Left  16   53.33     53.33   53.33
##    2 Center   1    3.33      3.33   56.67
##    3  Right  13   43.33     43.33  100.00
##   NA     NA   0    0.00        NA      NA

Creating a new social class variable

Suppose that we want to make a new variable which consists of three classes: lower, middle and upper class. We can make this variable by recoding class. Table 3 shows the recoding scheme for this task.

Table 3: Recoding Scheme of Social Class
Old variable(class)
New variable(class_r)
Values Labels Values Labels
1 Lower class 1 Lower class
2 Working class 1 Lower class
3 Lower middle class 2 Middle class
4 Middle class 2 Middle class
5 Upper middle class 2 Middle class
6 Upper class 3 Upper class

Using the following code, recode class into class_r.

mydata <- rec(mydata, class, rec = "1:2 = 1; 3:5 = 2; 6 = 3", append = TRUE)

Then, we will assign the variable and value label to class_r.

mydata$class_r <- set_label(mydata$class_r, 
                            label = "3-category Social Class")
mydata$class_r <- set_labels(mydata$class_r, labels = c("Lower class" = 1,
                                                        "Middle class" = 2,
                                                        "Upper class" = 3))

Finally, check the recoded variable.

frq(mydata$class_r)
## 
## # 3-category Social Class (x) <numeric> 
## # total N=30  valid N=30  mean=1.87  sd=0.43
##  
##  val        label frq raw.prc valid.prc cum.prc
##    1  Lower class   5   16.67     16.67   16.67
##    2 Middle class  24   80.00     80.00   96.67
##    3  Upper class   1    3.33      3.33  100.00
##   NA           NA   0    0.00        NA      NA

Renaming variables

The name of recoded variables are automatically assigned as ‘variable name_r’. However, you may want to change variable names. In this case, use ‘data name <- var_rename(data name, current name = "new name", ...)’ For example, the following code will change age_r, polorient_r and class_r into age_gr, polorient_3 and class_3, respectively.

mydata <- var_rename(mydata, age_r = "age_gr", polorient_r = "polorient_3", 
                     class_r = "class_3")

Computing variables

Another way to make a new variable is to compute variables. It is useful especially when the relationship between old and new variables can be expressed in mathematical equations.

For example, let us make a variable of birth year. The relationship between birth year and age is \(birth year = 2019 - Age\). Using this equation, we can create a variable of birth year by ‘data name <- data name %>% mutate(new variable name = Equation)’. The following code will also assign the variable label.

mydata <- mydata %>%
  mutate(b_year = 2019 - mydata$age)
mydata$b_year <- set_label(mydata$b_year, label = "Year of Birth")
mydata$b_year
##  [1] 1953 1947 1960 1999 1951 1943 1958 1929 1955 1980 1962 1972 1963 1968
## [15] 1985 2001 2001 1989 1954 1984 1975 1979 1962 1979 1960 1937 1975 1989
## [29] 1942 1959
## attr(,"label")
## [1] "Year of Birth"

Removing variables

In case you want to remove unnecessary variables from data, use ‘the following code’remove_var()’ function. The following code will remove b_year. The code is ‘data name <- data name %>% remove_var(variable name)’. You can remove multiple variables by ‘remove_var(var 1, var 2, var 3, var 4, ...)’.

mydata <- mydata %>%
  remove_var(b_year)

In this code, ‘%>%’ is called as pipes. Think of it simply as “then”. Thus, the code means that choose mydata, and then remove a variable of b_year, which will be new mydata.

Making datasets compact (Optional)

I recommend to change your data file into a format of tibbles (another format of data frames) because it will make life a little easier especially when you work with large datasets (For more information, visit Tibbles). You can easily achieve this job by ‘data name <- as_tibble(data name)’. For example, we will convert mydata into tibble formats using the following code:

mydata <- as_tibble(mydata)

Then, save your dataset again. This time I use a file name different from what I used in lab 3 so that I can keep all the dataset files I have worked on so far.

saveRDS(mydata, file = "mydata-2.rds")

How to import other formats of datasets

Public datasets are not provided as an R-compatible format. Normally, they are offered as either SPSS or STATA formats. To import SPSS-format datasets (.sav), use the ‘read_spss()’ function. Click on this. It will download an example of SPSS dataset. Put the downloaded file in your working folder and run the following code:

spss <- read_spss("spss-example.sav")
spss <- as_tibble(spss)

To import STATA-format datasets (.dta), use the ‘read_stata()’ function. Click on this. It will download an example of STATA dataset. Put the downloaded file in your working folder and run the following code:

stata <- read_stata("stata-example.dta")
stata <- as_tibble(stata)

The R codes you have written so far look like:

################################################################################
# Title: Lab 3
# Course: SOC830 & SOCI702 & SOCX830
# Date: 25/03/2019
################################################################################

# Load packages
library(sjlabelled)
library(sjmisc)

# Import an RDS dataset
mydata <- readRDS("mydata.rds")

# Recode variables
## Create age groups
mydata <- rec(mydata, age, rec = "min:19 = 1; 20:29 = 2; 30:39 = 3; 40:49 = 4;
              50:59 = 5; 60:69 = 6; 70:79 = 7; 80:89 = 8; 90:max = 9", 
              append = TRUE)
mydata$age_r <- set_label(mydata$age_r, label = "Age Category")
mydata$age_r <- set_labels(mydata$age_r, labels = c ("10s" = 1,
                                                     "20s" = 2,
                                                     "30s" = 3,
                                                     "40s" = 4,
                                                     "50s" = 5,
                                                     "60s" = 6,
                                                     "70s" = 7,
                                                     "80s" = 8,
                                                     "90s" = 9))
frq(mydata$age_r)

## Create 3-category political orientation
mydata <- rec(mydata, polorient, rec = "1:2 = 1; 3 = 2; 4:5 = 3", append = TRUE)
mydata$polorient_r <- set_label(mydata$polorient_r, 
                                label = "3-category Political Orientation")
mydata$polorient_r <- set_labels(mydata$polorient_r, labels = c("Left" = 1,
                                                                "Center" = 2,
                                                                "Right" = 3))
frq(mydata$polorient_r)

## Create 3-category social class
mydata <- rec(mydata, class, rec = "1:2 = 1; 3:5 = 2; 6 = 3", append = TRUE)
mydata$class_r <- set_label(mydata$class_r, 
                            label = "3-category Social Class")
mydata$class_r <- set_labels(mydata$class_r, labels = c("Lower class" = 1,
                                                        "Middle class" = 2,
                                                        "Upper class" = 3))
frq(mydata$class_r)

# Rename variables
mydata <- var_rename(mydata, age_r = "age_gr", polorient_r = "polorient_3", 
                     class_r = "class_3")

# Compute variables
mydata <- mydata %>%
  mutate(b_year = 2019 - mydata$age)
mydata$b_year <- set_label(mydata$b_year, label = "Year of Birth")
mydata$b_year

# Remove variables
mydata <- mydata %>%
  remove_var(b_year)

# Make the dataset compact

mydata <- as_tibble(mydata)
# Save the data file
saveRDS(mydata, file = "mydata-2.rds")

# Import other formats of datasets
# Import SPSS datasets
spss <- read_spss("spss-example.sav")
spss <- as_tibble(spss)

## Import STATA datasets
stata <- read_stata("stata-example.dta")
stata <- as_tibble(stata)
Last updated on 24 May, 2019 by Dr Hang Young Lee(hangyoung.lee@mq.edu.au)