SOC830 Lab 11: Multiple Regression

The final lab introduces 1) how to conduct a multiple regression model, 2).how to use dummy variables in the regression model, and 3) how to create regression tables using R. We use four packages in this lab. Load them using the following code:

library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjPlot)

Also, run the following code. Otherwise, you will see scientific notations (e.g., 2e-16) instead of numbers in the R output.

options(digits=5, scipen=15)

This lab uses the 2009 AuSSa dataset. Download the data file in the course website(iLearn) and put it in your working directory. Then, run the following code:

aus2009 <-readRDS("aussa2009.rds")

Using the 2009 AuSSa dataset, this lab investigates gender income gaps in Australia. For this investigation, we will use six variables: gender (sex), education (educyrs), employment status (wrkst), supervisor status (wrksup), working hours per week (wrkhrs) and annual income (rinc). We also add a variable of identification numbers (id) to this dataset. We create a new dataset consisting of only complete cases which have no missing values on the six variables of our choice. The following code performs this task. If you cannot understand the code, review Creating a Dataset of Complete Cases.

income <- aus2009 %>%
  select(id, sex, educyrs, wrkst, wrksup, wrkhrs, rinc)

income <- income %>%
  mutate(mark = 0)

income <- income %>%
  mutate(mark = replace(mark, complete.cases(income), 1))

income <- income %>%
  filter(mark == 1)

Multiple Regression Analysis

Running regression models is quite simple. We use ‘lm()’ again but add more independent variables. Thus, ‘model name <- lm(dependent ~ independent 1 + independent 2 + independent 3 + …, data = data name)’ conducts multiple regression analysis. The following code estimates a regression model in which education (educyrs) and working hours per week (wrkhrs) explain the annual income(rinc).

model <- lm(rinc ~ educyrs + wrkhrs, data = income)
summary(model)
## 
## Call:
## lm(formula = rinc ~ educyrs + wrkhrs, data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -81451 -16559  -2982  17717  88216 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) -12938.1     4758.5   -2.72              0.0067 ** 
## educyrs       2668.0      271.5    9.83 <0.0000000000000002 ***
## wrkhrs         737.8       67.2   10.99 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25600 on 732 degrees of freedom
## Multiple R-squared:  0.235,  Adjusted R-squared:  0.232 
## F-statistic:  112 on 2 and 732 DF,  p-value: <0.0000000000000002

In the output, the intercept is -12,938.1, which means that the annual income will be - $12,938.1 if people have 0 years of schooling and do not work. The coefficient of educyrs is 2,668, which means that for each additional one-year increase in years of schooling, annual income will increase by $2,668 while controlling for the effect of working hours per week. This positive effect of education is statistically significant at α = .05 because the p-value (<0.0000000000000002) is much less than .05. The coefficient of wrkhrs is 737.8, which means that for each additional one-hour increase in working hours per week, annual income will increase by $737.8 while controlling for the effect of education. This positive effect of working hours is statistically significant at α = .05 because the p-value (<0.0000000000000002) is much less than .05. Thus, the regression equation of this model is \(rinc = 2,688 × educyrs + 737.8 × wrkhrs – 12,938.1\). The coefficient of determination (Multiple R-squared) is 0.235, which means that 23.5% of the total variance in annual incomes is explained by the two independent variables.

Dummy Variable Analysis

Creating a Dummy Variable.

A dummy variable is a binary variable used in regression analysis that represents subgroups (or categories) of the sample in your study. And it is used to estimate the effect of nominal variables. Let’s make a dummy variable of gender. The following code shows two categories of gender (sex): male and female.

frq(income$sex)
## 
## # R: Sex (x) <numeric> 
## # total N=735  valid N=735  mean=1.53  sd=0.50
##  
##  val  label frq raw.prc valid.prc cum.prc
##    1   Male 346   47.07     47.07   47.07
##    2 Female 389   52.93     52.93  100.00
##   NA     NA   0    0.00        NA      NA

Since we want to investigate how much less women earn compared to men, males should be the reference category which takes 0. Consequently, females should take 1. Thus, let’s recode sex in such a way. In a nutshell, the reference category must take 0 values in dummy variables.

income <- rec(income, sex, rec = "1=0;2=1", append = TRUE)

Then, we rename a newly recoded variable as female and set a variable and value label.

income <- var_rename(income, sex_r = female)
income$female <- set_label(income$female, label = "Gender")
income$female <- set_labels(income$female, labels = c("Male" = 0, "Female" = 1))

The most important step is to change this dummy variable as a factor. Otherwise, R will not treat it as a dummy variable in regression analysis.

income$female <- to_factor(income$female)

In some cases, you may want to use women instead of men as your reference category. In this case, you can easily change a reference category by using the ‘ref_lvl(variable, lvl = value for the reference category)’ function. Make sure that the variable used for ‘ref_lvl()’ should be a variable that is converted to a factor. The following code makes a new variable (male) in which 0 is female, and 1 is male.

income$male <- ref_lvl(income$female, lvl = 1)

The following code makes a dummy variable of employment status (wrkst). wrkst has ten categories, which is too many to analyse them. When you look at the frequency table of wrkst, value 4 to 10 can be collapsed into one employment status, ‘not in labour force”. We will use this category as a reference category.

frq(aus2009$wrkst)
## 
## # R: Current employment status (x) <numeric> 
## # total N=1525  valid N=1488  mean=4.14  sd=3.17
##  
##  val                                label frq raw.prc valid.prc cum.prc
##    1         Employed-full time, main job 579   37.97     38.91   38.91
##    2         Employed-part time, main job 206   13.51     13.84   52.76
##    4                Helping family member  25    1.64      1.68   54.44
##    5                           Unemployed  23    1.51      1.55   55.98
##    6 Student, school, vocational training  44    2.89      2.96   58.94
##    7                              Retired 336   22.03     22.58   81.52
##    8         Housewife, -man, home duties 171   11.21     11.49   93.01
##    9                 Permanently disabled  71    4.66      4.77   97.78
##   10            Other,not in labour force  33    2.16      2.22  100.00
##   NA                                   NA  37    2.43        NA      NA
income <- rec(income, wrkst, rec = "1=1; 2=2; 4:10=0", append = TRUE)
income$wrkst_r <- set_label(income$wrkst_r, label = "Current employment status")
income$wrkst_r <- set_labels(income$wrkst_r, labels = c("Not in labour force" = 0,
                                                        "Full-time" = 1,
                                                        "Part-time" = 2))
income$wrkst_r <- to_factor(income$wrkst_r)

The following code makes a dummy variable of supervisor status (wrksup). Supervisees are treated as a reference category.

frq(income$wrksup)
## 
## # R: Supervises others at work (x) <numeric> 
## # total N=735  valid N=735  mean=1.49  sd=0.50
##  
##  val label frq raw.prc valid.prc cum.prc
##    1   Yes 374   50.88     50.88   50.88
##    2    No 361   49.12     49.12  100.00
##   NA    NA   0    0.00        NA      NA
income <- rec(income, wrksup, rec = "1=1; 2=0", append = TRUE)
income$wrksup_r <- set_label(income$wrksup_r, label = "Supervisor jobs")
income$wrksup_r <- set_labels(income$wrksup_r, labels = c("Suvervisee" = 0,
                                                          "Supervisor" = 1))
income$wrksup_r <- to_factor(income$wrksup_r)

Assigning Variable Labels

We also assign variable labels to all the other variables of the income dataset.

income$educyrs <- set_label(income$educyrs, label = "Education")
income$wrkhrs <- set_label(income$wrkhrs, label = "Hours worked per week")
income$rinc <- set_label(income$rinc, label = "Annual income")

Dummy Variable Regression Analysis

First, we estimate the annual income only by gender, which shows gender difference in the annual income.

model.4.1 <- lm(rinc ~ female, data = income)
summary(model.4.1)
## 
## Call:
## lm(formula = rinc ~ female, data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65099 -22495    -99  19105  58105 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    65099       1485   43.83 <0.0000000000000002 ***
## female1       -19204       2042   -9.41 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27600 on 733 degrees of freedom
## Multiple R-squared:  0.108,  Adjusted R-squared:  0.106 
## F-statistic: 88.5 on 1 and 733 DF,  p-value: <0.0000000000000002

In the output, ‘female1’ indicates ‘being a female (when the value of female is 1). Thus, the coefficient of ‘female1’ means that women earn $19,204 less than men. And this female disadvantage is statistically significant (the p-value is less than .001).

For the comparison, let’s use male instead of female in regression analysis. Note that ‘female’ is the reference category in the male variable.

model.4.1.1 <- lm(rinc ~ male, data = income)
summary(model.4.1.1)
## 
## Call:
## lm(formula = rinc ~ male, data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -65099 -22495    -99  19105  58105 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)    45895       1401   32.76 <0.0000000000000002 ***
## male1          19204       2042    9.41 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 27600 on 733 degrees of freedom
## Multiple R-squared:  0.108,  Adjusted R-squared:  0.106 
## F-statistic: 88.5 on 1 and 733 DF,  p-value: <0.0000000000000002

In the output, look at the coefficient of ‘male1’. It is 19,204. Note that it is a POSITIVE number. It means that men earn $19,204 MORE than women. This example shows well that regression coefficients change when different categories are set as a reference category and that the interpretation of coefficients for dummy variables should be made in relation to assigned reference categories.

The second model estimates female disadvantage in the annual income while controlling for education.

model.4.2 <- lm(rinc ~ female + educyrs, data = income)
summary(model.4.2)
## 
## Call:
## lm(formula = rinc ~ female + educyrs, data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -63729 -17900  -2893  19989  59271 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    24743       4136    5.98         0.0000000034 ***
## female1       -19715       1908  -10.33 < 0.0000000000000002 ***
## educyrs         2836        274   10.36 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 25800 on 732 degrees of freedom
## Multiple R-squared:  0.222,  Adjusted R-squared:  0.22 
## F-statistic:  104 on 2 and 732 DF,  p-value: <0.0000000000000002

After controlling education, the coefficient of female drops to -19,715, which means that women earn $19,715 less than men who have the same level of education.

The third model estimates female disadvantage in the annual income while controlling for education and working hours per week.

model.4.3 <- lm(rinc ~ female + educyrs + wrkhrs, data = income)
summary(model.4.3)
## 
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs, data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -73473 -15943  -2091  17612  75984 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)    361.3     4917.4    0.07                 0.94    
## female1     -14496.9     1929.2   -7.51  0.00000000000016727 ***
## educyrs       2743.1      262.0   10.47 < 0.0000000000000002 ***
## wrkhrs         571.0       68.5    8.34  0.00000000000000036 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24700 on 731 degrees of freedom
## Multiple R-squared:  0.289,  Adjusted R-squared:  0.287 
## F-statistic: 99.3 on 3 and 731 DF,  p-value: <0.0000000000000002

After controlling education and working hours per week, the coefficient of female reduced to -14,496, which means that women earn $14,496 less than men if they have the same level of education and work as much as men do. This reduction implies that women work less than men, which in turn decreases their annual income.

The fourth model adds another dummy variable, employment status (wrkst_r).

model.4.4 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
summary(model.4.4)
## 
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -69347 -15734  -1928  16941  76334 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -18471.8     7654.2   -2.41              0.01606 *  
## female1     -12781.4     1889.4   -6.76       0.000000000027 ***
## educyrs       2551.5      255.7    9.98 < 0.0000000000000002 ***
## wrkhrs         309.9       90.7    3.42              0.00067 ***
## wrkst_r1     35211.0     5827.7    6.04       0.000000002427 ***
## wrkst_r2     22627.1     6079.8    3.72              0.00021 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24000 on 729 degrees of freedom
## Multiple R-squared:  0.333,  Adjusted R-squared:  0.328 
## F-statistic: 72.8 on 5 and 729 DF,  p-value: <0.0000000000000002

The output shows two coefficients of employment status (wrkst_r). In wrkst_r, 1 means “Full-time employment”, and 2 means “Part-time employment”. Thus, the coefficient of ‘wrkst_r1’ is for “Full-time employment and that of ‘wrkst_r2’ is for “Part-time employment”. The interpretation is that full-time workers earn $35,211 more, and part-time workers earn $22,627 more than those who are not in the labour force (the reference category) while controlling for gender, education and working hours per week. After controlling employment status as well as education and working hours per week, women earn $12,781 less than men. The gender gap is reduced a lot compared to that in the second model, which implies that employment status accounts substantially for female disadvantage in the annual income. Women are less likely to have stable jobs than men, which decrease their annual income.

The fifth model adds a dummy variable of supervisor status, wrksup_r.

model.4.5 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, data = income)
summary(model.4.5)
## 
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, 
##     data = income)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -71303 -14958  -1293  16018  77518 
## 
## Coefficients:
##             Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)   -17601       7563   -2.33              0.02023 *  
## female1       -12462       1868   -6.67        0.00000000005 ***
## educyrs         2367        256    9.24 < 0.0000000000000002 ***
## wrkhrs           238         91    2.62              0.00904 ** 
## wrkst_r1       35738       5758    6.21        0.00000000091 ***
## wrkst_r2       22764       6005    3.79              0.00016 ***
## wrksup_r1       7992       1824    4.38        0.00001346487 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 23700 on 728 degrees of freedom
## Multiple R-squared:  0.35,   Adjusted R-squared:  0.345 
## F-statistic: 65.4 on 6 and 728 DF,  p-value: <0.0000000000000002

The coefficient of ‘wrksup_r1’ (which means those who supervise others at work) is 7,992. This means that supervisors earn $7,992 more than supervisees while controlling for gender, education, working hours per week and employment status. The coefficient of ‘female1’ is a little bit changed, implying that supervisor status does not contribute much to female disadvantage in the annual income. The regression equation of this model is \(rinc =-12,462 × female + 2,367 × educyrs + 238 × wrkhrs + 35,738 × wrkst_r1 + 22,764 × wrkst_r2 +7,992 × wrksup_r1 – 17,601\). R-square (Multiple R-squared) is 0.35, which means that 35% of the total variance in the annual income is explained by the five variables (gender, education, working hours, employment status, and supervisor status).

Creating a Regression Table

We have so far analysed female disadvantage in the annual income using several regression models. How can we present all these regression models in a just single table? Using the ‘tab_model()’ function would be the easiest way. The following code combines the five regression models we have estimated so far and creates a single regression table.

tab_model(model.4.1, model.4.2, model.4.3, model.4.4, model.4.5, digits = 3,
          pred.labels = c("Intercept", "Female (ref. = Male)", 
                          "Education", "Hours worked per week", 
                          "Full-time (ref. = Not in labour force)", 
                          "Part-time (ref. = Not in labour force)", 
                          "Supervisor (ref. = Supervisee)"),
          dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
          title = "Gender Effects on Annual Income",
          show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "asterisk")
Gender Effects on Annual Income
  Model 1 Model 2 Model 3 Model 4 Model 5
Predictors Estimates Estimates Estimates Estimates Estimates
Intercept 65099.191 ***
(1485.239)
24743.249 ***
(4135.493)
361.329
(4917.426)
-18471.756 *
(7654.244)
-17601.103 *
(7563.033)
Female (ref. = Male) -19204.178 ***
(2041.574)
-19714.886 ***
(1908.532)
-14496.910 ***
(1929.191)
-12781.385 ***
(1889.367)
-12461.921 ***
(1867.632)
Education 2835.734 ***
(273.737)
2743.087 ***
(261.986)
2551.511 ***
(255.720)
2366.974 ***
(256.071)
Hours worked per week 571.010 ***
(68.454)
309.875 ***
(90.654)
238.257 **
(91.022)
Full-time (ref. = Not in labour force) 35211.005 ***
(5827.722)
35737.917 ***
(5757.544)
Part-time (ref. = Not in labour force) 22627.062 ***
(6079.790)
22764.274 ***
(6005.349)
Supervisor (ref. = Supervisee) 7992.262 ***
(1823.718)
Observations 735 735 735 735 735
R2 / R2 adjusted 0.108 / 0.106 0.222 / 0.220 0.289 / 0.287 0.333 / 0.328 0.350 / 0.345
  • p<0.05   ** p<0.01   *** p<0.001

In the code, you must list the name of regression models that will be included in a regression table. ‘digits = value’ specifies the decimal places for numbers in the table. ‘pred.labels = c()’ sets the labels of independent variables displayed in the first column of the table. ‘dv.labels = c()’ sets the labels of models displayed in the first row of the table. ‘title = “”’ sets the title of the table. ‘show.se = TRUE’ displays the standard errors of coefficients, ‘show.ci = FALSE’ does not show the confidence intervals of coefficients, and ‘collapse.se = TRUE’ shows standard errors along with coefficients in the same column. ‘p.style = “asterisk”’ displays the p-value in the format of stars.

The R codes you have written so far look like:

################################################################################
# Lab 11: Multiple Regression
# 3/06/2019
# SOC830, SOCI702, SOCX830
################################################################################
# Load packages
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjPlot)

# Avoid scientific notation
options(digits=5, scipen=15) 

# Load dataset
aus2009 <- readRDS("aussa2009.RDS")

# Create a dataset of complete cases
income <- aus2009 %>%
  select(id, sex, educyrs, wrkst, wrksup, wrkhrs, rinc)

income <- income %>%
  mutate(mark = 0)

income <- income %>%
  mutate(mark = replace(mark, complete.cases(income), 1))

income <- income %>%
  filter(mark == 1)

# Multiple regression
model <- lm(rinc ~ educyrs + wrkhrs, data = income)
summary(model)

# Make dummy variables
## Gender
frq(income$sex)

income <- rec(income, sex, rec = "1=0;2=1", append = TRUE)

income <- var_rename(income, sex_r = female) # rename `sex_r` as `female`
income$female <- set_label(income$female, label = "Gender")
income$female <- set_labels(income$female, labels = c("Male" = 0, "Female" = 1))

income$female <- to_factor(income$female) # convert into a factor for dummy

income$male <- ref_lvl(income$female, lvl = 1) # set females as a reference

## Current employment status
frq(aus2009$wrkst)
income <- rec(income, wrkst, rec = "1=1; 2=2; 4:10=0", append = TRUE)
income$wrkst_r <- set_label(income$wrkst_r, label = "Current employment status")
income$wrkst_r <- set_labels(income$wrkst_r, labels = c("Not in labour force" = 0,
                                                        "Full-time" = 1,
                                                        "Part-time" = 2))
income$wrkst_r <- to_factor(income$wrkst_r)

## Supervisor
frq(income$wrksup)
income <- rec(income, wrksup, rec = "1=1; 2=0", append = TRUE)
income$wrksup_r <- set_label(income$wrksup_r, label = "Supervisor jobs")
income$wrksup_r <- set_labels(income$wrksup_r, labels = c("Suvervisee" = 0,
                                                          "Supervisor" = 1))
income$wrksup_r <- to_factor(income$wrksup_r)

## Assign Variable Name
income$educyrs <- set_label(income$educyrs, label = "Education")
income$wrkhrs <- set_label(income$wrkhrs, label = "Hours worked per week")
income$rinc <- set_label(income$rinc, label = "Annual income")

# Dummy variable regression analysis
model.4.1 <- lm(rinc ~ female, data = income) # male as the reference
summary(model.4.1)

model.4.1.1 <- lm(rinc ~ male, data = income) # female as the reference
summary(model.4.1.1)

model.4.2 <- lm(rinc ~ female + educyrs, data = income)
summary(model.4.2)

model.4.3 <- lm(rinc ~ female + educyrs + wrkhrs, data = income)
summary(model.4.3)

model.4.4 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
summary(model.4.4)

model.4.5 <- lm(rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, data = income)
summary(model.4.5)

# Make a regression table
tab_model(model.4.1, model.4.2, model.4.3, model.4.4, model.4.5, digits = 3,
          pred.labels = c("Intercept", "Female (ref. = Male)", 
            "Education", "Hours worked per week", 
            "Full-time (ref. = Not in labour force)", 
            "Part-time (ref. = Not in labour force)", 
            "Supervisor (ref. = Supervisee)"),
          dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
          title = "Gender Effects on Annual Income",
          show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "asterisk")
Last updated on 01 June, 2019 by Dr Hang Young Lee(hangyoung.lee@mq.edu.au)