The final lab introduces 1) how to conduct a multiple regression model, 2).how to use dummy variables in the regression model, and 3) how to create regression tables using R. We use four packages in this lab. Load them using the following code:
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjPlot)
Also, run the following code. Otherwise, you will see scientific notations (e.g., 2e16) instead of numbers in the R output.
options(digits=5, scipen=15)
This lab uses the 2009 AuSSa dataset. Download the data file in the course website(iLearn) and put it in your working directory. Then, run the following code:
aus2009 <readRDS("aussa2009.rds")
Using the 2009 AuSSa dataset, this lab investigates gender income gaps in Australia. For this investigation, we will use six variables: gender (sex), education (educyrs), employment status (wrkst), supervisor status (wrksup), working hours per week (wrkhrs) and annual income (rinc). We also add a variable of identification numbers (id) to this dataset. We create a new dataset consisting of only complete cases which have no missing values on the six variables of our choice. The following code performs this task. If you cannot understand the code, review Creating a Dataset of Complete Cases.
income < aus2009 %>%
select(id, sex, educyrs, wrkst, wrksup, wrkhrs, rinc)
income < income %>%
mutate(mark = 0)
income < income %>%
mutate(mark = replace(mark, complete.cases(income), 1))
income < income %>%
filter(mark == 1)
Multiple Regression Analysis
Running regression models is quite simple. We use ‘lm()
’ again but add more independent variables. Thus, ‘model name < lm(dependent ~ independent 1 + independent 2 + independent 3 + …, data = data name)
’ conducts multiple regression analysis. The following code estimates a regression model in which education (educyrs) and working hours per week (wrkhrs) explain the annual income(rinc).
model < lm(rinc ~ educyrs + wrkhrs, data = income)
summary(model)
##
## Call:
## lm(formula = rinc ~ educyrs + wrkhrs, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 81451 16559 2982 17717 88216
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 12938.1 4758.5 2.72 0.0067 **
## educyrs 2668.0 271.5 9.83 <0.0000000000000002 ***
## wrkhrs 737.8 67.2 10.99 <0.0000000000000002 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25600 on 732 degrees of freedom
## Multiple Rsquared: 0.235, Adjusted Rsquared: 0.232
## Fstatistic: 112 on 2 and 732 DF, pvalue: <0.0000000000000002
In the output, the intercept is 12,938.1, which means that the annual income will be  $12,938.1 if people have 0 years of schooling and do not work. The coefficient of educyrs is 2,668, which means that for each additional oneyear increase in years of schooling, annual income will increase by $2,668 while controlling for the effect of working hours per week. This positive effect of education is statistically significant at α = .05 because the pvalue (<0.0000000000000002) is much less than .05. The coefficient of wrkhrs is 737.8, which means that for each additional onehour increase in working hours per week, annual income will increase by $737.8 while controlling for the effect of education. This positive effect of working hours is statistically significant at α = .05 because the pvalue (<0.0000000000000002) is much less than .05. Thus, the regression equation of this model is \(rinc = 2,688 × educyrs + 737.8 × wrkhrs – 12,938.1\). The coefficient of determination (Multiple Rsquared) is 0.235, which means that 23.5% of the total variance in annual incomes is explained by the two independent variables.
Dummy Variable Analysis
Creating a Dummy Variable.
A dummy variable is a binary variable used in regression analysis that represents subgroups (or categories) of the sample in your study. And it is used to estimate the effect of nominal variables. Let’s make a dummy variable of gender. The following code shows two categories of gender (sex): male and female.
frq(income$sex)
##
## # R: Sex (x) <numeric>
## # total N=735 valid N=735 mean=1.53 sd=0.50
##
## val label frq raw.prc valid.prc cum.prc
## 1 Male 346 47.07 47.07 47.07
## 2 Female 389 52.93 52.93 100.00
## NA NA 0 0.00 NA NA
Since we want to investigate how much less women earn compared to men, males should be the reference category which takes 0. Consequently, females should take 1. Thus, let’s recode sex in such a way. In a nutshell, the reference category must take 0 values in dummy variables.
income < rec(income, sex, rec = "1=0;2=1", append = TRUE)
Then, we rename a newly recoded variable as female and set a variable and value label.
income < var_rename(income, sex_r = female)
income$female < set_label(income$female, label = "Gender")
income$female < set_labels(income$female, labels = c("Male" = 0, "Female" = 1))
The most important step is to change this dummy variable as a factor. Otherwise, R will not treat it as a dummy variable in regression analysis.
income$female < to_factor(income$female)
In some cases, you may want to use women instead of men as your reference category. In this case, you can easily change a reference category by using the ‘ref_lvl(variable, lvl = value for the reference category)
’ function. Make sure that the variable used for ‘ref_lvl()
’ should be a variable that is converted to a factor. The following code makes a new variable (male) in which 0 is female, and 1 is male.
income$male < ref_lvl(income$female, lvl = 1)
The following code makes a dummy variable of employment status (wrkst). wrkst has ten categories, which is too many to analyse them. When you look at the frequency table of wrkst, value 4 to 10 can be collapsed into one employment status, ‘not in labour force”. We will use this category as a reference category.
frq(aus2009$wrkst)
##
## # R: Current employment status (x) <numeric>
## # total N=1525 valid N=1488 mean=4.14 sd=3.17
##
## val label frq raw.prc valid.prc cum.prc
## 1 Employedfull time, main job 579 37.97 38.91 38.91
## 2 Employedpart time, main job 206 13.51 13.84 52.76
## 4 Helping family member 25 1.64 1.68 54.44
## 5 Unemployed 23 1.51 1.55 55.98
## 6 Student, school, vocational training 44 2.89 2.96 58.94
## 7 Retired 336 22.03 22.58 81.52
## 8 Housewife, man, home duties 171 11.21 11.49 93.01
## 9 Permanently disabled 71 4.66 4.77 97.78
## 10 Other,not in labour force 33 2.16 2.22 100.00
## NA NA 37 2.43 NA NA
income < rec(income, wrkst, rec = "1=1; 2=2; 4:10=0", append = TRUE)
income$wrkst_r < set_label(income$wrkst_r, label = "Current employment status")
income$wrkst_r < set_labels(income$wrkst_r, labels = c("Not in labour force" = 0,
"Fulltime" = 1,
"Parttime" = 2))
income$wrkst_r < to_factor(income$wrkst_r)
The following code makes a dummy variable of supervisor status (wrksup). Supervisees are treated as a reference category.
frq(income$wrksup)
##
## # R: Supervises others at work (x) <numeric>
## # total N=735 valid N=735 mean=1.49 sd=0.50
##
## val label frq raw.prc valid.prc cum.prc
## 1 Yes 374 50.88 50.88 50.88
## 2 No 361 49.12 49.12 100.00
## NA NA 0 0.00 NA NA
income < rec(income, wrksup, rec = "1=1; 2=0", append = TRUE)
income$wrksup_r < set_label(income$wrksup_r, label = "Supervisor jobs")
income$wrksup_r < set_labels(income$wrksup_r, labels = c("Suvervisee" = 0,
"Supervisor" = 1))
income$wrksup_r < to_factor(income$wrksup_r)
Assigning Variable Labels
We also assign variable labels to all the other variables of the income dataset.
income$educyrs < set_label(income$educyrs, label = "Education")
income$wrkhrs < set_label(income$wrkhrs, label = "Hours worked per week")
income$rinc < set_label(income$rinc, label = "Annual income")
Dummy Variable Regression Analysis
First, we estimate the annual income only by gender, which shows gender difference in the annual income.
model.4.1 < lm(rinc ~ female, data = income)
summary(model.4.1)
##
## Call:
## lm(formula = rinc ~ female, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 65099 22495 99 19105 58105
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 65099 1485 43.83 <0.0000000000000002 ***
## female1 19204 2042 9.41 <0.0000000000000002 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27600 on 733 degrees of freedom
## Multiple Rsquared: 0.108, Adjusted Rsquared: 0.106
## Fstatistic: 88.5 on 1 and 733 DF, pvalue: <0.0000000000000002
In the output, ‘female1’ indicates ‘being a female (when the value of female is 1). Thus, the coefficient of ‘female1’ means that women earn $19,204 less than men. And this female disadvantage is statistically significant (the pvalue is less than .001).
For the comparison, let’s use male instead of female in regression analysis. Note that ‘female’ is the reference category in the male variable.
model.4.1.1 < lm(rinc ~ male, data = income)
summary(model.4.1.1)
##
## Call:
## lm(formula = rinc ~ male, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 65099 22495 99 19105 58105
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 45895 1401 32.76 <0.0000000000000002 ***
## male1 19204 2042 9.41 <0.0000000000000002 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27600 on 733 degrees of freedom
## Multiple Rsquared: 0.108, Adjusted Rsquared: 0.106
## Fstatistic: 88.5 on 1 and 733 DF, pvalue: <0.0000000000000002
In the output, look at the coefficient of ‘male1’. It is 19,204. Note that it is a POSITIVE number. It means that men earn $19,204 MORE than women. This example shows well that regression coefficients change when different categories are set as a reference category and that the interpretation of coefficients for dummy variables should be made in relation to assigned reference categories.
The second model estimates female disadvantage in the annual income while controlling for education.
model.4.2 < lm(rinc ~ female + educyrs, data = income)
summary(model.4.2)
##
## Call:
## lm(formula = rinc ~ female + educyrs, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 63729 17900 2893 19989 59271
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 24743 4136 5.98 0.0000000034 ***
## female1 19715 1908 10.33 < 0.0000000000000002 ***
## educyrs 2836 274 10.36 < 0.0000000000000002 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 25800 on 732 degrees of freedom
## Multiple Rsquared: 0.222, Adjusted Rsquared: 0.22
## Fstatistic: 104 on 2 and 732 DF, pvalue: <0.0000000000000002
After controlling education, the coefficient of female drops to 19,715, which means that women earn $19,715 less than men who have the same level of education.
The third model estimates female disadvantage in the annual income while controlling for education and working hours per week.
model.4.3 < lm(rinc ~ female + educyrs + wrkhrs, data = income)
summary(model.4.3)
##
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 73473 15943 2091 17612 75984
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 361.3 4917.4 0.07 0.94
## female1 14496.9 1929.2 7.51 0.00000000000016727 ***
## educyrs 2743.1 262.0 10.47 < 0.0000000000000002 ***
## wrkhrs 571.0 68.5 8.34 0.00000000000000036 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24700 on 731 degrees of freedom
## Multiple Rsquared: 0.289, Adjusted Rsquared: 0.287
## Fstatistic: 99.3 on 3 and 731 DF, pvalue: <0.0000000000000002
After controlling education and working hours per week, the coefficient of female reduced to 14,496, which means that women earn $14,496 less than men if they have the same level of education and work as much as men do. This reduction implies that women work less than men, which in turn decreases their annual income.
The fourth model adds another dummy variable, employment status (wrkst_r).
model.4.4 < lm(rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
summary(model.4.4)
##
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 69347 15734 1928 16941 76334
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 18471.8 7654.2 2.41 0.01606 *
## female1 12781.4 1889.4 6.76 0.000000000027 ***
## educyrs 2551.5 255.7 9.98 < 0.0000000000000002 ***
## wrkhrs 309.9 90.7 3.42 0.00067 ***
## wrkst_r1 35211.0 5827.7 6.04 0.000000002427 ***
## wrkst_r2 22627.1 6079.8 3.72 0.00021 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24000 on 729 degrees of freedom
## Multiple Rsquared: 0.333, Adjusted Rsquared: 0.328
## Fstatistic: 72.8 on 5 and 729 DF, pvalue: <0.0000000000000002
The output shows two coefficients of employment status (wrkst_r). In wrkst_r, 1 means “Fulltime employment”, and 2 means “Parttime employment”. Thus, the coefficient of ‘wrkst_r1’ is for “Fulltime employment and that of ‘wrkst_r2’ is for “Parttime employment”. The interpretation is that fulltime workers earn $35,211 more, and parttime workers earn $22,627 more than those who are not in the labour force (the reference category) while controlling for gender, education and working hours per week. After controlling employment status as well as education and working hours per week, women earn $12,781 less than men. The gender gap is reduced a lot compared to that in the second model, which implies that employment status accounts substantially for female disadvantage in the annual income. Women are less likely to have stable jobs than men, which decrease their annual income.
The fifth model adds a dummy variable of supervisor status, wrksup_r.
model.4.5 < lm(rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, data = income)
summary(model.4.5)
##
## Call:
## lm(formula = rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r,
## data = income)
##
## Residuals:
## Min 1Q Median 3Q Max
## 71303 14958 1293 16018 77518
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 17601 7563 2.33 0.02023 *
## female1 12462 1868 6.67 0.00000000005 ***
## educyrs 2367 256 9.24 < 0.0000000000000002 ***
## wrkhrs 238 91 2.62 0.00904 **
## wrkst_r1 35738 5758 6.21 0.00000000091 ***
## wrkst_r2 22764 6005 3.79 0.00016 ***
## wrksup_r1 7992 1824 4.38 0.00001346487 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 23700 on 728 degrees of freedom
## Multiple Rsquared: 0.35, Adjusted Rsquared: 0.345
## Fstatistic: 65.4 on 6 and 728 DF, pvalue: <0.0000000000000002
The coefficient of ‘wrksup_r1’ (which means those who supervise others at work) is 7,992. This means that supervisors earn $7,992 more than supervisees while controlling for gender, education, working hours per week and employment status. The coefficient of ‘female1’ is a little bit changed, implying that supervisor status does not contribute much to female disadvantage in the annual income. The regression equation of this model is \(rinc =12,462 × female + 2,367 × educyrs + 238 × wrkhrs + 35,738 × wrkst_r1 + 22,764 × wrkst_r2 +7,992 × wrksup_r1 – 17,601\). Rsquare (Multiple Rsquared) is 0.35, which means that 35% of the total variance in the annual income is explained by the five variables (gender, education, working hours, employment status, and supervisor status).
Creating a Regression Table
We have so far analysed female disadvantage in the annual income using several regression models. How can we present all these regression models in a just single table? Using the ‘tab_model()
’ function would be the easiest way. The following code combines the five regression models we have estimated so far and creates a single regression table.
tab_model(model.4.1, model.4.2, model.4.3, model.4.4, model.4.5, digits = 3,
pred.labels = c("Intercept", "Female (ref. = Male)",
"Education", "Hours worked per week",
"Fulltime (ref. = Not in labour force)",
"Parttime (ref. = Not in labour force)",
"Supervisor (ref. = Supervisee)"),
dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
title = "Gender Effects on Annual Income",
show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "asterisk")
Model 1  Model 2  Model 3  Model 4  Model 5  

Predictors  Estimates  Estimates  Estimates  Estimates  Estimates 
Intercept 
65099.191 ^{***} (1485.239) 
24743.249 ^{***} (4135.493) 
361.329 ^{} (4917.426) 
18471.756 ^{*} (7654.244) 
17601.103 ^{*} (7563.033) 
Female (ref. = Male) 
19204.178 ^{***} (2041.574) 
19714.886 ^{***} (1908.532) 
14496.910 ^{***} (1929.191) 
12781.385 ^{***} (1889.367) 
12461.921 ^{***} (1867.632) 
Education 
2835.734 ^{***} (273.737) 
2743.087 ^{***} (261.986) 
2551.511 ^{***} (255.720) 
2366.974 ^{***} (256.071) 

Hours worked per week 
571.010 ^{***} (68.454) 
309.875 ^{***} (90.654) 
238.257 ^{**} (91.022) 

Fulltime (ref. = Not in labour force) 
35211.005 ^{***} (5827.722) 
35737.917 ^{***} (5757.544) 

Parttime (ref. = Not in labour force) 
22627.062 ^{***} (6079.790) 
22764.274 ^{***} (6005.349) 

Supervisor (ref. = Supervisee) 
7992.262 ^{***} (1823.718) 

Observations  735  735  735  735  735 
R^{2} / R^{2} adjusted  0.108 / 0.106  0.222 / 0.220  0.289 / 0.287  0.333 / 0.328  0.350 / 0.345 

In the code, you must list the name of regression models that will be included in a regression table. ‘digits = value
’ specifies the decimal places for numbers in the table. ‘pred.labels = c()
’ sets the labels of independent variables displayed in the first column of the table. ‘dv.labels = c()
’ sets the labels of models displayed in the first row of the table. ‘title = “”
’ sets the title of the table. ‘show.se = TRUE
’ displays the standard errors of coefficients, ‘show.ci = FALSE
’ does not show the confidence intervals of coefficients, and ‘collapse.se = TRUE
’ shows standard errors along with coefficients in the same column. ‘p.style = “asterisk”
’ displays the pvalue in the format of stars.
The R codes you have written so far look like:
################################################################################
# Lab 11: Multiple Regression
# 3/06/2019
# SOC830, SOCI702, SOCX830
################################################################################
# Load packages
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjPlot)
# Avoid scientific notation
options(digits=5, scipen=15)
# Load dataset
aus2009 < readRDS("aussa2009.RDS")
# Create a dataset of complete cases
income < aus2009 %>%
select(id, sex, educyrs, wrkst, wrksup, wrkhrs, rinc)
income < income %>%
mutate(mark = 0)
income < income %>%
mutate(mark = replace(mark, complete.cases(income), 1))
income < income %>%
filter(mark == 1)
# Multiple regression
model < lm(rinc ~ educyrs + wrkhrs, data = income)
summary(model)
# Make dummy variables
## Gender
frq(income$sex)
income < rec(income, sex, rec = "1=0;2=1", append = TRUE)
income < var_rename(income, sex_r = female) # rename `sex_r` as `female`
income$female < set_label(income$female, label = "Gender")
income$female < set_labels(income$female, labels = c("Male" = 0, "Female" = 1))
income$female < to_factor(income$female) # convert into a factor for dummy
income$male < ref_lvl(income$female, lvl = 1) # set females as a reference
## Current employment status
frq(aus2009$wrkst)
income < rec(income, wrkst, rec = "1=1; 2=2; 4:10=0", append = TRUE)
income$wrkst_r < set_label(income$wrkst_r, label = "Current employment status")
income$wrkst_r < set_labels(income$wrkst_r, labels = c("Not in labour force" = 0,
"Fulltime" = 1,
"Parttime" = 2))
income$wrkst_r < to_factor(income$wrkst_r)
## Supervisor
frq(income$wrksup)
income < rec(income, wrksup, rec = "1=1; 2=0", append = TRUE)
income$wrksup_r < set_label(income$wrksup_r, label = "Supervisor jobs")
income$wrksup_r < set_labels(income$wrksup_r, labels = c("Suvervisee" = 0,
"Supervisor" = 1))
income$wrksup_r < to_factor(income$wrksup_r)
## Assign Variable Name
income$educyrs < set_label(income$educyrs, label = "Education")
income$wrkhrs < set_label(income$wrkhrs, label = "Hours worked per week")
income$rinc < set_label(income$rinc, label = "Annual income")
# Dummy variable regression analysis
model.4.1 < lm(rinc ~ female, data = income) # male as the reference
summary(model.4.1)
model.4.1.1 < lm(rinc ~ male, data = income) # female as the reference
summary(model.4.1.1)
model.4.2 < lm(rinc ~ female + educyrs, data = income)
summary(model.4.2)
model.4.3 < lm(rinc ~ female + educyrs + wrkhrs, data = income)
summary(model.4.3)
model.4.4 < lm(rinc ~ female + educyrs + wrkhrs + wrkst_r, data = income)
summary(model.4.4)
model.4.5 < lm(rinc ~ female + educyrs + wrkhrs + wrkst_r + wrksup_r, data = income)
summary(model.4.5)
# Make a regression table
tab_model(model.4.1, model.4.2, model.4.3, model.4.4, model.4.5, digits = 3,
pred.labels = c("Intercept", "Female (ref. = Male)",
"Education", "Hours worked per week",
"Fulltime (ref. = Not in labour force)",
"Parttime (ref. = Not in labour force)",
"Supervisor (ref. = Supervisee)"),
dv.labels = c("Model 1", "Model 2", "Model 3", "Model 4", "Model 5"),
title = "Gender Effects on Annual Income",
show.se = TRUE, show.ci = FALSE, collapse.se = TRUE, p.style = "asterisk")