Learning Objectives: Lesson 8.1 + 8.2 
By the end of lessons 8.1 + 8.2, students should be able to: (1) understand the theory behind; (2) run the R code for; and (3) interpret the R output for:

0. Code to run to set up your computer.
# Update Packages
update.packages(ask = FALSE, repos='https://cran.csiro.au/', dependencies = TRUE)
# Install Packages
if(!require(dplyr)) {install.packages("dplyr", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjPlot", repos='https://cran.csiro.au/', dependencies=TRUE)}
# Load packages into memory
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjstats)
library(sjPlot)
# Turn off scientific notation
options(digits=3, scipen=8)
# Stop View from overloading memory with a large datasets
RStudioView < View
View < function(x) {
if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}
# Datasets
# Example 1: Crime Dataset
lga < readRDS(url("https://methods101.com/data/nswlgacrimeclean.RDS"))
# Example 2: AuSSA Dataset
aus2012 < readRDS(url("https://mqsociology.github.io/learnr/soci832/aussa2012.RDS"))
# Example 3: Australian Electoral Survey
aes_full < readRDS(gzcon(url("https://mqsociology.github.io/learnr/soci832/aes_full.rds")))
# Example 4: AES 2013, reduced
elect_2013 < read.csv(url("https://methods101.com/data/elect_2013.csv"))
1. Linear Regression
So far we have looked at the relationship between two variables with analytical techniques such as correlations and comparisons of means.
We have also looked at the relationship between three or more variables with analytical techniques like factor analysis and measures of the reliability of scales (like Cronbach’s Alpha). These allowed us to see if a set of three or more variables are ‘moving together’  whether they are measuring some common, unobserved variable.
However, in much of social science, what we are concerned with is explaining an outcome. In most social science papers, we try to explain (or predict) an outcome (dependent variable), and report the impact of many different potential causal variables (independent variables).
A linear regression is one model  and probably the most popular model  for exploring the impact of multiple independent variables on a single dependent variable.
A linear regression has a number of fundamental characteristics, including:
 there is one outcome (dependent) variable;
 which is continuous or interval (i.e. not a binary, or categorical variable); and
 the predictor (independent) variables are either binary (0/1) or continious (or interval) (they are not categorical)
 the model of the outcome variable is a linear combination of the independent variables, which means that the equation for a linear regression is:
\(\begin{aligned} y = &b_1 x_1 + b_2 x_2 + b_3 x_3 + ... + c + e \\ \\ \text{Where:} \\ y=&\text{ dependent variable (outcome variables)} \\ x_1 , x_2 , \text{ ... }x_n = &\text{ independent variables (predictor variables)} \\ b_1 , b_2 , \text{ ... }b_n = &\text{ slope of relationship between }y\text{ and }x_1 , x_2 , \text{ ... }x_n \\ c = &\text{ intercept (i.e. value of }y \text{ when }x_1 , x_2 , \text{ ... }x_n=0 \text{)} \\ e = &\text{ error (i.e. that part of }y \text{ not explained by model)} \\ \end{aligned}\)
1.1 A Silly Example: Predicting Visits to the Rest Room
Let’s use a silly example  because it is memorable.
Say we are interested in knowing how many time a patron of a bar or pub or night club visits the rest room in an evening.
And let’s also suppose that we have some ethical way to measure this and collect the data on 100 people who are in this busy bar on a saturday night.
In this case, our dependent variable is ‘visits to the rest room’, and it might vary between 0 and 10 for our 100 people.
And the question we ask ourselves is  what cause or predicts the number of visits?
Well we might think that there are two important variables
 the number of hours the person has been in the bar; and
 the number of drinks the person has had at the bar.
To express this as a line we could say that, on average:
\(\begin{aligned} \text{ rest_room_visits } = &b_1 \text{ hours_at_bar } + b_2 \text{ number_of_drinks } + c\\ \\ \text{Where:} \\ b_1 = &\text{ slope of relationship between hours_at_bar and rest_room_visits } \\ b_2 = &\text{ slope of relationship between number_of_drinks and rest_room_visits} \\ c = &\text{ the constant or yintercept, which is the number of rest_room_visits (on average) } \\ &\text{ for someone with zero for the two other variables, i.e. they just arrived at bar } \\ &\text{ (hours_at_bar = 0) and has not had a drink yet (number_of_drinks = 0) } \\ \end{aligned}\)
Now if we want to try to work out what B1, B2, and C are we would first need to get a whole lot of data. Well we are lucky, because we have collected data from 100 people on a Saturday night at a local Irish Pub.
The data looks like this, with four columns
Patron  rest_room_visits  hours_at_bar  number_of_drinks 

01  3  1  4 
02  1  0.5  1 
03  1  2  1 
04  6  4  4 
05  9  5  6 
06  2  3  2 
07  1  0  1 
08  0  2  0 
09  15  9  13 
10  4  2  3 
… AND SO ON FOR 100 people …
We can ignore the first column (patron) which is just an identifing column.
Now we could, if we wanted to, plot each person on a chart, where the y axis is visits, and the x axis is one of the predictor/independent variables (e.g. hours).
If we did this we would have a very messy graph but there would be a general tendency for visits to go up as the number of hours at the pub went up, and the number of drinks goes up.
In a linear regression, the statistical software calculates the ‘line of best fit’  meaning the line that is as close as possible to all persons  all the dots on our graph.
This line of best fit can then be described with the three coefficients discussed earlier: \(b_1\), \(b_2\), and \(c\).
B1 and B2 are ‘slopes’  which you would have learnt about in year 8 mathematics. Remember the equation \(y = bx + c\) Well \(b\) is the slope, and \(c\) is the intercept, and it is the same in linear regression models.
Exactly how the statistical packages calcuate the line of best fit is not our concern, and it involves some complex maths. What matters to us as social scientists is that  so long as basic assumptions about the nature of the data are true, and the statistical package is sound  these models can estimate the relationship between various predictors (independent variables)  and an outcome (dependent variable) we care about.
So let’s say we did analyse our data, and R told us the values for the coefficients were:
 \(b_1 = 0.5\)
 \(b_2 = 0.8\)
 \(c = 0.2\)
What does this mean?
It means that  on average:
 for each one hour a person spends at the bar the will visit the rest room 0.5 times;
 for each drink they have, they will visit the rest room 0.8 times; and
 for someone who stays for zero hours and has no drinks, they will visit the rest room 0.2 times.
Exercise: Make your own mental model of ‘visits to the rest room’ 
We will now do an exercise on the white board where we discuss and come up with our own ‘best guess’ of a linear model of visits to the rest room. Questions:

2. Linear Regression Example: Political Knowledge
Let’s run a model in R ourselves now, and see how it is done with a real dataset  the dataset from McAllister 2016.
Let’s start with a very simple model of political knowledge, looking at the impact of being 1824 years of age (or not) on political knowledge.
Now remember, our political knowledge scale goes from 0 to 10, and each one point represents one quiz question the person go right (out of a total of 10 questions) The independent variable we are using to use is a binary variable (also called a dummy variable) which is equal to 1 if the person in aged 1824 years of age or zero otherwise.
To run a linear regression model, we need to run two lines of code. The first run’s the model, with the command ‘lm()’. Inside the brackets we put the dependent variable and then a tilde (~) and then the independent variable/s (if we have multiple independent variables, we separate them with plus (+) sign). We put the name of the dataset at the end of the set of arguments.
NOTE: In this case, we are using piping with %>%
. The meaning of this piping is explained after the hash symbols on each line
We then send the output of this line to a summary command to give us a very useful summary of the model.
When you are ready, run the following command:
elect_2013 %>% # Take this data frame and PUT it...
stats::lm(pol_knowledge ~ age_18_24, data = .) %>% # Into this model. Data frame will appear where the full stop is (.)
base::summary() # and then we put the output of lm() into summary()
##
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## 4.855 2.541 0.145 2.145 6.459
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 4.8545 0.0479 101.31 < 2e16 ***
## age_18_24 1.3140 0.1996 6.58 0.000000000052 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.89 on 3849 degrees of freedom
## (104 observations deleted due to missingness)
## Multiple Rsquared: 0.0111, Adjusted Rsquared: 0.0109
## Fstatistic: 43.3 on 1 and 3849 DF, pvalue: 0.000000000052
Now this code might look confusing, but it will become easy  with some practice reading it.
Remember the rule “First look at the pvalue”.
However, for models (e.g. regression models like this), we aren’t that interested in the pvalue for the model overall. What you should look at is the pvalue for the coefficients. and for some reason in R these pvalues are listed under the heading “Pr(>t)”. Why? I don’t know.
Anyway, so we look at the two numbers under “Pr” and they are both very small numbers (with 11 and 16 zeros after the decimal point), which means that p < 0.05, and so they are significant and we can interpret the coefficients.
What do the coefficients say? Well lets start with the first that is just called “(Intercept)”. This is the C which we discussed in the earlier section, as in the value of y, when all the x’s are equal to zero.
In this case, the intercept has a value of 4.85, which means that people who have a zero for the variable 1824 year old (i.e. everyone else) have a mean score on the political knowledge quiz of 4.85.
And now let’s look at the second coefficient  the first variable  which is 1.31. So this means persons who are 1824 years of age, get, on average, 1.31 less questions correct in the quiz than persons who are older.
So, in essence, this model says that 25+ year olds get on average 4.85 quiz questions right, and 1824 year olds get 3.54 question right.
3. Rsquared and Adjusted Rsquared
What else can this model tell us? Well towards the bottom of the model you can see that it gives a value for the Rsquared and the Adjusted Rsquared. We were introduced to the notion of an Rsquared in the last set of notes (Week 2, Part 2  correlation), because Rsquared is literally the square of Pearson’s R. We learnt in that that Rsquared is an measure of effect size, and that it can it represents the proportion (percentage) of variation in the dependent variable that MIGHT be able to be explained by the independent variable/s.
We also said that roughly speaking we can look at Rsquared and say that: small effect size is 0.010.04 (i.e 1%4%) medium effect size is 0.090.16 (i.e.9%16%) large effect size is 0.25+ (ie. 25%+)
So given this, what does the output for our regression model say? It says that the multiple Rsquared is 0.01. So basically the difference between 1824 year olds and the rest of the population explains about 1% of the variation in political knowledge  at best!
3.1 Example: A more complex model
So let’s now make a more complex model of political engagement. I will just write the model out below. You should understand it’s meaning:
elect_2013 %>%
stats::lm(pol_knowledge ~ age_18_24
+ internet_skills
+ female
+ tertiary_ed
+ income_quintiles
+ aust_born
+ interest_pol, data = .) %>%
base::summary()
##
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24 + internet_skills +
## female + tertiary_ed + income_quintiles + aust_born + interest_pol,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## 7.052 1.819 0.014 1.749 8.047
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 7.2572 0.1675 43.34 < 2e16 ***
## age_18_24 0.7244 0.1876 3.86 0.00011 ***
## internet_skills 0.1172 0.0358 3.27 0.00107 **
## female 0.7485 0.0835 8.97 < 2e16 ***
## tertiary_ed 0.8749 0.0988 8.85 < 2e16 ***
## income_quintiles 0.1304 0.0334 3.90 0.000098 ***
## aust_born 0.0431 0.0952 0.45 0.65064
## interest_pol 1.5622 0.0537 29.11 < 2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.42 on 3423 degrees of freedom
## (524 observations deleted due to missingness)
## Multiple Rsquared: 0.293, Adjusted Rsquared: 0.291
## Fstatistic: 202 on 7 and 3423 DF, pvalue: <2e16
If you run this model and look at the output you will see that it is a much better model, with nearly 1/3 of all political knowledge explainable by the model itself. One question you might have is: “Which of these varibles is most important?”
This is essentially an effect size question. There are two ways of answering this question: the first uses Rsquare; and the second Standardised regression coefficients.
We can measure the effect size of a particular variable by looking at the change in the Rsquared when we run the model with and without the variable.
For example, let’s run exactly the same model as model 2 but we will drop out interest in politics:
elect_2013 %>%
stats::lm(pol_knowledge ~ age_18_24
+ internet_skills
+ female
+ tertiary_ed
+ income_quintiles
+ aust_born, data = .) %>%
base::summary()
##
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24 + internet_skills +
## female + tertiary_ed + income_quintiles + aust_born, data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## 6.898 2.134 0.091 2.145 6.669
##
## Coefficients:
## Estimate Std. Error t value Pr(>t)
## (Intercept) 4.2579 0.1472 28.94 < 2e16 ***
## age_18_24 1.5296 0.2071 7.39 1.9e13 ***
## internet_skills 0.1680 0.0399 4.22 2.6e05 ***
## female 1.0899 0.0921 11.83 < 2e16 ***
## tertiary_ed 1.1527 0.1097 10.51 < 2e16 ***
## income_quintiles 0.1630 0.0372 4.38 1.2e05 ***
## aust_born 0.1016 0.1059 0.96 0.34
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.7 on 3435 degrees of freedom
## (513 observations deleted due to missingness)
## Multiple Rsquared: 0.118, Adjusted Rsquared: 0.116
## Fstatistic: 76.5 on 6 and 3435 DF, pvalue: <2e16
What is the Rsquared (or actually Adjusted Rsquared  this is a better measure)? It is around 0.11, meaning that the explanatory power of the model has dropped from explaining around 30% of variation in political knowledge, to only explaining approximately 10%.
We can also use a second method to measure effect size: standardized regression coefficients.
4. Standardised Coefficients  Beta
The other method that is commonly used to measure effect size and to compare effect size across independent variables in a regression model is ‘standardised regression coefficients’.
At some level these are quite simple to understand. To estimate standardised coefficients, all of the variables are standardised (also called normalised) so that their mean is zero and their standard deviation is 1. And then they are put in the regression model.
The effect of this is to make the coefficients all measured in the same units: Each coefficient is a number that represents “The number of standard deviations change in y (the dependent variables) are predicted/caused by a one standard deviation change in each X (each independent variable).”
Let’s estimate the standardised regression coefficients  which are also called ‘standardized betas’ or just ‘betas’  for model 2.
To do this we need to install and then load the package “lm.beta”.
if(!require(lm.beta)) {install.packages("lm.beta", repos='https://cran.csiro.au/', dependencies=TRUE)}
base::library(lm.beta)
and then the command is simple: we just run lm.beta() before we run summary (and after we have run lm())
elect_2013 %>%
stats::lm(pol_knowledge ~ age_18_24
+ internet_skills
+ female
+ tertiary_ed
+ income_quintiles
+ aust_born
+ interest_pol, data = .) %>%
lm.beta::lm.beta() %>%
base::summary()
##
## Call:
## stats::lm(formula = pol_knowledge ~ age_18_24 + internet_skills +
## female + tertiary_ed + income_quintiles + aust_born + interest_pol,
## data = .)
##
## Residuals:
## Min 1Q Median 3Q Max
## 7.052 1.819 0.014 1.749 8.047
##
## Coefficients:
## Estimate Standardized Std. Error t value Pr(>t)
## (Intercept) 7.25724 0.00000 0.16747 43.34 < 2e16 ***
## age_18_24 0.72437 0.05817 0.18756 3.86 0.00011 ***
## internet_skills 0.11722 0.05497 0.03581 3.27 0.00107 **
## female 0.74846 0.13042 0.08348 8.97 < 2e16 ***
## tertiary_ed 0.87490 0.14094 0.09883 8.85 < 2e16 ***
## income_quintiles 0.13042 0.06266 0.03345 3.90 0.000098 ***
## aust_born 0.04310 0.00655 0.09517 0.45 0.65064
## interest_pol 1.56222 0.43175 0.05367 29.11 < 2e16 ***
## 
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.42 on 3423 degrees of freedom
## (524 observations deleted due to missingness)
## Multiple Rsquared: 0.293, Adjusted Rsquared: 0.291
## Fstatistic: 202 on 7 and 3423 DF, pvalue: <2e16
stats::sd(elect_2013$age, na.rm = TRUE)
## [1] 17
stats::sd(elect_2013$pol_knowledge, na.rm = TRUE)
## [1] 2.94
This information is identitical to the output from a summary of a normally linear regression model, except that it has added the standardised beta coefficients, as the second set of numbers (under the heading ‘Standardized’).
If we look at these, we can see that the highest beta is for interest in politics, which is 0.43. When assessing effect size, we can ignore the sign (+/), and a beta of 0.4 is very large.
When thinking about effect size, and beta, a rough rule of thumb is
 small = ~ 0.05
 medium = ~ 0.10.2
 large ,= 0.3+
Reading the beta’s we can see that the three variables in our model with the largest impact on political knowledge are, in rank order:
 Interest in politics
 Tertiary education
 Gender