Week 5: Bivariate Analysis, Scales & Indicies, and Dimension Reduction 
Learning Objectives By the end of this class, students should be able to (1) define, (2) know when to use, (3) interpret R output for, and (4)  with the assistance of methods101.com and Google  run the R commands for the following types of statistical analysis:

SOCI832: Lesson 5.2: Comparison of Means
0. How to I get my computer set up for today’s class?
# Install Packages
if(!require(dplyr)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjlabelled)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjmisc)) {install.packages("sjmisc", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjstats)) {install.packages("sjstats", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(sjPlot)) {install.packages("sjlabelled", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(summarytools)) {install.packages("summarytools", repos='https://cran.csiro.au/', dependencies=TRUE)}
if(!require(ggplot2)) {install.packages("ggplot2", repos='https://cran.csiro.au/', dependencies= TRUE)}
if(!require(ggthemes)) {install.packages("ggthemes", repos='https://cran.csiro.au/', dependencies= TRUE)}
# Load packages into memory
library(dplyr)
library(sjlabelled)
library(sjmisc)
library(sjstats)
library(sjPlot)
library(summarytools)
library(ggplot2)
library(ggthemes)
# Turn off scientific notation
options(digits=5, scipen=15)
# Stop View from overloading memory with a large datasets
RStudioView < View
View < function(x) {
if ("data.frame" %in% class(x)) { RStudioView(x[1:500,]) } else { RStudioView(x) }
}
elect_2013 < read.csv(url("https://methods101.com/data/elect_2013.csv"))
2. How do I compare the mean of two variables?
We will learn about the basic ways to compare the difference in means of two groups  one where the groups are independent (e.g. height of men and women), and one where they are paired (e.g. height of the same people at age 15 and age 15.5 years).
2.1 How do I compare two different groups? Independent samples ttest
Let’s say we want to compare the political knowledge of men and women in our dataset. We want to ask if the mean for men, and the mean for women is different the command to test this is ‘t.test’.
Below is the command we run t.test
. Below that is the output in the R console.
t.test(elect_2013$pol_knowledge ~ elect_2013$female)
##
## Welch Two Sample ttest
##
## data: elect_2013$pol_knowledge by elect_2013$female
## t = 11.6, df = 3839, pvalue <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.88917 1.24939
## sample estimates:
## mean in group 0 mean in group 1
## 5.3283 4.2590
Now I know this looks like a mess, but it is actually not too difficult to understand.
Remember the first rule of reading statistical output: look at the pvalue.
If the pvalue is above 0.05, then there is normally no need to interpret anything else because the test is not significant.
WARNING: As mentioned, there is a complication to this instruction, because some statistical commands give multiple pvalues  one for each variable, and one for the model overall. This instruction is about evaluating the pvalue for each individual variable.
So if we follow this rule of looking at the pvalue first,
What does it say?
The pvalue in this case is “pvalue < 2.2e16” (if you haven’t turned off scientific notation) or “pvalue <0.0000000000000002”. What does that mean? Is that less than 0.05? Yes! So the test tells us the difference in means is highly statistically significant. There is only a very low probablity that we got this difference by random chance.
So what is the next step for intepreting this output?
Let’s look at the last three line. They say:
## sample estimates:
## mean in group 0 mean in group 1
## 5.328302 4.259021
This is telling us that the mean of the group with value “0” is 5.33, and the mean for the group with value “1” is 4.26. But what is group 0 and 1? Well we need to look at our data. The means are measure in “pol_knowledge” units, and the variable for gender is 1 = female, and 0 = male. So this tells us that the mean political knowledge for men in our sample is 5.3, and for women is 4.3.
We could stop interpreting our data here, but there is another useful part of the output to interpret. Look at these two lines:
## 95 percent confidence interval:
## 0.8891745 1.2493867
This tells us that the ‘difference of means’ between men and women has a 95% confidence interval of 0.89 to 1.25. This says that the TRUE difference between men and women  the population parameter  is with 95% certainty between 0.89 and 1.25.
2.2 How do I compare two sets of data on the same set of case? Paired (or dependent) samples ttest
The second type of comparison of means we are going to run is the paired test. In a paired test the two variables to be measured are measured on the same units of analysis
The reason we need a different test for this is because when the same unit of analysis is used for the two variables the two variables are dependent on each other  they are not independent samples  as so the statistical test changes to account for this.
In the next example, we are going to compare participants average score for ‘following the election on TV’ vs ‘following the election in the newspaper’.
t.test(elect_2013$election_tv,
elect_2013$election_newspaper,
paired = TRUE)
##
## Paired ttest
##
## data: elect_2013$election_tv and elect_2013$election_newspaper
## t = 26.6, df = 3883, pvalue <0.0000000000000002
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.46002 0.39683
## sample estimates:
## mean of the differences
## 0.42842
This can be read in the same way as the previous ttest except that in this case the last line reports the difference in means, not the two means.
Intuitively we know that this means people followed the election more in the newspaper than on TV, but we can check this, by just running two means() to double check we are right:
mean(elect_2013$election_tv, na.rm = TRUE)
## [1] 2.008
mean(elect_2013$election_newspaper, na.rm = TRUE)
## [1] 2.4337
And you can see that what we thought was true is, with people having an average score of 2.01 for election_tv, and 2.43 for election_newspaper.