2.2 Concept Sheet

Concepts

Descriptive statistics
Inferential statistics
Population
Sample
Statistical significance
P-value
Correlation Coefficient
Pearson’s correlation
T-test
Linear regression
Coefficient
R Square

Definitions

Descriptive statistics: Statistics which summarise a sample, but which do not attempt to draw conclusions (inferences) about a larger population. For example, the mean, the median, and the standard deviation of a variable in a sample are all descriptive statistics.

Inferential statistics: Statistics which aims to draw conclusions about a population from the analysis of a sample from that population. This statistics asks “What do we know about the true population parameter given the information in the sample?” Inferential statistics tends to include a reported p-value or significance tests, or confidence intervals. These tell us the chance that the population parameter is zero (i.e. that nothing is happening).

Population: A large, unseen group which you draw inferences about. A sample is drawn from a population.

Sample: A set of cases (individuals) generally assumed to be drawn as a true random sample from the population. A sample is used to draw inferences about the population. For example:A sample of 1,000 voters surveyed by phone can be used to draw inferences about the population of Australian voters.

Population parameter: A true value or characteristics of a population, for example, the mean age of Australian voters, or the proportion of voters who would give their two party prefer vote to the Liberal Party. The population parameter is the number you are trying to estimate with inferential statistics.

Sample statistic: A value or characteristics measured on your sample. The sample statistics is an estimate of the true population parameter. For example, the mean age of your sample of 1,000 voters, or the proportion of your sample that would give their two party preferred vote to the Liberal Party.

Null hypothesis: This is - roughly speaking - the hypothesis that “nothing is happening”. Most of the time the null hypothesis is that the population parameter is zero.

P-value/statistical significance: Roughly speaking - the chance that our population parameter is zero. This is expressed as a number between 0 and 1. The critical value is p < 0.05, which means a less that 5% chance that our population parameter is zero (or whatever our null hypothesis is).

Confidence interval: This is the range within which we expect (normally with 95% confidence) the population parameter to lie.

Correlation coefficient: A number between 1 and -1 that indicates the strength of the relationship between two variables, with 1 indicating that they completely covary in a positive direction, -1 indicating that they completely covary in an opposite direction, and 0 indicating that they are statistically independent.

Pearson’s correlation coefficient: The most commonly used correlation coefficient. It is a standardised measure of how much two variables co-vary. The formal equation for pearson’s correlation coefficient (also called ‘r’) is:

\(\begin{aligned} r = &\frac{\sum_{i=1}^n (x_i - \bar x)(y_i - \bar y)}{(N-1)s_x s_y} \\ \\ \text{Where:} \\ \bar x = &\text{ mean of x} \\ \bar y = &\text{ mean of y} \\ \bar N = &\text{ number of observations} \\ s_x = &\text{ standard deviation of x} \\ s_y = &\text{ standard deviation of y} \\ \end{aligned}\)

T-test: A test used to compare the mean values of a dependent variable in two groups.

Independent sample t-test: Used when you are comparing the same variable on two different groups of individuals (or cases). For example, comparing the scores on a test of two different classes.

Paired sample t-test: Used when you are comparing two variables on the same set of individuals (or cases). For example, when you are comparing the quiz results of the same class in Week 1 and Week 2.

Linear regression: A model for predicting the value of one variable from the value of other variables. The model uses the equation of a straight line (y = bx + c). y is the outcome variable, x is the predictor variable, and b is the slope of the relationship between x and y.

\(\begin{aligned} y = &b_1 x_1 + b_2 x_2 + b_3 x_3 + ... + b_0 + e \\ \\ \text{Where:} \\ y=&\text{ dependent variable (outcome variables)} \\ x_1 , x_2 , \text{ ... }x_n = &\text{ independent variables (predictor variables)} \\ b_1 , b_2 , \text{ ... }b_n = &\text{ slope of relationship between }y\text{ and }x_1 , x_2 , \text{ ... }x_n \\ b_0 = &\text{ intercept (i.e. value of }y \text{ when all }x=0 \text{)} \\ \end{aligned}\)

R-squared or \(R^2\): In a regression model, this is the proportion of change in the dependent variable (the outcome) accounted for by the independent variables in the model.

regression coefficient (b): the change in the dependent variable associated with a one unit increase in the independent variable.

standardised beta values: regression coefficients but for a model where the independent and dependent varables are all standardised (centred on zero, standard deviation of 1). The standardized beta regression coefficients are interpreted as “the number of standard deviation increase in the dependent variable associated with a one standard deviation increase in the independent variable.”