SSCI202 Workshop 10: Correlation

This week, we will use the NSW Crime dataset. This workshop introduces 1) how to compute correlation coefficients and 2) how to create scatterplots.


Suppose that we want to examine the relationship between two continuous variables such as whether and how robbery rates are correlated with unemployment rates. Computing correlation coefficients is an easy way for this task.

To create a correlation table, go to Analyze>Correlate> Bivariate (see <Figure 1>) .

<Figure 1>

Figure 1: <Figure 1>

In the box of Bivariate Correlations, 1) move all the variables of your interest (in this case, robbery and unemploy) to the box of Variables:. And then 2) Choose Pearson in the section of Correlation Coefficients. 3) You can show asterisks (*) in your output to highlight a statistically significant correlation coefficient by checking the box “Flag significant correlations” (see <Figure 2>). 4) click OK.

<Figure 2>

Figure 2: <Figure 2>

<Figure 3> shows the output of correlation. You may notice different numbers of cases across cells. Correlations are calculated based on all the cases which have valid values on both of two variables, which means that cases with missing values on any of the two variables are excluded from the analysis. In <Figure 3>, Pearson’s r is .261, and its associated p-value is .014., which suggests that robbery and unemployment rates are weakly correlated, but this association is statistically significant at .05.

<Figure 3>

Figure 3: <Figure 3>

You can compute correlation coefficients for multiple (more than two) variables. Think about other variables that may be correlated with robbery and unemployment such as percentage of residents who rent dwelling (pctrent), income inequality measured by Gini coefficient (giniinc), median house prices (housmedprice), and median age of residents (medage). You can simply add all the variables of your interest to the list of Variables (see <Figure 4>).

Note: Gini coefficient measures the level of economic inequality. It ranges from 0 (perfect equality) to 1 (perfect inequality). An LGA in which every resident has the same income would have an income Gini coefficient of 0, which indicates perfect equality. An LGA in which one resident earns all the income, while all the others earn nothing, would have an income Gini coefficient of 1, which indicates perfect inequality. Thus, higher Gini coefficients of income mean higher levels of income inequality.

<Figure 4>

Figure 4: <Figure 4>

When you compute correlation coefficients between more than two variables, the way to deal with missing values can make a big difference in your output. The “pairwise deletion” takes into account cases in which valid values are available for each pair of variables. For example, the correlation coefficient between robbery and unemployment rates is computed using cases which have valid values in both variables. On the other hand, the correlation coefficient between robbery rates and the median house price is computed using cases which have valid values in these two variables. Therefore, each correlation coefficient can be computed based upon different sets of cases, which makes it difficult to compare correlation coefficients directly.

The “listwise deletion” method can fix this problem. It computes the correlation coefficient across selected variables using the same set of cases which drops any cases that have missing values on any of the selected variables. However, a disadvantage with the use of listwise deletion is that you may lose more cases compared to the use of pairwise deletion. In this workshop, we will use “listwise deletion” because it is a preferred method in most social science research. 1) Click Options, 2) tick “Exclude cases listwise” and then 3) press Continue (see <Figure 4>). Then, in the previous box, click OK at the bottom.

You will see the correlation matrix as in <Figure 5>. Which shows correlation coefficients across all the variables of your choice. Each cell shows a Pearson correlation coefficient and its p-value between the corresponding column and row variable. For instance, the correlation coefficient between robbery and unemployment rates is .210, and its p-value is .073.

<Figure 5>

Figure 5: <Figure 5>


Scatterplots are an effective visualisation to examine bivariate relationships. To create a scatterplot, go to Graphs > Legacy Dialogs > Scatter/Dot (see <Figure 6>).

<Figure 6>

Figure 6: <Figure 6>

First, we will make a scatterplot between two variables (robbery and unemploy). Select “Simple Scatter” in the box of Scatter/Dot (see <Figure 7>).

<Figure 7>

Figure 7: <Figure 7>

In the box of Simple Scatterplot, 1) move your dependent variable (robbery) to the section of Y Axis, 2) independent variable (unemploy) to the section of X Axis, and 3) click OK.

<Figure 8>

Figure 8: <Figure 8>

Then, you will see the scatterplot between robbery and unemploy as in <Figure 9>.

<Figure 9>

Figure 9: <Figure 9>

Also, you can create scatterplots across many variables so that you can examine bivariate relationships across them. To make scatterplots across multiple variables, choose “Matrix Scatter” in the box of Scatter/Dot (see <Figure 10>), and put all the variables of your interest in the section of Matrix Variables (see <Figure 11>). It will generate a matrix of scatterplots as in <Figure 12>.

<Figure 10>

Figure 10: <Figure 10>

<Figure 11>

Figure 11: <Figure 11>

<Figure 12>

Figure 12: <Figure 12>

Workshop Activity 10: Correlations

  1. Generate a scatterplot for the following pairs of variables. Which scatterplot suggests the strongest association?

    A. Gini coefficient of total income (giniinc) and unemployment rate (unemploy)

    B. Gini coefficient of total income (giniinc) and % of residents who rent dwelling (pctrent)

    C. Gini coefficient of total income (giniinc) and the median sale price of houses (housmedprice)

    D. Gini coefficient of total income (giniinc) and the median age (medage)

  1. Generate a scatterplot matrix with all the variable used in the Q1. (they are giniinc, unemploy, pctrent, housmedprice, and medage). Which variables show negative association with the median sale price of houses?

  1. Now we are going back to the examination of how various characteristics of LGAs are associated with crime rates, but we change our focus from robbery to sexual offences (sexoff). We continue use the same sets of independent variables that we examined for robbery rates.

    A. Create a scatterplot using sexoff and unemploy, and compare the output with the one with robbery (Figure 9). Which crime rates are more strongly associated with unemployment rates? Robbery or sexual offence rates?

    B. Construct a correlation matrix using sexoff, unemploy, pctrent, giniinc, housmedprice, and medage. Which variables are significantly correlated with sexoff? And describe how they are correlated with sexoff? (interpret the output in terms of both direction and strength of the correlation).

    C. Now compare your answer to the question above (Q3-B) with the <Figure 5> on robbery rate. Do you think that the same independent variable affects robbery and sexual offence rates in the same way?

Note: External students should post their answers to these three questions on the iLearn. This activity will contribute to your workshop participation marks.

Last updated on 24 October, 2019 by Dr Hang Young Lee(