SSCI202 Workshop 11: Regression Analysis

In workshop 11, we will learn the regression analysis, a powerful statistical model that enables you to examine the effect of independent variables on the dependent variable. We continue using the NSW Crime Dataset.

Preparation

Selecting Cases of Complete Observations

In most cases, your dataset contains cases with many missing values. Missing values can reduce the statistical power of analyses and produce biased estimates, which could lead to invalid conclusions. There are several ways to deal with missing values, but this workshop introduces only the method of “Listwise deletion” (also known as complete-case analysis) which is the most frequently used in handling missing data. You learned this method in workshop 10 when you created the correlation matrix. But this workshop introduces a more general way that can be used for all kinds of statistical analyses. You will learn how to select all the cases which have no missing values on the variables of your choice so that you can use these selected cases for any of your future analysis. Listwise deletion removes any case from the dataset if the case has a missing value on any variable of your choice. Suppose that we are investigating the relationships across six variables: sexual offence rates (sexoff), the median age of residents (medage), unemployment rates (unemploy), median house prices (housmedprice), regions of LGAs(region2), and urban or rural LGAs (urban).

The first step of this analysis is to select all the LGAs that have no missing values on these six variables so that we can investigate all combinations of the relationships using the same set of LGAs. Go to Data > Select Cases (See <Figure 1>.

<Figure 1>

Figure 1: <Figure 1>

In the box of Select Cases, 1) choose “If condition is satisfied”. 2) In the section of Output, choose “Filter out unselected cases”, which will exclude unselected cases from the analysis. Then, 3) click the “If” button just below “If condition is satisfied” in the section of Select (See <Figure 2>.

<Figure 2>

Figure 2: <Figure 2>

In the box of Select Cases: If, type the following expression (See <Figure 3>).

NOT (SYSMIS(sexoff)) AND NOT (SYSMIS(medage)) AND NOT (SYSMIS(unemploy)) AND NOT (SYSMIS(housmedprice)) AND NOT (SYSMIS(region2)) AND NOT (SYSMIS(urban))
<Figure 3>

Figure 3: <Figure 3>

SYSMIS(variable) finds cases that have missing values on the variable. Thus, NOT SYSMIS(variable) finds cases that do not have missing values. AND signifies that all the listed conditions must be met. Combined together, the above expression will select cases that have no missing values on any of the six variables.

Click Continue at the bottom. Then, you will be back to the previous box. Click OK at the bottom. Make sure all the analysis after this point will be done only for the selected cases. Thus, if you want to analyse all the cases again, you need to deselect cases (See Deselecting cases.

Changing the Scale of Variables

housmedprice has a huge range (47500 – 2800000), which makes it a little bit complicated to interpret regression outputs. Therefore, reducing the range would be helpful. An easy way to achieve this goal is to divide the value by numbers. In this example, we divide it by 10,000.

Go to Transform > Compute Variable. In the box of Compute Variable, 1) type “housmedprice2”, which is a new variable name, in the section of Target Variable:, 2) click the icon of “Type & Label”, which will show the box of Compute Variable: Type and Label, 3) in the box, choose “Label”, type “Median Sale Price of Houses ($10,000)”, and click Continue, 4) type housmedprice/10000 in the section of Numeric Expression:, and 5) click OK at the bottom (see <Figure 4>).

<Figure 4>

Figure 4: <Figure 4>

Simple Linear Regression

In this workshop, we will investigate what kinds of community characteristics influence sexual offence rates(sexoff). We start with a very simple version of regression model in which only the dependent variable is regressed on only one independent variable.

First, we will estimate the effect of the median sale price of houses (housmedprice2) on the sexual offence rate (sexoff). To conduct a linear regression analysis, go to Analyze > Regression > Linear (see <Figure 5>) .

<Figure 5>

Figure 5: <Figure 5>

In the popped-up box of Linear Regression, 1) move sexoff to the section of Dependent, 2) move housmedprice2 to the section of Independent(s) and 3) click OK (see <Figure 6>).

<Figure 6>

Figure 6: <Figure 6>

You will see the regression output as in <Figure 7>.

<Figure 7>

Figure 7: <Figure 7>

First, look at the table of Model Summary. You will see “R Square”. R Square (the coefficient of determination) tells you the proportion of the variance in the dependent variable explained by independent variables. In the table, R Square is .272, which means that 27.2% of the total variation in the sexual offence rate are explained by the median sale price of houses.

Second, look at the table of Coefficients. The intercept is 267.818. The coefficient (slope) of housmedprice is -919, meaning that for each additional $10,000 in median sale price of houses, sexual offence rates will decrease by .919. Therefore, the regression equation estimated by this model is \(sexoff = 267.818 -.919 × housmedprice2\).

Third, the table also provides the p-value for the significance test of the coefficient. The p-value for the coefficient of housmedprice2 is .000, indicating that the negative effect of housmedprice2 on sexoff is statistically significant at α = .05 (Note: We normally don’t pay attention to the p-value for intercepts).

Then, we will estimate the sexual offence rate by other two independent variables, respectively. In the box of Linear Regression (in <Figure 6>), move medage and unemploy to the section of Independent(s), respectively. You will get the regression output for medage (See <Figure 8>) and for unemploy (see <Figure 9>).

<Figure 8>

Figure 8: <Figure 8>

<Figure 9>

Figure 9: <Figure 9>

Multiple Regression

In this section, we will estimate the sexual offence rate by the three independent variables (housmedprice2, medage, unemploy) simultaneously. To conduct this multiple regression analysis, move all the three variables to the section of Independent(s) in the box of Linear Regression (See <Figure 10>)

<Figure 10>

Figure 10: <Figure 10>

You will see the regression output as in <Figure 11>.

<Figure 11>

Figure 11: <Figure 11>

Again, focus on the table of Model Summary and Coefficient.

  • R Square is .305, which means that 30.5% of the total variance in the sexual offence rate are explained by the three variables.

  • The coefficient of housmedprice2 is -.976, meaning that for every additional $10,000 in median sale price of house, sexual offence rates will decrease by .976, after controlling the median age of residents and unemployment rates. The p-value for this coefficent is .000, and thus the negative effect of housmedprice2 is statistically significant at α = .05.

  • The coefficient of medage is -3.540, meaning that for each additional one year increase in the median age of respondents, sexual offence rates will decrease by 3.540, after controlling the median sale price of houses and unemployment rates. The p-value for this coefficent is .068, and thus the negative effect of medage is NOT statistically significant at α = .05.

  • The coefficient of unemploy is 7.886, meaning that for each additional 1% increase in the unemployment rate, sexual offence rates will increase by 7.886, after controlling the median sale price of houses and the median age of residents. The p-value for this coefficent is .176, and thus the positive effect of unemploy is NOT statistically significant at α = .05.

  • The regression equation estimated by this model is \(sexoff = 371.667 -.976 × housmedprice2 -3.540 × medage +7.886 × unemploy\).

Dummy Variable Analysis

Creating a Dummy Variable.

A dummy variable is a binary variable used in the regression analysis that represents subgroups (or categories) of the sample in your study. And it is used to estimate the effect of categorical (especially, nominal) variables. Suppose that we want to estimate the effect of regions on the sexual offence rate. More specifically, we want to compare this effect across three regions: Greater Metropolitan Sydney, Sydney Surrounds, and Rural and Regional Areas. Since this variable of regions is not a continuous variable (actually, it is a nominal variable), a set of dummy variables is necessary for estimating this effect.

Let’s make a set of dummy variables for the region. First, we need to recode region2 into region3, following the recoding scheme of <Table 1>.

Table 1: Recoding scheme for new region variable
Old variable(region2)
New variable(region3)
Values Labels Values Labels
1 Greater Metropolitan Sydney 1 Greater Metropolitan Sydney
2 Sydney Surrounds 2 Sydney Surrounds
3 Mid North Coast 3 Rural and Regional Areas
4 Murray
5 Murrumbidgee
6 Hunter
7 Illawarra
8 Richmond-Tweed
9 Canberra Region
10 Northern
11 Central West
12 North Western
13 Far west

After making region3, we need to make two dummy variables which represent “Sydney Surrounds” and “Rural and Regional Areas”, respectively (Note that when the nominal variable that you want to estimate the effect has a N number of categories, you need to make a N-1 number of dummy variables). The omitted category in region3 (Greater Metropolitan Sydney) will serve as the reference category in estimating the effect of each region. Make a dummy variable for “Sydney Surrounds” (region3_d2), following the recoding scheme of <Table 2>. Note that values in a dummy variable take either 0 ( = not belong to the category) or 1 ( = belong to the category). Thus, 1 in the dummy variable for “Sydney Surrounds” indicates for an LGA in Sydney Surrounds.

Table 2: Recoding scheme for a dummy variable for Sydney Surrounds
Old variable(region3)
New variable(region3_d2)
Values Labels Values Labels
1 Greater Metropolitan Sydney 0 Else
2 Sydney Surrounds 1 Sydney Surrounds
3 Rural and Regional Areas 0 Else

Also, make a dummy variable for “Rural and Regional Areas” (region3_d3), following the recoding scheme of <Table 3>.

Table 3: Recoding scheme for a dummy variable for Rural and Regional Areas
Old variable(region3)
New variable(region3_d3)
Values Labels Values Labels
1 Greater Metropolitan Sydney 0 Else
2 Sydney Surrounds 0 Else
3 Rural and Regional Areas 1 Rural and Regional Areas

Dummy Variable Regression Analysis

After making dummy variables, it is easy to estimate the effect of dummy variables. Move a set of dummy variables (in this example, region3_d2 and region3_d3) to the section of Independent(s) in the box of Linear Regression.

You will see the regression output as in <Figure 12>.

<Figure 12>

Figure 12: <Figure 12>

Notable results are: * R Square is .247, which means that 24.7% of the total variance in the sexual offence rate are explained by the set of two dummy variables of regions.

  • The coefficient of “Sydney Surrounds” is 32.652, meaning that LGAs in Sydney Surrounds have, on average, higher sexual offence rates than those in Greater Metropolitan Sydney by 32.652. However, this difference is not statistically significant at α = .05 (because the p-value (.554) is greater than α). Thus, we confirm that sexual offence rates do not differ significantly between Great Metropolitan Sydney and Sydney Surrounds.

  • The coefficient of “Rural and Regional Areas” is 117.408, meaning that LGAs in Rural and Regional Areas have, on average, higher sexual offence rates than those in Greater Metropolitan Sydney by 117.408. And this difference is statistically significant at α = .05 (because the p-value (.000) is much less than α).

Then, we will examine whether these findings are consistent even after controlling community characteristics. In the box of Linear Regression, move region3_d2, region3_d3, housmedprice2, medage and unemploy to the section of Independent(s).

You will see the regression output as in <Figure 13>.

<Figure 12>

Figure 13: <Figure 12>

As seen in <Figure 13>, the findings in <Figure 12> are consistent even after controlling the median price of houses, the median age of residents, and unempoyment rates. * The coefficient of “Sydney Surrounds” is 25.663, but this coefficient is not statistically significant.

  • By contrast, the coefficient of “Rural and Regional Areas” is 110.669, meaning that LGAs in Rural and Regional Areas have, on average, higher sexual offence rates than those in Greater Metropolitan Sydney by 110.669 after controlling the median price of houses, the median age of residents, and unempoyment rates. And this difference is statistically significant at α = .05.

  • R Square is .363, which means that 36.3% of the total variance in the sexual offence rate are explained by all the five variables.

  • The regression equation estimated by this model is \(sexoff = 367.978 + 2.5663 × region3\_d2 +110.669 × region3\_d3 -.393 × housmedprice2 -6.599 × medage +10.410 × unemploy\).

Workshop Activity 11: Regression Analysis

  1. This time we will conduct the regression analysis using a different set of cases. Deselect cases first (See Deselecting cases. Then, select all the LGAs that have no missing values on the five variables: robbery, medage, giniinc, meaninc and urban. We will use this set of cases throughout Workshop Activity 11.


  1. Change the scale of meaninc by dividing it by 1,000 so that the unit of meainc is $1,000. The new variable name is meaninc2, and its label is “The Average Annual Income of Residents($1,000)”.


  1. Have SPSS estimate a regression model predicting robbery rates(robbery) by the average income of residents (meaninc2), income inequality (giniinc), and the median age of residents (medage), simultaneously. Based on the SPSS output, how would you interpret the regression coefficient (slope) of meaninc2 and giniinc? Do you think the effect of meaninc2 and giniinc are statistically significant at α = .05?


  1. Investigate whether urban LGAs have higher or lower robbery rates than rural LGAs.urban is a dummy variable in which urban LGAs are coded as 1, and rural LGAs are coded as 0. Use this varialbe for the analysis (You need not recode this variable). Also, you need to control the average income of residents (meaninc2), income inequality (giniinc), and the median age of residents (medage). This means that the regression model predicts robbery rates(robbery) by urban, meaninc2, giniinc, and medage. What is the regression coefficient (slope) for urban? How would you interpret it? Do you think the difference in robbery rates between urban and rural LGAs is statistically significant at α =.01?

Note: External students should post their answers to Q.3 and Q.4 on the iLearn. This activity will contribute to your workshop participation marks.


Last updated on 05 November, 2019 by Dr Hang Young Lee(hangyoung.lee@mq.edu.au)