- Selecting cases
- Categorising continuous variables
- Exercise: Recoding variables using cut points
- How to make a new variable by combining the existing two variables
- Workshop Activity 6: Exploring Crime Rates in NSW
This workshop continues using the Data for Crime Rates in Local Government Areas in New South Wales. Open the data again in SPSS.
Sometimes we want to carry out an analysis for a specific group (e.g., young people, women, unemployed, etc.) rather than for all cases in the sample. If you want to do such an analysis, you need to define a subset of interest by specifying selection criteria and then selecting cases that meet the requirements. This procedure is called “subsetting”, and SPSS provides a way for subsetting. Suppose that we want to analyse crime rates only in urban LGAs of NSW. The first step is to select only urban LGAs from the entire LGAs.
Go to Data > Select Cases.
You will see the box of Select Cases.
Let’s take a look at the box. The section of Select provides several methods for subsetting. The default, All cases, includes all cases for the analysis. However, we will use the next option, so 1) choose If condition is satisfied. 2) In the section of Output, choose Filter out unselected cases, which will exclude unselected cases from the analysis. Then, 3) click If button just below “If condition is satisfied” in the section of Select.
You will see the box of “Select Cases: If” in which you can set criteria for selecting cases. The selection criteria should be expressed using basic SPSS operators. <Table 1> shows basic SPSS operators and how these can be used for selecting cases.
|=||Equal||urban=1: if urban equals to 1|
|~=||Not equal||urban~=1: if urban does not equal to 1|
|<||Less than||age<30: if age is less than 30|
|<=||Less than or equal to||age<=30: if age is less than or equal to 30|
|>||Greater than||age>30: if age is greater than 30|
|>=||Greater than or equal to||age>=30: if age is greater than or equal to 30|
|&||AND: all the conditions hold||age>30 & age<40: if age is greater than 30 AND less than 40|
||||OR: either one of the conditions holds||age<20 | age>80: if age is less than 20 OR greater than 80|
In the box of Select Cases: If, 1) type “urban=1” in the box at the top. urban has two values: 0 means ‘rural’ and 1 mean ‘urban’. Thus, “urban=1” enforces SPSS to select only urban LGAs. 2) Click Continue at the bottom. Then, you will be back to the previous box. 3) Click OK at the bottom.
Go to Data View. You will see a new variable titled “filter_$” at the rightmost column (see <Figure 4>). The variable has a value 1 if the case is selected in the subset and 0 otherwise. But you don’t need to do anything with this filtering variable, so ignore this variable.
Note that any analysis you conduct after this point will be done ONLY for the selected subset. For example, let’s make a frequency table for urban. As seen in <Figure 5>, only urban LGAs are included in the analysis.
When you want to stop using the subset you defined in the previous section, you need to turn off subsetting.
Go to Data > Select Cases again. In the box of Select Cases, click Reset at the bottom. You will see “All cases” chosen in the section of Select (see <Figure 6>). Click OK at the bottom. After then, the analysis will be done for all cases.
Categorising continuous variables
Sometimes we want to transform a continuous variable into a categorical one. Using an income variable, we may want to classify people into several income groups such as high-income, middle-income, and low-income groups. How can we do this kind of categorisation using SPSS? You may remember that recoding variables can do this job (see Creating a new variable of age groups).
Suppose that we want to classify LGAs into two categories depending on the average income of households: “High 50% LGAs” with the average household income being above the median and “Low 50% LGAs” with the average household income being below the median. This is called median-splitting by which all cases will be split into two groups with the equal size of cases. Of course, there are other ways to categorise income groups such as using more refined percentiles (e.g. quantiles, quintiles) or even tax-brackets. But median-splitting is the simplest and easiest method and can be used for preliminary analysis. In this workshop, we will use the median as a cut point to categorise LGAs into two income groups.
Obtaining values that split cases
This task requires you to split all the LGAs into two groups depending on their values on the average household income variable (the variable name is meaninc). Also, we want each group to have the equal size of cases so that “High income LGA” group has 50% of all cases and “Low income LGA” group has another 50% of all cases. The first step of this procedure is to calculate the cut point by which all the cases are split into half.
- Go to Analyze > Descriptive Statistics > Frequencies.
- Choose a variable by which cases will be split (in this example, it is meaninc)
- Click Statistics in the box of Frequencies.
- Tick “Cut points for ⸏⸏⸏ equal groups” under the section of Percentile Values. Type the number of groups you want to make (in this example, it is 2).
- Click Continue at the bottom. You will be back to the previous box.
- Click OK at the bottom.
The output shows the cut point, which splits all the cases by half (see <Figure 8>). As you may notice, 52,265 is the median value for meaninc. The next step is to make a new variable using this cut point.
Recoding variables using cut points
This section will use the same steps as in Recoding variables.
First, 1) Go to Transform > Recode into Different Variables. In the box of Recode into Different Variables, 2) choose meaninc in the left pane and move it to the right pane. 3) Type meaninc2 as a Name of Output Variable and “mean income – two groups” as a Label. 4) Click Change. 5) Click Old and New Values. In the box of Recode into Different Variables: Old and New Values, 6) choose “Range, LOWEST through value” and type the cut point that we calculated in the previous section, which is 52,265. Then, 7) type 1 in the section of New Value. This tells SPSS to recode all the LGAs with mean income below 52,265 into 1. 8) Click Add.
Next, 9)Choose “Range, value through HIGHEST” and type the cut point (52,265). Then, 10) type 2 in the section of New Value. All the LGAs with mean income higher than 52,265 will be recoded into 2. 11) Click Add. 12) Choose “System- or user-missing” in the section of Old Value, “system-missing” in the section of New Value. 13) Click Continue at the bottom. You will be back to the previous box. 14) Click OK at the bottom.
Go to Variable View. You will see a new variable named meaninc2 at the bottommost row. 15) Assign an appropriate level of measurement (Nominal) and value labels (1 = “Bottom 50%”; 2 = “Top 50%”).
Make a frequency table of this newly generated variable to check whether the variable is created correctly.
As seen in the <Figure 11>, almost the same number of cases are included in each group (“Bottom 50%” includes a case with the median value and thus have one more case than “Top 50%”.).
Exercise: Recoding variables using cut points
In the same way you made meaninc2, make a new variable (unemploy2) that splits all LGAs into two groups: top and bottom 50% with respect to unemployment rates (unemploy). We will use unemploy2 in the next section.
How to make a new variable by combining the existing two variables
This section introduces a more complicated recoding scheme which uses two variables simultaneously. Suppose that we want to classify LGAs with reference to both meaninc2 and unemploy2 as in <Table 2>.
|1||1||1||low income, low unemployment|
|1||2||2||low income, high unemployment|
|2||1||3||high income, low unemployment|
|2||2||4||high income, high unemployment|
There may be several ways to create this new variable, incunemp4. Using Compute might be one way. But this section makes it using Recode in a more complicated way.
First of all, 1) Go to Transform > Recode into Different Variables. 2) In the box of Recode into Different Variables, choose unemploy2 in the left pane and move it to the right pane. 3) Type incunemp4 as a Name of Output Variable and “4 groups by income and unemployment” as Label. 4) Click Change. 5) Click If.
We first specify the recoding scheme of employ2 but ONLY FOR Low income LGAs (meaninc2=1). So, we select cases having a value 1 in meaninc2. In the box of Recode into Different Variables: If Cases, 6) choose “Include if case satisfies condition”. 7) Type “meaninc2=1” in the white box, which will select only low-income LGAs. 8) Click Continue.
You will be back to the previous box. 9) Click “Old and New Values” and type 1 for Old Value and 1 for New Value. 10) Click Add. After then, 11) type 2 for Old Value and 2 for New Value. 12) Click Add. 13) Choose “System- or user-missing” in the section of Old Value, “system-missing” in the section of New Value. Click Add. 14) Click Continue at the bottom. You will be back to the previous box. 15) Click OK at the bottom.
What you have done so far is recoding employ2 only for low-income LGAs (meaninc2=1). Go to Data View. You will see that high income LGAs (meaninc2=2) have missing values in the new variable, incunempl4 (see <Figure 15>). Now is the time to recode employ2 for high-income LGAs.
Next, 16) Go to Transform > Recode into Different Variables again. You will see the previous setting. We will use the same setting mostly but change the setting in the section of IF. 17) Click If again. In the box of Recode into Different Variables: If Cases, 18) change the expression into “meaninc2=2”. 19) Click Continue. You will be back to the previous box. Check the condition of If has been changed (see <Figure 15>).
And 20) Click Old and New Values. In the box of Recode into Different Variables: Old and New Values, 21) click each previous old and new value and then click Remove. Now, you are ready to input a new coding scheme for high-income LGAs. Now, we will assign a value 3 and 4 in incunemp4 (see <Table 2>). 22) Type 1 for Old Value and 3 for New Value. Click Add. 23) type 2 for Old Value and 4 for New Value. 24) Choose “System- or user-missing” in the section of Old Value, “system-missing” in the section of New Value. Click Add. 25) Click Continue at the bottom. You will be back to the previous box. 26) Click OK at the bottom.
Sort cases by meaninc2 and unemploy2 (Choose ascending orders for both variables). Then, you will be able to check whether the new variable is created correctly.
Workshop Activity 6: Exploring Crime Rates in NSW
WARNING: After completing Q3, you need to deselect cases before you save the file. Otherwise, your future analysis on this dataset will analyse only 30 LGAs in the Greater Metropolitan Sydney.
Note: External students should post their answers to these three questions on the iLearn. This activity will contribute to your workshop participation marks.