# SOC830 Lab 5: Univariate Statistics

The fifth lab session covers the following:

• How to make a frequency table
• How to compute the mode
• How to calculate descriptive statistics
• How to visualise the distribution

We will use four packages for this lab. Load them using the following code:

library(sjlabelled)
library(sjmisc)
library(sjPlot)
library(summarytools)

Note that the loading order of packages is essential. If you do not follow the suggested loading order, you will not get the same result as in the all below examples.

Also, you may see some warning messages when you load these packages. Please ignore them. They will not affect your analytic outcomes. The issue is mostly from the fact that you did not update R or necessary packages. Nonetheless, it is always recommended to update R and R packages.

# Import the 2009 AuSSa dataset.

This lab uses the 2009 AuSSa dataset. You can download the file of this dataset in the course website(iLearn). Download the data file and put it in your working directory. Then, run the following code:

aus2009 <-readRDS("aussa2009.rds")

The dataset is loaded as aus2009.

# How to make a frquency table

I briefly introduced how to make a frequency table in Lab 4. ‘frq(data name$variable name)’ generates a frequecy table. For instance, the following code produces a frequency table of sex in aus2009: frq(aus2009$sex) 
##
## # R: Sex (x) <numeric>
## # total N=1525  valid N=1494  mean=1.57  sd=0.50
##
##  val  label frq raw.prc valid.prc cum.prc
##    1   Male 647   42.43     43.31   43.31
##    2 Female 847   55.54     56.69  100.00
##   NA     NA  31    2.03        NA      NA

In the outcome, NA means missing values (item non-responses). frq is frequencies, raw.prc denotes raw percentages (based on the number of total cases), valid.prc indicates valid percentages (based on the number of valid cases), and cum.prc means cumulative percentages.

Alternatively, you can make a frequency table using ‘freq()’ from summarytools package. I prefer to use ‘freq()’ because this function provides more detailed information.

freq(aus2009$sex)  ## Frequencies ## aus2009$sex
## Label: R: Sex
## Type: Numeric
##
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##           1    647     43.31          43.31     42.43          42.43
##           2    847     56.69         100.00     55.54          97.97
##        <NA>     31                               2.03         100.00
##       Total   1525    100.00         100.00    100.00         100.00

The problem of this output is that value labels are not shown in the frequency table. This is because summarytools package does not support the value labels assigned by sjlabelled package. To show value labels, we need to add one more function: ‘to_label()’.

freq(to_label(aus2009$sex)) ## Frequencies ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## ------------ ------ --------- -------------- --------- -------------- ## Male 647 43.31 43.31 42.43 42.43 ## Female 847 56.69 100.00 55.54 97.97 ## <NA> 31 2.03 100.00 ## Total 1525 100.00 100.00 100.00 100.00 Now, you can see the labels of all values in the frequency table. In the table, Freq is frequencies, % Valid indicates valid percentages (based on the number of valid cases), and % Valid Cum. means cumulative percentages based on valid cases, % Total denotes raw percentages (based on the number of total cases), and % Total Cum. is cumulative percentages based on total cases. Also, this frequency table provides the number of total cases at the bottom row. The next codes will create frequency tables of marital and richcol. freq(to_label(aus2009$marital))
## Frequencies
##
##                                                                Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ------------------------------------------------------------ ------ --------- -------------- --------- --------------
##                                                      Married    926     62.11          62.11     60.72          60.72
##                                                      Widowed     81      5.43          67.54      5.31          66.03
##                                                     Divorced    129      8.65          76.19      8.46          74.49
##       Separated (married but sep./not living w legal spouse)     30      2.01          78.20      1.97          76.46
##                                        Never married, single    325     21.80         100.00     21.31          97.77
##                                                         <NA>     34                               2.23         100.00
##                                                        Total   1525    100.00         100.00    100.00         100.00
freq(to_label(aus2009$richcol)) ## Frequencies ## ## Freq % Valid % Valid Cum. % Total % Total Cum. ## -------------------------------- ------ --------- -------------- --------- -------------- ## Strongly agree 133 8.91 8.91 8.72 8.72 ## Agree 378 25.34 34.25 24.79 33.51 ## Neither agree nor disagree 219 14.68 48.93 14.36 47.87 ## Disagree 553 37.06 85.99 36.26 84.13 ## Strongly disagree 209 14.01 100.00 13.70 97.84 ## <NA> 33 2.16 100.00 ## Total 1525 100.00 100.00 100.00 100.00 ## How to make a frequency table of age groups The frequency table of continuous variables is not easy to read because there are so many values in them. Consequently, researchers often transform continuous variables into ordinal ones and then create the frequency table. Let’s try to make the frequency table of respondents’ age. We will first recode age into a variable of age groups (you did this in Lab 4) and then make the frequency table of age groups. aus2009 <- rec(aus2009, age, rec = "min:19 = 1; 20:29 = 2; 30:39 = 3; 40:49 = 4; 50:59 = 5; 60:69 = 6; 70:79 = 7; 80:89 = 8; 90:max = 9", append = TRUE) aus2009$age_r <- set_label(aus2009$age_r, label = "Age Category") aus2009$age_r <- set_labels(aus2009$age_r, labels = c ("10s" = 1, "20s" = 2, "30s" = 3, "40s" = 4, "50s" = 5, "60s" = 6, "70s" = 7, "80s" = 8, "90s" = 9)) freq(to_label(aus2009$age_r))
## Frequencies
##
##               Freq   % Valid   % Valid Cum.   % Total   % Total Cum.
## ----------- ------ --------- -------------- --------- --------------
##         10s     31      2.09           2.09      2.03           2.03
##         20s    134      9.03          11.12      8.79          10.82
##         30s    182     12.26          23.38     11.93          22.75
##         40s    251     16.91          40.30     16.46          39.21
##         50s    338     22.78          63.07     22.16          61.38
##         60s    302     20.35          83.42     19.80          81.18
##         70s    176     11.86          95.28     11.54          92.72
##         80s     67      4.51          99.80      4.39          97.11
##         90s      3      0.20         100.00      0.20          97.31
##        <NA>     41                               2.69         100.00
##       Total   1525    100.00         100.00    100.00         100.00

# How to compute the mode

R does not have a built-in function to compute a mode of a variable. So, I made a function called ‘getmode()’ which lets you calculate the mode. First, run the following code:

getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}

You will see getmode in the tab of Environment. This is the custom function you just added. Thus, every time you want to compute the mode, you need to run the above code first. Let’s caculate the mode of sex, martial, richcol, age and age_r. Again, we will use ’to_label() function for nominal and ordinal variables to see value labels instead of values.

getmode(to_label(aus2009$sex)) ##  Female ## Levels: Male Female getmode(to_label(aus2009$marital))
##  Married
## 5 Levels: Married Widowed ... Never married, single
getmode(to_label(aus2009$richcol)) ##  Disagree ## 5 Levels: Strongly agree Agree Neither agree nor disagree ... Strongly disagree getmode(to_label(aus2009$age))
##  62
getmode(to_label(aus2009$age_r)) ##  50s ## Levels: 10s 20s 30s 40s 50s 60s 70s 80s 90s The results show that ‘Female’ is the mode for sex, ‘Married’ for marital, ‘Disagree’ for richcol, 62 for age, and ‘50s’ for age_r. # How to compute descriptive statistics descr()’ from summarytools package shows various descriptive statistics. The notable statistics include: 1. Mean 2. Std.Dev. (Standard Deviation) 3. Min (Minimum Value) 4. Q1 (25th percentile) 5. Median (50th percentile) 6. Q3 (75th percentile) 7. Max (Maximum Value) 8. IQR (Inter-quartile Range) 9. Skewness 10. N.Valid (Number of Valid Cases) 11. % Valid (Percentage of Valid Cases) The following code shows descriptive statistics of richcol(ordinal variable): descr(aus2009$richcol)
## Descriptive Statistics
## aus2009$richcol ## Label: Q2c Getting ahead: In [Rs country] only the rich can afford the costs of attending university. ## N: 1525 ## ## richcol ## ----------------- --------- ## Mean 3.22 ## Std.Dev 1.22 ## Min 1.00 ## Q1 2.00 ## Median 4.00 ## Q3 4.00 ## Max 5.00 ## MAD 1.48 ## IQR 2.00 ## CV 0.38 ## Skewness -0.26 ## SE.Skewness 0.06 ## Kurtosis -1.09 ## N.Valid 1492.00 ## Pct.Valid 97.84 Please note that incase of ordinal variables, you need to be careful in interpreting the results. For instance, the median value of richcol is 4, which means the median response is “Disagree”. The following code will show descriptive statistics of age(continuous variable): descr(aus2009$age)
## Descriptive Statistics
## aus2009$age ## N: 1525 ## ## age ## ----------------- --------- ## Mean 52.53 ## Std.Dev 16.78 ## Min 17.00 ## Q1 41.00 ## Median 54.00 ## Q3 64.00 ## Max 93.00 ## MAD 17.79 ## IQR 23.00 ## CV 0.32 ## Skewness -0.18 ## SE.Skewness 0.06 ## Kurtosis -0.68 ## N.Valid 1484.00 ## Pct.Valid 97.31 # How to visualise the distribution There are many packages to visualise your data in R. In our course I will use sjPlot package because it supports the labels assigned by sjlabelled package and is easy to use. If you want to learn more advanced graphics of R, I recommend using ggplot2 package. One good reliable online resource can be found here. ## How to make bar graphs Bar graphs provide appropriate ways to present nominal or ordinal variables graphically. ‘plot_frq(data name$variable name, type = "bar")’ makes a bar graph of the specified variable. For example, the following code will make a bar graph of sex in aus2009:

plot_frq(aus2009$sex, type = "bar") If you want to add the title of a figure, add ‘title = "Any title"’ to the code. The following code will add “Gender distribution” as the title of the figure: plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution") The title of X-axis(R: Sex) is somewhat bizarre. The title of X-axis can be changed by adding ‘axis.title = "Any title"’. I change it into “Gender”. Thus, the code is:

plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution", axis.title = "Gender") If you want to remove percentages of each category at the top of bars, add ‘show.prc = FALSE’ to the code. plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution",
show.prc = FALSE) If you want to remove frequencies of each category at the top of bars, add ‘show.n = FALSE’ to the code.

plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution", show.n = FALSE) Next, let’s make a bar graph of another variable, marital. plot_frq(aus2009$marital, type = "bar", title = "Marital Status",
axis.title = "Marital Status", show.n = FALSE) You may notice that one axis label is too long, which makes it difficult to read other labels. So, we will change this long label into a more brief one. To change the axis label, ‘axis.labels’ option will be used.

sjp.stackfrq(aus2009$marital, title = "Marital Status", axis.labels = "Marital Status", legend.labels = c("Married", "Widowed", "Divorced", "Separated", "Never Married or Single")) Note that I changed the label of axis and legend using ‘axis.labels’ and ‘legend.labels’ option. ## How to make histograms Histograms are used with continuous variables which have many scores. Histograms look like bar graphs, except that the sides of the “bars” touch to form a continuous series. To make histograms, use ‘plot_frq(data name$variable name, type = "hist")’. The following code creates a histogram of age.

plot_frq(aus2009$age, type = "hist", title = "Distribution of Age", axis.title = "Age") In this graph, I would like to set the distance between axis labels equal to 10. So, I add ‘grid.breaks = 10’ to the code. Thus, the code is: plot_frq(aus2009$age, type = "hist", title = "Distribution of Age", axis.title = "Age",
grid.breaks = 10) The R codes you have written so far look like:

################################################################################
# Week6: Univariate Analysis
# 01/04/2019
# SOC830, SOCI702, SOCX830
################################################################################

library(sjlabelled)
library(sjmisc)
library(sjPlot)
library(summarytools)

# Frequency table
frq(aus2009$sex) freq(aus2009$sex)
freq(to_label(aus2009$sex)) freq(to_label(aus2009$marital))
freq(to_label(aus2009$richcol)) #freq(to_label(aus2009$age))

# Make a frequency table of age gropus
aus2009 <- rec(aus2009, age, rec = "min:19 = 1; 20:29 = 2; 30:39 = 3; 40:49 = 4;
50:59 = 5; 60:69 = 6; 70:79 = 7; 80:89 = 8; 90:max = 9",
append = TRUE)
aus2009$age_r <- set_label(aus2009$age_r, label = "Age Category")
aus2009$age_r <- set_labels(aus2009$age_r, labels = c ("10s" = 1,
"20s" = 2,
"30s" = 3,
"40s" = 4,
"50s" = 5,
"60s" = 6,
"70s" = 7,
"80s" = 8,
"90s" = 9))
freq(to_label(aus2009$age_r)) # Mode getmode <- function(v) { uniqv <- unique(v) uniqv[which.max(tabulate(match(v, uniqv)))] } getmode(to_label(aus2009$sex))
getmode(to_label(aus2009$marital)) getmode(to_label(aus2009$richcol))
getmode(to_label(aus2009$age)) getmode(to_label(aus2009$age_r))

# Descriptive statistics
descr(aus2009$richcol) descr(aus2009$age)

# Visualise the distribution
## Bar graph
plot_frq(aus2009$sex, type = "bar") plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution") # change the title of figure
plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution", axis.title = "Gender") # change the title of axis plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution",
show.prc = FALSE) # do not show percentages
plot_frq(aus2009$sex, type = "bar", title = "Gender Distribution", show.n = FALSE) # do not show frequencies plot_frq(aus2009$marital, type = "bar", title = "Marital Status",
axis.title = "Marital Status", show.n = FALSE)
plot_frq(aus2009$marital, type = "bar", title = "Marital Status", axis.title = "Marital Status", axis.labels = c("Married", "Widowed", "Divorced", "Separated", "Never Married or Single"), show.n = FALSE) # change the labels of axis ## Stacked bar graph for Likert scales sjp.stackfrq(aus2009$marital, title = "Marital Status",
axis.labels = "Marital Status",
legend.labels =  c("Married", "Widowed", "Divorced",
"Separated", "Never Married or Single"))

## Histogram
plot_frq(aus2009$age, type = "hist", title = "Distribution of Age", axis.title = "Age") plot_frq(aus2009$age, type = "hist", title = "Distribution of Age", axis.title = "Age",
grid.breaks = 10) # grid.breks:sets the distance between breaks for the axis
Last updated on 05 May, 2019 by Dr Hang Young Lee(hangyoung.lee@mq.edu.au)