SOCI832: Overview: Cleaning Data

Week 4: Cleaning Data

Learning Objectives

By the end of this class, students should be able to (with the assistance of, and ability to constantly refer to and the internet (e.g. Google Search) and R Help) do the following

  • run set up commands in R
  • download an article, it’s data, and the codebooks for the data from internet sources
  • import data
  • generate a codebook for their R data
  • compare (1) an article, (2) official codebooks, and (3) the real data/codebooks and write out list of changes that need to be made to each variable to clean up the data
  • write the code for this clean up in R
  • generate descriptive statistics table of variables
  • use piping - the %>$ command

Structure of Class

The entire class will be a practical. There will not be a lecture. Instead you will be working on your project data (or a practice dataset) for the entire class, and following the sets in this guide to do the exercise for Week 4 (below).


By this stage you should have chosen your paper and dataset for the main project for this class.

  • Step 0: Choose the five most important variables from your proposed paper to focus on today.
  • Step 1: Set up your R.
  • Step 2: Download your data, and the offical codebook.
  • Step 3: Import your data into R and create your own codebook
  • Step 4: Compare the article, offical codebook, your codebook, and your R data frame, and based on this write yourself instructions (in Google Sheets, Google Docs, or on paper, or similar) about what changes need to be made to each variable. Specify what needs to happen for every value of all five of your variables
  • Step 5: Write and run the code to clean your five variables
  • Step 6: Generate a descriptive statistics table for your five variables
  • Step 7: Cut and paste your code and your output tables and figues into the class Google Doc here.

Last updated on 19 August, 2019 by Dr Nicholas Harrigan (