1 Introduction

1.1 Logistics


1.1.1 Using html_document rather than beamer_presentation output for the next two lectures

  • What this means is that the lecture won’t have “slides”; rather, I’ll just scroll down
  • Should be easier for you to “knit” entire lecture [Try it!] because doesn’t rely on latex
  • To see formatted version of lecture, open lecture6.html or knit and follow along in that version
  • you will be able to run code chunks within .Rmd file as usual
  • This is an experiment to see which you prefer; will inform revisions to next time I teach the class
  • For Lectures 8, 9, and 10 we will go back to beamer_presentation with

1.1.2 Mid-term course survey

  • A link to mid-term course survey will be posted at beginning of lecture 6 problem set. It is short. please complete it!
  • focus is on how class is working for you; what is going well; how can class be improved

1.1.3 Reading to do before next class

  • Work through slides we from lecture 8 that we don’t get to in class
    • [REQUIRED] slides from section “Tidying data”
    • [OPTIONAL] slides from section “Missing data”
  • [REQUIRED] R developer blog on Pivoting
  • [OPTIONAL] GW chapter 12 (tidy data)
    • Lecture covers this material pretty closely
    • Chapter functions for tidying data are limiting and outdated: gather() and spread()
  • [OPTIONAL] Wickham, H. (2014). Tidy Data. Journal of Statistical Software, 59(10), 1-23. doi:10.18637/jss.v059.i10
    • This is the journal article that introduced the data concepts covered in GW chapter 12 and created the packages related to tidying data
    • Link to article here: LINK

1.1.4 Libraries we will use today

Functions we will use today are new and still under development, so you need to install the development version of tidyr

#install.packages("devtools")
#devtools::install_github("tidyverse/tidyr")

You must run this R code chunk (install packages first if you don’t have them)

library(tidyverse)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.2.1          ✔ purrr   0.3.2     
#> ✔ tibble  2.1.3          ✔ dplyr   0.8.3     
#> ✔ tidyr   1.0.0.9000     ✔ stringr 1.4.0     
#> ✔ readr   1.3.1          ✔ forcats 0.4.0
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag()    masks stats::lag()
library(haven)
library(labelled)

1.2 Lecture overview

Creating analysis datasets often require changing the organizational structure of data

Examples:

  • You want analysis dataset to have one obs per student, but your data has one obs per student-course
  • You want analysis dataset to have one obs per institution, but enrollment data has one obs per institution-enrollment level

Two common ways to change organizational structure of data

  1. Use group_by to perform calculations separately within groups and then use summarise to create an object with one observation per group. Examples:
    • Creating objects containing summary statistics that are basis for tables and graphs (focus of lecture 4)
    • Creating student-transcript level GPA variable from student-transcript-course level data (focus of lecture 5 problem set)
  2. Reshape your data – called tidying in the R tidyverse world – by transforming columns (variables) into rows (observations) and vice-versa
    • Our topic for today

This lecture is about changing the organizational structure of your data by transforming untidy data into tidy data.

  • Working with tidy data has many benefits, one of them is that all the packages in the tidyverse are designed to work with tidy data.
  • We will perform data tidying using functions from the tidyr package, which is a package within tidyverse.

Show index and example datasets in tidyr package

help(package="tidyr")

# note that example datasets table1, table2, etc. are listed in the index alongside functions
table1
tidyr::table1 # same same

df1 <- table1 # create an object from the "function" table1
str(df1)

table2
table3

2 Data “structure” vs data “concepts”

Before we define “tidy data”, we will spend significant time defining and discussing some core terms/concepts about datasets.

This discussion draws from the 2014 article Tidy Data by Hadley Wickham.

  • Wickham (2014) distinguishes between “data structure” and “data concepts
    • (Wickham actually uses the term “data semantics”, but I don’t know what the word “semantics” means, so I’ll use the term “data concepts”)

2.1 Dataset structure

Dataset structure refers to the “physical layout” of a dataset

  • Typically, datasets are “rectangular tables made up of rows and columns” (emphasis added).
  • a cell is the intersection of one column and one row (think cells in Microsoft Excel)

There are many alternative data structures to present the same underlying data

  • Below are two representations of the same underlying data that have different data structures (rows and columns transposed)
#create structure a: treatment as columns, names as rows
structure_a <- tibble(
  name   = c("John Smith","Jane Doe","Mary Johnson"),
  treatmenta    = c(NA, 16, 3),
  treatmentb = c(2, 11, 1)
)

#create structure b: treatment as rows, names as columns
structure_b <- tibble(
  treatment   = c("treatmenta","treatmentb"),
  John_Smith    = c(NA, 2),
  Jane_Doe = c(16,11),
  Mary_Johnson = c(3,1)
)

structure_a
#> # A tibble: 3 x 3
#>   name         treatmenta treatmentb
#>   <chr>             <dbl>      <dbl>
#> 1 John Smith           NA          2
#> 2 Jane Doe             16         11
#> 3 Mary Johnson          3          1
structure_b
#> # A tibble: 2 x 4
#>   treatment  John_Smith Jane_Doe Mary_Johnson
#>   <chr>           <dbl>    <dbl>        <dbl>
#> 1 treatmenta         NA       16            3
#> 2 treatmentb          2       11            1

2.1.1 Unit of analysis

unit of analysis [my term, not Wickham’s]:

  • What each row represents in a dataset (referring to physical layout of dataset).

Examples of different units of analysis:

  • if each row represents a student, you have student level data
  • if each row represents a student-course, you have student-course level data
  • if each row represents an organization-year, you have organization-year level data

Questions:

  • What does each row represent in the data frame object structure_a?
structure_a
#> # A tibble: 3 x 3
#>   name         treatmenta treatmentb
#>   <chr>             <dbl>      <dbl>
#> 1 John Smith           NA          2
#> 2 Jane Doe             16         11
#> 3 Mary Johnson          3          1
  • What does each row represent in the data frame object structure_b?
structure_b
#> # A tibble: 2 x 4
#>   treatment  John_Smith Jane_Doe Mary_Johnson
#>   <chr>           <dbl>    <dbl>        <dbl>
#> 1 treatmenta         NA       16            3
#> 2 treatmentb          2       11            1
  • What does each row represent in the data frame object ipeds_hc_temp?
    • Below we load data on 12-month enrollment headcount for 2015-16 academic year from the Integrated Postsecondary Education Data System (IPEDS)
ipeds_hc_temp <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL) %>% 
  select(unitid,lstudy,efytotlt) %>% arrange(unitid,lstudy)

ipeds_hc_temp
#> # A tibble: 15,726 x 3
#>    unitid                lstudy efytotlt
#>     <dbl>             <dbl+lbl>    <dbl>
#>  1 100654   1 [Undergraduate]       4865
#>  2 100654   3 [Graduate]            1292
#>  3 100654 999 [Generated total]     6157
#>  4 100663   1 [Undergraduate]      13440
#>  5 100663   3 [Graduate]            8114
#>  6 100663 999 [Generated total]    21554
#>  7 100690   1 [Undergraduate]        415
#>  8 100690   3 [Graduate]             415
#>  9 100690 999 [Generated total]      830
#> 10 100706   1 [Undergraduate]       6994
#> # … with 15,716 more rows
#show variable labels and value labels
ipeds_hc_temp %>% var_label()
#> $unitid
#> [1] "Unique identification number of the institution"
#> 
#> $lstudy
#> [1] "Original level of study on survey form"
#> 
#> $efytotlt
#> [1] "Grand total"
ipeds_hc_temp %>% val_labels()
#> $unitid
#> NULL
#> 
#> $lstudy
#>   Undergraduate        Graduate Generated total 
#>               1               3             999 
#> 
#> $efytotlt
#> NULL

#print a few obs
ipeds_hc_temp %>% head(n=10)
#> # A tibble: 10 x 3
#>    unitid                lstudy efytotlt
#>     <dbl>             <dbl+lbl>    <dbl>
#>  1 100654   1 [Undergraduate]       4865
#>  2 100654   3 [Graduate]            1292
#>  3 100654 999 [Generated total]     6157
#>  4 100663   1 [Undergraduate]      13440
#>  5 100663   3 [Graduate]            8114
#>  6 100663 999 [Generated total]    21554
#>  7 100690   1 [Undergraduate]        415
#>  8 100690   3 [Graduate]             415
#>  9 100690 999 [Generated total]      830
#> 10 100706   1 [Undergraduate]       6994

#print a few obs, with value labels rather than variable values
ipeds_hc_temp %>% head(n=10) %>% as_factor()
#> # A tibble: 10 x 3
#>    unitid lstudy          efytotlt
#>     <dbl> <fct>              <dbl>
#>  1 100654 Undergraduate       4865
#>  2 100654 Graduate            1292
#>  3 100654 Generated total     6157
#>  4 100663 Undergraduate      13440
#>  5 100663 Graduate            8114
#>  6 100663 Generated total    21554
#>  7 100690 Undergraduate        415
#>  8 100690 Graduate             415
#>  9 100690 Generated total      830
#> 10 100706 Undergraduate       6994

2.1.2 Which variable(s) uniquely identify rows in a data frame

Identifying which combination of variables uniquely identifies rows in a data frame helps you identify the “unit of analysis” and understand the “structure” of your dataset

  • Said differently: for each value of this variable (or combination of variables), there is only one row
  • Very important for reshaping/tidying data (this week) and very important for joining/merging data frames (next week)
  • Sometimes a codebook will explicitly tell you which vars uniquely identify rows
  • sometimes you have to figure this out through investigation
    • focus on ID variables and categorical variables that identify what each row represents; not continuous numeric variables like total_enrollment or family_income

Task: Let’s try to identify the variable(s) that uniquely identify rows in ipeds_hc_temp

  • Multiple ways of doing this
  • I’ll give you some code for one approach; just try to follow along and understand
names(ipeds_hc_temp)
#> [1] "unitid"   "lstudy"   "efytotlt"
ipeds_hc_temp %>% head(n=10)
#> # A tibble: 10 x 3
#>    unitid                lstudy efytotlt
#>     <dbl>             <dbl+lbl>    <dbl>
#>  1 100654   1 [Undergraduate]       4865
#>  2 100654   3 [Graduate]            1292
#>  3 100654 999 [Generated total]     6157
#>  4 100663   1 [Undergraduate]      13440
#>  5 100663   3 [Graduate]            8114
#>  6 100663 999 [Generated total]    21554
#>  7 100690   1 [Undergraduate]        415
#>  8 100690   3 [Graduate]             415
#>  9 100690 999 [Generated total]      830
#> 10 100706   1 [Undergraduate]       6994


First, Let’s investigate whether the ID variable unitid uniquely identifies rows in data frame ipeds_hc_temp

  • I’ll annotate code for each step
ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
  group_by(unitid) %>% # group by unitid
  summarise(n_per_group=n()) %>% # create measure of number of obs per group
  ungroup %>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
  count(n_per_group) # frequency of number of observations per group
#> # A tibble: 2 x 2
#>   n_per_group     n
#>         <int> <int>
#> 1           2  5127
#> 2           3  1824

What does above output tell us?

  • There are 5,127 values of unitid that have 2 rows for that value of unitid
  • There are 1,824 values of unitid that have 3 rows for that value of unitid
  • Note: 2*5127+3*1824== 15,726 which is the number of observations in ipeds_hc_temp
  • Conclusion: the variable unitid does not uniquely identify rows in the data frame ipeds_hc_temp


Second, Let’s investigate whether the comination of unitid and lstudy uniquely identifies rows in data frame ipeds_hc_temp

ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
  group_by(unitid,lstudy) %>% # group by unitid and lstudy
  summarise(n_per_group=n()) %>% # create measure of number of obs per group
  ungroup %>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
  count(n_per_group) # frequency of number of observations per group
#> # A tibble: 1 x 2
#>   n_per_group     n
#>         <int> <int>
#> 1           1 15726

What does above output tell us?

  • There is 1 row each unique combination of unitid and lstudy
  • Conclusion: the variables unitid and lstudy uniquely identify rows in the data frame ipeds_hc_temp

What happens if we didn’t ungroup?

ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
  group_by(unitid,lstudy) %>% # group by unitid and lstudy
  summarise(n_per_group=n()) %>% # create measure of number of obs per group
  count(n_per_group) # frequency of number of observations per group
#> # A tibble: 6,951 x 3
#>    unitid n_per_group     n
#>     <dbl>       <int> <int>
#>  1 100654           1     3
#>  2 100663           1     3
#>  3 100690           1     3
#>  4 100706           1     3
#>  5 100724           1     3
#>  6 100751           1     3
#>  7 100760           1     2
#>  8 100812           1     2
#>  9 100830           1     3
#> 10 100858           1     3
#> # … with 6,941 more rows

2.2 Dataset (semantics) concepts

Think of these dataset concepts as being distinct from dataset structure (rows and columns)


The difference between data structure and data concepts:

  • Data structure refers to the the physical layout of the data (e.g., what the rows and columns in a dataset actually represent)
  • data concepts – which were introduced by Wickham (2014) – refer to how the data should be structured

2.2.1 Values, variables, and observations

Wickham (2014, p. 3): “A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation.”

Wickham (2014) definitions of dataset concepts: values, variables, and observations [I Recommend writing these definitions down on a separate sheet of paper]:

  • value: A single element within some data structure (e.g., vector, list), usually a number or a character string.
    • e.g. the value of the variable enrollment for one organization in a dataset where each observation represents a postsecondary education institution
  • variables: “A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units”
    • e.g., the variable enrollment is a vector that contains total enrollment for each organization in the dataset
  • observations: “An observation contains all values measured on the same unit (like a person, or a day)…across attributes”
    • e.g., the values of each variable for one organization in a dataset where each observation represents a postsecondary education institution

Example of a dataset that satisfies Wickham definitions

#> # A tibble: 10 x 4
#>    unitid ug_enroll grad_enroll all_enroll
#>     <dbl>     <dbl>       <dbl>      <dbl>
#>  1 100654      4865        1292       6157
#>  2 100663     13440        8114      21554
#>  3 100690       415         415        830
#>  4 100706      6994        2212       9206
#>  5 100724      5373         728       6101
#>  6 100751     35199        6308      41507
#>  7 100760      2379          NA       2379
#>  8 100812      4124          NA       4124
#>  9 100830      5072         924       5996
#> 10 100858     22953        6284      29237


Remove object

rm(ipeds_hc_temp)

2.2.2 Tables and observational unit

A particular project or data collection (e.g., longitudinal survey tracking student achievement from high school through college), may require several data sources:

  • data on student characteristics (e.g., name, date of birth) with one observation per student
  • data on the high school the student attends with one observation per school
  • transcript data with one observation per student-course, etc.

Each of these data sources may use a different “level of observation” and each requiring a different “table” (i.e., dataframe):

  • “Table” is the term database people use for what we would refer to as a “dataset” or a “data frame”
  • We talk more about “tables” next week, which is about joining/merging data frames


Observational unit/observational level [Wickham’s term, not mine]

“In a given analysis, there may be multiple levels of observations. For example, in a trial of new allergy medication we might have three observational types: demographic data collected from each person (age, sex, race), medical data collected from each person on each day (number of sneezes, redness of eyes), and meteorological data collected on each day (temperature,pollen count) (Wickham 2014, p. 4)”


So Wickham defines observational unit/observational level as what each row should represent in an appropriately structured data frame

2.2.3 Summary/parting thoughts

These data concepts (e.g., value, variable, observation) seem easy on first glance, but can feel slippery on closer inspection.

In particular, when you confronted with a particular dataset, sometimes it can feel confusing what the variables/observations/unit of analysis are (data structure) and what they should be (data concepts).

3 Defining tidy vs. untidy data

3.1 Rules of tidy data (defining tidy data)

Wickham chapter 12: "There are three interrelated rules which make a dataset tidy:

  1. Each variable must have its own column
  2. Each observation must have its own row
  3. Each value must have its own cell"
  4. [Additional rule from Wickham (2014)] Each type of observational unit forms a table

I Recommend writing these rules down on a separate sheet of paper

“These three rules are interrelated because it’s impossible to only satisfy two of the three” (Wickham chapter 12)


Visual representation of the three rules of tidy data:

Here is an example of tidy data:

#help(package="tidyr")
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

Question:

  • What is does each observation (data concept) represent in the above dataset?
  • What is does each row (data structure) represent in the above dataset?

3.2 Untidy data

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

While all tidy datasets are organized the same way, untidy datasets can have several different organizational structures


Here is an example of an untidy version of the same data shown in table1

#Untidy data, table 2
#help(package="tidyr")
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

Other examples of untidy data (output omitted):

table3

#tables 4a and 4b put the information from table1 in two separate data frames
table4a
table4b

table5

3.3 Diagnosing untidy data

The first step in transforming untidy data to tidy data is diagnosing which principles of tidy data have been violated.

Recall the three principles of tidy data:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

Let’s diagnose the problems with table2 by answering these questions

  1. Does each variable have its own column?
    • if not, how does the dataset violate this principle?
    • what should the variables be?
  2. Does each observation have its own row?
    • if not, how does the dataset violate this principle?
    • what does each row represent actually represent?
    • what should each row represent?
  3. Does each value have its own cell."
    • if not, how does the dataset violate this principle?

Printout of table2

table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

Answers to these questions:

  1. Does each variable have its own column? No
    • if not, how does the dataset violate this principle? “cases” and “population” should be two different variables because they are different attributes, but in table2 these two attributes are recorded in the column “type” and the associated value for each type is recorded in the column “count”
    • what should the variables be? country, year, cases, population
  2. Does each observation have its own row? No
    • if not, how does the dataset violate this principle? there is one observation for each country-year-type. But the values of type (population, cases) represent attributes of a unit, which should be represented by distinct variables rather than rows. So table2 has two rows per observation but it should have one row per observation
    • what does each row represent actually represent? country-year-type
    • what should each row represent? country-year
  3. Does each value have its own cell." Yes.
    • if not, how does the dataset violate this principle?

3.4 Student exercise:

For each of the following datasets – table1, table3, table4a, and table5 – answer the following questions:

  1. Does each variable have its own column?
    • if not, how does the dataset violate this principle?
    • what should the variables be?
  2. Does each observation have its own row?
    • if not, how does the dataset violate this principle?
    • what does each row represent actually represent?
    • what should each row represent?
  3. Does each value have its own cell."
    • if not, how does the dataset violate this principle?

We’ll give you ~15 minutes for this


ANSWERS TO STUDENT EXERCISE BELOW:

3.4.1 table1

table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
  1. Does each variable have its own column? YES
  2. Does each observation have its own row? YES
    • what does each row represent actually represent? COUNTRY-YEAR
    • what should each row represent? COUNTERY-YEAR
  3. Does each value have its own cell." YES

3.4.2 table3

table3
#> # A tibble: 6 x 3
#>   country      year rate             
#>   <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583
  1. Does each variable have its own column? No
    • if not, how does the dataset violate this principle? The column rate contains two variables, cases and population
    • what should the variables be? country, year, cases, population
  2. Does each observation have its own row? Yes
    • what does each row represent actually represent? country-year
    • what should each row represent? country-year
  3. Does each value have its own cell." No
    • if not, how does the dataset violate this principle? In the rate column, each cell contains two values, a value for cases and a value for population

3.4.3 table4a

table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#>   <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
  1. Does each variable have its own column? No
    • if not, how does the dataset violate this principle? The variable cases is spread over two columns and the variable year is also spread over two columns
    • what should the variables be? country, year, cases
  2. Does each observation have its own row? No
    • if not, how does the dataset violate this principle? There are two country-year observations on each row
    • what does each row represent actually represent? country
    • what should each row represent? country-year
  3. Does each value have its own cell." Yes

3.4.4 table4b [not required to answer]

table4b
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#>   <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583

Answers pretty much the same as table4a, except table4b shows data for population rather than `cases’

3.4.5 table5

table5
#> # A tibble: 6 x 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583
  1. Does each variable have its own column? No
    • if not, how does the dataset violate this principle? Two problems. First, the single variable year' is spread across two columnscenturyandyear. Second, theratecolumn contains the two variablescasesandpopulation`
    • what should the variables be? country, year, cases, population
  2. Does each observation have its own row? Yes
    • what does each row represent actually represent? country, year
    • what should each row represent? country, year
  3. Does each value have its own cell?" No
    • if not, how does the dataset violate this principle? Two problems. First, the each value of year' is spread across two cells. Second, the each cell of theratecolumn contains two values, one forcasesand one forpopulation`

3.5 Revisiting definitions of variables and observations

Worthwhile to revisit Wickham’s (2014) defintions of variables, and observations in light of what we now know about tidy vs. untidy data.

Whenever I work with datasets I tend to think of observations as being synonomous with rows and variables as being synonomous with columns. But if we use Wickham’s definitions of observations and variables, this is not true.

  • Wickham definition of variables: “A variable contains all values that measure the same underlying attribute (like height, temperature, duration) across units”
    • In a tidy dataset, variables and columns are the same thing.
    • In an untidy dataset, variables and columns are not the same thing; a single variable (i.e., attribute) may be represented by two columns
      • For example, in table2 the attribute population was represented by two columns, type and count
  • Wickham definition of observations: “An observation contains all values measured on the same unit (like a person, or a day)…across attributes”
    • In a tidy dataset, an observations is the same thing as a row
    • In an untidy dataset, the values for a particular unit may be spread across multiple rows.
      • For example, in table1 and table2 Wickham would think of country-year as the proper observational unit. table1 has one row per country-year (tidy) but table2 has two rows per country-year (untidy).


“Observational unit” (data concept) vs. “unit of analysis” (my term; data structure)

  • observational unit. Wickham tends to think of observational unit as what each observation should represent (e.g., a person, a person-year) in a tidy dataset
  • unit of analysis. By contrast (right or wrong), I tend to think of “unit of analysis” as what each row of data actually represents (e.g., country-year-type)


Takeways from this discussion of formal data concepts and tidy vs. untidy data

  • In everyday usage, the terms variable and observation refer to data structure, with :
    • variables=columns
    • observations=rows
  • In Wickham’s definition, the terms variables and observation refer to data contents, with:
    • A variable containing all values of one attribute for all units
    • An observation contains the value of all attributes for one unit
  • Can think of Wickham’s definitions of variables and observations as belonging only to tidy datasets.
    • Based on Wickham’s definition, variables=columns and observations=rows only if the dataset is tidy
    • Can think of Wickham’s definitions of variables and observations as what should be.


In real world, we encounter many untidy datasets. We can still equate variables with columns and rows with observations.

  • Just be mindful that you are using the “everyday” definitions of variables and observations rather than the Wickham definitions.

3.6 Why tidy data

Why should you create tidy datasets before conducting analyses?

  1. If you have a consistent organizational structure for analysis datasets, easier to learn tools for analyzing data

  2. tidy datasets are optimal for R
  • Base R functions and tidyverse functions are designed to work with vectors of values
  • In a tidy dataset, each column is the vector of values for a given variable, as shown below
str(table1)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    6 obs. of  4 variables:
#>  $ country   : chr  "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
#>  $ year      : int  1999 2000 1999 2000 1999 2000
#>  $ cases     : int  745 2666 37737 80488 212258 213766
#>  $ population: int  19987071 20595360 172006362 174504898 1272915272 1280428583
str(table1$country)
#>  chr [1:6] "Afghanistan" "Afghanistan" "Brazil" "Brazil" "China" ...
str(table1$cases)
#>  int [1:6] 745 2666 37737 80488 212258 213766


Example:

  • how would you calculate rate variable (=cases/population*10000) for table1 (tidy) and table2 (untidy)
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table1 %>% 
  mutate(rate = cases / population * 10000)
#> # A tibble: 6 x 5
#>   country      year  cases population  rate
#>   <chr>       <int>  <int>      <int> <dbl>
#> 1 Afghanistan  1999    745   19987071 0.373
#> 2 Afghanistan  2000   2666   20595360 1.29 
#> 3 Brazil       1999  37737  172006362 2.19 
#> 4 Brazil       2000  80488  174504898 4.61 
#> 5 China        1999 212258 1272915272 1.67 
#> 6 China        2000 213766 1280428583 1.67

#Just looking at table2, obvious that calculating rate is much harder
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

3.6.1 Caveat: But tidy data not always best

Datasets are often stored in an untidy structure rather than a tidy structure when the untidy structure has a smaller file size than the tidy structure

  • smaller file-size leads to faster processing time, which is very important for web applications (e.g., facebook) and data visualizations (e.g., interactive maps)

3.7 Legacy concepts: “Long” vs. “wide” data

Prior to Wickham (2014) and the creation of the “tidyverse,” the concepts of “tidy”/“untidy” (adjective) data and “tidying” (verb) did not exist.

Instead, researchers would “reshape” (verb) data from “long” (adjective) to “wide” (adjective) and vice-versa

Think “wide” and “long” as alternative presentations of the exact same data.ß

“wide” data represented with fewer rows and more columns
“long” data represented with more rows and fewer columns

Example, Table 204.10 of the Digest for Education Statistics shows changer over time in the number and percentage of K-12 students on free/reduced lunch LINK

  • Wide form display of data:
#load("data/nces_digest/nces_digest_table_204_10.RData")
load(url("https://github.com/ozanj/rclass/raw/master/data/nces_digest/nces_digest_table_204_10.RData"))
table204_10
#> # A tibble: 51 x 13
#>    state tot_2000 tot_2010 tot_2011 tot_2012 frl_2000 frl_2010 frl_2011
#>    <chr>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>    <dbl>
#>  1 Alab…   728351   730427   731556   740475   335143   402386  420447 
#>  2 Alas…   105333   132104   131166   131483    32468    50701   53238 
#>  3 Ariz…   877696  1067210  1024454   990378   274277   482044  511885 
#>  4 Arka…   449959   482114   483114   486157   205058   291608  294324 
#>  5 Cali…  6050753  6169427  6202862  6178788  2820611  3335885 3353964.
#>  6 Colo…   724349   842864   853610   863121   195148   336426  348896 
#>  7 Conn…   562179   552919   543883   549295   143030   190554  194339 
#>  8 Dela…   114676   128342   128470   127791    37766    61564   62774 
#>  9 Dist…    68380    71263    72329    75411    47839    52027   45199 
#> 10 Flor…  2434755  2641555  2668037  2691881  1079009  1479519 1535670 
#> # … with 41 more rows, and 5 more variables: frl_2012 <dbl>,
#> #   p_frl_2000 <dbl>, p_frl_2010 <dbl>, p_frl_2011 <dbl>, p_frl_2012 <dbl>

Task: Reshape table204_10from wide to long

gather() from G&W can’t deal with complex variable names and can’t reshape multiple columns (will discuss this shortly):

total <- table204_10 %>%
  select(state,tot_2000,tot_2010,tot_2011,tot_2012) #subset and assign to new object
names(total)<-c("state","2000","2010","2011","2012") #change names for year "tot_2010" -> "2010"

total_long <- total %>% 
  gather(`2000`,`2010`,`2011`,`2012`,key=year,value=total_students) %>% 
  arrange(state, year) #arrange by state and year

head(total_long, n=10) #view 10 obs
#> # A tibble: 10 x 3
#>    state                         year  total_students
#>    <chr>                         <chr>          <dbl>
#>  1 Alabama ....................  2000          728351
#>  2 Alabama ....................  2010          730427
#>  3 Alabama ....................  2011          731556
#>  4 Alabama ....................  2012          740475
#>  5 Alaska ..................     2000          105333
#>  6 Alaska ..................     2010          132104
#>  7 Alaska ..................     2011          131166
#>  8 Alaska ..................     2012          131483
#>  9 Arizona ..................... 2000          877696
#> 10 Arizona ..................... 2010         1067210

pivot_longer can do the same task, but it can handle reshaping multiple columns and complex patterns at the same time (although, you won’t be able handle complicated patterns until you learn regular expressions)


total %>%
  pivot_longer(
    cols = (c("2000","2010", "2011", "2012")),
    names_to = "year",
    values_to = "total_students"
  )
#> # A tibble: 204 x 3
#>    state                         year  total_students
#>    <chr>                         <chr>          <dbl>
#>  1 Alabama ....................  2000          728351
#>  2 Alabama ....................  2010          730427
#>  3 Alabama ....................  2011          731556
#>  4 Alabama ....................  2012          740475
#>  5 Alaska ..................     2000          105333
#>  6 Alaska ..................     2010          132104
#>  7 Alaska ..................     2011          131166
#>  8 Alaska ..................     2012          131483
#>  9 Arizona ..................... 2000          877696
#> 10 Arizona ..................... 2010         1067210
#> # … with 194 more rows

Note that the concepts “wide” vs. “long” are relative rather than absolute

  • For example, comparing table4a and table1: table4a is wide and table1 is long
table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#>   <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
  • But comparing table1 and table2: table1 is wide and table2 is long
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

4 Tidying data:

Steps in tidying untidy data:

  1. Diagnose the problem
    • e.g., Which principles of tidy data are violated and how are they violated? What should the unit of analysis be? What should the variables and observations be?
  2. Sketch out what the tidy data should look like [on piece of scrap paper is best!]
  3. Transform untidy to tidy

4.1 Common causes of untidy data

Tidy data can only have one organizational structure.

However, untidy data can have several different organizational structures. In turn, several causes of untidy data exist. Important to identify the most common causes of untidy data, so that we can develop solutions for each common cause.

The two most common and most important causes of untidy data

  1. Some column names are not names of variables, but values of a variable" (e.g, table4a, table4b), which results in:
    • a single variable spread (e.g., population) over multiple columns (e.g., 1999, 2000)
    • a single row contains multiple observations
table4b
#> # A tibble: 3 x 3
#>   country         `1999`     `2000`
#>   <chr>            <int>      <int>
#> 1 Afghanistan   19987071   20595360
#> 2 Brazil       172006362  174504898
#> 3 China       1272915272 1280428583
  1. An observation is scattered across multiple rows (e.g., table2), such that:
    • One column identifies variable type (e.g., type) and another column contains the values for each variable (e.g., count)
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

Other common causes of untidy data (less important and/or less common)

  1. Individual cells in a column contain data from two variables (e.g., the rate column in table3)
table3
#> # A tibble: 6 x 3
#>   country      year rate             
#>   <chr>       <int> <chr>            
#> 1 Afghanistan  1999 745/19987071     
#> 2 Afghanistan  2000 2666/20595360    
#> 3 Brazil       1999 37737/172006362  
#> 4 Brazil       2000 80488/174504898  
#> 5 China        1999 212258/1272915272
#> 6 China        2000 213766/1280428583
  1. Values from a single variable separated into two columns (e.g., in table5, values of the 4-digit yearvariable are separated into the two columns, century and 2-digit year )
table5
#> # A tibble: 6 x 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583

4.2 Tidying data: reshape “wide’ to”long" (gathering)

As discussed above, the first common reason for untidy data is that some of the column names are not names of variables, but values of a variable (e.g., table4a, table4b)

table4a
#> # A tibble: 3 x 3
#>   country     `1999` `2000`
#>   <chr>        <int>  <int>
#> 1 Afghanistan    745   2666
#> 2 Brazil       37737  80488
#> 3 China       212258 213766

The solution to this problem is to transform the untidy columns (which represent variable values) into rows

  • In the tidyverse, this process is called “gathering”
  • Outside the tidyverse (and in future updated tidyverse), this process is called “reshaping” from “wide” to “long”

In the above example, we want to transform table4a into something that looks like this:

table1 %>% select(country, year, cases)
#> # A tibble: 6 x 3
#>   country      year  cases
#>   <chr>       <int>  <int>
#> 1 Afghanistan  1999    745
#> 2 Afghanistan  2000   2666
#> 3 Brazil       1999  37737
#> 4 Brazil       2000  80488
#> 5 China        1999 212258
#> 6 China        2000 213766

We could use gather() but going to get you in the habit of using pivot_longer that can handle simple to complex reshaping

?pivot_longer

Gathering requires knowing three parameters:

  1. names of the set of columns that represent values, not variables in your untidy data
    • These are existing columns of the untidy data
    • in table 4a these are the columns 1999 and 2000
  2. names_to : variable name(s) you will assign to columns you are gathering from the untidy data
    • This var doesn’t yet exist in untidy data, but will be a variable name in the tidy data
    • For the table4a example, we’ll call this variable year because the values of this variable will be years
    • Said different: the variable(s) you will create whose values will be the column names from the untidy data.
    • In using gather() this is the “key” variable
  3. values to: The name of the variable that will contain values in the tidy dataset you create and whose values are spread across multiple columns in the untidy dataset
    • This var doesn’t yet exist in untidy data, but will be a variable name in the tidy data
    • in the table4a example, we’ll call the “value variable” cases because the values refer to number of cases
table4a %>%
  pivot_longer(
    cols = (c("1999", "2000")),
    names_to = "year",
    values_to = "cases"
  )
#> # A tibble: 6 x 3
#>   country     year   cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766


#giving different names to the key variable and the value variable is fine
table4a %>%
  pivot_longer(
    cols = (c("1999", "2000")),
    names_to = "yr",
    values_to = "cases"
  )
#> # A tibble: 6 x 3
#>   country     yr     cases
#>   <chr>       <chr>  <int>
#> 1 Afghanistan 1999     745
#> 2 Afghanistan 2000    2666
#> 3 Brazil      1999   37737
#> 4 Brazil      2000   80488
#> 5 China       1999  212258
#> 6 China       2000  213766

4.2.1 pivot_longer can deal with more complex data patterns

?pivot_longer

Example 1: Complex variable names. Enrollment variables in table204_10 have more information than we really need to transform from untidy to tidy data (this is why we renamed the variable names before the gather() example above). We’re going to use the names_prefix() argument to match and remove the part of the variable names that are not useful for reshaping wide to long.

Here we are still using the three core parameters: cols, names_to, values_to()


total1 <- table204_10 %>%
  select(state,tot_2000,tot_2010,tot_2011,tot_2012)

total1 %>%
  pivot_longer(
    cols = c("tot_2000","tot_2010","tot_2011","tot_2012"),
    names_to = "year",
    names_prefix = ("tot_"),
    values_to = "total_students"
  )
#> # A tibble: 204 x 3
#>    state                         year  total_students
#>    <chr>                         <chr>          <dbl>
#>  1 Alabama ....................  2000          728351
#>  2 Alabama ....................  2010          730427
#>  3 Alabama ....................  2011          731556
#>  4 Alabama ....................  2012          740475
#>  5 Alaska ..................     2000          105333
#>  6 Alaska ..................     2010          132104
#>  7 Alaska ..................     2011          131166
#>  8 Alaska ..................     2012          131483
#>  9 Arizona ..................... 2000          877696
#> 10 Arizona ..................... 2010         1067210
#> # … with 194 more rows

#another way to indicate cols
total1 %>%
  pivot_longer(
    cols = starts_with("tot_"),
    names_to = "year",
    names_prefix = ("tot_"),
    values_to = "total_students"
  )
#> # A tibble: 204 x 3
#>    state                         year  total_students
#>    <chr>                         <chr>          <dbl>
#>  1 Alabama ....................  2000          728351
#>  2 Alabama ....................  2010          730427
#>  3 Alabama ....................  2011          731556
#>  4 Alabama ....................  2012          740475
#>  5 Alaska ..................     2000          105333
#>  6 Alaska ..................     2010          132104
#>  7 Alaska ..................     2011          131166
#>  8 Alaska ..................     2012          131483
#>  9 Arizona ..................... 2000          877696
#> 10 Arizona ..................... 2010         1067210
#> # … with 194 more rows

4.2.2 pivot_longer can deal with more complex multiple columns

Example 2: Complex variable names and multiple variables need to be reshaped. In this case, we have two pieces of information for each state-year observation: total enrollment and students on free/reduced lunch. So not only do we need to reshape the data from wide to long so our observational-level is state-year, we also need to save total enrollment and students on free/reduced lunch into separate columns.

-Note the special name .value and names_sep() argument. By specifying all variables (besides state) are seperated by an underscore this tells pivot_longer that that first part of the variable name before the seperator specify the “value” being measured and will be the new variables in the output

total2 <- table204_10 %>%
  select(state,tot_2000,tot_2010,tot_2011,tot_2012, frl_2000,frl_2010,frl_2011,frl_2012)

#multiple variables (when patterns are consistent)
total2 %>%
  pivot_longer(
    -state, 
    names_to = c(".value", "year"), # special name .value
    names_sep = "_"
  )
#> # A tibble: 204 x 4
#>    state                         year      tot    frl
#>    <chr>                         <chr>   <dbl>  <dbl>
#>  1 Alabama ....................  2000   728351 335143
#>  2 Alabama ....................  2010   730427 402386
#>  3 Alabama ....................  2011   731556 420447
#>  4 Alabama ....................  2012   740475 429604
#>  5 Alaska ..................     2000   105333  32468
#>  6 Alaska ..................     2010   132104  50701
#>  7 Alaska ..................     2011   131166  53238
#>  8 Alaska ..................     2012   131483  53082
#>  9 Arizona ..................... 2000   877696 274277
#> 10 Arizona ..................... 2010  1067210 482044
#> # … with 194 more rows

#what happens if we switch order of .value and year? 
total2 %>%
  pivot_longer(
    -state, 
    names_to = c("year", ".value"), # special name .value
    names_sep = "_"
  )
#> # A tibble: 102 x 6
#>    state                         year   `2000`  `2010`   `2011`  `2012`
#>    <chr>                         <chr>   <dbl>   <dbl>    <dbl>   <dbl>
#>  1 Alabama ....................  tot    728351  730427  731556   740475
#>  2 Alabama ....................  frl    335143  402386  420447   429604
#>  3 Alaska ..................     tot    105333  132104  131166   131483
#>  4 Alaska ..................     frl     32468   50701   53238    53082
#>  5 Arizona ..................... tot    877696 1067210 1024454   990378
#>  6 Arizona ..................... frl    274277  482044  511885   514193
#>  7 Arkansas ..................   tot    449959  482114  483114   486157
#>  8 Arkansas ..................   frl    205058  291608  294324   298573
#>  9 California .................  tot   6050753 6169427 6202862  6178788
#> 10 California .................  frl   2820611 3335885 3353964. 3478407
#> # … with 92 more rows

4.2.3 Student exercise: Real-world example of long to wide

[In the task I present below, fine to just work through solution I have created or try doing on your own before you look at solution]

The following dataset is drawn from Table 204.10 of the NCES Digest for Education Statistics.

  • The table shows change over time in the number and percentage of K-12 students on free/reduced lunch for selected years.
  • LINK to website with data
#Let's take a look at the data (we read in the data in the wide vs long section)
glimpse(table204_10)
#> Observations: 51
#> Variables: 13
#> $ state      <chr> "Alabama ....................", "Alaska .............…
#> $ tot_2000   <dbl> 728351, 105333, 877696, 449959, 6050753, 724349, 5621…
#> $ tot_2010   <dbl> 730427, 132104, 1067210, 482114, 6169427, 842864, 552…
#> $ tot_2011   <dbl> 731556, 131166, 1024454, 483114, 6202862, 853610, 543…
#> $ tot_2012   <dbl> 740475, 131483, 990378, 486157, 6178788, 863121, 5492…
#> $ frl_2000   <dbl> 335143, 32468, 274277, 205058, 2820611, 195148, 14303…
#> $ frl_2010   <dbl> 402386, 50701, 482044, 291608, 3335885, 336426, 19055…
#> $ frl_2011   <dbl> 420447.00, 53238.00, 511885.00, 294324.00, 3353963.72…
#> $ frl_2012   <dbl> 429604, 53082, 514193, 298573, 3478407, 358876, 20108…
#> $ p_frl_2000 <dbl> 46.01394, 30.82415, 31.24966, 45.57260, 46.61587, 26.…
#> $ p_frl_2010 <dbl> 55.08915, 38.37961, 45.16862, 60.48528, 54.07123, 39.…
#> $ p_frl_2011 <dbl> 57.47298, 40.58826, 49.96662, 60.92227, 54.07123, 40.…
#> $ p_frl_2012 <dbl> 58.01735, 40.37176, 51.91886, 61.41493, 56.29594, 41.…

#Create smaller version of data frame for purpose of student exercise
total<- table204_10 %>% 
  select(state,p_frl_2000,p_frl_2010,p_frl_2011,p_frl_2012)
head(total)
#> # A tibble: 6 x 5
#>   state                         p_frl_2000 p_frl_2010 p_frl_2011 p_frl_2012
#>   <chr>                              <dbl>      <dbl>      <dbl>      <dbl>
#> 1 Alabama ....................        46.0       55.1       57.5       58.0
#> 2 Alaska ..................           30.8       38.4       40.6       40.4
#> 3 Arizona .....................       31.2       45.2       50.0       51.9
#> 4 Arkansas ..................         45.6       60.5       60.9       61.4
#> 5 California .................        46.6       54.1       54.1       56.3
#> 6 Colorado ....................       26.9       39.9       40.9       41.6

Task (using the data frame total):

  1. Diagnose the problem with the data frame total
  2. Sketch out what the tidy data should look like
  3. Transform untidy to tidy. hint: use names_prefix

Solution to student exercise

  1. Diagnose the problem with the data frame total
    • Column names p_frl_2000, p_frl_2010, etc. are not variables; rather they refer to values of the variable year
    • Currently each observation represents a state with separate number of students on FRL variables for each year.
    • Each observation should be a state-year, with only one variable for FRL
  2. Sketch out what the tidy data should look like
  3. Transform untidy to tidy
    • names of the set of columns that represent values, not variables in your untidy data
      • p_frl_2000,p_frl_2010,p_frl_2011,p_frl_2012
    • names_to : variable name you will assign to columns you are gathering from the untidy data
      • This var doesn’t yet exist in untidy data, but will be a variable name in the tidy data
      • In this example, it’s year
    • values_to: The name of the variable that will contain values in the tidy dataset you create and whose values are spread across multiple columns in the untidy dataset
      • This var doesn’t yet exist in untidy data, but will be a variable name in the tidy data
      • in this example, the value variable is frl_students
total %>%
  pivot_longer(
    cols = starts_with("p_frl_"),
    names_to = "year",
    names_prefix = ("p_frl_"),
    values_to = "pct_frl"
  )
#> # A tibble: 204 x 3
#>    state                         year  pct_frl
#>    <chr>                         <chr>   <dbl>
#>  1 Alabama ....................  2000     46.0
#>  2 Alabama ....................  2010     55.1
#>  3 Alabama ....................  2011     57.5
#>  4 Alabama ....................  2012     58.0
#>  5 Alaska ..................     2000     30.8
#>  6 Alaska ..................     2010     38.4
#>  7 Alaska ..................     2011     40.6
#>  8 Alaska ..................     2012     40.4
#>  9 Arizona ..................... 2000     31.2
#> 10 Arizona ..................... 2010     45.2
#> # … with 194 more rows

4.3 Tidying data: “long” to “wide” (spreading)

The second important and common cause of untidy data:

  • is when an observation is scattered across multiple rows with one column identifies variable type and another column contains the values for each variable.
  • table2 is an example of this sort of problem
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583

As my R guru Ben Skinner says, this sort of data structure is very common “in the wild”

  • often the data you download have this structure
  • it is up to you to tidy before analyses

The solution to this problem is to transform the untidy rows (which represent different variables) into columns

  • In the tidyverse, this process is called “spreading” (we spread observations across multiple columns); spreading is the opposite of gathering
  • Outside the tidyverse (and in future updated tidyverse), this process is called “reshaping” from “long” to “wide”

Goal: we want to transform table2 so it looks like table1

table2
table1

We reshape long to wide by using the pivot_wider() function from thetidyr package.

?pivot_wider

Helpful to look at table2 while we introduce pivot_wider() function

table2 %>% head(n=8)
#> # A tibble: 8 x 4
#>   country      year type           count
#>   <chr>       <int> <chr>          <int>
#> 1 Afghanistan  1999 cases            745
#> 2 Afghanistan  1999 population  19987071
#> 3 Afghanistan  2000 cases           2666
#> 4 Afghanistan  2000 population  20595360
#> 5 Brazil       1999 cases          37737
#> 6 Brazil       1999 population 172006362
#> 7 Brazil       2000 cases          80488
#> 8 Brazil       2000 population 174504898

Spreading requires knowing two parameters, which are the parameters of spread():

  1. names_from. Column name(s) in the untidy data whose values will become variable names in the tidy data
    • this column name exists as a variable in the untidy data
    • in table2 this is the type column; the values of type, cases and population, will become variable names in the tidy data
  2. values_from. Column name(s) in untidy data that contains values for the new variables that will be created in the tidy data
    • this is a varname that exists in the untidy data
    • in table2 the value column is count; the values of count will become the values of the new variables cases and population in the tidy data
table2
#> # A tibble: 12 x 4
#>    country      year type            count
#>    <chr>       <int> <chr>           <int>
#>  1 Afghanistan  1999 cases             745
#>  2 Afghanistan  1999 population   19987071
#>  3 Afghanistan  2000 cases            2666
#>  4 Afghanistan  2000 population   20595360
#>  5 Brazil       1999 cases           37737
#>  6 Brazil       1999 population  172006362
#>  7 Brazil       2000 cases           80488
#>  8 Brazil       2000 population  174504898
#>  9 China        1999 cases          212258
#> 10 China        1999 population 1272915272
#> 11 China        2000 cases          213766
#> 12 China        2000 population 1280428583
table2 %>%
    pivot_wider(names_from = type, values_from = count)
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583
#compare to table 1
table1
#> # A tibble: 6 x 4
#>   country      year  cases population
#>   <chr>       <int>  <int>      <int>
#> 1 Afghanistan  1999    745   19987071
#> 2 Afghanistan  2000   2666   20595360
#> 3 Brazil       1999  37737  172006362
#> 4 Brazil       2000  80488  174504898
#> 5 China        1999 212258 1272915272
#> 6 China        2000 213766 1280428583

4.3.1 Student exercise: real-world example of reshaping long to wide

[In the task I present below, fine to just work through solution I have created or try doing on your own before you look at solution]


The Integrated Postsecondary Education Data System (IPEDS) collects data on colleges and universities

  • Below we load IPEDS data on 12-month enrollment headcount for 2015-16 academic year
#load these libraries if you haven't already
#library(haven)
#library(labelled)
ipeds_hc_all <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL)

Create smaller version of dataset

#ipeds_hc <- ipeds_hc_all %>% select(instnm,unitid,lstudy,efytotlt,efytotlm,efytotlw)
ipeds_hc <- ipeds_hc_all %>% select(instnm,unitid,lstudy,efytotlt)

Get to know data

head(ipeds_hc)
#> # A tibble: 6 x 4
#>   instnm                              unitid                lstudy efytotlt
#>   <chr>                                <dbl>             <dbl+lbl>    <dbl>
#> 1 Alabama A & M University            100654 999 [Generated total]     6157
#> 2 Alabama A & M University            100654   1 [Undergraduate]       4865
#> 3 Alabama A & M University            100654   3 [Graduate]            1292
#> 4 University of Alabama at Birmingham 100663 999 [Generated total]    21554
#> 5 University of Alabama at Birmingham 100663   1 [Undergraduate]      13440
#> 6 University of Alabama at Birmingham 100663   3 [Graduate]            8114
str(ipeds_hc)
#> Classes 'tbl_df', 'tbl' and 'data.frame':    15726 obs. of  4 variables:
#>  $ instnm  : chr  "Alabama A & M University" "Alabama A & M University" "Alabama A & M University" "University of Alabama at Birmingham" ...
#>   ..- attr(*, "label")= chr "Institution (entity) name"
#>   ..- attr(*, "format.stata")= chr "%91s"
#>  $ unitid  : num  100654 100654 100654 100663 100663 ...
#>   ..- attr(*, "label")= chr "Unique identification number of the institution"
#>   ..- attr(*, "format.stata")= chr "%12.0g"
#>  $ lstudy  : 'haven_labelled' num  999 1 3 999 1 3 999 1 3 999 ...
#>   ..- attr(*, "label")= chr "Original level of study on survey form"
#>   ..- attr(*, "format.stata")= chr "%15.0g"
#>   ..- attr(*, "labels")= Named num  1 3 999
#>   .. ..- attr(*, "names")= chr  "Undergraduate" "Graduate" "Generated total"
#>  $ efytotlt: num  6157 4865 1292 21554 13440 ...
#>   ..- attr(*, "label")= chr "Grand total"
#>   ..- attr(*, "format.stata")= chr "%12.0g"
#>  - attr(*, "label")= chr "dct_s2015_is"

#Variable labels
ipeds_hc %>% var_label()
#> $instnm
#> [1] "Institution (entity) name"
#> 
#> $unitid
#> [1] "Unique identification number of the institution"
#> 
#> $lstudy
#> [1] "Original level of study on survey form"
#> 
#> $efytotlt
#> [1] "Grand total"

#Only the variable lstudy has value labels
ipeds_hc %>% select(lstudy) %>% val_labels()
#> $lstudy
#>   Undergraduate        Graduate Generated total 
#>               1               3             999

Student Task:

  1. Diagnose the problem with the data frame ipeds_hc (why is it untidy?)
  2. Sketch out what the tidy data should look like
  3. Transform untidy to tidy


Solution to student task:

1. Diagnose the problem with the data frame - First, let’s investigate the data

ipeds_hc <- ipeds_hc %>% arrange(unitid, lstudy)
head(ipeds_hc, n=20)
#> # A tibble: 20 x 4
#>    instnm                              unitid               lstudy efytotlt
#>    <chr>                                <dbl>            <dbl+lbl>    <dbl>
#>  1 Alabama A & M University            100654   1 [Undergraduate]      4865
#>  2 Alabama A & M University            100654   3 [Graduate]           1292
#>  3 Alabama A & M University            100654 999 [Generated tota…     6157
#>  4 University of Alabama at Birmingham 100663   1 [Undergraduate]     13440
#>  5 University of Alabama at Birmingham 100663   3 [Graduate]           8114
#>  6 University of Alabama at Birmingham 100663 999 [Generated tota…    21554
#>  7 Amridge University                  100690   1 [Undergraduate]       415
#>  8 Amridge University                  100690   3 [Graduate]            415
#>  9 Amridge University                  100690 999 [Generated tota…      830
#> 10 University of Alabama in Huntsville 100706   1 [Undergraduate]      6994
#> 11 University of Alabama in Huntsville 100706   3 [Graduate]           2212
#> 12 University of Alabama in Huntsville 100706 999 [Generated tota…     9206
#> 13 Alabama State University            100724   1 [Undergraduate]      5373
#> 14 Alabama State University            100724   3 [Graduate]            728
#> 15 Alabama State University            100724 999 [Generated tota…     6101
#> 16 The University of Alabama           100751   1 [Undergraduate]     35199
#> 17 The University of Alabama           100751   3 [Graduate]           6308
#> 18 The University of Alabama           100751 999 [Generated tota…    41507
#> 19 Central Alabama Community College   100760   1 [Undergraduate]      2379
#> 20 Central Alabama Community College   100760 999 [Generated tota…     2379

#code to investigate what each observation represents
  #I'll break this code down next week when we talk about joining data frames
ipeds_hc %>% group_by(unitid,lstudy) %>% # group_by our candidate
  mutate(n_per_id=n()) %>% # calculate number of obs per group
  ungroup() %>% # ungroup the data
  count(n_per_id==1) # count "true that only one obs per group"
#> # A tibble: 1 x 2
#>   `n_per_id == 1`     n
#>   <lgl>           <int>
#> 1 TRUE            15726

Summary of problems with the data frame:

  • In the untidy data frame, each row represents college-level_of_study
    • there are separate rows for each value of level of study (undergraduate, graduate, generated total)
    • so three rows for each college
  • the values of the column lstudy represent different attributes (undergraduate, graduate, generated total)
    • each of these attributes should be its own variable

2. Sketch out what the tidy data should look like (sketch out on your own)

  • What tidy data should look like:
    • Each observation (row) should be a college
    • There should be separate variables for each level of study, with each variable containing enrollment for that level of study

3. Transform untidy to tidy

  • names_from. Column name in the untidy data whose values will become variable names in the tidy data that contains variable names
    • this variable name exists in the untidy data
    • in ipeds_hc the key column is lstudy; the values of lstudy, will become variable names in the tidy data
  • values_from. Column name in untidy data that contains values for the new variables that will be created in the tidy data
    • this is a varname that exists in the untidy data
    • in ipeds_hc the value column is efytotlt; the values of efytotlt will become the values of the new variables in the tidy data
ipeds_hc %>% 
    pivot_wider(names_from = lstudy, values_from = efytotlt) #notice it uses the underlying data not labels
#> # A tibble: 6,951 x 5
#>    instnm                              unitid   `1`   `3` `999`
#>    <chr>                                <dbl> <dbl> <dbl> <dbl>
#>  1 Alabama A & M University            100654  4865  1292  6157
#>  2 University of Alabama at Birmingham 100663 13440  8114 21554
#>  3 Amridge University                  100690   415   415   830
#>  4 University of Alabama in Huntsville 100706  6994  2212  9206
#>  5 Alabama State University            100724  5373   728  6101
#>  6 The University of Alabama           100751 35199  6308 41507
#>  7 Central Alabama Community College   100760  2379    NA  2379
#>  8 Athens State University             100812  4124    NA  4124
#>  9 Auburn University at Montgomery     100830  5072   924  5996
#> 10 Auburn University                   100858 22953  6284 29237
#> # … with 6,941 more rows

Alternative solution

Helpful to create a character version of variable lstudy prior to spreading

ipeds_hc %>% select(lstudy) %>% val_labels()
#> $lstudy
#>   Undergraduate        Graduate Generated total 
#>               1               3             999
str(ipeds_hc$lstudy)
#>  'haven_labelled' num [1:15726] 1 3 999 1 3 999 1 3 999 1 ...
#>  - attr(*, "label")= chr "Original level of study on survey form"
#>  - attr(*, "labels")= Named num [1:3] 1 3 999
#>   ..- attr(*, "names")= chr [1:3] "Undergraduate" "Graduate" "Generated total"

ipeds_hcv2 <- ipeds_hc %>% 
  mutate(level = recode(as.integer(lstudy),
    `1` = "ug",
    `3` = "grad",
    `999` = "all")
  ) %>% select(-lstudy) # drop variable lstudy

head(ipeds_hcv2)
#> # A tibble: 6 x 4
#>   instnm                              unitid efytotlt level
#>   <chr>                                <dbl>    <dbl> <chr>
#> 1 Alabama A & M University            100654     4865 ug   
#> 2 Alabama A & M University            100654     1292 grad 
#> 3 Alabama A & M University            100654     6157 all  
#> 4 University of Alabama at Birmingham 100663    13440 ug   
#> 5 University of Alabama at Birmingham 100663     8114 grad 
#> 6 University of Alabama at Birmingham 100663    21554 all

ipeds_hcv2 %>% select(instnm,unitid,level,efytotlt) %>%
    pivot_wider(names_from = level, values_from = efytotlt) #nicer!
#> # A tibble: 6,951 x 5
#>    instnm                              unitid    ug  grad   all
#>    <chr>                                <dbl> <dbl> <dbl> <dbl>
#>  1 Alabama A & M University            100654  4865  1292  6157
#>  2 University of Alabama at Birmingham 100663 13440  8114 21554
#>  3 Amridge University                  100690   415   415   830
#>  4 University of Alabama in Huntsville 100706  6994  2212  9206
#>  5 Alabama State University            100724  5373   728  6101
#>  6 The University of Alabama           100751 35199  6308 41507
#>  7 Central Alabama Community College   100760  2379    NA  2379
#>  8 Athens State University             100812  4124    NA  4124
#>  9 Auburn University at Montgomery     100830  5072   924  5996
#> 10 Auburn University                   100858 22953  6284 29237
#> # … with 6,941 more rows

4.3.2 spreading with multiple value variables

What if we want to spread a dataset that contains multiple value variables?

Create dataset that has total enrollment, enrollment “men”, and enrollment “women” [IPEDS terms]

ipeds_hc_multi_val <- ipeds_hc_all %>%
  select(unitid,lstudy,efytotlt,efytotlm,efytotlw)

ipeds_hc_multi_val <- ipeds_hc_multi_val %>% 
  mutate(level = recode(as.integer(lstudy),
    `1` = "ug",
    `3` = "grad",
    `999` = "all")
  ) %>% select(-lstudy) # drop variable lstudy

head(ipeds_hc_multi_val)
#> # A tibble: 6 x 5
#>   unitid efytotlt efytotlm efytotlw level
#>    <dbl>    <dbl>    <dbl>    <dbl> <chr>
#> 1 100654     6157     2646     3511 all  
#> 2 100654     4865     2258     2607 ug   
#> 3 100654     1292      388      904 grad 
#> 4 100663    21554     8383    13171 all  
#> 5 100663    13440     5552     7888 ug   
#> 6 100663     8114     2831     5283 grad

ipeds_hc_multi_val  %>% 
    pivot_wider(names_from = level, values_from = c(efytotlt, efytotlm, efytotlw))
#> # A tibble: 6,951 x 10
#>    unitid efytotlt_all efytotlt_ug efytotlt_grad efytotlm_all efytotlm_ug
#>     <dbl>        <dbl>       <dbl>         <dbl>        <dbl>       <dbl>
#>  1 100654         6157        4865          1292         2646        2258
#>  2 100663        21554       13440          8114         8383        5552
#>  3 100690          830         415           415          337         177
#>  4 100706         9206        6994          2212         5194        3983
#>  5 100724         6101        5373           728         2321        2105
#>  6 100751        41507       35199          6308        18526       15946
#>  7 100760         2379        2379            NA         1027        1027
#>  8 100812         4124        4124            NA         1478        1478
#>  9 100830         5996        5072           924         2158        1847
#> 10 100858        29237       22953          6284        14893       11871
#> # … with 6,941 more rows, and 4 more variables: efytotlm_grad <dbl>,
#> #   efytotlw_all <dbl>, efytotlw_ug <dbl>, efytotlw_grad <dbl>

5 Missing values

5.1 Explicit and implicit missing values

This section deals with missing variables and tidying data.

But first, it is helpful to create a new version of the IPEDS enrollment dataset as follows:

  • keeps observations for for-profit colleges
  • keeps the following enrollment variables:
    • total enrollment
    • enrollment of students who identify as “Black or African American”
ipeds_hc_na <- ipeds_hc_all %>% filter(sector %in% c(3,6,9)) %>% #keep only for-profit colleges
  select(instnm,unitid,lstudy,efytotlt,efybkaam) %>% 
  mutate(level = recode(as.integer(lstudy), # create recoded version of lstudy
    `1` = "ug",
    `3` = "grad",
    `999` = "all")
  ) %>% select(instnm,unitid,level,efytotlt,efybkaam) %>% 
  arrange(unitid,desc(level))

Now, let’s print some rows

  • There is one row for each college-level_of_study
  • Some colleges have three rows of data (ug, grad, all)
  • Colleges that don’t have any undergraduates or don’t have any graduate students only have two rows of data
ipeds_hc_na
#> # A tibble: 6,265 x 5
#>    instnm                               unitid level efytotlt efybkaam
#>    <chr>                                 <dbl> <chr>    <dbl>    <dbl>
#>  1 South University-Montgomery          101116 ug         777      122
#>  2 South University-Montgomery          101116 grad       218       40
#>  3 South University-Montgomery          101116 all        995      162
#>  4 New Beginning College of Cosmetology 101277 ug         132        0
#>  5 New Beginning College of Cosmetology 101277 all        132        0
#>  6 Herzing University-Birmingham        101365 ug         675       73
#>  7 Herzing University-Birmingham        101365 grad        15        5
#>  8 Herzing University-Birmingham        101365 all        690       78
#>  9 Prince Institute-Southeast           101958 ug          34        0
#> 10 Prince Institute-Southeast           101958 all         34        0
#> # … with 6,255 more rows

Now let’s create new versions of the enrollment variables, that replace 0 with NA

ipeds_hc_na <- ipeds_hc_na %>% 
  mutate(
    efytotltv2 = ifelse(efytotlt == 0, NA, efytotlt),
    efybkaamv2 = ifelse(efybkaam == 0, NA, efybkaam)
  ) %>% select(instnm,unitid,level,efytotlt,efytotltv2,efybkaam,efybkaamv2)

ipeds_hc_na %>% select(unitid,level,efytotlt,efytotltv2,efybkaam,efybkaamv2)
#> # A tibble: 6,265 x 6
#>    unitid level efytotlt efytotltv2 efybkaam efybkaamv2
#>     <dbl> <chr>    <dbl>      <dbl>    <dbl>      <dbl>
#>  1 101116 ug         777        777      122        122
#>  2 101116 grad       218        218       40         40
#>  3 101116 all        995        995      162        162
#>  4 101277 ug         132        132        0         NA
#>  5 101277 all        132        132        0         NA
#>  6 101365 ug         675        675       73         73
#>  7 101365 grad        15         15        5          5
#>  8 101365 all        690        690       78         78
#>  9 101958 ug          34         34        0         NA
#> 10 101958 all         34         34        0         NA
#> # … with 6,255 more rows

Create dataset that drops the original enrollment variables, keeps enrollment vars that replace 0 with NA

ipeds_hc_nav2 <- ipeds_hc_na %>% select(-efytotlt,-efybkaam)

Now we can introduce the concepts of explicit and implicit missing values

There are two types of missing values:

  • Explicit missing values: variable has the value NA for a parcitular row
  • Implicit missing values: the row is simply not present in the data

Let’s print data for the first two colleges

ipeds_hc_nav2 %>% head(, n=5)
#> # A tibble: 5 x 5
#>   instnm                               unitid level efytotltv2 efybkaamv2
#>   <chr>                                 <dbl> <chr>      <dbl>      <dbl>
#> 1 South University-Montgomery          101116 ug           777        122
#> 2 South University-Montgomery          101116 grad         218         40
#> 3 South University-Montgomery          101116 all          995        162
#> 4 New Beginning College of Cosmetology 101277 ug           132         NA
#> 5 New Beginning College of Cosmetology 101277 all          132         NA

South University-Montgomery has three rows:

  • variable efytotltv2 has 0 explicit missing values and 0 implicit missing values
  • variable efybkaamv2 has 0 explicit missing values and 0 implicit missing values

New Beginning College of Cosmetology has two rows (because they have no graduate students):

  • variable efytotltv2 has 0 explicit missing values and 1 implicit missing values (no row for grad students)
  • variable efybkaamv2 has 2 explicit missing values and 1 implicit missing values (no row for grad students)

Let’s look at another dataset called stocks, which shows stock return for each year and quarter for some hypothetical company.

  • Wickham uses this dataset to introduce explicit and implicit missing values
stocks <- tibble(
  year   = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
  qtr    = c(   1,    2,    3,    4,    2,    3,    4),
  return = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
stocks # note: this data is already tidy
#> # A tibble: 7 x 3
#>    year   qtr return
#>   <dbl> <dbl>  <dbl>
#> 1  2015     1   1.88
#> 2  2015     2   0.59
#> 3  2015     3   0.35
#> 4  2015     4  NA   
#> 5  2016     2   0.92
#> 6  2016     3   0.17
#> 7  2016     4   2.66

The variable return has:

  • 1 explicit missing value in year==2015 and qtr==4
  • 1 implicit missing value in year==2016 and qtr==1; this row of data simply does not exist
stocks %>% 
  complete(year, qtr)
#> # A tibble: 8 x 3
#>    year   qtr return
#>   <dbl> <dbl>  <dbl>
#> 1  2015     1   1.88
#> 2  2015     2   0.59
#> 3  2015     3   0.35
#> 4  2015     4  NA   
#> 5  2016     1  NA   
#> 6  2016     2   0.92
#> 7  2016     3   0.17
#> 8  2016     4   2.66

5.2 Making implicit missing values explicit

An Implicit missing value is the result of a row not existing. If you want to make an an implicit missing value explicit, then make the non-existant row exist.

The complete() function within the tidyr package turns implicit missing values into explicit missing values

  • Basically, you specificy the object (i.e., data frame) and a list of variables;
  • complete() returns an object that has all unique combinations of those variables, including those not found in the original data frame
  • I’ll skip additinoal options and complications
stocks
#> # A tibble: 7 x 3
#>    year   qtr return
#>   <dbl> <dbl>  <dbl>
#> 1  2015     1   1.88
#> 2  2015     2   0.59
#> 3  2015     3   0.35
#> 4  2015     4  NA   
#> 5  2016     2   0.92
#> 6  2016     3   0.17
#> 7  2016     4   2.66
stocks %>% complete(year, qtr)
#> # A tibble: 8 x 3
#>    year   qtr return
#>   <dbl> <dbl>  <dbl>
#> 1  2015     1   1.88
#> 2  2015     2   0.59
#> 3  2015     3   0.35
#> 4  2015     4  NA   
#> 5  2016     1  NA   
#> 6  2016     2   0.92
#> 7  2016     3   0.17
#> 8  2016     4   2.66

Note that we now have a row for year==2016 and qtr==1 that has an explicit missing value for return

Let’s apply complete() to our IPEDS dataset

ipeds_hc_nav2
#> # A tibble: 6,265 x 5
#>    instnm                               unitid level efytotltv2 efybkaamv2
#>    <chr>                                 <dbl> <chr>      <dbl>      <dbl>
#>  1 South University-Montgomery          101116 ug           777        122
#>  2 South University-Montgomery          101116 grad         218         40
#>  3 South University-Montgomery          101116 all          995        162
#>  4 New Beginning College of Cosmetology 101277 ug           132         NA
#>  5 New Beginning College of Cosmetology 101277 all          132         NA
#>  6 Herzing University-Birmingham        101365 ug           675         73
#>  7 Herzing University-Birmingham        101365 grad          15          5
#>  8 Herzing University-Birmingham        101365 all          690         78
#>  9 Prince Institute-Southeast           101958 ug            34         NA
#> 10 Prince Institute-Southeast           101958 all           34         NA
#> # … with 6,255 more rows

ipeds_complete <- ipeds_hc_nav2 %>% select(unitid,level,efytotltv2,efybkaamv2) %>%
  complete(unitid, level)

ipeds_complete
#> # A tibble: 9,063 x 4
#>    unitid level efytotltv2 efybkaamv2
#>     <dbl> <chr>      <dbl>      <dbl>
#>  1 101116 all          995        162
#>  2 101116 grad         218         40
#>  3 101116 ug           777        122
#>  4 101277 all          132         NA
#>  5 101277 grad          NA         NA
#>  6 101277 ug           132         NA
#>  7 101365 all          690         78
#>  8 101365 grad          15          5
#>  9 101365 ug           675         73
#> 10 101958 all           34         NA
#> # … with 9,053 more rows

#Confirm that the "complete" dataset always has three observations per unitid
ipeds_complete %>% group_by(unitid) %>% summarise(n=n()) %>% count(n)
#> # A tibble: 1 x 2
#>       n    nn
#>   <int> <int>
#> 1     3  3021

#Note that previous dataset did not
ipeds_hc_nav2 %>% group_by(unitid) %>% summarise(n=n()) %>% count(n)
#> # A tibble: 2 x 2
#>       n    nn
#>   <int> <int>
#> 1     2  2798
#> 2     3   223

Should you make implicit missing values explicit?

  • No clear-cut answer; depends on many context-specific things about your project
  • The important thing is to be aware of the presence of implicit missing values (both in the “input” datasets you read-in and the datasets you create from this inputs) and be purposeful about how you deal with implicit missing values
    • This is the sort of thing sloppy researchers forget/ignore
  • My recommendation for the stage of creating analysis datasets from input data:
    • If you feel unsure about making implicit values explicit, then I recommend making them explicit
    • This forces you to be more fully aware of patterns of missing data; helps you avoid careless mistakes down the road
    • After making implicit missing values explicit, you can drop these rows once you are sure you don’t need them

5.3 Spreading and explicit/implicit missing values

We use spread() to transform rows into columns; outside the tidyverse, referred to as reshaping from “long” to “wide”

Let’s look at two datasets that have similar structure. Which one of these is in need of tidying?

stocks
#> # A tibble: 7 x 3
#>    year   qtr return
#>   <dbl> <dbl>  <dbl>
#> 1  2015     1   1.88
#> 2  2015     2   0.59
#> 3  2015     3   0.35
#> 4  2015     4  NA   
#> 5  2016     2   0.92
#> 6  2016     3   0.17
#> 7  2016     4   2.66

ipeds_hc_nav2 %>% select (instnm,unitid,level,efytotltv2) %>% head(n=5)
#> # A tibble: 5 x 4
#>   instnm                               unitid level efytotltv2
#>   <chr>                                 <dbl> <chr>      <dbl>
#> 1 South University-Montgomery          101116 ug           777
#> 2 South University-Montgomery          101116 grad         218
#> 3 South University-Montgomery          101116 all          995
#> 4 New Beginning College of Cosmetology 101277 ug           132
#> 5 New Beginning College of Cosmetology 101277 all          132

Let’s use spread() to tidy the IPEDS dataset, focusing on the total enrollment variable which has implicit missing values but no explicit missing values

ipeds_hc_nav2 %>% select (instnm,unitid,level,efytotltv2) %>%
  spread(key = level, value = efytotltv2) %>% 
  arrange(unitid) %>% head(n=5)
#> # A tibble: 5 x 5
#>   instnm                               unitid   all  grad    ug
#>   <chr>                                 <dbl> <dbl> <dbl> <dbl>
#> 1 South University-Montgomery          101116   995   218   777
#> 2 New Beginning College of Cosmetology 101277   132    NA   132
#> 3 Herzing University-Birmingham        101365   690    15   675
#> 4 Prince Institute-Southeast           101958    34    NA    34
#> 5 Charter College                      102845  3996    20  3976

The resulting dataset has explicint missing values (i.e,. NAs) for rows that had implicit missing values in the input data.

Let’s use spread() the IPEDS dataset again, this time focusing on the Black enrollment variable which has both explicit and implicit missing values

ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) %>% head(n=5)
#> # A tibble: 5 x 4
#>   instnm                               unitid level efybkaamv2
#>   <chr>                                 <dbl> <chr>      <dbl>
#> 1 South University-Montgomery          101116 ug           122
#> 2 South University-Montgomery          101116 grad          40
#> 3 South University-Montgomery          101116 all          162
#> 4 New Beginning College of Cosmetology 101277 ug            NA
#> 5 New Beginning College of Cosmetology 101277 all           NA

ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) %>%
  spread(key = level, value = efybkaamv2) %>% 
  arrange(unitid) %>% head(n=5)
#> # A tibble: 5 x 5
#>   instnm                               unitid   all  grad    ug
#>   <chr>                                 <dbl> <dbl> <dbl> <dbl>
#> 1 South University-Montgomery          101116   162    40   122
#> 2 New Beginning College of Cosmetology 101277    NA    NA    NA
#> 3 Herzing University-Birmingham        101365    78     5    73
#> 4 Prince Institute-Southeast           101958    NA    NA    NA
#> 5 Charter College                      102845   102     1   101

The resulting dataset has explicit missing values (i.e,. NAs) for rows that had implicit missing values in the input data and for rows that has explicit missing values in the data.

What if we spread a dataset that has explicit missing values and no implicit missing values?

  • Resulting dataset has explicit missing values (i.e,. NAs) for rows that were explicit missing values in the input data.
ipeds_complete %>% select(unitid,level,efybkaamv2) %>%
  spread(key = level, value = efybkaamv2) %>% 
  arrange(unitid) %>% head(n=5)
#> # A tibble: 5 x 4
#>   unitid   all  grad    ug
#>    <dbl> <dbl> <dbl> <dbl>
#> 1 101116   162    40   122
#> 2 101277    NA    NA    NA
#> 3 101365    78     5    73
#> 4 101958    NA    NA    NA
#> 5 102845   102     1   101

Takeways about spread() and missing values

  • Explicit missing values in the input data become explicit missing values in the resulting dataset
  • Implicit missing values in the input data become explicit missing values in the resulting dataset

Therefore, no need to make implicit missing values explicit (using complete()) prior to spreading

5.4 Gathering and explicit/implicit missing values [SKIP]

We use gather() to transform columns into rows; outside the tidyverse, referred to as reshaping from “wide” to “long”

Let’s create a dataset in need of gathering

  • start w/ IPEDS dataset that has enrollment for Black students
  • create a fictitious 2017 enrollment variable equal to 2016 enrollment + 20
  • rename the enrollment vars 2016 and 2017
ipeds_gather <- ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) 


ipeds_gather <- ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) %>%
  mutate(
    efybkaamv2_2017= ifelse(!is.na(efybkaamv2),efybkaamv2+20,20)
  ) %>%
  rename("2016"=efybkaamv2,"2017"=efybkaamv2_2017)

ipeds_gather %>% head(n=10)
#> # A tibble: 10 x 5
#>    instnm                               unitid level `2016` `2017`
#>    <chr>                                 <dbl> <chr>  <dbl>  <dbl>
#>  1 South University-Montgomery          101116 ug       122    142
#>  2 South University-Montgomery          101116 grad      40     60
#>  3 South University-Montgomery          101116 all      162    182
#>  4 New Beginning College of Cosmetology 101277 ug        NA     20
#>  5 New Beginning College of Cosmetology 101277 all       NA     20
#>  6 Herzing University-Birmingham        101365 ug        73     93
#>  7 Herzing University-Birmingham        101365 grad       5     25
#>  8 Herzing University-Birmingham        101365 all       78     98
#>  9 Prince Institute-Southeast           101958 ug        NA     20
#> 10 Prince Institute-Southeast           101958 all       NA     20
  • The variable 2016 has both implicit and explicit missing values
  • The variable 2017 has implicit but no explicit missing values

Let’s use gather() to transform the columns 2016 and 2017 into rows

ipeds_gatherv2 <- ipeds_gather %>% 
  gather(`2016`,`2017`,key = year, value = efybkaam) %>%
  arrange(unitid,desc(level),year) 

ipeds_gatherv2 %>% head(n=10)
#> # A tibble: 10 x 5
#>    instnm                               unitid level year  efybkaam
#>    <chr>                                 <dbl> <chr> <chr>    <dbl>
#>  1 South University-Montgomery          101116 ug    2016       122
#>  2 South University-Montgomery          101116 ug    2017       142
#>  3 South University-Montgomery          101116 grad  2016        40
#>  4 South University-Montgomery          101116 grad  2017        60
#>  5 South University-Montgomery          101116 all   2016       162
#>  6 South University-Montgomery          101116 all   2017       182
#>  7 New Beginning College of Cosmetology 101277 ug    2016        NA
#>  8 New Beginning College of Cosmetology 101277 ug    2017        20
#>  9 New Beginning College of Cosmetology 101277 all   2016        NA
#> 10 New Beginning College of Cosmetology 101277 all   2017        20

Before looking at missing values, let’s investigate data structure

#number of rows after gathering is exactly 2X number of rows in input dataset
nrow(ipeds_gather)
#> [1] 6265
nrow(ipeds_gatherv2)
#> [1] 12530
nrow(ipeds_gatherv2)==nrow(ipeds_gather)*2
#> [1] TRUE

#How many observations for each combination of unitid and level? 
  #always 2: one for 2016 and one for 2017
ipeds_gatherv2 %>% group_by(unitid,level) %>% 
  summarise(n_per_unitid_level=n()) %>% ungroup %>% count(n_per_unitid_level)
#> # A tibble: 1 x 2
#>   n_per_unitid_level     n
#>                <int> <int>
#> 1                  2  6265

#How many observations for each unitid? 
  # 4 observations for colleges that had two rows in the input dataset
  # 6 observations for colleges that had three rows in the input dataset
ipeds_gatherv2 %>% group_by(unitid) %>% 
  summarise(n_per_unitid=n()) %>% ungroup %>% count(n_per_unitid)
#> # A tibble: 2 x 2
#>   n_per_unitid     n
#>          <int> <int>
#> 1            4  2798
#> 2            6   223

Let’s compare the data before and after gathering for one college

#before gathering
ipeds_gather %>% filter(unitid==101277)
#> # A tibble: 2 x 5
#>   instnm                               unitid level `2016` `2017`
#>   <chr>                                 <dbl> <chr>  <dbl>  <dbl>
#> 1 New Beginning College of Cosmetology 101277 ug        NA     20
#> 2 New Beginning College of Cosmetology 101277 all       NA     20
#after gathering
ipeds_gatherv2 %>% filter(unitid==101277)
#> # A tibble: 4 x 5
#>   instnm                               unitid level year  efybkaam
#>   <chr>                                 <dbl> <chr> <chr>    <dbl>
#> 1 New Beginning College of Cosmetology 101277 ug    2016        NA
#> 2 New Beginning College of Cosmetology 101277 ug    2017        20
#> 3 New Beginning College of Cosmetology 101277 all   2016        NA
#> 4 New Beginning College of Cosmetology 101277 all   2017        20

Takeaways about gathering and explicit/implicit missing values

  • Explicit missing values from the input dataset become explicit missing values in the resulting dataset after gathering
  • Implicit missing values from the input dataset are implicit missing values the resulting dataset after gathering. Why:
    • In the input dataset, no row existed for these implicit missing values
    • For each existing row in the input dataset, gather() creates new rows
    • gather() does nothing to rows that do not exist in the input dataset

Therefore, if you want to make implicit values explicit after gathering, then prior to gathering you should use complete() to make missing values explicit

6 Tidying data: separating and uniting [SKIP]

table5
#> # A tibble: 6 x 4
#>   country     century year  rate             
#>   <chr>       <chr>   <chr> <chr>            
#> 1 Afghanistan 19      99    745/19987071     
#> 2 Afghanistan 20      00    2666/20595360    
#> 3 Brazil      19      99    37737/172006362  
#> 4 Brazil      20      00    80488/174504898  
#> 5 China       19      99    212258/1272915272
#> 6 China       20      00    213766/1280428583
  • “Separating” is for dealing with the problem in the rate column
  • “Uniting” is for dealing with the problem in the century and year columns

Read about this in section 12.4 of Wickham text, but no homework questions on separating and uniting