html_document
rather than beamer_presentation
output for the next two lectureslecture6.html
or knit and follow along in that versionbeamer_presentation
withgather()
and spread()
Functions we will use today are new and still under development, so you need to install the development version of tidyr
#install.packages("devtools")
#devtools::install_github("tidyverse/tidyr")
You must run this R code chunk (install packages first if you don’t have them)
library(tidyverse)
#> ── Attaching packages ────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
#> ✔ ggplot2 3.2.1 ✔ purrr 0.3.2
#> ✔ tibble 2.1.3 ✔ dplyr 0.8.3
#> ✔ tidyr 1.0.0.9000 ✔ stringr 1.4.0
#> ✔ readr 1.3.1 ✔ forcats 0.4.0
#> ── Conflicts ───────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
#> ✖ dplyr::filter() masks stats::filter()
#> ✖ dplyr::lag() masks stats::lag()
library(haven)
library(labelled)
Creating analysis datasets often require changing the organizational structure of data
Examples:
Two common ways to change organizational structure of data
group_by
to perform calculations separately within groups and then use summarise
to create an object with one observation per group. Examples:
This lecture is about changing the organizational structure of your data by transforming untidy data into tidy data.
Show index and example datasets in tidyr package
help(package="tidyr")
# note that example datasets table1, table2, etc. are listed in the index alongside functions
table1
tidyr::table1 # same same
df1 <- table1 # create an object from the "function" table1
str(df1)
table2
table3
Before we define “tidy data”, we will spend significant time defining and discussing some core terms/concepts about datasets.
This discussion draws from the 2014 article Tidy Data by Hadley Wickham.
Dataset structure refers to the “physical layout” of a dataset
There are many alternative data structures to present the same underlying data
#create structure a: treatment as columns, names as rows
structure_a <- tibble(
name = c("John Smith","Jane Doe","Mary Johnson"),
treatmenta = c(NA, 16, 3),
treatmentb = c(2, 11, 1)
)
#create structure b: treatment as rows, names as columns
structure_b <- tibble(
treatment = c("treatmenta","treatmentb"),
John_Smith = c(NA, 2),
Jane_Doe = c(16,11),
Mary_Johnson = c(3,1)
)
structure_a
#> # A tibble: 3 x 3
#> name treatmenta treatmentb
#> <chr> <dbl> <dbl>
#> 1 John Smith NA 2
#> 2 Jane Doe 16 11
#> 3 Mary Johnson 3 1
structure_b
#> # A tibble: 2 x 4
#> treatment John_Smith Jane_Doe Mary_Johnson
#> <chr> <dbl> <dbl> <dbl>
#> 1 treatmenta NA 16 3
#> 2 treatmentb 2 11 1
unit of analysis [my term, not Wickham’s]:
Examples of different units of analysis:
Questions:
structure_a
?structure_a
#> # A tibble: 3 x 3
#> name treatmenta treatmentb
#> <chr> <dbl> <dbl>
#> 1 John Smith NA 2
#> 2 Jane Doe 16 11
#> 3 Mary Johnson 3 1
structure_b
?structure_b
#> # A tibble: 2 x 4
#> treatment John_Smith Jane_Doe Mary_Johnson
#> <chr> <dbl> <dbl> <dbl>
#> 1 treatmenta NA 16 3
#> 2 treatmentb 2 11 1
ipeds_hc_temp
?
ipeds_hc_temp <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL) %>%
select(unitid,lstudy,efytotlt) %>% arrange(unitid,lstudy)
ipeds_hc_temp
#> # A tibble: 15,726 x 3
#> unitid lstudy efytotlt
#> <dbl> <dbl+lbl> <dbl>
#> 1 100654 1 [Undergraduate] 4865
#> 2 100654 3 [Graduate] 1292
#> 3 100654 999 [Generated total] 6157
#> 4 100663 1 [Undergraduate] 13440
#> 5 100663 3 [Graduate] 8114
#> 6 100663 999 [Generated total] 21554
#> 7 100690 1 [Undergraduate] 415
#> 8 100690 3 [Graduate] 415
#> 9 100690 999 [Generated total] 830
#> 10 100706 1 [Undergraduate] 6994
#> # … with 15,716 more rows
#show variable labels and value labels
ipeds_hc_temp %>% var_label()
#> $unitid
#> [1] "Unique identification number of the institution"
#>
#> $lstudy
#> [1] "Original level of study on survey form"
#>
#> $efytotlt
#> [1] "Grand total"
ipeds_hc_temp %>% val_labels()
#> $unitid
#> NULL
#>
#> $lstudy
#> Undergraduate Graduate Generated total
#> 1 3 999
#>
#> $efytotlt
#> NULL
#print a few obs
ipeds_hc_temp %>% head(n=10)
#> # A tibble: 10 x 3
#> unitid lstudy efytotlt
#> <dbl> <dbl+lbl> <dbl>
#> 1 100654 1 [Undergraduate] 4865
#> 2 100654 3 [Graduate] 1292
#> 3 100654 999 [Generated total] 6157
#> 4 100663 1 [Undergraduate] 13440
#> 5 100663 3 [Graduate] 8114
#> 6 100663 999 [Generated total] 21554
#> 7 100690 1 [Undergraduate] 415
#> 8 100690 3 [Graduate] 415
#> 9 100690 999 [Generated total] 830
#> 10 100706 1 [Undergraduate] 6994
#print a few obs, with value labels rather than variable values
ipeds_hc_temp %>% head(n=10) %>% as_factor()
#> # A tibble: 10 x 3
#> unitid lstudy efytotlt
#> <dbl> <fct> <dbl>
#> 1 100654 Undergraduate 4865
#> 2 100654 Graduate 1292
#> 3 100654 Generated total 6157
#> 4 100663 Undergraduate 13440
#> 5 100663 Graduate 8114
#> 6 100663 Generated total 21554
#> 7 100690 Undergraduate 415
#> 8 100690 Graduate 415
#> 9 100690 Generated total 830
#> 10 100706 Undergraduate 6994
Identifying which combination of variables uniquely identifies rows in a data frame helps you identify the “unit of analysis” and understand the “structure” of your dataset
total_enrollment
or family_income
Task: Let’s try to identify the variable(s) that uniquely identify rows in ipeds_hc_temp
names(ipeds_hc_temp)
#> [1] "unitid" "lstudy" "efytotlt"
ipeds_hc_temp %>% head(n=10)
#> # A tibble: 10 x 3
#> unitid lstudy efytotlt
#> <dbl> <dbl+lbl> <dbl>
#> 1 100654 1 [Undergraduate] 4865
#> 2 100654 3 [Graduate] 1292
#> 3 100654 999 [Generated total] 6157
#> 4 100663 1 [Undergraduate] 13440
#> 5 100663 3 [Graduate] 8114
#> 6 100663 999 [Generated total] 21554
#> 7 100690 1 [Undergraduate] 415
#> 8 100690 3 [Graduate] 415
#> 9 100690 999 [Generated total] 830
#> 10 100706 1 [Undergraduate] 6994
First, Let’s investigate whether the ID variable unitid
uniquely identifies rows in data frame ipeds_hc_temp
ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
group_by(unitid) %>% # group by unitid
summarise(n_per_group=n()) %>% # create measure of number of obs per group
ungroup %>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
count(n_per_group) # frequency of number of observations per group
#> # A tibble: 2 x 2
#> n_per_group n
#> <int> <int>
#> 1 2 5127
#> 2 3 1824
What does above output tell us?
unitid
that have 2 rows for that value of unitid
unitid
that have 3 rows for that value of unitid
2*5127+3*1824==
15,726 which is the number of observations in ipeds_hc_temp
unitid
does not uniquely identify rows in the data frame ipeds_hc_temp
Second, Let’s investigate whether the comination of unitid
and lstudy
uniquely identifies rows in data frame ipeds_hc_temp
ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
group_by(unitid,lstudy) %>% # group by unitid and lstudy
summarise(n_per_group=n()) %>% # create measure of number of obs per group
ungroup %>% # ungroup (otherwise frequency table [next step] created) separately for each group (i.e., separate frequency table for each value of unitid)
count(n_per_group) # frequency of number of observations per group
#> # A tibble: 1 x 2
#> n_per_group n
#> <int> <int>
#> 1 1 15726
What does above output tell us?
1
row each unique combination of unitid
and lstudy
unitid
and lstudy
uniquely identify rows in the data frame ipeds_hc_temp
What happens if we didn’t ungroup?
ipeds_hc_temp %>% # start with data frame object ipeds_hc_temp
group_by(unitid,lstudy) %>% # group by unitid and lstudy
summarise(n_per_group=n()) %>% # create measure of number of obs per group
count(n_per_group) # frequency of number of observations per group
#> # A tibble: 6,951 x 3
#> unitid n_per_group n
#> <dbl> <int> <int>
#> 1 100654 1 3
#> 2 100663 1 3
#> 3 100690 1 3
#> 4 100706 1 3
#> 5 100724 1 3
#> 6 100751 1 3
#> 7 100760 1 2
#> 8 100812 1 2
#> 9 100830 1 3
#> 10 100858 1 3
#> # … with 6,941 more rows
Think of these dataset concepts as being distinct from dataset structure (rows and columns)
The difference between data structure and data concepts:
Wickham (2014, p. 3): “A dataset is a collection of values, usually either numbers (if quantitative) or strings (if qualitative). Values are organized in two ways. Every value belongs to a variable and an observation.”
Wickham (2014) definitions of dataset concepts: values, variables, and observations [I Recommend writing these definitions down on a separate sheet of paper]:
enrollment
for one organization in a dataset where each observation represents a postsecondary education institutionenrollment
is a vector that contains total enrollment for each organization in the datasetExample of a dataset that satisfies Wickham definitions
#> # A tibble: 10 x 4
#> unitid ug_enroll grad_enroll all_enroll
#> <dbl> <dbl> <dbl> <dbl>
#> 1 100654 4865 1292 6157
#> 2 100663 13440 8114 21554
#> 3 100690 415 415 830
#> 4 100706 6994 2212 9206
#> 5 100724 5373 728 6101
#> 6 100751 35199 6308 41507
#> 7 100760 2379 NA 2379
#> 8 100812 4124 NA 4124
#> 9 100830 5072 924 5996
#> 10 100858 22953 6284 29237
Remove object
rm(ipeds_hc_temp)
A particular project or data collection (e.g., longitudinal survey tracking student achievement from high school through college), may require several data sources:
Each of these data sources may use a different “level of observation” and each requiring a different “table” (i.e., dataframe):
Observational unit/observational level [Wickham’s term, not mine]
“In a given analysis, there may be multiple levels of observations. For example, in a trial of new allergy medication we might have three observational types: demographic data collected from each person (age, sex, race), medical data collected from each person on each day (number of sneezes, redness of eyes), and meteorological data collected on each day (temperature,pollen count) (Wickham 2014, p. 4)”
So Wickham defines observational unit/observational level as what each row should represent in an appropriately structured data frame
These data concepts (e.g., value, variable, observation) seem easy on first glance, but can feel slippery on closer inspection.
In particular, when you confronted with a particular dataset, sometimes it can feel confusing what the variables/observations/unit of analysis are (data structure) and what they should be (data concepts).
Wickham chapter 12: "There are three interrelated rules which make a dataset tidy:
I Recommend writing these rules down on a separate sheet of paper
“These three rules are interrelated because it’s impossible to only satisfy two of the three” (Wickham chapter 12)
Visual representation of the three rules of tidy data:
Here is an example of tidy data:
#help(package="tidyr")
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
Question:
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
While all tidy datasets are organized the same way, untidy datasets can have several different organizational structures
Here is an example of an untidy version of the same data shown in table1
#Untidy data, table 2
#help(package="tidyr")
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
Other examples of untidy data (output omitted):
table3
#tables 4a and 4b put the information from table1 in two separate data frames
table4a
table4b
table5
The first step in transforming untidy data to tidy data is diagnosing which principles of tidy data have been violated.
Recall the three principles of tidy data:
Let’s diagnose the problems with table2
by answering these questions
Printout of table2
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
Answers to these questions:
table2
these two attributes are recorded in the column “type” and the associated value for each type is recorded in the column “count”table2
has two rows per observation but it should have one row per observationFor each of the following datasets – table1
, table3
, table4a
, and table5
– answer the following questions:
We’ll give you ~15 minutes for this
ANSWERS TO STUDENT EXERCISE BELOW:
table1
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table3
table3
#> # A tibble: 6 x 3
#> country year rate
#> <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
rate
contains two variables, cases
and population
rate
column, each cell contains two values, a value for cases
and a value for population
table4a
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
cases
is spread over two columns and the variable year
is also spread over two columnstable4b
[not required to answer]table4b
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
Answers pretty much the same as table4a
, except table4b
shows data for population
rather than `cases’
table5
table5
#> # A tibble: 6 x 4
#> country century year rate
#> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
year' is spread across two columns
centuryand
year. Second, the
ratecolumn contains the two variables
casesand
population`year' is spread across two cells. Second, the each cell of the
ratecolumn contains two values, one for
casesand one for
population`Worthwhile to revisit Wickham’s (2014) defintions of variables, and observations in light of what we now know about tidy vs. untidy data.
Whenever I work with datasets I tend to think of observations as being synonomous with rows and variables as being synonomous with columns. But if we use Wickham’s definitions of observations and variables, this is not true.
table2
the attribute population
was represented by two columns, type
and count
table1
and table2
Wickham would think of country-year
as the proper observational unit. table1
has one row per country-year
(tidy) but table2
has two rows per country-year
(untidy).
“Observational unit” (data concept) vs. “unit of analysis” (my term; data structure)
Takeways from this discussion of formal data concepts and tidy vs. untidy data
In real world, we encounter many untidy datasets. We can still equate variables with columns and rows with observations.
Why should you create tidy datasets before conducting analyses?
If you have a consistent organizational structure for analysis datasets, easier to learn tools for analyzing data
str(table1)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 6 obs. of 4 variables:
#> $ country : chr "Afghanistan" "Afghanistan" "Brazil" "Brazil" ...
#> $ year : int 1999 2000 1999 2000 1999 2000
#> $ cases : int 745 2666 37737 80488 212258 213766
#> $ population: int 19987071 20595360 172006362 174504898 1272915272 1280428583
str(table1$country)
#> chr [1:6] "Afghanistan" "Afghanistan" "Brazil" "Brazil" "China" ...
str(table1$cases)
#> int [1:6] 745 2666 37737 80488 212258 213766
Example:
rate
variable (=cases/population*10000
) for table1
(tidy) and table2
(untidy)table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table1 %>%
mutate(rate = cases / population * 10000)
#> # A tibble: 6 x 5
#> country year cases population rate
#> <chr> <int> <int> <int> <dbl>
#> 1 Afghanistan 1999 745 19987071 0.373
#> 2 Afghanistan 2000 2666 20595360 1.29
#> 3 Brazil 1999 37737 172006362 2.19
#> 4 Brazil 2000 80488 174504898 4.61
#> 5 China 1999 212258 1272915272 1.67
#> 6 China 2000 213766 1280428583 1.67
#Just looking at table2, obvious that calculating rate is much harder
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
Datasets are often stored in an untidy structure rather than a tidy structure when the untidy structure has a smaller file size than the tidy structure
Prior to Wickham (2014) and the creation of the “tidyverse,” the concepts of “tidy”/“untidy” (adjective) data and “tidying” (verb) did not exist.
Instead, researchers would “reshape” (verb) data from “long” (adjective) to “wide” (adjective) and vice-versa
Think “wide” and “long” as alternative presentations of the exact same data.ß
“wide” data represented with fewer rows and more columns
“long” data represented with more rows and fewer columns
Example, Table 204.10 of the Digest for Education Statistics shows changer over time in the number and percentage of K-12 students on free/reduced lunch LINK
#load("data/nces_digest/nces_digest_table_204_10.RData")
load(url("https://github.com/ozanj/rclass/raw/master/data/nces_digest/nces_digest_table_204_10.RData"))
table204_10
#> # A tibble: 51 x 13
#> state tot_2000 tot_2010 tot_2011 tot_2012 frl_2000 frl_2010 frl_2011
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 Alab… 728351 730427 731556 740475 335143 402386 420447
#> 2 Alas… 105333 132104 131166 131483 32468 50701 53238
#> 3 Ariz… 877696 1067210 1024454 990378 274277 482044 511885
#> 4 Arka… 449959 482114 483114 486157 205058 291608 294324
#> 5 Cali… 6050753 6169427 6202862 6178788 2820611 3335885 3353964.
#> 6 Colo… 724349 842864 853610 863121 195148 336426 348896
#> 7 Conn… 562179 552919 543883 549295 143030 190554 194339
#> 8 Dela… 114676 128342 128470 127791 37766 61564 62774
#> 9 Dist… 68380 71263 72329 75411 47839 52027 45199
#> 10 Flor… 2434755 2641555 2668037 2691881 1079009 1479519 1535670
#> # … with 41 more rows, and 5 more variables: frl_2012 <dbl>,
#> # p_frl_2000 <dbl>, p_frl_2010 <dbl>, p_frl_2011 <dbl>, p_frl_2012 <dbl>
Task: Reshape table204_10
from wide to long
gather()
from G&W can’t deal with complex variable names and can’t reshape multiple columns (will discuss this shortly):
total <- table204_10 %>%
select(state,tot_2000,tot_2010,tot_2011,tot_2012) #subset and assign to new object
names(total)<-c("state","2000","2010","2011","2012") #change names for year "tot_2010" -> "2010"
total_long <- total %>%
gather(`2000`,`2010`,`2011`,`2012`,key=year,value=total_students) %>%
arrange(state, year) #arrange by state and year
head(total_long, n=10) #view 10 obs
#> # A tibble: 10 x 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama .................... 2000 728351
#> 2 Alabama .................... 2010 730427
#> 3 Alabama .................... 2011 731556
#> 4 Alabama .................... 2012 740475
#> 5 Alaska .................. 2000 105333
#> 6 Alaska .................. 2010 132104
#> 7 Alaska .................. 2011 131166
#> 8 Alaska .................. 2012 131483
#> 9 Arizona ..................... 2000 877696
#> 10 Arizona ..................... 2010 1067210
pivot_longer
can do the same task, but it can handle reshaping multiple columns and complex patterns at the same time (although, you won’t be able handle complicated patterns until you learn regular expressions)
total %>%
pivot_longer(
cols = (c("2000","2010", "2011", "2012")),
names_to = "year",
values_to = "total_students"
)
#> # A tibble: 204 x 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama .................... 2000 728351
#> 2 Alabama .................... 2010 730427
#> 3 Alabama .................... 2011 731556
#> 4 Alabama .................... 2012 740475
#> 5 Alaska .................. 2000 105333
#> 6 Alaska .................. 2010 132104
#> 7 Alaska .................. 2011 131166
#> 8 Alaska .................. 2012 131483
#> 9 Arizona ..................... 2000 877696
#> 10 Arizona ..................... 2010 1067210
#> # … with 194 more rows
Note that the concepts “wide” vs. “long” are relative rather than absolute
table4a
and table1
: table4a
is wide and table1
is longtable4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table1
and table2
: table1
is wide and table2
is longtable1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
Steps in tidying untidy data:
Tidy data can only have one organizational structure.
However, untidy data can have several different organizational structures. In turn, several causes of untidy data exist. Important to identify the most common causes of untidy data, so that we can develop solutions for each common cause.
The two most common and most important causes of untidy data
table4a
, table4b
), which results in:
population
) over multiple columns (e.g., 1999
, 2000
)table4b
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 19987071 20595360
#> 2 Brazil 172006362 174504898
#> 3 China 1272915272 1280428583
table2
), such that:
type
) and another column contains the values for each variable (e.g., count
)table2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
Other common causes of untidy data (less important and/or less common)
rate
column in table3
)table3
#> # A tibble: 6 x 3
#> country year rate
#> <chr> <int> <chr>
#> 1 Afghanistan 1999 745/19987071
#> 2 Afghanistan 2000 2666/20595360
#> 3 Brazil 1999 37737/172006362
#> 4 Brazil 2000 80488/174504898
#> 5 China 1999 212258/1272915272
#> 6 China 2000 213766/1280428583
table5
, values of the 4-digit year
variable are separated into the two columns, century
and 2-digit year
)table5
#> # A tibble: 6 x 4
#> country century year rate
#> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
As discussed above, the first common reason for untidy data is that some of the column names are not names of variables, but values of a variable (e.g., table4a
, table4b
)
table4a
#> # A tibble: 3 x 3
#> country `1999` `2000`
#> <chr> <int> <int>
#> 1 Afghanistan 745 2666
#> 2 Brazil 37737 80488
#> 3 China 212258 213766
The solution to this problem is to transform the untidy columns (which represent variable values) into rows
In the above example, we want to transform table4a
into something that looks like this:
table1 %>% select(country, year, cases)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <int> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
We could use gather()
but going to get you in the habit of using pivot_longer
that can handle simple to complex reshaping
?pivot_longer
Gathering requires knowing three parameters:
1999
and 2000
table4a
example, we’ll call this variable year
because the values of this variable will be yearsgather()
this is the “key” variabletable4a
example, we’ll call the “value variable” cases
because the values refer to number of casestable4a %>%
pivot_longer(
cols = (c("1999", "2000")),
names_to = "year",
values_to = "cases"
)
#> # A tibble: 6 x 3
#> country year cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
#giving different names to the key variable and the value variable is fine
table4a %>%
pivot_longer(
cols = (c("1999", "2000")),
names_to = "yr",
values_to = "cases"
)
#> # A tibble: 6 x 3
#> country yr cases
#> <chr> <chr> <int>
#> 1 Afghanistan 1999 745
#> 2 Afghanistan 2000 2666
#> 3 Brazil 1999 37737
#> 4 Brazil 2000 80488
#> 5 China 1999 212258
#> 6 China 2000 213766
pivot_longer
can deal with more complex data patterns?pivot_longer
Example 1: Complex variable names. Enrollment variables in table204_10 have more information than we really need to transform from untidy to tidy data (this is why we renamed the variable names before the gather()
example above). We’re going to use the names_prefix()
argument to match and remove the part of the variable names that are not useful for reshaping wide to long.
Here we are still using the three core parameters: cols
, names_to
, values_to()
total1 <- table204_10 %>%
select(state,tot_2000,tot_2010,tot_2011,tot_2012)
total1 %>%
pivot_longer(
cols = c("tot_2000","tot_2010","tot_2011","tot_2012"),
names_to = "year",
names_prefix = ("tot_"),
values_to = "total_students"
)
#> # A tibble: 204 x 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama .................... 2000 728351
#> 2 Alabama .................... 2010 730427
#> 3 Alabama .................... 2011 731556
#> 4 Alabama .................... 2012 740475
#> 5 Alaska .................. 2000 105333
#> 6 Alaska .................. 2010 132104
#> 7 Alaska .................. 2011 131166
#> 8 Alaska .................. 2012 131483
#> 9 Arizona ..................... 2000 877696
#> 10 Arizona ..................... 2010 1067210
#> # … with 194 more rows
#another way to indicate cols
total1 %>%
pivot_longer(
cols = starts_with("tot_"),
names_to = "year",
names_prefix = ("tot_"),
values_to = "total_students"
)
#> # A tibble: 204 x 3
#> state year total_students
#> <chr> <chr> <dbl>
#> 1 Alabama .................... 2000 728351
#> 2 Alabama .................... 2010 730427
#> 3 Alabama .................... 2011 731556
#> 4 Alabama .................... 2012 740475
#> 5 Alaska .................. 2000 105333
#> 6 Alaska .................. 2010 132104
#> 7 Alaska .................. 2011 131166
#> 8 Alaska .................. 2012 131483
#> 9 Arizona ..................... 2000 877696
#> 10 Arizona ..................... 2010 1067210
#> # … with 194 more rows
pivot_longer
can deal with more complex multiple columnsExample 2: Complex variable names and multiple variables need to be reshaped. In this case, we have two pieces of information for each state-year observation: total enrollment and students on free/reduced lunch. So not only do we need to reshape the data from wide to long so our observational-level is state-year, we also need to save total enrollment and students on free/reduced lunch into separate columns.
-Note the special name .value
and names_sep()
argument. By specifying all variables (besides state) are seperated by an underscore this tells pivot_longer that that first part of the variable name before the seperator specify the “value” being measured and will be the new variables in the output
total2 <- table204_10 %>%
select(state,tot_2000,tot_2010,tot_2011,tot_2012, frl_2000,frl_2010,frl_2011,frl_2012)
#multiple variables (when patterns are consistent)
total2 %>%
pivot_longer(
-state,
names_to = c(".value", "year"), # special name .value
names_sep = "_"
)
#> # A tibble: 204 x 4
#> state year tot frl
#> <chr> <chr> <dbl> <dbl>
#> 1 Alabama .................... 2000 728351 335143
#> 2 Alabama .................... 2010 730427 402386
#> 3 Alabama .................... 2011 731556 420447
#> 4 Alabama .................... 2012 740475 429604
#> 5 Alaska .................. 2000 105333 32468
#> 6 Alaska .................. 2010 132104 50701
#> 7 Alaska .................. 2011 131166 53238
#> 8 Alaska .................. 2012 131483 53082
#> 9 Arizona ..................... 2000 877696 274277
#> 10 Arizona ..................... 2010 1067210 482044
#> # … with 194 more rows
#what happens if we switch order of .value and year?
total2 %>%
pivot_longer(
-state,
names_to = c("year", ".value"), # special name .value
names_sep = "_"
)
#> # A tibble: 102 x 6
#> state year `2000` `2010` `2011` `2012`
#> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama .................... tot 728351 730427 731556 740475
#> 2 Alabama .................... frl 335143 402386 420447 429604
#> 3 Alaska .................. tot 105333 132104 131166 131483
#> 4 Alaska .................. frl 32468 50701 53238 53082
#> 5 Arizona ..................... tot 877696 1067210 1024454 990378
#> 6 Arizona ..................... frl 274277 482044 511885 514193
#> 7 Arkansas .................. tot 449959 482114 483114 486157
#> 8 Arkansas .................. frl 205058 291608 294324 298573
#> 9 California ................. tot 6050753 6169427 6202862 6178788
#> 10 California ................. frl 2820611 3335885 3353964. 3478407
#> # … with 92 more rows
[In the task I present below, fine to just work through solution I have created or try doing on your own before you look at solution]
The following dataset is drawn from Table 204.10 of the NCES Digest for Education Statistics.
#Let's take a look at the data (we read in the data in the wide vs long section)
glimpse(table204_10)
#> Observations: 51
#> Variables: 13
#> $ state <chr> "Alabama ....................", "Alaska .............…
#> $ tot_2000 <dbl> 728351, 105333, 877696, 449959, 6050753, 724349, 5621…
#> $ tot_2010 <dbl> 730427, 132104, 1067210, 482114, 6169427, 842864, 552…
#> $ tot_2011 <dbl> 731556, 131166, 1024454, 483114, 6202862, 853610, 543…
#> $ tot_2012 <dbl> 740475, 131483, 990378, 486157, 6178788, 863121, 5492…
#> $ frl_2000 <dbl> 335143, 32468, 274277, 205058, 2820611, 195148, 14303…
#> $ frl_2010 <dbl> 402386, 50701, 482044, 291608, 3335885, 336426, 19055…
#> $ frl_2011 <dbl> 420447.00, 53238.00, 511885.00, 294324.00, 3353963.72…
#> $ frl_2012 <dbl> 429604, 53082, 514193, 298573, 3478407, 358876, 20108…
#> $ p_frl_2000 <dbl> 46.01394, 30.82415, 31.24966, 45.57260, 46.61587, 26.…
#> $ p_frl_2010 <dbl> 55.08915, 38.37961, 45.16862, 60.48528, 54.07123, 39.…
#> $ p_frl_2011 <dbl> 57.47298, 40.58826, 49.96662, 60.92227, 54.07123, 40.…
#> $ p_frl_2012 <dbl> 58.01735, 40.37176, 51.91886, 61.41493, 56.29594, 41.…
#Create smaller version of data frame for purpose of student exercise
total<- table204_10 %>%
select(state,p_frl_2000,p_frl_2010,p_frl_2011,p_frl_2012)
head(total)
#> # A tibble: 6 x 5
#> state p_frl_2000 p_frl_2010 p_frl_2011 p_frl_2012
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama .................... 46.0 55.1 57.5 58.0
#> 2 Alaska .................. 30.8 38.4 40.6 40.4
#> 3 Arizona ..................... 31.2 45.2 50.0 51.9
#> 4 Arkansas .................. 45.6 60.5 60.9 61.4
#> 5 California ................. 46.6 54.1 54.1 56.3
#> 6 Colorado .................... 26.9 39.9 40.9 41.6
Task (using the data frame total
):
total
names_prefix
Solution to student exercise
total
p_frl_2000
, p_frl_2010
, etc. are not variables; rather they refer to values of the variable year
p_frl_2000,p_frl_2010,p_frl_2011,p_frl_2012
total %>%
pivot_longer(
cols = starts_with("p_frl_"),
names_to = "year",
names_prefix = ("p_frl_"),
values_to = "pct_frl"
)
#> # A tibble: 204 x 3
#> state year pct_frl
#> <chr> <chr> <dbl>
#> 1 Alabama .................... 2000 46.0
#> 2 Alabama .................... 2010 55.1
#> 3 Alabama .................... 2011 57.5
#> 4 Alabama .................... 2012 58.0
#> 5 Alaska .................. 2000 30.8
#> 6 Alaska .................. 2010 38.4
#> 7 Alaska .................. 2011 40.6
#> 8 Alaska .................. 2012 40.4
#> 9 Arizona ..................... 2000 31.2
#> 10 Arizona ..................... 2010 45.2
#> # … with 194 more rows
The second important and common cause of untidy data:
table2
is an example of this sort of problemtable2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
As my R guru Ben Skinner says, this sort of data structure is very common “in the wild”
The solution to this problem is to transform the untidy rows (which represent different variables) into columns
Goal: we want to transform table2
so it looks like table1
table2
table1
We reshape long to wide by using the pivot_wider()
function from thetidyr
package.
?pivot_wider
Helpful to look at table2
while we introduce pivot_wider()
function
table2 %>% head(n=8)
#> # A tibble: 8 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
Spreading requires knowing two parameters, which are the parameters of spread()
:
table2
this is the type
column; the values of type
, cases
and population
, will become variable names in the tidy datatable2
the value column is count
; the values of count
will become the values of the new variables cases
and population
in the tidy datatable2
#> # A tibble: 12 x 4
#> country year type count
#> <chr> <int> <chr> <int>
#> 1 Afghanistan 1999 cases 745
#> 2 Afghanistan 1999 population 19987071
#> 3 Afghanistan 2000 cases 2666
#> 4 Afghanistan 2000 population 20595360
#> 5 Brazil 1999 cases 37737
#> 6 Brazil 1999 population 172006362
#> 7 Brazil 2000 cases 80488
#> 8 Brazil 2000 population 174504898
#> 9 China 1999 cases 212258
#> 10 China 1999 population 1272915272
#> 11 China 2000 cases 213766
#> 12 China 2000 population 1280428583
table2 %>%
pivot_wider(names_from = type, values_from = count)
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
#compare to table 1
table1
#> # A tibble: 6 x 4
#> country year cases population
#> <chr> <int> <int> <int>
#> 1 Afghanistan 1999 745 19987071
#> 2 Afghanistan 2000 2666 20595360
#> 3 Brazil 1999 37737 172006362
#> 4 Brazil 2000 80488 174504898
#> 5 China 1999 212258 1272915272
#> 6 China 2000 213766 1280428583
[In the task I present below, fine to just work through solution I have created or try doing on your own before you look at solution]
The Integrated Postsecondary Education Data System (IPEDS) collects data on colleges and universities
#load these libraries if you haven't already
#library(haven)
#library(labelled)
ipeds_hc_all <- read_dta(file="https://github.com/ozanj/rclass/raw/master/data/ipeds/effy/ey15-16_hc.dta", encoding=NULL)
Create smaller version of dataset
#ipeds_hc <- ipeds_hc_all %>% select(instnm,unitid,lstudy,efytotlt,efytotlm,efytotlw)
ipeds_hc <- ipeds_hc_all %>% select(instnm,unitid,lstudy,efytotlt)
Get to know data
head(ipeds_hc)
#> # A tibble: 6 x 4
#> instnm unitid lstudy efytotlt
#> <chr> <dbl> <dbl+lbl> <dbl>
#> 1 Alabama A & M University 100654 999 [Generated total] 6157
#> 2 Alabama A & M University 100654 1 [Undergraduate] 4865
#> 3 Alabama A & M University 100654 3 [Graduate] 1292
#> 4 University of Alabama at Birmingham 100663 999 [Generated total] 21554
#> 5 University of Alabama at Birmingham 100663 1 [Undergraduate] 13440
#> 6 University of Alabama at Birmingham 100663 3 [Graduate] 8114
str(ipeds_hc)
#> Classes 'tbl_df', 'tbl' and 'data.frame': 15726 obs. of 4 variables:
#> $ instnm : chr "Alabama A & M University" "Alabama A & M University" "Alabama A & M University" "University of Alabama at Birmingham" ...
#> ..- attr(*, "label")= chr "Institution (entity) name"
#> ..- attr(*, "format.stata")= chr "%91s"
#> $ unitid : num 100654 100654 100654 100663 100663 ...
#> ..- attr(*, "label")= chr "Unique identification number of the institution"
#> ..- attr(*, "format.stata")= chr "%12.0g"
#> $ lstudy : 'haven_labelled' num 999 1 3 999 1 3 999 1 3 999 ...
#> ..- attr(*, "label")= chr "Original level of study on survey form"
#> ..- attr(*, "format.stata")= chr "%15.0g"
#> ..- attr(*, "labels")= Named num 1 3 999
#> .. ..- attr(*, "names")= chr "Undergraduate" "Graduate" "Generated total"
#> $ efytotlt: num 6157 4865 1292 21554 13440 ...
#> ..- attr(*, "label")= chr "Grand total"
#> ..- attr(*, "format.stata")= chr "%12.0g"
#> - attr(*, "label")= chr "dct_s2015_is"
#Variable labels
ipeds_hc %>% var_label()
#> $instnm
#> [1] "Institution (entity) name"
#>
#> $unitid
#> [1] "Unique identification number of the institution"
#>
#> $lstudy
#> [1] "Original level of study on survey form"
#>
#> $efytotlt
#> [1] "Grand total"
#Only the variable lstudy has value labels
ipeds_hc %>% select(lstudy) %>% val_labels()
#> $lstudy
#> Undergraduate Graduate Generated total
#> 1 3 999
Student Task:
ipeds_hc
(why is it untidy?)
Solution to student task:
1. Diagnose the problem with the data frame - First, let’s investigate the data
ipeds_hc <- ipeds_hc %>% arrange(unitid, lstudy)
head(ipeds_hc, n=20)
#> # A tibble: 20 x 4
#> instnm unitid lstudy efytotlt
#> <chr> <dbl> <dbl+lbl> <dbl>
#> 1 Alabama A & M University 100654 1 [Undergraduate] 4865
#> 2 Alabama A & M University 100654 3 [Graduate] 1292
#> 3 Alabama A & M University 100654 999 [Generated tota… 6157
#> 4 University of Alabama at Birmingham 100663 1 [Undergraduate] 13440
#> 5 University of Alabama at Birmingham 100663 3 [Graduate] 8114
#> 6 University of Alabama at Birmingham 100663 999 [Generated tota… 21554
#> 7 Amridge University 100690 1 [Undergraduate] 415
#> 8 Amridge University 100690 3 [Graduate] 415
#> 9 Amridge University 100690 999 [Generated tota… 830
#> 10 University of Alabama in Huntsville 100706 1 [Undergraduate] 6994
#> 11 University of Alabama in Huntsville 100706 3 [Graduate] 2212
#> 12 University of Alabama in Huntsville 100706 999 [Generated tota… 9206
#> 13 Alabama State University 100724 1 [Undergraduate] 5373
#> 14 Alabama State University 100724 3 [Graduate] 728
#> 15 Alabama State University 100724 999 [Generated tota… 6101
#> 16 The University of Alabama 100751 1 [Undergraduate] 35199
#> 17 The University of Alabama 100751 3 [Graduate] 6308
#> 18 The University of Alabama 100751 999 [Generated tota… 41507
#> 19 Central Alabama Community College 100760 1 [Undergraduate] 2379
#> 20 Central Alabama Community College 100760 999 [Generated tota… 2379
#code to investigate what each observation represents
#I'll break this code down next week when we talk about joining data frames
ipeds_hc %>% group_by(unitid,lstudy) %>% # group_by our candidate
mutate(n_per_id=n()) %>% # calculate number of obs per group
ungroup() %>% # ungroup the data
count(n_per_id==1) # count "true that only one obs per group"
#> # A tibble: 1 x 2
#> `n_per_id == 1` n
#> <lgl> <int>
#> 1 TRUE 15726
Summary of problems with the data frame:
undergraduate
, graduate
, generated total
)lstudy
represent different attributes (undergraduate
, graduate
, generated total
)
2. Sketch out what the tidy data should look like (sketch out on your own)
3. Transform untidy to tidy
ipeds_hc
the key column is lstudy
; the values of lstudy
, will become variable names in the tidy dataipeds_hc
the value column is efytotlt
; the values of efytotlt
will become the values of the new variables in the tidy dataipeds_hc %>%
pivot_wider(names_from = lstudy, values_from = efytotlt) #notice it uses the underlying data not labels
#> # A tibble: 6,951 x 5
#> instnm unitid `1` `3` `999`
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama A & M University 100654 4865 1292 6157
#> 2 University of Alabama at Birmingham 100663 13440 8114 21554
#> 3 Amridge University 100690 415 415 830
#> 4 University of Alabama in Huntsville 100706 6994 2212 9206
#> 5 Alabama State University 100724 5373 728 6101
#> 6 The University of Alabama 100751 35199 6308 41507
#> 7 Central Alabama Community College 100760 2379 NA 2379
#> 8 Athens State University 100812 4124 NA 4124
#> 9 Auburn University at Montgomery 100830 5072 924 5996
#> 10 Auburn University 100858 22953 6284 29237
#> # … with 6,941 more rows
Alternative solution
Helpful to create a character version of variable lstudy prior to spreading
ipeds_hc %>% select(lstudy) %>% val_labels()
#> $lstudy
#> Undergraduate Graduate Generated total
#> 1 3 999
str(ipeds_hc$lstudy)
#> 'haven_labelled' num [1:15726] 1 3 999 1 3 999 1 3 999 1 ...
#> - attr(*, "label")= chr "Original level of study on survey form"
#> - attr(*, "labels")= Named num [1:3] 1 3 999
#> ..- attr(*, "names")= chr [1:3] "Undergraduate" "Graduate" "Generated total"
ipeds_hcv2 <- ipeds_hc %>%
mutate(level = recode(as.integer(lstudy),
`1` = "ug",
`3` = "grad",
`999` = "all")
) %>% select(-lstudy) # drop variable lstudy
head(ipeds_hcv2)
#> # A tibble: 6 x 4
#> instnm unitid efytotlt level
#> <chr> <dbl> <dbl> <chr>
#> 1 Alabama A & M University 100654 4865 ug
#> 2 Alabama A & M University 100654 1292 grad
#> 3 Alabama A & M University 100654 6157 all
#> 4 University of Alabama at Birmingham 100663 13440 ug
#> 5 University of Alabama at Birmingham 100663 8114 grad
#> 6 University of Alabama at Birmingham 100663 21554 all
ipeds_hcv2 %>% select(instnm,unitid,level,efytotlt) %>%
pivot_wider(names_from = level, values_from = efytotlt) #nicer!
#> # A tibble: 6,951 x 5
#> instnm unitid ug grad all
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 Alabama A & M University 100654 4865 1292 6157
#> 2 University of Alabama at Birmingham 100663 13440 8114 21554
#> 3 Amridge University 100690 415 415 830
#> 4 University of Alabama in Huntsville 100706 6994 2212 9206
#> 5 Alabama State University 100724 5373 728 6101
#> 6 The University of Alabama 100751 35199 6308 41507
#> 7 Central Alabama Community College 100760 2379 NA 2379
#> 8 Athens State University 100812 4124 NA 4124
#> 9 Auburn University at Montgomery 100830 5072 924 5996
#> 10 Auburn University 100858 22953 6284 29237
#> # … with 6,941 more rows
What if we want to spread a dataset that contains multiple value variables?
Create dataset that has total enrollment, enrollment “men”, and enrollment “women” [IPEDS terms]
ipeds_hc_multi_val <- ipeds_hc_all %>%
select(unitid,lstudy,efytotlt,efytotlm,efytotlw)
ipeds_hc_multi_val <- ipeds_hc_multi_val %>%
mutate(level = recode(as.integer(lstudy),
`1` = "ug",
`3` = "grad",
`999` = "all")
) %>% select(-lstudy) # drop variable lstudy
head(ipeds_hc_multi_val)
#> # A tibble: 6 x 5
#> unitid efytotlt efytotlm efytotlw level
#> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 100654 6157 2646 3511 all
#> 2 100654 4865 2258 2607 ug
#> 3 100654 1292 388 904 grad
#> 4 100663 21554 8383 13171 all
#> 5 100663 13440 5552 7888 ug
#> 6 100663 8114 2831 5283 grad
ipeds_hc_multi_val %>%
pivot_wider(names_from = level, values_from = c(efytotlt, efytotlm, efytotlw))
#> # A tibble: 6,951 x 10
#> unitid efytotlt_all efytotlt_ug efytotlt_grad efytotlm_all efytotlm_ug
#> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 100654 6157 4865 1292 2646 2258
#> 2 100663 21554 13440 8114 8383 5552
#> 3 100690 830 415 415 337 177
#> 4 100706 9206 6994 2212 5194 3983
#> 5 100724 6101 5373 728 2321 2105
#> 6 100751 41507 35199 6308 18526 15946
#> 7 100760 2379 2379 NA 1027 1027
#> 8 100812 4124 4124 NA 1478 1478
#> 9 100830 5996 5072 924 2158 1847
#> 10 100858 29237 22953 6284 14893 11871
#> # … with 6,941 more rows, and 4 more variables: efytotlm_grad <dbl>,
#> # efytotlw_all <dbl>, efytotlw_ug <dbl>, efytotlw_grad <dbl>
This section deals with missing variables and tidying data.
But first, it is helpful to create a new version of the IPEDS enrollment dataset as follows:
ipeds_hc_na <- ipeds_hc_all %>% filter(sector %in% c(3,6,9)) %>% #keep only for-profit colleges
select(instnm,unitid,lstudy,efytotlt,efybkaam) %>%
mutate(level = recode(as.integer(lstudy), # create recoded version of lstudy
`1` = "ug",
`3` = "grad",
`999` = "all")
) %>% select(instnm,unitid,level,efytotlt,efybkaam) %>%
arrange(unitid,desc(level))
Now, let’s print some rows
ipeds_hc_na
#> # A tibble: 6,265 x 5
#> instnm unitid level efytotlt efybkaam
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 777 122
#> 2 South University-Montgomery 101116 grad 218 40
#> 3 South University-Montgomery 101116 all 995 162
#> 4 New Beginning College of Cosmetology 101277 ug 132 0
#> 5 New Beginning College of Cosmetology 101277 all 132 0
#> 6 Herzing University-Birmingham 101365 ug 675 73
#> 7 Herzing University-Birmingham 101365 grad 15 5
#> 8 Herzing University-Birmingham 101365 all 690 78
#> 9 Prince Institute-Southeast 101958 ug 34 0
#> 10 Prince Institute-Southeast 101958 all 34 0
#> # … with 6,255 more rows
Now let’s create new versions of the enrollment variables, that replace 0
with NA
ipeds_hc_na <- ipeds_hc_na %>%
mutate(
efytotltv2 = ifelse(efytotlt == 0, NA, efytotlt),
efybkaamv2 = ifelse(efybkaam == 0, NA, efybkaam)
) %>% select(instnm,unitid,level,efytotlt,efytotltv2,efybkaam,efybkaamv2)
ipeds_hc_na %>% select(unitid,level,efytotlt,efytotltv2,efybkaam,efybkaamv2)
#> # A tibble: 6,265 x 6
#> unitid level efytotlt efytotltv2 efybkaam efybkaamv2
#> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 101116 ug 777 777 122 122
#> 2 101116 grad 218 218 40 40
#> 3 101116 all 995 995 162 162
#> 4 101277 ug 132 132 0 NA
#> 5 101277 all 132 132 0 NA
#> 6 101365 ug 675 675 73 73
#> 7 101365 grad 15 15 5 5
#> 8 101365 all 690 690 78 78
#> 9 101958 ug 34 34 0 NA
#> 10 101958 all 34 34 0 NA
#> # … with 6,255 more rows
Create dataset that drops the original enrollment variables, keeps enrollment vars that replace 0
with NA
ipeds_hc_nav2 <- ipeds_hc_na %>% select(-efytotlt,-efybkaam)
Now we can introduce the concepts of explicit and implicit missing values
There are two types of missing values:
NA
for a parcitular rowLet’s print data for the first two colleges
ipeds_hc_nav2 %>% head(, n=5)
#> # A tibble: 5 x 5
#> instnm unitid level efytotltv2 efybkaamv2
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 777 122
#> 2 South University-Montgomery 101116 grad 218 40
#> 3 South University-Montgomery 101116 all 995 162
#> 4 New Beginning College of Cosmetology 101277 ug 132 NA
#> 5 New Beginning College of Cosmetology 101277 all 132 NA
South University-Montgomery
has three rows:
efytotltv2
has 0
explicit missing values and 0
implicit missing valuesefybkaamv2
has 0
explicit missing values and 0
implicit missing valuesNew Beginning College of Cosmetology
has two rows (because they have no graduate students):
efytotltv2
has 0
explicit missing values and 1
implicit missing values (no row for grad students)efybkaamv2
has 2
explicit missing values and 1
implicit missing values (no row for grad students)Let’s look at another dataset called stocks
, which shows stock return
for each year and quarter for some hypothetical company.
stocks <- tibble(
year = c(2015, 2015, 2015, 2015, 2016, 2016, 2016),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
return = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
stocks # note: this data is already tidy
#> # A tibble: 7 x 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 2 0.92
#> 6 2016 3 0.17
#> 7 2016 4 2.66
The variable return
has:
1
explicit missing value in year==2015
and qtr==4
1
implicit missing value in year==2016
and qtr==1
; this row of data simply does not existstocks %>%
complete(year, qtr)
#> # A tibble: 8 x 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 1 NA
#> 6 2016 2 0.92
#> 7 2016 3 0.17
#> 8 2016 4 2.66
An Implicit missing value is the result of a row not existing. If you want to make an an implicit missing value explicit, then make the non-existant row exist.
The complete()
function within the tidyr
package turns implicit missing values into explicit missing values
complete()
returns an object that has all unique combinations of those variables, including those not found in the original data framestocks
#> # A tibble: 7 x 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 2 0.92
#> 6 2016 3 0.17
#> 7 2016 4 2.66
stocks %>% complete(year, qtr)
#> # A tibble: 8 x 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 1 NA
#> 6 2016 2 0.92
#> 7 2016 3 0.17
#> 8 2016 4 2.66
Note that we now have a row for year==2016
and qtr==1
that has an explicit missing value for return
Let’s apply complete()
to our IPEDS dataset
ipeds_hc_nav2
#> # A tibble: 6,265 x 5
#> instnm unitid level efytotltv2 efybkaamv2
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 777 122
#> 2 South University-Montgomery 101116 grad 218 40
#> 3 South University-Montgomery 101116 all 995 162
#> 4 New Beginning College of Cosmetology 101277 ug 132 NA
#> 5 New Beginning College of Cosmetology 101277 all 132 NA
#> 6 Herzing University-Birmingham 101365 ug 675 73
#> 7 Herzing University-Birmingham 101365 grad 15 5
#> 8 Herzing University-Birmingham 101365 all 690 78
#> 9 Prince Institute-Southeast 101958 ug 34 NA
#> 10 Prince Institute-Southeast 101958 all 34 NA
#> # … with 6,255 more rows
ipeds_complete <- ipeds_hc_nav2 %>% select(unitid,level,efytotltv2,efybkaamv2) %>%
complete(unitid, level)
ipeds_complete
#> # A tibble: 9,063 x 4
#> unitid level efytotltv2 efybkaamv2
#> <dbl> <chr> <dbl> <dbl>
#> 1 101116 all 995 162
#> 2 101116 grad 218 40
#> 3 101116 ug 777 122
#> 4 101277 all 132 NA
#> 5 101277 grad NA NA
#> 6 101277 ug 132 NA
#> 7 101365 all 690 78
#> 8 101365 grad 15 5
#> 9 101365 ug 675 73
#> 10 101958 all 34 NA
#> # … with 9,053 more rows
#Confirm that the "complete" dataset always has three observations per unitid
ipeds_complete %>% group_by(unitid) %>% summarise(n=n()) %>% count(n)
#> # A tibble: 1 x 2
#> n nn
#> <int> <int>
#> 1 3 3021
#Note that previous dataset did not
ipeds_hc_nav2 %>% group_by(unitid) %>% summarise(n=n()) %>% count(n)
#> # A tibble: 2 x 2
#> n nn
#> <int> <int>
#> 1 2 2798
#> 2 3 223
Should you make implicit missing values explicit?
We use spread()
to transform rows into columns; outside the tidyverse, referred to as reshaping from “long” to “wide”
Let’s look at two datasets that have similar structure. Which one of these is in need of tidying?
stocks
#> # A tibble: 7 x 3
#> year qtr return
#> <dbl> <dbl> <dbl>
#> 1 2015 1 1.88
#> 2 2015 2 0.59
#> 3 2015 3 0.35
#> 4 2015 4 NA
#> 5 2016 2 0.92
#> 6 2016 3 0.17
#> 7 2016 4 2.66
ipeds_hc_nav2 %>% select (instnm,unitid,level,efytotltv2) %>% head(n=5)
#> # A tibble: 5 x 4
#> instnm unitid level efytotltv2
#> <chr> <dbl> <chr> <dbl>
#> 1 South University-Montgomery 101116 ug 777
#> 2 South University-Montgomery 101116 grad 218
#> 3 South University-Montgomery 101116 all 995
#> 4 New Beginning College of Cosmetology 101277 ug 132
#> 5 New Beginning College of Cosmetology 101277 all 132
Let’s use spread()
to tidy the IPEDS dataset, focusing on the total enrollment variable which has implicit missing values but no explicit missing values
ipeds_hc_nav2 %>% select (instnm,unitid,level,efytotltv2) %>%
spread(key = level, value = efytotltv2) %>%
arrange(unitid) %>% head(n=5)
#> # A tibble: 5 x 5
#> instnm unitid all grad ug
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 South University-Montgomery 101116 995 218 777
#> 2 New Beginning College of Cosmetology 101277 132 NA 132
#> 3 Herzing University-Birmingham 101365 690 15 675
#> 4 Prince Institute-Southeast 101958 34 NA 34
#> 5 Charter College 102845 3996 20 3976
The resulting dataset has explicint missing values (i.e,. NAs
) for rows that had implicit missing values in the input data.
Let’s use spread()
the IPEDS dataset again, this time focusing on the Black enrollment variable which has both explicit and implicit missing values
ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) %>% head(n=5)
#> # A tibble: 5 x 4
#> instnm unitid level efybkaamv2
#> <chr> <dbl> <chr> <dbl>
#> 1 South University-Montgomery 101116 ug 122
#> 2 South University-Montgomery 101116 grad 40
#> 3 South University-Montgomery 101116 all 162
#> 4 New Beginning College of Cosmetology 101277 ug NA
#> 5 New Beginning College of Cosmetology 101277 all NA
ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) %>%
spread(key = level, value = efybkaamv2) %>%
arrange(unitid) %>% head(n=5)
#> # A tibble: 5 x 5
#> instnm unitid all grad ug
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 South University-Montgomery 101116 162 40 122
#> 2 New Beginning College of Cosmetology 101277 NA NA NA
#> 3 Herzing University-Birmingham 101365 78 5 73
#> 4 Prince Institute-Southeast 101958 NA NA NA
#> 5 Charter College 102845 102 1 101
The resulting dataset has explicit missing values (i.e,. NAs
) for rows that had implicit missing values in the input data and for rows that has explicit missing values in the data.
What if we spread a dataset that has explicit missing values and no implicit missing values?
NAs
) for rows that were explicit missing values in the input data.ipeds_complete %>% select(unitid,level,efybkaamv2) %>%
spread(key = level, value = efybkaamv2) %>%
arrange(unitid) %>% head(n=5)
#> # A tibble: 5 x 4
#> unitid all grad ug
#> <dbl> <dbl> <dbl> <dbl>
#> 1 101116 162 40 122
#> 2 101277 NA NA NA
#> 3 101365 78 5 73
#> 4 101958 NA NA NA
#> 5 102845 102 1 101
Takeways about spread()
and missing values
Therefore, no need to make implicit missing values explicit (using complete()
) prior to spreading
We use gather()
to transform columns into rows; outside the tidyverse, referred to as reshaping from “wide” to “long”
Let’s create a dataset in need of gathering
2016
and 2017
ipeds_gather <- ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2)
ipeds_gather <- ipeds_hc_nav2 %>% select (instnm,unitid,level,efybkaamv2) %>%
mutate(
efybkaamv2_2017= ifelse(!is.na(efybkaamv2),efybkaamv2+20,20)
) %>%
rename("2016"=efybkaamv2,"2017"=efybkaamv2_2017)
ipeds_gather %>% head(n=10)
#> # A tibble: 10 x 5
#> instnm unitid level `2016` `2017`
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 South University-Montgomery 101116 ug 122 142
#> 2 South University-Montgomery 101116 grad 40 60
#> 3 South University-Montgomery 101116 all 162 182
#> 4 New Beginning College of Cosmetology 101277 ug NA 20
#> 5 New Beginning College of Cosmetology 101277 all NA 20
#> 6 Herzing University-Birmingham 101365 ug 73 93
#> 7 Herzing University-Birmingham 101365 grad 5 25
#> 8 Herzing University-Birmingham 101365 all 78 98
#> 9 Prince Institute-Southeast 101958 ug NA 20
#> 10 Prince Institute-Southeast 101958 all NA 20
2016
has both implicit and explicit missing values2017
has implicit but no explicit missing valuesLet’s use gather()
to transform the columns 2016
and 2017
into rows
ipeds_gatherv2 <- ipeds_gather %>%
gather(`2016`,`2017`,key = year, value = efybkaam) %>%
arrange(unitid,desc(level),year)
ipeds_gatherv2 %>% head(n=10)
#> # A tibble: 10 x 5
#> instnm unitid level year efybkaam
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 South University-Montgomery 101116 ug 2016 122
#> 2 South University-Montgomery 101116 ug 2017 142
#> 3 South University-Montgomery 101116 grad 2016 40
#> 4 South University-Montgomery 101116 grad 2017 60
#> 5 South University-Montgomery 101116 all 2016 162
#> 6 South University-Montgomery 101116 all 2017 182
#> 7 New Beginning College of Cosmetology 101277 ug 2016 NA
#> 8 New Beginning College of Cosmetology 101277 ug 2017 20
#> 9 New Beginning College of Cosmetology 101277 all 2016 NA
#> 10 New Beginning College of Cosmetology 101277 all 2017 20
Before looking at missing values, let’s investigate data structure
#number of rows after gathering is exactly 2X number of rows in input dataset
nrow(ipeds_gather)
#> [1] 6265
nrow(ipeds_gatherv2)
#> [1] 12530
nrow(ipeds_gatherv2)==nrow(ipeds_gather)*2
#> [1] TRUE
#How many observations for each combination of unitid and level?
#always 2: one for 2016 and one for 2017
ipeds_gatherv2 %>% group_by(unitid,level) %>%
summarise(n_per_unitid_level=n()) %>% ungroup %>% count(n_per_unitid_level)
#> # A tibble: 1 x 2
#> n_per_unitid_level n
#> <int> <int>
#> 1 2 6265
#How many observations for each unitid?
# 4 observations for colleges that had two rows in the input dataset
# 6 observations for colleges that had three rows in the input dataset
ipeds_gatherv2 %>% group_by(unitid) %>%
summarise(n_per_unitid=n()) %>% ungroup %>% count(n_per_unitid)
#> # A tibble: 2 x 2
#> n_per_unitid n
#> <int> <int>
#> 1 4 2798
#> 2 6 223
Let’s compare the data before and after gathering for one college
#before gathering
ipeds_gather %>% filter(unitid==101277)
#> # A tibble: 2 x 5
#> instnm unitid level `2016` `2017`
#> <chr> <dbl> <chr> <dbl> <dbl>
#> 1 New Beginning College of Cosmetology 101277 ug NA 20
#> 2 New Beginning College of Cosmetology 101277 all NA 20
#after gathering
ipeds_gatherv2 %>% filter(unitid==101277)
#> # A tibble: 4 x 5
#> instnm unitid level year efybkaam
#> <chr> <dbl> <chr> <chr> <dbl>
#> 1 New Beginning College of Cosmetology 101277 ug 2016 NA
#> 2 New Beginning College of Cosmetology 101277 ug 2017 20
#> 3 New Beginning College of Cosmetology 101277 all 2016 NA
#> 4 New Beginning College of Cosmetology 101277 all 2017 20
Takeaways about gathering and explicit/implicit missing values
gather()
creates new rowsgather()
does nothing to rows that do not exist in the input datasetTherefore, if you want to make implicit values explicit after gathering, then prior to gathering you should use complete()
to make missing values explicit
table5
#> # A tibble: 6 x 4
#> country century year rate
#> <chr> <chr> <chr> <chr>
#> 1 Afghanistan 19 99 745/19987071
#> 2 Afghanistan 20 00 2666/20595360
#> 3 Brazil 19 99 37737/172006362
#> 4 Brazil 20 00 80488/174504898
#> 5 China 19 99 212258/1272915272
#> 6 China 20 00 213766/1280428583
rate
columncentury
and year
columnsRead about this in section 12.4 of Wickham text, but no homework questions on separating and uniting