Overview of mutating joins
Following Wickham, we’ll explain joins by creating hypothetical tables x
and y
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
x
#> # A tibble: 3 x 2
#> key val_x
#> <dbl> <chr>
#> 1 1 x1
#> 2 2 x2
#> 3 3 x3
y
#> # A tibble: 3 x 2
#> key val_y
#> <dbl> <chr>
#> 1 1 y1
#> 2 2 y2
#> 3 4 y3
A join “is a way of connecting each row in x
to zero, one, or more rows in y
”
Observations in table x
matched to observations in table y
using a “key” variable
- A key is a variable (or combination of variables) that exist in both tables and uniquely identifies observations in at least one of the two tables
Two tables x
and y
can be “joined” when the primary key for table x
can be found in table y
; in other words, when table y
contains the foreign key, which uniquely identifies observations in table x
- e.g., use
id
to join nls_stu
and nls_tran
because id
is the primary key for nls_stu
(i.e., uniquely identifies obs in nls_stu
) and id
can be found in nls_tran
There are four types of joins between tables x
and y
:
- inner join: keep all observations that appear in both table
x
and table y
- left join: keep all observations in
x
(regardless of whether these obs appear in y
)
- right join: keep all observations in
y
(regardless of whether these obs appear in x
)
- full join: keep all observations that appear in
x
or in y
The last three joins – left, right, full – keep observations that appear in at least one table and are collectively referred to as outer joins
The following Venn diagram – copied from Grolemund and Wickham Chapter 13 – is useful for developing an initial understanding of the four join types
We will join tables x
and y
using the join()
command from dplyr
package. join()
is a general command, which has more specific commands for each type of join:
inner_join()
left_join()
right_join()
full_join()
Note that all of these join commands result in an object that contains all the variables from x
and all the variables from y
- So if you want resulting object to contain a subset of variables from
x
and y
, then prior to the join, you should eliminate unwanted variables from x
and/or y
How we’ll teach joins
- I’ll spend the most amount of time on inner joins, e.g., moving from simpler to more complicated joins.
- I’ll spend less time on outer joins because most of the stuff from inner joins will apply to outer joins too
- Note: all of the cool, multi-colored visual representations of joins are copied directly from Grolemund and Wickham, Chapter 12)
- Practical recommendation:
- Investigate each data frame before joining.
- get a “feel” for the data frame
- figure out which combination of variables uniquely identify observations
- Often helpful to sketch out data structure of one or both dataframes on scratch-paper and to sketch out data structure you want to see after mergin
Inner joins
inner joins keep all observations that appear in both table x
and table y
- More correctly, an inner join mathes observations from two tables “whenever their keys are equal”
- If there are multiple matches between
x
and y
, all combination of the matches are returned.
- e.g., if object
x
has one row where the variable key==1
and object y
has two rows where the variable key==1
, the resulting object will contain two rows where the variable key==1
Visual representation of x
and y
:
- the colored column in each dataset is the “key” variable. The key variable(s) match rows between the tables.
Below is a visual representation of an inner join.
- Matches in a join (rows common to both
x
and y
) are indicated with dots. “The number of dots=the number of matches=the number of rows in the output”
The basic synatx in R: inner_join(x, y, by ="keyvar")
- where
x
and y
are names of tables to join
by
specifies the name of the key variable or the combination of variables that form the key
Practical example:
- let’s try an inner join of the two datasets
nls_stu
and nls_stu_pets
I recommend these general steps when merging two datasets
- Identify
key
variable for join()
command by investigating the data structure of each dataset. Do stuff like this:
- which variables uniquely identify obs (i.e., what is the “key” in each table)
- note: not all tables have keys
- Once you identify the primary key for one of the tables, make sure that variable (or combination of variables) exists in the other table
- Identify key variables you will use to join the two tables
- Join datasets
- Assess/investigate quality of join
- This is basically exploratory data analysis for the purpose of data quality
- e.g., for obs that don’t “match”, investigate why (what are the patterns)
- We talk about these investigations in more detail below in section on filtering joins and section on join problems
Task: inner join of the two datasets nls_stu
and nls_stu_pets
one-to-one join vs. one-to-many join
Note: Wickham refers to the concepts in this section as “duplicate keys”
General rule rule of thumb for joining two tables
- key variable must uniquely identify observations in at least one of the tables you are joining
Depending on whether the key variable uniquely identifies observations in table x
and/or table y
you will have:
- one-to-one join: key variable uniquely identifies observations in table
x
and uniquely identifies observations in table y
- The join between
nls_stu
and nls_stu_pets
was a one-to-one join; the variable id
uniquely identifies observations in both tables
- In the relational database world one-to-one joins are rare and are considered special cases of one-to-many or many-to-one joins
- Why? if tables can be joined via one-to-one join, then they should already be part of the same table.
- one-to-many join: key variable uniquely identifies observatiosn in table
x
and does not uniquely identify observations in table y
- each observation from table
x
may match to multiple observations from table `y’
- e.g.,
inner_join(nls_stu, nls_trans, by = "id")
- e.g. one observation (student) in
nls_stu
has many observations in nls_trans
(transcripts)
- many-to-one join: key variable does not uniquely identify observations in table
x
and does uniquely identify observations in table y
- each observation from table
y
may match to multiple observations from table `x’
- e.g.,
inner_join(nls_trans, nls_trans, by = "id")
- e.g., many observations (transcripts) in
nls_trans
have one observation in nls_stu
(student)
- many-to-many join: key variable does not uniquely identify observations in table
x
and does not uniquely identify observations in table y
Many-to-one merge using fictitious tables x
and y
#create new versions of table x and table y
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
2, "x3",
1, "x4"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2"
)
x
#> # A tibble: 4 x 2
#> key val_x
#> <dbl> <chr>
#> 1 1 x1
#> 2 2 x2
#> 3 2 x3
#> 4 1 x4
y
#> # A tibble: 2 x 2
#> key val_y
#> <dbl> <chr>
#> 1 1 y1
#> 2 2 y2
Step 1: Investigate the two tables
- Note that
key
does not uniquely identify observations in x
but does uniquely identify observations in y
Visual representation of merge
Step 2: “join” the two tables
Student-task:
- conduct a one-to-many inner join of the two datasets
nls_stu
and nls_trans
Fine to try doing it without looking at solutions, or just work through solutions below
Solution to student task: steps
- Invesigate data structure
- Join variables
- Assess/investigate quality of join
#Investigate data
#we know id is primary key for nls_stu, investigate primary key for nls_tran
#id does not uniquely identify obs in nls_tran
nls_tran %>% group_by(id) %>% summarise(n_per_key=n()) %>% ungroup %>%
filter(n_per_key>1) %>% count(n_per_key)
#> # A tibble: 6 x 2
#> n_per_key n
#> <int> <int>
#> 1 2 4558
#> 2 3 1681
#> 3 4 415
#> 4 5 71
#> 5 6 6
#> 6 7 3
#id and transnum uniquely identify obs
nls_tran %>% group_by(id,transnum) %>% summarise(n_per_key=n()) %>% ungroup %>%
filter(n_per_key>1) %>% count(n_per_key)
#> # A tibble: 0 x 2
#> # ... with 2 variables: n_per_key <int>, n <int>
#id uniquely identifies obs in nls_stu and id is available in nls_tran, so use id as key
#merge
nls_stu_tran <- nls_stu %>% inner_join(nls_tran, by = "id") #error warning on var labels will populate
#> Warning: Column `id` has different attributes on LHS and RHS of join
#print some obs and selected variables of merged data frame
nls_stu_tran %>% select(id,transnum,bysex,findisp,trnsflag) %>% as_factor()
#> # A tibble: 24,253 x 5
#> id transnum bysex findisp trnsflag
#> <dbl> <dbl> <fct> <fct> <fct>
#> 1 18 1 2. female 6. NO RESPONSE FROM SCHO~ 0. DUMMY TRANSCRIPT
#> 2 67 1 1. male 1. TRANSCRIPT RECEIVED 1. REQUESTED & RECEI~
#> 3 83 1 2. female 3. STUDENT NEVER ATTENDED 0. DUMMY TRANSCRIPT
#> 4 315 1 1. male 1. TRANSCRIPT RECEIVED 1. REQUESTED & RECEI~
#> 5 315 2 1. male 1. TRANSCRIPT RECEIVED 1. REQUESTED & RECEI~
#> 6 414 1 1. male 3. STUDENT NEVER ATTENDED 0. DUMMY TRANSCRIPT
#> 7 430 1 1. male 4. SCHOOL LOST OR DESTRO~ 0. DUMMY TRANSCRIPT
#> 8 802 1 2. female 1. TRANSCRIPT RECEIVED 1. REQUESTED & RECEI~
#> 9 802 2 2. female 2. SCHOOL REFUSED 0. DUMMY TRANSCRIPT
#> 10 836 1 1. male 4. SCHOOL LOST OR DESTRO~ 0. DUMMY TRANSCRIPT
#> # ... with 24,243 more rows
#investigate results of merge
names(nls_stu_tran)
#> [1] "id" "schcode" "bysex" "csex" "crace" "cbirthm"
#> [7] "cbirthd" "cbirthyr" "transnum" "findisp" "trnsflag" "terms"
#> [13] "fice" "state" "cofcon" "instype" "itype"
nrow(nls_stu)
#> [1] 22652
nrow(nls_tran)
#> [1] 24253
nrow(nls_stu_tran)
#> [1] 24253
#Below sections show how to investigate quality of merge in more detail
Defining the key columns
Thus far, tables have been joined by a single “key” variable using this syntax:
inner_join(x,y, by = "keyvar")
Often, multiple variables form the “key”. Specify this using this syntax:
inner_join(x,y, by = c("keyvar1","keyvar2","..."))
Practical example:
- perform an inner join of
nls_tran
and nls_term
#Investigate
nls_tran %>% group_by(id,transnum) %>% summarise(n_per_key=n()) %>% ungroup %>%
filter(n_per_key>1) %>% count(n_per_key)
#> # A tibble: 0 x 2
#> # ... with 2 variables: n_per_key <int>, n <int>
nls_term %>% group_by(id,transnum,termnum) %>% summarise(n_per_key=n()) %>% ungroup %>%
filter(n_per_key>1) %>% count(n_per_key)
#> # A tibble: 0 x 2
#> # ... with 2 variables: n_per_key <int>, n <int>
#merge
nls_tran_term <- nls_tran %>% inner_join(nls_term, by = c("id","transnum"))
#print a few obs and selected vars from merged data frame
nls_tran_term %>% select(id,transnum,termnum,transnum,findisp,courses,termtype) %>% as_factor()
#> # A tibble: 120,807 x 6
#> id transnum termnum findisp courses termtype
#> <dbl> <dbl> <dbl> <fct> <dbl> <fct>
#> 1 67 1 1 1. TRANSCRIPT RECEIVED 3 4. quarter
#> 2 67 1 2 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 3 67 1 3 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 4 67 1 4 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 5 67 1 5 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 6 67 1 6 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 7 67 1 7 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 8 67 1 8 1. TRANSCRIPT RECEIVED 2 4. quarter
#> 9 315 1 1 1. TRANSCRIPT RECEIVED 4 2. semester
#> 10 315 1 2 1. TRANSCRIPT RECEIVED 3 4. quarter
#> # ... with 120,797 more rows
#investigate
nrow(nls_tran)
#> [1] 24253
nrow(nls_term)
#> [1] 120885
nrow(nls_tran_term)
#> [1] 120807
#appears that some observations from nls_term did not merge with nls_trans
#we should investigate this further [below]
Sometimes a key variable in one table has a different variable name in the other table. You can specify that the variables to be matched from one table to another as follows:
inner_join(x,y, by = c("keyvarx" = "keyvary"))
Practical example:
- perform inner join between
nls_stu
and nls_tran
:
Same syntax can be used when key is formed from multiple variables
- show using merge of
nls_tran
and nls_term
Outer joins
Thus far we have focused on “inner joins”
- keep all observations that appear in both table
x
and table y
“outer joins” keep observations that appear in at least one table. There are three types of outer joins:
- left join: keep all observations in
x
(regardless of whether these obs appear in y
)
- This is the most common type of join when creating analysis datasets because we often start w/ a dataframe that has all units we want in our analysis and left join ensures that we keep all of these units
- right join: keep all observations in
y
(regardless of whether these obs appear in x
)
- full join: keep all observations that appear in
x
or in y
Description of the four join types from R help file
The syntax for the outer join commands is identical to inner joins, so once you understand inner joins, outer joins are not difficult.
inner_join()
- return all rows from x where there are matching values in y, and all columns from x and y. If there are multiple matches between x and y, all combination of the matches are returned.
left_join()
- return all rows from x, and all columns from x and y. Rows in x with no match in y will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
right_join()
- return all rows from y, and all columns from x and y. Rows in y with no match in x will have NA values in the new columns. If there are multiple matches between x and y, all combinations of the matches are returned.
full_join()
- return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
Visual representation of outer joins
This figures are copied straight from Wickham chapter 12
Venn diagram of joins
We want to perform outer joins on these two tables
Visual representation of outer joins
“These joins work by adding an additional “virtual” observation to each table. This observation has a key that always matches (if no other key matches), and a value filled with NA."
Practicing outer joins
The left-join is the most commonly used outer join in social science research (more common than inner join too).
Why is this? Often, we start with some dataset x
(e.g., nls_stu
) and we want to add variables from dataset y
- Usually, we want to keep observations from
x
regardless of whether they match with y
- Usually uninterested in observations from
y
that did not match with x
Student task (try doing yourself or just follow along):
- start with
nls_stu_pets
- perform a left join with
nls_stu
and save the object
- Then perform a left join with
nls_tran