Data Wrangling in R with dplyr and tidyr

Joining Data

Learn how to combine data from different sources using dplyr join functions

Course Sections

Why Join Data?

In data analysis, it’s common to work with multiple data sources that need to be combined for analysis. R provides functions to join data frames based on common columns, allowing you to create more comprehensive datasets for analysis.

What we’ll learn in this section

Goals In this section, we will:
    • Understand different types of joins: inner, left, right, and full
    • Use dplyr functions to join data frames
    • Apply set operations: union, intersection, and set difference

These techniques will enable you to merge and manipulate data from multiple sources, giving you a more complete picture for your analysis. Let’s dive into the details and see how these operations work in practice.

Types of Joins

There are several types of joins that can be performed on data frames:

TypeDescription
Inner JoinReturns rows with matching values in both data frames
Left JoinReturns all rows from the left data frame and matched rows from the right data frame
Right JoinReturns all rows from the right data frame and matched rows from the left data frame
Full JoinReturns all rows when there is a match in either data frame

Let’s start by creating some sample data to demonstrate the different types of joins.

 students <- data.frame(
  id = 1:4,
  name = c("Alia", "Bala", "Cara", "Dana"),
  section = c("A", "B", "A", "B"),
  study = c(2, 8, NA, 4),
  play = c(5, 5, 10, 10)
)
 
activities <- data.frame(
  id = c(1, 2, 3, 5),
  game = c("Basketball", "Hockey", "Tennis", "Volleyball")
)
 
 
books <- data.frame(
  id = c(1, 2, 3, 5),
  book = c("Harry Potter", "The Hobbit", "The Alchemist", "Charlie and the Chocolate Factory")
) 

students

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

activities

idgame
1Basketball
2Hockey
3Tennis
5Volleyball

books

idbook
1Harry Potter
2The Hobbit
3The Alchemist
5Charlie and the Chocolate Factory

In the examples that follow, we’ll use these data frames to demonstrate the different types of joins.

Inner Join

An inner join returns only the rows that have matching values in both data frames.

Syntax
     inner_join(data_frame1, data_frame2, by = "column_name")
     
    # or
     
    data_frame1 |> 
      inner_join(data_frame2, by = "column_name") 
Example

    Join students and activities data frames by id:

     students |> 
      inner_join(activities, by = "id") 

    Output:

     # A tibble: 3 × 6
         id name  section study  play game      
      <dbl> <chr> <chr>   <dbl> <dbl> <chr>     
    1     1 Alia  A           2     5 Basketball
    2     2 Bala  B           8     5 Hockey    
    3     3 Cara  A          NA    10 Tennis     

    In this case, the resulting data frame will only contain rows with matching id values in both data frames.

Exercise 4.1: Inner Join

Join the students and books data frames by id using an inner join.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idbook
1Harry Potter
2The Hobbit
3The Alchemist
5Charlie and the Chocolate Factory

Left Join

A left join returns all rows from the left data frame and the matched rows from the right data frame.

Syntax
     left_join(data_frame1, data_frame2, by = "column_name")
     
    # or
     
    data_frame1 |> 
      left_join(data_frame2, by = "column_name") 
Example

    Left join students and activities data frames by id:

     students |> 
      left_join(activities, by = "id") 

    Output:

     # A tibble: 4 × 6
         id name  section study  play game      
      <dbl> <chr> <chr>   <dbl> <dbl> <chr>     
    1     1 Alia  A           2     5 Basketball
    2     2 Bala  B           8     5 Hockey    
    3     3 Cara  A          NA    10 Tennis    
    4     4 Dana  B           4    10 NA         

    In this case, the resulting data frame will contain all rows from the students data frame and the matched rows from the activities data frame.

Exercise 4.2: Left Join

Join the students and books data frames by id using a left join.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idbook
1Harry Potter
2The Hobbit
3The Alchemist
5Charlie and the Chocolate Factory

Right Join

A right join returns all rows from the right data frame and the matched rows from the left data frame.

Syntax
     right_join(data_frame1, data_frame2, by = "column_name")
     
    # or
     
    data_frame1 |> 
      right_join(data_frame2, by = "column_name") 
Example

    Right join students and activities data frames by id:

     students |> 
      right_join(activities, by = "id") 

    Output:

     # A tibble: 4 × 6
         id name  section study  play game      
      <dbl> <chr> <chr>   <dbl> <dbl> <chr>     
    1     1 Alia  A           2     5 Basketball
    2     2 Bala  B           8     5 Hockey    
    3     3 Cara  A          NA    10 Tennis    
    4     5 NA    NA         NA    NA Volleyball 

    In this case, the resulting data frame will contain all rows from the activities data frame and the matched rows from the students data frame.

Exercise 4.3: Right Join

Join the students and books data frames by id using a right join.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idbook
1Harry Potter
2The Hobbit
3The Alchemist
5Charlie and the Chocolate Factory

Full Join

A full join returns all rows when there is a match in either the left or right data frame.

Syntax
     full_join(data_frame1, data_frame2, by = "column_name")
     
    # or
     
    data_frame1 |> 
      full_join(data_frame2, by = "column_name") 
Example

    Full join students and activities data frames by id:

     students |> 
      full_join(activities, by = "id") 

    Output:

     # A tibble: 5 × 6
         id name  section study  play game      
      <dbl> <chr> <chr>   <dbl> <dbl> <chr>     
    1     1 Alia  A           2     5 Basketball
    2     2 Bala  B           8     5 Hockey    
    3     3 Cara  A          NA    10 Tennis    
    4     4 Dana  B           4    10 NA        
    5     5 NA    NA         NA    NA Volleyball 

    In this case, the resulting data frame will contain all rows from both the students and activities data frames.

Exercise 4.4: Full Join

Join the students and books data frames by id using a full join.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idbook
1Harry Potter
2The Hobbit
3The Alchemist
5Charlie and the Chocolate Factory

Union

The union() function is used to combine two data frames by stacking them on top of each other.

Syntax
     union(data_frame1, data_frame2) 
idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idnamesectionstudyplay
5EllaA65
6FionaB710
7GinaA85
8HanaB910
Example

    For union to work, we need to have the same columns in both data frames. We’ll use the two students data frames shown above to demonstrate the union operation.

     students <- tibble(
      id = 1:4,
      name = c("Alia", "Bala", "Cara", "Dana"),
      section = c("A", "B", "A", "B"),
      study = c(2, 8, NA, 4),
      play = c(5, 5, 10, 10)
    )
     
    students2 <- tibble(
      id = 5:8,
      name = c("Ella", "Fiona", "Gina", "Hana"),
      section = c("A", "B", "A", "B"),
      study = c(6, 7, 8, 9),
      play = c(5, 10, 5, 10)
    )
     
    union(students, students2) 

    Output:

     # A tibble: 8 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     2 Bala  B           8     5
    3     3 Cara  A          NA    10
    4     4 Dana  B           4    10
    5     5 Ella  A           6     5
    6     6 Fiona B           7    10
    7     7 Gina  A           8     5
    8     8 Hana  B           9    10 
namesport
AliceTennis
BobFootball
CharlieBasketball
namesport
BobFootball
DavidTennis
EveVolleyball
Exercise 4.5: Union Operation

Using the team_a and team_b data frames shown above, find all unique players using the union operation.

Intersection

The intersect() function is used to find the common rows between two data frames.

Syntax
     intersect(data_frame1, data_frame2) 
idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idnamesectionstudyplay
1AliaA25
3CaraANA10
Example

    To demonstrate the intersection operation, we’ll use the students data frame and a subset containing only the rows with section “A”.

     students <- tibble(
      id = 1:4,
      name = c("Alia", "Bala", "Cara", "Dana"),
      section = c("A", "B", "A", "B"),
      study = c(2, 8, NA, 4),
      play = c(5, 5, 10, 10)
    )
     
    section_a <- students |> 
      filter(section == "A")
     
    intersect(students, section_a) 

    Output:

     # A tibble: 2 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     3 Cara  A          NA    10 
namesport
AliceTennis
BobFootball
CharlieBasketball
namesport
BobFootball
DavidTennis
EveVolleyball
Exercise 4.6: Intersection Operation

Using the team_a and team_b data frames shown above, find the players that are in both teams using the intersection operation.

Set Difference

The setdiff() function is used to find the rows that are in the first data frame but not in the second data frame.

Syntax
     setdiff(data_frame1, data_frame2) 
idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
idnamesectionstudyplay
1AliaA25
3CaraANA10
Example

    Using the students data frame and the section_a data frame from the previous example, we can find the rows that are in the students data frame but not in the section_a data frame using setdiff() :

     students <- tibble(
      id = 1:4,
      name = c("Alia", "Bala", "Cara", "Dana"),
      section = c("A", "B", "A", "B"),
      study = c(2, 8, NA, 4),
      play = c(5, 5, 10, 10)
    )
     
    section_a <- students |> 
      filter(section == "A")
     
    setdiff(students, section_a) 

    Output:

     # A tibble: 2 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     2 Bala  B           8     5
    2     4 Dana  B           4    10 
namesport
AliceTennis
BobFootball
CharlieBasketball
namesport
BobFootball
DavidTennis
EveVolleyball
Exercise 4.7: Set Difference Operation

Using the team_a and team_b data frames shown above, find the players that are in team_a but not in team_b using the set difference operation.

Review

Let’s review what we’ve learned:

Quiz

    Loading...

    Loading...

    Loading...

Summary In this section, you've learned how to:
    • Perform different types of joins: inner, left, right, and full
    • Use dplyr functions to join data frames
    • Apply set operations: union, intersection, and set difference

In the next section, we’ll learn how to reshape data using tidyr functions.