In data analysis, it’s common to work with multiple data sources that need to be combined for analysis. R provides functions to join data frames based on common columns, allowing you to create more comprehensive datasets for analysis.
What we’ll learn in this section
GoalsIn this section, we will:
Understand different types of joins: inner, left, right, and full
Use dplyr functions to join data frames
Apply set operations: union, intersection, and set difference
These techniques will enable you to merge and manipulate data from multiple sources, giving you a more complete picture for your analysis. Let’s dive into the details and see how these operations work in practice.
Types of Joins
There are several types of joins that can be performed on data frames:
Type
Description
Inner Join
Returns rows with matching values in both data frames
Left Join
Returns all rows from the left data frame and matched rows from the right data frame
Right Join
Returns all rows from the right data frame and matched rows from the left data frame
Full Join
Returns all rows when there is a match in either data frame
Let’s start by creating some sample data to demonstrate the different types of joins.
students
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
activities
id
game
1
Basketball
2
Hockey
3
Tennis
5
Volleyball
books
id
book
1
Harry Potter
2
The Hobbit
3
The Alchemist
5
Charlie and the Chocolate Factory
In the examples that follow, we’ll use these data frames to demonstrate the different types of joins.
Inner Join
An inner join returns only the rows that have matching values in both data frames.
Syntax
Example
Join students and activities data frames by id:
Output:
In this case, the resulting data frame will only contain rows with matching id values in both data frames.
Exercise 4.1: Inner Join
Join the students and books data frames by id using an inner join.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
book
1
Harry Potter
2
The Hobbit
3
The Alchemist
5
Charlie and the Chocolate Factory
Output:
Left Join
A left join returns all rows from the left data frame and the matched rows from the right data frame.
Syntax
Example
Left join students and activities data frames by id:
Output:
In this case, the resulting data frame will contain all rows from the students data frame and the matched rows from the activities data frame.
Exercise 4.2: Left Join
Join the students and books data frames by id using a left join.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
book
1
Harry Potter
2
The Hobbit
3
The Alchemist
5
Charlie and the Chocolate Factory
Output:
Right Join
A right join returns all rows from the right data frame and the matched rows from the left data frame.
Syntax
Example
Right join students and activities data frames by id:
Output:
In this case, the resulting data frame will contain all rows from the activities data frame and the matched rows from the students data frame.
Exercise 4.3: Right Join
Join the students and books data frames by id using a right join.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
book
1
Harry Potter
2
The Hobbit
3
The Alchemist
5
Charlie and the Chocolate Factory
Output:
Full Join
A full join returns all rows when there is a match in either the left or right data frame.
Syntax
Example
Full join students and activities data frames by id:
Output:
In this case, the resulting data frame will contain all rows from both the students and activities data frames.
Exercise 4.4: Full Join
Join the students and books data frames by id using a full join.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
book
1
Harry Potter
2
The Hobbit
3
The Alchemist
5
Charlie and the Chocolate Factory
Output:
Union
The union() function is used to combine two data frames by stacking them on top of each other.
Syntax
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
name
section
study
play
5
Ella
A
6
5
6
Fiona
B
7
10
7
Gina
A
8
5
8
Hana
B
9
10
Example
For union to work, we need to have the same columns in both data frames.
We’ll use the two students data frames shown above to demonstrate the union operation.
Output:
name
sport
Alice
Tennis
Bob
Football
Charlie
Basketball
name
sport
Bob
Football
David
Tennis
Eve
Volleyball
Exercise 4.5: Union Operation
Using the team_a and team_b data frames shown above, find all unique players using the union operation.
Intersection
The intersect() function is used to find the common rows between two data frames.
Syntax
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
name
section
study
play
1
Alia
A
2
5
3
Cara
A
NA
10
Example
To demonstrate the intersection operation, we’ll use the students data frame and a subset containing only the rows with section “A”.
Output:
name
sport
Alice
Tennis
Bob
Football
Charlie
Basketball
name
sport
Bob
Football
David
Tennis
Eve
Volleyball
Exercise 4.6: Intersection Operation
Using the team_a and team_b data frames shown above, find the players that are in both teams using the intersection operation.
Set Difference
The setdiff() function is used to find the rows that are in the first data frame but not in the second data frame.
Syntax
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
id
name
section
study
play
1
Alia
A
2
5
3
Cara
A
NA
10
Example
Using the students data frame and the section_a data frame from the previous example, we can find the rows that are in the students data frame but not in the section_a data frame using setdiff() :
Output:
name
sport
Alice
Tennis
Bob
Football
Charlie
Basketball
name
sport
Bob
Football
David
Tennis
Eve
Volleyball
Exercise 4.7: Set Difference Operation
Using the team_a and team_b data frames shown above, find the players that are in team_a but not in team_b using the set difference operation.
Review
Let’s review what we’ve learned:
Quiz
Loading...
Loading...
Loading...
SummaryIn this section, you've learned how to:
Perform different types of joins: inner, left, right, and full
Use dplyr functions to join data frames
Apply set operations: union, intersection, and set difference
In the next section, we’ll learn how to reshape data using tidyr functions.