Datasets often contain a large amount of data, and it can be challenging to work with all of it at once. Filtering is a way to reduce the size of the dataset by selecting only the rows that meet specific criteria.
Why do we need to filter data?
There are several reasons why we might want to filter data. Some reasons include:
To focus on a specific subset of the data.
To remove rows that are not relevant to our analysis.
To remove missing or invalid data.
To prepare the data for further analysis.
What we’ll learn in this section
GoalsIn this section, we'll learn how to:
Filter rows using the filter function in dplyr.
Filter rows based on a single condition.
Filter rows based on multiple conditions.
Filter rows using logical operators.
Basic Syntax for Filtering
To use the filter function in dplyr, we use the following syntax:
We specify the dataset we want to filter, followed by the |> operator, and then the filter function. Inside the filter function, we specify the condition that we want to filter on.
The condition can be any logical expression that evaluates to TRUE or FALSE . Rows where the condition is TRUE are kept, and rows where the condition is FALSE are removed.
For example, here is an example condition:
This code will keep only the rows where the value in the column column_name is greater than 10.
If we need to filter rows that are equal to a specific value, we use the == operator. It has two equal signs, not one. For a value less than, we use < , for greater than, we use > , for less than or equal to, we use <= , and for greater than or equal to, we use >= .
We can also use != to filter rows that are not equal to a specific value.
Operator
Description
==
Equal to
!=
Not equal to
>
Greater than
<
Less than
>=
Greater than or equal to
<=
Less than or equal to
Remember that you have to use two equal signs == for comparison, not one equal sign = .
Filtering with a Single Condition
Example
Let’s say we want to filter the flowers dataset to keep only the rows where the season is “Summer”.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
To filter the dataset, we can use the following code:
name
height
season
sunlight
growth
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
This code will return only the flowers that bloom in the summer. Since the column has character values, we use quotes around the value we are comparing it to. Use "Summer" with quotes. Don’t use Summer without quotes.
Exercise 4.1
Try filtering the flowers dataset to keep only the rows where growth is “fast”. Remember to use == for comparison, not = .
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Filtering with Multiple Conditions
You can also filter rows based on multiple conditions. To do this, you can use the & operator to combine conditions.
Syntax for when both conditions must be met
You can use a comma , or the & operator to filter rows that meet both conditions.
or
Example
For example, let’s filter the flowers dataset to keep only the rows where season is “Spring” and sunlight is greater than 8.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
or
This code will return only the flowers that bloom in spring and receive more than 8 hours of sunlight.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Exercise 4.2
Try filtering the flowers dataset to keep only the rows where season is “Summer” and growth is “fast”.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Syntax for when at least one condition must be met
You can use the | operator to filter rows that meet either of two conditions.
This code will return rows that satisfy either condition1 or condition2 .
Example
Let’s filter the flowers dataset to keep only the rows where season is “Spring” or growth is “slow”.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
This code will return only the flowers that bloom in spring or have slow growth.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Peony
90
Spring
7.2
slow
Exercise 4.3
Run this code to keep only the rows where height is greater than 70 or growth is “slow”.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Exercise 4.4
Type the code to filter the flowers dataset to keep only the rows where the height is less than 91 or the sunlight is greater than 8.5.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Filtering rows that do not meet a condition
You can use the ! operator to filter rows that do not meet a condition.
Example
Let’s filter the flowers dataset to keep only the rows where season is not “Summer”.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
You can also write it as:
This code will return only the flowers that do not bloom in summer.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Peony
90
Spring
7.2
slow
Exercise 4.5
Filter the flowers dataset to keep only the rows where growth is not “slow”.