One way to approach data wrangling in R is to work with rows. This involves ordering rows, keeping only unique rows, filtering rows based on conditions, and subsetting rows based on their positions or values.
In this section, we’ll learn how to perform these operations using the dplyr package in R.
What we’ll learn in this section
GoalsIn this section, we will learn how to:
Order rows using the arrange() function
Keep only unique rows using the distinct() function
Filter rows based on single and multiple conditions using the filter() function
Use logical operators in filtering
Subset rows based on their positions or values using the slice() family of functions
Example Dataset
We’ll create and use this sample dataset. It contains 5 columns and 4 rows.
students <- data.frame( id = 1:4, name = c("Alia", "Bala", "Cara", "Dana"), section = c("A", "B", "A", "B"), study = c(2, 8, NA, 4), play = c(5, 5, 10, 10))
This is how the dataset looks like:
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
Arrange Rows
The arrange() function is used to sort rows based on one or more columns.
Syntax
By default, the arrange() function sorts in ascending order.
dataset |> arrange(column_name)
To sort in descending order, use the desc() function.
dataset |> arrange(desc(column_name))
To arrange by multiple columns, list the column names in the arrange() function.
dataset |> arrange(column_name1, column_name2)
Let’s try this with some examples.
Example
To sort the students data frame by the section column in descending order, we can:
students |> arrange(desc(section))
Output:
# A tibble: 4 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 2 Bala B 8 52 4 Dana B 4 103 1 Alia A 2 54 3 Cara A NA 10
To sort the data frame first by the play column in ascending order and then by the study column in descending order, we can:
students |> arrange(play, desc(study))
Output:
# A tibble: 4 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 2 Bala B 8 52 1 Alia A 2 53 4 Dana B 4 104 3 Cara A NA 10
Exercise 2.1
Arrange the students data frame by the section column in descending order.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> arrange(desc(section))
Output:
# A tibble: 4 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 2 Bala B 8 52 4 Dana B 4 103 1 Alia A 2 54 3 Cara A NA 10
Exercise 2.2
Sort the students data frame by the section column in descending order and id in descending order.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> arrange(desc(section), desc(id))
Output:
# A tibble: 4 x 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 4 Dana B 4 102 2 Bala B 8 53 3 Cara A NA 104 1 Alia A 2 5
Keep Only Unique Rows
The distinct() function is used to keep only unique/distinct rows from a data frame.
Syntax
To keep distinct rows based on all columns:
dataset |> distinct()
To keep distinct rows based on specific columns:
dataset |> distinct(column_name1, column_name2)
The .keep_all = TRUE argument keeps all columns in the result.
To keep only unique rows based on the section column:
students |> distinct(section)
Output:
# A tibble: 2 × 1 section <chr>1 A 2 B
Exercise 3.1
Keep only unique rows based on the play column.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> distinct(play)
Output:
# A tibble: 2 × 1 play <dbl>1 52 10
Filter Rows
The filter() function is used to subset rows based on column values.
Syntax
To filter rows based on a single condition:
dataset |> filter(condition)
Example
To filter rows where study is greater than 3:
students |> filter(study > 3)
Output:
# A tibble: 2 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 2 Bala B 8 52 4 Dana B 4 10
Exercise 4.1
Filter the students data frame to show only rows where id is less than 3.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> filter(id < 3)
Output:
# A tibble: 2 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 5
To filter rows with two or more conditions, we can write conditional statements using logical operators. The common logical operators are & (and), | (or), and ! (not).
To filter rows where section is “A” or play is greater than 5:
students |> filter(section == "A" | play > 5)
Output:
# A tibble: 3 × 5
id name section study play
<int> <chr> <chr> <dbl> <dbl>
1 1 Alia A 2 5
2 3 Cara A NA 10
3 4 Dana B 4 10
Exercise 5.1
Filter the students data frame to show rows where section is “B” and study is not NA.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> filter(section == "B" & !is.na(study))
Output:
# A tibble: 2 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 2 Bala B 8 52 4 Dana B 4 10
Subset Rows Based on Positions or Values
The slice() family of functions is used to subset rows based on their positions or values.
Syntax
To select rows by position:
dataset |> slice(row_numbers)
To select top N rows:
dataset |> slice_head(n = N)
To select bottom N rows:
dataset |> slice_tail(n = N)
To select random N rows:
set.seed(seed_value) # this can be any integerdataset |> slice_sample(n = N)
Example
To select the first two rows:
students |> slice(1:2)
Output:
# A tibble: 2 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 5
Example
To select the last two rows:
students |> slice_tail(2)
Output:
# A tibble: 2 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 3 Cara A NA 102 4 Dana B 4 10
Exercise 6.1
Select the first three rows from the students data frame.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> slice_head(n = 3)# orstudents |> slice(1:3)
Output:
# A tibble: 3 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 53 3 Cara A NA 10
Exercise 6.2
Select a sample of two rows from the students data frame. Add set.seed(123) before the command to ensure reproducibility.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
set.seed(123)students |> slice_sample(n = 2)
Output:
# A tibble: 2 x 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 3 Cara A NA 102 4 Dana B 4 10
Review
In this section, we learned how different dplyr functions can be used to manipulate rows in R. Let’s review with a quiz and summarize what we’ve learned so far.
Quiz
Loading...
Loading...
Loading...
SummaryIn this section, we learned how to:
Order rows using the arrange() function
Keep only unique rows using the distinct() function
Filter rows based on single and multiple conditions using the filter() function
Use logical operators in filtering
Subset rows based on their positions or values using the slice() family of functions
In the next section, we’ll learn how to apply column operations in R using dplyr functions.