Data Wrangling in R with dplyr and tidyr

Working with Rows

Learn how to manipulate rows in R using dplyr functions

Course Sections

Row Operations

One way to approach data wrangling in R is to work with rows. This involves ordering rows, keeping only unique rows, filtering rows based on conditions, and subsetting rows based on their positions or values.

In this section, we’ll learn how to perform these operations using the dplyr package in R.

What we’ll learn in this section

Goals In this section, we will learn how to:
    • Order rows using the arrange() function
    • Keep only unique rows using the distinct() function
    • Filter rows based on single and multiple conditions using the filter() function
    • Use logical operators in filtering
    • Subset rows based on their positions or values using the slice() family of functions

Example Dataset

We’ll create and use this sample dataset. It contains 5 columns and 4 rows.

 students <- data.frame(
  id = 1:4,
  name = c("Alia", "Bala", "Cara", "Dana"),
  section = c("A", "B", "A", "B"),
  study = c(2, 8, NA, 4),
  play = c(5, 5, 10, 10)
) 

This is how the dataset looks like:

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Arrange Rows

The arrange() function is used to sort rows based on one or more columns.

Syntax

    By default, the arrange() function sorts in ascending order.

     dataset |> 
      arrange(column_name) 

    To sort in descending order, use the desc() function.

     dataset |> 
      arrange(desc(column_name)) 

    To arrange by multiple columns, list the column names in the arrange() function.

     dataset |> 
      arrange(column_name1, column_name2) 

Let’s try this with some examples.

Example

    To sort the students data frame by the section column in descending order, we can:

     students |> 
      arrange(desc(section)) 

    Output:

     # A tibble: 4 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     2 Bala  B           8     5
    2     4 Dana  B           4    10
    3     1 Alia  A           2     5
    4     3 Cara  A          NA    10 

    To sort the data frame first by the play column in ascending order and then by the study column in descending order, we can:

     students |> 
      arrange(play, desc(study)) 

    Output:

     # A tibble: 4 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     2 Bala  B           8     5
    2     1 Alia  A           2     5
    3     4 Dana  B           4    10
    4     3 Cara  A          NA    10 
Exercise 2.1

Arrange the students data frame by the section column in descending order.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
Exercise 2.2

Sort the students data frame by the section column in descending order and id in descending order.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Keep Only Unique Rows

The distinct() function is used to keep only unique/distinct rows from a data frame.

Syntax

    To keep distinct rows based on all columns:

     dataset |> 
      distinct() 

    To keep distinct rows based on specific columns:

     dataset |> 
      distinct(column_name1, column_name2) 

    The .keep_all = TRUE argument keeps all columns in the result.

     dataset |> 
      distinct(column_name1, column_name2, .keep_all = TRUE) 
Example

    To keep only unique rows based on the section column:

     students |> 
      distinct(section) 

    Output:

     # A tibble: 2 × 1
      section
      <chr>  
    1 A      
    2 B       
Exercise 3.1

Keep only unique rows based on the play column.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Filter Rows

The filter() function is used to subset rows based on column values.

Syntax

    To filter rows based on a single condition:

     dataset |> 
      filter(condition) 
Example

    To filter rows where study is greater than 3:

     students |> 
      filter(study > 3) 

    Output:

     # A tibble: 2 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     2 Bala  B           8     5
    2     4 Dana  B           4    10 
Exercise 4.1

Filter the students data frame to show only rows where id is less than 3.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

To filter rows with two or more conditions, we can write conditional statements using logical operators. The common logical operators are & (and), | (or), and ! (not).

Syntax

    Common logical operators:

    • & (and)
    • | (or)
    • ! (not)

    Example usage:

     dataset |> 
      filter(condition1 & condition2)
     
    dataset |>
      filter(condition1 | condition2)
     
    dataset |> 
      filter(!condition) 
Example

    To filter rows where section is “A” or play is greater than 5:

     students |> 
      filter(section == "A" | play > 5) 

    Output:

     # A tibble: 3 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     3 Cara  A          NA    10
    3     4 Dana  B           4    10
     
Exercise 5.1

Filter the students data frame to show rows where section is “B” and study is not NA.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Subset Rows Based on Positions or Values

The slice() family of functions is used to subset rows based on their positions or values.

Syntax

    To select rows by position:

     dataset |> 
      slice(row_numbers) 

    To select top N rows:

     dataset |> 
      slice_head(n = N) 

    To select bottom N rows:

     dataset |> 
      slice_tail(n = N) 

    To select random N rows:

     set.seed(seed_value) # this can be any integer
    dataset |> 
      slice_sample(n = N) 
Example

    To select the first two rows:

     students |> 
      slice(1:2) 

    Output:

     # A tibble: 2 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     2 Bala  B           8     5 
Example

    To select the last two rows:

     students |> 
      slice_tail(2) 

    Output:

     # A tibble: 2 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     3 Cara  A          NA    10
    2     4 Dana  B           4    10 
Exercise 6.1

Select the first three rows from the students data frame.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410
Exercise 6.2

Select a sample of two rows from the students data frame. Add set.seed(123) before the command to ensure reproducibility.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Review

In this section, we learned how different dplyr functions can be used to manipulate rows in R. Let’s review with a quiz and summarize what we’ve learned so far.

Quiz

    Loading...

    Loading...

    Loading...

Summary In this section, we learned how to:
  • Order rows using the arrange() function
  • Keep only unique rows using the distinct() function
  • Filter rows based on single and multiple conditions using the filter() function
  • Use logical operators in filtering
  • Subset rows based on their positions or values using the slice() family of functions

In the next section, we’ll learn how to apply column operations in R using dplyr functions.