Data Wrangling in R with dplyr and tidyr

Working with Groups

Learn how to group and summarize data in R using dplyr functions

Course Sections

Working with Groups

Another crucial aspect of data wrangling in R is working with grouped data. This involves aggregating data, performing calculations within groups, and applying operations to subsets of your data based on specific criteria.

What we’ll learn in this section

In this section, we’ll explore how to effectively work with grouped data using the dplyr package in R.

Goals In this section, we'll learn how to:
    • Group data using group_by(), ungroup(), and rowwise()
    • Perform per-operation grouping using .by/by
    • Summarize grouped data using summarise(), reframe(), count(), and tally()
    • Understand the difference between adding counts with add_count() and add_tally()
    • Apply group-wise operations using mutate() and filter()

Grouping Data

The group_by() function is used to group your data by one or more variables.

Syntax
     dataset |> 
      group_by(column1) 

    You can group by multiple columns:

     dataset |> 
      group_by(column1, column2) 

When you group data using group_by() without any function after that, you won’t notice any difference. The grouping is still applied. Only when you use a function after grouping, you will notice the effect of grouping.

Example

    Group students by section:

     students |> 
      group_by(section)  

    Output:

     # A tibble: 4 × 5
    # Groups:   section [2]
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     2 Bala  B           8     5
    3     3 Cara  A          NA    10
    4     4 Dana  B           4    10 

    In the output, you can see the message Groups: section [2] . This indicates that the data is grouped by the section column.

    If you use a function after grouping, you will see the effect of grouping. For example, calculating the mean study time by section.

     students |> 
      group_by(section) |> 
      summarise(mean_study = mean(study, na.rm = TRUE)) 

    Output:

     # A tibble: 2 × 2
      section mean_study
      <chr>        <dbl>
    1 A                2
    2 B                6 
Exercise 3.1

Group the students by section and calculate the mean play time for each section.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Ungrouping Data

The ungroup() function removes grouping from your data. It is a common mistake to forget to ungroup our data after grouping, which can lead to unexpected results.

Example

    Group by section, then ungroup:

     students |> 
      group_by(section) |>
      # do something here
      ungroup() 

Calculating Values for Each Row

The rowwise() function groups data frame by rows, allowing operations to be performed row by row.

Example

    Calculate total time for each student:

     students |> 
      rowwise() |>
      mutate(total_time = sum(across(c(study, play)), na.rm = TRUE)) |>
      ungroup() 

    Output:

     # A tibble: 4 × 6
         id name  section study  play total_time
      <int> <chr> <chr>   <dbl> <dbl>      <dbl>
    1     1 Alia  A           2     5          7
    2     2 Bala  B           8     5         13
    3     3 Cara  A          NA    10         10
    4     4 Dana  B           4    10         14 

    In this case, the total time is calculated for each row. For example, the total time for Alia is 2 (study) + 5 (play) = 7.

    If we do not use rowwise(), the sum would be calculated for each column separately and this would not be the total time for each student.

    For example, if we use the following code:

     students |> 
      mutate(total_time = sum(across(c(study, play)), na.rm = TRUE)) 

    We get the following output:

     # A tibble: 4 × 6
         id name  section study  play total_time
      <int> <chr> <chr>   <dbl> <dbl>      <dbl>
    1     1 Alia  A           2     5         44
    2     2 Bala  B           8     5         44
    3     3 Cara  A          NA    10         44
    4     4 Dana  B           4    10         44 

    In this case, the number 44 is the sum of study column (2 + 8 + 4 = 14) and play column (5 + 5 + 10 + 10 = 30). It is not the sum of each row. To avoid this, we must use rowwise() to calculate the sum for each row.

Exercise 3.2

Calculate the average of study and play time for each student.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Group-wise Operations

After grouping your data, dplyr allows you to perform powerful operations on these groups. This lets you apply calculations to each group separately, enabling complex data analysis and transformation. We’ll cover two key combinations:

group_by() + mutate() : Create or modify columns based on group calculations.

group_by() + filter() : Filter rows using conditions evaluated within each group.

These techniques will help you manipulate grouped data effectively and gain valuable insights. Let’s explore how they work with some practical examples.

group_by() + mutate()

This combination allows you to create or modify columns based on group-level calculations.

Example

    Add column with mean study time for each section:

     students |> 
      group_by(section) |>
      mutate(mean_study_section = mean(study, na.rm = TRUE)) |>
      ungroup() 

    Output:

     # A tibble: 4 × 6
         id name  section study  play mean_study_section
      <int> <chr> <chr>   <dbl> <dbl>              <dbl>
    1     1 Alia  A           2     5                  2
    2     2 Bala  B           8     5                  6
    3     3 Cara  A          NA    10                  2
    4     4 Dana  B           4    10                  6 
Exercise 3.3

Group the students by section and add a column with the total study time for each section.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

group_by() + filter()

This combination allows you to filter rows based on group-level conditions.

Example

    Keep only students with play time less than 10 in each section:

     students |> 
      group_by(section) |>
      filter(play < 10) |>
      ungroup() 

    Output:

     # A tibble: 2 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     2 Bala  B           8     5 
Exercise 3.4

Group the students by section and filter out students who have a study time greater than 5 hours in each section.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Review

We’ve learned how to work with groups. Let’s review what we’ve covered in this section:

Quiz

    Loading...

    Loading...

    Loading...

Summary In this section, you've learned how to:
    • Group data using group_by() , ungroup() , and rowwise()
    • Summarize grouped data using summarise()
    • Apply group-wise operations using mutate() and filter()

In the next section, we’ll learn how to join data frames in R using dplyr functions.