A Beginner's Introduction to R Through the Tidyverse

Summarizing & Grouping Data

Learn how to summarize and group data using dplyr.

Course Sections

Why Summarize Data?

Summarizing data is an essential step in data analysis. When we have a large dataset, it can be challenging to make sense of the data without summarizing it. We can use basic statistics such as mean, median, minimum, maximum, and count to summarize the data.

Why Group Data?

Grouping data is another important step in data analysis. Grouping allows us to divide the data into subsets based on one or more variables. We can then perform operations on each group separately. This is useful when we want to compare different groups or calculate group-specific statistics.

What you’ll learn in this section

Goals In this section, we'll learn how to:
  • Summarize data using the summarize function in dplyr.
  • Understand the split-apply-combine strategy.
  • Group data using the group_by function.

Summarizing Data

The summarize function in dplyr is used to calculate summary statistics on a dataset. You can also spell it as summarise if you prefer British English spelling.

Syntax

 data |> 
  summarize(new_column = function(column)) 

Here, new_column is the name of the new column you want to create, and function(column) is the function you want to apply to the column to calculate the summary statistic.

Functions for Summarizing Data

Here are some common functions used for summarizing data:

  • mean() : Calculate the mean (average) of a column.
  • median() : Calculate the median of a column.
  • min() : Calculate the minimum value of a column.
  • max() : Calculate the maximum value of a column.
  • sum() : Calculate the sum of a column.
  • n() : Count the number of rows in a dataset.
Example

    We’ll continue to use the flowers data for our examples and exercises.

    nameheightseasonsunlightgrowth
    Poppy75Spring8.3fast
    Rose150Summer6.4slow
    Zinnia60Summer8.7fast
    Peony90Spring7.2slow

    Let’s summarize the flowers dataset to get the average height of flowers for each season .

     flowers |> 
      summarize(avg_height = mean(height)) 
    avg_height
    93.75
Exercise 5.1

Try summarizing the flowers dataset to get the average sunlight .

nameheightseasonsunlightgrowth
Poppy75Spring8.3fast
Rose150Summer6.4slow
Zinnia60Summer8.7fast
Peony90Spring7.2slow

Grouping data using the split-apply-combine strategy

The split-apply-combine strategy is a common approach to data manipulation. It involves three steps:

  1. Split: Divide the data into groups based on one or more variables.
  2. Apply: Apply a function (e.g., summarization) to each group.
  3. Combine: Combine the results back into a single dataset.
Example

    Let’s say we want to calculate the average height of flowers for each season . We can use the split-apply-combine strategy to achieve this.

    Split: Split the data by season .

     spring_flowers <- flowers |>
      filter(season == "Spring") 
     
    summer_flowers <- flowers |>
      filter(season == "Summer")  

    Apply: Calculate the average height for each group.

     spring_avg_height <- spring_flowers |> 
      summarize(avg_height = mean(height))
     
    summer_avg_height <- summer_flowers |>
      summarize(avg_height = mean(height)) 

    Combine: Combine the results into a single dataset.

     rbind(spring_avg_height, summer_avg_height) 

    Output

       avg_height
    1      82.5
    2     105.0 

Grouping data with group_by

The group_by function is used to split the data into groups based on one or more variables. After grouping, you can apply summarization or other operations to each group.

Syntax

The basic syntax for grouping data with dplyr is:

 data |> 
  group_by(column) |> 
  summarize(new_column = function(column)) 
Example

    Let’s use the same example of calculating the average height of flowers for each season using the group_by function.

     flowers |> 
      group_by(season) |> 
      summarize(avg_height = mean(height)) 

    Output

       season avg_height
    1 Spring      82.5
    2 Summer     105.0 
Exercise 5.2

Use group_by to calculate the maximum sunlight for each season using the flowers dataset.

nameheightseasonsunlightgrowth
Poppy75Spring8.3fast
Rose150Summer6.4slow
Zinnia60Summer8.7fast
Peony90Spring7.2slow

In this section, we used group_by along with summarize to group and summarize data. We can also use other functions with group_by to perform operations on grouped data. For example, we can use the filter function to filter data within each group. As you can see, using group_by and summarize together is a powerful way to analyze and summarize data in R. Compared to the split-apply-combine strategy, using group_by and summarize is more concise and easier to read.

Playground

This a playground! Just play around with the code and see what happens.

Playground

Try grouping the flowers dataset by growth and summarizing the average height for each group.

Or try grouping the dataset by growth and season and summarizing the average sunlight for each group.

nameheightseasonsunlightgrowth
Poppy75Spring8.3fast
Rose150Summer6.4slow
Zinnia60Summer8.7fast
Peony90Spring7.2slow

Review

Quiz

    Loading...

    Loading...

    Loading...

Review You've learned how to:
  • Summarize data using the summarize function in dplyr.
  • Group data using the group_by function.
  • Understand the split-apply-combine strategy.