Summarizing & Grouping Data | The School of Data

A Beginner's Introduction to R Through the Tidyverse Summarizing & Grouping Data

Why Summarize Data?

Summarizing data is an essential step in data analysis. When we have a large dataset, it can be challenging to make sense of the data without summarizing it. We can use basic statistics such as mean, median, minimum, maximum, and count to summarize the data.

Why Group Data?

Grouping data is another important step in data analysis. Grouping allows us to divide the data into subsets based on one or more variables. We can then perform operations on each group separately. This is useful when we want to compare different groups or calculate group-specific statistics.

What you’ll learn in this section

Goals In this section, we'll learn how to:

Summarize data using the summarize function in dplyr.
Understand the split-apply-combine strategy.
Group data using the group_by function.

Summarizing Data

The summarize function in dplyr is used to calculate summary statistics on a dataset. You can also spell it as summarise if you prefer British English spelling.

Syntax

 data |> 
  summarize(new_column = function(column))

Here, new_column is the name of the new column you want to create, and function(column) is the function you want to apply to the column to calculate the summary statistic.

Functions for Summarizing Data

Here are some common functions used for summarizing data:

mean() : Calculate the mean (average) of a column.
median() : Calculate the median of a column.
min() : Calculate the minimum value of a column.
max() : Calculate the maximum value of a column.
sum() : Calculate the sum of a column.
n() : Count the number of rows in a dataset.

Example

We’ll continue to use the flowers data for our examples and exercises.

name	height	season	sunlight	growth
Poppy	75	Spring	8.3	fast
Rose	150	Summer	6.4	slow
Zinnia	60	Summer	8.7	fast
Peony	90	Spring	7.2	slow

 flowers <- data.frame(
  name = c("Poppy", "Rose", "Zinnia", "Peony"),
  height = c(75, 150, 60, 90),
  season = c("Spring", "Summer", "Summer", "Spring"),
  sunlight = c(8.3, 6.4, 8.7, 7.2),
  growth = c("fast", "slow", "fast", "slow")
)
 
flowers

Let’s summarize the flowers dataset to get the average height of flowers for each season .

 flowers |> 
  summarize(avg_height = mean(height))

avg_height
93.75

Exercise 5.1

Try summarizing the flowers dataset to get the average sunlight .

name	height	season	sunlight	growth
Poppy	75	Spring	8.3	fast
Rose	150	Summer	6.4	slow
Zinnia	60	Summer	8.7	fast
Peony	90	Spring	7.2	slow

Grouping data using the split-apply-combine strategy

The split-apply-combine strategy is a common approach to data manipulation. It involves three steps:

Split: Divide the data into groups based on one or more variables.
Apply: Apply a function (e.g., summarization) to each group.
Combine: Combine the results back into a single dataset.

Example

Let’s say we want to calculate the average height of flowers for each season . We can use the split-apply-combine strategy to achieve this.

Split: Split the data by season .

 spring_flowers <- flowers |>
  filter(season == "Spring") 
 
summer_flowers <- flowers |>
  filter(season == "Summer")

Apply: Calculate the average height for each group.

 spring_avg_height <- spring_flowers |> 
  summarize(avg_height = mean(height))
 
summer_avg_height <- summer_flowers |>
  summarize(avg_height = mean(height))

Combine: Combine the results into a single dataset.

 rbind(spring_avg_height, summer_avg_height)

Output

   avg_height
1      82.5
2     105.0

Grouping data with group_by

The group_by function is used to split the data into groups based on one or more variables. After grouping, you can apply summarization or other operations to each group.

Syntax

The basic syntax for grouping data with dplyr is:

 data |> 
  group_by(column) |> 
  summarize(new_column = function(column))

Example

Let’s use the same example of calculating the average height of flowers for each season using the group_by function.

 flowers |> 
  group_by(season) |> 
  summarize(avg_height = mean(height))

Output

   season avg_height
1 Spring      82.5
2 Summer     105.0

Exercise 5.2

Use group_by to calculate the maximum sunlight for each season using the flowers dataset.

name	height	season	sunlight	growth
Poppy	75	Spring	8.3	fast
Rose	150	Summer	6.4	slow
Zinnia	60	Summer	8.7	fast
Peony	90	Spring	7.2	slow

In this section, we used group_by along with summarize to group and summarize data. We can also use other functions with group_by to perform operations on grouped data. For example, we can use the filter function to filter data within each group. As you can see, using group_by and summarize together is a powerful way to analyze and summarize data in R. Compared to the split-apply-combine strategy, using group_by and summarize is more concise and easier to read.

Playground

This a playground! Just play around with the code and see what happens.

Playground

Try grouping the flowers dataset by growth and summarizing the average height for each group.

Or try grouping the dataset by growth and season and summarizing the average sunlight for each group.

name	height	season	sunlight	growth
Poppy	75	Spring	8.3	fast
Rose	150	Summer	6.4	slow
Zinnia	60	Summer	8.7	fast
Peony	90	Spring	7.2	slow

Review

Quiz

Review You've learned how to:

Summarize data using the summarize function in dplyr.
Group data using the group_by function.
Understand the split-apply-combine strategy.

Previous Filter Rows

Go to Next SectionCreate Columns →

On this page

- Why Summarize Data?
- Why Group Data?
- What you’ll learn in this section
- Summarizing Data
- Grouping data using the split-apply-combine strategy
- Grouping data with group_by
- Review