Why Summarize Data?
Summarizing data is an essential step in data analysis. When we have a large dataset, it can be challenging to make sense of the data without summarizing it. We can use basic statistics such as mean, median, minimum, maximum, and count to summarize the data.
Why Group Data?
Grouping data is another important step in data analysis. Grouping allows us to divide the data into subsets based on one or more variables. We can then perform operations on each group separately. This is useful when we want to compare different groups or calculate group-specific statistics.
What you’ll learn in this section
Goals In this section, we'll learn how to: Summarize data using the summarize
function in dplyr. Understand the split-apply-combine strategy. Group data using the group_by
function.
Summarizing Data
The summarize
function in dplyr is used to calculate summary statistics on a dataset. You can also spell it as summarise
if you prefer British English spelling.
Syntax
data |>
summarize(new_column = function (column) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Here, new_column
is the name of the new column you want to create, and function(column)
is the function you want to apply to the column to calculate the summary statistic.
Functions for Summarizing Data
Here are some common functions used for summarizing data:
mean()
: Calculate the mean (average) of a column.
median()
: Calculate the median of a column.
min()
: Calculate the minimum value of a column.
max()
: Calculate the maximum value of a column.
sum()
: Calculate the sum of a column.
n()
: Count the number of rows in a dataset.
Example We’ll continue to use the flowers data for our examples and exercises.
name height season sunlight growth Poppy 75 Spring 8.3 fast Rose 150 Summer 6.4 slow Zinnia 60 Summer 8.7 fast Peony 90 Spring 7.2 slow
show code for creating the flowers dataset flowers <- data.frame (
name = c ( "Poppy" , "Rose" , "Zinnia" , "Peony" ),
height = c ( 75 , 150 , 60 , 90 ),
season = c ( "Spring" , "Summer" , "Summer" , "Spring" ),
sunlight = c ( 8.3 , 6.4 , 8.7 , 7.2 ),
growth = c ( "fast" , "slow" , "fast" , "slow" )
)
flowers
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Let’s summarize the flowers dataset to get the average height of flowers for each season
.
flowers |>
summarize( avg_height = mean (height ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Exercise 5.1 Try summarizing the flowers dataset to get the average sunlight
.
name height season sunlight growth Poppy 75 Spring 8.3 fast Rose 150 Summer 6.4 slow Zinnia 60 Summer 8.7 fast Peony 90 Spring 7.2 slow
show solution flowers |>
summarize( avg_sunlight = mean (sunlight ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Grouping data using the split-apply-combine strategy
The split-apply-combine strategy is a common approach to data manipulation. It involves three steps:
Split : Divide the data into groups based on one or more variables.
Apply : Apply a function (e.g., summarization) to each group.
Combine : Combine the results back into a single dataset.
Example Let’s say we want to calculate the average height of flowers for each season
. We can use the split-apply-combine strategy to achieve this.
Split : Split the data by season
.
spring_flowers <- flowers |>
filter (season == "Spring" )
summer_flowers <- flowers |>
filter (season == "Summer" )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Apply : Calculate the average height for each group.
spring_avg_height <- spring_flowers |>
summarize( avg_height = mean (height ) )
summer_avg_height <- summer_flowers |>
summarize( avg_height = mean (height ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Combine : Combine the results into a single dataset.
rbind (spring_avg_height, summer_avg_height)
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output
avg_height
1 82.5
2 105.0
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Grouping data with group_by
The group_by
function is used to split the data into groups based on one or more variables. After grouping, you can apply summarization or other operations to each group.
Syntax
The basic syntax for grouping data with dplyr is:
data |>
group_by( column ) |>
summarize(new_column = function (column) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Example Let’s use the same example of calculating the average height of flowers for each season
using the group_by
function.
flowers |>
group_by( season ) |>
summarize( avg_height = mean (height ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output
season avg_height
1 Spring 82.5
2 Summer 105.0
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Exercise 5.2 Use group_by
to calculate the maximum sunlight
for each season
using the flowers dataset.
name height season sunlight growth Poppy 75 Spring 8.3 fast Rose 150 Summer 6.4 slow Zinnia 60 Summer 8.7 fast Peony 90 Spring 7.2 slow
show solution flowers |>
group_by( season ) |>
summarize( max_sunlight = max (sunlight ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
In this section, we used group_by
along with summarize
to group and summarize data.
We can also use other functions with group_by
to perform operations on grouped data.
For example, we can use the filter
function to filter data within each group. As you can see, using group_by
and summarize
together is a powerful way to analyze and summarize data in R. Compared to the split-apply-combine strategy, using group_by
and summarize
is more concise and easier to read.
Playground
This a playground! Just play around with the code and see what happens.
Playground Try grouping the flowers dataset by growth
and summarizing the average height
for each group.
Or try grouping the dataset by growth
and season
and summarizing the average sunlight
for each group.
name height season sunlight growth Poppy 75 Spring 8.3 fast Rose 150 Summer 6.4 slow Zinnia 60 Summer 8.7 fast Peony 90 Spring 7.2 slow
show solution flowers |>
group_by( growth ) |>
summarize( avg_height = mean (height ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> flowers |>
group_by( growth, season ) |>
summarize( avg_sunlight = mean (sunlight ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Review
Quiz Loading...
Loading...
Loading...
Review You've learned how to: Summarize data using the summarize
function in dplyr. Group data using the group_by
function. Understand the split-apply-combine strategy.