Summarizing data is an essential step in data analysis. When we have a large dataset, it can be challenging to make sense of the data without summarizing it. We can use basic statistics such as mean, median, minimum, maximum, and count to summarize the data.
Why Group Data?
Grouping data is another important step in data analysis. Grouping allows us to divide the data into subsets based on one or more variables. We can then perform operations on each group separately. This is useful when we want to compare different groups or calculate group-specific statistics.
What you’ll learn in this section
GoalsIn this section, we'll learn how to:
Summarize data using the summarize function in dplyr.
Understand the split-apply-combine strategy.
Group data using the group_by function.
Summarizing Data
The summarize function in dplyr is used to calculate summary statistics on a dataset. You can also spell it as summarise if you prefer British English spelling.
Syntax
Here, new_column is the name of the new column you want to create, and function(column) is the function you want to apply to the column to calculate the summary statistic.
Functions for Summarizing Data
Here are some common functions used for summarizing data:
mean() : Calculate the mean (average) of a column.
median() : Calculate the median of a column.
min() : Calculate the minimum value of a column.
max() : Calculate the maximum value of a column.
sum() : Calculate the sum of a column.
n() : Count the number of rows in a dataset.
Example
We’ll continue to use the flowers data for our examples and exercises.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Let’s summarize the flowers dataset to get the average height of flowers for each season .
avg_height
93.75
Exercise 5.1
Try summarizing the flowers dataset to get the average sunlight .
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Grouping data using the split-apply-combine strategy
The split-apply-combine strategy is a common approach to data manipulation. It involves three steps:
Split: Divide the data into groups based on one or more variables.
Apply: Apply a function (e.g., summarization) to each group.
Combine: Combine the results back into a single dataset.
Example
Let’s say we want to calculate the average height of flowers for each season . We can use the split-apply-combine strategy to achieve this.
Split: Split the data by season .
Apply: Calculate the average height for each group.
Combine: Combine the results into a single dataset.
Output
Grouping data with group_by
The group_by function is used to split the data into groups based on one or more variables. After grouping, you can apply summarization or other operations to each group.
Syntax
The basic syntax for grouping data with dplyr is:
Example
Let’s use the same example of calculating the average height of flowers for each season using the group_by function.
Output
Exercise 5.2
Use group_by to calculate the maximum sunlight for each season using the flowers dataset.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
In this section, we used group_by along with summarize to group and summarize data.
We can also use other functions with group_by to perform operations on grouped data.
For example, we can use the filter function to filter data within each group. As you can see, using group_by and summarize together is a powerful way to analyze and summarize data in R. Compared to the split-apply-combine strategy, using group_by and summarize is more concise and easier to read.
Playground
This a playground! Just play around with the code and see what happens.
Playground
Try grouping the flowers dataset by growth and summarizing the average height for each group.
Or try grouping the dataset by growth and season and summarizing the average sunlight for each group.
name
height
season
sunlight
growth
Poppy
75
Spring
8.3
fast
Rose
150
Summer
6.4
slow
Zinnia
60
Summer
8.7
fast
Peony
90
Spring
7.2
slow
Review
Quiz
Loading...
Loading...
Loading...
ReviewYou've learned how to:
Summarize data using the summarize function in dplyr.