Basic Questions
Let’s start our exploration by answering these questions:
1. What are the names of the ice cream flavours? 2. How many flavours are there in the dataset? 3. What are the categories of ice cream in the dataset?
Before we answer these questions, let’s translate them into data tasks:
Translating Questions into Data Tasks
1. What are the names of the ice cream flavours?
To answer this question, we need to list the unique values in the name
column of the ice_cream
dataset. We can use the distinct()
function to achieve this.
In other words, we can ask ourselves: “What are the unique values in the name
column of the ice_cream
dataset?“
2. How many flavours are there in the dataset?
To answer this question, we need to count the number of unique values in the name
column of the ice_cream
dataset. We can use the distinct()
function to list unique values and the count()
function to count the number of unique values.
In other words, we can ask ourselves: “How many unique values are there in the name
column of the ice_cream
dataset?“
3. What are the categories of ice cream in the dataset?
To answer this question, we need to list the unique values in the category
column of the ice_cream
dataset. We can use the distinct()
function to achieve this.
In other words, we can ask ourselves: “What are the unique values in the category
column of the ice_cream
dataset?”
By translating our questions into data tasks, we can make it easier to understand what we need to do with the data.
Get Distinct Values
The distinct()
function in tidyverse is used to get unique/distinct rows of a data frame. It returns a data frame with only the unique rows.
The syntax for the distinct()
function is:
dataset |>
distinct( column_name )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Exercise 1.1: List Ice Cream Flavors Use the distinct()
function to list the names of the ice cream flavors:
show solution ice_cream |>
distinct( name )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 6 × 1
name
< chr >
1 Vanilla
2 Chocolate
3 Strawberry
4 Mango
5 Cookie Dough
6 Blueberry
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Count Unique Values
The count()
function in tidyverse is used to count the number of unique values in a data frame. It returns a data frame with the count of unique values.
The syntax for the count()
function is:
dataset |>
distinct( column_name ) |>
count()
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Exercise 1.2: Count Ice Cream Flavors Use the distinct()
and count()
functions to count the number of unique flavors:
show solution ice_cream |>
distinct( name ) |>
count()
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 1 × 1
n
< int >
1 6
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Count unique categories
Exercise 1.3: List Ice Cream Categories Use the distinct()
function to list the categories of ice cream:
show solution ice_cream |>
distinct( category )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 3 × 1
category
< chr >
1 Classic
2 Fruit
3 Specialty
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Let’s take this a step further and count the number of flavors in each category.
Count Flavors by Category
Exercise 1.4: Count Flavors by Category Use the group_by()
and summarise()
functions to count how many flavors are in each category:
show solution
ice_cream |>
group_by( category ) |>
summarise( count = n())
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 3 × 2
category count
< chr > < int >
1 Classic 2
2 Fruit 3
3 Specialty 1
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Writing it up
Let’s write up a small report based on our findings:
Report: Ice Cream Flavors Analysis
1. Ice Cream Flavors
The ice cream flavors in the dataset are Vanilla, Chocolate, Strawberry, Mango, Cookie Dough, and Blueberry.
2. Number of Flavors
There are 6 unique ice cream flavors in the dataset.
3. Ice Cream Categories
The ice cream flavors are categorized into Classic, Fruit, and Specialty. The Classic category has 2 flavors, the Fruit category has 3 flavors, and the Specialty category has 1 flavor.
Next Steps
In the next section, we’ll continue exploring the ice_cream
dataset and answer questions like “What is the highest-rated ice cream flavor?” and “What is the average price of ice cream in each category?”
What we’ll learn in this section
In this section, we’ll get some quick insights from our dataset. This is so we can do some basic exploration. With key information in our report, we will be able to have a more confident understanding of our data.
Highest & Lowest Values
One way to get a sense of the data is to find out the highest and lowest values. It can help us provide an overview of our data and get us ready to share quick initial insights.
Example: Lowest-rated Ice Cream What is the lowest-rated ice cream?
Let’s translate this question into a technical question. First, we need to find the lowest rating and it’s corresponding ice cream name.
To find the lowest rating, first, we’ll select the rating column and then can use the min
function.
ice_cream |>
select( rating ) |>
min ()
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> But this gives us only the value. To find the row, we can use the filter function.
ice_cream |>
filter (rating == min (rating))
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> We can also use the slice_min
to slice the row with the minimum rating value.
ice_cream |> slice_min( rating )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> # A tibble: 1 × 5
id name category rating price
< int > < chr > < chr > < dbl > < dbl >
1 6 Blueberry Fruit 3.8 3
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> You can also add this information inline by saving these values to variables and then referencing them in the text.
rating_lowest <- ice_cream |> slice_min( rating )
paste0 ( "The lowest-rated ice cream is " , rating_lowest $ name,
" with a rating of " , rating_lowest $ rating, "." )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> This will display the following output:
[ 1 ] "The lowest-rated ice cream is Blueberry with a rating of 3.8."
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> You can try this on your own:
Exercise 2.1: Highest Rated Ice Cream Find the highest-rated ice cream in the dataset.
show solution rating_highest <- ice_cream |> slice_max( rating )
paste0 ( "The highest-rated ice cream is " , rating_highest $ name,
" with a rating of " , rating_highest $ rating, "." )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 1 × 5
id name category rating price
< int > < chr > < chr > < dbl > < dbl >
1 4 Mango Fruit 4.7 3.25
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> The highest-rated ice cream is Mango with a rating of 4.7.
Exercise 2.2: Highest priced ice cream show solution price_highest <- ice_cream |> slice_max( price )
price_highest
paste0 ( "The highest-priced ice cream is " , price_highest $ name,
" with a price of " , price_highest $ price, "." )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 1 × 5
id name category rating price
< int > < chr > < chr > < dbl > < dbl >
1 5 Cookie Dough Specialty 4 3.5
The highest - priced ice cream is Cookie Dough with a price of $ 3.5 .
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Exercise 2.3: Least expensive ice cream
Average Values
What is the category with the highest average price?
The category with the highest average price is Specialty at $3.5.
ice_cream |>
group_by( category ) |>
summarise( avg_price = mean (price ) ) |>
slice_max( avg_price )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Monthly Sales Dataset
The best-performing region is North with average sales of 301.
monthly_sales |> group_by( region ) |> summarise( avg_sales = mean (sales ) ) |> slice_max( avg_sales )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
The lowest-performing month is January 2021 with total sales of 1800.
monthly_sales |> group_by( date ) |> summarise( total_sales = sum (sales ) ) |> slice_min( total_sales )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
The highest-performing month is December 2021 with total sales of 3600.
monthly_sales |> group_by( date ) |> summarise( total_sales = sum (sales ) ) |> slice_max( total_sales )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
The flavor with the highest average sales is Flavor 6 with average sales of 310.
monthly_sales |> group_by( flavor_id ) |> summarise( avg_sales = mean (sales ) ) |> slice_max( avg_sales )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
These one-line summaries provide quick insights into our datasets and demonstrate how to use various dplyr functions to extract meaningful information.
Statistical Summaries
Now, let’s dive deeper into statistical summaries. These are numerical measures that give a description of the columns or variables in a dataset, helping to describe the main characteristics of the data.
Ice Cream Dataset
Example: Summarize Prices Use the summarise()
function to calculate summary statistics for the price
column:
ice_cream |>
summarise(
min_price = min (price ) ,
max_price = max (price),
range_price = max (price) - min (price),
avg_price = mean (price),
sd_price = sd (price)
)
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 1 × 5
min_price max_price range_price avg_price sd_price
< dbl > < dbl > < dbl > < dbl > < dbl >
1 2.5 3.5 1 3.083 0.379
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> This gives us a summary of the price column, showing the minimum ($2.5), maximum ($3.5), range ($1), average ($3.083), and standard deviation ($0.379) of the prices.
Exercise 2.1: Summarize Ratings Use the summarise()
function to calculate summary statistics for the rating
column:
show solution ice_cream |>
summarise(
min_rating = min (rating ) ,
max_rating = max (rating),
range_rating = max (rating) - min (rating),
avg_rating = mean (rating),
sd_rating = sd (rating)
)
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 1 × 5
min_rating max_rating range_rating avg_rating sd_rating
< dbl > < dbl > < dbl > < dbl > < dbl >
1 3.8 4.7 0.9 4.18 0.316
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Monthly Sales Dataset
Exercise 2.2: Monthly Sales Summary Statistics Calculate the minimum, maximum, range, average, and standard deviation of sales:
show solution monthly_sales |>
summarise(
min_sales = min (sales ) ,
max_sales = max (sales),
range_sales = max (sales) - min (sales),
avg_sales = mean (sales),
sd_sales = sd (sales)
)
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
min_sales max_sales range_sales avg_sales sd_sales
1 100 500 400 300.1042 115.4701
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Analysis by Group
Now let’s analyze our data by different groups.
Ice Cream Dataset
Exercise 2.3: Average Price by Category Calculate the average price of ice cream for each category:
show solution ice_cream |>
group_by( category ) |>
summarise( avg_price = mean (price ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 3 × 2
category avg_price
< chr > < dbl >
1 Classic 2.75
2 Fruit 3.17
3 Specialty 3.5
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Monthly Sales Dataset
Exercise 2.4: Average Sales by Region Calculate the average sales for each region:
show solution monthly_sales |>
group_by( region ) |>
summarise( avg_sales = mean (sales ) )
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 2 × 2
region avg_sales
< chr > < dbl >
1 North 301 .
2 South 299 .
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2">
Detecting Outliers
Outliers are data points that are significantly different from other observations. Let’s learn how to detect outliers in our datasets.
Ice Cream Dataset
Exercise 2.5: Detect Price Outliers Identify ice cream flavors with prices that are more than 1.5 times the interquartile range (IQR) above the third quartile or below the first quartile:
show solution # Calculate Q1, Q3, and IQR
Q1 <- quantile (ice_cream $ price, 0.25 )
Q3 <- quantile (ice_cream $ price, 0.75 )
IQR <- Q3 - Q1
# Identify outliers
ice_cream |>
filter (price < (Q1 - 1.5 * IQR) | price > (Q3 + 1.5 * IQR))
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
# A tibble: 0 × 5
# … with 5 variables: id <int>, name <chr>, category <chr>, rating <dbl>, price <dbl>
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> In this case, there are no price outliers in the ice cream dataset based on the IQR method.
Monthly Sales Dataset
Exercise 2.6: Detect Sales Outliers Identify months with sales that are more than 2 standard deviations away from the mean:
show solution # Calculate mean and standard deviation
sales_mean <- mean (monthly_sales $ sales)
sales_sd <- sd (monthly_sales $ sales)
# Identify outliers
monthly_sales |>
filter (sales < (sales_mean - 2 * sales_sd) | sales > (sales_mean + 2 * sales_sd))
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> Output:
flavor_id date region sales
1 3 2021 - 01 - 01 North 100
2 2 2021 - 02 - 01 North 500
3 5 2021 - 03 - 01 North 100
4 1 2021 - 04 - 01 North 500
5 4 2021 - 05 - 01 North 100
6 3 2021 - 07 - 01 North 500
...
copied = true)" @mouseleave.debounce.1000ms="copied && (copied = false)" @keydown.enter.debounce.1000ms="copied && (copied = false)" @keydown.space.debounce.1000ms="copied && (copied = false)" @touchstart.debounce.1000ms="copied && (copied = false)" class="group flex select-none items-center justify-between gap-2 rounded p-2 hover:bg-3 hover:text-brand focus-visible:outline-none focus-visible:ring-1 motion-safe:transition-colors absolute right-0 top-0 text-2"> This output shows the sales records that are considered outliers based on being more than 2 standard deviations away from the mean.
Conclusion
In this comprehensive section, we’ve explored a wide range of data analysis techniques using R. We started with quick one-line summaries to extract key insights from our datasets,