Let’s start our exploration by answering these questions:
1. What are the names of the ice cream flavours?
2. How many flavours are there in the dataset?
3. What are the categories of ice cream in the dataset?
Before we answer these questions, let’s translate them into data tasks:
Translating Questions into Data Tasks
1. What are the names of the ice cream flavours?
To answer this question, we need to list the unique values in the name column of the ice_cream dataset. We can use the distinct() function to achieve this.
In other words, we can ask ourselves: “What are the unique values in the name column of the ice_cream dataset?“
2. How many flavours are there in the dataset?
To answer this question, we need to count the number of unique values in the name column of the ice_cream dataset. We can use the distinct() function to list unique values and the count() function to count the number of unique values.
In other words, we can ask ourselves: “How many unique values are there in the name column of the ice_cream dataset?“
3. What are the categories of ice cream in the dataset?
To answer this question, we need to list the unique values in the category column of the ice_cream dataset. We can use the distinct() function to achieve this.
In other words, we can ask ourselves: “What are the unique values in the category column of the ice_cream dataset?”
By translating our questions into data tasks, we can make it easier to understand what we need to do with the data.
Get Distinct Values
The distinct() function in tidyverse is used to get unique/distinct rows of a data frame. It returns a data frame with only the unique rows.
The syntax for the distinct() function is:
dataset |> distinct(column_name)
Exercise 1.1: List Ice Cream Flavors
Use the distinct() function to list the names of the ice cream flavors:
ice_cream |> distinct(name)
Output:
# A tibble: 6 × 1 name <chr>1 Vanilla2 Chocolate3 Strawberry4 Mango5 Cookie Dough6 Blueberry
Count Unique Values
The count() function in tidyverse is used to count the number of unique values in a data frame. It returns a data frame with the count of unique values.
The syntax for the count() function is:
dataset |> distinct(column_name) |> count()
Exercise 1.2: Count Ice Cream Flavors
Use the distinct() and count() functions to count the number of unique flavors:
ice_cream |> distinct(name) |> count()
Output:
# A tibble: 1 × 1 n <int>1 6
Count unique categories
Exercise 1.3: List Ice Cream Categories
Use the distinct() function to list the categories of ice cream:
# A tibble: 3 × 2 category count <chr> <int>1 Classic 22 Fruit 33 Specialty 1
Writing it up
Let’s write up a small report based on our findings:
Report: Ice Cream Flavors Analysis
1. Ice Cream Flavors
The ice cream flavors in the dataset are Vanilla, Chocolate, Strawberry, Mango, Cookie Dough, and Blueberry.
2. Number of Flavors
There are 6 unique ice cream flavors in the dataset.
3. Ice Cream Categories
The ice cream flavors are categorized into Classic, Fruit, and Specialty. The Classic category has 2 flavors, the Fruit category has 3 flavors, and the Specialty category has 1 flavor.
Next Steps
In the next section, we’ll continue exploring the ice_cream dataset and answer questions like “What is the highest-rated ice cream flavor?” and “What is the average price of ice cream in each category?”
What we’ll learn in this section
In this section, we’ll get some quick insights from our dataset. This is so we can do some basic exploration. With key information in our report, we will be able to have a more confident understanding of our data.
Highest & Lowest Values
One way to get a sense of the data is to find out the highest and lowest values. It can help us provide an overview of our data and get us ready to share quick initial insights.
Example: Lowest-rated Ice Cream
What is the lowest-rated ice cream?
Let’s translate this question into a technical question. First, we need to find the lowest rating and it’s corresponding ice cream name.
To find the lowest rating, first, we’ll select the rating column and then can use the min function.
ice_cream |> select(rating) |> min()
But this gives us only the value. To find the row, we can use the filter function.
ice_cream |> filter(rating == min(rating))
We can also use the slice_min to slice the row with the minimum rating value.
ice_cream |> slice_min(rating)
# A tibble: 1 × 5 id name category rating price <int> <chr> <chr> <dbl> <dbl>1 6 Blueberry Fruit 3.8 3
You can also add this information inline by saving these values to variables and then referencing them in the text.
rating_lowest <- ice_cream |> slice_min(rating)paste0("The lowest-rated ice cream is ", rating_lowest$name, " with a rating of ", rating_lowest$rating, ".")
This will display the following output:
[1] "The lowest-rated ice cream is Blueberry with a rating of 3.8."
You can try this on your own:
Exercise 2.1: Highest Rated Ice Cream
Find the highest-rated ice cream in the dataset.
rating_highest <- ice_cream |> slice_max(rating)paste0("The highest-rated ice cream is ", rating_highest$name, " with a rating of ", rating_highest$rating, ".")
Output:
# A tibble: 1 × 5 id name category rating price <int> <chr> <chr> <dbl> <dbl>1 4 Mango Fruit 4.7 3.25
The highest-rated ice cream is Mango with a rating of 4.7.
Exercise 2.2: Highest priced ice cream
price_highest <- ice_cream |> slice_max(price)price_highest paste0("The highest-priced ice cream is ", price_highest$name, " with a price of ", price_highest$price, ".")
Output:
# A tibble: 1 × 5 id name category rating price <int> <chr> <chr> <dbl> <dbl>1 5 Cookie Dough Specialty 4 3.5The highest-priced ice cream is Cookie Dough with a price of $3.5.
Exercise 2.3: Least expensive ice cream
Average Values
What is the category with the highest average price?
The category with the highest average price is Specialty at $3.5.
These one-line summaries provide quick insights into our datasets and demonstrate how to use various dplyr functions to extract meaningful information.
Statistical Summaries
Now, let’s dive deeper into statistical summaries. These are numerical measures that give a description of the columns or variables in a dataset, helping to describe the main characteristics of the data.
Ice Cream Dataset
Example: Summarize Prices
Use the summarise() function to calculate summary statistics for the price column:
This gives us a summary of the price column, showing the minimum ($2.5), maximum ($3.5), range ($1), average ($3.083), and standard deviation ($0.379) of the prices.
Exercise 2.1: Summarize Ratings
Use the summarise() function to calculate summary statistics for the rating column:
# A tibble: 2 × 2 region avg_sales <chr> <dbl>1 North 301. 2 South 299.
Detecting Outliers
Outliers are data points that are significantly different from other observations. Let’s learn how to detect outliers in our datasets.
Ice Cream Dataset
Exercise 2.5: Detect Price Outliers
Identify ice cream flavors with prices that are more than 1.5 times the interquartile range (IQR) above the third quartile or below the first quartile:
# A tibble: 0 × 5# … with 5 variables: id <int>, name <chr>, category <chr>, rating <dbl>, price <dbl>
In this case, there are no price outliers in the ice cream dataset based on the IQR method.
Monthly Sales Dataset
Exercise 2.6: Detect Sales Outliers
Identify months with sales that are more than 2 standard deviations away from the mean:
# Calculate mean and standard deviationsales_mean <- mean(monthly_sales$sales)sales_sd <- sd(monthly_sales$sales)# Identify outliersmonthly_sales |> filter(sales < (sales_mean - 2 * sales_sd) | sales > (sales_mean + 2 * sales_sd))
Output:
flavor_id date region sales1 3 2021-01-01 North 1002 2 2021-02-01 North 5003 5 2021-03-01 North 1004 1 2021-04-01 North 5005 4 2021-05-01 North 1006 3 2021-07-01 North 500...
This output shows the sales records that are considered outliers based on being more than 2 standard deviations away from the mean.
Conclusion
In this comprehensive section, we’ve explored a wide range of data analysis techniques using R. We started with quick one-line summaries to extract key insights from our datasets,