Quick Insights | The School of Data

Exploratory Data Analysis with R Quick Insights

Basic Questions

Let’s start our exploration by answering these questions:

1. What are the names of the ice cream flavours?
2. How many flavours are there in the dataset?
3. What are the categories of ice cream in the dataset?

Before we answer these questions, let’s translate them into data tasks:

Translating Questions into Data Tasks

1. What are the names of the ice cream flavours?

To answer this question, we need to list the unique values in the name column of the ice_cream dataset. We can use the distinct() function to achieve this.

In other words, we can ask ourselves: “What are the unique values in the name column of the ice_cream dataset?“

2. How many flavours are there in the dataset?

To answer this question, we need to count the number of unique values in the name column of the ice_cream dataset. We can use the distinct() function to list unique values and the count() function to count the number of unique values.

In other words, we can ask ourselves: “How many unique values are there in the name column of the ice_cream dataset?“

3. What are the categories of ice cream in the dataset?

To answer this question, we need to list the unique values in the category column of the ice_cream dataset. We can use the distinct() function to achieve this.

In other words, we can ask ourselves: “What are the unique values in the category column of the ice_cream dataset?”

By translating our questions into data tasks, we can make it easier to understand what we need to do with the data.

Get Distinct Values

The distinct() function in tidyverse is used to get unique/distinct rows of a data frame. It returns a data frame with only the unique rows.

The syntax for the distinct() function is:

 dataset |>
  distinct(column_name)

Exercise 1.1: List Ice Cream Flavors

Use the distinct() function to list the names of the ice cream flavors:

Count Unique Values

The count() function in tidyverse is used to count the number of unique values in a data frame. It returns a data frame with the count of unique values.

The syntax for the count() function is:

 dataset |>
  distinct(column_name) |>
  count()

Exercise 1.2: Count Ice Cream Flavors

Use the distinct() and count() functions to count the number of unique flavors:

Count unique categories

Exercise 1.3: List Ice Cream Categories

Use the distinct() function to list the categories of ice cream:

Let’s take this a step further and count the number of flavors in each category.

Count Flavors by Category

Exercise 1.4: Count Flavors by Category

Use the group_by() and summarise() functions to count how many flavors are in each category:

  
ice_cream |>
  group_by(category) |>
  summarise(count = n())

Output:

 # A tibble: 3 × 2
  category  count
  <chr>     <int>
1 Classic       2
2 Fruit         3
3 Specialty     1

Writing it up

Let’s write up a small report based on our findings:

Report: Ice Cream Flavors Analysis

1. Ice Cream Flavors

The ice cream flavors in the dataset are Vanilla, Chocolate, Strawberry, Mango, Cookie Dough, and Blueberry.

2. Number of Flavors

There are 6 unique ice cream flavors in the dataset.

3. Ice Cream Categories

The ice cream flavors are categorized into Classic, Fruit, and Specialty. The Classic category has 2 flavors, the Fruit category has 3 flavors, and the Specialty category has 1 flavor.

Next Steps

In the next section, we’ll continue exploring the ice_cream dataset and answer questions like “What is the highest-rated ice cream flavor?” and “What is the average price of ice cream in each category?”

What we’ll learn in this section

In this section, we’ll get some quick insights from our dataset. This is so we can do some basic exploration. With key information in our report, we will be able to have a more confident understanding of our data.

Highest & Lowest Values

One way to get a sense of the data is to find out the highest and lowest values. It can help us provide an overview of our data and get us ready to share quick initial insights.

Example: Lowest-rated Ice Cream

What is the lowest-rated ice cream?

Let’s translate this question into a technical question. First, we need to find the lowest rating and it’s corresponding ice cream name.

To find the lowest rating, first, we’ll select the rating column and then can use the min function.

 ice_cream |> 
  select(rating) |>
  min()

But this gives us only the value. To find the row, we can use the filter function.

 ice_cream |>
  filter(rating == min(rating))

We can also use the slice_min to slice the row with the minimum rating value.

  ice_cream |> slice_min(rating)

 # A tibble: 1 × 5
     id name         category rating price
  <int> <chr>        <chr>     <dbl> <dbl>
1     6 Blueberry    Fruit       3.8   3

You can also add this information inline by saving these values to variables and then referencing them in the text.

 rating_lowest <- ice_cream |> slice_min(rating)
 
paste0("The lowest-rated ice cream is ", rating_lowest$name, 
       " with a rating of ", rating_lowest$rating, ".")

This will display the following output:

 [1] "The lowest-rated ice cream is Blueberry with a rating of 3.8."

You can try this on your own:

Exercise 2.1: Highest Rated Ice Cream

Find the highest-rated ice cream in the dataset.

 rating_highest <- ice_cream |> slice_max(rating)
 
paste0("The highest-rated ice cream is ", rating_highest$name, 
       " with a rating of ", rating_highest$rating, ".")

Output:

 # A tibble: 1 × 5
     id name     category rating price
  <int> <chr>    <chr>     <dbl> <dbl>
1     4 Mango    Fruit       4.7  3.25

The highest-rated ice cream is Mango with a rating of 4.7.

Exercise 2.2: Highest priced ice cream

 price_highest <- ice_cream |> slice_max(price)
 
price_highest 
 
paste0("The highest-priced ice cream is ", price_highest$name, 
       " with a price of ", price_highest$price, ".")

Output:

 # A tibble: 1 × 5
     id name         category rating price
  <int> <chr>        <chr>     <dbl> <dbl>
1     5 Cookie Dough Specialty     4    3.5
 
The highest-priced ice cream is Cookie Dough with a price of $3.5.

Exercise 2.3: Least expensive ice cream

Average Values

What is the category with the highest average price?

The category with the highest average price is Specialty at $3.5.

 ice_cream |> 
  group_by(category) |> 
  summarise(avg_price = mean(price)) |> 
  slice_max(avg_price)

Monthly Sales Dataset

The best-performing region is North with average sales of 301.

 monthly_sales |> group_by(region) |> summarise(avg_sales = mean(sales)) |> slice_max(avg_sales)

The lowest-performing month is January 2021 with total sales of 1800.

 monthly_sales |> group_by(date) |> summarise(total_sales = sum(sales)) |> slice_min(total_sales)

The highest-performing month is December 2021 with total sales of 3600.

 monthly_sales |> group_by(date) |> summarise(total_sales = sum(sales)) |> slice_max(total_sales)

The flavor with the highest average sales is Flavor 6 with average sales of 310.

 monthly_sales |> group_by(flavor_id) |> summarise(avg_sales = mean(sales)) |> slice_max(avg_sales)

These one-line summaries provide quick insights into our datasets and demonstrate how to use various dplyr functions to extract meaningful information.

Statistical Summaries

Now, let’s dive deeper into statistical summaries. These are numerical measures that give a description of the columns or variables in a dataset, helping to describe the main characteristics of the data.

Ice Cream Dataset

Example: Summarize Prices

Use the summarise() function to calculate summary statistics for the price column:

 ice_cream |>
  summarise(
    min_price = min(price),
    max_price = max(price),
    range_price = max(price) - min(price),
    avg_price = mean(price),
    sd_price = sd(price)
  )

Output:

 # A tibble: 1 × 5
  min_price max_price range_price avg_price sd_price
      <dbl>     <dbl>       <dbl>     <dbl>    <dbl>
1       2.5       3.5         1       3.083    0.379

This gives us a summary of the price column, showing the minimum ($2.5), maximum ($3.5), range ($1), average ($3.083), and standard deviation ($0.379) of the prices.

Exercise 2.1: Summarize Ratings

Use the summarise() function to calculate summary statistics for the rating column:

 ice_cream |>
  summarise(
    min_rating = min(rating),
    max_rating = max(rating),
    range_rating = max(rating) - min(rating),
    avg_rating = mean(rating),
    sd_rating = sd(rating)
  )

Output:

 # A tibble: 1 × 5
  min_rating max_rating range_rating avg_rating sd_rating
       <dbl>      <dbl>        <dbl>      <dbl>     <dbl>
1        3.8        4.7          0.9       4.18     0.316

Monthly Sales Dataset

Exercise 2.2: Monthly Sales Summary Statistics

Calculate the minimum, maximum, range, average, and standard deviation of sales:

 monthly_sales |>
  summarise(
    min_sales = min(sales),
    max_sales = max(sales),
    range_sales = max(sales) - min(sales),
    avg_sales = mean(sales),
    sd_sales = sd(sales)
  )

Output:

   min_sales max_sales range_sales avg_sales  sd_sales
1       100       500         400  300.1042  115.4701

Analysis by Group

Now let’s analyze our data by different groups.

Ice Cream Dataset

Exercise 2.3: Average Price by Category

Calculate the average price of ice cream for each category:

 ice_cream |>
  group_by(category) |>
  summarise(avg_price = mean(price))

Output:

 # A tibble: 3 × 2
  category  avg_price
  <chr>         <dbl>
1 Classic        2.75
2 Fruit          3.17
3 Specialty      3.5

Monthly Sales Dataset

Exercise 2.4: Average Sales by Region

Calculate the average sales for each region:

 monthly_sales |>
  group_by(region) |>
  summarise(avg_sales = mean(sales))

Output:

 # A tibble: 2 × 2
  region avg_sales
  <chr>      <dbl>
1 North      301. 
2 South      299.

Detecting Outliers

Outliers are data points that are significantly different from other observations. Let’s learn how to detect outliers in our datasets.

Ice Cream Dataset

Exercise 2.5: Detect Price Outliers

Identify ice cream flavors with prices that are more than 1.5 times the interquartile range (IQR) above the third quartile or below the first quartile:

 # Calculate Q1, Q3, and IQR
Q1 <- quantile(ice_cream$price, 0.25)
Q3 <- quantile(ice_cream$price, 0.75)
IQR <- Q3 - Q1
 
# Identify outliers
ice_cream |>
  filter(price < (Q1 - 1.5 * IQR) | price > (Q3 + 1.5 * IQR))

Output:

 # A tibble: 0 × 5
# … with 5 variables: id <int>, name <chr>, category <chr>, rating <dbl>, price <dbl>

In this case, there are no price outliers in the ice cream dataset based on the IQR method.

Monthly Sales Dataset

Exercise 2.6: Detect Sales Outliers

Identify months with sales that are more than 2 standard deviations away from the mean:

 # Calculate mean and standard deviation
sales_mean <- mean(monthly_sales$sales)
sales_sd <- sd(monthly_sales$sales)
 
# Identify outliers
monthly_sales |>
  filter(sales < (sales_mean - 2 * sales_sd) | sales > (sales_mean + 2 * sales_sd))

Output:

    flavor_id       date region sales
1          3 2021-01-01  North   100
2          2 2021-02-01  North   500
3          5 2021-03-01  North   100
4          1 2021-04-01  North   500
5          4 2021-05-01  North   100
6          3 2021-07-01  North   500
...

This output shows the sales records that are considered outliers based on being more than 2 standard deviations away from the mean.

Conclusion

In this comprehensive section, we’ve explored a wide range of data analysis techniques using R. We started with quick one-line summaries to extract key insights from our datasets,

Previous Introduction

Go to Next SectionGoing Deeper →

On this page

- Basic Questions
- Translating Questions into Data Tasks
- Get Distinct Values
- Count Unique Values
- Count unique categories
- Count Flavors by Category
- Writing it up
- Next Steps
- What we’ll learn in this section
- Highest & Lowest Values
- Average Values
- Statistical Summaries
- Analysis by Group
- Detecting Outliers
- Conclusion