Exploratory Data Analysis with R

Quick Insights

Learn how to look at data to generate initial insights

Course Sections

Basic Questions

Let’s start our exploration by answering these questions:

  1. 1. What are the names of the ice cream flavours?
  2. 2. How many flavours are there in the dataset?
  3. 3. What are the categories of ice cream in the dataset?

Before we answer these questions, let’s translate them into data tasks:

Translating Questions into Data Tasks

1. What are the names of the ice cream flavours?

To answer this question, we need to list the unique values in the name column of the ice_cream dataset. We can use the distinct() function to achieve this.

In other words, we can ask ourselves: “What are the unique values in the name column of the ice_cream dataset?“

2. How many flavours are there in the dataset?

To answer this question, we need to count the number of unique values in the name column of the ice_cream dataset. We can use the distinct() function to list unique values and the count() function to count the number of unique values.

In other words, we can ask ourselves: “How many unique values are there in the name column of the ice_cream dataset?“

3. What are the categories of ice cream in the dataset?

To answer this question, we need to list the unique values in the category column of the ice_cream dataset. We can use the distinct() function to achieve this.

In other words, we can ask ourselves: “What are the unique values in the category column of the ice_cream dataset?”

By translating our questions into data tasks, we can make it easier to understand what we need to do with the data.

Get Distinct Values

The distinct() function in tidyverse is used to get unique/distinct rows of a data frame. It returns a data frame with only the unique rows.

The syntax for the distinct() function is:

 dataset |>
  distinct(column_name) 
Exercise 1.1: List Ice Cream Flavors

Use the distinct() function to list the names of the ice cream flavors:

Count Unique Values

The count() function in tidyverse is used to count the number of unique values in a data frame. It returns a data frame with the count of unique values.

The syntax for the count() function is:

 dataset |>
  distinct(column_name) |>
  count() 
Exercise 1.2: Count Ice Cream Flavors

Use the distinct() and count() functions to count the number of unique flavors:

Count unique categories

Exercise 1.3: List Ice Cream Categories

Use the distinct() function to list the categories of ice cream:

Let’s take this a step further and count the number of flavors in each category.

Count Flavors by Category

Exercise 1.4: Count Flavors by Category

Use the group_by() and summarise() functions to count how many flavors are in each category:

Writing it up

Let’s write up a small report based on our findings:

Report: Ice Cream Flavors Analysis

1. Ice Cream Flavors

The ice cream flavors in the dataset are Vanilla, Chocolate, Strawberry, Mango, Cookie Dough, and Blueberry.

2. Number of Flavors

There are 6 unique ice cream flavors in the dataset.

3. Ice Cream Categories

The ice cream flavors are categorized into Classic, Fruit, and Specialty. The Classic category has 2 flavors, the Fruit category has 3 flavors, and the Specialty category has 1 flavor.

Next Steps

In the next section, we’ll continue exploring the ice_cream dataset and answer questions like “What is the highest-rated ice cream flavor?” and “What is the average price of ice cream in each category?”

What we’ll learn in this section

In this section, we’ll get some quick insights from our dataset. This is so we can do some basic exploration. With key information in our report, we will be able to have a more confident understanding of our data.

Highest & Lowest Values

One way to get a sense of the data is to find out the highest and lowest values. It can help us provide an overview of our data and get us ready to share quick initial insights.

Example: Lowest-rated Ice Cream

    What is the lowest-rated ice cream?

    Let’s translate this question into a technical question. First, we need to find the lowest rating and it’s corresponding ice cream name.

    To find the lowest rating, first, we’ll select the rating column and then can use the min function.

     ice_cream |> 
      select(rating) |>
      min() 

    But this gives us only the value. To find the row, we can use the filter function.

     ice_cream |>
      filter(rating == min(rating)) 

    We can also use the slice_min to slice the row with the minimum rating value.

      ice_cream |> slice_min(rating) 
     # A tibble: 1 × 5
         id name         category rating price
      <int> <chr>        <chr>     <dbl> <dbl>
    1     6 Blueberry    Fruit       3.8   3 

    You can also add this information inline by saving these values to variables and then referencing them in the text.

     rating_lowest <- ice_cream |> slice_min(rating)
     
    paste0("The lowest-rated ice cream is ", rating_lowest$name, 
           " with a rating of ", rating_lowest$rating, ".") 

    This will display the following output:

     [1] "The lowest-rated ice cream is Blueberry with a rating of 3.8." 

    You can try this on your own:

Exercise 2.1: Highest Rated Ice Cream

Find the highest-rated ice cream in the dataset.

Exercise 2.2: Highest priced ice cream
Exercise 2.3: Least expensive ice cream

Average Values

What is the category with the highest average price?

The category with the highest average price is Specialty at $3.5.

 ice_cream |> 
  group_by(category) |> 
  summarise(avg_price = mean(price)) |> 
  slice_max(avg_price) 

Monthly Sales Dataset

  1. The best-performing region is North with average sales of 301.

     monthly_sales |> group_by(region) |> summarise(avg_sales = mean(sales)) |> slice_max(avg_sales) 
  2. The lowest-performing month is January 2021 with total sales of 1800.

     monthly_sales |> group_by(date) |> summarise(total_sales = sum(sales)) |> slice_min(total_sales) 
  3. The highest-performing month is December 2021 with total sales of 3600.

     monthly_sales |> group_by(date) |> summarise(total_sales = sum(sales)) |> slice_max(total_sales) 
  4. The flavor with the highest average sales is Flavor 6 with average sales of 310.

     monthly_sales |> group_by(flavor_id) |> summarise(avg_sales = mean(sales)) |> slice_max(avg_sales) 

These one-line summaries provide quick insights into our datasets and demonstrate how to use various dplyr functions to extract meaningful information.

Statistical Summaries

Now, let’s dive deeper into statistical summaries. These are numerical measures that give a description of the columns or variables in a dataset, helping to describe the main characteristics of the data.

Ice Cream Dataset

Example: Summarize Prices

    Use the summarise() function to calculate summary statistics for the price column:

     ice_cream |>
      summarise(
        min_price = min(price),
        max_price = max(price),
        range_price = max(price) - min(price),
        avg_price = mean(price),
        sd_price = sd(price)
      ) 

    Output:

     # A tibble: 1 × 5
      min_price max_price range_price avg_price sd_price
          <dbl>     <dbl>       <dbl>     <dbl>    <dbl>
    1       2.5       3.5         1       3.083    0.379 

    This gives us a summary of the price column, showing the minimum ($2.5), maximum ($3.5), range ($1), average ($3.083), and standard deviation ($0.379) of the prices.

Exercise 2.1: Summarize Ratings

Use the summarise() function to calculate summary statistics for the rating column:

Monthly Sales Dataset

Exercise 2.2: Monthly Sales Summary Statistics

Calculate the minimum, maximum, range, average, and standard deviation of sales:

Analysis by Group

Now let’s analyze our data by different groups.

Ice Cream Dataset

Exercise 2.3: Average Price by Category

Calculate the average price of ice cream for each category:

Monthly Sales Dataset

Exercise 2.4: Average Sales by Region

Calculate the average sales for each region:

Detecting Outliers

Outliers are data points that are significantly different from other observations. Let’s learn how to detect outliers in our datasets.

Ice Cream Dataset

Exercise 2.5: Detect Price Outliers

Identify ice cream flavors with prices that are more than 1.5 times the interquartile range (IQR) above the third quartile or below the first quartile:

Monthly Sales Dataset

Exercise 2.6: Detect Sales Outliers

Identify months with sales that are more than 2 standard deviations away from the mean:

Conclusion

In this comprehensive section, we’ve explored a wide range of data analysis techniques using R. We started with quick one-line summaries to extract key insights from our datasets,