Data Wrangling in R with dplyr and tidyr

Working with Columns

Learn how to manipulate columns in R using dplyr functions

Course Sections

Column Operations

We often need to work with columns in a data frame to analyze and manipulate data. In this section, we’ll learn how to efficiently manipulate columns in R using dplyr functions.

The dplyr package provides a set of functions that make it easy to work with columns in a data frame. We can select specific columns, rename columns, create new columns, and perform operations on multiple columns using dplyr functions.

What we’ll learn in this section

Goals In this section, we'll learn how to:
    • Get a quick overview of your data frame
    • Select and rename columns using select() and rename()
    • Create and modify columns using mutate()
    • Perform operations on multiple columns using across()
    • Relocate columns within a data frame

Getting a Glimpse of Your Data

The glimpse() function provides a view of your data frame, showing column types and a few values.

Syntax
     glimpse(dataset) 
Example

    Get a glimpse of the students data frame:

     glimpse(students) 

    Output:

     Rows: 4
    Columns: 5
    $ id      <int> 1, 2, 3, 4
    $ name    <chr> "Alia", "Bala", "Cara", "Dana"
    $ section <chr> "A", "B", "A", "B"
    $ study   <dbl> 2, 8, NA, 4
    $ play    <dbl> 5, 5, 10, 10 

Selecting Columns

The select() function is used to choose specific columns from a data frame.

Syntax
     dataset |> 
      select(column1, column2, ...) 

    You can also use helper functions like starts_with() , ends_with() , contains() , etc.

Example

    To select only the name and section columns, we can use:

    idnamesectionstudyplay
    1AliaA25
    2BalaB85
    3CaraANA10
    4DanaB410
     students |> 
      select(name, section) 

    Output:

     # A tibble: 4 × 2
      name  section
      <chr> <chr>  
    1 Alia  A      
    2 Bala  B      
    3 Cara  A      
    4 Dana  B       
Exercise 1.1

Select the study and play columns from the students data frame.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Renaming Columns

The rename() function is used to change column names.

Syntax
     dataset |> 
      rename(new_name = old_name) 
Example

    Rename ‘study’ to ‘study_hours’:

     students |> 
      rename(study_hours = study) 

    Output:

     # A tibble: 4 × 5
         id name  section study_hours  play
      <int> <chr> <chr>        <dbl> <dbl>
    1     1 Alia  A                2     5
    2     2 Bala  B                8     5
    3     3 Cara  A               NA    10
    4     4 Dana  B                4    10 
Exercise 1.2

Rename ‘play’ to ‘leisure_time’ and ‘section’ to ‘class’.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Creating and Modifying Columns

The mutate() function is used to create new columns or modify existing ones.

Syntax
     dataset |> 
      mutate(new_column = expression) 
Example

    Add a new column ‘city’ as “Oxford” for all students:

     students |> 
      mutate(city = "Oxford") 

    Output:

     # A tibble: 4 × 6
         id name  section study  play city  
      <int> <chr> <chr>   <dbl> <dbl> <chr> 
    1     1 Alia  A           2     5 Oxford
    2     2 Bala  B           8     5 Oxford
    3     3 Cara  A          NA    10 Oxford
    4     4 Dana  B           4    10 Oxford 

We can also perform operations on existing columns to create new ones. For example, we can use case_when() to create a new column based on conditions.

Syntax
     dataset |> 
      mutate(new_column = case_when(
        condition1 ~ value1,
        condition2 ~ value2,
        TRUE ~ default_value
      )) 
Example

    Create a new column ‘study_category’ that categorizes students based on their study hours:

     students |> 
      mutate(study_category = case_when(
        study < 3 ~ "Low",
        study <= 6 ~ "Medium",
        study > 6 ~ "High",
        TRUE ~ "Unknown"
      )) 

    Output:

     # A tibble: 4 × 6
         id name  section study  play study_category
      <int> <chr> <chr>   <dbl> <dbl> <chr>         
    1     1 Alia  A           2     5 Low           
    2     2 Bala  B           8     5 High          
    3     3 Cara  A          NA    10 Unknown       
    4     4 Dana  B           4    10 Medium         
Exercise 1.3

Create a new column ‘play_category’ that categorizes students based on their play hours: “Low” if play < 3, “Medium” if play is between 3 and 6, and “High” if play > 6.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Column-wise Operations with across()

The across() function allows use to apply the same function(s) to multiple columns.

Syntax
     dataset |> 
      mutate(across(columns, function)) 

    To select the columns, we must wrap them around c() .

     dataset |> 
      mutate(across(c(column1, column2), function)) 

    We can also wrap columns with where() to apply the function to columns that meet a specific condition.

     dataset |> 
      mutate(across(where(is.numeric), function)) 

    To apply a custom function in the same line, we use the formula ~ . You can find this on your keyboard on the top left corner, below the escape key.

     dataset |> 
      mutate(across(where(is.numeric), ~function)) 

    Later, we’ll show you an example of how it works.

Example

    Convert all character columns to uppercase:

     students |> 
      mutate(across(c(name, section), toupper)) 

    Output:

     # A tibble: 4 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 ALIA  A           2     5
    2     2 BALA  B           8     5
    3     3 CARA  A          NA    10
    4     4 DANA  B           4    10 
Example

    Convert all numeric columns except id column to minutes by multiplying by 60:

     students |> 
      select(-id) |> 
      mutate(across(where(is.numeric), ~. * 60)) 

    Notice the syntax ~. * 60 . The . represents the column being operated on. Here . means each column in the selection (i.e., the numeric columns selected by where(is.numeric) ).

    In other words, you can say “apply the function * 60 to each column . in the selection.”

    Output:

     # A tibble: 4 × 4
      name  section study  play
      <chr> <chr>   <dbl> <dbl>
    1 Alia  A         120   300
    2 Bala  B         480   300
    3 Cara  A          NA   600
    4 Dana  B         240   600 
Example

    In tidyr, there is a function to replace missing values with a specific value. We can use it with across() to replace NA values with 0 in all numeric columns.

     students |> 
      mutate(across(where(is.numeric), ~replace_na(., 0))) 

    Output:

     # A tibble: 4 × 5
         id name  section study  play
      <int> <chr> <chr>   <dbl> <dbl>
    1     1 Alia  A           2     5
    2     2 Bala  B           8     5
    3     3 Cara  A           0    10
    4     4 Dana  B           4    10 
Exercise 1.4

Use across() to replace NA values with 5 in all numeric columns.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Relocating Columns

The relocate() function is used to change the position of columns in a data frame.

Syntax

    By default, relocate() moves the specified column to the first position.

     dataset |> 
      relocate(column_to_move) 

    To move a column to .before or .after a specific column:

     dataset |> 
      relocate(column_to_move, .before = column_name) 

    or

     dataset |> 
      relocate(column_to_move, .after = column_name) 
Example

    Move ‘section’ before ‘name’:

     students |> 
      relocate(section, .before = name) 

    Output:

     # A tibble: 4 × 5
         id section name  study  play
      <int> <chr>   <chr> <dbl> <dbl>
    1     1 A       Alia      2     5
    2     2 B       Bala      8     5
    3     3 A       Cara     NA    10
    4     4 B       Dana      4    10 
Exercise 1.5

Relocate ‘play’ to be the first column in the data frame.

idnamesectionstudyplay
1AliaA25
2BalaB85
3CaraANA10
4DanaB410

Review

We’ve covered the essential dplyr functions for working with columns in R. Let’s review and summarize what we’ve learned.

Quiz

    Loading...

    Loading...

    Loading...

Summary In this section, we learned how to:
    • Get a quick overview of your data frame using glimpse()
    • Select and rename columns using select() and rename()
    • Create and modify columns using mutate()
    • Perform operations on multiple columns using across()
    • Relocate columns within a data frame using relocate()

These functions are essential for data manipulation and analysis in R using the dplyr package, allowing you to efficiently work with columns in your datasets.

In the next section, we’ll learn how to work with groups.