We often need to work with columns in a data frame to analyze and manipulate data. In this section, we’ll learn how to efficiently manipulate columns in R using dplyr functions.
The dplyr package provides a set of functions that make it easy to work with columns in a data frame. We can select specific columns, rename columns, create new columns, and perform operations on multiple columns using dplyr functions.
What we’ll learn in this section
GoalsIn this section, we'll learn how to:
Get a quick overview of your data frame
Select and rename columns using select() and rename()
Create and modify columns using mutate()
Perform operations on multiple columns using across()
Relocate columns within a data frame
Getting a Glimpse of Your Data
The glimpse() function provides a view of your data frame, showing column types and a few values.
Syntax
glimpse(dataset)
Example
Get a glimpse of the students data frame:
glimpse(students)
Output:
Rows: 4Columns: 5$ id <int> 1, 2, 3, 4$ name <chr> "Alia", "Bala", "Cara", "Dana"$ section <chr> "A", "B", "A", "B"$ study <dbl> 2, 8, NA, 4$ play <dbl> 5, 5, 10, 10
Selecting Columns
The select() function is used to choose specific columns from a data frame.
Syntax
dataset |> select(column1, column2, ...)
You can also use helper functions like starts_with() , ends_with() , contains() , etc.
Example
To select only the name and section columns, we can use:
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> select(name, section)
Output:
# A tibble: 4 × 2 name section <chr> <chr>1 Alia A 2 Bala B 3 Cara A 4 Dana B
Exercise 1.1
Select the study and play columns from the students data frame.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> select(study, play)
Output:
# A tibble: 4 × 2 study play <dbl> <dbl>1 2 52 8 53 NA 104 4 10
Renaming Columns
The rename() function is used to change column names.
Syntax
dataset |> rename(new_name = old_name)
Example
Rename ‘study’ to ‘study_hours’:
students |> rename(study_hours = study)
Output:
# A tibble: 4 × 5 id name section study_hours play <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 53 3 Cara A NA 104 4 Dana B 4 10
Exercise 1.2
Rename ‘play’ to ‘leisure_time’ and ‘section’ to ‘class’.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> rename(leisure_time = play, class = section)
Output:
# A tibble: 4 × 5 id name class study leisure_time <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 53 3 Cara A NA 104 4 Dana B 4 10
Creating and Modifying Columns
The mutate() function is used to create new columns or modify existing ones.
Syntax
dataset |> mutate(new_column = expression)
Example
Add a new column ‘city’ as “Oxford” for all students:
students |> mutate(city = "Oxford")
Output:
# A tibble: 4 × 6 id name section study play city <int> <chr> <chr> <dbl> <dbl> <chr>1 1 Alia A 2 5 Oxford2 2 Bala B 8 5 Oxford3 3 Cara A NA 10 Oxford4 4 Dana B 4 10 Oxford
We can also perform operations on existing columns to create new ones. For example, we can use case_when() to create a new column based on conditions.
Create a new column ‘study_category’ that categorizes students based on their study hours:
students |> mutate(study_category = case_when( study < 3 ~ "Low", study <= 6 ~ "Medium", study > 6 ~ "High", TRUE ~ "Unknown" ))
Output:
# A tibble: 4 × 6 id name section study play study_category <int> <chr> <chr> <dbl> <dbl> <chr>1 1 Alia A 2 5 Low 2 2 Bala B 8 5 High 3 3 Cara A NA 10 Unknown 4 4 Dana B 4 10 Medium
Exercise 1.3
Create a new column ‘play_category’ that categorizes students based on their play hours:
“Low” if play < 3, “Medium” if play is between 3 and 6, and “High” if play > 6.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> mutate(play_category = case_when( play < 3 ~ "Low", play <= 6 ~ "Medium", play > 6 ~ "High", TRUE ~ "Unknown" ))
Output:
# A tibble: 4 × 6 id name section study play play_category <int> <chr> <chr> <dbl> <dbl> <chr>1 1 Alia A 2 5 Medium 2 2 Bala B 8 5 Medium 3 3 Cara A NA 10 High 4 4 Dana B 4 10 High
Column-wise Operations with across()
The across() function allows use to apply the same function(s) to multiple columns.
Syntax
dataset |> mutate(across(columns, function))
To select the columns, we must wrap them around c() .
students |> mutate(across(c(name, section), toupper))
Output:
# A tibble: 4 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 1 ALIA A 2 52 2 BALA B 8 53 3 CARA A NA 104 4 DANA B 4 10
Example
Convert all numeric columns except id column to minutes by multiplying by 60:
students |> select(-id) |> mutate(across(where(is.numeric), ~. * 60))
Notice the syntax ~. * 60 . The . represents the column being operated on. Here . means each column in the selection (i.e., the numeric columns selected by where(is.numeric) ).
In other words, you can say “apply the function * 60 to each column . in the selection.”
Output:
# A tibble: 4 × 4 name section study play <chr> <chr> <dbl> <dbl>1 Alia A 120 3002 Bala B 480 3003 Cara A NA 6004 Dana B 240 600
Example
In tidyr, there is a function to replace missing values with a specific value. We can use it with across() to replace NA values with 0 in all numeric columns.
students |> mutate(across(where(is.numeric), ~replace_na(., 0)))
Output:
# A tibble: 4 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 53 3 Cara A 0 104 4 Dana B 4 10
Exercise 1.4
Use across() to replace NA values with 5 in all numeric columns.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> mutate(across(where(is.numeric), ~replace_na(., 5)))
Output:
# A tibble: 4 × 5 id name section study play <int> <chr> <chr> <dbl> <dbl>1 1 Alia A 2 52 2 Bala B 8 53 3 Cara A 5 104 4 Dana B 4 10
Relocating Columns
The relocate() function is used to change the position of columns in a data frame.
Syntax
By default, relocate() moves the specified column to the first position.
dataset |> relocate(column_to_move)
To move a column to .before or .after a specific column:
# A tibble: 4 × 5 id section name study play <int> <chr> <chr> <dbl> <dbl>1 1 A Alia 2 52 2 B Bala 8 53 3 A Cara NA 104 4 B Dana 4 10
Exercise 1.5
Relocate ‘play’ to be the first column in the data frame.
id
name
section
study
play
1
Alia
A
2
5
2
Bala
B
8
5
3
Cara
A
NA
10
4
Dana
B
4
10
students |> relocate(play)
Output:
# A tibble: 4 × 5 play id name section study <dbl> <int> <chr> <chr> <dbl>1 5 1 Alia A 22 5 2 Bala B 83 10 3 Cara A NA4 10 4 Dana B 4
Review
We’ve covered the essential dplyr functions for working with columns in R. Let’s review and summarize what we’ve learned.
Quiz
Loading...
Loading...
Loading...
SummaryIn this section, we learned how to:
Get a quick overview of your data frame using glimpse()
Select and rename columns using select() and rename()
Create and modify columns using mutate()
Perform operations on multiple columns using across()
Relocate columns within a data frame using relocate()
These functions are essential for data manipulation and analysis in R using the dplyr package, allowing you to efficiently work with columns in your datasets.
In the next section, we’ll learn how to work with groups.