Working with Columns | The School of Data

Data Wrangling in R with dplyr and tidyr Working with Columns

Column Operations

We often need to work with columns in a data frame to analyze and manipulate data. In this section, we’ll learn how to efficiently manipulate columns in R using dplyr functions.

The dplyr package provides a set of functions that make it easy to work with columns in a data frame. We can select specific columns, rename columns, create new columns, and perform operations on multiple columns using dplyr functions.

What we’ll learn in this section

Goals In this section, we'll learn how to:

Get a quick overview of your data frame
Select and rename columns using select() and rename()
Create and modify columns using mutate()
Perform operations on multiple columns using across()
Relocate columns within a data frame

Getting a Glimpse of Your Data

The glimpse() function provides a view of your data frame, showing column types and a few values.

Syntax

 glimpse(dataset)

Example

Get a glimpse of the students data frame:

 glimpse(students)

Output:

 Rows: 4
Columns: 5
$ id      <int> 1, 2, 3, 4
$ name    <chr> "Alia", "Bala", "Cara", "Dana"
$ section <chr> "A", "B", "A", "B"
$ study   <dbl> 2, 8, NA, 4
$ play    <dbl> 5, 5, 10, 10

Selecting Columns

The select() function is used to choose specific columns from a data frame.

Syntax

 dataset |> 
  select(column1, column2, ...)

You can also use helper functions like starts_with() , ends_with() , contains() , etc.

Example

To select only the name and section columns, we can use:

id	name	section	study	play
1	Alia	A	2	5
2	Bala	B	8	5
3	Cara	A	NA	10
4	Dana	B	4	10

 students |> 
  select(name, section)

Output:

 # A tibble: 4 × 2
  name  section
  <chr> <chr>  
1 Alia  A      
2 Bala  B      
3 Cara  A      
4 Dana  B

Exercise 1.1

Select the study and play columns from the students data frame.

id	name	section	study	play
1	Alia	A	2	5
2	Bala	B	8	5
3	Cara	A	NA	10
4	Dana	B	4	10

Renaming Columns

The rename() function is used to change column names.

Syntax

 dataset |> 
  rename(new_name = old_name)

Example

Rename ‘study’ to ‘study_hours’:

 students |> 
  rename(study_hours = study)

Output:

 # A tibble: 4 × 5
     id name  section study_hours  play
  <int> <chr> <chr>        <dbl> <dbl>
1     1 Alia  A                2     5
2     2 Bala  B                8     5
3     3 Cara  A               NA    10
4     4 Dana  B                4    10

Exercise 1.2

Rename ‘play’ to ‘leisure_time’ and ‘section’ to ‘class’.

id	name	section	study	play
1	Alia	A	2	5
2	Bala	B	8	5
3	Cara	A	NA	10
4	Dana	B	4	10

 students |> 
  rename(leisure_time = play, class = section)

Output:

 # A tibble: 4 × 5
     id name  class study leisure_time
  <int> <chr> <chr> <dbl>        <dbl>
1     1 Alia  A         2            5
2     2 Bala  B         8            5
3     3 Cara  A        NA           10
4     4 Dana  B         4           10

Creating and Modifying Columns

The mutate() function is used to create new columns or modify existing ones.

Syntax

 dataset |> 
  mutate(new_column = expression)

Example

Add a new column ‘city’ as “Oxford” for all students:

 students |> 
  mutate(city = "Oxford")

Output:

 # A tibble: 4 × 6
     id name  section study  play city  
  <int> <chr> <chr>   <dbl> <dbl> <chr> 
1     1 Alia  A           2     5 Oxford
2     2 Bala  B           8     5 Oxford
3     3 Cara  A          NA    10 Oxford
4     4 Dana  B           4    10 Oxford

We can also perform operations on existing columns to create new ones. For example, we can use case_when() to create a new column based on conditions.

Syntax

 dataset |> 
  mutate(new_column = case_when(
    condition1 ~ value1,
    condition2 ~ value2,
    TRUE ~ default_value
  ))

Example

Create a new column ‘study_category’ that categorizes students based on their study hours:

 students |> 
  mutate(study_category = case_when(
    study < 3 ~ "Low",
    study <= 6 ~ "Medium",
    study > 6 ~ "High",
    TRUE ~ "Unknown"
  ))

Output:

 # A tibble: 4 × 6
     id name  section study  play study_category
  <int> <chr> <chr>   <dbl> <dbl> <chr>         
1     1 Alia  A           2     5 Low           
2     2 Bala  B           8     5 High          
3     3 Cara  A          NA    10 Unknown       
4     4 Dana  B           4    10 Medium

Exercise 1.3

Create a new column ‘play_category’ that categorizes students based on their play hours: “Low” if play < 3, “Medium” if play is between 3 and 6, and “High” if play > 6.

id	name	section	study	play
1	Alia	A	2	5
2	Bala	B	8	5
3	Cara	A	NA	10
4	Dana	B	4	10

 students |> 
  mutate(play_category = case_when(
    play < 3 ~ "Low",
    play <= 6 ~ "Medium",
    play > 6 ~ "High",
    TRUE ~ "Unknown"
  ))

Output:

 # A tibble: 4 × 6
     id name  section study  play play_category
  <int> <chr> <chr>   <dbl> <dbl> <chr>        
1     1 Alia  A           2     5 Medium       
2     2 Bala  B           8     5 Medium       
3     3 Cara  A          NA    10 High         
4     4 Dana  B           4    10 High

Column-wise Operations with across()

The across() function allows use to apply the same function(s) to multiple columns.

Syntax

 dataset |> 
  mutate(across(columns, function))

To select the columns, we must wrap them around c() .

 dataset |> 
  mutate(across(c(column1, column2), function))

We can also wrap columns with where() to apply the function to columns that meet a specific condition.

 dataset |> 
  mutate(across(where(is.numeric), function))

To apply a custom function in the same line, we use the formula ~ . You can find this on your keyboard on the top left corner, below the escape key.

 dataset |> 
  mutate(across(where(is.numeric), ~function))

Later, we’ll show you an example of how it works.

Example

Convert all character columns to uppercase:

 students |> 
  mutate(across(c(name, section), toupper))

Output:

 # A tibble: 4 × 5
     id name  section study  play
  <int> <chr> <chr>   <dbl> <dbl>
1     1 ALIA  A           2     5
2     2 BALA  B           8     5
3     3 CARA  A          NA    10
4     4 DANA  B           4    10

Example

Convert all numeric columns except id column to minutes by multiplying by 60:

 students |> 
  select(-id) |> 
  mutate(across(where(is.numeric), ~. * 60))

Notice the syntax ~. * 60 . The . represents the column being operated on. Here . means each column in the selection (i.e., the numeric columns selected by where(is.numeric) ).

In other words, you can say “apply the function * 60 to each column . in the selection.”

Output:

 # A tibble: 4 × 4
  name  section study  play
  <chr> <chr>   <dbl> <dbl>
1 Alia  A         120   300
2 Bala  B         480   300
3 Cara  A          NA   600
4 Dana  B         240   600

Example

In tidyr, there is a function to replace missing values with a specific value. We can use it with across() to replace NA values with 0 in all numeric columns.

 students |> 
  mutate(across(where(is.numeric), ~replace_na(., 0)))

Output:

 # A tibble: 4 × 5
     id name  section study  play
  <int> <chr> <chr>   <dbl> <dbl>
1     1 Alia  A           2     5
2     2 Bala  B           8     5
3     3 Cara  A           0    10
4     4 Dana  B           4    10

Exercise 1.4

Use across() to replace NA values with 5 in all numeric columns.

id	name	section	study	play
1	Alia	A	2	5
2	Bala	B	8	5
3	Cara	A	NA	10
4	Dana	B	4	10

 students |> 
  mutate(across(where(is.numeric), ~replace_na(., 5)))

Output:

 # A tibble: 4 × 5
     id name  section study  play
  <int> <chr> <chr>   <dbl> <dbl>
1     1 Alia  A           2     5
2     2 Bala  B           8     5
3     3 Cara  A           5    10
4     4 Dana  B           4    10

Relocating Columns

The relocate() function is used to change the position of columns in a data frame.

Syntax

By default, relocate() moves the specified column to the first position.

 dataset |> 
  relocate(column_to_move)

To move a column to .before or .after a specific column:

 dataset |> 
  relocate(column_to_move, .before = column_name)

 dataset |> 
  relocate(column_to_move, .after = column_name)

Example

Move ‘section’ before ‘name’:

 students |> 
  relocate(section, .before = name)

Output:

 # A tibble: 4 × 5
     id section name  study  play
  <int> <chr>   <chr> <dbl> <dbl>
1     1 A       Alia      2     5
2     2 B       Bala      8     5
3     3 A       Cara     NA    10
4     4 B       Dana      4    10

Exercise 1.5

Relocate ‘play’ to be the first column in the data frame.

id	name	section	study	play
1	Alia	A	2	5
2	Bala	B	8	5
3	Cara	A	NA	10
4	Dana	B	4	10

 students |> 
  relocate(play)

Output:

 # A tibble: 4 × 5
   play    id name  section study
  <dbl> <int> <chr> <chr>   <dbl>
1     5     1 Alia  A           2
2     5     2 Bala  B           8
3    10     3 Cara  A          NA
4    10     4 Dana  B           4

Review

We’ve covered the essential dplyr functions for working with columns in R. Let’s review and summarize what we’ve learned.

Quiz

Summary In this section, we learned how to:

Get a quick overview of your data frame using glimpse()
Select and rename columns using select() and rename()
Create and modify columns using mutate()
Perform operations on multiple columns using across()
Relocate columns within a data frame using relocate()

These functions are essential for data manipulation and analysis in R using the dplyr package, allowing you to efficiently work with columns in your datasets.

In the next section, we’ll learn how to work with groups.

Previous Working with Rows

Go to Next SectionWorking with Groups →

On this page

- Column Operations
- What we’ll learn in this section
- Getting a Glimpse of Your Data
- Selecting Columns
- Renaming Columns
- Creating and Modifying Columns
- Column-wise Operations with across()
- Relocating Columns
- Review