- Published on
- • 13 min read
Data Wrangling Cheat Sheet
- Authors
- Name
- Earl Duncan
- Socials
- Preface
- Aim
- R Libraries
- Renaming
- Rename multiple columns
- Rename column using string as column name
- Mutating Values
- Replace NAs, NaN, or Inf
- Insert values for missing combinations
- Mutate across multiple columns as new columns
- Mutate across referring to column name
- Convert all of one data type to another data type
- Mutate column using string as column name
- Reshaping data frames
- Pivot longer with multiple variables in 1 column name
- Split the values in one column into multiple rows
- Split one column into two columns
- Convert named vector to tibble and vice versa
- Other wrangling
- String as column name in dplyr functions
- Grouping by negation
- Sum across multiple columns
- Filenames and filepaths
- Shortcuts and speed hacks
- Converting age to 5-year age groups
- Relabelling factor levels without converting to string
Originally released on RPubs, 30 July, 2024.
Preface
Aim
This guide serves as a cheat sheet for wrangling tasks that I find myself often doing but don’t always remember the exact way to execute the code to achieve the result I want. pivot_longer
is a good example. I use it fairly frequently, and it’s straightforward when you only want to convert a number of columns into a single “name” and “value” pair of new columns. As soon as the names_pattern
argument is used, especially when regex is involved, things get complicated quickly. This guide contains worked examples which demonstrate different ways these common wrangling tasks might be performed, and provides ready-to-use code via a simple copy-paste.
R Libraries
The R code in this document relies on two main R libraries.
library(dplyr)
library(tidyr)
Renaming
Rename multiple columns
rename
can be used to rename one or several columns by passing a named vector containing the new and old variable names. But that is often too tedious when you need to rename many columns. If the new column names have a common structure, e.g. they are all being converted to uppercase, or they are all being prepended with new_
, then rename_with
is the function to use.
E.g. converting column names to lowercase, and replacing periods with underscores from names (all columns by default)
head(iris, 2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
iris %>%
rename_with(~ tolower(gsub("\\.", "_", .x)), .cols = everything())
## sepal_length sepal_width petal_length petal_width species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
E.g. Add prefix to all but first column
iris %>%
rename_with(~ paste0("var_", .x), .cols = -1)
## Sepal.Length var_Sepal.Width var_Petal.Length var_Petal.Width var_Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
Note: .cols
can take column names (quoted or unquoted, negated or not), indices (negated or not), and select
helper functions like starts_with
, contains
, etc.
Rename column using string as column name
E.g.
var <- "new_name"
dat %>%
rename("{new_name}" := old_name)
Alternatively:
dat %>%
rename(!!sym(new_name) := old_name)
See also mutating and other uses using string as a column name
Mutating Values
Replace NAs, NaN, or Inf
E.g. Replace NAs
in a specific column with zero
dat %>%
mutate(var = replace_na(var, 0)) # tidyr solution
mutate(var = replace(var, is.na(var), 0)) # more general base R solution
Note: replace_na()
will not work if the variable is a factor, and the replacement is not already a level for your factor (see here)
The replace_na()
function is very specific and limited. For multiple variables, and replacing other special values, use across()
:
dat %>%
mutate(across(c(var1, var2), ~ replace(., is.na(.), 0))) %>%
mutate(across(c(var3, var4), ~ replace(., is.nan(.), 0))) %>%
mutate(across(var5, ~ replace(., is.infinite(.), 999)))
Insert values for missing combinations
Example data set:
cancer <- tibble(
sex = c("Male", "Female", "Female"),
cancer = c("Prostate", "Ovarian", "Ovarian"),
agegrp = c("40-50", "40-50", "51-60"),
count = c(20, 16, 23)
)
cancer
## # A tibble: 3 × 4
## sex cancer agegrp count
## <chr> <chr> <chr> <dbl>
## 1 Male Prostate 40-50 20
## 2 Female Ovarian 40-50 16
## 3 Female Ovarian 51-60 23
E.g. Add missing agegrp for each sex-cancer combo
cancer %>%
complete(nesting(sex, cancer), agegrp)
## # A tibble: 4 × 4
## sex cancer agegrp count
## <chr> <chr> <chr> <dbl>
## 1 Female Ovarian 40-50 16
## 2 Female Ovarian 51-60 23
## 3 Male Prostate 40-50 20
## 4 Male Prostate 51-60 NA
Add missing agegrp for each sex-cancer combo, imputing 0 for missing counts
cancer %>%
complete(nesting(sex, cancer), agegrp, fill = list(count = 0))
## # A tibble: 4 × 4
## sex cancer agegrp count
## <chr> <chr> <chr> <dbl>
## 1 Female Ovarian 40-50 16
## 2 Female Ovarian 51-60 23
## 3 Male Prostate 40-50 20
## 4 Male Prostate 51-60 0
Add missing cancers for each sex-agegrp combo, imputing 0 for missing counts
cancer %>%
complete(nesting(sex, agegrp), cancer, fill = list(count = 0))
## # A tibble: 6 × 4
## sex agegrp cancer count
## <chr> <chr> <chr> <dbl>
## 1 Female 40-50 Ovarian 16
## 2 Female 40-50 Prostate 0
## 3 Female 51-60 Ovarian 23
## 4 Female 51-60 Prostate 0
## 5 Male 40-50 Ovarian 0
## 6 Male 40-50 Prostate 20
Add all missing combinations
cancer %>%
complete(sex, agegrp, cancer, fill = list(count = 0))
## # A tibble: 8 × 4
## sex agegrp cancer count
## <chr> <chr> <chr> <dbl>
## 1 Female 40-50 Ovarian 16
## 2 Female 40-50 Prostate 0
## 3 Female 51-60 Ovarian 23
## 4 Female 51-60 Prostate 0
## 5 Male 40-50 Ovarian 0
## 6 Male 40-50 Prostate 20
## 7 Male 51-60 Ovarian 0
## 8 Male 51-60 Prostate 0
Mutate across multiple columns as new columns
dat %>%
mutate(across(starts_with("var"), ~ .x * 2, .names = "new_{col}"))
E.g.
iris %>%
mutate(across(starts_with("Sepal"), ~ .x * 2, .names = "new_{col}")) %>%
head(2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species new_Sepal.Length
## 1 5.1 3.5 1.4 0.2 setosa 10.2
## 2 4.9 3.0 1.4 0.2 setosa 9.8
## new_Sepal.Width
## 1 7
## 2 6
Mutate across referring to column name
cur_column
gives the name of the current column. E.g.
dat <- tibble(
condition = c(rep("asthma", 2), rep("obesity", 2)),
asthma = c(TRUE, TRUE, FALSE, TRUE),
obesity = c(TRUE, FALSE, TRUE, TRUE)
)
dat %>%
mutate(across(c("asthma", "obesity"), ~ ifelse(condition == cur_column(), NA, .x)))
## # A tibble: 4 × 3
## condition asthma obesity
## <chr> <lgl> <lgl>
## 1 asthma NA TRUE
## 2 asthma NA FALSE
## 3 obesity FALSE NA
## 4 obesity TRUE NA
Convert all of one data type to another data type
E.g. convert all character-valued columns to factors, and all numeric columns to integers
dat %>%
mutate(across(where(is.character), as.factor)) %>%
mutate(across(where(is.double), as.integer))
Mutate column using string as column name
To create a new column:
var <- "new_col"
dat %>%
mutate("{var}" := old_col * 100)
# or
dat %>%
mutate(!!sym(var) := old_col * 100)
To apply the mutation to the same column:
dat %>%
mutate("{var}" := get(var) * 100)0)
# or
dat %>%
mutate(!!sym(var) := !!sym(var) * 100)
Note: var
can be a vector. Alternatively, if var
is a list, replace !!sym()
with !!!syms()
.
See also renaming and other uses using string as a column name
Reshaping data frames
Pivot longer with multiple variables in 1 column name
E.g. Pivot on these columns using the word before the first underscore as “name” and the word after as the values (new columns)
dat %>%
pivot_longer(
c(A_long, B_long, A_lat, B_lat, A_location, B_location),
names_to = c("name", ".value"),
names_pattern = "(.*)_(.*)" # or names_sep = "_"
)
E.g. Pivot on these columns using the word before the first underscore as “name” and the remainder as the values (new columns)
dat %>%
pivot_longer(
c(A_long, B_long, A_lat, B_lat, A_location_name, B_location_name),
names_to = c("name", ".value"),
names_pattern = "([^_])_(.*)" # Everything before first underscore + remainder
)
E.g. Pivot all columns except id.1
and id.2
using last character as “name” and the word before the period as the values (new columns)
dat %>%
pivot_longer(
-c(id.1, id.2),
names_to = c(".value", "name"),
names_pattern = "(.*).(.$)" # or names_sep = "\\."
)
E.g. splitting one column into three using first two delimiters:
dat <- tibble(
id = 1:3,
male.age.5 = 4,
male.age.10 = 5.5,
`male.age.15.and.over` = 6,
female.age.5 = 3.5,
female.age.10 = 9,
`female.age.15.and.over` = 7
)
Solution 1:
dat %>%
pivot_longer(
-id,
names_to = c("sex", "age", ".value"),
names_pattern = "([^\\.]*).(age).(.*)" # or names_pattern = "([^\\.]*).([^\\.]*).(.*)"
)
## # A tibble: 6 × 6
## id sex age `5` `10` `15.and.over`
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 male age 4 5.5 6
## 2 1 female age 3.5 9 7
## 3 2 male age 4 5.5 6
## 4 2 female age 3.5 9 7
## 5 3 male age 4 5.5 6
## 6 3 female age 3.5 9 7
Solution 2:
dat %>%
rename_with(~gsub(".and.over", "+", .x)) %>%
pivot_longer(
-id,
names_to = c("sex", "age", ".value"),
names_sep = "\\." # or either of the above two names_pattern solutions
)
## # A tibble: 6 × 6
## id sex age `5` `10` `15+`
## <int> <chr> <chr> <dbl> <dbl> <dbl>
## 1 1 male age 4 5.5 6
## 2 1 female age 3.5 9 7
## 3 2 male age 4 5.5 6
## 4 2 female age 3.5 9 7
## 5 3 male age 4 5.5 6
## 6 3 female age 3.5 9 7
Split the values in one column into multiple rows
E.g. splitting values in a column based on a delimiter
tibble(col_1 = LETTERS[1:3]) %>%
mutate(year = paste(c(2021:2025), collapse = ", ")) %>%
mutate(year = strsplit(year, ", ")) %>%
unnest(year)
## # A tibble: 15 × 2
## col_1 year
## <chr> <chr>
## 1 A 2021
## 2 A 2022
## 3 A 2023
## 4 A 2024
## 5 A 2025
## 6 B 2021
## 7 B 2022
## 8 B 2023
## 9 B 2024
## 10 B 2025
## 11 C 2021
## 12 C 2022
## 13 C 2023
## 14 C 2024
## 15 C 2025
Split one column into two columns
E.g. splitting a column based on the first space only
tibble(car = rownames(mtcars)) %>%
slice(1:5) %>%
separate(col = car, into = c("manufacturer", "make"), sep = "^\\S*\\K\\s+")
## # A tibble: 5 × 2
## manufacturer make
## <chr> <chr>
## 1 Mazda RX4
## 2 Mazda RX4 Wag
## 3 Datsun 710
## 4 Hornet 4 Drive
## 5 Hornet Sportabout
Convert named vector to tibble and vice versa
to: | named vector | one-row df | two-column df |
---|---|---|---|
named vector |
|
| |
one-row df |
|
| |
two-column df |
|
|
Note: when converting a two-column tibble to named vector, pull(dat, values, names)
is preferred over
library(janitor) # For row_to_names()
dat %>%
t() %>%
row_to_names(row_number = 1) %>% # Use first row as column names
as_tibble()
Other wrangling
String as column name in dplyr functions
vars <- c("foo", "bar")
# The following are equivalent:
dat %>%
group_by(across(all_of(vars))) # or
group_by(foo, bar) # or
group_by(across(all_of(c("foo", var)))) # or
group_by(foo, !!sym(var)) # or
group_by(!!!syms(vars))
This also applies to arrange()
. For select()
or relocate()
, drop across()
, e.g. select(all_of(vars))
.
var = "foo"
# The following are equivalent:
dat %>%
rename(new_foo = !!sym(var)) # or
rename(!!sym(glue("new_{var}")) := !!sym(var)) # or
rename_with(~ paste0("new_", .x), .cols = !!sym(var))
For use of !!
, sym()
, and !!sym()
, see here. For use of !!!
, see here
See also renaming and mutating using string as a column name
Grouping by negation
Put the negation inside across()
, not before it. E.g. to group by all columns except “foo” and “bar”:
dat %>%
group_by(across(-c(foo, bar)))
# or
vars <- c("foo", "bar")
dat %>%
group_by(across(-vars))
A practical example is when you want to sum a column of counts by all groups:
dat %>%
group_by(across(-n)) %>%
summarise(n = sum(n))
Sum across multiple columns
Summing across multiple columns:
dat %>%
group_by(var1) %>%
mutate(sum = rowSums(across(starts_with("value_"))))
Compare to summing across multiple rows:
dat %>%
group_by(var1) %>%
summarise(across(starts_with("value_"), ~ sum(., na.rm = TRUE)))
Note the differences in speed:
dat %>%
mutate(
# pop = pop.1 + pop.2 + pop.3 + ... + pop.100 # Hard-coded, but fast
# pop = apply(dat %>% select(starts_with("pop.")), 1, sum) # VERY slow
pop = select(., starts_with("pop.")) %>% rowSums(na.rm = TRUE) # FAST
)
Filenames and filepaths
List all files in a particular folder, with optional regex string match (as a vector)
files <- list.files(
"/data/my_folder",
pattern = "a*z\\.xlsx",
full.names = TRUE
)
Split vector of filepaths into directory and file name (with extension)
tibble(full_name = files) %>%
mutate(
filepath = dirname(full_name),
file = basename(full_name)
)
Note: dirname
and basename
are both base R functions.
Extract file extension from filename:
tools::file_ext(filename)
Shortcuts and speed hacks
Converting age to 5-year age groups
E.g.
dat <- tibble(age = 2:20)
dat %>%
mutate(
agegrp = paste(floor(age /5) * 5L, (floor(age /5) + 1L) * 5L - 1L, sep = "-"),
agegrp = ifelse(agegrp == "20-24", "20+", agegrp)
)
## # A tibble: 19 × 2
## age agegrp
## <int> <chr>
## 1 2 0-4
## 2 3 0-4
## 3 4 0-4
## 4 5 5-9
## 5 6 5-9
## 6 7 5-9
## 7 8 5-9
## 8 9 5-9
## 9 10 10-14
## 10 11 10-14
## 11 12 10-14
## 12 13 10-14
## 13 14 10-14
## 14 15 15-19
## 15 16 15-19
## 16 17 15-19
## 17 18 15-19
## 18 19 15-19
## 19 20 20+
Relabelling factor levels without converting to string
This requires the forcats
library. E.g.
library(forcats)
tibble(
age_band = c("0-4", "5-9", "10-14", "15-19", "20-24", "25+")
) %>%
mutate(
age_band2 = fct_collapse( # Combine levels
age_band,
"0-9" = c("0-4", "5-9"),
"10-19" = c("10-14", "15-19")
),
# Relabel all levels with function
age_band2 = fct_relabel(age_band2, ~ sub("-", ".", .x)),
# Relabel specific levels manually
age_band2 = fct_recode(age_band2, "0.09" = "0.9"),
) %>%
select(age_band, age_band2) %>%
distinct()
## # A tibble: 6 × 2
## age_band age_band2
## <chr> <fct>
## 1 0-4 0.09
## 2 5-9 0.09
## 3 10-14 10.19
## 4 15-19 10.19
## 5 20-24 20.24
## 6 25+ 25+