```
library(dplyr)
coalesce(c(1, NA, 3, NA), c(2, 2, 4, 4))
```

`[1] 1 2 3 4`

R

code

tidyverse

dplyr

Author

Dave Rosenman

Published

May 16, 2024

The dplyr `coalesce`

function is incredibly useful and similar to the SQL `COALESCE`

function. Given a set of vectors, it finds and keeps the first non-`NA`

value at each position. For example, the following code returns the vector c(1, 2, 3, 4).

- At position 1, the first non-
`NA`

value is 1 (from the first vector). - At position 2, the first non-
`NA`

value is 2 (from the second vector, because the value at position 2 in the first vector is NA). - At position 3, the first non-
`NA`

value is 3 (from the first vector). - Finally, at position 4, the first non-
`NA`

value is 4 (from the second vector, because the value at position 4 in the first vector is NA).

The `coalesce`

function is not limited to two vectors. You can use as many vectors as you’d like.

The vectors must be of equal length or length 1. Vectors of length 1 will be recycled. The following:

Is equivalent to:

I most often use `coalesce`

to replace all `NA`

values in a vector with a single value. For example, the following code replaces all `NA`

values with 0:

Here are two alternative ways to do the same thing.

`[1] 1 0 0 5 6`

Of the three options I have shown so far, I prefer `coalesce`

and `replace_na`

. (`coalesce`

is a more general version of `replace_na`

; `replace_na`

takes a vector and a single value to replace the `NA`

values in that vector with). If you want to go outside of base R and the tidyverse, `data.table::fcoalesce`

is a much faster version of `dplyr::coalesce`

.

Let’s compare the speeds!

```
library(microbenchmark)
library(data.table)
set.seed(11)
x <- 1:10^7
x[sample(1:10^7, size = 10^6, replace = FALSE)] <- NA
microbenchmark(
fcoalesce(x, 0L),
coalesce(x, 0L),
replace_na(x, 0L),
ifelse(is.na(x), 0L, x),
times = 5
)
```

```
Unit: milliseconds
expr min lq mean median uq max
fcoalesce(x, 0L) 14.4582 16.1549 18.47224 18.5650 19.1356 24.0475
coalesce(x, 0L) 285.4586 289.1238 318.36890 292.7322 297.1620 427.3679
replace_na(x, 0L) 37.2726 41.0626 51.48104 41.4943 52.0923 85.4834
ifelse(is.na(x), 0L, x) 168.9397 178.8357 191.00360 180.9929 186.2458 240.0039
neval
5
5
5
5
```

`data.table:fcoalesce`

is the winner in terms of speed, followed by `replace_na`

. Of the four methods above, `coalesce`

and `fcoalesce`

are the most general, since they are not limited to replacing all `NA`

values with a single value.

Is there a base R equivalent to `dplyr::coalesce`

? No. But we can easily create one using just base R code.

To think about how we would do that, let’s start with two vectors:

How could we get the same results using the `ifelse`

function? It’s simple. We return the value in `y`

when the value in `x`

is `NA`

.

That’s simple enough. But what if we want to use three vectors?

We can start with our code from the case where we used two vectors.

When both `x`

and `y`

are `NA`

(when `ifelse(is.na(x), y, x)`

gives us `NA`

), we want to use what is in z. Otherwise, we want to keep the results from step 1 above.

That worked! But what if we want to generalize this to any number of input vectors? We can use the base R function `Reduce`

. For our case where we used `x`

, `y`

, and `z`

, we could do:

`Reduce(f, list(x, y, z))`

, where `f`

is a function of two variables, is the equivalent of `f(f(x, y), z)`

. And `Reduce(f, list(x, y, z, a))`

is equivalent to `f(f(f(x, y), z), a)`

. The Reduce function is used to iteratively apply a function to elements of a list, reducing it to a single value. It takes a function with two parameters and applies it to the first two elements of the list, then applies the same function to the result and the next element, and so on, until all elements are combined into a single value.

To use `Reduce`

to mimic `coalesce(x, y, z, ...)`

, we need to apply the logic `f <- function(x, y) { ifelse(is.na(x), y, x)}`

over and over starting from left to right. In other words, for three vectors x, y, and z, we need to do:

Which is equivalent to

So a very simple base R function equivalent to the `coalesce`

function is:

Let’s see if it produces identical results to `dplyr::coalesce`

:

```
set.seed(11)
x <- 1:10^7
y <- 1:10^7
x[sample(1:10^7, size = 10^6, replace = FALSE)] <- NA
y[sample(1:10^7, size = 10^6, replace = FALSE)] <- NA
z <- 1L
dplyr_result <- coalesce(x, y, z)
base_r_result <- coalesce_base_r(x, y, z)
identical(dplyr_result, base_r_result)
```

`[1] TRUE`

We get identical results!

Let’s compare the speed:

```
Unit: milliseconds
expr min lq mean median uq max
coalesce(x, y, z) 299.6434 321.6311 355.9083 357.0767 385.4815 415.709
coalesce_base_r(x, y, z) 349.0094 406.5152 400.0490 412.2628 413.9706 418.487
neval
5
5
```

Our base R version of `coalesce`

is almost identical in speed dplyr’s!

But our function contains some flaws. dplyr’s `coalesce`

function forces the vectors passed to it to either be of the same length or be of length 1. If we try:

```
Error in `coalesce()`:
! Can't recycle `..1` (size 5) to match `..2` (size 2).
```

We get an error, since the first vector has length 5 and the second has length 2.

Here’s a better base R version of `coalesce`

:

```
coalesce_base <- function(...) {
args <- list(...)
# Check for NULL, zero-length vectors, and collect lengths
lengths <- sapply(args, function(x) {
if (is.null(x) || length(x) == 0) {
stop("Arguments must not be NULL or zero-length vectors")
}
length(x)
})
# Determine the maximum length
max_length <- max(lengths)
# Check if lengths are consistent.
# Only allow vectors of length equal to max length or length of 1
if (any(lengths != max_length & lengths != 1)) {
stop("All arguments must have the same length,
except for vectors of length 1 which can be recycled")
}
# Use Reduce with ifelse to coalesce
Reduce(function(x, y) ifelse(is.na(x), y, x), args)
}
# Example usage:
v1 <- c(NA, 2, NA, 4, NA)
v2 <- c(1, NA, 3, NA, NA)
v3 <- 0
coalesce_base(v1, v2, v3)
```

`[1] 1 2 3 4 0`

Again, let’s compare the speed of our function to `dplyr::coalesce`

and `data.table::fcoalesce`

.

```
Unit: milliseconds
expr min lq mean median uq
fcoalesce(x, y, z) 16.9233 17.5952 19.12082 17.6570 19.5384
coalesce(x, y, z) 304.2171 321.0251 343.79622 355.3758 358.6747
coalesce_base_r(x, y, z) 344.0853 357.0263 383.64736 386.5437 388.1294
max neval
23.8902 5
379.6884 5
442.4521 5
```

`data.table::fcoalesce`

is the clear winner when it comes to speed! Our function is almost identical in speed to `dplyr::coalesce`

!