Reproducibility of parallel tasks in R

Updated

Introduction: Reproducibility in R

Reproducibility is important. More important than ever.

However, making a project reproducible is not as trivial as it sounds. It involves quite some effort which goes beyond “doing it programmatically”:

For the first two points, efforts like the Zenodo project, containerization (containerit), and makefile-based approaches (drake, workflowr) provide helpful resources.

Despite all of these efforts, there is another point that needs to be added to the list: Ensuring that parallel stochastic computations are reproducible.

The idea of this post originated from this Stackoverflow question. In particular by the answer of Ralf Stubner which got me started thinking about all the differences between parallel modes in R and how complicated it is for users to get an overview of all the available choices.

Today parallelization is used in many studies. Especially when it comes to modeling (machine learning / statistical), parallelization is practically indispensable. There is a variety of choices in R to go parallel. Speaking from my practical experience over the last years, the most widely known (and used?) approach is foreach with the help of the %dopar% operator. Probably because it is based on the good old for()-loop which is usually taught in intro lessons of any programming language. Next comes the “multicore” aka “apply” mode in R, with functions lapply() (and friends) and mclappy().

When it comes to reproducibility of stochastic processes, most people are aware that one needs to use set.seed(). While this holds true for sequential processes, it does not for parallel ones. The default RNG kind used in R (“Mersenne-Twister”) does not work with parallel backends. When reading ?set.seed(), I see people often closing the page quickly after having taken a first look. Why? There is a lot of information in this help page - however, most of it is also mildly complicated and very theoretical. Also, there is only one sentence which relates to parallelization:

This is not particularly interesting of itself, but provides the basis for the multiple streams used in package parallel.

This sentence belongs to the description of the "L'Ecuyer-CMRG" RNG kind. When only reading this sentence, there is usually still confusion about what action needs to be taken (if at all).

As you might already suspect at this point, action needs to be taken to make parallel streams reproducible. And unfortunately, this actions differs based on the backend / package used. To clear some of this confusion, I’ve put together a few minimal reproducible examples to show how to make stochastic processes reproducible for various parallel backends:

The examples of this post relate to the parallelMap and future packages. The first one is being used in the (retired) mlr package while future provides a generic approach to parallelization in R used by many packages (among other also by the mlr3 package (shameless self-promotion :o).

You will see that the solutions to ensure reproducibility differ for each backend / package. This is really unfortunate and requires users to change code for every backend. Unfortunately there is no other way to accomplish this right now (and possibly in the future). Maybe this post can help serving as a reference for parallel reproducibility in the R community.

The parallelMap package

Good to know: In all following examples set.seed(123456, "Mersenne-Twister") is explicitly used instead of just set.seed(123456). This is only for showcases purposes to recall that "Mersenne-Twister" is the default RNG kind used in R. Hence, set.seed(123456, "Mersenne-Twister") and set.seed(123456) are effectively the same when being called in a fresh R session.

Multicore backend

Let`s take a look at the first example. When using the “multicore” mode and the standard RNG kind, results are not reproducible.

Not reproducible

library("parallelMap")

suppressMessages({
  set.seed(123456, "Mersenne-Twister") # R default
  parallelStartMulticore(cpus = 2)
  r1 <- parallelMap(runif, rep(3, 2))
  parallelStop()

  set.seed(123456, kind = "Mersenne-Twister") # R default
  parallelStartMulticore(cpus = 2)
  r2 <- parallelMap(runif, rep(3, 2))
  parallelStop()
})
all.equal(r1, r2)
## [1] TRUE

Reproducible

To ensure reproducibility for this mode, one needs to use the "L'Ecuyer-CMRG" RNG kind. This holds true for using mclapply() directly as well. The parallelMap packages essentially just wraps mclapply() in its parallelStartMulticore() function.

library("parallelMap")

suppressMessages({
  set.seed(123456, "L'Ecuyer-CMRG")
  parallelStartMulticore(cpus = 2)
  r3 <- parallelMap(runif, rep(3, 2))
  parallelStop()

  set.seed(123456, "L'Ecuyer-CMRG")
  parallelStartMulticore(cpus = 2)
  r4 <- parallelMap(runif, rep(3, 2))
  parallelStop()
})
all.equal(r3, r4)
## [1] TRUE

Socket backend

For a cluster-based approach, the approach differs. First, the non-reproducible way again.

Not reproducible

library("parallelMap")

suppressMessages({
  parallelStartSocket(cpus = 2)
  set.seed(123456, "Mersenne-Twister") # R default
  r5 <- parallelMap(runif, rep(3, 2))
  parallelStop()

  parallelStartSocket(cpus = 2)
  set.seed(123456, "Mersenne-Twister") # R default
  r6 <- parallelMap(runif, rep(3, 2))
  parallelStop()
})
all.equal(r5, r6)
## [1] "Component 1: Mean relative difference: 1.004628"  "Component 2: Mean relative difference: 0.4810356"

Reproducible

Instead of simply using the "L'Ecuyer-CMRG" RNG kind directly, one needs to use the function clusterSetRNGStream() as shown below. Behind the scenes, this function also uses the "L'Ecuyer-CMRG" RNG kind.

I cannot tell why the R-core guys decided to go this route instead of using the same approach as for the “multicore” mode and handling things behind the scenes. This would have prevented a lot of confusion and (possibly) non-reproducible results in the past (and future).

Important

The clusterSetRNGStream() call needs to come after the “PSOCK” cluster was initialized (here done within parallelStartSocket(), otherwise by calling makeCluster()).

library("parallelMap")

suppressMessages({
  parallelStartSocket(cpus = 2)
  parallel::clusterSetRNGStream(iseed = 123456)
  r7 <- parallelMap(runif, rep(3, 2))
  parallelStop()

  parallelStartSocket(cpus = 2)
  parallel::clusterSetRNGStream(iseed = 123456)
  r8 <- parallelMap(runif, rep(3, 2))
  parallelStop()
})
all.equal(r7, r8)
## [1] TRUE

If the order is reversed, clusterSetRNGStream() will error:

library("parallelMap")

suppressMessages({
  parallel::clusterSetRNGStream(iseed = 123456)
  parallelStartSocket(cpus = 2)
  r7b <- parallelMap(runif, rep(3, 2))
  parallelStop()

  parallel::clusterSetRNGStream(iseed = 123456)
  parallelStartSocket(cpus = 2)
  r8b <- parallelMap(runif, rep(3, 2))
  parallelStop()
})
## Error in defaultCluster(cl): no cluster 'cl' supplied and none is registered
all.equal(r7b, r8b)
## Error in all.equal(r7b, r8b): object 'r7b' not found

The future package

As of the time writing this post, the future package comes with support for several asynchronous backends:

The following modes are derivatives from the multisession/socket mode tailored to be used on HPC and other remote machines:

The callr backend is entirely different as it does not make use of forking (multicore) or cluster-based approaches. See the package’s README for a detailed comparison.

The following examples show how you can make parallelization relying on the future package and its friends reproducible.

Multicore backend

Not reproducible

library("future.apply")

# register parallel backend
plan(multicore, workers = 2)

set.seed(123456, "Mersenne-Twister") # R default
r9 <- future_lapply(rep(3, 2), runif)
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-1') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-2') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
set.seed(123456, kind = "Mersenne-Twister") # R default
r10 <- future_lapply(rep(3, 2), runif)
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-1') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-2') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
all.equal(r9, r10)
## [1] TRUE

Reproducible

In contrast to the parallelMap approach, the multicore backend uses the "L'Ecuyer-CMRG" RNG kind already internally. See ?future_lapply(). There are two ways to activate it:

The future package internally accounts for it by using “L’Ecuyer-CMRG” RNG stream when using one of the options above. See section “Reproducible random number generation (RNG)” in ?future.apply::future_lapply() for more information.

Note that one could also use the furrr package with its function future_map() as equivalent substitute.

(i)

library("future.apply")

# register parallel backend
plan(multicore, workers = 2)

r11a <- future_lapply(rep(3, 2), runif, future.seed = 123456)

r12a <- future_lapply(rep(3, 2), runif, future.seed = 123456)

all.equal(r11a, r12a)
## [1] TRUE

(ii)

library("future.apply")

# register parallel backend
plan(multicore, workers = 2)

set.seed(123)
r11b <- future_lapply(rep(3, 2), runif, future.seed = TRUE)

set.seed(123)
r12b <- future_lapply(rep(3, 2), runif, future.seed = TRUE)

all.equal(r11b, r12b)
## [1] TRUE

Socket backend

The following example uses the doFuture backend to parallelize a foreach loop.

Not reproducible

library("doFuture")

# register parallel backend
registerDoFuture()
plan(multisession, workers = 2)

set.seed(123456, "Mersenne-Twister") # R default
r13 = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-2') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
set.seed(123456, "Mersenne-Twister") # R default
r14 = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-2') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
all.equal(r13, r14)
## [1] "Component 1: Mean relative difference: 0.4372837" "Component 2: Mean relative difference: 0.4759783"

When reading the RNG section of ?doFuture::doFuture, one discovers that the doFuture does not support reproducible parallel streams. Instead, the doRNG package is suggested for reproducible parallel foreach calls.

When using doFuture::registerDoFuture(), setting the “L’Ecuyer-CMRG” has no effect.

library("doFuture")

# register parallel backend
registerDoFuture()
plan(multisession, workers = 2)

set.seed(123, "L'Ecuyer-CMRG")
r15 = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-2') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
set.seed(123, "L'Ecuyer-CMRG")
r16 = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-1') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the foreach() iterations ('doFuture-2') unexpectedly generated random numbers without declaring
## so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To fix this,
## use '%dorng%' from the 'doRNG' package instead of '%dopar%'. This ensures that proper, parallel-safe random numbers are produced
## via the L'Ecuyer-CMRG method. To disable this check, set option 'future.rng.onMisuse' to "ignore".
all.equal(r15, r16)
## [1] "Component 1: Mean relative difference: 0.8583278" "Component 2: Mean relative difference: 0.5375348"

Reproducible

Instead, one needs to add a call to doRNG::registerDoRNG() to initialize the parallel RNG stream.

library("doFuture")
library("doRNG")

# register parallel backend
registerDoFuture()
plan(multisession, workers = 2)

registerDoRNG(123)
r15a = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}

registerDoRNG(123)
r16a = foreach(i = rep(3, 2)) %dopar% {
  runif(i)
}
all.equal(r15a, r16a)
## [1] TRUE

callr backend

The “callr” backend luckily functions in the same way as the “multicore” one from the future package.

Not reproducible

library("future.callr")
library("future.apply")

# register parallel backend
plan(callr, workers = 2)

set.seed(123456, kind = "Mersenne-Twister") # R default
r17 <- future_lapply(rep(3, 2), runif)
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-1') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-2') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
set.seed(123456, kind = "Mersenne-Twister") # R default
r18 <- future_lapply(rep(3, 2), runif)
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-1') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
## Warning: UNRELIABLE VALUE: One of the 'future.apply' iterations ('future_lapply-2') unexpectedly generated random numbers without
## declaring so. There is a risk that those random numbers are not statistically sound and the overall results might be invalid. To
## fix this, specify 'future.seed=TRUE'. This ensures that proper, parallel-safe random numbers are produced via the L'Ecuyer-CMRG
## method. To disable this check, use 'future.seed = NULL', or set option 'future.rng.onMisuse' to "ignore".
all.equal(r17, r18)
## [1] "Component 1: Mean relative difference: 0.3862001" "Component 2: Mean relative difference: 0.8865892"

Reproducible

library("future.callr")
library("future.apply")

# register parallel backend
plan(callr, workers = 2)

set.seed(123456, "L'Ecuyer-CMRG")
r19 <- future_lapply(rep(3, 2), runif, future.seed = TRUE)

set.seed(123456, "L'Ecuyer-CMRG")
r20 <- future_lapply(rep(3, 2), runif, future.seed = TRUE)

all.equal(r19, r20)
## [1] TRUE

Summary

With all the examples shown above, it should be clear now that care needs to be taken to achieve reproducibility in R in parallel scenarios. Approaches differ not only by mode but also by packages interfacing the modes.

Sometimes package authors might take care of this internally but more often not. In the end it is your own responsibility to check your code for its reproducibility.

This post does not compare the strong and weak parts of all parallel modes shown. If you are curious about this and wonder which backend you should go with in the first place, I can recommend the overview vignette of the future package. However, do not expect that this document will tell you “the” best mode. There is no such thing (currently). ;)