A tricky pitfall in explicit parallelism

Today I met with a rather unexpected behaviour when exporting variables to a snow cluster in R.

Consider what would happen if the following code was build into a package "foo".

a.func <- function(chunk){
 rep(paste(chunk, collapse = ""), times = my.var)
}
b.func <- function(){
  library(parallel)
  cl <- makeCluster(rep("localhost", times = 2), type = "SOCK")
  my.var <- 13
  clusterExport(cl, "my.var")
  chunks <- list(rep("what is", times = 2), rep("wrong here?", times = 3))
  clusterLapply(cl, chunks, a)
}

Now, let another program run this:

library(foo)
b.func()

What would happen? Rather unexpectedly, you will get the error:

Error in get(name, envir = envir) : object my.var not found

Which function is failing? It is clusterExport(). It does not help that my.var is defined the line just above it:

  my.var <- 13
  clusterExport(cl, "my.var")

Why does clusterExport() fails here?

Because by default clusterExport() searches only the global environment, and my.var is not in the global environment, it is "only" in the current environment, created by the call to b.func().

This is not a bug, it is the documented behaviour of clusterExport() the man page says:

'clusterExport' assigns the values on the master of the variables named in 'list' to variables of the same names in the global environments of each node. The environment on the master from which variables are exported defaults to the global environment.

But is it the behaviour that the user will expect? I say no. I was suprised to find that clusterExport() did not search the current environment for the variables that it is said to export.

To get the behaviour I intended, I had to either modify the calling code and define my.var there or explicitly give clusterExport() the argument "envir = NULL". The former was illogical in the context so I went for the latter:

a.func <- function(chunk){
 rep(paste(chunk, collapse = ""), times = my.var)
}
b.func <- function(){
  library(parallel)
  cl <- makeCluster(rep("localhost", times = 2), type = "SOCK")
  my.var <- 13
  clusterExport(cl, "my.var", envir = NULL)
  chunks <- list(rep("what is", times = 2), rep("wrong here?", times = 3))
  clusterLapply(cl, chunks, a)
}

I suggest clusterExport() is modified to search the current evaluation environment too. To avoid unexpected changes of existing code, the global environment should be searched first. Then all currently working code will work exactly as before, while new code can benefit of the new feature.

The point of such a change is primarily that it is pretty time consuming to debug clusters and if we could save future users that hassle, it would be a good thing. Also, not having to write

envir = NULL

would be nice.

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 12, 2017