Reproducible research: smart caching, parallelisms and pitfalls

Problem: Research is reproducible if and only if the researcher has a documented way to arrive at the results from the data. Since the advent of cheap computers, researchers can make their research more reproducible than ever - because if the researcher can use a computer program to go from data to results, the program in itself is the documentation of the process, and others can verify the results by inspecting the computer program, re-run it with the same data, or rerun the program on other data. This is all good, so what is the problem to be solved? When doing research I often work with parts of data, parts of data that arrive at different times, and the problem is a technical one: How to code the program to be

smart enough to cache calculations already done; and
smart enough to detect when new parts of data has been added?

My program of choice for litterate programming knitr has advanced features to create a system that automatically invalidates the cache when the underlying data has changed.

Smart caching

knitr has an option called cache.extra which takes a custom R expression. At run-time knitr evaluates the expression and the value is stored as a condition for the cache to be valid. Every time the knitr processes the document, the expression is evaluated and if the value of the expression is not identical to the stored value, the cache is considered invalid and the chunk is re-run. By using list.files() embedded in a call to file.info() you can turn knitr into a make kind of utility (the idea was first published here: https://github.com/yihui/knitr/issues/238).

my.path <- "/home/Datasets/"
opts_chunk$set(results='markup', cache=TRUE, echo=TRUE, warnings=TRUE,
 cache.extra=file.info(paste(my.path, list.files(my.path, recursive=TRUE),
 sep = ""))$mtime, autodep=TRUE, tidy.opts=list(width.cutoff=100))

If you have source files in sub-directories of my.path, do not remove the "recursive=TRUE" option to list.files().

In the example above, the cache.extra option is set globally, so all following chunks will be re-run whenever a file is added, changed, or removed from the directory defined in my.path.

Parallelisms

For the time being, knitr does not automatically run independent chunks in parallel (the design of knitr make implicit parallelism prune to race conditions, see https://github.com/yihui/knitr/issues/744#issuecomment-39404306). So, in order to minimize computation time when running on multicore processors, you have to use explicit parallelisms.

This is one area where I have noticed that the reproducible requirement makes the analysis process abit inflexible. The problem is that you cannot change part of a chunk without getting the whole chunk re-evaluated, and you cannot get two (or more) chunks to run in parallel. Consider, for example the following two chunks.

my.models <- list(
  c("deprived ~ 1 + (1|hid) + (sex | country/religion)", "my.subset.no.sing"),
  c("deprived ~ 1 + (1|hid) + (sex | country:religion)", "my.subset"),
  c("deprived ~ 1 + (1|hid) + (sex | country/religion)", "my.subset.only.mixed"),
  c("deprived ~ 1 + (1|hid) + (sex | country:religion)", "my.subset.only.mixed")
)

The next chunk will fit the models to data, run in parallel, save the resulting objects to a file (we do not want the resulting HUGE objects in the global environment, therefore we return NULL and save the fitted object to disk for later processing in another chunk).

require(multicore)
mclapply(1:length(my.models), function(i){
   my.fit <- glmer(formula = my.models[[i]][1], data = get(my.models[[i]][2]),
                   family = binomial(link = "logit"))
   save(my.fit, file = paste("Religion.fit.", i, sep = ""))
   return()
})

Now, if you wanted to change one of the models, all the unchanged models would be re-run. That is also the case if you remove one of the models. So, if you find that you do not need one of your models, you have to leave it there or the remaining models will be re-run.

Avoiding meaningless invalidation of the cache

For the same reason, a seemingly innocent change of the global options - the opts_chunk$set() used in the first chunk - say from "warning=TRUE" to "warning=FALSE" would case all chunks to be rerun. If the total computation time for all chunks is, say 12 hours, doing such a change will effectively stop you from get a compiled version of the document for the next 12 hours. Don't do that when you are near a deadline for submitting your article... Learn what global options you need in the beginning of the work process, and stick to them!

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 12, 2017