knitr()
uses lazy evaluation of its cache, which is why compilation of documents that presents tables or graphs based on analysis of Big data can be blisteringly fast. If you want to know how much slower your documents would compile without this feature, true this
opts_chunk$set(cache=TRUE, cache.lazy=FALSE)
in the first chunk of your document.
When chunks that include an explicit parallelism - e.g use of parallel::mclapply()
- is executed, the current R-process is forked. If it were not for the lazy evaluation of the cache, the current R-process would have had all previously created objects in its environment, and the forks would occupy A LOT of RAM, for absolute no good at all.
But, since knitr()
does use laze evaluation of the cache, the forks does not have to be bloated. But they can be, if you are not careful. And this is why I wrote this article, to show how you could avoid bloated forks.
Consider this senario: After a series of manipulations of a Big data object, a calculation that could be done in parallel is now possible. This particular calculation only relies on a subset of all the variables in the Big data object, so there is no need to have the whole Big data object in the forked environments.
First, let us consider the naïve approach, here my.df
is the Big data object, and my.dt
is the subset needed for the calculations made in parallel.
<<naïve>>= library(data.table) my.dt <- data.table(my.df[which(my.df$Hushållets.inkomst > 0), c(1,2,10,11,26)], key = c('År', 'SAMS.Område', 'Hushållets.inkomst')) good.households <- sapply(unique(my.dt$År), function(år) { parallel::mclapply(unique(my.dt[År == år,SAMS.Område]), function(sams) { foo <- my.dt[År==år & SAMS.Område==sams, .(indhus=sum(Individens.inkomst), my.n=length(Individens.inkomst), identifier=paste(år, sams, Hushållets.inkomst, sep = ".")), by=Hushållets.inkomst] foo$identifier[which(foo$Hushållets.inkomst >= foo$indhus)] }, mc.cores=4, mc.preschedule=FALSE) }) @
The process forked by parallel:mclapply()
will contain a copy of my.df
, because it is in the global environment when mclapply
is envoked.
The enhanced version goes like this: one chunk to create the subset, another to run calculations in parallel.
<<make.dt>>= ## Refer to (=use) my.df only in this chunk but not in the chunk that runs ## in parallel below library(data.table) my.dt <- data.table(my.df[which(my.df$Hushållets.inkomst > 0), c(1,2,10,11,26)], key = c('År', 'SAMS.Område', 'Hushållets.inkomst')) @ <<good.households>>= ## This chunk can be executed in parallel without having to load my.df into RAM. library(data.table) good.households <- sapply(unique(my.dt$År), function(år) { parallel::mclapply(unique(my.dt[År == år,SAMS.Område]), function(sams) { foo <- my.dt[År==år & SAMS.Område==sams, .(indhus=sum(Individens.inkomst), my.n=length(Individens.inkomst), identifier=paste(år, sams, Hushållets.inkomst, sep = ".")), by=Hushållets.inkomst] foo$identifier[which(foo$Hushållets.inkomst >= foo$indhus)] }, mc.cores=4, mc.preschedule=FALSE) }) @
But, there is more, let knitr()
evaluate all chunks up to and including make.dt
before you add the chunk good.households
, because otherwise my.df
will still be in the global environment when the evaluation of good.households
is started. To get the slimmed forks, we want knitr()
to cache my.df
and not to evaluate its cached version of my.df
when good.households
is about to start, and this can be achieved by running knitr()
on a copy where good.households
is either not present or inactived. After that initial run, put the chunk good.households
in the document (or activate it if you inactivated it).