Avoid sapply

Michael Barton published, in 2012, a post on the functional programming nature of R at http://www.bioinformaticszen.com/post/simple-functional-programming-in-r/

While Barton's post has some merits, there are two things in it that I think merit better solutions. Or, put more positively, when I read Barton's post the other day, it made me think about two (unrelated) things, and here is what I learned.

R is slow at function invocation

The first thing is what I usually think about as "R is slow at function invocation". Even if slow function invocation might not really be what is at play here or in similar situations, minimizing the number of function invocation is a very good general advice.

Barton's example was about creating a binary version of a string vector, which more or less looked like this:

married <- sample(c("Yes", "No", NA), 1E7, replace = TRUE)

I intentionally made this vector a bit long, since my point is more clear for long vectors.

Barton's suggestion was:

system.time(married.binary.1 <- sapply(married, function(x) switch(x, Yes = 1, No = 0, NA)))
   user  system elapsed
 44.087   0.188  44.281

The problem here is that the anonymous function is invoked once for every element in the vector. For long vectors function invocation time will matter. The better solution is to use the function ifelse(), which takes the vector as argument, not the element of the vector.

system.time(married.binary.2 <- ifelse(married == "Yes", 1, 0))
   user  system elapsed
  5.421   0.220   5.640
identical(as.numeric(married.binary.1), married.binary.2)
[1] TRUE

Here we could save 39 seconds of execution time by avoiding sapply(). For more elaborated ways of avoiding to process a vector element by element, see For and the use of vectors in R.

Treat data as immutable

The second thing in Barton's post that he modifies data. "What's the problem officer?", you might ask. Firstly, treating data as immutable removes the risk that the wrong parts of data is modified, be it the wrong positions in a vector or even another object than you intended. Secondly, if you modify data then you introduce state into your program, and now you have to keep track of what has been done.

It is not hard to treat data as immutable, it is just a matter of replacing a few often used idioms.

Don't modify, create new variables instead

Barton stored the result of the conversion in the same variable, like this:

married <- sapply(married, function(x) switch(x, Yes = 1, No = 0, NA))

Simply find a new name for the results, it is not hard.

married.binary <- ifelse(married == "Yes", 1, 0)

At times (not very often though) it might be tempting to delete really huge objects for which you no longer have any use for - in order to save memory. You can accomplish this by placing the code that depends on the really huge object in a function. When that function returns, the really huge object will no longer occupy RAM for you, and you have successfully treated data as immutable.

Are these merely cosmetic concerns? I don't think so, I think that if we program in a way that makes our goal just happen, without us having to ask for it, then we use the langauge in a better way than if we have to explictly ask for our goals to happen, ie by issuing rm().

Create new objects with one-liners only

Many times I have initated a vector, perhaps filled it with NA, and the added the contents:

a <- round(rnorm(1E2), 1)
## Bad version follows
my.results.1 <- rep(NA, length(a))
my.results.1[which(a > 0)] <- 1
my.results.1[which(a <= 0)] <- 0

my.results is changed, which really should be avoided. There are good functions that do a better job with simple one-liners. ifelse() is one, recode() in the car package is another.

## Good version follows
library(car)
my.results.2 <- recode(a, recodes = "lo:0 = 0; 0:hi = 1")

Once you learn to use tools like recode() and ifelse(), then treating data as immutable becomes the natural thing to do. I learned about ifelse() just recently, and I would love to learn about alternatives to recode() and ifelse(), and more generally about how to treat your data as immutable.

identical(my.results.1, my.results.2)
[1] TRUE

For now, I can not really explain what it is in the functional programming paradigm that appalls to me, it's still a bit blurry. But getting rid of state feels very nice, as if a task (keeping track of state) is abstracted away into nothing.

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 12, 2017