This article will help the programmers experienced in Java, Python or PERL to program R in a way that will make use of R:s strength and avoid its weaknesses. This article was published in Software Developers Journal vol 4, 2012.
While R has many datatypes, the vector holds a special place in R programming. A vector is a list of elements of the same type, where the type can be logical, numerical or made out of strings. A vector in R is like a list in PERL, but with a few important differences:
Vectors in R can have undefined elements, which is represented by NA
as in "Not available", which as we will see, is a useful feature.
Assignment can be done in R either by a single equation sign "="
or by the left arrow "<-"
, in this article I will use "<-"
. To create a vector, the function c()
is helpful since it concatenates all its arguments to a vector. Dots are allowed in variable names, and I use them as substitute for spaces which are not allowed.
a.logical.vector <- c(TRUE, TRUE, NA, FALSE) a.string.vector <- c("tree" "cow" "running", NA, "") a.numerical.vector <- c(1, 0, 4.5667, 2, NA, 5)
There are many other ways to create vectors, for example the ":" operator and the seq()
function. The colon operator generates a sequence, that is a numerical vector, eg.
another.vector <- 4:9
print(another.vector)
[1] 4 5 6 7 8 9
The [1]
in the output above is not part of another.vector
, its indicating that another.vector
has only one dimension.
The reasons that the vector holds a special place in R programming are two. Firstly, functions generally accepts vectors as input as well as single values. To understand this, it might help to conceive of single values as vectors with length 1. Secondly, function invocation is slow, but execution is fast. Giving functions vectors as input helps keeping the code elegant, but it also makes execution faster. Elegant and fast code is the promise, giving up some uses of for
is the price. But once you understand how it works, I think you will find that those uses of for
was rather clumpsy anyway.
The for
loop in R takes two parameters, a name of the temporary variable of the loop and a vector of values that will be assigned to the temporary variable. (Actually, this is exactly what the foreach construct does in PERL.)
When a counter is needed, the colon operator is handy, as in the following example:
for(i in 1:4){ print(i*i) } [1] 1 [1] 4 [1] 9 [1] 16
This is a perfect example of the kind of for
use that you have to give up, and replace with the much more elegant and fast snippet:
i <- 1:4
print(i*i)
[1] 1 4 9 16
What happens in the second version here is that the multiplicator operator *
operates on the vectors (1, 2, 3, 4), (1, 2, 3, 4), and returns (1, 4, 9, 16) which then print()
, prints out. Note that both *
and print()
are only called once in the second version, but in the first version they are both called one time for each iteration of the loop, which is indicated of the four [1]
printed by print()
. Now this was a very short example, so we cannot expect any differences in execution time, but the advantage when it comes to compact code is there. To generalize, the method with which for
was substituted in the second version was a vectorized operator/function, i.e. a function or operator that accept a vector as input. Most built-in functions and operators in R are vectorized, and using for
to apply a built-in function on a set of values is, not only clumpsy, but also slower.
The second use of for
to be avoided is conditional processing of items according to a simple condition, where simple means "not dependent on the item to be evaluated". For example, let us imagine that you need to single out all names that includes a certain character, let's say "r", from a vector of names. Having the character "r", or not, is a simple rule, the meaning of the rule does not depend on the item that the rule is to be applied on.
In PERL something like the following might get the task done:
my @names = ("Alice", "Bob", "Caesar", "Demeter"); foreach (@names) { print if $_ =~ /r/ }
To solve the task in R, we need to understand indexing of vectors. Consider the following vector of names:
my.names <- c("Alice", "Bob", "Caesar", "Demeter")
To get a new vector consisting of element 1 and 4 in my.names
, you simply use my.names[c(1, 4)]
.
my.names[c(1, 4)]
This feature can be used to print (or otherwise process) only the elements that fulfills a simple condition, as shown below.
my.names <- c("Alice", "Bob", "Caesar", "Demeter") print(my.names[grep("r", my.names)]) [1] "Caesar" "Demeter"
As you can see from the number of [1]
printed, print()
was only called once, as was grep()
. The PERL solution relied on one list, mentioned twice, and the invisible $_
variable, which was implicitly used eight times - once for print and once for the pattern matching operator ~=
for each element. The R solution required my.vector
to be explicitly mentioned three times: at creation, as argument to grep()
and as argument to print()
. However, my.vector
never changed value(s), it remained constant unlike the invisible variable $_
in the PERL solution which took on different values for each iteration of the foreach
loop.
As previously stated, function invokation is expensive in R but execution is fast, so let us compare the computation time needed to solve the "find the names with 'r' in them" task using for
and indexing respectively. Remeber that for
in R works like foreach
in PERL, but requires a name for the temporary variable, here I will use "the.name". To measure execution time, the built-in function system.time()
can be used. To get enough elements in the vector for the differences in computation time to be measurable, I use two lists of names from http://www.listofnames.info, and sample, with replacement, 100.000 names from that list, which gives a list of 100.000 names. The number of unique names is 4.219.
unique.names <- c( readLines(url('http://www.listofnames.info/download/male_names.txt')), readLines(url('http://www.listofnames.info/download/female.txt')) ) my.names <- unique.names[sample(length(unique.names), 100000, replace = TRUE)]
Since printing many names will take quite some time, and seldom is needed in real world problems, we will save the names in a new vector instead. In the first version, this vector needs to be instantiated before the for
loop begins.
save.results.here <- vector() system.time( for(the.name in my.names){ if(length(grep("r", the.name)) > 0){ save.results.here <- c(save.results.here, the.name) } } ) user system elapsed 50.728 1.216 61.600 print(length(save.results.here)) [1] 41565
In other words, in a vector of 100.000 names, R found the letter "r" in 41.565 of them and the computation time was 50 seconds. If you wonder about the length() in the if-statement, it is needed since when there is no match, grep()
returns a zero length vector, rather than the value zero.
Now what about the indexed version?
system.time( save.results.here <- my.names[grep("r", my.names)] ) user system elapsed 0.116 0.000 0.125 print(length(save.results.here)) [1] 41565
Now, that took only 0.116 seconds! To understand the huge difference in computation time needed by these two solutions to the problem, consider that in the for
solution, two functions was invoked 100.000 times: grep()
and c()
and in R function invocation is expensive. In the indexed version, grep()
was only invoked once.
This far, we have seen two cases where for
loops can and should be avoided: vectorized functions, and conditioning on a simple rule. But what about problems that require conditioning on complex rules? I will show you how to avoid for
in conditioning on two kinds of complex rules:
As an example of rules of the first kind, consider the following task: From a long list of names, pick the names which have more vowels than consonants. When we create a test for this rule we must know the number of contants in the word in question. That does not mean that we need to iterate the list of names, for each word first count the number of consonants in the word and then count the number of vowels in that word - we can count the number of consonants in all words in one operation first, and count the numbers of vowels in a second operation, and in a third operation we can pick the elements which fulfills the condition (having more vowels than consonants). It is helpful to imagine the operation as adding columns to a table. We start with a table of one column, the names. The results of the first operation is stored as the second column, and the second operation adds another column. The third operation select the rows in which the value of the third column is greater than the value of the second column, and uses this selection as index for the vector of names.
To get the number of consonants in a word I use the function gregexpr(), which returns the positions of all matches. For the string "Tyrone" it would, if I ask for consonants only, return the vector (1, 3, 5), which is the postions of "T", "r" and "n" respectively. I am just interested in getting the length of this vector. When gregexp()
is given a vector of length one, it returns a single vector which could be used as input to the length()
function. But we will put a vector of names into gregexp()
, not a single name, and we will therefore get a list of vectors as output. To record only the length of these vectors we need to use the function sapply()
. As sapply()
is a very commonly used function in R, it is worthwhile to explain the how it works.
sapply()
takes a vector as input and feeds each item in this vector to a function. As output sapply()
returns a vector of whatever values the function in question returned. Here is an example, which records the number of characters for each element of a string vector. The function nchar()
returns the number of characters in a string.
short.list.of.names <- c("Alice", "Bob", "Caesar", "Demeter") my.lengths <- sapply(short.list.of.names, nchar) print(my.lengths) Alice Bob Caesar Demeter 5 3 5 7
Some readers might ask "if functions in R are vectorized does this not hold for nchar()
as well?" Yes it does, and in a way this is a rather bad example of the power of sapply()
, since the following would have worked equally well (actually, it would have worked even better).
print(nchar(short.list.of.names)) [1] 5 3 5 7
sapply()
makes better use on lists of vectors, but these are slightly more complex to illustrate, which is why I started with a more simple example. Here is an example of how sapply()
is used to count the length of the vectors in a list of vectors:
one.vector <- c(1, NA, 2) another.vector <- c("Alice", "Bob", "Caesar", "Demeter") a.third.vector <- c(TRUE, FALSE, FALSE, NA, FALSE, FALSE) my.list <- list(one.vector, another.vector, a.third.vector) my.lengths <- sapply(my.list, length) print(my.lengths) [1] 3 4 6
We are now ready to take on the task of finding the names with more vowels than consontants. And let's time the computation time required too.
system.time({ nr.of.consonants <- sapply(gregexpr('[b|c|d|f|g|h|j|k|l|m|n|p|q|r|s|t|v|w|x|z]', my.names, ignore.case = TRUE), length) nr.of.vowels <- sapply(gregexpr('[a|e|i|o|u|y]', my.names, ignore.case = TRUE), length) these.names <- my.names[which(nr.of.vowels > nr.of.consonants)] print(length(these.names)) } ) [1] 20052 user system elapsed 11.977 0.228 14.882
On my system, R took 12 seconds to find 20.052 names with more vowels than consonants out of a list of 100.000 names. While we can imagine the method used in this example as a table with new columns added to it, there is no need to actually build a table structure - all the three vectors in use are of the same length and the greater than operator ">"
understands that it should compare the two vectors nr.of.consonants and nr.of.vowels element by element since it is a vectorized operator. The function which()
returns a vector of the positions that evaluated to TRUE. If you want to check that R really has found the right names, we can inspect the first 10 names returned, with the function head()
.
head(these.names, 10) [1] "giana" "Isaac" "ileane" "xenia" "Joey" "cayla" "ailee" "augusta" "livia" "daria"
The last strategy for nice and fast R programming concerns conditioning on rules that depend on the relation between elements in the vector. I will give two examples of this. First, consider the task of picking the elements which occur at least n-times in the vector. The relation in question would then be identity, which elements are identical
to at least n elements in the vector.
While R has a built-in function table()
which indeed would be helpful in getting this particular task done, I would like to show a general way of handling this class of problems, that is, problems not confined to the relation of identity.
Before I go into the details of the solution, I would like to take the opportunity to discuss a mind-set that I used to have when PERL was my main programming language. I call it the row-oriented mind set. You have a some kind of tabular data, orginating either from a file where each line can be seen as row in a table, or from a database query. You process the data one row at a time until all rows are dealt with. This means that you will have at least as many function calls as there are rows in your data, often two function calls or more per row.
Since it is expensive to invoke functions in R, the row oriented mind set works very bad in R, as I have been trying to show in this article. Instead you need a column oriented mind set in order to be efficient with R.
From within the row oriented mind set, the problem of finding elements with at least n copies in a vector naturally seeks a solution in the line of "for each unique value, make a test which reveals the number of copies of that value". That solution requires as many function invokations as there as unique values.
When you are in the column oriented mind set, you naturally seek a solution in the lines of "create a column of all unique values, this column represents the first match of all unique values. And then add another column of the same length where the values that have at least two matches in the vector is represented, the values that only occured once in the vector will have NA in the second column. Add a third column for the values that have at least three occurences in the vector. And so on. This solution will take the form a matrix with as many columns as there are matches for the value(s) that has most occurences in the vector. This particular solution can be problematic if there are extremely many occurances of a single item, the memory requirements can be a problem if the number of unique values are big. An improved solution within the column-based mind set would be to store a list of vectors rather than a matrix, since such a list can have vectors of different length, and for each iteration we would only need to store the values that have this or more occurences.
Our particular task gave us information beforehand on how many columns a column based solution would need, we were given n as part of the task, and we could take that information into account when we decided to go for a matrix based or a list of vectors based solution. If n is small, then there will be no memory problems with the matrix solution, if n is big, say 1.000, we would anticipate memory problems with a matrix of 1.000 colmns and go for the list of vectors instead.
If we were to build a generic table function, we would not know beforehand the number of columns needed in the matrix solution, so it would be safer to go with the list of vectors. Here I will give a solution based on list of vectors. It turns out there is not much to gain by limiting the solution to n occurances, we still need the foundations of what would easily be turned into a generic table function, we would only hard-code a threshold when the function would stop looking for further occurances, which seems rather ugly and not reusable - so why not build a generic table function then?
Two new functions are introduced in this example match()
and na.omit()
. match()
takes two vectors as input and returns a vector of the same length as the first vector. The contents of returned vector is the position of the first match when the value of each of the elements in the first vector is search for in the second vector. When there is no match, NA is returned instead of a position, but here we are not interested in which elements of the first vector that did not occur in the second, so we remove them by na.omit()
.
the.unique.values <- unique(my.names) not.matched.yet <- my.names list.of.vectors.of.pointers <- list() system.time({ while(length(not.matched.yet) > 0){ ## First, find out which values are still left in not.matched.yet ## use the %in% operator which return true or false. there.are.more.of.these <- the.unique.values %in% not.matched.yet ## Record the matches (add a vector to the list of vectors) list.of.vectors.of.pointers <- c(list.of.vectors.of.pointers, list(there.are.more.of.these)) ## Now, when we have accounted for these occurences, we would like to remove ## one occurency in not.matched.yet, we find the positions of the elments to ## be removed with the match() function. ## Unlike the %in% operator, the match() function returns the position of the ## first match. remove.these <- na.omit(match(the.unique.values[there.are.more.of.these], not.matched.yet)) ## Remove the elements that are now matched not.matched.yet <- not.matched.yet[-remove.these] } ## Almost done, now build the final table by processing each column. ## But first, setup a vector of appropriate length to hold the results the.result <- vector(length = length(the.unique.values)) for(no.of.occurances in 1:length(list.of.vectors.of.pointers)){ ## most elements of the.result will be overwritten many times, but that's ## still cheap the.result[list.of.vectors.of.pointers[[no.of.occurances]]] <- no.of.occurances } }) user system elapsed 1.416 0.004 1.455
This generic structure can be used for any conditioning which depends on relations between elements in the vector. To solve the task of picking out the element of at least n occurences, we could substitute while
loop with a for
loop from 1 to n.
The same strategy can be used when relations are between different vectors. For a truly recursive problem - which is the second and last example on conditioning on complex rules - consider a vector of unique names, and a second vector representing the name of the parent (NA if the parent is not in the set). Since a node can have several children, some elements in "name of parent" are not unique. Here are the first five lines in the set.
name name.of.parent [1,] "Austin" "Kory" [2,] "Nigel" "Samual" [3,] "Ismael" "Frankie" [4,] "Reinaldo" "Douglass" [5,] "Frankie" "Delmer"
Now, consider the task: pick out the names in the set which has grand children nodes eight generations away. Within a row-based mind set, one would follow each name until its chain of parents was eight, or the chain was terminated by NA, in case we would discard the chain and all the names in it. From a column based mind set, we use one function call to get all parents of the first order parents, and then a second function call to get all parents of the second order parents, and so on until we reached eight generations back. This time we know that the answer will require only nine vectors, and we can setup a matrix in which to store the results, instead of a list of vectors. The function cbind()
is used to combine vectors of equal length to a table.
## create a column for each generation (first order parents, second order ## parents, and so on) until the eighth order. Since we already have the ## first order given, we need to add seven columns. the.results <- cbind(name, name.of.parent) current.parents <- name.of.parent for(i in 1:7){ new.parents <- name.of.parent[match(current.parents, name)] ## Record the parents (add a vector to the list of vectors) the.results <- cbind(the.results, new.parents) ## Next iteration, look for the parents of these parents current.parents <- new.parents } ## all names in the last column, column no 9, called "1st gen", fulfill the condition the.answer <- unique(na.omit(the.results[,9])) ## colnames(the.results) <- c("9th gen","8th gen","7th gen","6th gen","5th gen", "4th gen","3rd gen","2nd gen","1st gen")
9th gen 8th gen 7th gen 6th gen 5th gen 4th gen 3rd gen 2nd gen 1st gen [1,] "Omar" "maudie" "Alexis" "nari" "Freeman" NA NA NA NA [2,] "Avery" "glory" "rosabelle" "shena" "beryl" "jessa" NA NA NA [3,] "tessie" "Young" "lynna" NA NA NA NA NA NA [4,] "corly" "ilysa" NA NA NA NA NA NA NA [5,] "ilene" NA NA NA NA NA NA NA NA [6,] "mollie" "abagael" "shauna" "Elijah" "iolande" "concordia" "carlin" "shena" "beryl" [7,] "dacy" "daisey" "Eusebio" "Filiberto" "carma" "karylin" "Gordon" "carol" "eadie" [8,] "lynn" "Moises" "raine" "Antonia" "doretta" "elisha" "Guillermo" "debbie" "minnnie" [9,] "dosi" "michaelina" "ella" "Vern" "tersina" "debbie" "minnnie" "jessa" NA [10,] "emmeline" "guinna" "goldarina" "carmelina" "fidela" "Jermaine" "carol" "eadie" "jessa"
The is nothing wrong with the for
loop, on the contrary you may well use it in R. However, care must be taken to avoid to use it to loop through elements, one at a time. To be efficient in R, you must process many elements in a single operation (function call). In this article I have explained three general strategies to process many elements in a single operation:
If you use these strategies, you can minimize the number of function invocations, which will make your code execute fast in R.