(Or Why tapply()
can be infinitely faster than a (bad) for()
loop)
Real world problem: For each individual in a dataset, calculate the sum of - some variable - for the other members in the household of that individual.
The functions to use: tapply()
, merge()
tapply
will return, for each household, the sum of - some variable - for all members of that household.merge
can be used to project these results onto each individual (one sum -> all members)set.seed = 2 my.df <- data.frame(income=sample.int(5, 10, TRUE), household.id=(LETTERS[sample.int(3, 10, TRUE)])) my.df <- my.df[order(my.df$household.id),]
This creates a data frame with the following contents
my.df income household.id 3 2 A 4 5 A 6 5 A 7 3 A 2 2 B 9 4 B 1 1 C 5 4 C 8 5 C 10 5 C
Now, we can start with tapply()
, which will return this:
tapply(my.df$income, my.df$household.id, sum) A B C 15 6 15
Save these results in a vector.
my.results <- tapply(my.df$income, my.df$household.id, sum)
To be able to use merge()
, we need to make a data frame out of this vector, and store the names as household.id
.
my.temp.df <- data.frame(household.income = my.results,
household.id = names(my.results))
my.temp.df household.income household.id A 15 A B 6 B C 15 C
Now, we can project back the results with merge()
.
my.df <- merge(my.df, my.temp.df)
Inspect the results:
my.df household.id income household.income 1 A 2 15 2 A 5 15 3 A 5 15 4 A 3 15 5 B 2 6 6 B 4 6 7 C 1 15 8 C 4 15 9 C 5 15 10 C 5 15
Looks good, now calculate the difference, and remove the column household.income
, which was only of use for the calculation of income.of.others
.
my.df$income.of.others <- my.df$household.income - my.df$income my.df <- my.df[-grep("household.income", names(my.df))]
my.df household.id income income.of.others 1 A 2 13 2 A 5 10 3 A 5 10 4 A 3 12 5 B 2 4 6 B 4 2 7 C 1 14 8 C 4 11 9 C 5 10 10 C 5 10
The following snippet gives the same result, but is hopelessly inefficient.
my.df$income.of.others <- 0 for(i in 1:nrow(my.df)){ for(j in which(my.df$household.id == my.df$household.id[i])){ if(j != i){ my.df$income.of.others[i] <- my.df$income.of.others[i] + my.df$income[j] } } }
## Calculate differences in computation time set.seed = 2 number.of.cases = 10000 number.of.groups = 3800 my.df <- data.frame(income=sample.int(5, number.of.cases, TRUE), household.id=rep(1:number.of.groups, ceiling(number.of.cases/number.of.groups))[1:number.of.cases]) my.df.bak <- my.df my.for.solution <- function(){ my.df$income.of.others <- 0 for(i in 1:nrow(my.df)){ for(j in which(my.df$household.id == my.df$household.id[i])){ if(j != i){ my.df$income.of.others[i] <- my.df$income.of.others[i] + my.df$income[j] } } } return(my.df) } my.tapply.solution <- function(){ my.results <- tapply(my.df$income, my.df$household.id, sum) my.temp.df <- data.frame(household.income = my.results, household.id = names(my.results)) my.df <- merge(my.df, my.temp.df) return(my.df) } system.time(my.for.solution()) user system elapsed 25.081 0.264 29.509 system.time(my.tapply.solution()) user system elapsed 0.124 0.000 0.156