Sum by a grouping factor

(Or Why tapply() can be infinitely faster than a (bad) for() loop)

Real world problem: For each individual in a dataset, calculate the sum of - some variable - for the other members in the household of that individual.

The functions to use: tapply(), merge()

  1. tapply will return, for each household, the sum of - some variable - for all members of that household.
  2. merge can be used to project these results onto each individual (one sum -> all members)
  3. then subtract the individual contribution from the group sum to get the sum of the other members.
set.seed = 2
my.df <- data.frame(income=sample.int(5, 10, TRUE),
                    household.id=(LETTERS[sample.int(3, 10, TRUE)]))
my.df <- my.df[order(my.df$household.id),]

This creates a data frame with the following contents

my.df
   income household.id
3       2            A
4       5            A
6       5            A
7       3            A
2       2            B
9       4            B
1       1            C
5       4            C
8       5            C
10      5            C

Now, we can start with tapply(), which will return this:

tapply(my.df$income, my.df$household.id, sum)
 A  B  C
15  6 15

Save these results in a vector.

my.results <- tapply(my.df$income, my.df$household.id, sum)

To be able to use merge(), we need to make a data frame out of this vector, and store the names as household.id.

my.temp.df <- data.frame(household.income = my.results,
                         household.id = names(my.results))
my.temp.df
  household.income household.id
A               15            A
B                6            B
C               15            C

Now, we can project back the results with merge().

my.df <- merge(my.df, my.temp.df)

Inspect the results:

my.df
   household.id income household.income
1             A      2               15
2             A      5               15
3             A      5               15
4             A      3               15
5             B      2                6
6             B      4                6
7             C      1               15
8             C      4               15
9             C      5               15
10            C      5               15

Looks good, now calculate the difference, and remove the column household.income, which was only of use for the calculation of income.of.others.

my.df$income.of.others <- my.df$household.income - my.df$income
my.df <- my.df[-grep("household.income", names(my.df))]
my.df
   household.id income income.of.others
1             A      2               13
2             A      5               10
3             A      5               10
4             A      3               12
5             B      2                4
6             B      4                2
7             C      1               14
8             C      4               11
9             C      5               10
10            C      5               10

The following snippet gives the same result, but is hopelessly inefficient.

my.df$income.of.others <- 0
for(i in 1:nrow(my.df)){
  for(j in which(my.df$household.id == my.df$household.id[i])){
    if(j != i){
    my.df$income.of.others[i] <- my.df$income.of.others[i] + my.df$income[j]
    }
  }
}
## Calculate differences in computation time
set.seed = 2
number.of.cases = 10000
number.of.groups = 3800
my.df <- data.frame(income=sample.int(5, number.of.cases, TRUE),
                  household.id=rep(1:number.of.groups,
                  ceiling(number.of.cases/number.of.groups))[1:number.of.cases])
my.df.bak <- my.df

my.for.solution <- function(){
my.df$income.of.others <- 0
for(i in 1:nrow(my.df)){
  for(j in which(my.df$household.id == my.df$household.id[i])){
    if(j != i){
    my.df$income.of.others[i] <- my.df$income.of.others[i] + my.df$income[j]
    }
  }
}
return(my.df)
}

my.tapply.solution <- function(){
  my.results <- tapply(my.df$income, my.df$household.id, sum)
  my.temp.df <- data.frame(household.income = my.results,
                           household.id = names(my.results))
  my.df <- merge(my.df, my.temp.df)
return(my.df)
}

system.time(my.for.solution())
   user  system elapsed
 25.081   0.264  29.509

system.time(my.tapply.solution())
   user  system elapsed
  0.124   0.000   0.156

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 17, 2019