At the prize of slightly increased memory usage, vectorizing away loops makes R really fast at data manipulation tasks.
Problem: You have a data.frame with 100.000 records (rows) of time series data. Each record holds measurements for a number of variables recorded on a certain point in time. Records with the same id represent measurements of the same individual. Each record shares id with zero, one, two or three other records. Your task is to single out the records about individuals for which there are three or more records.
For pedagogical reasons, here is a solution that relies on loops
library(Hmisc) my.get.cases.with.at.least.x.measurements <- function(x, id) { indices.of.first.records <- which(duplicated(id) == F) number.of.measurements.per.ind <- sapply(1:length(indices.of.first.record), function(x) { length(which(id == id[indices.of.first.record[x]])) }) these.indices.have.at.least.x.measurements <- which(number.of.measurements.per.ind >= x) } system.time(these.indices.have.at.least.x.measurements <- my.get.cases.with.at.least.x.measurements(3, id)) user system elapsed 24.158 0.172 66.72824 seconds for 3.000 cases, now let's try a vectorized version. The strategy is to use match() to find the first match of each id, then set the id of those records to NA, use match() again to get the second occurrence, and so on. Then, start with a fresh copy of the vector of ids and go backwards, this time using %in% instead of match(), setting all records that shares id with four records to NA, and so on. It requires way more lines of code, but the execution time is essentially nullified.
[ should come more here ]
## function to calculate the number of children, given a list of positions of all records that represents children in the relevant category of children ## requires parents.id in the current environment calculate.number.of.children <- function(vector.of.positions.of.children) { ## select only the first occurence (remeber this pointer is relative to vector.of.positions.of.children! pos.unique.children <- which(duplicated(rdata$RB030[vector.of.positions.of.children]) == FALSE) ## RB220 is fathers RB230 is mothers. parents.id.from.children <- sort(c(rdata$RB220[vector.of.positions.of.children[pos.unique.children]], rdata$RB230[vector.of.positions.of.children[pos.unique.children]])) number.of.children <- vector("integer", length = length(parents.id)) temp <- which(stripped.id %in% parents.id.from.children) ## instead of working with (many) short lists and looping over all parents (thousands), ## BE FAST by using (a few) long lists and looping over number of children (about 10, at most). while(length(temp) > 0) { number.of.children[temp] <- number.of.children[temp] + 1 set.these.to.na <- match(stripped.id[temp], parents.id.from.children) parents.id.from.children[set.these.to.na] <- NA temp <- which(parents.id %in% parents.id.from.children) } return(number.of.children) } ## children over 16 (and below 21) pos.children.over.16 <- which((rdata$RB010-rdata$RB080 < 21) & !(rdata$RB220 == -2 & rdata$RB230 == -2) & (rdata$RB245 != 4)) my.df.full$number.of.children.above.16 <- calculate.number.of.children(pos.children.over.16) ## children below 16 pos.children.below.16 <- which(rdata$RB245 == 4) my.df.full$number.of.children.below.16 <- calculate.number.of.children(pos.children.below.16) ## total number of children (reliable, does eliminiates duplicates that change RB245 from 4 to 3) my.df.full$total.number.of.children <- calculate.number.of.children(sort(c(pos.children.below.16, pos.children.over.16)))