complex use of match() in R

If you want to find the "corresponding" items in two vectors where both vectors have duplicated values, but you want to ignore any duplicate in vector b that is not duplicated in vector a?

Think of the elements as a kind of identifier or describing cases, where vector b is part of some larger structure with other information that should be merged with a.

a <- c("foo", "bar", "bar", "bal")
b <- c("faa", "foo", "boo", "bar", "bar", "bad", "bal", "bal", "baz")

The correct vector of corresponding positions is 2, 4, 5, 7 or 2, 4, 5, 8. Note that "bal" only appears once in a, but two times in b (at positions 7 and 8), which makes two solutions correct.

match() and %in% is unsufficient here:

> match(a, b)
[1] 2 4 4 7
> which(b %in% a)
[1] 2 4 5 7 8

make a working copy of a and b.
use match() and change matching items to NA in both a and b (the working copies)
repeat until the resulting vector is the same length as a

Implementation

a <- c("foo", "bar", "bar", "bal")
b <- c("faa", "foo", "boo", "bar", "bar", "bad", "bal", "bal", "baz")

tmp.a <- a
tmp.b <- b
my.result <- rep(NA, times=length(tmp.a))
while(length(which(is.na(my.result) == FALSE)) < length(tmp.a)){ ## this condition
  ## assumes that all elements in a have at least one corresponding element in b
  ## but that might perfectly fine, e.g. if a is derived from b in the first place.
    remove.these.from.b <- unique(match(na.omit(tmp.a), tmp.b))
    remove.these.from.a <- match(tmp.b[remove.these.from.b], tmp.a)
    tmp.b[remove.these.from.b] <- NA
    tmp.a[remove.these.from.a] <- NA
    my.result[remove.these.from.a] <- remove.these.from.b ## store the matches
}
> my.result
[1] 2 4 5 7

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 12, 2017