find-small-distances

This text will explain two things, firstly how to find items in a vector which differ less than a treshold, secondly how to find coordinates from two vectors that refers to locations closer to each other than a treshold

Finding "pairs" in an unordered vector that differ less than a treshold

Consider the vector N, that holds 10 items

N
 [1] 6410272 6404200 6407008 6393624 6401948 6403341 6397321 6412530 6385294
[10] 6407825

Which items are closer than treshold x to a another item in the vector? Let's define x to 1.500.

x <- 1500
order(N)[which(diff(N[order(N)]) < x)]
[1] 5 6 3

So item 5, 6 and 3 differ by no more than 1.500 to at least one other item in the vector N.

Which items constitute the pairs, in which 5, 6, 3 are parts?

sapply(which(diff(N[order(N)]) < x), function (y) {order(N)[c(y, y+1)]})
     [,1] [,2] [,3]
[1,]    5    6    3
[2,]    6    2   10

So item 5 happens to be a pair with item 6, which is also in a pair with item 2, lastly, item 3 is in a pair with item 10.

If you need the pairs ordered:

d <- sapply(which(diff(N[order(N)]) < x), function (y) {order(N)[c(y, y+1)]})
e <- sapply(seq(1, length(d), 2), function (y) {sort(d[c(y,y+1)])})
e
     [,1] [,2] [,3]
[1,]    5    2    3
[2,]    6    6   10

For each pair, what is the difference?

d <- sapply(which(diff(N[order(N)]) < x), function (y) {order(N)[c(y, y+1)]})
abs(diff(matrix(N[sapply(seq(1, length(d), 2), function (y) {sort(d[c(y,y+1)])})], 2)))
     [,1] [,2] [,3]
[1,] 1393  859  817

Finding pairs of pairs in two unordered vectors with a total difference less than a treshold

Suppose you have two unordered vectors with coordinates, N for values on the north-south axis, and E for values on the east-west axis. How do you find pairs of coordinate pairs that are closer than x to each other? I met with a real-world application of this problem when I investigated arson, where geographical distance was very important.

E
 [1] 1268851 1268200 1267224 1274877 1271267 1270656 1269688 1275535 1278333
 10] 1259585

We will use the matrix of indexes, e, as a means to get the differences in the other vector. The eventuality that there exists pairs in E, other than the ones found i N is irrelevant since those pairs differ more than threshold x in N.

Here are the values in E of the pairs from N

matrix(E[e], 2)
        [,1]    [,2]    [,3]
[1,] 1271267 1268200 1267224
[2,] 1270656 1270656 1259585

Calculate the difference in E

abs(diff(matrix(E[e], 2)))

The same thing for N

abs(diff(matrix(N[e], 2)))

Add the resulting vectors

abs(diff(matrix(E[e], 2))) + abs(diff(matrix(N[e], 2)))

Which pairs have a total sum of difference less than x = 3000 ?

x = 3000
which(abs(diff(matrix(E[e], 2))) + abs(diff(matrix(N[e], 2))) < y)
[1] 1

Only the first pair.

Everything in one block

x = 1 d <- sapply(which(diff(N[order(N)]) < x), function (y) {order(N)[c(y, y+1)]}) e <- sapply(seq(1, length(d), 2), function (y) {sort(d[c(y,y+1)])}) f <- which(abs(diff(matrix(E[e], 2))) + abs(diff(matrix(N[e], 2))) < x) matrix(bränder$Adress[e[,f]], ncol = 2, byrow = T)

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 17, 2019