Finding simular strings

Problem: match elements in two set of strings, where the elements that are to be matched might differ in a number of ways.

Description of the problem

I have a relation database and a set of files that I need to match. One field in one of the tables of the database is supposed to match the name of the corresponding file, however the files were named in a not quite systematic way, and the file names must not be changed (since the files are sources in another process which rely on their names).

Here is an sample of elements from the actual file names:

file.names.clean[sample.int(length(file.names.clean), 100)]
  [1] "VT05-6110-03"        "VT09-2611-041"       "VT08-2611-207"       "HT04-2611-70"        "HT06-2480-23b"       "HT10-1120-20"
  [7] "VT08-2611-062"       "HT05-2611-047"       "HT06-1080-04"        "HT04-2611-12"        "HT06-1080-07"        "HT06-1190-24b"
 [13] "VT10-2611-073"       "VT07-2611-180"       "HT07-2611-210"       "HT06-2611-42"        "HT09-2611-030"       "VT07-2480-07b"
 [19] "HT10-1120-04"        "VT09-2611-055"       "VT10-2480-001"       "VT05-1030-04"        "HT06-1190-26"        "HT10-2611-233"
 [25] "VT07-2490-03"        "VT07-2611-148"       "VT07-2480-16"        "VT05-2611-40"        "VT06-1190-04"        "VT10-2611-032"
 [31] "VT08-2611-002"       "HT06-2611-226"       "HT06-2611-201"       "HT04-2480-05"        "HT07-2490-05"        "HT05-2480-005"
 [37] "HT07-2611-074"       "HT09-2611-007"       "HT08-2611-048"       "HT09-1120-19"        "VT10-1120-1"         "HT06-1020-04"
 [43] "HT10-2611-211"       "VT05-2611-45b"       "HT10-2432-01"        "HT06-1080-13"        "VT05-1080-02"        "VT05-6110-02"
 [49] "HT09-1120-01"        "HT08-6030-06"        "HT06-2611-100b"      "HT07-2611-206"       "VT08-2480-03"        "HT08-2611-203"
 [55] "VT09-1190-11_B"      "HT10-2611-227"       "HT10-2611-230"       "VT06-2611-17"        "VT05-6110-07"        "HT09-2490-04"
 [61] "HT04-2611-054"       "HT07-2611-020"       "VT05-6110-04"        "HT06-2611-100"       "VT08-2611-058"       "HT04-2611-65"
 [67] "HT10-2480-18"        "VT08-1130-02"        "HT08-1030-05"        "HT08-1160-201"       "HT10-2611-229"       "HT10-2940-06"
 [73] "HT08-2611-031"       "HT08-1190-01"        "HT05-2611-045"       "HT08-2611-090"       "HT09-2480-05"        "VT09-248-01_Konstig"
 [79] "HT06-2432-04"        "HT07-2440-02b"       "VT08-1199-6"         "VT10-2611-102"       "VT10-2450-06"        "VT05-2611-78"
 [85] "HT07-2611-073"       "HT08-2611-030"       "HT08-2450-06"        "VT07-1420-04"        "HT09-1160-03"        "HT05-2611-109"
 [91] "HT06-2611-217"       "VT06-2611-30"        "VT07-2611-114"       "HT08-2480-05"        "VT05-2611-12b"       "VT07-1420-01"
 [97] "HT06-2611-213"       "HT06-2611-220"       "HT06-2611-04"        "VT05-2611-12"

There are three parts of the name, each part separated with a dash. The first two parts are unproblematic, but in the third part the number of intial zeros, if any, it not defined. So, we see "VT10-1120-1" as well as "VT07-1420-01" and "HT05-2480-005", and in the other set, the name for "VT07-1420-01" could well be "VT07-1420-1" or "VT07-1420-001". Also, only the names in the other set can be changed, the actual file names must remain the same.

In addition, there is no guarantee that a particular element exists in both sets.

The name of variable holding the other set of names is rapportnr

Wanted output

The final output wanted is a copy rapportnr that has the elements adapted to the version of each particular name that exists in file.names.clean. Elements in rapportnr that does not exist in file.names.clean should be empty.

Method

After an initial matching, build a vector of not yet matched items. This vector holds indices in rapportnr which have not yet been matched, and which might be missing from file.names.clean.

Create a subset of file.names.clean that includes only not yet matched elements. And a vector of the subset of elements of rapportnr which is not yet matched. Thus reduce set on which the comming operations are applied on to only work on cases not already dealt with.

Now, build a working copy of the not yet matched subset of rapportnr but with the names changed according to a rule. Match this vector against the subset of file.names.clean, and for the positions in this working copy of rapportnr that match the working file.names.clean

Since the sets we work with now are short than the original set, we must never use the positions returned, but always the positions we get when we in a later step match the values returned against the original, full-length, vector. So, for each string-transforming rule we apply, we must match two things:

At the indices returned from the latter, we should place the values of the transformed strings.

Repeat this process for each string-transforming rule.

Step one, match without applying any string-transformation.

Create the output vector final.output <- vector(mode = "character", length = length(rapportnr)) single out the easy cases, which match without any transformation at all indices.of.matching.elements <- which(rapportnr %in% file.names.clean == TRUE) matching.elements <- match(rapportnr, file.names.clean) matching.elements has a 1:1 relation to rapportnr. NA in position X of matching.elements means that the value at position X have no match in file.names.clean value Y in position X matching elements means, that the value at position X of rapportnr matches the value of position Y of file.names.clean

Save the easy cases at the right place in the output vector final.output[is.na(matching.elements) == F] <- rapportnr[is.na(matching.elements) == F] remove the easy cases from rapportnr and file.names.clean Since matching.elements has a 1:1 relation to rapportnr, we can use NA not NA of the positions. rapportnr <- rapportnr[is.na(matching.elements)] when removing cases from file.names.clean, use the values not the positions! file.names.clean <- file.names.clean[-na.omit(matching.elements)]

Start applying string-transforming rules on the elements in rapportnr. Use a new working copy for each rule. new.strings <-

comments powered by Disqus


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 17, 2019