Problem: You run a web service which people use to look up words in your database and you want to offer your users good suggestions for their next query.
Strategy: Record "sessions", series of queries performed by the same user, and use that data to predict what words future users will want to query based on their current query.
Below you will find
To illustrate the concept, I will use the english words of the first seven letters of the greek alphabet as the database to be queried.
In order to refer to words by pointer, it is handy to have a table with all words that exists in the database (If you intend to store the results of the analysis as a list of lists of words, then such a table is not strictly necessary, but it is a good idea anyway).
the.words <- c("alpha", "beta", "gamma", "delta", "epsilon", "zeta", "eta") (all.words <- data.frame(index = 1:length(the.words), word = the.words)) index word 1 alpha 2 beta 3 gamma 4 delta 5 epsilon 6 zeta 7 eta
Code to construct a sample data set of "sessions", in session "A" a user queried the words "epsilon" and "delta".
set.seed(100) session.lengths <- sample(1:7, size=5, replace=TRUE) (sessions <- data.frame(query.id=1:sum(session.lengths), session.id = rep(LETTERS[1:length(session.lengths)], times = session.lengths), word=unlist(sapply(session.lengths, function(x) { sample(all.words$word, size = x, replace = FALSE) })), stringsAsFactors = FALSE)) query.id session.id word 1 A epsilon 2 A delta 3 B zeta 4 B gamma 5 B alpha 6 C zeta 7 C eta 8 C gamma 9 D beta 10 D gamma 11 D epsilon 12 D eta 13 D zeta 14 E epsilon 15 E zeta 16 E alpha 17 E gamma
the.results <- sapply(unique(sessions$word), function(W) { the.sessions <- with(sessions, session.id[which(word == W)]) Ws <- unlist(sapply(the.sessions, function(session){ with(sessions, word[which(session.id == session)]) })) remove.self <- which(Ws %in% W) if(length(remove.self) > 0){ Ws <- Ws[-remove.self] } temp.table <- sort(table(as.character(Ws)), decreasing = TRUE) if(length(temp.table) > 5){ temp.table <- temp.table[1:5] } return(names(temp.table)) }) names(the.results) <- unique(sessions$word)
This will result in the following list of suggestions
(the.results) $epsilon [1] "gamma" "zeta" "alpha" "beta" "delta" $delta [1] "epsilon" $zeta [1] "gamma" "alpha" "epsilon" "eta" "beta" $gamma [1] "zeta" "alpha" "epsilon" "eta" "beta" $alpha [1] "gamma" "zeta" "epsilon" $eta [1] "gamma" "zeta" "beta" "epsilon" $beta [1] "epsilon" "eta" "gamma" "zeta"
The list is to be read like this: If a user queries the database for "alpha", then suggest "gamma", "zeta" and "epsilon", in that order, as possible next queries.
If we wanted to store these data in a compact way, we should store only pointers to words, not the words themselves. Storing the words would, however, save a SQL-lookup when the user looks up a word, so perhaps it is wiser to store the words as words, since that would keep response times lower.