Guess Next Query

Problem: You run a web service which people use to look up words in your database and you want to offer your users good suggestions for their next query.

Strategy: Record "sessions", series of queries performed by the same user, and use that data to predict what words future users will want to query based on their current query.

Below you will find

Requirements

To illustrate the concept, I will use the english words of the first seven letters of the greek alphabet as the database to be queried.

1. A table with all words present in the database

In order to refer to words by pointer, it is handy to have a table with all words that exists in the database (If you intend to store the results of the analysis as a list of lists of words, then such a table is not strictly necessary, but it is a good idea anyway).

the.words <- c("alpha", "beta", "gamma", "delta", "epsilon", "zeta", "eta")
(all.words <- data.frame(index = 1:length(the.words), word = the.words))
  index    word
     1   alpha
     2    beta
     3   gamma
     4   delta
     5 epsilon
     6    zeta
     7     eta

2. A data set with (many) recorded "sessions"

Code to construct a sample data set of "sessions", in session "A" a user queried the words "epsilon" and "delta".

set.seed(100)
session.lengths <- sample(1:7, size=5, replace=TRUE)
(sessions <- data.frame(query.id=1:sum(session.lengths),
 session.id = rep(LETTERS[1:length(session.lengths)], times = session.lengths),
 word=unlist(sapply(session.lengths, function(x) {
        sample(all.words$word, size = x, replace = FALSE)
       })),
stringsAsFactors = FALSE))
   query.id session.id   word
        1          A   epsilon
        2          A     delta
        3          B      zeta
        4          B     gamma
        5          B     alpha
        6          C      zeta
        7          C       eta
        8          C     gamma
        9          D      beta
       10          D     gamma
       11          D   epsilon
       12          D       eta
       13          D      zeta
       14          E   epsilon
       15          E      zeta
       16          E     alpha
       17          E     gamma

Analysis

the.results <- sapply(unique(sessions$word), function(W) {
  the.sessions <- with(sessions, session.id[which(word == W)])
  Ws <- unlist(sapply(the.sessions, function(session){
    with(sessions, word[which(session.id == session)])
  }))
  remove.self <- which(Ws %in% W)
  if(length(remove.self) > 0){
    Ws <- Ws[-remove.self]
  }
  temp.table <- sort(table(as.character(Ws)), decreasing = TRUE)
  if(length(temp.table) > 5){
    temp.table <- temp.table[1:5]
  }
  return(names(temp.table))
})
names(the.results) <- unique(sessions$word)

Results

This will result in the following list of suggestions

(the.results)
$epsilon
[1] "gamma" "zeta"  "alpha" "beta"  "delta"

$delta
[1] "epsilon"

$zeta
[1] "gamma"   "alpha"   "epsilon" "eta"     "beta"

$gamma
[1] "zeta"    "alpha"   "epsilon" "eta"     "beta"

$alpha
[1] "gamma"   "zeta"    "epsilon"

$eta
[1] "gamma"   "zeta"    "beta"    "epsilon"

$beta
[1] "epsilon" "eta"     "gamma"   "zeta"

The list is to be read like this: If a user queries the database for "alpha", then suggest "gamma", "zeta" and "epsilon", in that order, as possible next queries.

If we wanted to store these data in a compact way, we should store only pointers to words, not the words themselves. Storing the words would, however, save a SQL-lookup when the user looks up a word, so perhaps it is wiser to store the words as words, since that would keep response times lower.


Back to the index

Blog roll

R-bloggers, Debian Weekly
Valid XHTML 1.0 Strict [Valid RSS] Valid CSS! Emacs Muse Last modified: oktober 12, 2017