Guess Next Query

Problem: You run a web service which people use to look up words in your database and you want to offer your users good suggestions for their next query.

Strategy: Record "sessions", series of queries performed by the same user, and use that data to predict what words future users will want to query based on their current query.

Below you will find

a list of requirements for the predictions,
two sample data sets.
the code, in R, needed to make and display the predictions.
the result of the code when applied to the data set.
R code needed to replicate the construction of the sample data set.

Requirements

To illustrate the concept, I will use the english words of the first seven letters of the greek alphabet as the database to be queried.

1. A table with all words present in the database

In order to refer to words by pointer, it is handy to have a table with all words that exists in the database (If you intend to store the results of the analysis as a list of lists of words, then such a table is not strictly necessary, but it is a good idea anyway).

the.words <- c("alpha", "beta", "gamma", "delta", "epsilon", "zeta", "eta")
(all.words <- data.frame(index = 1:length(the.words), word = the.words))
  index    word
     1   alpha
     2    beta
     3   gamma
     4   delta
     5 epsilon
     6    zeta
     7     eta

2. A data set with (many) recorded "sessions"

Code to construct a sample data set of "sessions", in session "A" a user queried the words "epsilon" and "delta".

set.seed(100)
session.lengths <- sample(1:7, size=5, replace=TRUE)
(sessions <- data.frame(query.id=1:sum(session.lengths),
 session.id = rep(LETTERS[1:length(session.lengths)], times = session.lengths),
 word=unlist(sapply(session.lengths, function(x) {
        sample(all.words$word, size = x, replace = FALSE)
       })),
stringsAsFactors = FALSE))
   query.id session.id   word
        1          A   epsilon
        2          A     delta
        3          B      zeta
        4          B     gamma
        5          B     alpha
        6          C      zeta
        7          C       eta
        8          C     gamma
        9          D      beta
       10          D     gamma
       11          D   epsilon
       12          D       eta
       13          D      zeta
       14          E   epsilon
       15          E      zeta
       16          E     alpha
       17          E     gamma

Analysis

For each word W that has been queried:
- get all the words Ws that appear in any of the session(s) where W was queried.
- order Ws according to frequency and return up to five words starting with the word with highest frequency.

the.results <- sapply(unique(sessions$word), function(W) {
  the.sessions <- with(sessions, session.id[which(word == W)])
  Ws <- unlist(sapply(the.sessions, function(session){
    with(sessions, word[which(session.id == session)])
  }))
  remove.self <- which(Ws %in% W)
  if(length(remove.self) > 0){
    Ws <- Ws[-remove.self]
  }
  temp.table <- sort(table(as.character(Ws)), decreasing = TRUE)
  if(length(temp.table) > 5){
    temp.table <- temp.table[1:5]
  }
  return(names(temp.table))
})
names(the.results) <- unique(sessions$word)

Results

This will result in the following list of suggestions

(the.results)
$epsilon
[1] "gamma" "zeta"  "alpha" "beta"  "delta"

$delta
[1] "epsilon"

$zeta
[1] "gamma"   "alpha"   "epsilon" "eta"     "beta"

$gamma
[1] "zeta"    "alpha"   "epsilon" "eta"     "beta"

$alpha
[1] "gamma"   "zeta"    "epsilon"

$eta
[1] "gamma"   "zeta"    "beta"    "epsilon"

$beta
[1] "epsilon" "eta"     "gamma"   "zeta"

The list is to be read like this: If a user queries the database for "alpha", then suggest "gamma", "zeta" and "epsilon", in that order, as possible next queries.

If we wanted to store these data in a compact way, we should store only pointers to words, not the words themselves. Storing the words would, however, save a SQL-lookup when the user looks up a word, so perhaps it is wiser to store the words as words, since that would keep response times lower.

comments powered by Disqus

Back to the index

Blog roll

R-bloggers, Debian Weekly

Last modified: oktober 12, 2017