Online LDA

Problem to solve: run LDA on Big Data.

Solution: iteratively update the model with new chunks of data, instead of trying to fit the whole data set on each iteration. This has the advantage that when there is new data, the current (old) model can be updated using only the new data.

Three programs offer online or streaming LDA:

Unfortunately, none of the appear to be able to use GPU.

Gensim

Installation

On Debian stable this installs gensim 3.8.

sudo python3 -m pip install --upgrade gensim

input format

Gensim supports docs in files or in single file: "Gensim only requires that a corpus must be able to return one document vector at a time".

from smart_open import open  # for transparently opening remote files

class MyCorpus(object):
    def __iter__(self):
        for line in open('https://radimrehurek.com/gensim/mycorpus.txt'):
            # assume there's one document per line, tokens separated by whitespace
            yield dictionary.doc2bow(line.lower().split())

corpus_memory_friendly = MyCorpus()

for vector in corpus_memory_friendly:  # load one vector into memory at a time
    print(vector)

You can iterate over such a corpus to create a dictionary which excluded very rare terms

from six import iteritems
# collect statistics about all tokens
dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt'))
# remove stop words and words that appear only once
stop_ids = [
    dictionary.token2id[stopword]
    for stopword in stoplist
    if stopword in dictionary.token2id
]
once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1]
dictionary.filter_tokens(stop_ids + once_ids)  # remove stop words and words that appear only once
dictionary.compactify()  # remove gaps in id sequence after words that were removed
print(dictionary)

river

Installation

sudo python3 -m pip install git+https://github.com/online-ml/river --upgrade

vowpal_wabbit

Installation

Based on https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Dependencies

git clone --recursive git@github.com:VowpalWabbit/vowpal_wabbit.git
cd vowpal_wabbit
sudo apt-get install libboost-dev libboost-thread-dev libboost-program-options-dev libboost-system-dev libboost-math-dev libboost-test-dev zlib1g-dev cmake g++
wget https://github.com/google/flatbuffers/archive/v1.12.0.tar.gz
tar -xzf v1.12.0.tar.gz
cd flatbuffers-1.12.0
mkdir build_dir
cd build_dir
cmake -G "Unix Makefiles" -DFLATBUFFERS_BUILD_TESTS=Off -DFLATBUFFERS_INSTALL=On -DCMAKE_BUILD_TYPE=Release -DFLATBUFFERS_BUILD_FLATHASH=Off ..
sudo make install
sudo pip3 install six
sudo apt-get install libboost-python-dev
sudo apt-get install default-jdk
cd ../..
make

creates a binary in build, try it with

build/vowpalwabbit/vw --help | less

Input format

according to https://mlwave.com/tutorial-online-lda-with-vowpal-wabbit/ the input format is a pipe followed by the words, optionally with counters

| this is a document about sports
| here:1 another:2 two:2 documents:2 about:2 sports:2
| the:17 epic:5 of:18 gilgamesh:6 [...]

To feed vw data in this format from a mysql/mariadb database source I have written two scripts, first a shell script that efficiently (in parallel) queries the database for text, formats each results set into a set of bag of words, and runs vowpalwabbit.

#!/bin/bash

number_of_topics=100
theoretical_max_number_of_documents=`mysql --column-names=off -B --database=flashback -e 'select count(*) from original_contributions_randomized;'`
number_of_partitions=8192
## Let the learning rate for the next iteration be the current learning rate * (1-decreasing_factor)
decreasing_factor=0.00
lr=0.001

sudo rm -rf /tmp/textdata.raw

for ((i=0;i<$number_of_partitions;i++)); do
    ## Extract one partition of data
    mysql --database=flashback -e "select REGEXP_REPLACE(content, '[[:punct:]]|[[:space:]]', ' ') from original_contributions_randomized partition (p${i}) into outfile '/tmp/textdata.raw';"
    ## Remove special characters
    sed 's/\xc2\x80\|\xc2\x94\|\xc2\x96//g' < /tmp/textdata.raw > half-clean.txt
    sudo rm /tmp/textdata.raw
    ## remove stopwords and numbers, stem, store the hash for any unseen words so that the final model can be expressed using words
    /home/hans/annex/text/src/r/text-cleaning.R half-clean.txt clean-${i}.txt global_dict.RData
    rm half-clean.txt
    ## Run/update the model
    if [[ ! -a weights ]]; then
        ## First run
        resume_part=""
    else
        ## update
        resume_part="--save_resume -i weights"
    fi
    while [[ -a vw.lock ]]; do
        sleep 5;
    done
    ## Fork a vw instance
    lr=`bc -l <<< "${lr}*(1-${decreasing_factor})"`
    (touch vw.lock
     echo "vw -d clean-${i}.txt --lda $number_of_topics --lda_D $theoretical_max_number_of_documents -learning_rate=$lr $resume_part -f weights --readable_model lda.model.vw"
    /home/hans/src/vowpal_wabbit/build/vowpalwabbit/vw -d clean-${i}.txt --lda $number_of_topics --lda_D $theoretical_max_number_of_documents --learning_rate=$lr $resume_part -f weights --readable_model lda.model.vw
    rm vw.lock) &
done

The second script is text-cleaning.R

#!/usr/bin/Rscript

## usage text-cleaning.R source.txt cleaned_source.txt global_dictionary.RData
args <- commandArgs(trailingOnly=TRUE)

lemmatize=TRUE
# lemmatize=FALSE

load(file = "~/annex/projekt/echo-chambers/my.stop.words.RData") # my.stop.words
library(quanteda)
my.corpus.q <- corpus(read.table(args[1], sep = "\n", stringsAsFactors = FALSE)$V1)
# Handle emoticons
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "❤", "hjärta", valuetype = "fixed",
  case_insensitive = FALSE)
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "👍", "tummeupp", valuetype = "fixed",
  case_insensitive = FALSE)
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😂😂😂", "😂", valuetype = "fixed",
  case_insensitive = FALSE)
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😂", "grin", valuetype = "fixed",
  case_insensitive = FALSE)
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😡😡😡", "😡", valuetype = "fixed",
  case_insensitive = FALSE)
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😡", "feelsbadman", valuetype = "fixed",
  case_insensitive = FALSE)
my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😈", "devilgrin", valuetype = "fixed",
                              case_insensitive = FALSE)

## remove numbers, punctuations and separators
washed.q <- tokens(my.corpus.q, remove_numbers = TRUE,
                               remove_punct = TRUE, remove_separators = TRUE)
lower.q <- tokens_tolower(washed.q)

if(lemmatize){
    load(file = "~/annex/data/Avslutat-Eget/stemming.RData") # tab
    ## Lemmatize
    lemmatized.q <- tokens_replace(tokens(lower.q), pattern=tab$term,
                                   replacement=tab$stem, valuetype = "fixed")

    ## Remove stop words, custom swedish list, and default english
    stopwords.removed.q.1 <- tokens_select(lemmatized.q, pattern = c(my.stop.words,
                                                                     stopwords(language="en")), selection = 'remove')
} else {
    ## Remove stop words, custom swedish list, and default english
    stopwords.removed.q.1 <- tokens_select(lower.q, pattern = c(my.stop.words,
                                                                     stopwords(language="en")), selection = 'remove')
}

## Remove documents that has less than four words after the removal of stopwords.
final <- stopwords.removed.q.1[which(sapply(stopwords.removed.q.1, length) > 3)]
# final.docs <- sapply(final, paste, collapse = " ")
# cat(final.docs, file = args[2], sep = "\n", row.names=FALSE, col.names=FALSE)

bag.of.words.1 <- sapply(final, table)
bag.of.words.2 <- sapply(bag.of.words.1, function(x) { paste(names(x), x, sep = ":") })

dictionary.temp <- unique(unlist(sapply(bag.of.words.1, names)))
## add new terms to the global dictionary

if(file.exists(args[3])){
    load(args[3])
    dictionary <- c(dictionary, dictionary.temp[dictionary.temp %in% dictionary == FALSE])
    save(dictionary, file = args[3])
} else {
    dictionary <- dictionary.temp
    save(dictionary, file = args[3])
}
cat(as.character(sapply(sapply(bag.of.words.2, paste, collapse = " "), function(x) { paste("|", x) })), file = args[2], sep = "\n", row.names=FALSE, col.names=FALSE)

This program depends on two external data-sets, a character vector of stopwords "~/annex/projekt/echo-chambers/my.stop.words.RData") and a stemming table (technichally, a data.frame with two columns "stem" and "term") ~/annex/data/Avslutat-Eget/stemming.RData, which in my case covers swedish, but they are easily replaced.

Training

Run the training

bash ~/annex/text/src/sh/lda-flashback.sh 0.001 &> lda.log

Visualise the progress of training

Show the average loss for the held-out documents in the last batch as a function of number of documents seen by training.

Up to batch 1295

grep -i "average loss" lda.log > loss.tab
for ((i=1;i<=1295;++i)); do wc clean-$i.txt >> length.txt; done

In R, there where some document missing or average loss na at above 925, so restrict to 925.

foo <- read.table("loss.tab", na.strings="n.a.")
my.x <- cumsum(read.table("length.txt")$V1)
plot(my.x[1:925], foo$V4[1:925], ylab = "average loss for the last seen batch", xlab = "number of documents seen", type = "b", pch = 20)

As can be infered from the graph, training for this particular data set (and with this number of topics) converges to a loss of about 9.3 at 1.000.000 seen documents (this was with a fixed learning rate of 0.001, as seen in the script above).

Topics per term

Extracting topics per term is rather complex since vw only stores a number for the hash of each term and the hash space is determined by a run-time parameter '-b' by default 18. The size of the hash space is 2 to the power of the -b parameter, and 2^18 = 262.144, so room for 262.144 words before there are hash collisions. The total number of terms seen by training is about 2.4 million, since each batch is representative, the most common terms should appear before there are hash collisions, and to recover the words, we can feed the first 262.144 terms (or is it 262.143?) through the model in prediction/testing mode (training is inactiviated) with a special parameter that saves the hash (or the number?) in a human readable format. There are two possibilities `—audit` or `—invert_hash` (see here)

--invert_hash is similar to --readable_model, but the model is output in a more human readable format with feature names followed by weights, instead of hash indexes and weights. Note that running vw with --invert_hash is much slower and needs much more memory. Feature names are not stored in the cache files (so if -c is on and the cache file exists and you want to use --invert_hash, either delete the cache or use -k to do it automatically). For multi-pass learning (where -c is necessary), it is recommended to first train the model without --invert_hash and then do another run with no learning (-t) which will just read the previously created binary model (-i my.model) and store it in human-readable format (--invert_hash my.invert_hash).

--audit_regressor mode works like --invert_hash but is designed to have much smaller RAM usage overhead. To use it you shall perform two steps. Firstly, train your model as usual and save your regressor with -f. Secondly, test your model against the same dataset that was used for training with the additional --audit_regressor result_file option in the command line. Technically, this loads the regressor and prints out feature details when it's encountered in the dataset for the first time. Thus, the second step may be used on any dataset that contains the same features. It cannot process features that have hash collisions - the first one encountered will be printed out and the others ignored. If your model isn't too big you may prefer to use --invert_hash or the vw-varinfo script for the same purpose.

To inactivate training, use

 -t [ --testonly ]                Ignore label information and just test

The text-cleaning script saves a global dictionary, a character vector of all terms shown to training. To create a vw-formatted document with the first 262.000 words, run this snippet in R

load("global_dict.RData")
cat(paste0("| ", dictionary[1:262000], "\n"), file = "dictionary.vw", sep = "")

and then invoke vw so that it inactivates training, and produces predictions for the topics for each term, ie p(topic|term)

## ~/src/vowpal_wabbit/build/vowpalwabbit/vw -t -i weights -d dictionary.vw --audit_regressor map_terms_to_indices.txt
~/src/vowpal_wabbit/build/vowpalwabbit/vw -t -i weights -d dictionary.vw -p dictionary.predictions

However, these loadings only indicate the topic probability given the term, they say nothing about what terms are the most likely given the topic, so very uncommon words can show up in the top terms here. To get the terms per topic, the word frequency is needed.

(Also, hash collisions started with term 262.145, and there were 2.400.000 terms, I'm not sure if any of the colliding terms, which of course are unusual, could also make it into the top list)

The full word frequency is probably costly to compute, but since the least common words are uninteresting anyway, we can take sample of the data and use the word frequency of that sample without loss.

Here is a word counting R script

#!/usr/bin/Rscript

## usage word-counter.R 1 200 words.RData
args <- commandArgs(trailingOnly=TRUE)

library(data.table)

inner.f <- function(i){
    chunk <- read.table(paste0("clean-", i, ".txt"), sep = "\n", stringsAsFactors=F)
    list.of.terms <- sapply(chunk, strsplit, " ")
    list.of.terms.1 <- sapply(list.of.terms, strsplit, ":")
    bar <- unlist(sapply(list.of.terms.1, function(x) { x[-1] }), recursive=FALSE, use.names=FALSE)
    baz <- table(unlist(sapply(bar, function(x) { rep(x[1], times=x[2]) })))
    ## store only the words which appears least twice in this batch (5.000 posts).
    final <- baz[which(baz > 1)]
    total.number.of.words <- sum(baz)
    final.df <- data.table(names(final), as.numeric(final))
    colnames(final.df) <- c("term", paste0("frequency", i))
    setkey(final.df, term)
    if(file.exists(args[3])){
        load(args[3]) ## words
        foo <- merge(words, final.df, all=T)
        words <- data.table(term = foo$term, frequency = rowSums(foo[, 2:3], na.rm=TRUE))
        print(paste("at chunk", i, "number of words:", nrow(words)))
        setkey(words, term)
    } else {
        words <- final.df
        print(paste("at chunk", i, "number of words:", nrow(words)))
    }
    save(words, file = args[3])
    stopifnot(nrow(words) < 262144)
}

for(x in args[1]:args[2]){
    inner.f(x)
}

Run this in a shell to gather word freqencies from the first 200 chunks (about 1.000.000 post since there are about 5.000 posts per chunk).

~/annex/text/src/r/word-counter.R 1 200 words.RData

Estimate proportions with

library(data.table)
load("words.RData")
words[, proportion := frequency/sum(frequency)]
save(words, file = "proportions.RData")

There are different algorithms for deriving terms that characterize the topics. Relevance score has been proven a better algorithm than just listing words that are most probable given the topic p(term|topic). Relevance score is defined differently in http://qpleple.com/word-relevance/ and LDAvis: A method for visualizing and interpreting topics. This is how the former defines it

Relevance(term|topic) = p(term|topic)/exp(Hterm)

Hterm is the entropy of the topic distribution given term.

Hterm = Sum(p(topic|term)*log(p(topic|term)))

The latter has another definition:

Relevance of term to topic given lambda is

lambda * log(phi(k,w)) + (1 - lambda) * log (phi(k,w)/p(w))

phi(k,w) and p(w) is what is needed phi(k,w) is the probability of term given topic p(w) is probability of term in the whole corpus. good values of lambda is around 0.6

From https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md

-p prediction.dat stores the predictions (topic weights for each doc) in the file prediction.dat
--readable_model topics.dat stores the word-topic weights in the file topics.dat

what is a word-topic weight? Since it is larger than 1, it can't be a probability, but if we normalize rowSums in the `readable_mode` do we then get the probability p(topic|term)? I guess so.

But how do we get the probability of a term given a topic? is it just to normalize the colSums in foo? Let's try that

p.terms.given.topic <- apply(foo[, 3:102], 2, function(x) { x/sum(x) })
p.topic.given.term <- apply(foo[, 3:102], 1, function(x) { x/sum(x) })
total.p <- sum(foo[, 3:102])
p.term <- rowSums(foo[, 3:102])/total.p

## Try normalising the full matrix to sum to 1
t.sum <- sum(foo[, 3:102])
## look only at the first 20.000 words
bar <- apply(foo[1:20000, 3:102], 2, function(x) { x/t.sum })

## first normalise the rows, then normalise the columns
row.sums <- rowSums(foo[, 3:ncol(foo)])
row.weights <- row.sums/sum(row.sums)
weighted.by.row.weight <- foo[, 3:ncol(foo)] * row.weights
## normalise columns so that they sum to one.
colnorm.weighted.by.row.weight <- apply(weighted.by.row.weight, 2, function(x) { x / sum(x) })
p.terms.given.topic <- colnorm.weighted.by.row.weight
total.p <- sum(foo[, 3:ncol(foo)])
p.term <- rowSums(foo[, 3:ncol(foo)])/total.p

p.term should, roughly, coincide with the empirical word frequency I calcuted manually.

The relevance definition given used in LDAvis uses only p.terms.given.topic and p.term, so start with that one.

relevance(for term t, in topic k, given lambda l) = lambda * log (p.terms.given.topic) + (1-lambda) * log (p.terms.given.topic / p.term)

lambda <- 0.6
relevance <- lambda * log(p.terms.given.topic) + (1-lambda) * log(p.terms.given.topic/p.term)

Print the results

topic.loadings.indices <- apply(relevance, 2, function(x) { head(order(x, decreasing=TRUE), n = 20) })
topic.loading.terms <- apply(topic.loadings.indices, 2, function(x) { terms[x] })
topic.loading.terms

Not good, so the "weights" are not useful to derive p(term|topic), but we can gather that empirically by inspecting the predictions.

foo <- read.table("predictions.txt", sep = " ")

Distribution of topics

Which are the most common topics in the corpus?

Distribution of topics (predictions)

To get a predicted mix of topics in 100 batches of documents (in the files clean-<j>.txt), run this shell snippet:

for j in {1..100}; do ~/src/vowpal_wabbit/build/vowpalwabbit/vw -t -i weights -d clean-$j.txt -p predictions-$j.txt; done

Predictions

I guess there is a variance-covariance matrix as well, not only a p(topic|term) but a p(topic|(term1 & term2)) and p(topic|(term1 & term3)). Actually no.

Seems the only way to get p(term|topic) is to generate predictions of topic mixtures for documents, and then sum over them and count the f(term|topic) and divide by n_words(topic).