Problem to solve: run LDA on Big Data.
Solution: iteratively update the model with new chunks of data, instead of trying to fit the whole data set on each iteration. This has the advantage that when there is new data, the current (old) model can be updated using only the new data.
Three programs offer online or streaming LDA:
Unfortunately, none of the appear to be able to use GPU.
On Debian stable this installs gensim 3.8.
sudo python3 -m pip install --upgrade gensim
Gensim supports docs in files or in single file: "Gensim only requires that a corpus must be able to return one document vector at a time".
from smart_open import open # for transparently opening remote files class MyCorpus(object): def __iter__(self): for line in open('https://radimrehurek.com/gensim/mycorpus.txt'): # assume there's one document per line, tokens separated by whitespace yield dictionary.doc2bow(line.lower().split()) corpus_memory_friendly = MyCorpus() for vector in corpus_memory_friendly: # load one vector into memory at a time print(vector)
You can iterate over such a corpus to create a dictionary which excluded very rare terms
from six import iteritems # collect statistics about all tokens dictionary = corpora.Dictionary(line.lower().split() for line in open('https://radimrehurek.com/gensim/mycorpus.txt')) # remove stop words and words that appear only once stop_ids = [ dictionary.token2id[stopword] for stopword in stoplist if stopword in dictionary.token2id ] once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1] dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once dictionary.compactify() # remove gaps in id sequence after words that were removed print(dictionary)
sudo python3 -m pip install git+https://github.com/online-ml/river --upgrade
Based on https://github.com/VowpalWabbit/vowpal_wabbit/wiki/Dependencies
git clone --recursive git@github.com:VowpalWabbit/vowpal_wabbit.git cd vowpal_wabbit sudo apt-get install libboost-dev libboost-thread-dev libboost-program-options-dev libboost-system-dev libboost-math-dev libboost-test-dev zlib1g-dev cmake g++ wget https://github.com/google/flatbuffers/archive/v1.12.0.tar.gz tar -xzf v1.12.0.tar.gz cd flatbuffers-1.12.0 mkdir build_dir cd build_dir cmake -G "Unix Makefiles" -DFLATBUFFERS_BUILD_TESTS=Off -DFLATBUFFERS_INSTALL=On -DCMAKE_BUILD_TYPE=Release -DFLATBUFFERS_BUILD_FLATHASH=Off .. sudo make install sudo pip3 install six sudo apt-get install libboost-python-dev sudo apt-get install default-jdk cd ../.. make
creates a binary in build
, try it with
build/vowpalwabbit/vw --help | less
according to https://mlwave.com/tutorial-online-lda-with-vowpal-wabbit/ the input format is a pipe followed by the words, optionally with counters
| this is a document about sports | here:1 another:2 two:2 documents:2 about:2 sports:2 | the:17 epic:5 of:18 gilgamesh:6 [...]
To feed vw data in this format from a mysql/mariadb database source I have written two scripts, first a shell script that efficiently (in parallel) queries the database for text, formats each results set into a set of bag of words, and runs vowpalwabbit.
#!/bin/bash number_of_topics=100 theoretical_max_number_of_documents=`mysql --column-names=off -B --database=flashback -e 'select count(*) from original_contributions_randomized;'` number_of_partitions=8192 ## Let the learning rate for the next iteration be the current learning rate * (1-decreasing_factor) decreasing_factor=0.00 lr=0.001 sudo rm -rf /tmp/textdata.raw for ((i=0;i<$number_of_partitions;i++)); do ## Extract one partition of data mysql --database=flashback -e "select REGEXP_REPLACE(content, '[[:punct:]]|[[:space:]]', ' ') from original_contributions_randomized partition (p${i}) into outfile '/tmp/textdata.raw';" ## Remove special characters sed 's/\xc2\x80\|\xc2\x94\|\xc2\x96//g' < /tmp/textdata.raw > half-clean.txt sudo rm /tmp/textdata.raw ## remove stopwords and numbers, stem, store the hash for any unseen words so that the final model can be expressed using words /home/hans/annex/text/src/r/text-cleaning.R half-clean.txt clean-${i}.txt global_dict.RData rm half-clean.txt ## Run/update the model if [[ ! -a weights ]]; then ## First run resume_part="" else ## update resume_part="--save_resume -i weights" fi while [[ -a vw.lock ]]; do sleep 5; done ## Fork a vw instance lr=`bc -l <<< "${lr}*(1-${decreasing_factor})"` (touch vw.lock echo "vw -d clean-${i}.txt --lda $number_of_topics --lda_D $theoretical_max_number_of_documents -learning_rate=$lr $resume_part -f weights --readable_model lda.model.vw" /home/hans/src/vowpal_wabbit/build/vowpalwabbit/vw -d clean-${i}.txt --lda $number_of_topics --lda_D $theoretical_max_number_of_documents --learning_rate=$lr $resume_part -f weights --readable_model lda.model.vw rm vw.lock) & done
The second script is text-cleaning.R
#!/usr/bin/Rscript ## usage text-cleaning.R source.txt cleaned_source.txt global_dictionary.RData args <- commandArgs(trailingOnly=TRUE) lemmatize=TRUE # lemmatize=FALSE load(file = "~/annex/projekt/echo-chambers/my.stop.words.RData") # my.stop.words library(quanteda) my.corpus.q <- corpus(read.table(args[1], sep = "\n", stringsAsFactors = FALSE)$V1) # Handle emoticons my.corpus.q <- tokens_replace(tokens(my.corpus.q), "❤", "hjärta", valuetype = "fixed", case_insensitive = FALSE) my.corpus.q <- tokens_replace(tokens(my.corpus.q), "👍", "tummeupp", valuetype = "fixed", case_insensitive = FALSE) my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😂😂😂", "😂", valuetype = "fixed", case_insensitive = FALSE) my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😂", "grin", valuetype = "fixed", case_insensitive = FALSE) my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😡😡😡", "😡", valuetype = "fixed", case_insensitive = FALSE) my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😡", "feelsbadman", valuetype = "fixed", case_insensitive = FALSE) my.corpus.q <- tokens_replace(tokens(my.corpus.q), "😈", "devilgrin", valuetype = "fixed", case_insensitive = FALSE) ## remove numbers, punctuations and separators washed.q <- tokens(my.corpus.q, remove_numbers = TRUE, remove_punct = TRUE, remove_separators = TRUE) lower.q <- tokens_tolower(washed.q) if(lemmatize){ load(file = "~/annex/data/Avslutat-Eget/stemming.RData") # tab ## Lemmatize lemmatized.q <- tokens_replace(tokens(lower.q), pattern=tab$term, replacement=tab$stem, valuetype = "fixed") ## Remove stop words, custom swedish list, and default english stopwords.removed.q.1 <- tokens_select(lemmatized.q, pattern = c(my.stop.words, stopwords(language="en")), selection = 'remove') } else { ## Remove stop words, custom swedish list, and default english stopwords.removed.q.1 <- tokens_select(lower.q, pattern = c(my.stop.words, stopwords(language="en")), selection = 'remove') } ## Remove documents that has less than four words after the removal of stopwords. final <- stopwords.removed.q.1[which(sapply(stopwords.removed.q.1, length) > 3)] # final.docs <- sapply(final, paste, collapse = " ") # cat(final.docs, file = args[2], sep = "\n", row.names=FALSE, col.names=FALSE) bag.of.words.1 <- sapply(final, table) bag.of.words.2 <- sapply(bag.of.words.1, function(x) { paste(names(x), x, sep = ":") }) dictionary.temp <- unique(unlist(sapply(bag.of.words.1, names))) ## add new terms to the global dictionary if(file.exists(args[3])){ load(args[3]) dictionary <- c(dictionary, dictionary.temp[dictionary.temp %in% dictionary == FALSE]) save(dictionary, file = args[3]) } else { dictionary <- dictionary.temp save(dictionary, file = args[3]) } cat(as.character(sapply(sapply(bag.of.words.2, paste, collapse = " "), function(x) { paste("|", x) })), file = args[2], sep = "\n", row.names=FALSE, col.names=FALSE)
This program depends on two external data-sets, a character vector of stopwords "~/annex/projekt/echo-chambers/my.stop.words.RData")
and a stemming table (technichally, a data.frame with two columns "stem" and "term") ~/annex/data/Avslutat-Eget/stemming.RData
, which in my case covers swedish, but they are easily replaced.
bash ~/annex/text/src/sh/lda-flashback.sh 0.001 &> lda.log
Show the average loss for the held-out documents in the last batch as a function of number of documents seen by training.
Up to batch 1295
grep -i "average loss" lda.log > loss.tab for ((i=1;i<=1295;++i)); do wc clean-$i.txt >> length.txt; doneIn R, there where some document missing or average loss na at above 925, so restrict to 925.
foo <- read.table("loss.tab", na.strings="n.a.") my.x <- cumsum(read.table("length.txt")$V1) plot(my.x[1:925], foo$V4[1:925], ylab = "average loss for the last seen batch", xlab = "number of documents seen", type = "b", pch = 20)
As can be infered from the graph, training for this particular data set (and with this number of topics) converges to a loss of about 9.3 at 1.000.000 seen documents (this was with a fixed learning rate of 0.001, as seen in the script above).
Extracting topics per term is rather complex since vw only stores a number for the hash of each term and the hash space is determined by a run-time parameter '-b' by default 18. The size of the hash space is 2 to the power of the -b parameter, and 2^18 = 262.144, so room for 262.144 words before there are hash collisions. The total number of terms seen by training is about 2.4 million, since each batch is representative, the most common terms should appear before there are hash collisions, and to recover the words, we can feed the first 262.144 terms (or is it 262.143?) through the model in prediction/testing mode (training is inactiviated) with a special parameter that saves the hash (or the number?) in a human readable format. There are two possibilities `—audit` or `—invert_hash` (see here)
--invert_hash is similar to --readable_model, but the model is output in a more human readable format with feature names followed by weights, instead of hash indexes and weights. Note that running vw with --invert_hash is much slower and needs much more memory. Feature names are not stored in the cache files (so if -c is on and the cache file exists and you want to use --invert_hash, either delete the cache or use -k to do it automatically). For multi-pass learning (where -c is necessary), it is recommended to first train the model without --invert_hash and then do another run with no learning (-t) which will just read the previously created binary model (-i my.model) and store it in human-readable format (--invert_hash my.invert_hash). --audit_regressor mode works like --invert_hash but is designed to have much smaller RAM usage overhead. To use it you shall perform two steps. Firstly, train your model as usual and save your regressor with -f. Secondly, test your model against the same dataset that was used for training with the additional --audit_regressor result_file option in the command line. Technically, this loads the regressor and prints out feature details when it's encountered in the dataset for the first time. Thus, the second step may be used on any dataset that contains the same features. It cannot process features that have hash collisions - the first one encountered will be printed out and the others ignored. If your model isn't too big you may prefer to use --invert_hash or the vw-varinfo script for the same purpose.
To inactivate training, use
-t [ --testonly ] Ignore label information and just test
The text-cleaning script saves a global dictionary, a character vector of all terms shown to training. To create a vw-formatted document with the first 262.000 words, run this snippet in R
load("global_dict.RData") cat(paste0("| ", dictionary[1:262000], "\n"), file = "dictionary.vw", sep = "")
and then invoke vw so that it inactivates training, and produces predictions for the topics for each term, ie p(topic|term)
## ~/src/vowpal_wabbit/build/vowpalwabbit/vw -t -i weights -d dictionary.vw --audit_regressor map_terms_to_indices.txt ~/src/vowpal_wabbit/build/vowpalwabbit/vw -t -i weights -d dictionary.vw -p dictionary.predictions
However, these loadings only indicate the topic probability given the term, they say nothing about what terms are the most likely given the topic, so very uncommon words can show up in the top terms here. To get the terms per topic, the word frequency is needed.
(Also, hash collisions started with term 262.145, and there were 2.400.000 terms, I'm not sure if any of the colliding terms, which of course are unusual, could also make it into the top list)
The full word frequency is probably costly to compute, but since the least common words are uninteresting anyway, we can take sample of the data and use the word frequency of that sample without loss.
Here is a word counting R script
#!/usr/bin/Rscript ## usage word-counter.R 1 200 words.RData args <- commandArgs(trailingOnly=TRUE) library(data.table) inner.f <- function(i){ chunk <- read.table(paste0("clean-", i, ".txt"), sep = "\n", stringsAsFactors=F) list.of.terms <- sapply(chunk, strsplit, " ") list.of.terms.1 <- sapply(list.of.terms, strsplit, ":") bar <- unlist(sapply(list.of.terms.1, function(x) { x[-1] }), recursive=FALSE, use.names=FALSE) baz <- table(unlist(sapply(bar, function(x) { rep(x[1], times=x[2]) }))) ## store only the words which appears least twice in this batch (5.000 posts). final <- baz[which(baz > 1)] total.number.of.words <- sum(baz) final.df <- data.table(names(final), as.numeric(final)) colnames(final.df) <- c("term", paste0("frequency", i)) setkey(final.df, term) if(file.exists(args[3])){ load(args[3]) ## words foo <- merge(words, final.df, all=T) words <- data.table(term = foo$term, frequency = rowSums(foo[, 2:3], na.rm=TRUE)) print(paste("at chunk", i, "number of words:", nrow(words))) setkey(words, term) } else { words <- final.df print(paste("at chunk", i, "number of words:", nrow(words))) } save(words, file = args[3]) stopifnot(nrow(words) < 262144) } for(x in args[1]:args[2]){ inner.f(x) }
Run this in a shell to gather word freqencies from the first 200 chunks (about 1.000.000 post since there are about 5.000 posts per chunk).
~/annex/text/src/r/word-counter.R 1 200 words.RData
Estimate proportions with
library(data.table) load("words.RData") words[, proportion := frequency/sum(frequency)] save(words, file = "proportions.RData")
There are different algorithms for deriving terms that characterize the topics. Relevance score has been proven a better algorithm than just listing words that are most probable given the topic p(term|topic). Relevance score is defined differently in http://qpleple.com/word-relevance/ and LDAvis: A method for visualizing and interpreting topics. This is how the former defines it
Relevance(term|topic) = p(term|topic)/exp(Hterm)
Hterm is the entropy of the topic distribution given term.
Hterm = Sum(p(topic|term)*log(p(topic|term)))
The latter has another definition:
Relevance of term to topic given lambda is
lambda * log(phi(k,w)) + (1 - lambda) * log (phi(k,w)/p(w))
phi(k,w) and p(w) is what is needed phi(k,w) is the probability of term given topic p(w) is probability of term in the whole corpus. good values of lambda is around 0.6
From https://github.com/columbia-applied-data-science/rosetta/blob/master/examples/vw_helpers.md
-p prediction.dat stores the predictions (topic weights for each doc) in the file prediction.dat --readable_model topics.dat stores the word-topic weights in the file topics.dat
what is a word-topic weight? Since it is larger than 1, it can't be a probability, but if we normalize rowSums in the `readable_mode` do we then get the probability p(topic|term)? I guess so.
But how do we get the probability of a term given a topic? is it just to normalize the colSums in foo? Let's try that
p.terms.given.topic <- apply(foo[, 3:102], 2, function(x) { x/sum(x) }) p.topic.given.term <- apply(foo[, 3:102], 1, function(x) { x/sum(x) }) total.p <- sum(foo[, 3:102]) p.term <- rowSums(foo[, 3:102])/total.p
## Try normalising the full matrix to sum to 1 t.sum <- sum(foo[, 3:102]) ## look only at the first 20.000 words bar <- apply(foo[1:20000, 3:102], 2, function(x) { x/t.sum })
## first normalise the rows, then normalise the columns row.sums <- rowSums(foo[, 3:ncol(foo)]) row.weights <- row.sums/sum(row.sums) weighted.by.row.weight <- foo[, 3:ncol(foo)] * row.weights ## normalise columns so that they sum to one. colnorm.weighted.by.row.weight <- apply(weighted.by.row.weight, 2, function(x) { x / sum(x) }) p.terms.given.topic <- colnorm.weighted.by.row.weight total.p <- sum(foo[, 3:ncol(foo)]) p.term <- rowSums(foo[, 3:ncol(foo)])/total.p
p.term should, roughly, coincide with the empirical word frequency I calcuted manually.
The relevance definition given used in LDAvis uses only p.terms.given.topic and p.term, so start with that one.
relevance(for term t, in topic k, given lambda l) = lambda * log (p.terms.given.topic) + (1-lambda) * log (p.terms.given.topic / p.term)
lambda <- 0.6 relevance <- lambda * log(p.terms.given.topic) + (1-lambda) * log(p.terms.given.topic/p.term)
Print the results
topic.loadings.indices <- apply(relevance, 2, function(x) { head(order(x, decreasing=TRUE), n = 20) }) topic.loading.terms <- apply(topic.loadings.indices, 2, function(x) { terms[x] }) topic.loading.terms
Not good, so the "weights" are not useful to derive p(term|topic), but we can gather that empirically by inspecting the predictions.
foo <- read.table("predictions.txt", sep = " ")
Which are the most common topics in the corpus?
To get a predicted mix of topics in 100 batches of documents (in the files clean-<j>.txt
), run this shell snippet:
for j in {1..100}; do ~/src/vowpal_wabbit/build/vowpalwabbit/vw -t -i weights -d clean-$j.txt -p predictions-$j.txt; done
I guess there is a variance-covariance matrix as well, not only a p(topic|term) but a p(topic|(term1 & term2)) and p(topic|(term1 & term3)). Actually no.
Seems the only way to get p(term|topic) is to generate predictions of topic mixtures for documents, and then sum over them and count the f(term|topic) and divide by n_words(topic).
vw-lda
a wrapper for vw --lda