From the original repository:
"Biterm Topic Model (BTM) is a word co-occurrence based topic model that learns topics by modeling word-word co-occurrences patterns (e.g., biterms). (In constrast, LDA and PLSA are word-document co-occurrence topic models, since they model word-document co-occurrences.)
A biterm consists of two words co-occurring in the same context, for example, in the same short text window. Unlike LDA models the word occurrences, BTM models the biterm occurrences in a corpus. In generation procedure, a biterm is generated by drawn two words independently from a same topic. In other words, the distribution of a biterm b=(wi,wj) is defined as:
P(b) = \sum_k{P(wi|z)*P(wj|z)*P(z)}
.With Gibbs sampling algorithm, we can learn topics by estimate P(w|k) and P(z).
More detail can be referred to the following paper:
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, Xueqi Cheng. A Biterm Topic Model For Short Text. WWW2013."
This fork aims at providing interfaces to the author's original code-base suitable for closer to real-world applications while making minimal modifications.
Run make
in the repository's root.
Run script/train.py DOCUMENTS MODEL
in the repository's root where DOCUMENTS
is a file with one document (consisting of space separated tokens) per line and MODEL
is the directory to create the model in.
Training parameters can be set as follows:
--num-topics K
or-k K
to set the number of topics to learn toK
; this will default toK=20
.--alpha ALPHA
or-a ALPHA
to set the alpha parameter as given by the paper; this will default toALPHA=K/50
.--beta BETA
or-b BETA
to set the beta paramter as given by the paper; this will default toBETA=5
.--num-iterations N_IT
or-n N_IT
to set the number of training iterations; this will default toN_IT=5
.--save-steps SAVE_STEPS
or-s SAVE_STEPS
to set the number of iterations to save model after; this will default to 500.
After training, the directory MODEL
will contain
- a file
vocab.txt
with linesID TOKEN
that encodes the documents' tokens into integer IDs - a file
topics.csv
with tab-separatedtopic, prob_topic, top_words
wheretopic
is a topic's IDz
(\in [0..K-1]
),prob_topic
isP(z)
andtop_words
is a comma-separated list of at most 10 tokensw
with the highest value ofP(w|z)
, i.e. the topic's highest probability tokens - a directory
vectors/
that holds the actual model data, i.e. the values forP(z)
andP(w|z)
needed for topic inferral
This fork provides a python class BTMInferrer
in script/infer.py
with an interface for fast topic inferral of single documents that can easily be implemented analogously in other programming languages.
Here, an instance i
of BTMInferrer
can initialized with the model's directory (see section Topic learning). A single document
's topic vector can then be inferrered by calling i.infer(document)
, which will return a list
of K
values of type float
that represents the K
-dimensional vector P(z|d)
.
- existing
Makefile
was revised for efficiency and to separate build from source - existing scripts were recreated to increase efficiency, adaptability and ease of use
- existing
C++
code was formatted by LLVM Coding Standards and dynamic inferral (throughstdin/out
) was added while making minimal changes and retaining all previous functionality - the original project's sample data has been removed to decrease the repository's size (once GitHub prunes expired refs)