What is Latent Dirichlet allocation

Published: 22 April 2024
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

Latent Dirichlet allocation is a topic modeling technique for uncovering the central topics and their distributions across a set of documents.

Latent Dirichlet allocation (LDA)—not to be confused with linear discriminant analysis in machine learning—is a Bayesian approach to topic modeling. Simply put, LDA is a conditional, probabilistic form of topic modeling.

Topic modeling is a natural language processing (NLP) technique that applies unsupervised learning on large text datasets in order to produce a summary set of terms derived from those documents. These terms are meant to represent the collection’s overall primary set of topics. In this way, topic models aim to uncover the latent topics or themes characterizing a collection of documents.¹

Users can readily generate LDA topic models using scikit-learn’s (or, sklearn’s) natural language toolkit (link resides outside ibm.com)(NLTK) and gensim (link resides outside ibm.com) in Python. The topicmodels (link resides outside ibm.com) and lda (link resides outside ibm.com) packages in R also contain functions for generating LDA topic models.

Why AI governance is a business imperative for scaling enterprise artificial intelligence

Learn about barriers to AI adoptions, particularly lack of AI governance and risk management solutions.

Related content

Text generation assumption

LDA is a probabilistic model. This means the LDA algorithm generates topics, classifying words and documents among these different topics, according to probability distributions. But why is this an appropriate method for uncovering hidden thematic structures in text data? Essentially, LDA assumes documents have been generated through random sampling of pre-document topics, and it attempts to reverse engineer this sampling.

LDA’s imagined generative text process begins with pre-document topics. Each topic is fixed vocabulary of words, in which each word has a probability that it belongs to that topic. Note that words are assigned probabilities rather than a discrete category to account for potential plurality of meaning and use. Then, to create a text document, the machine randomly generates a distribution over topics—for example, eighty-five percent politics, ten percent finance, five percent history. For each word in this document, the machine randomly selects one of the chosen topics and randomly emits a word from that topic. Note that LDA does not itself generate documents this way. Rather, this is the process by which LDA assumes documents are generated. LDA itself attempts to reverse this process to uncover the overarching and hidden topics pervading a corpus of documents.

This figure illustrates the assumed text generation process underlying LDA:

At the bottom of this image are a random assortment of topics comprised of a word distribution. These topics exist for every document in the collection. To generate a given document (in this example, Shakespeare’s Othello), the machine generates a distribution of topics, which is the graph on the far right. This distribution is topics (and proportion for each) that will comprise the document. For each word in the document, the machine randomly selects one of the chosen topics (the column of colored dots), from which it in turn randomly chooses a word (the highlighted terms).² Note this visualization is only an illustrative example, not the result of any actual topic modeling or text generation. As mentioned, however, LDA assumes every document is generated by this general generative process, which LDA attempts to reverse engineer.

How latent Dirichlet allocation works

Much like other topic modeling approaches, such as latent semantic analysis (LSA), LDA approaches documents across corpora as bag of words. This means LDA ignores word order and context. Instead, it focuses on how often words occur and co-occur within each individual document.³ Word frequency and co-occurrence are modeled in the document-term matrix. Here, documents are rows and individual words are columns, or vice-versa. Each cell’s value indicates the frequency with which a given word appears in each document.⁴

Topic generation

Using the document-term matrix, the LDA algorithm generates topic distributions (that is lists of keywords with respective probabilities) according to word frequency and co-occurrences. By tracking frequency of co-occurrence, LDA assumes words that occur together are likely part of similar topics. The algorithm then assigns document-topic distributions based on the clusters of words that appear in the given document.⁵

For example, say we generate a LDA model for a collection of news articles that has the following partial output

Here, we have two topics that may likely be described as immigration (Topic 1) and astronomy (Topic 2). The scores attached to each word are the probability of that keyword appearing in its given topic. The probabilities attached to each document are that document’s respective probabilities of belonging to a mixture of topics given the distribution and co-occurrence of words from each topic within that document. For example, the table’s first row lists border under Topic 1 with a 40% probability and space in Topic 2 with a 60% probability. These percentages indicate the probability of their respective terms occurring in that topic across the whole corpus. The first document row reads Document 1: Topic 1: .95, Topic 2: .05. This means that, based the occurrence of words in Document 1, the model projects Document 1 as being 95% derived from Topic 1 and 5% derived from Topic 2. In other words, our hypothetical LDA model assumes these are the topics and proportions of those topics used to generate the model.

Of course, polysemous words in particular create problems for such discrete categorizations—for example, alien may refer to a human immigrant or an extra-terrestrial creature. In such cases, how does the LDA algorithm determine to which topic the word (and by extension, the document) belongs?

Gibbs formula

When assigning topics to words, the LDA algorithm uses what is known as Gibbs sampling. The Gibbs sampling formula is:

Understanding this equation’s exact operations and hyperparameters requires foundational knowledge in statistics and Markov Chain Monte Carlo techniques (the latter often employed in reinforcement learning). Nevertheless, we can summarize the equation’s principal components:

The first ratio expresses the probability of topic t in document d. The algorithm calculates this probability according to the number of words in document d that belong to topic t. This essentially asks: how prevalent is topic t in document d?
The second ratio expresses the probability of word w belonging to topic t. The algorithm calculates this probability by enumerating the occurrences of w in t over all word-tokens in t. This asks: with what frequency does word w appear in topic t throughout the rest of the corpus?

Note that Gibbs sampling is an iterative process. That is, a word is not sampled once, assigned a topic, and tossed aside. Rather, Gibbs sampling passes each word through multiple iterations, updating topic-word probabilities in light of one another.⁶

Text classification

LDA’s distinguishing feature is each document’s probabilistic distribution over a shared set of topics. As with other topic modeling approaches, these topics and their distribution in each document is not known. Topic modeling is unsupervised; there is no manually labeled data guiding the analysis. By uncovering hidden topical structures, LDA ultimately annotates documents across a collection. These annotations (that is the uncovered topics) can be used to classify texts. In this way, LDA aids information retrieval systems by automating the annotation and organization of large collections of documents.⁷

Optimization

As with many text mining techniques in data science, text preprocessing vastly improves LDA model results. Stopword removal removes common, semantically irrelevant words. Stemming or lemmatization helps reduce morphological variants into base forms, thereby conflating variations of a word the machine may otherwise separate, for example, dance, dancing, dancer, and so on.

Because LDA models are probabilistic and not deterministic, model optimization can be a difficult process. There is no set number of topics that produces the best results, nor does there seem to exist any guiding principle, such as a set number of topics suitable for a given number of documents. Indeed, fine-tuning model parameters to produce interpretable and coherent topics involves a great deal of trial and error.

Evaluation metrics

What tools or metrics are available to help a user fine-tune an LDA model? For those new to topic modeling, it can be frustrating to learn there is no one, or even collection of, performance metric adopted in literature.

Qualitative. Believe it or not, qualitative evaluation is not uncommon, particularly in real-world applications. These may often involve examining the top five or ten keywords for each topic. These are then used to evaluate topics according to how interpretable the topics are by human users.⁸ This sort of “eyeballing,” so to speak, requires a significant amount of expert domain knowledge and familiarity with the documents under consideration.⁹

Coherence. Topic coherence is one popular quantitative method for evaluating generated topics. A topic coherence score measures how often a given topic’s most probable words co-occur in the same documents throughout the corpus. More specifically it computes the co-occurrence frequency of each word pair from a topic’s top n words against each individual’s word’s frequency across the corpus. This aims to quantify how coherent a given topic is. A model’s overall coherence score averages the coherence score belonging to each individual topic. In effect, this overall score signifies the average topic coherence within a given model. Per its name coherence evaluates models solely according to how cohesive their topics are. Topics must also maintain a degree of exclusivity, however, for which there is currently no quantitative measure.¹⁰

Recent research shows that quantitative metrics, such as coherence score, are unreliable for topic model evaluation. This is, in part, due to ambiguity in the professed evaluative goal of interpretability—what makes a model and its results interpretable?¹¹ Moreover, automated metrics designed for older systems may not extrapolate well to newer systems. This issue is complicated by the lack of transparency in many published experiments that prevent generalization of evaluation methods to other datasets or domains.¹² Research has recently turned to artificial intelligence applications, notably large language models (LLMs), as a means of designing and evaluating LDA models for a specific research objective.¹³ While this shows promising results, further research is necessary.

Related resources

What is topic modeling?

Topic models are an unsupervised NLP method for summarizing text data through word groups. They assist in text classification and information retrieval tasks.

Train an LDA topic model for text analysis in Python

Train and fine-tune an LDA topic model with Python's NLTK and Gensim.

IBM TechXchange: Topic modeling

Explore topic modelling and how it works in this IBM Community article.

Take the next step

Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Explore watsonx.ai

Book a live demo

Footnotes

¹ Daniel Jurafsky and James Martin, Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 3rd edition, 2023,https://web.stanford.edu/~jurafsky/slp3/(link resides outside ibm.com).Jay Alammar and Maarten Grootendorst, Hands-On Large Language Models, O’Reilly, 2024.

² David Blei, “Probabilistic Topic Models,” Communications of the ACM, Vol. 55, No. 4, 2012, pp. 77-84. Zhiyuan Chen and Bing Liu, “Topic Models for NLP Applications,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.

³ Matthew Jockers, Text Analysis with R for Students of Literature, Springer, 2014.

⁴ Cole Howard, Hobson Lane, and Hannes Hapke, Natural Language Processing in Action, Manning Publications, 2019. Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana, Practical Natural Language Processing, O’Reilly, 2020.

⁵ Sowmya Vajjala, Bodhisattwa Majumder, Anuj Gupta, Harshit Surana, Practical Natural Language Processing, O’Reilly, 2020. David Blei, Andrew Ng, and Michael Jordan, “Lantent Dirichlet Allocation,” Journal of Machine Learning Research, Vol. 3, 2003, pp. 993-1022.

⁶ Zhiyuan Chen and Bing Liu, “Topic Models for NLP Applications,” Encyclopedia of Machine Learning and Data Mining, Springer, 2017.

⁷ David Blei, “Probabilistic Topic Models,” Communications of the ACM, Vol. 55, No. 4, 2012, pp. 77-84.

⁸ Chandler Camille May, “Topic Modeling in Theory and Practice,” Dissertation, John Hopkins University, 2022.

⁹ Matthew Gillings and Andrew Hardie, “The interpretation of topic models for scholarly analysis: An evaluation and critique of current practice,” Digital Scholarship in the Humanities, Vol. 38, No. 2, 2023, pp. 530-543,https://academic.oup.com/dsh/article-abstract/38/2/530/6957052(link resides outside ibm.com).

¹⁰ Chandler Camille May, “Topic Modeling in Theory and Practice,” Dissertation, John Hopkins University, 2022.https://aclanthology.org/D11-1024/ (link resides outside ibm.com).

¹¹ Zachary Lipton, “The Mythos of Model Interpretability: In machine learning, the concept of interpretability is both important and slippery,” Queue, Vol. 16, No. 3, pp. 31-57, https://dl.acm.org/doi/10.1145/3236386.3241340 (link resides outside ibm.com). Caitlin Doogan and Wray Buntine, “Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures,” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3824-3848, https://aclanthology.org/2021.naacl-main.300 (link resides outside ibm.com).

¹² Alexander Hoyle, Pranav Goel, Andrew Hian-Cheong, Denis Peskov, Jordan Boyd-Graber, Philip Resnik, “Is Automated Topic Model Evaluation Broken? The Incoherence of Coherence,” Advances in Neural Information Processing Systems, 2021, pp. 2018-2033, https://proceedings.neurips.cc/paper_files/paper/2021/hash/0f83556a305d789b1d71815e8ea4f4b0-Abstract.html (link resides outside ibm.com). Caitlin Doogan and Wray Buntine, “Topic Model or Topic Twaddle? Re-evaluating Semantic Interpretability Measures,” Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2021, pp. 3824-3848, https://aclanthology.org/2021.naacl-main.300 (link resides outside ibm.com).

¹³ Dominik Stammbach, Vilém Zouhar, Alexander Hoyle, Mrinmaya Sachan, and Elliott Ash, “Revisiting Automated Topic Model Evaluation with Large Language Models,” Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 9348-9357, https://aclanthology.org/2023.emnlp-main.581/ (link resides outside ibm.com).