What Is Stemming?

Published: 29 November 2023
Contributors: Jacob Murel Ph.D., Eda Kavlakoglu

Stemming is one of several text normalization techniques that converts raw text data into a readable format for natural language processing tasks.

Stemming is a text preprocessing technique in natural language processing (NLP). Specifically, it is the process of reducing inflected form of a word to one so-called “stem,” or root form, also known as a “lemma” in linguistics.¹ It is one of two primary methods—the other being lemmatization—that reduces inflectional variants within a text dataset to one morphological lexeme. In doing so, stemming aims to improve text processing in machine learning and information retrieval systems.

Build responsible AI workflows with AI governance

Learn the building blocks and best practices to help your teams accelerate responsible AI.

Related content

Why use stemming?

Machines, from search-and-find functions to deep learning models, process language largely according to form, and many researchers argue computers cannot understand meaning in language.² While some debate this latter point, it is nevertheless the case that machine learning models need to be trained to recognize different words as morphological variants of one base word. For instance, in search engines or library catalogs, users may submit a query with one word (for example, investing) but expect results that use any inflected word form (for example, invest, investment, investments, etc.). By reducing derivational word forms to one stem word, stemming helps information retrieval systems equate morphologically related words.³

For many text mining tasks including text classification, clustering, indexing, and more, stemming helps improve accuracy by shrinking the dimensionality of machine learning algorithms and grouping words according to concept. Reduction in algorithm dimensionality can improve the accuracy and precision of statistical NLP models, such as topic models and word embeddings.⁴ Stemming thereby improves accuracy when carrying out various NLP tasks, such as sentiment analysis of part of speech tagging. In this way, stemming serves as an important step in developing large language models.

How stemming works

Stemming is one stage in a text mining pipeline that converts raw text data into a structured format for machine processing. Stemming essentially strips affixes from words, leaving only the base form.⁵ This amounts to removing characters from the end of word tokens. Beyond this basic similarity, however, stemming algorithms vary widely.

Types of stemming algorithms

To explore differences between stemming algorithm operations, we can process this line from Shakespeare’s A Midsummer Night’s Dream: “Love looks not with the eyes but with the mind, and therefore is winged Cupid painted blind.” Before stemming, users must tokenize the raw text data. The Python natural language toolkit’s (NLTK) built-in tokenizer outputs the quoted text as:

Tokenized: ['Love', 'looks', 'not', 'with', 'the', 'eyes', 'but', 'with', 'the', 'mind', ',', 'and', 'therefore', 'is', 'winged', 'Cupid', 'painted', 'blind', '.']

By running the tokenized output through multiple stemmers, we can observe how stemming algorithms differ.

Lovins stemmer

The Lovins stemmer is the first published stemming algorithm. Essentially, it functions as a heavily parametrized find-and-replace function. It compares every input token against a list of common suffixes, with each suffix conditioned by one of 29 rules. If one of the list’s suffixes is found in a token, and removing the suffix does not violate any of the associated suffix’s conditions, the algorithm removes that suffix from the token. The stemmed token is then run through another set of rules, correcting for common malformations in stemmed roots, such as double letters (for example, hopping becomes hopp becomes hop).⁶

This code uses the the Python stemming library,⁷ to stem the tokenized Shakespeare quotation:

from stemming.lovins import stem
from nltk.tokenize import word_tokenize
text = "Love looks not with the eyes but with the mind, and therefore is winged Cupid painted blind."
words = word_tokenize(text)
stemmed_words = [stem(word) for word in words]

The code outputs:

Stemmed: ['Lov', 'look', 'not', 'with', 'th', 'ey', 'but', 'with', 'th', 'mind', ',', 'and', 'therefor', 'is', 'wing', 'Cupid', 'paint', 'blind', '.']

The output shows how the Lovins stemmer correctly turns conjugations and tenses to base forms (for example, painted becomes paint) while eliminating pluralization (for example, eyes becomes eye). But the Lovins stemming algorithm also returns a number of ill-formed stems, such as lov, th, and ey. These malformed root words result from removing too many characters. As is often the case in machine learning, such errors help reveal underlying processes.

When compared against the Lovins stemmer’s list of suffixes, the longest suffix fitting both love and the is the single-character -e. The only condition attached to this suffix is “No restrictions on stem,” meaning the stemmer may remove -e no matter the remaining stem’s length. Unfortunately, neither of the stems lov or th contain any of the characteristics the Lovins algorithm uses to identify malformed words, such as double letters or irregular plurals.⁸

When such malformed stems escape the algorithm, the Lovins stemmer can reduce semantically unrelated words to the same stem—for example, the, these, and this all reduce to th. Of course, these three words are all demonstratives, and so share a grammatical function. But other demonstratives, such as that and those, do not reduce to th. This means the Lovins generated stems do not properly represent word groups.

Porter stemmer

Compared to the Lovins stemmer, the Porter stemming algorithm uses a more mathematical stemming algorithm. Essentially, this stemmer classifies every character in a given token as either a consonant (c) or vowel (v), grouping subsequent consonants as C and subsequent vowels as V. The stemmer thus represents every word token as a combination of consonant and vowel groups. Once enumerated this way, the stemmer runs each word token through a list of rules that specify ending characters to remove according to the number of vowel-consonant groups in a token.⁹ Because English itself follows general but not absolute lexical rules, the Porter stemmer algorithm’s systematic criterion for determining suffix removal can return errors.

Python NLTK contains a built-in Porter stemmer function. This code deploys the Porter stemming algorithm on the tokenized Shakespeare quotation:

from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
porter_stemmer = PorterStemmer()
text = "Love looks not with the eyes but with the mind, and therefore is winged Cupid painted blind."
words = word_tokenize(text)
stemmed_words = [porter_stemmer.stem(word) for word in words]

This code returns:

Stemmed: ['love', 'look', 'not', 'with', 'the', 'eye', 'but', 'with', 'the', 'mind', ',', 'and', 'therefor', 'is', 'wing', 'cupid', 'paint', 'blind', '.']

As with Lovins, Porter correctly changes verb conjugations and noun pluralizations. While lacking Lovins’ other malformed stems (for example, love to lov), the Porter stemming algorithm nevertheless erroneously removes -e from the end of therefore.

Per the Porter stemmer’s consonant-vowel grouping method, therefore is represented as CVCVCVCV, or C(VC)³V, with the exponent signifying repetitions of consonant-vowel groups.

One of the algorithm’s final steps states that, if a word has not undergone any stemming and has an exponent value greater than 1, -e is removed from the word’s ending (if present). Therefore’s exponent value equals 3, and it contains none of the suffixes listed in the algorithm’s other conditions.¹⁰ Thus, therefore becomes therefor.

Admittedly, this is the Porter stemmer’s only error, perhaps testifying to why it is the most widely adopted stemming algorithm. Indeed, the Porter stemmer has served as a foundation for subsequent stemming algorithms.

Snowball stemmer

Snowball stemmer is an updated version of the Porter stemmer. While it aims to enforce a more robust set of rules for determining suffix removal, it nevertheless remains prone to many of the same errors. Much like the Porter stemmer, Python NLTK contains a built-in Snowball stemmer function:

from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import word_tokenize
stemmer = SnowballStemmer("english", True)
text = "There is nothing either good or bad but thinking makes it so."
words = word_tokenize(text)
stemmed_words = [stemmer.stem(word) for word in words]

The produces the same output of the Shakespeare text as the Porter stemmer, incorrectly reducing therefore to therefor:

Stemmed: ['love', 'look', 'not', 'with', 'the', 'eye', 'but', 'with', 'the', 'mind', ',', 'and', 'therefor', 'is', 'wing', 'cupid', 'paint', 'blind', '.']

The Snowball stemmer differs from Porter in two main ways. First, while the Lovins and Porter stemmers only stem English words, the Snowball stemmer can stem texts in a number of other Roman script languages, such as Dutch, German, French, and even Russian. Second, the Snowball stemmer, when implemented via Python NLTK library, can ignore stopwords. Stopwords are a non-universal collection of words that are removed from a dataset during preprocessing. The Snowball stemmer’s predefined stoplist contains words without a direct conceptual definition and that serve more a grammatical than semantic function. Stopwords included in the Snowball stemmer English stoplist include the, a, being, and the like.¹¹

Lancaster stemmer

Many sources describe the Lancaster stemmer—also known as the Paice stemmer—as the most aggressive of English language stemmers. The Lancaster stemmer contains a list of over 100 rules that dictate which ending character strings, if present, to replace with other strings, if any. The stemmer iterates through each word token, checking it against all the rules. If the token’s ending string matches that of a rule, the algorithm enacts the rule’s described operation and then runs the new, transformed word through all of the rules again. The stemmer iterates through all of the rules until a given token passes them all without being transformed.¹²

Though unavailable in Python NLTK, the Lancaster stemmer is available in stemming library:¹³

from stemming.paicehusk import stem
from nltk.tokenize import word_tokenize
text = "Love looks not with the eyes but with the mind, and therefore is winged Cupid painted blind."
words = word_tokenize(text)
stemmed_words = [stem(word) for word in words]

The code stems the tokenized Shakespeare passage as:

Stemmed: ['Lov', 'look', 'not', 'with', 'the', 'ey', 'but', 'with', 'the', 'mind', ',', 'and', 'theref', '', 'wing', 'Cupid', 'paint', 'blind', '.']

Clearly, the Lancaster stemmer’s iterative approach is the most aggressive of the stemmers, as shown with theref. First, the Lancaster stemmer has the rule “e1>”. This rule removes the single-character -e with no replacement. After the algorithm strips -e from therefore, it runs the new therefor through each rule. The newly transformed word fits the rule “ro2>.” This rule removes the two-character suffix -or with no replacement. The resulting stem theref fits none of the algorithms other rules and so is returned as the stemmed base. Unlike Lovins, the Lancaster algorithm has no means of accounting for malformed words.

Limitations of stemming

Language support

There are many English stemmers, as well as stemmers for other Roman script languages. More recently, research has turned towards developing and evaluating stemming algorithms for non-Roman script languages. Arabic, in particular, can be challenging due to its complex morphology and orthographic variations. A handful of studies compare the efficacy of different Arabic stemmers in relation to tasks such as classification.¹⁴ Additionally, researchers investigate stemming’s accuracy in improving information retrieval tasks in Tamil¹⁵ and Sanskrit.¹⁶

Over-stemming and under-stemming

While research evidences stemming’s role in improving NLP task accuracy, stemming does have two primary issues for which users need to watch. Over-stemming is when two semantically distinct words are reduced to the same root, and so conflated. Under-stemming signifies when two words semantically related are not reduced to the same root.¹⁷ An example of over-stemming is the Lancaster stemmer’s reduction of wander to wand, two semantically distinct terms in English. Both the Porter and Lovins stemmer’s do not alter wander at all however. An example of under-stemming is the Porter stemmer’s non-reduction of knavish to knavish and knave to knave, which do share the same semantic root. By comparison, the Lovins stemmer reduces both words to knav.

Base formation

Though having similar uses and objectives, stemming and lemmatization differ in small but key ways. Literature often describes stemming as more heuristic, essentially stripping common suffixes from words to produce a root word. Lemmatization, by comparison, conducts a more detailed morphological analysis of different words to determine a dictionary base form, removing not only suffixes, but prefixes as well. While stemming is quicker and more readily implemented, many developers of deep learning tools may prefer lemmatization given its more nuanced stripping process.

Footnotes

¹ Ruslan Mitkov, Oxford Handbook of Computational Linguistics, 2^nd edition, Oxford University Press, 2014.

² Emily Bender and Alexander Koller, “Climbing towards NLU: On Meaning, Form, and Understanding in the Age of Data,” Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 5185-5198, https://aclanthology.org/2020.acl-main.463/ (link resides outside ibm.com)

³ Steven Bird, Ewan Klein, and Edward Loper, Natural Language Processing with Python, O’Reilley, 2009.

⁴ Gary Miner, Dursun Delen, John Elder, Andrew Fast, Thomas Hill, and Robert A. Nisbet, Practical Text Mining and Statistical Analysis for Non-Structured Text Data Applications, Academic Press, 2012.

⁵ Christopher Manning and Hinrich Schütze, Foundations of Statistical Natural Language Processing, MIT Press, 1999.

⁶ Julie Beth Lovins, "Development of a stemming algorithm," Mechanical Translation and Computational Linguistics, Vol. 11, Nos. 1 and 2, 1968, pp. 22-31,
https://aclanthology.org/www.mt-archive.info/MT-1968-Lovins.pdf (link resides outside ibm.com)

⁷ https://pypi.org/project/stemming/1.0/ (link resides outside ibm.com)

⁸ Julie Beth Lovins, "Development of a stemming algorithm," Mechanical Translation and Computational Linguistics, Vol. 11, Nos. 1 and 2, 1968, pp. 22-31,
https://aclanthology.org/www.mt-archive.info/MT-1968-Lovins.pdf (link resides outside ibm.com)

⁹ Martin Porter, "An algorithm for suffix stripping", Program: electronic library and information systems, Vol. 14, No. 3, 1980, pp. 130-137, https://www.emerald.com/insight/content/doi/10.1108/eb046814/full/html (link resides outside ibm.com)

¹⁰ Martin Porter, "An algorithm for suffix stripping", Program: electronic library and information systems, Vol. 14, No. 3, 1980, pp. 130-137, https://www.emerald.com/insight/content/doi/10.1108/eb046814/full/html (link resides outside ibm.com)

¹¹ Martin Porter, “Snowball: A language for stemming algorithms,” 2001, https://snowballstem.org/texts/introduction.html (link resides outside ibm.com)

¹² Chris Paice, “Another stemmer," ACM SIGIR Forum, Vol. 24, No. 3, 1990, pp. 56–61, https://dl.acm.org/doi/10.1145/101306.101310 (link resides outside ibm.com)

¹³ https://pypi.org/project/stemming/1.0/ (link resides outside ibm.com)

¹⁴ Y. A. Alhaj, J. Xiang, D. Zhao, M. A. A. Al-Qaness, M. Abd Elaziz and A. Dahou, "A Study of the Effects of Stemming Strategies on Arabic Document Classification," IEEE Access, Vol. 7, pp. 32664-32671, https://ieeexplore.ieee.org/document/8664087 (link resides outside ibm.com). Janneke van der Zwaan, Maksim Abdul Latif, Dafne van Kuppevelt, Melle Lyklema, Christian Lange, "Are you sure your tool does what it is supposed to do? Validating Arabic root extraction," Digital Scholarship in the Humanities, Vol. 36, 2021, pp. 137–150, https://academic.oup.com/dsh/article/36/Supplement_1/i137/5545478?login=false (link resides outside ibm.com)

¹⁵ Ratnavel Rajalakshmi, Srivarshan Selvaraj, Faerie Mattins, Pavitra Vasudevan, Anand Kumar, "HOTTEST: Hate and Offensive content identification in Tamil using Transformers and Enhanced Stemming," Computer Speech & Language, Vol. 78, 2023, https://www.sciencedirect.com/science/article/abs/pii/S0885230822000870?via%3Dihub (link resides outside ibm.com)

¹⁶ Siba Sankar Sahu and Sukomal Pal, "Building a text retrieval system for the Sanskrit language: Exploring indexing, stemming, and searching issues," Computer Speech & Language, Vol. 81, 2023, https://www.sciencedirect.com/science/article/abs/pii/S0885230823000372?via%3Dihub (link resides outside ibm.com)

¹⁷ Chris Paice, “Stemming,” Encyclopedia of Database Systems, Springer, 2020.