Text Preprocessing in NLP (Tokenization, Stemming, and Lemmatization)

Before machines can understand language, raw text must first be converted into a cleaner and more structured form. Human language is rich, ambiguous, inconsistent, and full of variation. The same idea may appear in uppercase or lowercase, singular or plural, formal or informal style, and with punctuation, spelling noise, or unnecessary words. If raw text is given directly to machine learning systems, useful patterns may remain hidden behind this inconsistency.

This is why text preprocessing is one of the most important stages in Natural Language Processing (NLP). It transforms unstructured text into a usable representation for downstream tasks such as sentiment analysis, spam filtering, document classification, search engines, chatbots, recommendation systems, and language models.

Among the foundational techniques in preprocessing are tokenization, stemming, and lemmatization. These methods help break text into units, reduce word variation, and improve consistency for modeling.

This article explains these techniques practically, with examples in Python and guidance on when to use each one.

Consider the words connect, connected, connecting, and connection. To humans, these words are clearly related. To a machine, they may initially appear as separate unrelated tokens. Similarly, the sentence "Great product!" and "great product" should often mean the same thing despite capitalization and punctuation.

Without preprocessing, vocabulary becomes unnecessarily large, sparse, and noisy. This can reduce model quality, increase memory usage, and make learning slower. Preprocessing helps standardize text so algorithms focus on meaning rather than superficial variation.

What is Tokenization?

Tokenization is the process of breaking text into smaller units called tokens. A token may be a word, subword, sentence, or character depending on the task. Word tokenization is one of the most common starting points in classical NLP pipelines.

These tokens then become units for counting, embedding, classification, or further processing.
Example: Tokenization
from nltk.tokenize import word_tokenize

# Input text
text = "The movie was surprisingly good."

# Tokenize text
tokens = word_tokenize(text)

# Output
print(tokens)
Output:
['The', 'movie', 'was', 'surprisingly', 'good', '.']
Notice that punctuation may also appear as a token depending on the tokenizer.

Word tokenization splits text into words. Sentence tokenization splits paragraphs into sentences. Character tokenization splits into individual letters. Modern transformer models often use subword tokenization, where uncommon words are broken into meaningful parts.

For example, "unhappiness" may become "un", "happy", and a suffix unit depending on tokenizer design.

What is Stemming?

Stemming reduces words to a simpler root form by removing common prefixes or suffixes such as -ing, -ed, or -s. It follows rule-based shortcuts rather than understanding grammar or actual word meaning. As a result, the output may sometimes be an incomplete or non-dictionary word, but it helps group similar words together for text analysis.
Python Example: Stemming
from nltk.stem import PorterStemmer

stemmer = PorterStemmer()
words = ["playing", "played", "studies", "connection"]
for word in words: print(word, "->", stemmer.stem(word))
Output:
playing -> play
played -> play
studies -> studi
connection -> connect
Stemming is fast and useful when exact grammar is less important than reducing vocabulary size.

What is Lemmatization?

Lemmatization reduces words to their meaningful base dictionary form called a lemma. Unlike stemming, it uses vocabulary knowledge and often part-of-speech context. This usually produces cleaner and more linguistically valid outputs.

Lemmatization is generally slower than stemming but more accurate.
Python Example: Lemmatization
from nltk.stem import WordNetLemmatizer

# Create lemmatizer
lemmatizer = WordNetLemmatizer()

# Input words
words = ["running", "cars", "better"]

# Lemmatize words
for word in words:
    print(word, "->", lemmatizer.lemmatize(word))
Output:
running -> running
cars -> car
better -> better
Lemmatization gives better results when it knows the part of speech (POS) of the word. Without POS, NLTK assumes the word is a noun. That can give wrong outputs.
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet

lemmatizer = WordNetLemmatizer()


# Convert POS tags to WordNet format
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    return wordnet.NOUN


words = ["running", "cars", "better"]

# POS tagging
tagged_words = pos_tag(words)

# Lemmatization
for word, tag in tagged_words:
    lemma = lemmatizer.lemmatize(word, get_wordnet_pos(tag))
    print(word, "->", lemma)
Output:
running -> run
cars -> car
better -> well
Stemming is faster, simpler, and useful for search indexing or quick text classification where approximate roots are acceptable. Lemmatization is more precise and better when language quality matters, such as chatbots, analytics, semantic search, or advanced NLP pipelines.

If speed is critical, stemming may be enough. If meaning is critical, lemmatization is usually preferred.

Conclusion

Text pipelines often also include lowercasing, punctuation removal, stopword removal, spelling normalization, handling emojis, expanding contractions, and removing extra whitespace.

Text preprocessing transforms messy human language into structured signals machines can learn from. Tokenization breaks text into usable units, stemming reduces words through heuristic roots, and lemmatization restores meaningful base forms.

Though modern NLP has evolved rapidly, these foundations remain essential knowledge. Understanding how text is cleaned and standardized often matters as much as choosing the model itself.
Nagesh Chauhan
Nagesh Chauhan
Principal Engineer | Java ยท Spring Boot ยท Python ยท Microservices ยท AI/ML

Principal Engineer with 14+ years of experience in designing scalable systems using Java, Spring Boot, and Python. Specialized in microservices architecture, system design, and machine learning.

Share this Article

๐Ÿ’ฌ Comments

Join the discussion