This is why text preprocessing is one of the most important stages in Natural Language Processing (NLP). It transforms unstructured text into a usable representation for downstream tasks such as sentiment analysis, spam filtering, document classification, search engines, chatbots, recommendation systems, and language models.
Among the foundational techniques in preprocessing are tokenization, stemming, and lemmatization. These methods help break text into units, reduce word variation, and improve consistency for modeling.
This article explains these techniques practically, with examples in Python and guidance on when to use each one.
Consider the words connect, connected, connecting, and connection. To humans, these words are clearly related. To a machine, they may initially appear as separate unrelated tokens. Similarly, the sentence "Great product!" and "great product" should often mean the same thing despite capitalization and punctuation.
Without preprocessing, vocabulary becomes unnecessarily large, sparse, and noisy. This can reduce model quality, increase memory usage, and make learning slower. Preprocessing helps standardize text so algorithms focus on meaning rather than superficial variation.
What is Tokenization?
Tokenization is the process of breaking text into smaller units called tokens. A token may be a word, subword, sentence, or character depending on the task. Word tokenization is one of the most common starting points in classical NLP pipelines.These tokens then become units for counting, embedding, classification, or further processing.
Example: Tokenization
from nltk.tokenize import word_tokenize
# Input text
text = "The movie was surprisingly good."
# Tokenize text
tokens = word_tokenize(text)
# Output
print(tokens)
Output:
['The', 'movie', 'was', 'surprisingly', 'good', '.']
Notice that punctuation may also appear as a token depending on the tokenizer.
Word tokenization splits text into words. Sentence tokenization splits paragraphs into sentences. Character tokenization splits into individual letters. Modern transformer models often use subword tokenization, where uncommon words are broken into meaningful parts.
For example, "unhappiness" may become "un", "happy", and a suffix unit depending on tokenizer design.
What is Stemming?
Stemming reduces words to a simpler root form by removing common prefixes or suffixes such as -ing, -ed, or -s. It follows rule-based shortcuts rather than understanding grammar or actual word meaning. As a result, the output may sometimes be an incomplete or non-dictionary word, but it helps group similar words together for text analysis.Python Example: Stemming
from nltk.stem import PorterStemmer
stemmer = PorterStemmer()
words = ["playing", "played", "studies", "connection"]
for word in words: print(word, "->", stemmer.stem(word))
Output:
playing -> play
played -> play
studies -> studi
connection -> connect
Stemming is fast and useful when exact grammar is less important than reducing vocabulary size.
What is Lemmatization?
Lemmatization reduces words to their meaningful base dictionary form called a lemma. Unlike stemming, it uses vocabulary knowledge and often part-of-speech context. This usually produces cleaner and more linguistically valid outputs.Lemmatization is generally slower than stemming but more accurate.
Python Example: Lemmatization
from nltk.stem import WordNetLemmatizer
# Create lemmatizer
lemmatizer = WordNetLemmatizer()
# Input words
words = ["running", "cars", "better"]
# Lemmatize words
for word in words:
print(word, "->", lemmatizer.lemmatize(word))
Output:
running -> running
cars -> car
better -> better
Lemmatization gives better results when it knows the part of speech (POS) of the word. Without POS, NLTK assumes the word is a noun. That can give wrong outputs.
from nltk.stem import WordNetLemmatizer
from nltk import pos_tag
from nltk.corpus import wordnet
lemmatizer = WordNetLemmatizer()
# Convert POS tags to WordNet format
def get_wordnet_pos(tag):
if tag.startswith('J'):
return wordnet.ADJ
elif tag.startswith('V'):
return wordnet.VERB
elif tag.startswith('N'):
return wordnet.NOUN
elif tag.startswith('R'):
return wordnet.ADV
return wordnet.NOUN
words = ["running", "cars", "better"]
# POS tagging
tagged_words = pos_tag(words)
# Lemmatization
for word, tag in tagged_words:
lemma = lemmatizer.lemmatize(word, get_wordnet_pos(tag))
print(word, "->", lemma)
Output:
running -> run
cars -> car
better -> well
Stemming is faster, simpler, and useful for search indexing or quick text classification where approximate roots are acceptable. Lemmatization is more precise and better when language quality matters, such as chatbots, analytics, semantic search, or advanced NLP pipelines.
If speed is critical, stemming may be enough. If meaning is critical, lemmatization is usually preferred.
Conclusion
Text pipelines often also include lowercasing, punctuation removal, stopword removal, spelling normalization, handling emojis, expanding contractions, and removing extra whitespace.Text preprocessing transforms messy human language into structured signals machines can learn from. Tokenization breaks text into usable units, stemming reduces words through heuristic roots, and lemmatization restores meaningful base forms.
Though modern NLP has evolved rapidly, these foundations remain essential knowledge. Understanding how text is cleaned and standardized often matters as much as choosing the model itself.
Join the discussion