📚 Curiosity Chronicles

❯

❯

Nlp Data Preparation 1

Nlp Data Preparation 1

Nov 21, 20232 min read

data-preparation/text-cleaning

Text Cleaning

Text Cleaning is a crucial step in data preparation, aka data wrangling, the goal here is to produce ‘clean text’ that machines can analyze error free.
Clean text is human language rearranged into a format that machine models can understand.

How its done:

Text cleaning is performed by eliminating stop-words, removing Unicode words, and simplifying complex words to their root form. Few techniques used are below:

Normalizing Text
- Removing everything that would confuse a computer model
  - Lower Case Conversion
  - [[#|### Stop-word Removal]]
  - Unicode, Emoji, Emoticon, punctuation removal
  - Number Unification
  - [[#|### Removing Contractions]]
  - [[#|### Stemming]]
  - [[#|### Lemmatization]]
  - [[#|### Spelling replacement]]

Few other techniques that are performed to give context to the text are

Part of Speech (POS) Tagging
Named entity recognition (NER)

Explanations:

Stop-word Removal

Removing words that don’t directly apply to interpretation. Example: ‘the’, ‘is’, ‘that’, ‘a’, etc.,

Stemming

Stemming is a process that stems or removes last few characters from a word to find the root word. Example: - boating - boat - Caring - car (which is wrong) - history/historical - histori (which is wrong again)
Stemming is used in case of large dataset where performance is an issue.
Stemming algorithms:
- Porter’s Stemmer
- Lancaster stemming algorithm

Lemmatization

Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.
Lemmatization is computationally expensive since it involves look-up tables and what not.
Lemmatization algorithms depend on the availability of part-of-speech information on the input words because different normalization rules might need to be applied whether the word is a noun, verb, adjective, or other.

Stemming vs Lemmatization:

Stemming has its application in Sentiment Analysis while Lemmatization has its application in Chatbots, human-answering.

Number Unification

Standardizing numbers to one format (roman, text, letters)

Spelling replacement

Replacing spelling of words to one particular type (UK or US) Example: Color - Colour

Removing Contractions

Converting contracted text to their long forms Example : won’t - will not

References:

https://monkeylearn.com/blog/text-cleaning/
https://www.analyticsvidhya.com/blog/2022/01/text-cleaning-methods-in-nlp/
https://libguides.library.usyd.edu.au/text_data_mining/cleaning
https://www.analyticsvidhya.com/blog/2022/06/stemming-vs-lemmatization-in-nlp-must-know-differences/
https://nirajbhoi.medium.com/stemming-vs-lemmatization-in-nlp-efc280d4e845
https://www.baeldung.com/cs/stemming-vs-lemmatization

Graph View

Text Cleaning
How its done:
Explanations:
References:

Created with Quartz v4.4.0 © 2025

RSS
Email
LinkedIn
GitHub