Text Cleaning
- Text Cleaning is a crucial step in data preparation, aka data wrangling, the goal here is to produce âclean textâ that machines can analyze error free.
- Clean text is human language rearranged into a format that machine models can understand.
How its done:
Text cleaning is performed by eliminating stop-words, removing Unicode words, and simplifying complex words to their root form. Few techniques used are below:
- Normalizing Text
- Removing everything that would confuse a computer model
- Lower Case Conversion
- [[#|### Stop-word Removal]]
- Unicode, Emoji, Emoticon, punctuation removal
- Number Unification
- [[#|### Removing Contractions]]
- [[#|### Stemming]]
- [[#|### Lemmatization]]
- [[#|### Spelling replacement]]
- Removing everything that would confuse a computer model
Few other techniques that are performed to give context to the text are
- Part of Speech (POS) Tagging
- Named entity recognition (NER)
Explanations:
Stop-word Removal
- Removing words that donât directly apply to interpretation. Example: âtheâ, âisâ, âthatâ, âaâ, etc.,
Stemming
- Stemming is a process that stems or removes last few characters from a word to find the root word. Example: - boating - boat - Caring - car (which is wrong) - history/historical - histori (which is wrong again)
- Stemming is used in case of large dataset where performance is an issue.
- Stemming algorithms:
- Porterâs Stemmer
- Lancaster stemming algorithm
Lemmatization
- Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.
- Lemmatization is computationally expensive since it involves look-up tables and what not.
- Lemmatization algorithms depend on the availability of part-of-speech information on the input words because different normalization rules might need to be applied whether the word is a noun, verb, adjective, or other.
Stemming vs Lemmatization:
- Stemming has its application in Sentiment Analysis while Lemmatization has its application in Chatbots, human-answering.
Number Unification
- Standardizing numbers to one format (roman, text, letters)
Spelling replacement
- Replacing spelling of words to one particular type (UK or US) Example: Color - Colour
Removing Contractions
- Converting contracted text to their long forms Example : wonât - will not
References:
- https://monkeylearn.com/blog/text-cleaning/
- https://www.analyticsvidhya.com/blog/2022/01/text-cleaning-methods-in-nlp/
- https://libguides.library.usyd.edu.au/text_data_mining/cleaning
- https://www.analyticsvidhya.com/blog/2022/06/stemming-vs-lemmatization-in-nlp-must-know-differences/
- https://nirajbhoi.medium.com/stemming-vs-lemmatization-in-nlp-efc280d4e845
- https://www.baeldung.com/cs/stemming-vs-lemmatization