Text Cleaning

  • Text Cleaning is a crucial step in data preparation, aka data wrangling, the goal here is to produce ‘clean text’ that machines can analyze error free.
  • Clean text is human language rearranged into a format that machine models can understand.

How its done:

Text cleaning is performed by eliminating stop-words, removing Unicode words, and simplifying complex words to their root form. Few techniques used are below:

  • Normalizing Text
    • Removing everything that would confuse a computer model
      • Lower Case Conversion
      • [[#|### Stop-word Removal]]
      • Unicode, Emoji, Emoticon, punctuation removal
      • Number Unification
      • [[#|### Removing Contractions]]
      • [[#|### Stemming]]
      • [[#|### Lemmatization]]
      • [[#|### Spelling replacement]]

Few other techniques that are performed to give context to the text are

  • Part of Speech (POS) Tagging
  • Named entity recognition (NER)

Explanations:

Stop-word Removal

  • Removing words that don’t directly apply to interpretation. Example: ‘the’, ‘is’, ‘that’, ‘a’, etc.,

Stemming

  • Stemming is a process that stems or removes last few characters from a word to find the root word. Example: - boating - boat - Caring - car (which is wrong) - history/historical - histori (which is wrong again)
  • Stemming is used in case of large dataset where performance is an issue.
  • Stemming algorithms:
    • Porter’s Stemmer
    • Lancaster stemming algorithm

Lemmatization

  • Lemmatization considers the context and converts the word to its meaningful base form, which is called Lemma.
  • Lemmatization is computationally expensive since it involves look-up tables and what not.
  • Lemmatization algorithms depend on the availability of part-of-speech information on the input words because different normalization rules might need to be applied whether the word is a noun, verb, adjective, or other.

Stemming vs Lemmatization:

|

  • Stemming has its application in Sentiment Analysis while Lemmatization has its application in Chatbots, human-answering.

Number Unification

  • Standardizing numbers to one format (roman, text, letters)

Spelling replacement

  • Replacing spelling of words to one particular type (UK or US) Example: Color - Colour

Removing Contractions

  • Converting contracted text to their long forms Example : won’t - will not

References: