A simple algorithm to clean text and create a feature set for any NLP classification problem

Ishan Mehta
4 min readJul 14, 2022

In this article, I’ll introduce a straightforward algorithm for text cleaning and feature set creation applicable to any NLP classification problem. The algorithm utilizes vocabulary libraries from Python’s NLTK and Spacy packages.

Algorithm:

  1. Remove all punctuation
  2. Tokenize the text based on spaces
  3. Convert all tokens to lowercase
  4. Eliminate stop words from the token list
  5. Check if the token or its stem is present in the vocabulary, a modal verb, or part of the predefined word list intended for the feature set. Save the stem of the token.
  • Remove all punctuations
from string import punctuation

punctuations = []
for p in punctuation:
#Keeping ' since it could be part of a modal verb like can't, won't etc.
if p != "'":
punctuations.append(p)

text = "Before I cancel this Netflix subscription, any good shows to watch?"

text = "".join([char for char in text if char not in punctuations]
#text is "Before I cancel this Netflix subscription any good shows to watch" without punctuations , and ?
  • Tokenize the text based on spaces
tokens = re.split('\s+', text)#tokens is a list ['Before', 'I', 'cancel', 'this', 'Netflix', 'subscription', 'any', 'good', 'shows', 'to', 'watch']
  • Lowercase all the tokens
tokens = [t.lower() for t in tokens]#tokens is a list ['before', 'i', 'cancel', 'this', 'netflix', 'subscription', 'any', 'good', 'shows', 'to', 'watch']
  • Remove the stop words from the tokens

What are stopwords?

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.

stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those','be', 'been', 'being', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'now', '#']tokens = [t for t in tokens if t not in stopwords]#tokens is a list '['cancel', 'netflix', 'subscription', 'good', 'shows', 'watch']'
  • Check if the token or the stem of the token is in the vocabulary or is a modal verb or is in the list of words that you intend to have in the feature set, and save the stem of the token

What are modal verbs?

Modal verbs show possibility, intent, ability, or necessity. Because they’re a type of auxiliary verb (helper verb), they’re used together with the main verb of the sentence. Common examples include can, should, and must.

What is the stem of a word?

Stemming is the process of reducing a word to its stem or root. e.g stem of the word ‘subscription’ is ‘subscript’, ‘canceling’ is ‘cancel’, and ‘won’t’ is ‘won’t’

#modal_verbs
modal_verbs = ["can", "can't", "could", "couldn't", "did", "didn't", "may", "might", "must", "mustn't", "shall", "shan't", "should", "shouldn't", "will", "won't", "would", "wouldn't"]
#words not in the vocab which are valid
valid_words = ["netflix"]

#stemmer
stemmer = PorterStemmer()
tokens = [stemmer.stem(word) for word in tokens
if word in vocab_1 or word in vocab_2 or
word in valid_words or
word in modal_verbs or
ps.stem(word) in vocab_1 or
ps.stem(word) in vocab_2]
#tokens is a list '['cancel', 'netflix', 'subscript', 'good', 'show', 'watch']'

Complete algorithm:

import nltk
from nltk.stem import PorterStemmer
import re
import spacy
from string import punctuation


#punctuations
punctuations = []
for p in punctuation:
if p != "'":
punctuations.append(p)

#stopwords
stopwords = ['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those','be', 'been', 'being', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against','between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 'now', '#']

#corpuse of words vocab_1 and vocab_2
vocab_1 = nltk.corpus.words.words()

nlp = spacy.load("en_core_web_sm")
vocab_2 = set(nlp.vocab.strings)

#modal_verbs
modal_verbs = ["can", "can't", "could", "couldn't", "did", "didn't", "may", "might", "must", "mustn't", "shall", "shan't", "should", "shouldn't", "will", "won't", "would", "wouldn't"]

#words not in the vocab which are valid
valid_words = ["netflix"]

#stemmer
stemmer = PorterStemmer()

def clean_text(text):
text = "".join([char for char in text if char not in punctuations])

tokens = re.split('\s+', text)
tokens = [t.lower() for t in tokens]
tokens = [t for t in tokens if t not in stopwords]

tokens = [stemmer.stem(word) for word in tokens
if word in vocab_1 or word in vocab_2 or
word in valid_words or
word in modal_verbs or
stemmer.stem(word) in vocab_1 or
stemmer.stem(word) in vocab_2]

return tokens

print(clean_text("Before I cancel this Netflix subscription, any good shows to watch?")) #result is ['cancel', 'netflix', 'subscript', 'good', 'show', 'watch']
print(clean_text("So if i renew my 1 month netflix subscription now i will easily be able to watch Elite s5 before it expires")) # result is ['renew', '1', 'month', 'netflix', 'subscript', 'will', 'easili', 'abl', 'watch', 'elit', 'expir']

clean_text.py: https://github.com/ishanmehta17/nlp_clean_text_algo/blob/main/clean_text.py

This is a simple algorithm to clean text and create a feature set for any NLP classification problem.

Happy Reading!

--

--