Using CountVectorizer for NLP feature extraction

Published: 10 November 2023

Introduction

When approaching a natural language processing (NLP) use case, such as text classification, you will usually conduct some data preprocessing and feature extraction tasks before running your data inputs into machine learning and deep learning algorithms.

The bag-of-words model is commonly used to extract additional features within text data. More specifically, it generates a frequency count of each word in a given text. A class within a python library, scikit-learn, CountVectorizer, can help us compute the count of unique words across a number of texts with ease. To see an example of how this class is used within data science, check out this spam classification tutorial, which uses the Naive Bayes classifier.

Classify data using the Multinomial Naive Bayes

Use scikit-learn to conduct a text classification task using Multinomial Naive Bayes.

What is CountVectorizer

CountVectorizer is a class in scikit-learn that transforms a collection of text documents into a numerical matrix of word or token counts. This class has a number of parameters that can also assist in text preprocessing tasks, such as stop word removal, word count thresholds (i.e. maximums and minimums), vocab limits, n-gram creation and more. In this article, we’ll walk through how to use scikit-learn’s CountVectorizer to prepare data for use with a classifier.

How to use CountVectorizer

Let's explore how to use this class in a code editor to facilitate preprocessing for NLP use cases.

Installing and importing relevant libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import nltk
from sklearn.feature_extraction.text import CountVectorizer

Load the data

We will be using a dataset from the UCI Machine Learning Repository to walk through how to use CountVectorizer() to generate a sparse matrix of term frequencies.

data = pd.read_csv("~/Documents/Code/SMSSpamCollection.csv")
data.head()

Create a basic sparse matrix

To create a sparse matrix from the dataset, you’ll want to pass in the data from the text column of the data. By default, it will convert your text to lowercase and use utf-8 encoding.

vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(data.text)

Visualize as a dataframe

df = pd.DataFrame(data= matrix.toarray(), columns = vectorizer.get_feature_names_out())
df

Each row represents an individual text from the dataset.

Extract feature names

vectorizer.get_feature_names_out()

Returns words in your corpus, sorted by position in the sparse matrix.

Get the indices of each feature name

vectorizer.vocabulary_

Please note that this does not return the frequency count, but instead, it provides the index of each word in the corpus.

Refine your matrix with parameters

When you have a small dataset, the size of the matrix isn’t much of an issue, but as the vocabulary size increases, you might want to consider different methods to limit its size to the most relevant words across texts.

Remove stop words

Stop words typically have little significance and do not add a tremendous amount of value in classification tasks. These can include words, such as “the”, “or”, "is”, et cetera. To remove these words, you can pass in the stop_words parameter to filter out these words from the vocabulary list.

vectorizer = CountVectorizer(stop_words='english')

To see which languages stop word lists are supported in, run the following code:

print(stopwords.fileids())

Set maximum and minimum count thresholds

You can also set thresholds to remove words from the matrix that appear too frequently or remove words that rarely appear.

vectorizer = CountVectorizer(max_df=0.80, min_df=0.20)

The code sample removes any word from the sparse matrix that appears less than 20% and over 80% of the time in each text.

Alternatively, if you want to limit the number of words within your vocabulary, you can limit to the most commonly used x_number of words.

vectorizer = CountVectorizer(max_features = 50)

Creating n-grams

By default, CountVectorizer will tokenize text data into unigrams, or 1-grams. However, depending on your dataset, you might want to pull in more context and extend the n-gram range to return bigrams (2-grams) or trigrams (3-grams).

vectorizer = CountVectorizer(ngram_range = (2, 2))

This n-gram range will conduct a frequency count for two words, called bigrams.

Use CountVectorizer in a Naive Bayes tutorial