Member-only story

An Introduction to NLP Count Vectorization and TF-IDF (Part 1)

3 min readMar 14, 2021

Let's explore this basic, yet important, NLP tool together.

Count Vectorization is a useful way to convert text contents(e.g. strings) into numerical features that can be understood by machine learning algorithms. Each of the documents is placed on rows and individual words (tokens) are placed on columns. This converts a corpus of text documents into a matrix recording token occurrences.

Each entry of the matrix records the number of occurrences (i.e. frequency) of a specific word/token in a document.

Consider a corpus of 1,000 documents that collectively have 4,000 tokens. The matrix will have 1,000 rows and 4,000 columns.

As there are many tokens in a corpus of documents, and usually one document only contains a small subset of the tokens. Such a word count matrix will be wide (many one-hot-features), sparsed, and has many entries of zeros.

This process of tokenization, counting, and normalization is called the Bag of Words or “Bag of…

An Introduction to NLP Count Vectorization and TF-IDF (Part 1)

Written by Machine Learning Quick Reads

No responses yet