Member-only story

An Introduction to NLP Count Vectorization and TF-IDF (Part 1)

Machine Learning Quick Reads
3 min readMar 14, 2021

--

Let's explore this basic, yet important, NLP tool together.

Photo by Compare Fibre on Unsplash

Count Vectorization is a useful way to convert text contents(e.g. strings) into numerical features that can be understood by machine learning algorithms. Each of the documents is placed on rows and individual words (tokens) are placed on columns. This converts a corpus of text documents into a matrix recording token occurrences.

Each entry of the matrix records the number of occurrences (i.e. frequency) of a specific word/token in a document.

Consider a corpus of 1,000 documents that collectively have 4,000 tokens. The matrix will have 1,000 rows and 4,000 columns.

Photo by Pat Whelen on Unsplash

As there are many tokens in a corpus of documents, and usually one document only contains a small subset of the tokens. Such a word count matrix will be wide (many one-hot-features), sparsed, and has many entries of zeros.

This process of tokenization, counting, and normalization is called the Bag of Words or “Bag of…

--

--

Machine Learning Quick Reads
Machine Learning Quick Reads

Written by Machine Learning Quick Reads

Lead Author: Yaokun Lin, Actuary | ML Practitioner | Apply Tomorrow's Technology to Solve Today's Problems

No responses yet