Member-only story
An Introduction to NLP Count Vectorization and TF-IDF (Part 1)
Let's explore this basic, yet important, NLP tool together.
Count Vectorization is a useful way to convert text contents(e.g. strings) into numerical features that can be understood by machine learning algorithms. Each of the documents is placed on rows and individual words (tokens) are placed on columns. This converts a corpus of text documents into a matrix recording token occurrences.
Each entry of the matrix records the number of occurrences (i.e. frequency) of a specific word/token in a document.
Consider a corpus of 1,000 documents that collectively have 4,000 tokens. The matrix will have 1,000 rows and 4,000 columns.
As there are many tokens in a corpus of documents, and usually one document only contains a small subset of the tokens. Such a word count matrix will be wide (many one-hot-features), sparsed, and has many entries of zeros.
This process of tokenization, counting, and normalization is called the Bag of Words or “Bag of…