Tokenizing a corpus with Spacy

Tokenization is a process of diving a corpus into its basic meaningful entities. This is often to words and punctuation but it is not limited to words. In the example below we look at a simple tokenizer with Spacy

import spacy
import en_core_web_sm

# Initialize english tokenizer
tokenizer = en_core_web_sm.load()

sample_text = "Jenna is an excellent programmer"

[str(token) for token in tokenizer(sample_text)]

The output is:

OUTPUT['Jenna', 'is', 'an', 'excellent', 'programmer']

Natural Language Processing

Tokenizing a corpus with Spacy