Tokenizing a corpus with Spacy
Tokenization is a process of diving a corpus into its basic meaningful entities. This is often to words and punctuation but it is not limited to words. In the example below we look at a simple tokenizer with Spacy
import spacy
import en_core_web_sm
# Initialize english tokenizer
tokenizer = en_core_web_sm.load()
sample_text = "Jenna is an excellent programmer"
[str(token) for token in tokenizer(sample_text)]
The output is: