Lemmatizing a corpus

Lemmatization is the process of returning the root of the word. This is very useful because it can reduce sparsity when looking words that mean the same thing at their root. In the example below, we look at lemmatizing a sample corpus

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

sample_text = "John enjoys working on data analysis and in particular working on complex visualizations"

[" ==> ".join([str(token), token.lemma_ ]) for token in nlp(sample_text)]

The output is:

['John ==> John',
'enjoys ==> enjoy',
'working ==> work',
'on ==> on',
'data ==> datum',
'analysis ==> analysis',
'and ==> and',
'in ==> in',
'particular ==> particular',
'working ==> work',
'on ==> on',
'complex ==> complex',
'visualizations ==> visualization']