Tokenizing Tweets

Tokenizing is a process of diving a corpus into it's basic meaningful entities. This often would be works but in they could mean hashtags, emojies e.t.c. Tweets are particularly interesting in that different hashtags and emoticons and other interested tokens hold specific meanings. In the example below, we look at tokenization specifically for tweets.

from nltk.tokenize import TweetTokenizer
tokenizer = TweetTokenizer()

tweet = "Places I love to visit the city walls of #Dubrovnik #endlessroaming #travel #Croatia"

tokenizer.tokenize(tweet)

The output is:

['Places', 'I', 'love', 'to', 'visit', 'the', 'city', 'walls', 'of','#Dubrovnik', '#endlessroaming', '#travel', '#Croatia']

The TweetTokenizer from nltk takes care of tweeter specific tokenization. This is not often the case with other tokenizer like space. For demonstration, here is an example:

import spacy
import en_core_web_sm

nlp = en_core_web_sm.load()

[str(token) for token in nlp(tweet)]
['Places', 'I', 'love', 'to', 'visit', 'the', 'city', 'walls', 'of', '#', 'Dubrovnik', '#', 'endlessroaming', '#', 'travel', '#', 'Croatia']

Notice that in the later example, the hashtags are seperate tokens.