Information Visualization · FH Potsdam · Summer 2020

Tutorial 5: Text processing

In this tutorial we explore textual data. We will extract and visualize common words, filter them by type, and search and find words in their document context. For your orientation: the contents of this tutorial falls into the larger family of methods called Natural Language Processing (or short NLP). Now without any further words: let's do this!

🛒 1. Prepare

In addition to Pandas and Altair, we import the Natural Language Toolkit (in short nltk):

In [ ]:
import pandas as pd
import altair as alt
import nltk # ← new

This time, we will also need to download a bunch of required datasets that are necessary for various text processing steps. To get this out of the way, we will do this in one go. This might take a few minutes.

In [ ]:
nltk.download('punkt') # necessary for tokenization
nltk.download('wordnet') # necessary for lemmatization
nltk.download('stopwords') # necessary for removal of stop words
nltk.download('averaged_perceptron_tagger') # necessary for POS tagging
nltk.download('maxent_ne_chunker' ) # necessary for entity extraction
nltk.download('words')

# another library for text analysis
!pip install SpaCy 

# and a small English language model
!python -m spacy download en_core_web_sm 

# this tutorial will feature a wordcloud
!pip install wordcloud 
[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!
[nltk_data] Downloading package maxent_ne_chunker to
[nltk_data]     /root/nltk_data...
[nltk_data]   Package maxent_ne_chunker is already up-to-date!
[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Package words is already up-to-date!
Requirement already satisfied: SpaCy in /usr/local/lib/python3.6/dist-packages (2.2.4)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (1.1.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from SpaCy) (49.2.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (0.7.1)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (1.18.5)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (4.41.1)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (2.0.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (3.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (2.23.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (1.0.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (1.0.2)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (7.4.0)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (0.4.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from SpaCy) (1.0.2)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->SpaCy) (1.24.3)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->SpaCy) (2020.6.20)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->SpaCy) (2.10)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->SpaCy) (3.0.4)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->SpaCy) (1.7.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->SpaCy) (3.1.0)
Requirement already satisfied: en_core_web_sm==2.2.5 from https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz#egg=en_core_web_sm==2.2.5 in /usr/local/lib/python3.6/dist-packages (2.2.5)
Requirement already satisfied: spacy>=2.2.2 in /usr/local/lib/python3.6/dist-packages (from en_core_web_sm==2.2.5) (2.2.4)
Requirement already satisfied: blis<0.5.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.4.1)
Requirement already satisfied: numpy>=1.15.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.18.5)
Requirement already satisfied: plac<1.2.0,>=0.9.6 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.1.3)
Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.2)
Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.23.0)
Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (2.0.3)
Requirement already satisfied: setuptools in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (49.2.0)
Requirement already satisfied: catalogue<1.1.0,>=0.0.7 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.0)
Requirement already satisfied: wasabi<1.1.0,>=0.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (0.7.1)
Requirement already satisfied: srsly<1.1.0,>=1.0.2 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (4.41.1)
Requirement already satisfied: thinc==7.4.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (7.4.0)
Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/lib/python3.6/dist-packages (from spacy>=2.2.2->en_core_web_sm==2.2.5) (1.0.2)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (2020.6.20)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.6/dist-packages (from requests<3.0.0,>=2.13.0->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.24.3)
Requirement already satisfied: importlib-metadata>=0.20; python_version < "3.8" in /usr/local/lib/python3.6/dist-packages (from catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (1.7.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.6/dist-packages (from importlib-metadata>=0.20; python_version < "3.8"->catalogue<1.1.0,>=0.0.7->spacy>=2.2.2->en_core_web_sm==2.2.5) (3.1.0)
✔ Download and installation successful
You can now load the model via spacy.load('en_core_web_sm')
Requirement already satisfied: wordcloud in /usr/local/lib/python3.6/dist-packages (1.5.0)
Requirement already satisfied: numpy>=1.6.1 in /usr/local/lib/python3.6/dist-packages (from wordcloud) (1.18.5)
Requirement already satisfied: pillow in /usr/local/lib/python3.6/dist-packages (from wordcloud) (7.0.0)

String cleaning

Before we visualize any text, we need to do some normalization. Let's take the short story The Seventh Sally or How Trurl's Own Perfection Led to No Good (1965) of The Cyberiad by Stanisław Lem as an example dataset. We will first load the text by URL:

In [ ]:
import requests # to load the data by URL
r = requests.get('http://infovis.fh-potsdam.de/tutorials/data/story.txt')
r.encoding = "utf-8" # ensure correct encoding

story = r.text

# display first 500 characters
print(story[:500]+"…")
The seventh Sally or how Trurl's own perfection led to no good
By Stanisław Lem, 1965.
Translated by Michael Kandel, 1974.

The Universe is infinite but bounded, and therefore a beam of light, in whatever direction it may travel, will after billions of centuries return -  if powerful enough - to the point of its departure; and it is no different with rumor, that flies about from star to star and makes the rounds of every planet. One day Trurl heard distant reports of two mighty constructor-benef…

At this point the entire story is loaded into one long string.

✏️ Determine the character length of the story (with or without spaces, however you wish):

In [ ]:
 

Tokenization

… is the process of turning a text into chunks that we can more easily work with. Typically tokenization refers to the separation and extraction of words. This process largely relies on the use of spaces and punctuation marks. The latter are typically included as tokens themselves.

Suppose we have a sentence such as this one:

In [ ]:
sentence = "There were plenty of towns, rivers, mountains, forests, and brooks."

… with nltk's word_tokenize() we can extract all tokens into a neat list:

In [ ]:
words = nltk.word_tokenize(sentence)
words
Out[ ]:
['There',
 'were',
 'plenty',
 'of',
 'towns',
 ',',
 'rivers',
 ',',
 'mountains',
 ',',
 'forests',
 ',',
 'and',
 'brooks',
 '.']

✏️ This looks good already. Let's do this for the entire story:

In [ ]:

As you can see we also get the punctuation marks. These we can avoid with a different kind of tokenizer (e.g., the RegexpTokenizer) or by simply removing non-letter strings with Python's isalpha() method. But be mindful, this removes any token containing something else but letters.

In [ ]:
# no punctuation, numbers or contractions
onlywords = [word for word in words if word.isalpha()]

onlywords[0:20]
Out[ ]:
['There',
 'were',
 'plenty',
 'of',
 'towns',
 'rivers',
 'mountains',
 'forests',
 'and',
 'brooks']

✏️ Find out the number of words in the story:

In [ ]:
 

Note the list comprehension extracting only words that contain letters. The way it is written makes it both comprehensible and compact. I mean… read it out, it's almost like poetry!

onlywords = [word for word in words if word.isalpha()]

The first mention of word can also be altered, for example, to change all words' capitalization. In fact, another typical step of normalization is turning all words into their lowercase versions.

✏️ Change the code below to turn all words into their .lower()-case version:

In [ ]:
words = [word for word in words if word.isalpha()]

Stemming & lemmatizing

Words are often inflected to indicate plural, tense, case, etc. To get the word stem or lemma, you can apply stemming or lemmatization. The stemmer operates on a relatively robust, but simplistic rule set. In contrast, lemmatization is more reliable in the linking different word variants of the same dictionary entry of the same word, a.k.a. lemma, but it's computationally more expensive.

Let's compare them both:

In [ ]:
from nltk.stem import PorterStemmer as stemmer
from nltk.stem import WordNetLemmatizer as lemmatizer
from nltk.corpus import wordnet # for robust lemmatization

word = "drove"

print(stemmer().stem(word))
print(lemmatizer().lemmatize(word, pos = wordnet.VERB))
drove
drive

To lemmatize we needed to indicate the word type via the second parameter pos. But how do we know that it's a verb?

Part-of-speech tagging

Words assume specific roles in sentences. POS tagging identifies these roles as the parts of speech, which roughly translates to word categories such as verbs, nouns, adjectives etc.

To do POS tagging with NLTK, we need to first run the tokenization. So let's revisit the sentence from above:

In [ ]:
# to save us some typing, we import these, so we can call them directly
from nltk import word_tokenize, pos_tag

sentence = "There were plenty of towns, rivers, mountains, forests, and brooks."

# first we tokenize then we pos_tag
sentence = pos_tag(word_tokenize(sentence))

sentence
Out[ ]:
[('There', 'EX'),
 ('were', 'VBD'),
 ('plenty', 'NN'),
 ('of', 'IN'),
 ('towns', 'NNS'),
 (',', ','),
 ('rivers', 'NNS'),
 (',', ','),
 ('mountains', 'NNS'),
 (',', ','),
 ('forests', 'NNS'),
 (',', ','),
 ('and', 'CC'),
 ('brooks', 'NNS'),
 ('.', '.')]

This gives us a list of tuples, each of which contains the token again, plus the part of speech encoded in a tag. There is actually a good overview of the POS tags with brief definitions and examples on Stack Overflow.

Again, let's now do this for the entire story! But with a special twist: we want to catch all the verbs.

In [ ]:
# same as above: first tokenize, then pos_tag
pos = pos_tag(word_tokenize(story))

# to keep things short & sweet, we define a function for lemmatizing verbs
def lemmatize_verb (word):
  return lemmatizer().lemmatize(word.lower(), pos = wordnet.VERB)

# remember this form? the condition matches verbs, whose POS tag starts with a V
verbs = [lemmatize_verb(word[0]) for word in pos if word[1][0]=="V"]

# let's look at the first 50 verbs
print(verbs[:50])
['lead', 'translate', 'be', 'bound', 'travel', 'return', 'be', 'star', 'make', 'hear', 'accomplish', 'have', 'run', 'explain', 'be', 'have', 'circumnavigate', 'have', 'say', 'be', 'doubt', 'let', 'recall', 'be', 'undertake', 'keep', 'be', 'receive', 'pay', 'be', 'head', 'be', 'have', 'fly', 'pass', 'have', 'obtain', 'come', 'be', 'run', 'wave', 'astonish', 'concern', 'land', 'be', 'approach', 'clang', 'clank', 'introduce', 'have']

Woot! We just extracted all verbs from a story and normalized them… Pretty cool.

✏️ Repeat the last step for nouns below! Hint: you'll also need a specific lemmatize_noun():

In [ ]:

📄 2. Process

Now we can turn a text into its components and distinguish these tokens as different word types. Let's proceed by extracting entities, removing irrelevant words, and finding the most frequent words.

Extract entity types

Apart from identifying word types, we can distinguish between different entities that are mentioned in a text, such as persons, places, organizations, etc. This kind of text processing is also referred to as Named Entity Recognition (NER).

For this step, we are straying from NLTK and use spaCy's statistical models on the English language. So first we import spaCy and load the English language model:

In [ ]:
import spacy
nlp = spacy.load("en_core_web_sm")

Let's take a recent New York Times article on the uneven spread of the corona virus, which is already prepared as a plain text file for the purpose of this tutorial. So as before, we load the text into a string, called article:

In [ ]:
# retrieve plain text article
r = requests.get('http://infovis.fh-potsdam.de/tutorials/data/article.txt')
r.encoding = "utf-8"
article = r.text

# carry out NLP processing
doc = nlp(article)

# get each the text and entity label of all word entities in the article
entities = [ (e.text, e.label_) for e in doc.ents if e.text ]

# see first 20 entities
entities[0:20]
Out[ ]:
[('Hannah Beech', 'PERSON'),
 ('Alissa J. Rubin', 'PERSON'),
 ('Anatoly Kurmanaev', 'PERSON'),
 ('Ruth Maclean', 'PERSON'),
 ('11:28 a.m.', 'TIME'),
 ('Iran', 'GPE'),
 ('Iraq', 'GPE'),
 ('fewer than 100', 'CARDINAL'),
 ('The Dominican Republic', 'GPE'),
 ('nearly 7,600', 'CARDINAL'),
 ('Haiti', 'GPE'),
 ('about 85', 'CARDINAL'),
 ('Indonesia', 'GPE'),
 ('thousands', 'CARDINAL'),
 ('Malaysia', 'GPE'),
 ('about 100', 'CARDINAL'),
 ('earth', 'LOC'),
 ('New York', 'GPE'),
 ('Paris', 'GPE'),
 ('London', 'GPE')]

Now all tokens that have been recognized as particular entities are extracted and associated with an entity type. Have a look at spaCy's overview of NER tags to understand what they refer to.

✏️ Try above with the short story and compare the results:

In [ ]:

Remove stop words

The opposite of particularly interesting entitites are so-called stop words. They are very common and serve as short function words such as "the", "is", "or", "at". In text processing it can be useful to remove these frequent words to focus on those terms that are more specific to a given document.

NLTK actually already includes stop words for several languages, including English:

In [ ]:
from nltk.corpus import stopwords as stop

stopwords = stop.words("english")

print(stopwords)
['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', 'too', 'very', 's', 't', 'can', 'will', 'just', 'don', "don't", 'should', "should've", 'now', 'd', 'll', 'm', 'o', 're', 've', 'y', 'ain', 'aren', "aren't", 'couldn', "couldn't", 'didn', "didn't", 'doesn', "doesn't", 'hadn', "hadn't", 'hasn', "hasn't", 'haven', "haven't", 'isn', "isn't", 'ma', 'mightn', "mightn't", 'mustn', "mustn't", 'needn', "needn't", 'shan', "shan't", 'shouldn', "shouldn't", 'wasn', "wasn't", 'weren', "weren't", 'won', "won't", 'wouldn', "wouldn't"]

✏️ Check out the stop words of other languages

As a next step we remove the stop words from the short story to focus on those words that carry meaning:

In [ ]:
tokens = nltk.word_tokenize(story.lower())

# let's focus on those tokens that contain only letters
words = [word for word in tokens if word.isalpha()]

# this is a special form called List Comprehensions
without_stopwords = [word for word in words if word not in stopwords]

print(without_stopwords[:50])
['seventh', 'sally', 'trurl', 'perfection', 'led', 'good', 'stanisław', 'lem', 'translated', 'michael', 'kandel', 'universe', 'infinite', 'bounded', 'therefore', 'beam', 'light', 'whatever', 'direction', 'may', 'travel', 'billions', 'centuries', 'return', 'powerful', 'enough', 'point', 'departure', 'different', 'rumor', 'flies', 'star', 'star', 'makes', 'rounds', 'every', 'planet', 'one', 'day', 'trurl', 'heard', 'distant', 'reports', 'two', 'mighty', 'wise', 'accomplished', 'equal', 'news', 'ran']

✏️ What if you're really into stop words? Change above cell to remove all words that are not stop words!

Pack a bag of words

A common representation of text documents is the bag-of-words model, which simply considers a given text as a set of words, disregarding sentence or document structure. Typically, a bag-of-words representation is combined with the frequency of words in a document.

In [ ]:
tokens = word_tokenize(story.lower())
words = [word for word in tokens if word.isalpha()]

# bag of words as a dictionary data type
bow = {}

# we count the occurrences of each word and save it
for word in words:
  bow[word] = words.count(word)

# for later use, we create a sorted list of word-frequency tuples
words_frequency = sorted(bow.items(), key=lambda x: x[1], reverse=True)

print(words_frequency[0:100])
[('the', 169), ('of', 131), ('and', 127), ('a', 92), ('to', 63), ('that', 58), ('in', 49), ('you', 37), ('trurl', 33), ('it', 31), ('his', 30), ('they', 26), ('he', 25), ('was', 25), ('with', 24), ('for', 23), ('not', 22), ('as', 21), ('by', 19), ('but', 19), ('this', 19), ('do', 19), ('no', 18), ('all', 17), ('i', 17), ('had', 16), ('kingdom', 16), ('have', 16), ('when', 15), ('is', 14), ('one', 13), ('klapaucius', 13), ('him', 13), ('were', 13), ('an', 13), ('what', 13), ('or', 12), ('would', 12), ('box', 12), ('so', 11), ('are', 11), ('excelsius', 11), ('there', 11), ('who', 10), ('which', 10), ('into', 10), ('on', 10), ('king', 10), ('be', 10), ('how', 9), ('only', 9), ('their', 9), ('way', 9), ('if', 8), ('from', 8), ('nothing', 8), ('at', 8), ('subjects', 8), ('like', 8), ('these', 7), ('well', 7), ('our', 7), ('your', 7), ('about', 6), ('planet', 6), ('two', 6), ('those', 6), ('great', 6), ('monarch', 6), ('up', 6), ('also', 6), ('could', 6), ('though', 6), ('said', 6), ('know', 6), ('electrons', 6), ('after', 5), ('its', 5), ('even', 5), ('very', 5), ('without', 5), ('some', 5), ('such', 5), ('over', 5), ('now', 5), ('first', 5), ('death', 5), ('see', 5), ('too', 5), ('out', 5), ('model', 5), ('them', 5), ('doll', 5), ('understand', 5), ('enough', 4), ('space', 4), ('time', 4), ('ship', 4), ('through', 4), ('full', 4)]

✏️ What are the most frequent meaningful words? Change above code cell so that the stop words are removed!

🥗 3. Present

Now let's turn all these words into visualizations!

Word cloud

For text visualization, one technique has reached a lot of attention, despite its limited perceptual and analytical qualities. The word cloud (a.k.a. tag cloud) emerged in the golden age of Web 2.0 (the 2000s) and probably succeeded due to its simplicity in terms of interpretation and implementation: the more frequent a word, the larger the font size. Altair itself actually does not support word clouds, so we resort to a specific wordcloud generator and use matplotlib to render the images. The wordcloud library is extra convenient, it just takes the raw text as input:

In [ ]:
from wordcloud import WordCloud
import matplotlib.pyplot as plt

text = story
wc = WordCloud(width=500, height=500, background_color="white").generate(text)

# display the generated image:
my_dpi = 72
plt.figure(figsize = (500/my_dpi, 500/my_dpi), dpi=my_dpi)
plt.imshow(wc, interpolation="bilinear")
plt.axis("off")
plt.show()

✏️ The word cloud library actually gives a lot of options for customization. Change the colors, fonts, sizes, maybe keep the stopwords?

Common words

We shall move on to more precise representations of text. For this we will revisit an arguably mundane, but quite effective visualization technique: we draw a barchart of the most frequent words (excluding the stop words, if you have done the pencil activity in the section on packing a bag of words).

In [ ]:
# first we create a dataframe from the word frequencies
df = pd.DataFrame(words_frequency, columns=['word', 'count'])

# we want to focus just on the top 20 words
df_top = df[:20]

# draw horizontal barchart 
alt.Chart(df_top).mark_bar().encode(
  x = 'count:Q',
  y = 'word:N'
)
Out[ ]:

✏️ Sort the words on the y-axis according the counts plotted on the x-axis! Hint: here is an example.

All words by type

Through POS tagging we are able to identify the different word types, such as nouns, verbs, adjectives, adverbs, and several others. So let's do exactly this and distinguish between these common word types for the article:

In [ ]:
# first we extract all words and their types (a.k.a. parts-of-speech or POS)
pos = pos_tag(word_tokenize(article))

# we will be collecting words and types in lists of the same length
words = []
types = []

# iterate over all entries in the pos list (generated above)
for p in pos:
  # get the word and turn it into lowercase
  word = p[0].lower()
  # get the word's type
  tag = p[1]

  # for this analysis we remove entries that contain punctuation or numbers
  # and we also ignore the stopwords (sorry: the, and, or, etc!)
  if word.isalpha() and word not in stopwords:
    # first we add this word to the words list
    words.append(word)
    # then we add its word type to types list, based on the 1st letter of the pos tag
    # note that we access letters in a string, like entries in a list
    if   (tag[0]=="J"): types.append("Adjective")
    elif (tag[0]=="N"): types.append("Noun")
    elif (tag[0]=="R"): types.append("Adverb")
    elif (tag[0]=="V"): types.append("Verb")
    # there are many more word types, we simply subsume them under 'other'
    else: types.append("Other")

✏️ This is a good point to check what we generated. Take a look at the two lists we created:

In [ ]:
 
In [ ]:
 

With this information, we can now create two coordinated charts: one representing the frequency of the different word types and the other displaying the frequency of all words (given the current selection). But first things first: somebody get us a DataFrame quick!

In [ ]:
# with the two lists of the same length, we create a dataframe with a dictionary,
# of which the keys will become the column labels
df = pd.DataFrame({"word": words, "type": types })

# along the type column, we want to support a filter selection
selection = alt.selection(type="multi", fields=['type'])

# we create a composite chart consisting of two sub-charts
# the base holds it together and acts as the concierge taking care of the data
base = alt.Chart(df)

# this shows the types, note that we rely on Altair's aggregation prowess
chart1 = base.mark_bar().encode(
  x = alt.Y('type:N'),
  y = alt.X('count()'),
  # when a bar is selected, the others are displayed with reduced opacity
  opacity=alt.condition(selection, alt.value(1), alt.value(.25)),
).add_selection(selection)

# this chart reacts to the selection made in the left/above chart
chart2 = base.mark_bar(width=5).encode(
  x = alt.X('word:N'),
  y = alt.Y('count()'),
).transform_filter(selection)

chart1 | chart2
Out[ ]:

✏️ Sort the bars so that the most frequent types and words are on the left and we don't have to scroll to see them!

Keyword in context

Last but not least, it can be quite gratifying to see words in their original context. KWIC is a tried and tested method just for that purpose. Let's build one from scratch!

In [ ]:
import re # regular expressions, we will need them to search through the text
# the following we need, to display a text input field and make it interactive
import ipywidgets as widgets
from IPython.display import display, clear_output

# we move all line breaks with spaces, to not mess up the display (you'll see)
text = story.replace("\n", " ")

# create a search box …
search_box = widgets.Text(placeholder='Enter search term', description='Search:')
# … and make it appear
display(search_box)

# this function is triggered when a search query is entered
def f(sender):
  # we get the query's text value
  query = search_box.value

  # this is the window of characters displayed both sides
  span = 40 - int(len(query)/2)

  # for subsequent queries, we clear the output
  clear_output(wait=True)
  # which also removes the search box, so we return it
  display(search_box)

  # when the query is too short, we do not proceed and warn the user/reader
  if (len(query)<2): 
    print("\nPlease enter a longer query\n")
    return

  # and find all the start positions of matches in the text
  starts = [m.start() for m in re.finditer(query, text)]

  # if there are no matches, we also tell the user/reader
  if (len(starts)==0): 
    print("\nSorry, but there are no matches for your query\n")
    return

  # we go through all the start positions
  for start in starts:
    # determine the end position, based on the query's length
    end = start+len(query)

    # we get the string left and right of the match
    # rjust returns a right-justified string, if there are few letters left of match
    left = text[max(0, start-span):start].rjust(span)
    match = text[start:end]
    right = text[end:end+span]

    # we print left and right context with the actual match in the middle
    print(left+match+right)

# the function f is linked with searchbox' on_submit event
search_box.on_submit(f)

Only available in an active notebook on Colab or in Jupyter: Now give this KWIC query a go and search for the tragic protagonist of the short story?

✏️ Currently the search is case-sensitive. What would it take to make the query case-insensitive?

Visualize entities

Remember that we also extracted entities such as people, places, organizations, etc. from the story and article (see 2. Process > Extract entity types)? The last step of the tutorial is reserved for you to visualize them.

✏️ Take inspiration from above examples to visualize the entities contained in the story and/or article:

In [ ]: