For this project, you'll create a "word cloud" from a text by writing a script. This script needs to process the text, remove punctuation, ignore case and words that do not contain all alphabets, count the frequencies, and ignore uninteresting or irrelevant words. A dictionary is the output of the calculate_frequencies function. The wordcloud module will then generate the image from your dictionary. For the input text of your script, you will need to provide a file that contains text only. For the text itself, you can copy and paste the contents of a website you like. Or you can use a site like Project Gutenberg to find books that are available online. You could see what word clouds you can get from famous books, like a Shakespeare play or a novel by Jane Austen. Save this as a .txt file somewhere on your computer. Now you will need to upload your input file here so that your script will be able to process it. To do the upload, you will need an uploader widget. Run the following cell to perform all the installs and imports for your word cloud script and uploader widget. It may take a minute for all of this to run and there will be a lot of output messages. But, be patient. Once you get the following final line of output, the code is done executing. Then you can continue on with the rest of the instructions for this notebook.
Answers
WordCloud from Book Text - Python
Introduction
There exists a module for Python, which can be installed with the pip package manager.
We will also be using the module for plotting and the module for removing some special characters with Regex.
Loading the Book Text onto the Program
The problem statement asks to get a whole book text. We can save the text in a plain file and load it. However, here, I directly got the book Pride and Prejudice by Jane Austen from the Project Gutenberg site.
The code uses module to get the HTML page and (Beautiful Soup) module to parse through it to get the text.
If you open the HTML page of the book, and inspect the page, you will find that all the chapters are withing a element with . This makes it easy to get the text.
You could directly load the file and remove the requests and bs4 modules.
Processing the text
We first split the text into tokens. We simply use the Python's function. By default, it splits the string by spaces and returns a list.
So, we get a list of tokens. We run through all the tokens and make them lowercase, and then remove apostrophes and all non-alphanumeric characters with Regex.
The pattern matches an apostrophe and zero or more instances of . So, words like Brainly's, Princess', etc are taken care of.
We use the function. The pattern is substituted with a replacement in a given string. We substitute this apostrophe pattern with an empty string.
Next, we remove all non-alphanumeric characters in the same way. The special character in Regex to do just that is . Substitute it with empty string on all the tokens.
Counting Frequencies
The wordcloud module doesn't really need to be passed in a dictionary. It will count the frequencies by itself and generate a wordcloud. However, for the sake of the question, we count them ourselves.
The function is pretty simple. We just create a empty dictionary first. We loop through all the tokens. If a token is encountered the first time (since it is not in the dictionary). it is assigned a value of 0, otherwise we just increment the value by 1.
Also, we pass in stopwords to skip those while calculating frequencies. The wordcloud module already provides a list of some STOPWORDS. We add two custom stopwords: "mr" and "mrs".
Generating WordCloud
The syntax is pretty much standard. We set the parameters and use the function.
Finally, we plot it with matplotlib with the axes turned off.
[/tex] rule{300}{1} [/tex]
Now, I wrote the code originally in a Jupyter Notebook. However, it is easily exportable to a simple Python file. I am providing the code with a couple of screenshots.
WordCloud.py
from wordcloud import WordCloud, STOPWORDS
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup # For parsing webpages
import requests # To get webpages
import re # To remove patterns with Regex
# Use a Project Gutenberg Link
# Here, we take the book Pride and Prejudice by Jane Austen
gutenberg_ebook_url = "insert url here"
webpage = requests.get(gutenberg_ebook_url)
soup = BeautifulSoup(webpage.text, 'html.parser')
# Find all chapters in the HTML page
chapter_text = []
for chapter in soup.find_all('div', 'chapter'):
chapter_text.append(chapter.get_text())
# Split text into tokens
tokens = []
# Each chapter is a list, so we extend the tokens list
for chapter in chapter_text:
tokens.extend(chapter.split())
# Process the tokens
for i in range(len(tokens)):
# Convert everything to lowercase
tokens[i] = tokens[i].lower()
# Remove apostrophes [The s is optional as evident from the Regex]
tokens[i] = re.sub("'s*", "", tokens[i])
# Remove any non-alphanumeric characters
tokens[i] = re.sub("\W", "", tokens[i])
# Calculate the frequencies of words while avoiding stopwords
def calculate_frequencies(text_tokens, stopwords):
freq = dict()
for word in text_tokens:
if word in stopwords:
continue
if word in freq:
freq[word] += 1
else:
freq[word] = 0
return freq
# Set the Stopwords
stopwords = set(STOPWORDS) # Standard Stopwords in WordCloud
stopwords.add("mr")
stopwords.add("mrs")
# Get the word frequencies
word_frequencies = calculate_frequencies(tokens, stopwords)
# Create the Word Cloud
wordcloud = WordCloud(width = 1000, height = 1000,
stopwords = stopwords,
min_font_size = 10).generate_from_frequencies(word_frequencies)
# Plot the Word Cloud
plt.figure(figsize = (10, 10), facecolor=None)
plt.imshow(wordcloud)
plt.axis('off')
plt.show()