Computer Science, asked by mohanpatil2000, 9 months ago

For this project, you'll create a "word cloud" from a text by writing a script. This script needs to process the text, remove punctuation, ignore case and words that do not contain all alphabets, count the frequencies, and ignore uninteresting or irrelevant words. A dictionary is the output of the calculate_frequencies function. The wordcloud module will then generate the image from your dictionary. For the input text of your script, you will need to provide a file that contains text only. For the text itself, you can copy and paste the contents of a website you like. Or you can use a site like Project Gutenberg to find books that are available online. You could see what word clouds you can get from famous books, like a Shakespeare play or a novel by Jane Austen. Save this as a .txt file somewhere on your computer. Now you will need to upload your input file here so that your script will be able to process it. To do the upload, you will need an uploader widget. Run the following cell to perform all the installs and imports for your word cloud script and uploader widget. It may take a minute for all of this to run and there will be a lot of output messages. But, be patient. Once you get the following final line of output, the code is done executing. Then you can continue on with the rest of the instructions for this notebook.

Answers

Answered by QGP

WordCloud from Book Text - Python

Introduction

There exists a $\sf wordcloud$ module for Python, which can be installed with the pip package manager.

We will also be using the $\sf matplotlib$ module for plotting and the $\sf re$ module for removing some special characters with Regex.

Loading the Book Text onto the Program

The problem statement asks to get a whole book text. We can save the text in a plain file and load it. However, here, I directly got the book Pride and Prejudice by Jane Austen from the Project Gutenberg site.

The code uses $\sf requests$ module to get the HTML page and $\sf bs4$ (Beautiful Soup) module to parse through it to get the text.

If you open the HTML page of the book, and inspect the page, you will find that all the chapters are withing a $\sf div$ element with $\sf class="chapter"$ . This makes it easy to get the text.

You could directly load the file and remove the requests and bs4 modules.

Processing the text

We first split the text into tokens. We simply use the Python's $\sf split$ function. By default, it splits the string by spaces and returns a list.

So, we get a list of tokens. We run through all the tokens and make them lowercase, and then remove apostrophes and all non-alphanumeric characters with Regex.

The pattern $\sf `s*$ matches an apostrophe and zero or more instances of $\sf s$ . So, words like Brainly's, Princess', etc are taken care of.

We use the $\sf re.sub(pattern, replacement, string)$ function. The pattern is substituted with a replacement in a given string. We substitute this apostrophe pattern with an empty string.

Next, we remove all non-alphanumeric characters in the same way. The special character in Regex to do just that is $\sf \backslash W$ . Substitute it with empty string on all the tokens.

Counting Frequencies

The wordcloud module doesn't really need to be passed in a dictionary. It will count the frequencies by itself and generate a wordcloud. However, for the sake of the question, we count them ourselves.

The function $\sf calculate_frequencies$ is pretty simple. We just create a empty dictionary first. We loop through all the tokens. If a token is encountered the first time (since it is not in the dictionary). it is assigned a value of 0, otherwise we just increment the value by 1.

Also, we pass in stopwords to skip those while calculating frequencies. The wordcloud module already provides a list of some STOPWORDS. We add two custom stopwords: "mr" and "mrs".

Generating WordCloud

The syntax is pretty much standard. We set the parameters and use the $\sf generate\_from\_frequencies(frequencies)$ function.

Finally, we plot it with matplotlib with the axes turned off.

[/tex] rule{300}{1} [/tex]

Now, I wrote the code originally in a Jupyter Notebook. However, it is easily exportable to a simple Python file. I am providing the code with a couple of screenshots.

WordCloud.py

from wordcloud import WordCloud, STOPWORDS

import matplotlib.pyplot as plt

from bs4 import BeautifulSoup # For parsing webpages

import requests # To get webpages

import re # To remove patterns with Regex

# Use a Project Gutenberg Link

# Here, we take the book Pride and Prejudice by Jane Austen

gutenberg_ebook_url = "insert url here"

webpage = requests.get(gutenberg_ebook_url)

soup = BeautifulSoup(webpage.text, 'html.parser')

# Find all chapters in the HTML page

chapter_text = []

for chapter in soup.find_all('div', 'chapter'):

chapter_text.append(chapter.get_text())

# Split text into tokens

tokens = []

# Each chapter is a list, so we extend the tokens list

for chapter in chapter_text:

tokens.extend(chapter.split())

# Process the tokens

for i in range(len(tokens)):

# Convert everything to lowercase

tokens[i] = tokens[i].lower()

# Remove apostrophes [The s is optional as evident from the Regex]

tokens[i] = re.sub("'s*", "", tokens[i])

# Remove any non-alphanumeric characters

tokens[i] = re.sub("\W", "", tokens[i])

# Calculate the frequencies of words while avoiding stopwords

def calculate_frequencies(text_tokens, stopwords):

freq = dict()

for word in text_tokens:

if word in stopwords:

continue

if word in freq:

freq[word] += 1

else:

freq[word] = 0

return freq

# Set the Stopwords

stopwords = set(STOPWORDS) # Standard Stopwords in WordCloud

stopwords.add("mr")

stopwords.add("mrs")

# Get the word frequencies

word_frequencies = calculate_frequencies(tokens, stopwords)

# Create the Word Cloud

wordcloud = WordCloud(width = 1000, height = 1000,

stopwords = stopwords,

min_font_size = 10).generate_from_frequencies(word_frequencies)

# Plot the Word Cloud

plt.figure(figsize = (10, 10), facecolor=None)

plt.imshow(wordcloud)

plt.axis('off')

plt.show()

Attachments:

fe236dd8315e76ab650d300aeae03917.pdf

Previous Question

Next Question