Computer Science, asked by BendingReality, 7 months ago

Take a sample of ten phishing e-mails (or any text file) and find most commonly occurring word(s)

guptaankita5252: hello

BendingReality: Hi

guptaankita5252: can u solve my question

guptaankita5252: 1st one which I recently posted

guptaankita5252: reply

guptaankita5252: please

guptaankita5252: I need help in that question

BendingReality: That question has been answered already!!

guptaankita5252: but that is not correct

BendingReality: okay! I'm not in mood of answering this question but I'll pass it to other members!!

Answers

Answered by QGP

Word Frequency - Python

Getting the data

The first step is getting a text file with some content. If you have a source to get content, well and good. Here, I use the text file of The Adventures of Sherlock Holmes obtained from Project Gutenberg.

Pre-processing

Now, we have a text file. Open the file and have a look. A good approach now would be to convert everything to lowercase and remove all the special characters.

Converting to lowercase is easy. We just use the $\tt str.lower()$ function.

Removing the special characters would take some work. We can use the $\tt str.replace()$ function, but it only allows us to replace one character at a time, and we have a lot of replacements to do.

We use the power of Regular Expressions, with the $\tt re$ module. The powerful function $\texttt{re.sub(pattern, repl, string)}$ allows us to replace the occurrences of a RegEx pattern with repl in string.

We write a single pattern, containing all the special characters we have. We would replace them with a space (and not an empty string, because then there exists a possibility of words getting adjoined).

The special characters replaced: !"#$%&'()*+,-.;<=>?@[\]^_`{|}~“”‘’—

In the pattern, we escape each of these inside with a backslash. The RegEx pattern looks like this: r"[\!\"\#\$\%\&\'\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\“\”\‘\’\—]"

This pattern uses the special sequence inside square brackets [] which matches a single character from the list provided inside it. So, all the special characters will be matched with this single pattern.

We replace them with a space, and not an empty string, because then words might be accidentally joined when the special characters are are removed.

Now, there exist places where we have apostrophes. So, there will be individual letter s left dangling. We replace them with another pattern. r"\ss\s". The \s special sequence matches any whitespace character. So, it matches a whitespace, followed by the letter s, followed by another whitespace.

Splitting the content and counting word frequency

Now, we split everything using whitespaces as delimiters. After that, we declare a dictionary, and run through the word list to store the word and its count in the dictionary.

We use the $\texttt{dict.get(key, d{e}fault)}$ function, which searches if the key in the dictionary. If it does exist, it returns the value, otherwise this function returns a default value. This is useful, because we can now write our word counting code in one line.

words = dict()

for word in wordlist:

words[word] = words.get(word, 0) + 1

Sorting by word count

Finally, we sort the dictionary. This is a bit involved. We use the $\tt sorted()$ function.

We first split the dictionary into tuples with $\tt dict.items()$ .

Then we pass a custom function $\texttt{lambda x: x[1]}$ , which quite clearly just returns the second element in the tuple (which is the dictionary value).

And we set the $\tt reverse$ parameter to True, to sort it in descending order. The sorted() function returns tuples.

The sorting code thus looks like this:

words = {k: v for k, v in sorted(words.items(), key=lambda x: x[1], reverse=True)}

$\tt words.items()$ splits the dictionary, the lambda function is passed to key parameter, and the reverse parameter is set to True.

This returns a list of tuples, sorted as we wanted. We just iterate through them with dictionary comprehension.

After that we can just display the words and their counts.

$\rule{300}{1}$

The Code

import re

# Set the filename

filename = "the-adventures-of-sherlock-holmes.txt"

# Open the file and read the content

with open(filename, "r", encoding="utf-8") as f:

content = f.read()

# Convert everything to lowercase

content = content.lower()

# Remove special characters: !"#$%&'()*+,-.;<=>?@[\]^_`{|}~“”‘’—

# All characters are escaped with a backslash and replaced with a space

content = re.sub(r"[\!\"\#\$\%\&\'\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\“\”\‘\’\—]", " ", content)

# Replace the residual letter s left after apostrophes with a space

content = re.sub(r"\ss\s", " ", content)

# Split the content at spaces

wordlist = content.split()

# Declare a dictionary

words = dict()

# Count the occurence of each word

for word in wordlist:

words[word] = words.get(word, 0) + 1

# Sort the dictionary by count of words

words = {k: v for k, v in sorted(words.items(), key=lambda x: x[1], reverse=True)}

# Print the 100 most used words

i = 0

for word in words:

print(f"{word} - {words[word]}")

i += 1

if i == 100:

break

Attachments:

491484c7f4955bd461f05d658b0a19df.txt

Answered by guptaankita5252

Explanation:

Clear Doubts with Computer Tutor

In case you’re facing problems in understanding concepts, writing programs, solving questions, want to learn fun facts | tips | tricks or absolutely anything around computer science, feel free to join CTs learner-teacher community: students.computertutor.in

# Python program to find the k most frequent words

# from data set

from collections import Counter

#data_set = "phished emails"

def most_common_word(dataset , n=10):

# split() returns list of all the words in the string

split_it = data_set.split()

# Pass the split_it list to instance of Counter class.

count = Counter(split_it)

# most_common() produces k frequently encountered

# input values and their respective counts.

most_occur = count.most_common(n)