Take a sample of ten phishing e-mails (or any text file) and find most commonly occurring word(s)
Answers
Word Frequency - Python
Getting the data
The first step is getting a text file with some content. If you have a source to get content, well and good. Here, I use the text file of The Adventures of Sherlock Holmes obtained from Project Gutenberg.
Pre-processing
Now, we have a text file. Open the file and have a look. A good approach now would be to convert everything to lowercase and remove all the special characters.
Converting to lowercase is easy. We just use the function.
Removing the special characters would take some work. We can use the function, but it only allows us to replace one character at a time, and we have a lot of replacements to do.
We use the power of Regular Expressions, with the module. The powerful function allows us to replace the occurrences of a RegEx pattern with repl in string.
We write a single pattern, containing all the special characters we have. We would replace them with a space (and not an empty string, because then there exists a possibility of words getting adjoined).
The special characters replaced: !"#$%&'()*+,-.;<=>?@[\]^_`{|}~“”‘’—
In the pattern, we escape each of these inside with a backslash. The RegEx pattern looks like this: r"[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\“\”\‘\’\—]"
This pattern uses the special sequence inside square brackets [] which matches a single character from the list provided inside it. So, all the special characters will be matched with this single pattern.
We replace them with a space, and not an empty string, because then words might be accidentally joined when the special characters are are removed.
Now, there exist places where we have apostrophes. So, there will be individual letter s left dangling. We replace them with another pattern. r"\ss\s". The \s special sequence matches any whitespace character. So, it matches a whitespace, followed by the letter s, followed by another whitespace.
Splitting the content and counting word frequency
Now, we split everything using whitespaces as delimiters. After that, we declare a dictionary, and run through the word list to store the word and its count in the dictionary.
We use the function, which searches if the key in the dictionary. If it does exist, it returns the value, otherwise this function returns a default value. This is useful, because we can now write our word counting code in one line.
words = dict()
for word in wordlist:
words[word] = words.get(word, 0) + 1
Sorting by word count
Finally, we sort the dictionary. This is a bit involved. We use the function.
We first split the dictionary into tuples with .
Then we pass a custom function , which quite clearly just returns the second element in the tuple (which is the dictionary value).
And we set the parameter to True, to sort it in descending order. The sorted() function returns tuples.
The sorting code thus looks like this:
words = {k: v for k, v in sorted(words.items(), key=lambda x: x[1], reverse=True)}
splits the dictionary, the lambda function is passed to key parameter, and the reverse parameter is set to True.
This returns a list of tuples, sorted as we wanted. We just iterate through them with dictionary comprehension.
After that we can just display the words and their counts.
The Code
import re
# Set the filename
filename = "the-adventures-of-sherlock-holmes.txt"
# Open the file and read the content
with open(filename, "r", encoding="utf-8") as f:
content = f.read()
# Convert everything to lowercase
content = content.lower()
# Remove special characters: !"#$%&'()*+,-.;<=>?@[\]^_`{|}~“”‘’—
# All characters are escaped with a backslash and replaced with a space
content = re.sub(r"[\!\"\#\$\%\&\'\(\)\*\+\,\-\.\/\:\;\<\=\>\?\@\[\\\]\^\_\`\{\|\}\~\“\”\‘\’\—]", " ", content)
# Replace the residual letter s left after apostrophes with a space
content = re.sub(r"\ss\s", " ", content)
# Split the content at spaces
wordlist = content.split()
# Declare a dictionary
words = dict()
# Count the occurence of each word
for word in wordlist:
words[word] = words.get(word, 0) + 1
# Sort the dictionary by count of words
words = {k: v for k, v in sorted(words.items(), key=lambda x: x[1], reverse=True)}
# Print the 100 most used words
i = 0
for word in words:
print(f"{word} - {words[word]}")
i += 1
if i == 100:
break
Explanation:
Clear Doubts with Computer Tutor
In case you’re facing problems in understanding concepts, writing programs, solving questions, want to learn fun facts | tips | tricks or absolutely anything around computer science, feel free to join CTs learner-teacher community: students.computertutor.in
# Python program to find the k most frequent words
# from data set
from collections import Counter
#data_set = "phished emails"
def most_common_word(dataset , n=10):
# split() returns list of all the words in the string
split_it = data_set.split()
# Pass the split_it list to instance of Counter class.
count = Counter(split_it)
# most_common() produces k frequently encountered
# input values and their respective counts.
most_occur = count.most_common(n)
print(most_occur)