The volume of unstructured text in existence is growing dramatically, and Spark is an excellent tool
for analyzing this type of data. In this assignment, we will write code that calculates the most
common words in the Complete Works of William Shakespeare retrieved from Project Gutenberg.
This could also be scaled to find the most common words in Wikipedia.
During this assignment we will cover:
• Part 1: Creating a base RDD and pair RDDs
• Part 2: Counting with pair RDDs
• Part 3: Finding unique words and a mean value
• Part 4: Apply word count to a file
Note that for reference, you can look up the details of the relevant methods in:
• Spark's Python API
22 freelancere byder i gennemsnit $138 på dette job
I can help you with this task because I've 3+ years of experience working in the field of textual data analysis using Apache Spark and RDDs. Please let me know of any other details so you can start off.