The wordcount_spark.py
program we wrote earlier finds the word that is used the most times in the input text. It did this by doing a sum reduction
using the add
operator. You job is to modify this program using a different kind of reduction in order to count the number of distinct words in
the input text.
Call your new program distinct_spark.py
and commit it to the repository you used for Assignment 3.
We saw how to use the SparkContext.parallelize
method to create a distributed dataset (RDD) containing all the numbers from 0 to 1,000,000. Use this
same method to create an RDD containing the numbers from 1 to 1000. The RDD class has a handy method called
fold which aggregates all the elements of the data set
using a function that is supplied as an argument. Use this method to creat a program that
calculates the product of all the numbers from 1 to 1000 and prints the result.
Call your new program product_spark.py
and commit it to the repository you used for Assignment 3.
There is nothing to stop you combining the map
operation with the fold
operation. You can even apply map
more than once in order
to generate more complex mappings. For bonus marks, see if you can work out how to use map
and fold
to calculate the average of
the square root of all the numbers from 1 to 1000. i.e the sum of the square roots of all the numbers divided by 1000.
Call your new program squareroot_spark.py
and commit it to the repository you used for Assignment 3.