Advanced Python for Data Science Assignment 13

  1. The wordcount_spark.py program we wrote earlier finds the word that is used the most times in the input text. It did this by doing a sum reduction using the add operator. You job is to modify this program using a different kind of reduction in order to count the number of distinct words in the input text.

    Call your new program distinct_spark.py and commit it to the repository you used for Assignment 3.

  2. We saw how to use the SparkContext.parallelize method to create a distributed dataset (RDD) containing all the numbers from 0 to 1,000,000. Use this same method to create an RDD containing the numbers from 1 to 1000. The RDD class has a handy method called fold which aggregates all the elements of the data set using a function that is supplied as an argument. Use this method to creat a program that calculates the product of all the numbers from 1 to 1000 and prints the result.

    Call your new program product_spark.py and commit it to the repository you used for Assignment 3.

  3. There is nothing to stop you combining the map operation with the fold operation. You can even apply map more than once in order to generate more complex mappings. For bonus marks, see if you can work out how to use map and fold to calculate the average of the square root of all the numbers from 1 to 1000. i.e the sum of the square roots of all the numbers divided by 1000.

    Call your new program squareroot_spark.py and commit it to the repository you used for Assignment 3.