BigData with PySpark: Glossary

Key Points

Introduction	Spark is a general purpose cluster computing framework. Supports MapReduce but provides additional functionality. Very useful for machine learning and optimization.
MapReduce Primer	MapReduce is a software framework for processing large data sets in a distributed fashion. A data set is mapped into a collection of (key value) pairs. The (key, value) pairs can be manipulated (e.g. by sorting). The result is a reduction over all pairs with the same key.
Introduction to Spark	Spark defines an API for distributed computing using distributed data sets. A driver program coordinates the overall computation. An executor is a process that runs computations and stores data. Application code is sent to the executors. Tasks are sent to the executors to run.

FIXME