Introduction
|
Spark is a general purpose cluster computing framework.
Supports MapReduce but provides additional functionality.
Very useful for machine learning and optimization.
|
MapReduce Primer
|
MapReduce is a software framework for processing large data sets in a distributed fashion.
A data set is mapped into a collection of (key value) pairs.
The (key, value) pairs can be manipulated (e.g. by sorting).
The result is a reduction over all pairs with the same key.
|
Introduction to Spark
|
Spark defines an API for distributed computing using distributed data sets.
A driver program coordinates the overall computation.
An executor is a process that runs computations and stores data.
Application code is sent to the executors.
Tasks are sent to the executors to run.
|