BigData with PySpark: Glossary

Key Points

Introduction
  • Spark is a general purpose cluster computing framework.

  • Supports MapReduce but provides additional functionality.

  • Very useful for machine learning and optimization.

MapReduce Primer
  • MapReduce is a software framework for processing large data sets in a distributed fashion.

  • A data set is mapped into a collection of (key value) pairs.

  • The (key, value) pairs can be manipulated (e.g. by sorting).

  • The result is a reduction over all pairs with the same key.

Introduction to Spark
  • Spark defines an API for distributed computing using distributed data sets.

  • A driver program coordinates the overall computation.

  • An executor is a process that runs computations and stores data.

  • Application code is sent to the executors.

  • Tasks are sent to the executors to run.

Glossary

FIXME