Overview
Teaching: 10 min
Exercises: 0 minQuestions
What is Spark and what is it used for?
Objectives
Understand the principles behind Spark.
Understand the terminology used by Spark.
Any distributed computing framework needs to solve two problems: how to distribute data and how to distribute computation.
One such framework is Apache Hadoop. Hadoop uses the Hadoop Distributed Filesystem (HDFS) to solve the distributed data problem and MapReduce as the programming paradigm that provides effective distributed computation.
Apache Spark is a general purpose cluster computing framework that provides efficient in-memory computations for large data sets by distributing computation across multiple computers. Spark can utilize the Hadoop framework or run standalone.
Spark has a functional programming API in multiple languages that provides more operators than map and reduce, and does this via a distributed data framework called resilient distributed datasets or RDDs.
RDDs are essentially a programming abstraction that represents a read-only collection of objects that are partitioned across machines. RDDs are fault tolerant and are accessed via parallel operations.
Because RDDs can be cached in memory, Spark is extremely effective at iterative applications, where the data is being reused throughout the course of an algorithm. Most machine learning and optimization algorithms are iterative, making Spark an extremely effective tool for data science. Additionally, because Spark is so fast, it can be accessed in an interactive fashion via a command line prompt similar to the Python read-eval-print loop (REPL).
The Spark library itself contains a lot of the application elements that have found their way into most Big Data applications including support for SQL-like querying of big data, machine learning and graph algorithms, and even support for live streaming data.
The core components of Apache Spark are:
Because these components meet many Big Data requirements as well as the algorithmic and computational requirements of many data science tasks, Spark has been growing rapidly in popularity. Not only that, but Spark provides APIs in Scala, Java, and Python; meeting the needs for many different groups and allowing more data scientists to easily adopt Spark as their Big Data solution.
Key Points
Spark is a general purpose cluster computing framework.
Supports MapReduce but provides additional functionality.
Very useful for machine learning and optimization.