Download Apache Spark from here. Make sure you have at least version 2.1.1.
Uncompress and untar the archive, then move it to a known location such as /home/<user>/spark
or /Users/<user>/spark
. We’ll refer to
this location as _path_to_spark_
below.
Set your path with a command like the following:
PATH=$PATH:/_path_to_spark_/bin
Uncompress and untar the archive (you may need WinZip or another utility for this), then move it to a known location such as
C:\Users\<user>\spark
. We’ll refer to this location as _path_to_spark_
below.
Download the winutils.exe program and place it in C:\winutils\bin
.
Set the following variables:
set SPARK_PATH=C:\_path_to_spark_\spark
set PATH=%PATH%;%SPARK_PATH%\bin;C:\winutils\bin
set HADOOP_HOME=C:\winutils
set _JAVA_OPTIONS="-Xmx512M"
Check that Spark is installed correctly by running the command:
run-example SparkPi 10
To run Spark interactively in a Python interpreter, use pyspark
:
pyspark --master local[2]
PySpark will automatically create a SparkContext for you to work with using the local Spark configuration. You can check that Spark is loaded using the following command:
print(sc)
This should display output like:
<pyspark.context.SparkContext object at 0x10b47fbd0>
Once you are running pyspark
, you can open Spark UI by pointing your browser at http://localhost:4040/.
If you are seeing lots of
INFO
andWARNING
messagesTo reduce the verbose output from Spark, you can do the following:
Copy
/_path_to_spark_/conf/log4j.properties.template
to/_path_to_spark_/conf/log4j.properties
(\_path_to_spark\conf
for Windows). Edit the file and change the line:log4j.rootCategory=INFO, console
to
log4j.rootCategory=ERROR, console