Running Spark on Local Machine

Apache Spark is a fast and general-purpose cluster computing system. To get maximum potential out of it, Spark should be running on a distributed computing system. However, one might not have access to any distributed system all the time. Specially, for learning purpose one might want tor run spark on his/her own computer. This is actually a very easy task to do. There is a handful of way to do this. I would show, what I have done to run Spark on my laptop.

The first step is to download Spark from this link (in my case I put it in the home directory). Then unzip the folder using command line, or right clicking on the *.tar file. The following figure shows my unzipped folder, from where I would run Spark.

Running Spark from command line

Now, we can easily run Spark from command line. We need to grab the location of the folder. In my case, the location (can be copied from address bar using CTRL + L) is /home/shanto/spark-2.2.1-bin-hadoop2.7. Now I need to set some variables (SPARK_HOME and PYSPARK_PYTHON). After doing this, I can easily run spark by writing ${SPARK_HOME}/bin/pyspark. The lines I used in my command line is given below:

shanto@shanto:~$ export SPARK_HOME=/home/shanto/spark-2.2.1-bin-hadoop2.7
shanto@shanto:~$ export PYSPARK_PYTHON=python3
shanto@shanto:~$ ${SPARK_HOME}/bin/pyspark

Now, running a spark code is easy as well. I just need to write the following line in the command line (sparkcode.py is the file where I have written a few lines of Spark code).

${SPARK_HOME}/bin/spark-submit sparkcode.py

Okay. So everything works fine till now. However, if we close the terminal and write ${SPARK_HOME}/bin/pyspark on a new terminal, it will not work, because, the variables I set is not anymore. So do I need to set the variables all the time I open a new terminal? Thats

To make it permanent you need to edit the .bashrc file in your home directory. Let’s check if the file is there.

shanto@shanto:/$ cd ~
shanto@shanto:~$ ls -a

The file is a hidden file in your home directory, as shown in the figure below.

Now, we need to open .bashrc in any editor (I am opening it in nano) and add the lines as shown in the following piece of code.

shanto@shanto:~$ sudo nano .bashrc
password for shanto: 

Add these lines at the end of the file, save the file (CTRL + X) and exit:

export SPARK_HOME=/home/shanto/spark-2.2.1-bin-hadoop2.7
export PYSPARK_PYTHON=python3
export PATH=$SPARK_HOME/bin:$PATH

We are done! We can just write pyspark in the command line to start Spark as shown in the following figure.

While the job is running, we can access the web frontend at http://localhost:4040/.

Running Spark on Jupyter notebook

The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text [1]. This is also very popular among data scientists. I use Jupyter a lot, specially for small size project and where I need to add explanation and graphs for visualization. As I also use Spark for different projects, I need to run Spark from my Jupyter notebook. There are a few ways of doing that. The simplest way is to install the package findspark.

I already have Jupyter installed in my laptop. Now I just need to install findspark using the following command in command line. Without any arguments, the SPARK_HOME environmental variable will be used by findspark, so the previous step where we set the value of SPARK_HOME is a prerequisite.

shanto@shanto:~$sudo pip3 install findspark
[sudo] password for shanto: 
The directory '/home/shanto/.cache/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/home/shanto/.cache/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting findspark
  Downloading findspark-1.1.0-py2.py3-none-any.whl
Installing collected packages: findspark
Successfully installed findspark-1.1.0

Now we can start Jupter notebook by writing Jupter notebook on the command line.

shanto@shanto:~$ jupyter notebook

It will initiate Jupyter notebook in a browser. We can then use the following code to import findspark and then we can run Spark and do anything we want.

import findspark
findspark.init()

import pyspark

This is what I am running as a test code, to check if Spark really works.

So, We are all set. Lets have fun with Spark.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s