Simply Install: Spark (Cluster Mode)

Published in

Insight

7 min readJun 3, 2019

Simply Install is a series of blogs covering installation instructions for simple tools related to data engineering. This blog covers basic steps to install and configuration Apache Spark (a popular distributed computing framework) as a cluster. To provide a consistent installation, all instructions are written after testing on Ubuntu 18.04 on AWS using EC2 Instances. AWS also has a managed service called EMR that allows easy deployment of Spark.

Key takeaways

Basic cluster setup and installation of Spark
How to configure communication between nodes
How to manually add new workers (EC2 instance ) into the cluster.

Few basics

Before we jump into installing Spark, let us define terminologies that we will use in this. This will not cover advanced concepts of tuning Spark to suit the needs of a given job at hand. If you already know these, you can go ahead and skip this section.

Apache Spark is a distributed computing framework that utilizes framework of Map-Reduce to allow parallel processing of different things. As a cluster, Spark is defined as a centralized architecture.

Centralized systems are systems that use client/server architecture where one or more client nodes are directly connected to a central server. More info here.

Our setup will work on One Master node (an EC2 Instance) and Three Worker nodes. We will use our Master to run the Driver Program and deploy it in Standalone mode using the default Cluster Manager.

Master: A master node is an EC2 instance. It handles resource allocation for multiple jobs to the spark cluster. A master in Spark is defined for two reasons.

Identify the resource (CPU time, memory) needed to run when a job is submitted and requests the cluster manager.
Provide the resources (CPU time, memory) to the Driver Program that initiated the job as Executors.
Facilitates communication between the different Workers

Driver Program: The driver program does most of the grunt work for ONE job. It breaks apart the job or program (generally written in Python, Scala or Java) into a Directed Acyclic Graph (DAG), downloads dependencies, assigns tasks to executors, allows communication between the workers to share data during shuffling. Each Job has a separate Driver Program with its own SparkContext and SparkSession.

A DAG is a mathematical construct that is commonly used in computer science to represent flow of control in a program and its dependencies. DAGs contain vertices, and edges. In Spark, each Task is taken as a vertex of the graph. The edges are added based on the dependency of a task on other tasks. If there are cyclic dependencies, the DAG creation fails.

Workers: Each worker is an EC2 instance. They run the Executors. A worker can run executors of multiple jobs.

Executors: Executors are assigned tasks for a particular jobs. They run on a particular worker to complete the task. Job has a separate set of executors. They are run on top of a Virtual machine and identified only by the job.

Given we have a quick understanding of what Spark is, let’s spend time to put it into practice. A common image that you will see when people explain Spark is given below. This doesn’t however capture the physical components fully.

Here is a more detailed picture of what our setup will look like at the EC2 level and how you will interact with Spark and run your jobs on it.

Step 0: Pre-requisite setup

Each node in our cluster (one master + three workers) should have latest Java installed. We will use apt to install Java. We will first update that.

$ sudo apt update
$ sudo apt install openjdk-8-jre-headless
....
....
Processing triggers for libc-bin (2.27-3ubuntu1) ...
Processing triggers for ureadahead (0.100.0-20) ...
Processing triggers for systemd (237-3ubuntu10.12) ...
Processing triggers for ca-certificates (20180409) ...
Updating certificates in /etc/ssl/certs...
0 added, 0 removed; done.
Running hooks in /etc/ca-certificates/update.d...done.
done.$ java --version

We also need to install Scala in all of our machines.

$ sudo apt install scala
....
....
Setting up scala-parser-combinators (1.0.3-3) ...
Setting up libhawtjni-runtime-java (1.15-2) ...
Setting up scala-library (2.11.12-4~18.04) ...
Setting up scala-xml (1.0.3-3) ...
Setting up libjansi-native-java (1.7-1) ...
Setting up libjansi-java (1.16-1) ...
Setting up libjline2-java (2.14.6-1) ...
Setting up scala (2.11.12-4~18.04) ...
update-alternatives: using /usr/share/scala-2.11/bin/scala to provide /usr/bin/scala (scala) in auto mode$ scala -version
Scala code runner version 2.11.12 -- Copyright 2002-2017, LAMP/EPFL

Please make note of the Scala version here.

Step 0.5: Setting up Keyless SSH

In order to allow the communication between spark, we need to setup Keyless SSH login between master and the workers. This allows masters to coordinate running of different executors. If your master is not able to communicate with your workers, kindly check the security groups and your keyless SSH setup to resolve them.

Install openssh-server and openssh-client on the master only.

$ sudo apt install openssh-server openssh-client

Create a RSA key pair using the following command. We will name the files as id_rsa and id_rsa.pub.

$ cd ~/.ssh
~/.ssh: $ ssh-keygen -t rsa -P ""
Generating public/private rsa key pair.
Enter file in which to save the key: id_rsa
Your identification has been saved in id_rsa.
Your public key has been saved in id_rsa.pub.

We will have to manually copy the contents of the id_rsa.pub file into ~/.ssh/authorized_keys file in each worker. The entire contents should be in one line. It starts with ssh-rsa and ends with ubuntu@some_ip as shown below.

$ cat ~/.ssh/id_rsa.pub
ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQDUqJ4Tres45eA12CmDXvI+kGiRWM31PT.... ubuntu@10.0.0.9

Once copied into all workers, the Master can SSH into its workers automatically (this is required for spark to function).

You can test it this works if you try to SSH from master into the worker using the following comamnd

$ ssh -i ~/.ssh/id_rsa ubuntu@10.0.0.13

Step 1: Installing Spark

On each machine (both master and worker) install Spark using the following commands. You can configure your version by visiting here.

$ wget https://archive.apache.org/dist/spark/spark-2.4.3/spark-2.4.3-bin-hadoop2.7.tgz

Extract the files and move them to /usr/local/spark and add the spark/bin into PATH variable.

$ tar xvf spark-2.4.3-bin-hadoop2.7.tgz
$ sudo mv spark-2.4.3-bin-hadoop2.7/ /usr/local/spark
$ vi ~/.profileexport PATH=/usr/local/spark/bin:$PATH
$ source ~/.profile

Step 2: Configuring Master to keep track of its workers

Now that we have Spark installed in all our machines we need to let the Master instance know about the different workers that is available. We will be using a standalone cluster manager for demonstration purposes. You can also use something like YARN or Mesos to handle the cluster. Spark has detailed notes on the different cluster managers that you can use.

Definition: Cluster Manager is an agent that works in allocating the resource requested by the master on all the workers. The cluster manager then shares the resource back to the master, which the master assigns to a particular driver program.

We need to modify /usr/local/spark/conf/spark-env.sh file by providing information about Java location and the master node’s IP. This will be added to the environment when running Spark jobs.

# contents of conf/spark-env.sh
export SPARK_MASTER_HOST=<master-private-ip>
export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64# For PySpark use
export PYSPARK_PYTHON=python3

Note: If the spark-env.sh doesn’t exist copy the spark-env.sh.template and rename it.

We will also add all the IPs where the worker will be started. Open the /usr/local/spark/conf/slaves file and paste the following.

# contents of conf/slaves
<worker-private-ip1>
<worker-private-ip2>
<worker-private-ip3>

Note: If you updated your /etc/hosts file with hostnames for each IP you can use those here too. See how to setup hostnames in ubuntu here.
Oversubscription: It is the idea of exaggerating the number of resources that is available for use so cluster manager is in the illusion of having more resources. This forces CPU to work on a task and not sit idle during any clock cycles. You can set this by setting the SPARK_WORKER_CORES flag in spark-env.sh to a value higher than the actual number of cores.
Example: If each of our three workers have 2 cores, we have total of 6 cores available per worker. We can set the WORKER_CORES to be 3 times that to allow for oversubscription. i.e.,
export SPARK_WORKER_CORES=6

Check out other resources and options to tune here.

Test if its working fine

Spark has provided scripts that can initiate all the instances and setup the master-worker configuration. This starts it as a background process so you can exit the terminal.

$ sh /usr/local/spark/sbin/start-all.sh

Once you want to stop the service you can run sbin/stop-all.sh.

Note: Before you start a cluster be sure you spend time to think about security of your spark cluster, the ports that you should open to access all aspects of the application.

Adding additional worker nodes into the cluster

Seems intimidating but Spark has one of the easiest way to add a new worker into the cluster (to optimize workloads). This is purely because of how spark is setup internally and how the master communicates with the workers.

Note: Horizontal scaling can be detrimental as it increases the amount of data shuffled across.

We install Java in the machine. (Step 0)
Setup Keyless SSH from master into the machine by copying the public key into the machine (Step 0.5)
Install Spark in the machine (Step 1)
Update /usr/local/spark/conf/slaves file to add the new worker into the file.
Restart the everything using sbin/start-all.sh.

This setup installs a Spark on a cluster of Ubuntu machine with one master and three workers. In order to set spark up, it is important to understand how spark functions in the cluster setup, how it manages different jobs and allocates resources for each one of them.

Interested in transitioning to a career in Data Engineering or DevOps Engineering?
Find out more about the Insight Fellows Program, and sign up for program updates.