This blog has been moved to new address: http://www.trongkhoanguyen.com.
1. Spark deployment modes
Spark provides many modes to run our application:
– Local mode: run our application locally. This is really useful for debugging, we can step our code line by line with an IDE (I use Intellij)
– Standalone mode: we can easily deploy a standalone cluster with very few steps and configurations and then we can play around with it.
In general, the process of running an application on Spark is as followed:
– Firstly, we write a Spark application that creates a SparkContext to initialize the DriverProgram, and then registers the application to the ClusterManager. The DriverProgram with pre-configured parameters will next ask ClusterManager for resources, which is a list of Executors. You can notice easily that different Applications run in different Executors (or we can say Tasks from different Applications run in different JVMs).
– The ClusterManager could be: StandaloneCluster, Mesos or YARN.
– A WorkerNode is a node machine that host one or multiple Workers.
– A Worker is launched within a single JVM (1 process) and has access to a configured number of resources in the node machine, eg number of used cores per Worker, RAM memory. A Worker can spawn and own multiple Executors.
– An Executor is also launched within a single JVM (1 process), created by the Worker, and in charged of running multiple Tasks in a ThreadPool.
– Using SparkContext (sc), our application will be able to create some RDDs, apply some transformations on them and then trigger an action. Subsequently, Sc will immediately submit a Job to the DAGScheduler of the Driver. The DAGScheduler will apply some optimizations here. One noticeable thing is the process of splitting the DAG (Directed Acyclic Graph) into Stages (set of Tasks) by analyzing the types of dependency of RDDs (See previous post here – pipeline narrow dependencies). The DAGScheduler will decide what to run: the Tasks including stages and partitions.
– Tasks are then scheduled to the acquired Executors driver side TaskScheduler, according to resources and locality constraints (decides where to run each Task).
– Within a Task, the lineage DAG of corresponding stage is serialized together with closures (functions) of transformation, then sent to and executed on scheduled Executors.
– The Executors run the Tasks in multiple Threads and finally send back the results to the Driver to aggregate.
2. Noticiable characteristics
2.1. Compared with Hadoop MapReduce
Hadoop MapReduce runs each Task in its own process; and when a Task completes, that process goes away (unless we turned on the JVM reuse). Additionally, these Tasks can run concurrently in threads by using the advanced techniques with using class MultithreadedMapper.
While in Spark, many Tasks by default will run concurrently in multiple threads in a single process (an Executor). This process sticks to the Application, even when no jobs are running.
In fact, we can summarize the difference in execution model between them:
– Each MapReduce Executor is short-lived and expected to run only one large task.
– Each Spark Executor is long-running and expected to run many small tasks.
– Thus, be careful when ports source code from MapReduce to Spark because Spark requires thread safe execution.
2.2. So, what are the benefits of the implementation in Spark?
– Able to isolate different applications to run in different JVMs, isolate the tasks scheduling: each driver schedules its own tasks .
– Launching a new process costs much more than launching a new thread. Creating threads is 10-100 times faster than creating processes. Additionally, by running multiple small Tasks rather than just a few big Tasks would reduce the impacts of data skew.
– Very suitable for running interactive application: quick computation
Besides these gains, however, there are also adverse impacts:
– Applications cannot share data (mostly Rdds in sparkContext) without writing to external storage .
– Resources allocation inefficiency: for the entirety of an Application, the App will take full and fixed amount of resources that it asked the ClusterManager and got accepted, even though that App doesn’t run any Task.
– Dynamic resource allocation feature comes out in Spark 1.2 to address this issue: If a subset of the resources allocated to an application becomes idle, it can be returned to the cluster’s pool of resources and acquired by other applications. 
3. Time to do configurations
2.1. Configure the Workers
2.2 Configure the Application