[Source code analysis] Narrow dependency and wide dependency implementation in Spark

This blog has been moved to new address: http://www.trongkhoanguyen.com.

Files: Dependency.scala
As mentioned about different types of dependencies of RDDs in previous post, today I’m going to dive more about its implementation.

Type of dependencies

As you can see from the class diagram, dependency is divided into two types: narrow and shuffle (wide).

– For NarrowDependency, we have 3 concrete classes: OneToOneDependencyRangeDependency and PruneDependency:
+ OneToOneDependency:  one partition of the parent RDD is used by one partition of the child RDD. So, method  getParents(partitionID: Int) which gets the parent partitions for a child partition just returns that partitionID.
RangeDependency: is useful for Union operation where a set of child partitions depend on a set of parent partitions.
PruneDependency: used by PartitionPruningRDD: the child RDD contains a subset of partitions of the parents.

Note that NarrowDependency provides the method getParents(partitionID: Int) so that the DAGScheduler can put into good use in determining the preferred location to compute the task: if the getParents method that being called on RDDs returns the same value (same partition) then the transformations on RDDs should be computed locally.

In Spark, here are the list of transformations that are narrow dependencies:
* filter, map, mapValues, flatMap, flatMapValues.
* glom, pipe, zipWithIndex, cartesian, union.
* mapPartitionsWithInputSplit, mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext.
* sample, randomSplit.

ShuffleDependency :only PairRDDs with <Key, Value> have this kind of dependency and they involve the shuffle step. Therefore, we need a partitioner (to decide where to put intermediate values), a serializer (to serialize object so that object can be sent through network), a shuffleHandler to handle shuffle step, …

Here are the list of transformations that are wide dependencies:
* sortByKey, combineByKey, partitionBy.
* repartitionAndSortWithinPartitions.

Special case:
coalesce: could perform shuffle or not depending on given parameters.
subtractByKey, cogroup: depend on its partitions’ location, could be narrow or wide dependency.

This entry was posted in Source code analysis, Spark and tagged , . Bookmark the permalink.

One Response to [Source code analysis] Narrow dependency and wide dependency implementation in Spark

  1. Pingback: RDD operations: transformations and actions | Khoa's IT blog

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s