This blog has been moved to new address: http://www.trongkhoanguyen.com.
As mentioned about different types of dependencies of RDDs in previous post, today I’m going to dive more about its implementation.
As you can see from the class diagram, dependency is divided into two types: narrow and shuffle (wide).
– For NarrowDependency, we have 3 concrete classes: OneToOneDependency, RangeDependency and PruneDependency:
+ OneToOneDependency: one partition of the parent RDD is used by one partition of the child RDD. So, method getParents(partitionID: Int) which gets the parent partitions for a child partition just returns that partitionID.
+ RangeDependency: is useful for Union operation where a set of child partitions depend on a set of parent partitions.
+ PruneDependency: used by PartitionPruningRDD: the child RDD contains a subset of partitions of the parents.
Note that NarrowDependency provides the method getParents(partitionID: Int) so that the DAGScheduler can put into good use in determining the preferred location to compute the task: if the getParents method that being called on RDDs returns the same value (same partition) then the transformations on RDDs should be computed locally.
In Spark, here are the list of transformations that are narrow dependencies:
* filter, map, mapValues, flatMap, flatMapValues.
* glom, pipe, zipWithIndex, cartesian, union.
* mapPartitionsWithInputSplit, mapPartitions, mapPartitionsWithIndex, mapPartitionsWithContext.
* sample, randomSplit.
– ShuffleDependency :only PairRDDs with <Key, Value> have this kind of dependency and they involve the shuffle step. Therefore, we need a partitioner (to decide where to put intermediate values), a serializer (to serialize object so that object can be sent through network), a shuffleHandler to handle shuffle step, …
Here are the list of transformations that are wide dependencies:
* sortByKey, combineByKey, partitionBy.
* coalesce: could perform shuffle or not depending on given parameters.
* subtractByKey, cogroup: depend on its partitions’ location, could be narrow or wide dependency.