Tag Archives: Spark-core

Understand the shuffle component in spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. Shuffle is one of the most expensive operations that will affect the performance of the job. Even though Spark tries to avoid shuffle as possible as it can, some operations require shuffle … Continue reading

Posted in Spark | Tagged , | 1 Comment

Understand the scheduler component in spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. 1. Introduction To day, let’s get to understand what’s really happening behind the scene after we submit a Spark job to the cluster. I promise you that there will be many interesting stuffs … Continue reading

Posted in Architecture, Spark | Tagged | 4 Comments

Apache Spark 1.3 architecture – module spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. After spending a significant time in reading the source code in spark-core project, I can briefly draw the architecture showing the relationships and the flow (messages passed) between important components in this … Continue reading

Posted in Architecture, Spark | Tagged , | 2 Comments

[Source code analysis] Narrow dependency and wide dependency implementation in Spark

This blog has been moved to new address: http://www.trongkhoanguyen.com. Files: Dependency.scala As mentioned about different types of dependencies of RDDs in previous post, today I’m going to dive more about its implementation. As you can see from the class diagram, dependency is divided … Continue reading

Posted in Source code analysis, Spark | Tagged , | 1 Comment

Understand the storage module in spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. The module storage in Spark provides the data access service for application, including: – Reads and stores data from various sources: HDFS, Local disk, RAM or even fetch blocks of data from … Continue reading

Posted in Spark | Tagged , | 2 Comments

Understand RDD operations: transformations and actions

This blog has been moved to new address: http://www.trongkhoanguyen.com. As we’ve already known that each RDD has 2 sets of parallel operations: transformation and action. Today, let’s understand some of them. 1. Transformation Transformation Result map(f: T => U) Return a MappedRDD[U] by … Continue reading

Posted in Spark | Tagged , | 4 Comments