Author Archives: trongkhoanguyenblog

Understand the shuffle component in spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. Shuffle is one of the most expensive operations that will affect the performance of the job. Even though Spark tries to avoid shuffle as possible as it can, some operations require shuffle … Continue reading

Posted in Spark | Tagged , | 1 Comment

Understand the scheduler component in spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. 1. Introduction To day, let’s get to understand what’s really happening behind the scene after we submit a Spark job to the cluster. I promise you that there will be many interesting stuffs … Continue reading

Posted in Architecture, Spark | Tagged | 4 Comments

Apache Spark modules and their dependencies

This blog has been moved to new address: http://www.trongkhoanguyen.com. As you can see, module spark-core is the foundation framework for all the others. This module provides the implementations for spark computing engine: rdd, schedule, deploy, executor, storage, shuffle, … Module spark-sql including spark-hive … Continue reading

Posted in Architecture, Spark | Tagged | Leave a comment

Apache Spark 1.3 architecture – module spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. After spending a significant time in reading the source code in spark-core project, I can briefly draw the architecture showing the relationships and the flow (messages passed) between important components in this … Continue reading

Posted in Architecture, Spark | Tagged , | 2 Comments

Understand the Spark deployment modes

This blog has been moved to new address: http://www.trongkhoanguyen.com. 1. Spark deployment modes Spark provides many modes to run our application: – Local mode: run our application locally. This is really useful for debugging, we can step our code line by line with … Continue reading

Posted in Spark | Tagged | 1 Comment

[Source code analysis] Narrow dependency and wide dependency implementation in Spark

This blog has been moved to new address: http://www.trongkhoanguyen.com. Files: Dependency.scala As mentioned about different types of dependencies of RDDs in previous post, today I’m going to dive more about its implementation. As you can see from the class diagram, dependency is divided … Continue reading

Posted in Source code analysis, Spark | Tagged , | 1 Comment

Top reasons why shift to spark

This blog has been moved to new address: http://www.trongkhoanguyen.com. – Fast, in-memory (100x faster) or disk (2-10x faster). See Daytona GraySort contest and Official Result – Usability: rich APIs (Scala, Java, Python), concise, interactive shell   – Well designed, unified: Spark is a … Continue reading

Posted in Spark | Leave a comment

Understand the storage module in spark-core

This blog has been moved to new address: http://www.trongkhoanguyen.com. The module storage in Spark provides the data access service for application, including: – Reads and stores data from various sources: HDFS, Local disk, RAM or even fetch blocks of data from … Continue reading

Posted in Spark | Tagged , | 2 Comments

Scala quick reference

This blog has been moved to new address: http://www.trongkhoanguyen.com. Remembering many programming syntax is somehow painful for me. As I have not graduated yet, memorizing many programming languages for many courses at school really hurts me sometimes. Thus, It’s better for me to … Continue reading

Posted in Scala | Tagged | Leave a comment

Introduction to SparkSQL

This blog has been moved to new address: http://www.trongkhoanguyen.com. With Spark and RDD core API, we can do almost everything with datasets. Developers define steps of how to retrieve the data by applying functional transformations on RDDs. They are also the guy … Continue reading

Posted in Spark | Tagged | Leave a comment