Question: How can 2 applications share RDDs

This blog has been moved to new address: http://www.trongkhoanguyen.com.

Problem:

The application isolation in current Spark’s architecture results in the impossibility of sharing data (mostly RDDs) between different applications without writing them to an external storage. Refer to my post on Spark deployment modes to understand how an application acquires resources in a cluster.

Solutions:

– Externalize RDDs (save the serialized/deserialized RDDs into Distributed File System so that the other application can read: eg: HDFS (hard disk) or Tachyon (in-memory speed)).
– Use  SparkJobServer: the key idea is to build a central server that allows to run a job via a shared SparkContext (the Application no longer create the SparkContext but the SparkJobServer does. Thus, RDDs are now sharable), or an independent SparkContext.
Features of SparkJobServer:

  • “Spark as a Service”: Simple REST interface for all aspects of job, context management
  • Supports sub-second low-latency jobs via long-running job contexts
  • Start and stop job contexts for RDD sharing and low-latency jobs; change resources on restart
  • Kill running jobs via stop context
  • Separate jar uploading step for faster job startup
  • Asynchronous and synchronous job API. Synchronous API is great for low latency jobs!
  • Works with Standalone Spark as well as Mesos and yarn-client
  • Job and jar info is persisted via a pluggable DAO interface
  • Named RDDs to cache and retrieve RDDs by name, improving RDD sharing and reuse among jobs.
Posted in Spark | Leave a comment

Understand RDD operations: transformations and actions

This blog has been moved to new address: http://www.trongkhoanguyen.com.

As we’ve already known that each RDD has 2 sets of parallel operations: transformation and action. Today, let’s understand some of them.

1. Transformation

Transformation Result
map(f: T => U) Return a MappedRDD[U] by applying function f to each element
Map
flatmap(f: T => TraversableOnce[U]) Return a new FlatMappedRDD[U] by first applying a function to all elements and then flattening the results.
Map & FlatMap
For example: we want to print all words in a large text file, we can use the following code:

// create a RDD object from HDFS file
lines = spark.textFile("hdfs://input.txt");

// now we obtain a RDD which is a collection of
//words in the input.txt file
words = lines.flatMap(line => line.split(" "));
filter(f: T => Boolean) Return a FilteredRDD[T] having elemnts that f return true
mapPartitions(Iterator[T] => Iterator[U]) Return a new MapPartitionsRDD[U] by applying a function to each partition
sample(withReplacement, fraction, seed) Return a new PartitionwiseSampledRDD[T] which is a sampled subset
union(otherRdd[T]) Return a new UnionRDD[T] by making union with another Rdd
intersection(otherRdd[T]) Return a new RDD[T] by making intersection with another Rdd
distinct() Return a new RDD[T] containing distinct elements
groupByKey() Being called on (K,V) Rdd, return a new RDD[([K], Iterable[V])]
reduceByKey(f: (V, V) => V) Being called on (K, V) Rdd, return a new RDD[(K, V)] by aggregating values using feg: reduceByKey(_+_)
sortByKey([ascending]) Being called on (K,V) Rdd where K implements Ordered, return a new RDD[(K, V)] sorted by K
join(other: RDD[(K, W)) Being called on (K,V) Rdd, return a new RDD[(K, (V, W))] by joining them
cogroup(other: RDD[(K, W)) Being called on (K,V) Rdd, return a new RDD[(K, (Iterable[V], Iterable[W]))] such that for each key k in this & other, get a tuple with the list of values for that key in this as well as other
cartesian(other: RDD[U])  Return a  new RDD[(T, U)] by applying product

Continue reading

Posted in Spark | Tagged , | 4 Comments

Developer Certification for Apache Spark

This blog has been moved to new address: http://www.trongkhoanguyen.com.

Interesting huh, it seems that Spark has many promises in foreseeable future.
http://www.oreilly.com/data/sparkcert.html?cmp=ex-strata-na-lp-na_apache_spark_certification
https://databricks.com/certified-on-spark

Posted in Spark | Tagged | Leave a comment

A gentle introduction to Apache Spark

1. What is Apache Spark?

– A open source and powerful data processing engine.
– Complement (or even replace) its pioneer counterpart – Hadoop in the future due to much better performance.

“Spark enables applications in Hadoop clusters to run up to 100x faster in memory, and 10x faster even when running on disk.”

Why better? keep reading …

2. What are the most important criteria for a data processing engine? And what are the differences between Hadoop and Spark?

Scalability (work distribution): can work with large data
Fault tolerance : can self-recover

Spark leverage the distributed memory which the previous engine did not. That’s the first big difference!
Continue reading

Posted in Hadoop, Spark | Tagged , | 1 Comment

How to install Apache Spark 1.2.1 in Standalone cluster on Ubuntu

This blog has been moved to new address: http://www.trongkhoanguyen.com.

In this tutorial, you will be able  to:
– Install Apache Spark latest version 1.2.1 in Standalone mode on Ubuntu from scratch.
– Configure a small cluster which comprises of 1 Master node and 1 Worker node (You can easily configure for a 2nd Worker node if needed).
Live video is provided here:

Prerequisite:
– A fresh copy of Ubuntu version 14.04 [Get here] and install it on Vmware.

Steps:
1. Install Java SDK 6
2. Install Scala 2.10.4
3. Install SSH Remote Access
4. Install Hadoop 2.4 (optional )
5. Install Spark
6. Test

1. Install Java SDK 6

# optional - remove openjdk if your installed it
sudo apt-get purge openjdk*

# install Oracle Java SDK 6
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java6-installer
# specify the JAVA_HOME environment variable in /etc/environment
sudo nano /etc/environment
JAVA_HOME=/usr/lib/jvm/java-6-oracle/
# force OS to reload the /etc/environment file
source /etc/environment

Continue reading

Posted in Spark | Tagged , | 3 Comments

Spark reference materials

This blog has been moved to new address: http://www.trongkhoanguyen.com.

Starting out on the journey of learning something new, learning resources and references are  the most important things we have to search for. This is the goal of this post to list all essential materials that I’ve found to facilitate my learning of Spark.

1. Learning materials

– Apache Spark started as a research project at UC Berkeley in the AMPLab, thus we should have a look at their first paper: “Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing, 2011” . You  will find there the paper, slide as well as the presentation video.

– Get your hands dirty by following the official Spark screen cast to get to know how to install Spark and run some examples – A great introduction.

– Books: I’ve always thought that learning by reading books is one of the most effective ways thanks to theirs well-structured contents, the ease of getting an overview and follow the steps. As Spark is still a young and promising project ( first released is October 15, 2012), until this moment (11-2014), there are only 2 available books:
– Learning Spark, by Holden Karau, Andy Konwinski, Patrick Wendell and Matei Zaharia, O’Reilly [O’Reilly Link]
 Fast Data Processing with Spark [Link] – I didn’t find this book is good for beginners.

Continue reading

Posted in Spark | Tagged , | Leave a comment

Hello newcomer!

Habitually! 😀

public static void main()
{
	System.Console.WriteLine("Hello blog!")
}
Posted in For fun | Tagged | Leave a comment