This blog has been moved to new address: http://www.trongkhoanguyen.com.
With Spark and RDD core API, we can do almost everything with datasets. Developers define steps of how to retrieve the data by applying functional transformations on RDDs. They are also the guy who try to optimize his code.
With SparkSQL (the replacement of Shark), we can even play further with data: specifying what data we want to retrieve on partitioned collection of tuples schema (table, columns, data types). SparkSQL will do the optimization for us.
So, what SparkSQL actually does?
Transforms SQL queries into optimal RDD transformations and actions.
– interactive query
– can be mixed with machine learning
– support Hive query
Module SparkSQL contains:
– Catalyst: query parser & query optimizer
– Hive support
Catalyst provides a execution planning framework
– SQL parser & analyzer
– Logical operator & general expression
– Logical optimizer
– Transform operator tree
Parse -> Analyze -> Logical Plan -> Optimize -> Physical Plan -> Execute
Catalyst SQL core
2. Query optimization with simple rules transform:
A simple query example:
FROM (SELECT id, name
FROM People) p
WHERE p.id = 1
Perform optimizations on plan
3. Optimization rules in Spark-SQL
|NullPropagation||1+null => null
count(null) = 0
|BooleanSimplification:||false AND $right => false
true AND $right => $right
true OR $right => true
false OR $right => $right
if(true, $then, $else) => $then
|SimplifyFilters||remove trivially filters:
Filter(true, child) => child
Filter(false, child)=> empty
|CombineFilters||merges 2 filters:
Filter($fc, Filter($nc, child)) => Filter(AND($fc, $nc), child)
|PushPredicateThroughProject||pushes Filter operators through project operator:
Filter(‘i==1’, Project(i, j, child)) => Project(i, j, Filter(‘i==1’, child))
|PushPredicateThroughJoin||pushes Filter operators through join operator:
Filter(‘left.i’.att == 1, Join(left, right)) => Join(Filter(‘i==1’, left), right)
|ColumnPruning||Eliminates the reading of unused columns:
Join(left, right, leftSemi, “left.id”.attr == “right.id”.attr)=> Join(left, Project(‘id, right’), leftSemi)
4. Increasing performance by generating bytecode at run-time:
– We know that we are dealing with schema data (with rows and columns), therefore, we will have to evaluate a huge number of expressions like: summing columns for each row (eg: tableA[row_i, column_i] + tableA[row_i, column_j]), subtracting rows,…
– The problem is the abstract implementation of Java for such expressions – generics: can reduce the number of code but disregard performance:
– A simple call on 2 integers: a + b will cost: boxing cost, garbage collection cost and virtual functions called (to base class) cost.
– By generating byte code at run-time, we no longer have to worry about boxing cost and virtual function called. Garbage collection is also reduced.