6. 6Page:
• Fast and general purpose cluster computing system
• Born in UC Berkeley around 2009
• Open Sourced in 2010
• Interoperable with Hadoop and included in all the major distributions
• Provides high level APIs in Scala, Java , Python
9. 9Page:
Spark SQL
• Spark module for structured data processing
• The most popular Spark Module in the Ecosystem
• Use SQLContext to perform operations
• SQL Queries
• DataFrame API
• Dataset API
• White Paper
• http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf
10. 1
0
Page:
SQLContext
• Used to Create DataFrames
• Implementations
• SQLContext
• HiveContext
• An instance of the Spark SQL execution engine that integrates with
data stored in Hive
• SQLContext
• Spark Shell automatically creates a SparkContext as the sqlContext
variable
• Documentation
• https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spa
rk.sql.SQLContext
• As of Spark 2.0 use SparkSession
11. 1
1
Page:
DataFrame API
• One use of Spark SQL is to execute SQL queries written using either a basic
SQL syntax or HiveQL
12. 1
2
Page:
Dataset API
• Dataset is a new interface added in Spark 1.6 that provides the benefits of
RDDs with the benefits of Spark SQL’s optimized execution engine
• Use the SQLContext
• sqlContext.read.{OPERATION}
• DataFrame is simply a type alias of Dataset[Row]
• The unified Dataset API can be used both in Scala and Java.
• Python does not yet have support for the Dataset API
13. 1
3
Page:
Catalyst Optimizer
• Applied to Spark SQL and DataFrame API
• Extensible Optimizer
• Automatically finds the most efficient plan to execute data operations in the
users operation
Databricks, Catalyst Optimizer Workflow
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html
14. 1
4
Page:
Project Tungsten
• Applied to Spark SQL, DataFrame API, Dataset API and ML
• Effort to improve CPU and Memory usage by Spark with 3 Techniques
• Memory Management and Binary Processing
• Leveraging application semantics to manage memory explicitly and
eliminate the overhead of JVM object model and garbage collection
• Cache-aware computation
• Algorithms and data structures to exploit memory hierarchy
• Code generation
• Using code generation to exploit modern compilers and CPUs
• Closer to Bare Metal
17. 1
7
Page:
Certification
• Certification Organizations
• O’Reilly and Databricks
• http://www.oreilly.com/data/sparkcert.html
• Additional steps to prepare
• Work with Apache Spark
• Research Apache Spark Modules
• Spark SQL, Spark Streaming, MLlib, Graphx
• Read the RDD and other White Papers
• Read O’Reilly’s “Learning Spark” book