Getting started with SparkSQL - Desert Code Camp 2016

1Page:
Getting started with
SparkSQL
By: Avinash Ramineni

2Page:
Agenda
• About Clairvoyant
• Spark Concepts
• Spark SQL
• Project Tungsten
• Catalyst Optimizer
• Spark SQL as Distributed Query Engine
• Demo
• Questions

5Page:
Avinash Ramineni
• Principal @ Clairvoyant
• Email: avinash@clairvoyantsoft.com
• LinkedIn: https://www.linkedin.com/in/avinashramineni

6Page:
• Fast and general purpose cluster computing system
• Born in UC Berkeley around 2009
• Open Sourced in 2010
• Interoperable with Hadoop and included in all the major distributions
• Provides high level APIs in Scala, Java , Python

7Page:
Spark Architecture
Apache Spark, Cluster Mode Overview
http://spark.apache.org/docs/latest/img/cluster-overview.png

8Page:
Spark UI (Resource Manager)
http://{HOST}:8088/cluster

9Page:
Spark SQL
• Spark module for structured data processing
• The most popular Spark Module in the Ecosystem
• Use SQLContext to perform operations
• SQL Queries
• DataFrame API
• Dataset API
• White Paper
• http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf

1
0
Page:
SQLContext
• Used to Create DataFrames
• Implementations
• SQLContext
• HiveContext
• An instance of the Spark SQL execution engine that integrates with
data stored in Hive
• SQLContext
• Spark Shell automatically creates a SparkContext as the sqlContext
variable
• Documentation
• https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spa
rk.sql.SQLContext
• As of Spark 2.0 use SparkSession

1
1
Page:
DataFrame API
• One use of Spark SQL is to execute SQL queries written using either a basic
SQL syntax or HiveQL

1
2
Page:
Dataset API
• Dataset is a new interface added in Spark 1.6 that provides the benefits of
RDDs with the benefits of Spark SQL’s optimized execution engine
• Use the SQLContext
• sqlContext.read.{OPERATION}
• DataFrame is simply a type alias of Dataset[Row]
• The unified Dataset API can be used both in Scala and Java.
• Python does not yet have support for the Dataset API

1
3
Page:
Catalyst Optimizer
• Applied to Spark SQL and DataFrame API
• Extensible Optimizer
• Automatically finds the most efficient plan to execute data operations in the
users operation
Databricks, Catalyst Optimizer Workflow
https://databricks.com/blog/2015/04/13/deep-dive-into-spark-sqls-catalyst-optimizer.html

1
4
Page:
Project Tungsten
• Applied to Spark SQL, DataFrame API, Dataset API and ML
• Effort to improve CPU and Memory usage by Spark with 3 Techniques
• Memory Management and Binary Processing
• Leveraging application semantics to manage memory explicitly and
eliminate the overhead of JVM object model and garbage collection
• Cache-aware computation
• Algorithms and data structures to exploit memory hierarchy
• Code generation
• Using code generation to exploit modern compilers and CPUs
• Closer to Bare Metal

1
6
Page:
Learn More (Courses and Videos)
• MapR Academy
• http://learn.mapr.com/dev-360-apache-spark-essentials
• edx
• https://www.edx.org/course/introduction-apache-spark-uc-berkeleyx-cs105x
• https://www.edx.org/course/distributed-machine-learning-apache-uc-berkeleyx-
cs120x
• https://www.edx.org/course/big-data-analysis-apache-spark-uc-berkeleyx-cs110x
• https://www.edx.org/xseries/data-science-engineering-apache-spark
• Coursera
• https://www.coursera.org/learn/big-data-analysys
• Apache Spark YouTube
• https://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w
• Spark Summit
• https://spark-summit.org/2016/schedule/

1
7
Page:
Certification
• Certification Organizations
• O’Reilly and Databricks
• http://www.oreilly.com/data/sparkcert.html
• Additional steps to prepare
• Work with Apache Spark
• Research Apache Spark Modules
• Spark SQL, Spark Streaming, MLlib, Graphx
• Read the RDD and other White Papers
• Read O’Reilly’s “Learning Spark” book

1
8
Page:
References
• https://en.wikipedia.org/wiki/Apache_Spark
• http://spark.apache.org/news/spark-wins-daytona-gray-sort-100tb-
benchmark.html
• https://www.datanami.com/2016/06/08/apache-spark-adoption-numbers/
• http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf
• http://training.databricks.com/workshop/itas_workshop.pdf
• https://spark.apache.org/docs/latest/api/scala/index.html
• https://spark.apache.org/docs/latest/programming-guide.html
• https://github.com/databricks/learning-spark

Getting started with SparkSQL - Desert Code Camp 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Getting started with SparkSQL - Desert Code Camp 2016

Similar to Getting started with SparkSQL - Desert Code Camp 2016 (20)

More from clairvoyantllc

More from clairvoyantllc (12)

Recently uploaded

Recently uploaded (20)

Getting started with SparkSQL - Desert Code Camp 2016