SlideShare a Scribd company logo
1 of 30
Spark SQL 漫谈 
Cheng Hao 
Oct 25, 2014 
Copyright © 2014 Intel Corporation.
Agenda 
 Spark SQL Overview 
 Catalyst in Depth 
 SQL Core API Introduction 
 V.S. Shark & Hive-on-Spark 
 Our Contributions 
 Useful Materials 
2 
Copyright © 2014 Intel Corporation.
Spark SQL Overview 
Copyright © 2014 Intel Corporation.
Spark SQL in Spark 
4 
Spark 
Streaming 
real-time 
GraphX 
Graph 
(alpha) 
MLLib 
Machine 
Learning 
Spark Core 
Spark SQL 
 Spark SQL was first released in Spark 1.0 (May, 2014) 
 Initial committed by Michael Armbrust & Reynold Xin from Databricks 
Copyright © 2014 Intel Corporation.
Spark SQL Component Stack (User Perspective) 
 Hive-like interface(JDBC Service / CLI) 
 SQL API support (LINQ-like) 
 Both Hive QL & Simple SQL dialects are Supported 
 DDL is 100% compatible with Hive Metastore 
 Hive QL aims to 100% compatible with Hive DML 
 Simple SQL dialect is now very weak in functionality, 
but easy to extend 
5 
User Application 
CLI JDBC Service 
SQL API 
Hive QL Simple SQL 
Catalyst 
Spark Execution Operators 
Spark Core 
Data Analyst 
Hive Meta Store Simple Catalog 
Copyright © 2014 Intel Corporation.
Spark SQL Architecture 
6 
Frontend Backend 
Catalyst 
Copyright © 2014 Intel Corporation. 
By Michael Armbrust @ Databricks
Catalyst in Depth 
Copyright © 2014 Intel Corporation.
Understand Some Terminology 
 Logical and Physical query plans 
 Both are trees representing query evaluation 
 Internal nodes are operators over the data 
 Logical plan is higher-level and algebraic 
 Physical plan is lower-level and operational 
 Logical plan operators 
 Correspond to query language constructs 
 Conceptually describe what operation needs to be 
8 
performed 
 Physical plan operators 
 Correspond to implemented access methods 
 Physically Implement the operation described by logical 
operators 
SQL Text 
Parsing 
Unresolved 
Logical Plan 
Binding & Analyzing 
Logical Plan 
Optimizing 
Optimized 
Logical Plan 
Query Planning 
Physical Plan 
Copyright © 2014 Intel Corporation.
Examples 
9 
We execute the following commands on Spark SQL CLI. 
• CREATE TABLE T (key: String, value: String) 
• EXPLAIN EXTENDED 
SELECT 
a.key * (2 + 3), b.value 
FROM T a JOIN T b 
ON a.key=b.key AND a.key>3 
Copyright © 2014 Intel Corporation.
== Parsed Logical Plan == 
Project [('a.key * (2 + 3)) AS c_0#24,'b.value] 
Join Inner, Some((('a.key = 'b.key) && ('a.key > 3))) 
Understand some terminologies 
UnresolvedRelation None, T, Some(a) 
UnresolvedRelation None, T, Some(b) 
== Analyzed Logical Plan == 
Project [(CAST(key#27, DoubleType) * CAST((2 + 3), DoubleType)) AS c_0#24,value#30] 
Join Inner, Some(((key#27 = key#29) && (CAST(key#27, DoubleType) > CAST(3, DoubleType)))) 
MetastoreRelation default, T, Some(a) 
MetastoreRelation default, T, Some(b) 
== Optimized Logical Plan == 
Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] 
Join Inner, Some((key#27 = key#29)) 
Project [key#27] 
Filter (CAST(key#27, DoubleType) > 3.0) 
MetastoreRelation default, T, Some(a) 
MetastoreRelation default, T, Some(b) 
== Physical Plan == 
Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] 
BroadcastHashJoin [key#27], [key#29], BuildLeft 
Filter (CAST(key#27, DoubleType) > 3.0) 
HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), None 
HiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None 
Copyright © 2014 Intel Corporation. 
10
Catalyst Overview 
• Catalyst essentially a extensible framework to Analyze & Optimize the logical plan, 
expression. 
• Core Elements: 
• Tree Node API 
• Expression Optimization 
• Data Type & Schema 
• Row API 
• Logical Plan (Unresolved) Binding & Analyzing (Rules) 
• Logical Plan (Resolved) Optimizing (Rules) 
• SPI (Service Provider Interface) 
• FunctionRegistry 
• Schema Catalog 
11 
Copyright © 2014 Intel Corporation.
Data Type & Schema 
 Primitive Type 
 StringType, FloatType, IntegerType, ByteType, ShortType, DoubleType, LongType, 
BinaryType, BooleanType, DecimalType, TimestampType, DateType, Varchar(Not 
Complete Supported Yet), Char(Not Complete Supported Yet) 
 Complex Type 
 ArrayType 
 ArrayType(elementType: DataType) 
 StructType 
 StructField(name: String, dataType: DataType) 
 StructType(fields: Seq[StructField]) 
 MapType 
 MapType(keyType: DataType, valueType: DataType) 
 UnionType (Not Supported Yet) 
12 
Relation Schema 
Copyright © 2014 Intel Corporation.
Row API 
13 
trait Row extends Seq[Any] with Serializable 
{ 
def apply(i: Int): Any 
def isNullAt(i: Int): Boolean 
def getInt(i: Int): Int 
def getLong(i: Int): Long 
def getDouble(i: Int): Double 
def getFloat(i: Int): Float 
def getBoolean(i: Int): Boolean 
def getShort(i: Int): Short 
def getByte(i: Int): Byte 
def getString(i: Int): String 
def getAs[T](int: Int): T 
} 
 Row class is the key data structure widely used 
internal / external Spark SQL. 
 “def getAs[T]” is used for non-primitive data types 
 Field value represented as native language data 
type. 
 Field type represented as DataType described in last 
slice.
Logical Plan Binding & Analyzing 
• Essentially about data binding & semantic analysis 
• Example Rules 
• Bind Attributes, Relations with concrete data. 
• ResolveReferences, ResolveRelation 
• Expressions Analysis 
• Data Type Coercion (PropagateTypes, PromoteString, BooleanCasts, Division etc.) 
• Bind UDF(ResolveFunctions) 
• Evict / Expand the Analysis Logical Plan Operators 
• StarExpansion, EliminateAnalysisOperators 
• Implicit Semantic Supplement 
• Add sort expressions into the child projection list.(ResolveSortReferences) 
• Convert projection into aggregation if the projection contains aggregate 
function(GlobalAggregates). 
• UnresolvedHavingClauseAttributes 
• Semantic Checking 
• Unresolved Function, Relation, Attributes (CheckResolution) 
• Illegal expressions in projection of an Aggregation (CheckAggregation) 
• …. 
14 Copyright © 2014 Intel Corporation.
Logical Plan Optimizing 
• Simplify the Logical Plan Tree based on Relational / Logical Algebra, Common Sense (Rule Based) 
• Example Rules 
• Expression Optimization. 
• NullPropagation, ConstantFolding, SimplifyFilters, SimplifyCasts, OptimizeIn etc. 
• Filter PushDown 
• UnionPushdown, PushPredicateThroughProject, 
PushPredicateThroughJoin,ColumnPruning 
• Combine Operators 
• CombineFilters, CombineLimits 
• Concrete Example 
• IsNull(‘a + null) => IsNull(null) => Literal(true) 
• SELECT a.key, b.key FROM a, b ON a.key=b.key AND b.key>10 => 
SELECT a.key, b.key FROM a, (SELECT key FROM b WHERE key>10) ON a.key=b.key 
15 Copyright © 2014 Intel Corporation.
Spark SQL Dialects 
16 
Hive Parser 
Hive AST 
Logical Plan 
Optimized Logical 
Plan 
Hive+Spark 
Planner 
DSL API 
Spark 
Planner 
Execution 
Operators 
SQL Parser 
Unresolved 
Logical Plan 
Hive 
Catelog 
Simple 
Catelog 
HiveContext SQLContext 
Frontend 
Catalyst 
Backend 
XX Parser / API 
XXX 
Catelog 
XXX 
Planner 
XXXContext 
Frontend 
+ 
Catalyst + SPI 
+ 
Backend 
|| 
Tool 
Copyright © 2014 Intel Corporation.
Spark Plan (Physical Plan) 
 Root class of Spark Plan Operator (Physical Plan Operator for Spark) 
 Spark Plan Operators 
 Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.) 
 Aggregate: Aggregate 
 BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.) 
 Shuffle: AddExchange, Exchange 
 Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.) 
 .. 
 Spark Strategy (SparkPlanner) 
 Map the Optimized Logical Plan to Spark Plan 
17 
abstract class SparkPlan { 
def children: Seq[SparkPlan] 
/** Specifies how data is partitioned across different nodes in the cluster. */ 
def outputPartitioning: Partitioning = UnknownPartitioning(0) 
/** Specifies any partition requirements on the input data for this operator. */ 
def requiredChildDistribution: Seq[Distribution] = 
Seq.fill(children.size)(UnspecifiedDistribution) 
def execute(): RDD[Row] 
} 
Optimized 
Logical Plan 
Spark Plan 
RDD 
Spark Execution
Case Study for Catalyst in Depth 
• StreamSQL 
18 
• Reuse the HiveContext but with different Frontend / Backend. 
• Frontend: Slight modification of the HiveParser 
• Backend: Customed Query Planner, to generate the physical plan based on Spark 
DStream. 
• JIRA: https://issues.apache.org/jira/browse/SPARK-1363 
• Source: https://github.com/thunderain-project/StreamSQL 
• SQL 92 Support 
• Reuse the HiveContext but with different Frontend 
• Frontend: A modified HiveParser & Hive QL translator. 
• https://github.com/intel-hadoop/spark/tree/panthera 
• Pig on Spark POC 
• Modify the SQLContext 
• Provide a PigParser to translate the Pig script into Catalyst unresolved logical plan 
• https://github.com/databricks/pig-on-spark 
Copyright © 2014 Intel Corporation.
SQL Core API Introduction 
Copyright © 2014 Intel Corporation.
SchemaRDD 
• What’s SchemaRDD? 
• Spark SQL Core API (In Scala) 
20 
• Create SchemaRDD instance from 
• Plain SQL Text def sql(sqlText: String) 
• An existed Logical Plan def logicalPlanToSparkQuery(plan: LogicalPlan) 
• Spark RDD def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A]) 
• Spark RDD with Schema def applySchema(rowRDD: RDD[Row], schema: StructType) 
• Frequently used format file (json, parquet, etc.) def parquetFile(path: String) 
• SQL DSL 
• select, where, join, orderBy, limit, groupBy, unionAll, etc. 
• Data Sink 
• Persist the data with specified storage level def persist(newLevel: StorageLevel) 
• Save the data as ParquetFile def saveAsParquetFile(path: String) 
• Save the data as a new Table def registerTempTable(tableName: String) 
• Insert the data into existed table def insertInto(tableName: String, overwrite: Boolean) 
• …. 
• Java API / Python API supported 
Copyright © 2014 Intel Corporation. 
class SchemaRDD( 
@transient val sqlContext: SQLContext, 
@transient val baseLogicalPlan: LogicalPlan) 
extends RDD[Row](sqlContext.sparkContext, Nil)
Conceptual State Transition Diagram 
21 
RDD 
Schema RDD 
Unresolved 
Logical Plan 
SQL API 
SQL Text / File / Table 
* Unresolved Logical Plan  RDD (Unresolved Logical Plan  Logical 
Plan  Optimized Logical Plan  Physical Plan  Spark RDD) 
File / Memory etc. 
Copyright © 2014 Intel Corporation. 
…
Code Example 
sbt/sbt hive/console 
// HiveContext is created by default, and the object is imported, so we can call the object methods directly. 
sql("CREATE TABLE IF NOT EXISTS kv_text(key INT, value STRING)") 
sql("LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE kv_text") // create a Hive table and load data into it 
case class KV(key: Int, value: String) 
val kvRdd = sparkContext.parallelize((1 to 100).map(i => KV(i, s"val_$i"))) // create a normal RDD 
// implicitly convert the kvRDD into a SchemaRDD 
kvRdd.where('key >= 1).where('key <=5).registerTempTable("kv_rdd") // create a Hive Table from a SchemaRDD 
jsonFile("/tmp/file2.json").registerTempTable("kv_json") // load json file and register as a Hive Table 
val result = sql("SELECT a.key, b.value, c.key from kv_text a join kv_rdd b join kv_json c") 
result.collect().foreach(row => { 
val f0 = if(row.isNullAt(0)) "null" else row.getInt(0) 
val f1 = if(row.isNullAt(1)) "null" else row.getString(1) 
val f2 = if(row.isNullAt(2)) "null" else row.getInt(2) 
println(s"result:$f0, $f1, $f2") 
22 
}) 
Copyright © 2014 Intel Corporation.
V.S. Shark & 
Hive 
Copyright © 2014 Intel Corporation.
 Background of Shark/Hive-on-Spark/Spark SQL 
 Shark is the first SQL on Spark product, based on the earlier versions of Hive (with a re-write QueryPlanner to generate 
Spark RDD-based Physicial Plan); Shark is retired now and replaced by Spark SQL. 
 Hive-on-Spark is an QueryPlanner extension of Hive, it focus on the SparkPlanner and Spark RDD-based physical 
operators implementation. Spark users will automatically get the whole set of Hive’s rich features, including any new 
features that Hive might introduce in the future. 
 Spark SQL is a new SQL engine on Spark developed from scratch. 
 Functionality 
 Spark SQL almost support all of the functionalities that Hive provided from the perspective of data analysts. 
 SQL API on Spark Shell V.S. Pig latin. 
 Spark SQL is an extensible / flexible framework for developers (based on Catalyst), new extensions are very easy to 
be integrated. 
 Implementation Philosophy of Spark SQL (Simple & Nature) 
 Largely employs the Scala features (Pattern Matching, Implicit Conversion, Partial Function etc.) 
 Large small pieces of simple rule to bind, analyze, optimize logical plan & expression tree, and also the physical plan 
generation. 
 In-memory Computing & Maximize the Memory Usage (Cache related SQL API & Command). 
 Spark SQL benefits a lot from Hive by reusing its components (Hive QL Parser, Metatore, SerDe, StorageHandler etc.) 
 Stability 
 Hive is the defacto standard for SQL on big data so far, and it has been proven as a productive tool for couple of years 
in practices, many corner cases are covered in its continuous enhancements. 
 Spark SQL just start its journey ( ~0.5 year), we need more time to prove / improve it. 
24 Copyright © 2014 Intel Corporation.
Our Contributions 
Copyright © 2014 Intel Corporation.
 Totally 60+ PRs, 50+ Merged on Spark SQL 
 Features 
26 
 Add serde support for CTAS (PR2570) 
 Support the Grouping Set (PR1567) 
 Support EXTENDED for EXPLAIN (PR1982) 
 Cross join support in HiveQL (PR2124) 
 Add support for left semi join (PR837) 
 Add Date type support (PR2344) 
 Add Timestamp type support (PR275) 
 Add Expression RLike & Like support (PR224) 
 .. 
 Performance Enhancement / Improvement 
 Avoid table creation in logical plan analyzing for CTAS (PR1846) 
 Extract the joinkeys from join condition (PR1190) 
 Reduce the Expression tree object creations for aggregation function (min/max) (PR2113) 
 Pushdown the join filter & predication for outer join (PR1015) 
 Constant Folding for Expression Optimization (PR482) 
 Fix Performance Issue in data type casting (PR679) 
 Not limit argument type for hive simple udf (PR2506) 
 Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions (PR2407) 
 Select null from table would throw a MatchError (PR2396) 
 Type Coercion should support every type to have null value (PR2246) 
 …. 
 Bugs Fixing 
 …. 
Copyright © 2014 Intel Corporation.
Useful Materials 
Copyright © 2014 Intel Corporation.
 References 
28 
 http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx 
 http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark- 
SQL-Michael-Armbrust.pdf 
 https://www.youtube.com/watch?v=GQSNJAzxOr8 
 http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d 
 https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark 
 http://web.stanford.edu/class/cs346/qpnotes.html 
 http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf 
 http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf 
 https://courses.cs.washington.edu/courses/cse444/12sp/lectures/ 
 http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html 
• User Mail List 
 user@spark.apache.org 
• Dev Mail List 
 dev@spark.apache.org 
• Jira 
 https://issues.apache.org/jira/browse/SPARK/component/12322623 
• DevDoc 
 https://spark.apache.org/docs/latest/sql-programming-guide.html 
• Github 
 https://github.com/apache/spark/tree/master/sql 
Copyright © 2014 Intel Corporation.
Notice and Disclaimers: 
 Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands 
may be claimed as the property of others. 
See Trademarks on intel.com for full list of Intel trademarks. 
 Optimization Notice: 
Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that 
are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and 
other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on 
microprocessors not manufactured by Intel. 
Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain 
optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the 
applicable product User and Reference Guides for more information regarding the specific instruction sets covered 
by this notice. 
 Intel technologies may require enabled hardware, specific software, or services activation. Check with your system 
manufacturer or retailer. 
 No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems 
or any damages resulting from such losses. 
 You may not use or facilitate the use of this document in connection with any infringement or other legal analysis 
concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any 
patent claim thereafter drafted which includes subject matter disclosed herein. 
 No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this 
document. 
 The products described may contain design defects or errors known as errata which may cause the product to 
deviate from publish. 
Copyright © 2014 Intel Corporation.
Copyright © 2014 Intel Corporation.

More Related Content

What's hot

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureDatabricks
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story Roman Chukh
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkItai Yaffe
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeDatabricks
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Databricks
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta LakeDatabricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWSGary Stafford
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun JeongSpark Summit
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Databricks
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architectureAdam Doyle
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Databricks
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheDremio Corporation
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsMars Lan
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Michael Rys
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowDremio Corporation
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiData Con LA
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Databricks
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in DeltaDatabricks
 

What's hot (20)

Introduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse ArchitectureIntroduction SQL Analytics on Lakehouse Architecture
Introduction SQL Analytics on Lakehouse Architecture
 
Spark - Migration Story
Spark - Migration Story Spark - Migration Story
Spark - Migration Story
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta LakeNear Real-Time Data Warehousing with Apache Spark and Delta Lake
Near Real-Time Data Warehousing with Apache Spark and Delta Lake
 
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
Continuous Applications at Scale of 100 Teams with Databricks Delta and Struc...
 
Hyperspace for Delta Lake
Hyperspace for Delta LakeHyperspace for Delta Lake
Hyperspace for Delta Lake
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Building a Data Lake on AWS
Building a Data Lake on AWSBuilding a Data Lake on AWS
Building a Data Lake on AWS
 
Big Telco - Yousun Jeong
Big Telco - Yousun JeongBig Telco - Yousun Jeong
Big Telco - Yousun Jeong
 
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
Transitioning from Traditional DW to Apache® Spark™ in Operating Room Predict...
 
Delta lake and the delta architecture
Delta lake and the delta architectureDelta lake and the delta architecture
Delta lake and the delta architecture
 
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
Designing and Implementing a Real-time Data Lake with Dynamically Changing Sc...
 
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational CacheUsing Apache Arrow, Calcite, and Parquet to Build a Relational Cache
Using Apache Arrow, Calcite, and Parquet to Build a Relational Cache
 
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data PlatformsWhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
WhereHows: Taming Metadata for 150K Datasets Over 9 Data Platforms
 
Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)Big Data Processing with .NET and Spark (SQLBits 2020)
Big Data Processing with .NET and Spark (SQLBits 2020)
 
Building a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache ArrowBuilding a Virtual Data Lake with Apache Arrow
Building a Virtual Data Lake with Apache Arrow
 
Hugfr SPARK & RIAK -20160114_hug_france
Hugfr  SPARK & RIAK -20160114_hug_franceHugfr  SPARK & RIAK -20160114_hug_france
Hugfr SPARK & RIAK -20160114_hug_france
 
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules DamjiA Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
A Tale of Three Apache Spark APIs: RDDs, DataFrames and Datasets by Jules Damji
 
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
Not Your Father's Database: How to Use Apache Spark Properly in Your Big Data...
 
Change Data Feed in Delta
Change Data Feed in DeltaChange Data Feed in Delta
Change Data Feed in Delta
 

Viewers also liked

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupDatabricks
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowKristian Alexander
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQLjeykottalam
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalystTakuya UESHIN
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Chris Fregly
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesDatabricks
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for TrainingBryan Yang
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...Databricks
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsDatabricks
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsDatabricks
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceDatabricks
 
Seattle useR Group - R + Scala
Seattle useR Group - R + ScalaSeattle useR Group - R + Scala
Seattle useR Group - R + ScalaShouheng Yi
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnMichael Joseph
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingRamaninder Singh Jhajj
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityDatabricks
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML ConferenceDB Tsai
 

Viewers also liked (20)

Spark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark MeetupSpark SQL Deep Dive @ Melbourne Spark Meetup
Spark SQL Deep Dive @ Melbourne Spark Meetup
 
Spark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to KnowSpark SQL - 10 Things You Need to Know
Spark SQL - 10 Things You Need to Know
 
Intro to Spark and Spark SQL
Intro to Spark and Spark SQLIntro to Spark and Spark SQL
Intro to Spark and Spark SQL
 
20140908 spark sql & catalyst
20140908 spark sql & catalyst20140908 spark sql & catalyst
20140908 spark sql & catalyst
 
Spark SQL
Spark SQLSpark SQL
Spark SQL
 
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
Advanced Apache Spark Meetup Spark SQL + DataFrames + Catalyst Optimizer + Da...
 
Beyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFramesBeyond SQL: Speeding up Spark with DataFrames
Beyond SQL: Speeding up Spark with DataFrames
 
Spark Sql for Training
Spark Sql for TrainingSpark Sql for Training
Spark Sql for Training
 
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
From DataFrames to Tungsten: A Peek into Spark's Future @ Spark Summit San Fr...
 
SparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDsSparkSQL: A Compiler from Queries to RDDs
SparkSQL: A Compiler from Queries to RDDs
 
Optimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL JoinsOptimizing Apache Spark SQL Joins
Optimizing Apache Spark SQL Joins
 
Introducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data ScienceIntroducing DataFrames in Spark for Large Scale Data Science
Introducing DataFrames in Spark for Large Scale Data Science
 
Seattle useR Group - R + Scala
Seattle useR Group - R + ScalaSeattle useR Group - R + Scala
Seattle useR Group - R + Scala
 
Flink internals web
Flink internals web Flink internals web
Flink internals web
 
The hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarnThe hadoop 2.0 ecosystem and yarn
The hadoop 2.0 ecosystem and yarn
 
Spark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data ProcessingSpark: The State of the Art Engine for Big Data Processing
Spark: The State of the Art Engine for Big Data Processing
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
Strata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark communityStrata NYC 2015 - What's coming for the Spark community
Strata NYC 2015 - What's coming for the Spark community
 
Apache streams 2015
Apache streams 2015Apache streams 2015
Apache streams 2015
 
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
2015 01-17 Lambda Architecture with Apache Spark, NextML Conference
 

Similar to Spark sql meetup

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingAll Things Open
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSyed Hadoop
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldDatabricks
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Michael Rys
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Databricks
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Databricks
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Paulo Gutierrez
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteChris Baynes
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)Paul Chao
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionChetan Khatri
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Julian Hyde
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Michael Rys
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustSpark Summit
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesDatabricks
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFramesSpark Summit
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Fabian Hueske
 

Similar to Spark sql meetup (20)

Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Apache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster ComputingApache Spark: Lightning Fast Cluster Computing
Apache Spark: Lightning Fast Cluster Computing
 
Spark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.comSpark SQL In Depth www.syedacademy.com
Spark SQL In Depth www.syedacademy.com
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
20170126 big data processing
20170126 big data processing20170126 big data processing
20170126 big data processing
 
Composable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and WeldComposable Parallel Processing in Apache Spark and Weld
Composable Parallel Processing in Apache Spark and Weld
 
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
Bringing the Power and Familiarity of .NET, C# and F# to Big Data Processing ...
 
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
Structuring Apache Spark 2.0: SQL, DataFrames, Datasets And Streaming - by Mi...
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要Spark + AI Summit 2020 イベント概要
Spark + AI Summit 2020 イベント概要
 
Fast federated SQL with Apache Calcite
Fast federated SQL with Apache CalciteFast federated SQL with Apache Calcite
Fast federated SQL with Apache Calcite
 
AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)AI與大數據數據處理 Spark實戰(20171216)
AI與大數據數據處理 Spark實戰(20171216)
 
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in productionScalaTo July 2019 - No more struggles with Apache Spark workloads in production
ScalaTo July 2019 - No more struggles with Apache Spark workloads in production
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
Building data pipelines for modern data warehouse with Apache® Spark™ and .NE...
 
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael ArmbrustStructuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
Structuring Spark: DataFrames, Datasets, and Streaming by Michael Armbrust
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Building a modern Application with DataFrames
Building a modern Application with DataFramesBuilding a modern Application with DataFrames
Building a modern Application with DataFrames
 
Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.Taking a look under the hood of Apache Flink's relational APIs.
Taking a look under the hood of Apache Flink's relational APIs.
 

More from Michael Zhang

廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐Michael Zhang
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureMichael Zhang
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎Michael Zhang
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket LinxiaofengMichael Zhang
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践Michael Zhang
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven opsMichael Zhang
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用Michael Zhang
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控Michael Zhang
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性Michael Zhang
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试Michael Zhang
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and HadoopMichael Zhang
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Michael Zhang
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Michael Zhang
 
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Michael Zhang
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Michael Zhang
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Michael Zhang
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Michael Zhang
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyMichael Zhang
 
Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Michael Zhang
 

More from Michael Zhang (20)

廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐廣告系統在Docker/Mesos上的可靠性實踐
廣告系統在Docker/Mesos上的可靠性實踐
 
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier ArchitectureHKIX Upgrade to 100Gbps-Based Two-Tier Architecture
HKIX Upgrade to 100Gbps-Based Two-Tier Architecture
 
2014 GITC 帶上數據去創業 talkingdata—高铎
 2014 GITC 帶上數據去創業 talkingdata—高铎 2014 GITC 帶上數據去創業 talkingdata—高铎
2014 GITC 帶上數據去創業 talkingdata—高铎
 
Fastsocket Linxiaofeng
Fastsocket LinxiaofengFastsocket Linxiaofeng
Fastsocket Linxiaofeng
 
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
2014 Hpocon 李志刚   1号店 - puppet在1号店的实践2014 Hpocon 李志刚   1号店 - puppet在1号店的实践
2014 Hpocon 李志刚 1号店 - puppet在1号店的实践
 
2014 Hpocon 姚仁捷 唯品会 - data driven ops
2014 Hpocon 姚仁捷   唯品会 - data driven ops2014 Hpocon 姚仁捷   唯品会 - data driven ops
2014 Hpocon 姚仁捷 唯品会 - data driven ops
 
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用2014 Hpocon 高驰涛   云智慧 - apm在高性能架构中的应用
2014 Hpocon 高驰涛 云智慧 - apm在高性能架构中的应用
 
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控2014 Hpocon 黄慧攀   upyun - 平台架构的服务监控
2014 Hpocon 黄慧攀 upyun - 平台架构的服务监控
 
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性2014 Hpocon 吴磊   ucloud - 由点到面 提升公有云服务可用性
2014 Hpocon 吴磊 ucloud - 由点到面 提升公有云服务可用性
 
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试2014 Hpocon 周辉   大众点评 - 大众点评混合开发模式下的加速尝试
2014 Hpocon 周辉 大众点评 - 大众点评混合开发模式下的加速尝试
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
The Data Center and Hadoop
The Data Center and HadoopThe Data Center and Hadoop
The Data Center and Hadoop
 
Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.Hadoop Hardware @Twitter: Size does matter.
Hadoop Hardware @Twitter: Size does matter.
 
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
Q con shanghai2013-[ben lavender]-[long-distance relationships with robots]
 
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]Q con shanghai2013-[刘海锋]-[京东文件系统简介]
Q con shanghai2013-[刘海锋]-[京东文件系统简介]
 
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
Q con shanghai2013-[韩军]-[超大型电商系统架构解密]
 
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
Q con shanghai2013-[jains krums]-[real-time-delivery-archiecture]
 
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
Q con shanghai2013-[黄舒泉]-[intel it openstack practice]
 
Q con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodologyQ con shanghai2013-罗婷-performance methodology
Q con shanghai2013-罗婷-performance methodology
 
Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践Q con shanghai2013-赵永明-ats与cdn实践
Q con shanghai2013-赵永明-ats与cdn实践
 

Recently uploaded

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Lucknow
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Sonam Pathan
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一z xss
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Roomishabajaj13
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作ys8omjxb
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Roomdivyansh0kumar0
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhimiss dipika
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Paul Calvano
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书zdzoqco
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Personfurqan222004
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)Christopher H Felton
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkataanamikaraghav4
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationLinaWolf1
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITMgdsc13
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一Fs
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Roomdivyansh0kumar0
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一Fs
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMartaLoveguard
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一Fs
 

Recently uploaded (20)

Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja VipCall Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
Call Girls Service Adil Nagar 7001305949 Need escorts Service Pooja Vip
 
Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170Call Girls Near The Suryaa Hotel New Delhi 9873777170
Call Girls Near The Suryaa Hotel New Delhi 9873777170
 
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
办理(UofR毕业证书)罗切斯特大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With RoomVIP Kolkata Call Girl Salt Lake 👉 8250192130  Available With Room
VIP Kolkata Call Girl Salt Lake 👉 8250192130 Available With Room
 
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
Potsdam FH学位证,波茨坦应用技术大学毕业证书1:1制作
 
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130  Available With RoomVIP Kolkata Call Girl Kestopur 👉 8250192130  Available With Room
VIP Kolkata Call Girl Kestopur 👉 8250192130 Available With Room
 
Contact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New DelhiContact Rya Baby for Call Girls New Delhi
Contact Rya Baby for Call Girls New Delhi
 
Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24Font Performance - NYC WebPerf Meetup April '24
Font Performance - NYC WebPerf Meetup April '24
 
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
办理多伦多大学毕业证成绩单|购买加拿大UTSG文凭证书
 
Complet Documnetation for Smart Assistant Application for Disabled Person
Complet Documnetation   for Smart Assistant Application for Disabled PersonComplet Documnetation   for Smart Assistant Application for Disabled Person
Complet Documnetation for Smart Assistant Application for Disabled Person
 
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
A Good Girl's Guide to Murder (A Good Girl's Guide to Murder, #1)
 
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls KolkataVIP Call Girls Kolkata Ananya 🤌  8250192130 🚀 Vip Call Girls Kolkata
VIP Call Girls Kolkata Ananya 🤌 8250192130 🚀 Vip Call Girls Kolkata
 
PHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 DocumentationPHP-based rendering of TYPO3 Documentation
PHP-based rendering of TYPO3 Documentation
 
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
young call girls in Uttam Nagar🔝 9953056974 🔝 Delhi escort Service
 
Git and Github workshop GDSC MLRITM
Git and Github  workshop GDSC MLRITMGit and Github  workshop GDSC MLRITM
Git and Github workshop GDSC MLRITM
 
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
定制(Lincoln毕业证书)新西兰林肯大学毕业证成绩单原版一比一
 
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130  Available With RoomVIP Kolkata Call Girl Alambazar 👉 8250192130  Available With Room
VIP Kolkata Call Girl Alambazar 👉 8250192130 Available With Room
 
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
定制(Management毕业证书)新加坡管理大学毕业证成绩单原版一比一
 
Magic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptxMagic exist by Marta Loveguard - presentation.pptx
Magic exist by Marta Loveguard - presentation.pptx
 
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
定制(UAL学位证)英国伦敦艺术大学毕业证成绩单原版一比一
 

Spark sql meetup

  • 1. Spark SQL 漫谈 Cheng Hao Oct 25, 2014 Copyright © 2014 Intel Corporation.
  • 2. Agenda  Spark SQL Overview  Catalyst in Depth  SQL Core API Introduction  V.S. Shark & Hive-on-Spark  Our Contributions  Useful Materials 2 Copyright © 2014 Intel Corporation.
  • 3. Spark SQL Overview Copyright © 2014 Intel Corporation.
  • 4. Spark SQL in Spark 4 Spark Streaming real-time GraphX Graph (alpha) MLLib Machine Learning Spark Core Spark SQL  Spark SQL was first released in Spark 1.0 (May, 2014)  Initial committed by Michael Armbrust & Reynold Xin from Databricks Copyright © 2014 Intel Corporation.
  • 5. Spark SQL Component Stack (User Perspective)  Hive-like interface(JDBC Service / CLI)  SQL API support (LINQ-like)  Both Hive QL & Simple SQL dialects are Supported  DDL is 100% compatible with Hive Metastore  Hive QL aims to 100% compatible with Hive DML  Simple SQL dialect is now very weak in functionality, but easy to extend 5 User Application CLI JDBC Service SQL API Hive QL Simple SQL Catalyst Spark Execution Operators Spark Core Data Analyst Hive Meta Store Simple Catalog Copyright © 2014 Intel Corporation.
  • 6. Spark SQL Architecture 6 Frontend Backend Catalyst Copyright © 2014 Intel Corporation. By Michael Armbrust @ Databricks
  • 7. Catalyst in Depth Copyright © 2014 Intel Corporation.
  • 8. Understand Some Terminology  Logical and Physical query plans  Both are trees representing query evaluation  Internal nodes are operators over the data  Logical plan is higher-level and algebraic  Physical plan is lower-level and operational  Logical plan operators  Correspond to query language constructs  Conceptually describe what operation needs to be 8 performed  Physical plan operators  Correspond to implemented access methods  Physically Implement the operation described by logical operators SQL Text Parsing Unresolved Logical Plan Binding & Analyzing Logical Plan Optimizing Optimized Logical Plan Query Planning Physical Plan Copyright © 2014 Intel Corporation.
  • 9. Examples 9 We execute the following commands on Spark SQL CLI. • CREATE TABLE T (key: String, value: String) • EXPLAIN EXTENDED SELECT a.key * (2 + 3), b.value FROM T a JOIN T b ON a.key=b.key AND a.key>3 Copyright © 2014 Intel Corporation.
  • 10. == Parsed Logical Plan == Project [('a.key * (2 + 3)) AS c_0#24,'b.value] Join Inner, Some((('a.key = 'b.key) && ('a.key > 3))) Understand some terminologies UnresolvedRelation None, T, Some(a) UnresolvedRelation None, T, Some(b) == Analyzed Logical Plan == Project [(CAST(key#27, DoubleType) * CAST((2 + 3), DoubleType)) AS c_0#24,value#30] Join Inner, Some(((key#27 = key#29) && (CAST(key#27, DoubleType) > CAST(3, DoubleType)))) MetastoreRelation default, T, Some(a) MetastoreRelation default, T, Some(b) == Optimized Logical Plan == Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] Join Inner, Some((key#27 = key#29)) Project [key#27] Filter (CAST(key#27, DoubleType) > 3.0) MetastoreRelation default, T, Some(a) MetastoreRelation default, T, Some(b) == Physical Plan == Project [(CAST(key#27, DoubleType) * 5.0) AS c_0#24,value#30] BroadcastHashJoin [key#27], [key#29], BuildLeft Filter (CAST(key#27, DoubleType) > 3.0) HiveTableScan [key#27], (MetastoreRelation default, T, Some(a)), None HiveTableScan [key#29,value#30], (MetastoreRelation default, T, Some(b)), None Copyright © 2014 Intel Corporation. 10
  • 11. Catalyst Overview • Catalyst essentially a extensible framework to Analyze & Optimize the logical plan, expression. • Core Elements: • Tree Node API • Expression Optimization • Data Type & Schema • Row API • Logical Plan (Unresolved) Binding & Analyzing (Rules) • Logical Plan (Resolved) Optimizing (Rules) • SPI (Service Provider Interface) • FunctionRegistry • Schema Catalog 11 Copyright © 2014 Intel Corporation.
  • 12. Data Type & Schema  Primitive Type  StringType, FloatType, IntegerType, ByteType, ShortType, DoubleType, LongType, BinaryType, BooleanType, DecimalType, TimestampType, DateType, Varchar(Not Complete Supported Yet), Char(Not Complete Supported Yet)  Complex Type  ArrayType  ArrayType(elementType: DataType)  StructType  StructField(name: String, dataType: DataType)  StructType(fields: Seq[StructField])  MapType  MapType(keyType: DataType, valueType: DataType)  UnionType (Not Supported Yet) 12 Relation Schema Copyright © 2014 Intel Corporation.
  • 13. Row API 13 trait Row extends Seq[Any] with Serializable { def apply(i: Int): Any def isNullAt(i: Int): Boolean def getInt(i: Int): Int def getLong(i: Int): Long def getDouble(i: Int): Double def getFloat(i: Int): Float def getBoolean(i: Int): Boolean def getShort(i: Int): Short def getByte(i: Int): Byte def getString(i: Int): String def getAs[T](int: Int): T }  Row class is the key data structure widely used internal / external Spark SQL.  “def getAs[T]” is used for non-primitive data types  Field value represented as native language data type.  Field type represented as DataType described in last slice.
  • 14. Logical Plan Binding & Analyzing • Essentially about data binding & semantic analysis • Example Rules • Bind Attributes, Relations with concrete data. • ResolveReferences, ResolveRelation • Expressions Analysis • Data Type Coercion (PropagateTypes, PromoteString, BooleanCasts, Division etc.) • Bind UDF(ResolveFunctions) • Evict / Expand the Analysis Logical Plan Operators • StarExpansion, EliminateAnalysisOperators • Implicit Semantic Supplement • Add sort expressions into the child projection list.(ResolveSortReferences) • Convert projection into aggregation if the projection contains aggregate function(GlobalAggregates). • UnresolvedHavingClauseAttributes • Semantic Checking • Unresolved Function, Relation, Attributes (CheckResolution) • Illegal expressions in projection of an Aggregation (CheckAggregation) • …. 14 Copyright © 2014 Intel Corporation.
  • 15. Logical Plan Optimizing • Simplify the Logical Plan Tree based on Relational / Logical Algebra, Common Sense (Rule Based) • Example Rules • Expression Optimization. • NullPropagation, ConstantFolding, SimplifyFilters, SimplifyCasts, OptimizeIn etc. • Filter PushDown • UnionPushdown, PushPredicateThroughProject, PushPredicateThroughJoin,ColumnPruning • Combine Operators • CombineFilters, CombineLimits • Concrete Example • IsNull(‘a + null) => IsNull(null) => Literal(true) • SELECT a.key, b.key FROM a, b ON a.key=b.key AND b.key>10 => SELECT a.key, b.key FROM a, (SELECT key FROM b WHERE key>10) ON a.key=b.key 15 Copyright © 2014 Intel Corporation.
  • 16. Spark SQL Dialects 16 Hive Parser Hive AST Logical Plan Optimized Logical Plan Hive+Spark Planner DSL API Spark Planner Execution Operators SQL Parser Unresolved Logical Plan Hive Catelog Simple Catelog HiveContext SQLContext Frontend Catalyst Backend XX Parser / API XXX Catelog XXX Planner XXXContext Frontend + Catalyst + SPI + Backend || Tool Copyright © 2014 Intel Corporation.
  • 17. Spark Plan (Physical Plan)  Root class of Spark Plan Operator (Physical Plan Operator for Spark)  Spark Plan Operators  Joins: BroadcastHashJoin, CartesianProduct, HashOuterJoin, LeftSemiJoinHash etc.)  Aggregate: Aggregate  BasicOperators: Distinct, Except, Filter, Limit, Project, Sort, Union etc.)  Shuffle: AddExchange, Exchange  Commands: CacheTableCommand, DescribeCommand, ExplainCommand etc.)  ..  Spark Strategy (SparkPlanner)  Map the Optimized Logical Plan to Spark Plan 17 abstract class SparkPlan { def children: Seq[SparkPlan] /** Specifies how data is partitioned across different nodes in the cluster. */ def outputPartitioning: Partitioning = UnknownPartitioning(0) /** Specifies any partition requirements on the input data for this operator. */ def requiredChildDistribution: Seq[Distribution] = Seq.fill(children.size)(UnspecifiedDistribution) def execute(): RDD[Row] } Optimized Logical Plan Spark Plan RDD Spark Execution
  • 18. Case Study for Catalyst in Depth • StreamSQL 18 • Reuse the HiveContext but with different Frontend / Backend. • Frontend: Slight modification of the HiveParser • Backend: Customed Query Planner, to generate the physical plan based on Spark DStream. • JIRA: https://issues.apache.org/jira/browse/SPARK-1363 • Source: https://github.com/thunderain-project/StreamSQL • SQL 92 Support • Reuse the HiveContext but with different Frontend • Frontend: A modified HiveParser & Hive QL translator. • https://github.com/intel-hadoop/spark/tree/panthera • Pig on Spark POC • Modify the SQLContext • Provide a PigParser to translate the Pig script into Catalyst unresolved logical plan • https://github.com/databricks/pig-on-spark Copyright © 2014 Intel Corporation.
  • 19. SQL Core API Introduction Copyright © 2014 Intel Corporation.
  • 20. SchemaRDD • What’s SchemaRDD? • Spark SQL Core API (In Scala) 20 • Create SchemaRDD instance from • Plain SQL Text def sql(sqlText: String) • An existed Logical Plan def logicalPlanToSparkQuery(plan: LogicalPlan) • Spark RDD def createSchemaRDD[A <: Product: TypeTag](rdd: RDD[A]) • Spark RDD with Schema def applySchema(rowRDD: RDD[Row], schema: StructType) • Frequently used format file (json, parquet, etc.) def parquetFile(path: String) • SQL DSL • select, where, join, orderBy, limit, groupBy, unionAll, etc. • Data Sink • Persist the data with specified storage level def persist(newLevel: StorageLevel) • Save the data as ParquetFile def saveAsParquetFile(path: String) • Save the data as a new Table def registerTempTable(tableName: String) • Insert the data into existed table def insertInto(tableName: String, overwrite: Boolean) • …. • Java API / Python API supported Copyright © 2014 Intel Corporation. class SchemaRDD( @transient val sqlContext: SQLContext, @transient val baseLogicalPlan: LogicalPlan) extends RDD[Row](sqlContext.sparkContext, Nil)
  • 21. Conceptual State Transition Diagram 21 RDD Schema RDD Unresolved Logical Plan SQL API SQL Text / File / Table * Unresolved Logical Plan  RDD (Unresolved Logical Plan  Logical Plan  Optimized Logical Plan  Physical Plan  Spark RDD) File / Memory etc. Copyright © 2014 Intel Corporation. …
  • 22. Code Example sbt/sbt hive/console // HiveContext is created by default, and the object is imported, so we can call the object methods directly. sql("CREATE TABLE IF NOT EXISTS kv_text(key INT, value STRING)") sql("LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE kv_text") // create a Hive table and load data into it case class KV(key: Int, value: String) val kvRdd = sparkContext.parallelize((1 to 100).map(i => KV(i, s"val_$i"))) // create a normal RDD // implicitly convert the kvRDD into a SchemaRDD kvRdd.where('key >= 1).where('key <=5).registerTempTable("kv_rdd") // create a Hive Table from a SchemaRDD jsonFile("/tmp/file2.json").registerTempTable("kv_json") // load json file and register as a Hive Table val result = sql("SELECT a.key, b.value, c.key from kv_text a join kv_rdd b join kv_json c") result.collect().foreach(row => { val f0 = if(row.isNullAt(0)) "null" else row.getInt(0) val f1 = if(row.isNullAt(1)) "null" else row.getString(1) val f2 = if(row.isNullAt(2)) "null" else row.getInt(2) println(s"result:$f0, $f1, $f2") 22 }) Copyright © 2014 Intel Corporation.
  • 23. V.S. Shark & Hive Copyright © 2014 Intel Corporation.
  • 24.  Background of Shark/Hive-on-Spark/Spark SQL  Shark is the first SQL on Spark product, based on the earlier versions of Hive (with a re-write QueryPlanner to generate Spark RDD-based Physicial Plan); Shark is retired now and replaced by Spark SQL.  Hive-on-Spark is an QueryPlanner extension of Hive, it focus on the SparkPlanner and Spark RDD-based physical operators implementation. Spark users will automatically get the whole set of Hive’s rich features, including any new features that Hive might introduce in the future.  Spark SQL is a new SQL engine on Spark developed from scratch.  Functionality  Spark SQL almost support all of the functionalities that Hive provided from the perspective of data analysts.  SQL API on Spark Shell V.S. Pig latin.  Spark SQL is an extensible / flexible framework for developers (based on Catalyst), new extensions are very easy to be integrated.  Implementation Philosophy of Spark SQL (Simple & Nature)  Largely employs the Scala features (Pattern Matching, Implicit Conversion, Partial Function etc.)  Large small pieces of simple rule to bind, analyze, optimize logical plan & expression tree, and also the physical plan generation.  In-memory Computing & Maximize the Memory Usage (Cache related SQL API & Command).  Spark SQL benefits a lot from Hive by reusing its components (Hive QL Parser, Metatore, SerDe, StorageHandler etc.)  Stability  Hive is the defacto standard for SQL on big data so far, and it has been proven as a productive tool for couple of years in practices, many corner cases are covered in its continuous enhancements.  Spark SQL just start its journey ( ~0.5 year), we need more time to prove / improve it. 24 Copyright © 2014 Intel Corporation.
  • 25. Our Contributions Copyright © 2014 Intel Corporation.
  • 26.  Totally 60+ PRs, 50+ Merged on Spark SQL  Features 26  Add serde support for CTAS (PR2570)  Support the Grouping Set (PR1567)  Support EXTENDED for EXPLAIN (PR1982)  Cross join support in HiveQL (PR2124)  Add support for left semi join (PR837)  Add Date type support (PR2344)  Add Timestamp type support (PR275)  Add Expression RLike & Like support (PR224)  ..  Performance Enhancement / Improvement  Avoid table creation in logical plan analyzing for CTAS (PR1846)  Extract the joinkeys from join condition (PR1190)  Reduce the Expression tree object creations for aggregation function (min/max) (PR2113)  Pushdown the join filter & predication for outer join (PR1015)  Constant Folding for Expression Optimization (PR482)  Fix Performance Issue in data type casting (PR679)  Not limit argument type for hive simple udf (PR2506)  Use GenericUDFUtils.ConversionHelper for Simple UDF type conversions (PR2407)  Select null from table would throw a MatchError (PR2396)  Type Coercion should support every type to have null value (PR2246)  ….  Bugs Fixing  …. Copyright © 2014 Intel Corporation.
  • 27. Useful Materials Copyright © 2014 Intel Corporation.
  • 28.  References 28  http://spark-summit.org/wp-content/uploads/2013/10/J-Michael-Armburst-catalyst-spark-summit-dec-2013.pptx  http://spark-summit.org/wp-content/uploads/2014/07/Performing-Advanced-Analytics-on-Relational-Data-with-Spark- SQL-Michael-Armbrust.pdf  https://www.youtube.com/watch?v=GQSNJAzxOr8  http://www.slideshare.net/ueshin/20140908-spark-sql-catalyst?qid=3bb8abf4-3d8d-433f-9397-c24c5256841d  https://cwiki.apache.org/confluence/display/Hive/Hive+on+Spark  http://web.stanford.edu/class/cs346/qpnotes.html  http://www.cse.ohio-state.edu/hpcs/WWW/HTML/publications/papers/TR-11-7.pdf  http://codex.cs.yale.edu/avi/db-book/db6/slide-dir/PDF-dir/ch13.pdf  https://courses.cs.washington.edu/courses/cse444/12sp/lectures/  http://www.cs.uiuc.edu/class/sp06/cs411/lectures.html • User Mail List  user@spark.apache.org • Dev Mail List  dev@spark.apache.org • Jira  https://issues.apache.org/jira/browse/SPARK/component/12322623 • DevDoc  https://spark.apache.org/docs/latest/sql-programming-guide.html • Github  https://github.com/apache/spark/tree/master/sql Copyright © 2014 Intel Corporation.
  • 29. Notice and Disclaimers:  Intel, the Intel logo are trademarks of Intel Corporation in the U.S. and/or other countries. *Other names and brands may be claimed as the property of others. See Trademarks on intel.com for full list of Intel trademarks.  Optimization Notice: Intel's compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.  Intel technologies may require enabled hardware, specific software, or services activation. Check with your system manufacturer or retailer.  No computer system can be absolutely secure. Intel does not assume any liability for lost or stolen data or systems or any damages resulting from such losses.  You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.  No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.  The products described may contain design defects or errors known as errata which may cause the product to deviate from publish. Copyright © 2014 Intel Corporation.
  • 30. Copyright © 2014 Intel Corporation.