Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Shark: SQL and Rich
Analytics at Scale
November 26, 2012
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf...
Agenda
• Problems with existing database solutions
• What is Shark ?
• Shark architecture overview
• Extending spark for S...
Challenges in modern data analysis
Data size growing
• Processing has to scale out over large clusters
• Faults and stragg...
Existinganalysissolutions
MPP Databases
• Vertica, SAP HANA, Teradata, Google Dremel, Google PowerDrill, Cloudera Impala.....
What’sgoodabout MapReduce?
A data warehouse
• scales out to thousands of nodes in a fault- tolerant manner
• puts structur...
Sharkresearch
Shows MapReduce model can be extended to support SQL efficiently
• Started from a powerful MR-like engine (S...
WhatisShark?
A new data analysis system that is :
MapReduce-based architecture
• Built on the top of the RDD and spark exe...
SharkontopofSparkengine
Fast MapReduce-like engine
• In-memory storage for fast iterative computations
• General execution...
Sharkarchitecture
Master Process
Meta
store
HDFS
Client
Driver
SQL
Parser
Physical Plan
Execution
Spark
Cache Mgr.
Query
O...
Sharkarchitecture
HDFS Name Node
Metastore
(System
Catalog)
Resource ManagerScheduler
Master Process
Used to query an exis...
Sharkarchitecture-overview
Spark
• Support partial DAG execution
• Optimization of joint algorithm
Features of shark
• Sup...
Sharkarchitecture-overview
RDD
• Sparks main abstraction-RDD
• Collection stored in external storage system or derived dat...
Sharkarchitecture-overview
Fault tolerance guarantees
• Shark can tolerate the loss of any set of worker nodes.
• Recovery...
ExecutingSQLoverRDDs
Process of executing SQL queries which includes:
Query
parsing
Logical plan
generation
Physical plan
...
AnalyzingdataonShark
CREATE EXTERNAL TABLE wiki
(id BIGINT, title STRING, last_modified STRING,xml STRING,text STRING)
ROW...
CachingDatainShark
CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache"= "true") AS
SELECT * FROM wiki;
CREATE TABL...
Tuningthedegreeofparallelism
• Relies on Spark to infer the number of map tasks (automatically based on input size).
• Num...
ExtendingSparkforSQL
• Partial DAG Execution
• Columnar Memory Store
• Machine Learning Integration
• Hash-based Shuffle v...
PartialDAGExecution(PDE)
Partial DAG execution(PDE)
• Static query optimization
• Dynamic query optimization
• Modificatio...
PDE–Joinoptimization
Skew handling and degree parallelism
• Task scheduling overhead
Join
Result
Table 1
Map join Shuffle ...
Columnarmemorystore
• Simply catching records as JVM objects is insufficient
• Shark employs column oriented storage , a p...
Machinelearningsupport
Language Integration
• Shark allows queries to perform logistic
regression over a user database.
• ...
IntegratingMachinelearningwithSQL
def logRegress(points: RDD[Point]):Vector {
var w = Vector(D, _ => 2 * rand.nextDouble -...
Machinelearningsupport-Performance
1.1 0.8 0.7 1
0
10
20
30
40
50
60
70
80
90
100
Q1 Q2 Q3 Q4
Conviva Warehouse Queries (1...
Machinelearningworkflowcomparison
110
80
0.96
0 20 40 60 80 100 120
Hadoop (text)
Hadoop (binary)
Shark
Logistic regressio...
Machinelearningsupport
120
4.1
0 20 40 60 80 100 120 140
Hadoop
Shark/Spark
Machine Learning (1B records, 10 features/reco...
Summary
• By using Spark as the execution engine and employing novel and traditional database
techniques, Shark bridges th...
Thank
youhttps://spark.apache.org/sql/
Upcoming SlideShare
Loading in …5
×

Shark

203 views

Published on

Shark: SQL and Rich Analytics at Scale

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Shark

  1. 1. Shark: SQL and Rich Analytics at Scale November 26, 2012 http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf AMPLab, EECS, UC Berkeley Presented by Alexander Ivanichev Reynold S. Xin, Josh Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica
  2. 2. Agenda • Problems with existing database solutions • What is Shark ? • Shark architecture overview • Extending spark for SQL • Shark perfomance and examples
  3. 3. Challenges in modern data analysis Data size growing • Processing has to scale out over large clusters • Faults and stragglers complicate DB design Complexity of analysis increasing • Massive ETL (web crawling) • Machine learning, graph processing • Leads to long running jobs
  4. 4. Existinganalysissolutions MPP Databases • Vertica, SAP HANA, Teradata, Google Dremel, Google PowerDrill, Cloudera Impala... • Fast! • Generally not fault-tolerant; challenging for long running queries as clusters scale up. • Lack rich analytics such as machine learning and graph algorithms. Map Reduce • Apache Hive, Google Tenzing, Turn Cheetah... • Enables fine-grained fault-tolerance, resource sharing, scalability. • Expressive Machine Learning algorithms. • High-latency, dismissed for interactive workloads.
  5. 5. What’sgoodabout MapReduce? A data warehouse • scales out to thousands of nodes in a fault- tolerant manner • puts structure/schema onto HDFS data (schema-on-read) • compiles HiveQL queries into MapReduce jobs • flexible and extensible: support UDFs, scripts, custom serializers, storage formats Data-parallel operations • Apply the same operations on a defined set of data Fine-grained, deterministic tasks • Enables fault-tolerance & straggler mitigation But slow: 30+ seconds even for simple queries
  6. 6. Sharkresearch Shows MapReduce model can be extended to support SQL efficiently • Started from a powerful MR-like engine (Spark) • Extended the engine in various ways The artifact: Shark, a fast engine on top of MR • Performant SQL • Complex analytics in the same engine • Maintains MR benefits, e.g. fault-tolerance
  7. 7. WhatisShark? A new data analysis system that is : MapReduce-based architecture • Built on the top of the RDD and spark execution engine • Scales out and is fault-tolerant Performant • Similar speedups of up to 100x • Supports low-latency, interactive queries through in-memory computation Expressive and flexible • Supports both SQL and complex analytics such as machine learning • Apache Hive compatible with (storage, UDFs, types, metadata, etc)
  8. 8. SharkontopofSparkengine Fast MapReduce-like engine • In-memory storage for fast iterative computations • General execution graphs • Designed for low latency (~100ms jobs) Compatible with Hadoop storage APIs • Read/write to any Hadoop-supported systems, including HDFS, Hbase, SequenceFiles, etc Growing open source platform • 17+ companies contributing code
  9. 9. Sharkarchitecture Master Process Meta store HDFS Client Driver SQL Parser Physical Plan Execution Spark Cache Mgr. Query Optimizer JDBCCLI
  10. 10. Sharkarchitecture HDFS Name Node Metastore (System Catalog) Resource ManagerScheduler Master Process Used to query an existing Hive warehouse returns result much faster without modification Master Node SparkRuntime Slave Node Execution Engine Memstore Resource Manager Daemon HDFS DataNode SparkRuntime Slave Node Execution Engine Memstore Resource Manager Daemon HDFS DataNode
  11. 11. Sharkarchitecture-overview Spark • Support partial DAG execution • Optimization of joint algorithm Features of shark • Supports general computation • Provides in-memory storage abstraction-RDD • Engine is optimized for low latency
  12. 12. Sharkarchitecture-overview RDD • Sparks main abstraction-RDD • Collection stored in external storage system or derived data set • Contains arbitrary data types Benefits of RDD’s • Return at the speed of DRAM • Use of lineage • Speedy recovery • Immutable-foundation for relational processing. groupByKey join mapPartitions sortByKey distinct filter
  13. 13. Sharkarchitecture-overview Fault tolerance guarantees • Shark can tolerate the loss of any set of worker nodes. • Recovery is parallelized across the cluster. • The deterministic nature of RDDs also enables straggler mitigation • Recovery works even in queries that combine SQL and machine learning UDFs
  14. 14. ExecutingSQLoverRDDs Process of executing SQL queries which includes: Query parsing Logical plan generation Physical plan generation
  15. 15. AnalyzingdataonShark CREATE EXTERNAL TABLE wiki (id BIGINT, title STRING, last_modified STRING,xml STRING,text STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY 't' LOCATION 's3n://spark-data/wikipedia-sample/'; SELECT COUNT(*) FROM wiki_small WHERE TEXT LIKE '%Berkeley%';
  16. 16. CachingDatainShark CREATE TABLE wiki_small_in_mem TBLPROPERTIES ("shark.cache"= "true") AS SELECT * FROM wiki; CREATE TABLE wiki_cached AS SELECT * FROM wiki; Creates a table that is stored in a cluster’s memory using RDD.cache().
  17. 17. Tuningthedegreeofparallelism • Relies on Spark to infer the number of map tasks (automatically based on input size). • Number of reduce tasks needs to be specified by the user. SET mapred.reduce.tasks=499; • Out of memory error on slaves if the number is too small. • It is usually OK to set a higher value since the overhead of task launching is low in Spark
  18. 18. ExtendingSparkforSQL • Partial DAG Execution • Columnar Memory Store • Machine Learning Integration • Hash-based Shuffle vs Sort-based Shuffle • Data Co-partitioning • Partition Pruning based on Range Statistics • Distributed Data Loading • Distributed sorting • Better push-down of limits
  19. 19. PartialDAGExecution(PDE) Partial DAG execution(PDE) • Static query optimization • Dynamic query optimization • Modification of statistics Example of statistics • Partition size record count • List of “heavy hitters” • Approximate histogram SELECT * FROM table1 a JOIN table2 b ON a.key=b.key WHERE my_udf(b.field1, b.field2) = true; How to optimize the following query? • Hard to estimate cardinality! • Without cardinality estimation, cost- based optimizer breaks down
  20. 20. PDE–Joinoptimization Skew handling and degree parallelism • Task scheduling overhead Join Result Table 1 Map join Shuffle join Join Result Table 2
  21. 21. Columnarmemorystore • Simply catching records as JVM objects is insufficient • Shark employs column oriented storage , a partition of columns is one MapReduce “record” • Benefits: compact representation, cpu efficient compression, cache locality 1 2 3 John Mike Sally 4.1 3.5 6.4 1 John 4.1 2 Mike 3.5 3 Sally 6.4 Row storage Column storage
  22. 22. Machinelearningsupport Language Integration • Shark allows queries to perform logistic regression over a user database. • Example: Data analysis pipeline that performs logistic regression over database. Execution Engine Integration • Common abstraction allows machine learning computation and SQL queries to share workers and cached data. • Enables end to end fault tolerance Shark supports machine learning-first class citizen
  23. 23. IntegratingMachinelearningwithSQL def logRegress(points: RDD[Point]):Vector { var w = Vector(D, _ => 2 * rand.nextDouble - 1) for (i <- 1 to ITERATIONS){ val gradient = points.map { p => val denom = 1 + exp(-p.y* (w dot p.x)) (1 / denom - 1) * p.y * p.x }.reduce(_ +_) w -=gradient } w } val users = sql2rdd("SELECT * FROM user u JOIN comment c ONc.uid=u.uid") val features = users.mapRows { row => new Vector(extractFeature1(row.getInt("age")), extractFeature2(row.getStr("country")), ...)} val trainedVector = logRegress(features.cache()) Unified system for query processing and machine learning
  24. 24. Machinelearningsupport-Performance 1.1 0.8 0.7 1 0 10 20 30 40 50 60 70 80 90 100 Q1 Q2 Q3 Q4 Conviva Warehouse Queries (1.7 TB) Shark Shark (disk) Hive
  25. 25. Machinelearningworkflowcomparison 110 80 0.96 0 20 40 60 80 100 120 Hadoop (text) Hadoop (binary) Shark Logistic regression, per-iteration runtime (seconds) 155 120 4.1 0 50 100 150 200 Hadoop (text) Hadoop (binary) Shark K-means clustering, per- iteration runtime (seconds) 1. Selecting the data of interest from the warehouse using SQL 2. Extracting Features 3. Applying Iterative Algorithms: • Logistic Regression • K-Means Clustering
  26. 26. Machinelearningsupport 120 4.1 0 20 40 60 80 100 120 140 Hadoop Shark/Spark Machine Learning (1B records, 10 features/record) K-means 80 0.96 0 10 20 30 40 50 60 70 80 90 Hadoop Shark/Spark logistic regression
  27. 27. Summary • By using Spark as the execution engine and employing novel and traditional database techniques, Shark bridges the gap between MapReduce and MPP databases. • User can conveniently mix the best parts of both SQL and MapReduce-style programming and avoid moving data. • Provide fine-grained fault recovery across both types of operations. • In memory computation, users can choose to load high-value data into Shark’s memory store for fast analytics • Shark can answer queries up to 100X faster than Hive and machine learning 100X faster than Hadoop MapReduce and can achieve comparable performance with MPP databases
  28. 28. Thank youhttps://spark.apache.org/sql/

×