Spark SQL versus Apache Drill: Different Tools with Different Rules

© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & others
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning

What is Drill?

A Query engine that has…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible

Table Can Be an Entire Directory Tree
// On a file
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet`
group by errorLevel;
// On the entire data collection: all years, all months
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs`
group by errorLevel;

Basic Process
Zookeepe
r
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
c c c

Stages of Query Planning
Parser
Logical
Planner
Physical
Planner
Query
Foreman
Plan
fragments
sent to drill
bits
SQL
Query
Heuristic and
cost based
Cost based

Query Execution
SQL Parser
Optimizer
Scheduler
Pig Parser
PhysicalPlan
Mongo
Cassandra
HiveQL
Parser
RPC Endpoint
Distributed Cache
StorageInterface
OperatorsOperators
Foreman
LogicalPlan
HDFS
HBase
JDBC
Endpoint
ODBC
Endpoint

Batches of Values
• Value vectors
– List of values, with same schema
– With the 4-value semantics for each value
• Shipped around in batches
– max 256k bytes in a batch
– max 64K rows in a batch
• RPC designed for multiple replies to a request

Fixed Value Vectors

Vectorization
• Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions
• GCC, LLVM and JVM all do various optimizations automatically
– Manually code algorithms
• Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline

Runtime Compilation is Faster
• JIT is smart, but more gains with runtime compilation
• Janino: Java-based Java compiler
From http://bit.ly/16Xk32x

Drill compiler
Loaded class
Merge byte-
code of the
two classes
Janino
compiles
runtime
byte-code
CodeModel
generates
code
Precompiled
byte-code
templates

Optimistic
0
20
40
60
80
100
120
140
160
cmd pipeline small db med db large db dw compilation hadoop
Speed vs. check-pointing
No need to checkpoint
Checkpoint frequentlyApache Drill

Optimistic Execution
• Recovery code trivial
– Running instances discard the failed query’s intermediate state
• Pipelining possible
– Send results as soon as batch is large enough
– Requires barrier-less decomposition of query

Pipelining
• Record batches are pipelined between
nodes
– ~256kB usually
• Unit of work for Drill
– Operators works on a batch
• Operator reconfiguration happens at
batch boundaries
DrillBit
DrillBit DrillBit

Pipelining
• Random access: sort without copy or restructuring
• Avoids serialization/deserialization
• Off-heap (no GC woes when lots of memory)
• Read/write to disk
– when data larger than memory
Drill Bit
Memory
overflow
uses disk
Disk

Cost-based Optimization
• Using Optiq, an extensible framework
– Pluggable rules, and cost model
• Rules for distributed plan generation
– Insert Exchange operator into physical plan
– Optiq enhanced to explore parallel query plans
• Pluggable cost model
– CPU, IO, memory, network cost (data locality)
– Storage engine features (HDFS vs HIVE vs HBase)
Query
Optimizer
Pluggable
rules
Pluggable
cost model

What is SparkSQL?

What is Spark SQL
• Essentially syntactic sugar over a limited subset of Spark
• Inherits all the virtues (and vices) of Spark
– Lambdas can serve as UDFs (has subtle issues for performance)
• Inputs have to be loaded
– Perhaps lazily, not obvious when load actually happens
• Not designed as a streaming engine, requires more memory
• Some JSON support, but not so much for large or variable
objects
• Embedded in a real language!

In More Detail
• A Spark program consists of a computation graph that consumes
and produces so-called resilient data datasets
• SparkSQL allows these computations to be defined using SQL
(but needs schema definitions on the RDD’s)
• Conventional Spark programs and SparkSQL programs
interoperate nearly seamlessly

Many Similarities
SQL Parser
Optimizer
Java
PhysicalPlan
Scala
LogicalPlan
Python
group
ﬁlter
ﬁlter

Important Differences
• Spark execution assumes RDD’s are complete representation,
not a stream of row batches
• Input sources don’t inject optimization rules, nor expose detailed
cost models
• Most RDD’s don’t have a zero-copy capability
• Spark inherits JVM memory model, very limited use of off-heap

scala> sqlContext.sql("select * from json.`foo.json`").show
+---+------+----+
| a| b| c|
+---+------+----+
| 3|[3, 2]| xyz|
| 7| null| wxy|
| 7| []|null|
+---+------+----+

scala> sqlContext.sql(
"select a, explode(b) b_v from json.`bug.json`"
).show
+---+---------+
| a| b_v|
+---+---------+
| 3| 3|
| 3| 2|
+---+---------+

First Synthesis
• Drill has a more nuanced optimizer, better code generation
– This often leads to ~2x speed advantage
• Drill has ValueVector and row batches
– This leads to much less memory pressure
• Drill has much stricter memory life-cycle
– Query and done and gone, no need for big GC’s even on big memory
• Drill is all about SQL execution

But …
• Spark can optimize across entire program
– This often leads to ~2x speed advantage
• Spark has much more flexible memory structures
– This can lead to much less memory pressure
• Spark has much more flexible RDD life-cycle
– RDD’s can be cached, persisted or simply recomputed as necessary
• Spark is not all about SQL execution

The Really Big Differences
• Drill focuses heavily on secure, multi-tenant access to data
– Strong impersonation semantics
– Cascading rights via views
– Queries co-exist in a cluster and reserve only their momentary resource
requirements
• Spark focuses heavily on fully integrated execution models
– Any spark function works with (almost) any RDD’s
– Memory residency of RDD’s is the highest goal

Drill security
➢ End to end security from
BI tools to Hadoop
➢ Standard based PAM
Authentication
➢ 2 level user
Impersonation
➢ Fine-grained row and
column level access
control with Drill Views –
no centralized security
repository required

Granular security permissions through Drill views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.view.drill)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists

Ownership Chaining
• Combine Self Service Exploration with Data Governance
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path

But was that the right
question?

Unification is Feasible
• It is relatively easy to build a DrillContext in Spark
– compare to SqlContext
• Define Datasets as Drill data sources and sinks
– Drill runs at the same time as Spark
• Orchestrate transport of Spark data to/from Drill
• Cost of transport is remarkably small

What does the Spark and Drill integration look like
Features at a glance:
• Use Drill as an input to Spark
• Query Spark RDDs via Drill and create data pipelines
Disk (DFS)
Memory
RDD
Files Files

Is unification
valuable?

Example of Unification
Callers
Universe
Towers
cdr data

Simple Session Protocol
• Calls started at random
intervals
• During calls, reconnection
is done periodically
idle
connect
HELLO
FAIL
TIME
OUT
active
END
CONNECT
END
HELLO
start
SETUP
• Many log events are buffered
and sent to current tower during
active state

The Resulting Data
• Signal strength reports
– Tower, timestamp, rank, caller, caller location*, signal strength
• Tower log events: HELLO, FAIL, CONNECT, END
• Call end
• Note that data for one tower is often received by another due to
caller buffering to diagnostic data
*Location isn’t quite location … poetic license applied for

What can we do with it?

Baby Steps
• What does signal propagation look like?
select x, y, signal from cdr_stream where tower = 3
• Plot results to get a map of signal strength around a tower

Baby Steps
• What does tower coverage look like?
select x, y from cdr_stream
where tower = 3 and event_type = ‘CONNECT’.
• Plot results to get a map of coverage area for a tower

What about anomaly detection?

Detecting Tower Loss
It’s important to know if traffic is stopped or delayed
because of a problem…
But events from towers come at irregular intervals
How long after the last event should you begin to worry?

Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible

Converting Event Times to Anomaly
99.9%-ile
99.99%-ile

But in the real world, event
rates often change

Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)

After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile

Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t

Propagation Anomalies
• What happens when something shadows part of the coverage
field?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)

Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.

Other Issues
• Finding anomalies in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs

Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies

Coverage Areas

Just One Tower

Cluster Reports for That Tower

Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9

General Dataflow
Group by tower,
ﬁlter data (SQL)
k-means cluster
(ML LIB)
Split data
(SQL)
Location model
(Java)
Mark cluster
(ML LIB)
Rate detection
per cluster

Summary
• Drill and Spark provide healthy competition in Apache
• Over time, they have converged in many respects
– But important distinctions remain
• Projects can work together to share key technology
– Apache Arrow … started as off-shoot of Drill, now has >12 major
projects as participants, including Spark
• Systems can work together even more deeply
– DrillContext makes integration first class

e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)

Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams

Thank you for coming today!

…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course
discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

Spark SQL versus Apache Drill: Different Tools with Different Rules

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Spark SQL versus Apache Drill: Different Tools with Different Rules

Similar to Spark SQL versus Apache Drill: Different Tools with Different Rules (20)

More from DataWorks Summit/Hadoop Summit

More from DataWorks Summit/Hadoop Summit (20)

Recently uploaded

Recently uploaded (20)

Spark SQL versus Apache Drill: Different Tools with Different Rules

Editor's Notes