SlideShare a Scribd company logo
© 2014 MapR Technologies 1© 2014 MapR Technologies
© 2014 MapR Technologies 2
Contact Information
Ted Dunning
Chief Applications Architect at MapR Technologies
Committer & PMC for Apache’s Drill, Zookeeper & others
VP of Incubator at Apache Foundation
Email tdunning@apache.org tdunning@maprtech.com
Twitter @ted_dunning
© 2014 MapR Technologies 5
What is Drill?
© 2014 MapR Technologies 6
A Query engine that has…
• Columnar/Vectorized
• Optimistic/pipelined
• Runtime compilation
• Late binding
• Extensible
© 2014 MapR Technologies 7
Table Can Be an Entire Directory Tree
// On a file
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet`
group by errorLevel;
// On the entire data collection: all years, all months
select errorLevel, count(*)
from dfs.logs.`/AppServerLogs`
group by errorLevel;
© 2014 MapR Technologies 8
Basic Process
Zookeepe
r
DFS/HBase DFS/HBase DFS/HBase
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Drillbit
Distributed
Cache
Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf)
2. Drillbit generates execution plan based on query optimization & locality
3. Fragments are farmed to individual nodes
4. Result is returned to driving node
c c c
© 2014 MapR Technologies 9
Stages of Query Planning
Parser
Logical
Planner
Physical
Planner
Query
Foreman
Plan
fragments
sent to drill
bits
SQL
Query
Heuristic and
cost based
Cost based
© 2014 MapR Technologies 10
Query Execution
SQL Parser
Optimizer
Scheduler
Pig Parser
PhysicalPlan
Mongo
Cassandra
HiveQL
Parser
RPC Endpoint
Distributed Cache
StorageInterface
OperatorsOperators
Foreman
LogicalPlan
HDFS
HBase
JDBC
Endpoint
ODBC
Endpoint
© 2014 MapR Technologies 11
Batches of Values
• Value vectors
– List of values, with same schema
– With the 4-value semantics for each value
• Shipped around in batches
– max 256k bytes in a batch
– max 64K rows in a batch
• RPC designed for multiple replies to a request
© 2014 MapR Technologies 12
Fixed Value Vectors
© 2014 MapR Technologies 13
Vectorization
• Drill operates on more than one record at a time
– Word-sized manipulations
– SIMD instructions
• GCC, LLVM and JVM all do various optimizations automatically
– Manually code algorithms
• Logical Vectorization
– Bitmaps allow lightning fast null-checks
– Avoid branching to speed CPU pipeline
© 2014 MapR Technologies 14
Runtime Compilation is Faster
• JIT is smart, but more gains with runtime compilation
• Janino: Java-based Java compiler
From http://bit.ly/16Xk32x
© 2014 MapR Technologies 15
Drill compiler
Loaded class
Merge byte-
code of the
two classes
Janino
compiles
runtime
byte-code
CodeModel
generates
code
Precompiled
byte-code
templates
© 2014 MapR Technologies 16
Optimistic
0
20
40
60
80
100
120
140
160
cmd pipeline small db med db large db dw compilation hadoop
Speed vs. check-pointing
No need to checkpoint
Checkpoint frequentlyApache Drill
© 2014 MapR Technologies 17
Optimistic Execution
• Recovery code trivial
– Running instances discard the failed query’s intermediate state
• Pipelining possible
– Send results as soon as batch is large enough
– Requires barrier-less decomposition of query
© 2014 MapR Technologies 18
Pipelining
• Record batches are pipelined between
nodes
– ~256kB usually
• Unit of work for Drill
– Operators works on a batch
• Operator reconfiguration happens at
batch boundaries
DrillBit
DrillBit DrillBit
© 2014 MapR Technologies 19
Pipelining
• Random access: sort without copy or restructuring
• Avoids serialization/deserialization
• Off-heap (no GC woes when lots of memory)
• Read/write to disk
– when data larger than memory
Drill Bit
Memory
overflow
uses disk
Disk
© 2014 MapR Technologies 20
Cost-based Optimization
• Using Optiq, an extensible framework
– Pluggable rules, and cost model
• Rules for distributed plan generation
– Insert Exchange operator into physical plan
– Optiq enhanced to explore parallel query plans
• Pluggable cost model
– CPU, IO, memory, network cost (data locality)
– Storage engine features (HDFS vs HIVE vs HBase)
Query
Optimizer
Pluggable
rules
Pluggable
cost model
© 2014 MapR Technologies 21
What is SparkSQL?
© 2014 MapR Technologies 22
What is Spark SQL
• Essentially syntactic sugar over a limited subset of Spark
• Inherits all the virtues (and vices) of Spark
– Lambdas can serve as UDFs (has subtle issues for performance)
• Inputs have to be loaded
– Perhaps lazily, not obvious when load actually happens
• Not designed as a streaming engine, requires more memory
• Some JSON support, but not so much for large or variable
objects
• Embedded in a real language!
© 2014 MapR Technologies 23
In More Detail
• A Spark program consists of a computation graph that consumes
and produces so-called resilient data datasets
• SparkSQL allows these computations to be defined using SQL
(but needs schema definitions on the RDD’s)
• Conventional Spark programs and SparkSQL programs
interoperate nearly seamlessly
© 2014 MapR Technologies 24
Many Similarities
SQL Parser
Optimizer
Java
PhysicalPlan
Scala
LogicalPlan
Python
group
filter
filter
© 2014 MapR Technologies 25
Important Differences
• Spark execution assumes RDD’s are complete representation,
not a stream of row batches
• Input sources don’t inject optimization rules, nor expose detailed
cost models
• Most RDD’s don’t have a zero-copy capability
• Spark inherits JVM memory model, very limited use of off-heap
© 2014 MapR Technologies 26
scala> sqlContext.sql("select * from json.`foo.json`").show
+---+------+----+
| a| b| c|
+---+------+----+
| 3|[3, 2]| xyz|
| 7| null| wxy|
| 7| []|null|
+---+------+----+
© 2014 MapR Technologies 27
scala> sqlContext.sql(
"select a, explode(b) b_v from json.`bug.json`"
).show
+---+---------+
| a| b_v|
+---+---------+
| 3| 3|
| 3| 2|
+---+---------+
© 2014 MapR Technologies 28
First Synthesis
• Drill has a more nuanced optimizer, better code generation
– This often leads to ~2x speed advantage
• Drill has ValueVector and row batches
– This leads to much less memory pressure
• Drill has much stricter memory life-cycle
– Query and done and gone, no need for big GC’s even on big memory
• Drill is all about SQL execution
© 2014 MapR Technologies 29
But …
• Spark can optimize across entire program
– This often leads to ~2x speed advantage
• Spark has much more flexible memory structures
– This can lead to much less memory pressure
• Spark has much more flexible RDD life-cycle
– RDD’s can be cached, persisted or simply recomputed as necessary
• Spark is not all about SQL execution
© 2014 MapR Technologies 30
The Really Big Differences
• Drill focuses heavily on secure, multi-tenant access to data
– Strong impersonation semantics
– Cascading rights via views
– Queries co-exist in a cluster and reserve only their momentary resource
requirements
• Spark focuses heavily on fully integrated execution models
– Any spark function works with (almost) any RDD’s
– Memory residency of RDD’s is the highest goal
© 2014 MapR Technologies 31
Drill security
➢ End to end security from
BI tools to Hadoop
➢ Standard based PAM
Authentication
➢ 2 level user
Impersonation
➢ Fine-grained row and
column level access
control with Drill Views –
no centralized security
repository required
© 2014 MapR Technologies 32
Granular security permissions through Drill views
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Owner
Admins
Permission
Admins
Business Analyst Data Scientist
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist View (/views/maskedcards.view.drill)
Not a physical data copy
Name City State
Dave San Jose CA
John Boulder CO
Business Analyst View
Owner
Admins
Permission
Business
Analysts
Owner
Admins
Permission
Data
Scientists
© 2014 MapR Technologies 33
Ownership Chaining
• Combine Self Service Exploration with Data Governance
Name City State Credit Card #
Dave San Jose CA 1374-7914-3865-4817
John Boulder CO 1374-9735-1794-9711
Raw File (/raw/cards.csv)
Name City State Credit Card #
Dave San Jose CA 1374-1111-1111-1111
John Boulder CO 1374-1111-1111-1111
Data Scientist (/views/V_Scientist)
Jane (Read)
John (Owner)
Name City State
Dave San Jose CA
John Boulder CO
Analyst(/views/V_Analyst)
Jack (Read)
Jane(Owner)
RAWFILEV_ScientistV_Analyst
Does Jack have access to V_Analyst? ->YES
Who is the owner of V_Analyst? ->Jane
Drill accesses V_Analyst as Jane (Impersonation hop 1)
Does Jane have access to V_Scientist ? -> YES
Who is the owner of V_Scientist? ->John
Drill accesses V_Scientist as John (Impersonation hop 2)
John(Owner)
Does John have permissions on raw file? -> YES
Who is the owner of raw file? ->John
Drill accesses source file as John (no impersonation here)
Jack queries the view V_Analyst
*Ownership chain length (# hops) is configurable
Ownership
chaining
Access
path
© 2014 MapR Technologies 34
But was that the right
question?
© 2014 MapR Technologies 35
Unification is Feasible
• It is relatively easy to build a DrillContext in Spark
– compare to SqlContext
• Define Datasets as Drill data sources and sinks
– Drill runs at the same time as Spark
• Orchestrate transport of Spark data to/from Drill
• Cost of transport is remarkably small
© 2014 MapR Technologies 36
What does the Spark and Drill integration look like
Features at a glance:
• Use Drill as an input to Spark
• Query Spark RDDs via Drill and create data pipelines
Disk (DFS)
Memory
RDD
Files Files
© 2014 MapR Technologies 37
Is unification
valuable?
© 2014 MapR Technologies 38
Example of Unification
Callers
Universe
Towers
cdr data
© 2014 MapR Technologies 39
Simple Session Protocol
• Calls started at random
intervals
• During calls, reconnection
is done periodically
idle
connect
HELLO
FAIL
TIME
OUT
active
END
CONNECT
END
HELLO
start
SETUP
• Many log events are buffered
and sent to current tower during
active state
© 2014 MapR Technologies 40
The Resulting Data
• Signal strength reports
– Tower, timestamp, rank, caller, caller location*, signal strength
• Tower log events: HELLO, FAIL, CONNECT, END
• Call end
• Note that data for one tower is often received by another due to
caller buffering to diagnostic data
*Location isn’t quite location … poetic license applied for
© 2014 MapR Technologies 41
What can we do with it?
© 2014 MapR Technologies 42
Baby Steps
• What does signal propagation look like?
select x, y, signal from cdr_stream where tower = 3
• Plot results to get a map of signal strength around a tower
© 2014 MapR Technologies 43
Baby Steps
• What does tower coverage look like?
select x, y from cdr_stream
where tower = 3 and event_type = ‘CONNECT’.
• Plot results to get a map of coverage area for a tower
© 2014 MapR Technologies 44
What about anomaly detection?
© 2014 MapR Technologies 45
Detecting Tower Loss
It’s important to know if traffic is stopped or delayed
because of a problem…
But events from towers come at irregular intervals
How long after the last event should you begin to worry?
© 2014 MapR Technologies 46
Event Stream (timing)
• Events of various types arrive at irregular intervals
– we can assume Poisson distribution
• The key question is whether frequency has changed relative to
expected values
– This shows up as a change in interval
• Want alert as soon as possible
© 2014 MapR Technologies 47
Converting Event Times to Anomaly
99.9%-ile
99.99%-ile
© 2014 MapR Technologies 48
But in the real world, event
rates often change
© 2014 MapR Technologies 49
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2014 MapR Technologies 50
Time Intervals Are Key to Modeling Sporadic Events
0 1 2 3 4
02468
t (days)
dt(min)
© 2014 MapR Technologies 51
After Rate Correction
0 1 2 3 4
0246810
t (days)
dt/rate
99.9%−ile
99.99%−ile
© 2014 MapR Technologies 52
Detecting Anomalies in Sporadic Events
Incoming
events
99.97%-ile
Alarm
Δn
Rate
predictor
Rate
history
t-digest
δ> t
ti δ λ(ti- ti- n)
λ
t
© 2014 MapR Technologies 53
Propagation Anomalies
• What happens when something shadows part of the coverage
field?
– Can happen in urban areas with a construction crane
• Can solve heuristically
– Subtract from reference image composed by long term averages
– Doesn’t deal well with weak signal regions and low S/N
• Can solve probabilistically
– Compute anomaly for each measurement, use mean of log(p)
© 2014 MapR Technologies 54
© 2014 MapR Technologies 55
© 2014 MapR Technologies 56
Variable Signal/Noise Makes Heuristic Tricky
Far from the transmitter,
received signal is dominated by
noise. This makes subtraction of
average value a bad algorithm.
© 2014 MapR Technologies 57
Other Issues
• Finding anomalies in coverage area is similar tricky
• Coverage area is roughly where tower signal strength is higher
than neighbors
• Except for fuzziness due to hand-off delays
• Except for bias due to large-scale caller motions
– Rush hour
– Event mobs
© 2014 MapR Technologies 58
Simple Answer for Propagation Anomalies
• Cluster signal strength reports
• Cluster locations using k-means, large k
• Model report rate anomaly using discrete event models
• Model signal strength anomaly using percentile model
• Trade larger k against higher report rates, faster detection
• Overall anomaly is sum of individual log(p) anomalies
© 2014 MapR Technologies 59
Coverage Areas
© 2014 MapR Technologies 60
Just One Tower
© 2014 MapR Technologies 61
Cluster Reports for That Tower
© 2014 MapR Technologies 62
Cluster Reports for That Tower
1
2 3
4
5
6
7
8
9
© 2014 MapR Technologies 63
General Dataflow
Group by tower,
filter data (SQL)
k-means cluster
(ML LIB)
Split data
(SQL)
Location model
(Java)
Mark cluster
(ML LIB)
Rate detection
per cluster
© 2014 MapR Technologies 64
Summary
• Drill and Spark provide healthy competition in Apache
• Over time, they have converged in many respects
– But important distinctions remain
• Projects can work together to share key technology
– Apache Arrow … started as off-shoot of Drill, now has >12 major
projects as participants, including Spark
• Systems can work together even more deeply
– DrillContext makes integration first class
© 2014 MapR Technologies 65
e-book available courtesy of MapR
http://bit.ly/1jQ9QuL
A New Look at Anomaly Detection
by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
© 2014 MapR Technologies 66
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
Read online mapr.com/6ebooks-read
Download pdfs mapr.com/6ebooks-pdf
6 Free ebooks
Streaming
Architecture
Ted Dunning &
Ellen Friedman
and MapR Streams
© 2014 MapR Technologies 67
Thank you for coming today!
© 2014 MapR Technologies 68
…helping you put data technology to work
● Find answers
● Ask technical questions
● Join on-demand training course
discussions
● Follow release announcements
● Share and vote on product ideas
● Find Meetup and event listings
Connect with fellow Apache
Hadoop and Spark professionals
community.mapr.com

More Related Content

What's hot

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
StampedeCon
 
Apache Drill
Apache DrillApache Drill
Apache Drill
Ted Dunning
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to RedisDvir Volk
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
Bryan Bende
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
HostedbyConfluent
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
DataWorks Summit
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
DataWorks Summit
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
Yoshinori Matsunobu
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkDatabricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
Hortonworks
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
Flink Forward
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkDatabricks
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariDataWorks Summit
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
datamantra
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
Jurriaan Persyn
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
t3rmin4t0r
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
SANG WON PARK
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
DataWorks Summit/Hadoop Summit
 

What's hot (20)

Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Apache Drill
Apache DrillApache Drill
Apache Drill
 
Introduction to Redis
Introduction to RedisIntroduction to Redis
Introduction to Redis
 
Apache NiFi Record Processing
Apache NiFi Record ProcessingApache NiFi Record Processing
Apache NiFi Record Processing
 
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data TechBig Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
Big Data Kappa | Mark Senerth, The Walt Disney Company - DMED, Data Tech
 
Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Hive Does ACID
Hive Does ACIDHive Does ACID
Hive Does ACID
 
ORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big DataORC File - Optimizing Your Big Data
ORC File - Optimizing Your Big Data
 
RocksDB Performance and Reliability Practices
RocksDB Performance and Reliability PracticesRocksDB Performance and Reliability Practices
RocksDB Performance and Reliability Practices
 
End-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache SparkEnd-to-end Data Pipeline with Apache Spark
End-to-end Data Pipeline with Apache Spark
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Apache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query ProcessingApache Tez: Accelerating Hadoop Query Processing
Apache Tez: Accelerating Hadoop Query Processing
 
Building Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta LakeBuilding Reliable Lakehouses with Apache Flink and Delta Lake
Building Reliable Lakehouses with Apache Flink and Delta Lake
 
Simplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache SparkSimplifying Big Data Analytics with Apache Spark
Simplifying Big Data Analytics with Apache Spark
 
Managing 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with AmbariManaging 2000 Node Cluster with Ambari
Managing 2000 Node Cluster with Ambari
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Introduction to memcached
Introduction to memcachedIntroduction to memcached
Introduction to memcached
 
Hive+Tez: A performance deep dive
Hive+Tez: A performance deep diveHive+Tez: A performance deep dive
Hive+Tez: A performance deep dive
 
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
 
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storagehive HBase Metastore - Improving Hive with a Big Data Metadata Storage
hive HBase Metastore - Improving Hive with a Big Data Metadata Storage
 

Similar to Spark SQL versus Apache Drill: Different Tools with Different Rules

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
tshiran
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillTomer Shiran
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
DataWorks Summit/Hadoop Summit
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
Carol McDonald
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
MapR Technologies
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
Ted Dunning
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
Connor McDonald
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
Cloudera, Inc.
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
Vince Gonzalez
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
Carol McDonald
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
BigDataEverywhere
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Etu Solution
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
harryvanhaaren
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
MapR Technologies
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
DataStax Academy
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
DataStax Academy
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
Julien Anguenot
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
MapR Technologies
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
DataWorks Summit
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
mcsrivas
 

Similar to Spark SQL versus Apache Drill: Different Tools with Different Rules (20)

Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
Keys for Success from Streams to Queries
Keys for Success from Streams to QueriesKeys for Success from Streams to Queries
Keys for Success from Streams to Queries
 
Introduction to Spark
Introduction to SparkIntroduction to Spark
Introduction to Spark
 
Drill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is PossibleDrill into Drill – How Providing Flexibility and Performance is Possible
Drill into Drill – How Providing Flexibility and Performance is Possible
 
Real time-hadoop
Real time-hadoopReal time-hadoop
Real time-hadoop
 
Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2Whats new in Oracle Database 12c release 12.1.0.2
Whats new in Oracle Database 12c release 12.1.0.2
 
The Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache SparkThe Future of Hadoop: A deeper look at Apache Spark
The Future of Hadoop: A deeper look at Apache Spark
 
Cleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - SparkCleveland Hadoop Users Group - Spark
Cleveland Hadoop Users Group - Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop Big Data Everywhere Chicago: SQL on Hadoop
Big Data Everywhere Chicago: SQL on Hadoop
 
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
Track B-3 解構大數據架構 - 大數據系統的伺服器與網路資源規劃
 
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. GrayOVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
OVS and DPDK - T.F. Herbert, K. Traynor, M. Gray
 
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise NetworksUsing Familiar BI Tools and Hadoop to Analyze Enterprise Networks
Using Familiar BI Tools and Hadoop to Analyze Enterprise Networks
 
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache CassandraCassandra Day SV 2014: Spark, Shark, and Apache Cassandra
Cassandra Day SV 2014: Spark, Shark, and Apache Cassandra
 
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE PlatformLarge Scale Data Analytics with Spark and Cassandra on the DSE Platform
Large Scale Data Analytics with Spark and Cassandra on the DSE Platform
 
Introduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developersIntroduction to Cassandra and CQL for Java developers
Introduction to Cassandra and CQL for Java developers
 
Dealing with an Upside Down Internet
Dealing with an Upside Down InternetDealing with an Upside Down Internet
Dealing with an Upside Down Internet
 
How the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside DownHow the Internet of Things are Turning the Internet Upside Down
How the Internet of Things are Turning the Internet Upside Down
 
Apache Drill - Why, What, How
Apache Drill - Why, What, HowApache Drill - Why, What, How
Apache Drill - Why, What, How
 

More from DataWorks Summit/Hadoop Summit

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
DataWorks Summit/Hadoop Summit
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
DataWorks Summit/Hadoop Summit
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
DataWorks Summit/Hadoop Summit
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
DataWorks Summit/Hadoop Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit/Hadoop Summit
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
DataWorks Summit/Hadoop Summit
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
DataWorks Summit/Hadoop Summit
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
DataWorks Summit/Hadoop Summit
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
DataWorks Summit/Hadoop Summit
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
DataWorks Summit/Hadoop Summit
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
DataWorks Summit/Hadoop Summit
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
DataWorks Summit/Hadoop Summit
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
DataWorks Summit/Hadoop Summit
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
DataWorks Summit/Hadoop Summit
 

More from DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Recently uploaded

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 

Recently uploaded (20)

From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 

Spark SQL versus Apache Drill: Different Tools with Different Rules

  • 1. © 2014 MapR Technologies 1© 2014 MapR Technologies
  • 2. © 2014 MapR Technologies 2 Contact Information Ted Dunning Chief Applications Architect at MapR Technologies Committer & PMC for Apache’s Drill, Zookeeper & others VP of Incubator at Apache Foundation Email tdunning@apache.org tdunning@maprtech.com Twitter @ted_dunning
  • 3. © 2014 MapR Technologies 5 What is Drill?
  • 4. © 2014 MapR Technologies 6 A Query engine that has… • Columnar/Vectorized • Optimistic/pipelined • Runtime compilation • Late binding • Extensible
  • 5. © 2014 MapR Technologies 7 Table Can Be an Entire Directory Tree // On a file select errorLevel, count(*) from dfs.logs.`/AppServerLogs/2014/Janpart0001.parquet` group by errorLevel; // On the entire data collection: all years, all months select errorLevel, count(*) from dfs.logs.`/AppServerLogs` group by errorLevel;
  • 6. © 2014 MapR Technologies 8 Basic Process Zookeepe r DFS/HBase DFS/HBase DFS/HBase Drillbit Distributed Cache Drillbit Distributed Cache Drillbit Distributed Cache Query 1. Query comes to any Drillbit (JDBC, ODBC, CLI, protobuf) 2. Drillbit generates execution plan based on query optimization & locality 3. Fragments are farmed to individual nodes 4. Result is returned to driving node c c c
  • 7. © 2014 MapR Technologies 9 Stages of Query Planning Parser Logical Planner Physical Planner Query Foreman Plan fragments sent to drill bits SQL Query Heuristic and cost based Cost based
  • 8. © 2014 MapR Technologies 10 Query Execution SQL Parser Optimizer Scheduler Pig Parser PhysicalPlan Mongo Cassandra HiveQL Parser RPC Endpoint Distributed Cache StorageInterface OperatorsOperators Foreman LogicalPlan HDFS HBase JDBC Endpoint ODBC Endpoint
  • 9. © 2014 MapR Technologies 11 Batches of Values • Value vectors – List of values, with same schema – With the 4-value semantics for each value • Shipped around in batches – max 256k bytes in a batch – max 64K rows in a batch • RPC designed for multiple replies to a request
  • 10. © 2014 MapR Technologies 12 Fixed Value Vectors
  • 11. © 2014 MapR Technologies 13 Vectorization • Drill operates on more than one record at a time – Word-sized manipulations – SIMD instructions • GCC, LLVM and JVM all do various optimizations automatically – Manually code algorithms • Logical Vectorization – Bitmaps allow lightning fast null-checks – Avoid branching to speed CPU pipeline
  • 12. © 2014 MapR Technologies 14 Runtime Compilation is Faster • JIT is smart, but more gains with runtime compilation • Janino: Java-based Java compiler From http://bit.ly/16Xk32x
  • 13. © 2014 MapR Technologies 15 Drill compiler Loaded class Merge byte- code of the two classes Janino compiles runtime byte-code CodeModel generates code Precompiled byte-code templates
  • 14. © 2014 MapR Technologies 16 Optimistic 0 20 40 60 80 100 120 140 160 cmd pipeline small db med db large db dw compilation hadoop Speed vs. check-pointing No need to checkpoint Checkpoint frequentlyApache Drill
  • 15. © 2014 MapR Technologies 17 Optimistic Execution • Recovery code trivial – Running instances discard the failed query’s intermediate state • Pipelining possible – Send results as soon as batch is large enough – Requires barrier-less decomposition of query
  • 16. © 2014 MapR Technologies 18 Pipelining • Record batches are pipelined between nodes – ~256kB usually • Unit of work for Drill – Operators works on a batch • Operator reconfiguration happens at batch boundaries DrillBit DrillBit DrillBit
  • 17. © 2014 MapR Technologies 19 Pipelining • Random access: sort without copy or restructuring • Avoids serialization/deserialization • Off-heap (no GC woes when lots of memory) • Read/write to disk – when data larger than memory Drill Bit Memory overflow uses disk Disk
  • 18. © 2014 MapR Technologies 20 Cost-based Optimization • Using Optiq, an extensible framework – Pluggable rules, and cost model • Rules for distributed plan generation – Insert Exchange operator into physical plan – Optiq enhanced to explore parallel query plans • Pluggable cost model – CPU, IO, memory, network cost (data locality) – Storage engine features (HDFS vs HIVE vs HBase) Query Optimizer Pluggable rules Pluggable cost model
  • 19. © 2014 MapR Technologies 21 What is SparkSQL?
  • 20. © 2014 MapR Technologies 22 What is Spark SQL • Essentially syntactic sugar over a limited subset of Spark • Inherits all the virtues (and vices) of Spark – Lambdas can serve as UDFs (has subtle issues for performance) • Inputs have to be loaded – Perhaps lazily, not obvious when load actually happens • Not designed as a streaming engine, requires more memory • Some JSON support, but not so much for large or variable objects • Embedded in a real language!
  • 21. © 2014 MapR Technologies 23 In More Detail • A Spark program consists of a computation graph that consumes and produces so-called resilient data datasets • SparkSQL allows these computations to be defined using SQL (but needs schema definitions on the RDD’s) • Conventional Spark programs and SparkSQL programs interoperate nearly seamlessly
  • 22. © 2014 MapR Technologies 24 Many Similarities SQL Parser Optimizer Java PhysicalPlan Scala LogicalPlan Python group filter filter
  • 23. © 2014 MapR Technologies 25 Important Differences • Spark execution assumes RDD’s are complete representation, not a stream of row batches • Input sources don’t inject optimization rules, nor expose detailed cost models • Most RDD’s don’t have a zero-copy capability • Spark inherits JVM memory model, very limited use of off-heap
  • 24. © 2014 MapR Technologies 26 scala> sqlContext.sql("select * from json.`foo.json`").show +---+------+----+ | a| b| c| +---+------+----+ | 3|[3, 2]| xyz| | 7| null| wxy| | 7| []|null| +---+------+----+
  • 25. © 2014 MapR Technologies 27 scala> sqlContext.sql( "select a, explode(b) b_v from json.`bug.json`" ).show +---+---------+ | a| b_v| +---+---------+ | 3| 3| | 3| 2| +---+---------+
  • 26. © 2014 MapR Technologies 28 First Synthesis • Drill has a more nuanced optimizer, better code generation – This often leads to ~2x speed advantage • Drill has ValueVector and row batches – This leads to much less memory pressure • Drill has much stricter memory life-cycle – Query and done and gone, no need for big GC’s even on big memory • Drill is all about SQL execution
  • 27. © 2014 MapR Technologies 29 But … • Spark can optimize across entire program – This often leads to ~2x speed advantage • Spark has much more flexible memory structures – This can lead to much less memory pressure • Spark has much more flexible RDD life-cycle – RDD’s can be cached, persisted or simply recomputed as necessary • Spark is not all about SQL execution
  • 28. © 2014 MapR Technologies 30 The Really Big Differences • Drill focuses heavily on secure, multi-tenant access to data – Strong impersonation semantics – Cascading rights via views – Queries co-exist in a cluster and reserve only their momentary resource requirements • Spark focuses heavily on fully integrated execution models – Any spark function works with (almost) any RDD’s – Memory residency of RDD’s is the highest goal
  • 29. © 2014 MapR Technologies 31 Drill security ➢ End to end security from BI tools to Hadoop ➢ Standard based PAM Authentication ➢ 2 level user Impersonation ➢ Fine-grained row and column level access control with Drill Views – no centralized security repository required
  • 30. © 2014 MapR Technologies 32 Granular security permissions through Drill views Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Owner Admins Permission Admins Business Analyst Data Scientist Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist View (/views/maskedcards.view.drill) Not a physical data copy Name City State Dave San Jose CA John Boulder CO Business Analyst View Owner Admins Permission Business Analysts Owner Admins Permission Data Scientists
  • 31. © 2014 MapR Technologies 33 Ownership Chaining • Combine Self Service Exploration with Data Governance Name City State Credit Card # Dave San Jose CA 1374-7914-3865-4817 John Boulder CO 1374-9735-1794-9711 Raw File (/raw/cards.csv) Name City State Credit Card # Dave San Jose CA 1374-1111-1111-1111 John Boulder CO 1374-1111-1111-1111 Data Scientist (/views/V_Scientist) Jane (Read) John (Owner) Name City State Dave San Jose CA John Boulder CO Analyst(/views/V_Analyst) Jack (Read) Jane(Owner) RAWFILEV_ScientistV_Analyst Does Jack have access to V_Analyst? ->YES Who is the owner of V_Analyst? ->Jane Drill accesses V_Analyst as Jane (Impersonation hop 1) Does Jane have access to V_Scientist ? -> YES Who is the owner of V_Scientist? ->John Drill accesses V_Scientist as John (Impersonation hop 2) John(Owner) Does John have permissions on raw file? -> YES Who is the owner of raw file? ->John Drill accesses source file as John (no impersonation here) Jack queries the view V_Analyst *Ownership chain length (# hops) is configurable Ownership chaining Access path
  • 32. © 2014 MapR Technologies 34 But was that the right question?
  • 33. © 2014 MapR Technologies 35 Unification is Feasible • It is relatively easy to build a DrillContext in Spark – compare to SqlContext • Define Datasets as Drill data sources and sinks – Drill runs at the same time as Spark • Orchestrate transport of Spark data to/from Drill • Cost of transport is remarkably small
  • 34. © 2014 MapR Technologies 36 What does the Spark and Drill integration look like Features at a glance: • Use Drill as an input to Spark • Query Spark RDDs via Drill and create data pipelines Disk (DFS) Memory RDD Files Files
  • 35. © 2014 MapR Technologies 37 Is unification valuable?
  • 36. © 2014 MapR Technologies 38 Example of Unification Callers Universe Towers cdr data
  • 37. © 2014 MapR Technologies 39 Simple Session Protocol • Calls started at random intervals • During calls, reconnection is done periodically idle connect HELLO FAIL TIME OUT active END CONNECT END HELLO start SETUP • Many log events are buffered and sent to current tower during active state
  • 38. © 2014 MapR Technologies 40 The Resulting Data • Signal strength reports – Tower, timestamp, rank, caller, caller location*, signal strength • Tower log events: HELLO, FAIL, CONNECT, END • Call end • Note that data for one tower is often received by another due to caller buffering to diagnostic data *Location isn’t quite location … poetic license applied for
  • 39. © 2014 MapR Technologies 41 What can we do with it?
  • 40. © 2014 MapR Technologies 42 Baby Steps • What does signal propagation look like? select x, y, signal from cdr_stream where tower = 3 • Plot results to get a map of signal strength around a tower
  • 41. © 2014 MapR Technologies 43 Baby Steps • What does tower coverage look like? select x, y from cdr_stream where tower = 3 and event_type = ‘CONNECT’. • Plot results to get a map of coverage area for a tower
  • 42. © 2014 MapR Technologies 44 What about anomaly detection?
  • 43. © 2014 MapR Technologies 45 Detecting Tower Loss It’s important to know if traffic is stopped or delayed because of a problem… But events from towers come at irregular intervals How long after the last event should you begin to worry?
  • 44. © 2014 MapR Technologies 46 Event Stream (timing) • Events of various types arrive at irregular intervals – we can assume Poisson distribution • The key question is whether frequency has changed relative to expected values – This shows up as a change in interval • Want alert as soon as possible
  • 45. © 2014 MapR Technologies 47 Converting Event Times to Anomaly 99.9%-ile 99.99%-ile
  • 46. © 2014 MapR Technologies 48 But in the real world, event rates often change
  • 47. © 2014 MapR Technologies 49 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 48. © 2014 MapR Technologies 50 Time Intervals Are Key to Modeling Sporadic Events 0 1 2 3 4 02468 t (days) dt(min)
  • 49. © 2014 MapR Technologies 51 After Rate Correction 0 1 2 3 4 0246810 t (days) dt/rate 99.9%−ile 99.99%−ile
  • 50. © 2014 MapR Technologies 52 Detecting Anomalies in Sporadic Events Incoming events 99.97%-ile Alarm Δn Rate predictor Rate history t-digest δ> t ti δ λ(ti- ti- n) λ t
  • 51. © 2014 MapR Technologies 53 Propagation Anomalies • What happens when something shadows part of the coverage field? – Can happen in urban areas with a construction crane • Can solve heuristically – Subtract from reference image composed by long term averages – Doesn’t deal well with weak signal regions and low S/N • Can solve probabilistically – Compute anomaly for each measurement, use mean of log(p)
  • 52. © 2014 MapR Technologies 54
  • 53. © 2014 MapR Technologies 55
  • 54. © 2014 MapR Technologies 56 Variable Signal/Noise Makes Heuristic Tricky Far from the transmitter, received signal is dominated by noise. This makes subtraction of average value a bad algorithm.
  • 55. © 2014 MapR Technologies 57 Other Issues • Finding anomalies in coverage area is similar tricky • Coverage area is roughly where tower signal strength is higher than neighbors • Except for fuzziness due to hand-off delays • Except for bias due to large-scale caller motions – Rush hour – Event mobs
  • 56. © 2014 MapR Technologies 58 Simple Answer for Propagation Anomalies • Cluster signal strength reports • Cluster locations using k-means, large k • Model report rate anomaly using discrete event models • Model signal strength anomaly using percentile model • Trade larger k against higher report rates, faster detection • Overall anomaly is sum of individual log(p) anomalies
  • 57. © 2014 MapR Technologies 59 Coverage Areas
  • 58. © 2014 MapR Technologies 60 Just One Tower
  • 59. © 2014 MapR Technologies 61 Cluster Reports for That Tower
  • 60. © 2014 MapR Technologies 62 Cluster Reports for That Tower 1 2 3 4 5 6 7 8 9
  • 61. © 2014 MapR Technologies 63 General Dataflow Group by tower, filter data (SQL) k-means cluster (ML LIB) Split data (SQL) Location model (Java) Mark cluster (ML LIB) Rate detection per cluster
  • 62. © 2014 MapR Technologies 64 Summary • Drill and Spark provide healthy competition in Apache • Over time, they have converged in many respects – But important distinctions remain • Projects can work together to share key technology – Apache Arrow … started as off-shoot of Drill, now has >12 major projects as participants, including Spark • Systems can work together even more deeply – DrillContext makes integration first class
  • 63. © 2014 MapR Technologies 65 e-book available courtesy of MapR http://bit.ly/1jQ9QuL A New Look at Anomaly Detection by Ted Dunning and Ellen Friedman © June 2014 (published by O’Reilly)
  • 64. © 2014 MapR Technologies 66 Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams Read online mapr.com/6ebooks-read Download pdfs mapr.com/6ebooks-pdf 6 Free ebooks Streaming Architecture Ted Dunning & Ellen Friedman and MapR Streams
  • 65. © 2014 MapR Technologies 67 Thank you for coming today!
  • 66. © 2014 MapR Technologies 68 …helping you put data technology to work ● Find answers ● Ask technical questions ● Join on-demand training course discussions ● Follow release announcements ● Share and vote on product ideas ● Find Meetup and event listings Connect with fellow Apache Hadoop and Spark professionals community.mapr.com

Editor's Notes

  1. ELLEN: set up
  2. Talk track: This is what it looks like to have events such as those on website that come in at randomized times (people come when they want to) but the underlying average rate in this case is constant, in other words, a fairly steady stream of traffic. This looks at lot like the first signal we talked about: a randomized but even signal… We can use t-digest on it to set thresholds, everything works just grand. (Like radio activity Geiger counter clicks)
  3. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  4. Talk track: (Describe figure) Horizontal axis is days, with noon in the middle of each day. The faint shadow shows the underlying rate of events.The vertical axis is the time interval between events. Notice that as the rate of events is high, the time interval between events is small, but when the rate of events slows down, the time between events is much larger. Ellen: For this reason, we cannot set a simple threshold: if set low in day, we have an alert every night even though we expect a longer interval then. If we set it too high, we miss the real problems when traffic really is abnormally delayed or stopped altogether. What can you do to solve this? Ted: We build a model, multiple the modelled rate x the interval, we get a number we can threshold accurately.
  5. Talk track: You need a rate predictor Ellen: sometimes simple is good enough