Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets given by MapR Chief Data Engineer EMEA . Big Data User Group in Stuttgart 2013-05-16
The 7 Things I Know About Cyber Security After 25 Years | April 2024
Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets
1. Apache
Drill
a
interac.ve,
ad-‐hoc
query
system
for
large-‐scale
datasets
Michael
Hausenblas,
Chief
Data
Engineer
EMEA,
MapR
Big
Data
User
Group
Stu>gart,
2013-‐05-‐16
2. Which
workloads
do
you
encounter
in
your
environment?
h>p://www.flickr.com/photos/kevinomara/2866648330/
licensed
under
CC
BY-‐NC-‐ND
2.0
3. Batch
processing
…
for
recurring
tasks
such
as
large-‐scale
data
mining,
ETL
offloading/data-‐warehousing
à
for
the
batch
layer
in
Lambda
architecture
4. OLTP
…
user-‐facing
eCommerce
transac[ons,
real-‐[me
messaging
at
scale
(FB),
[me-‐series
processing,
etc.
à
for
the
serving
layer
in
Lambda
architecture
5. Stream
processing
…
in
order
to
handle
stream
sources
such
as
social
media
feeds
or
sensor
data
(mobile
phones,
RFID,
weather
sta[ons,
etc.)
à
for
the
speed
layer
in
Lambda
architecture
6. Search/Informa[on
Retrieval
…
retrieval
of
items
from
unstructured
documents
(plain
text,
etc.),
semi-‐structured
data
formats
(JSON,
etc.),
as
well
as
data
stores
(MongoDB,
CouchDB,
etc.)
9. Use
Case:
Marke[ng
Campaign
• Jane,
a
marke[ng
analyst
• Determine
target
segments
• Data
from
different
sources
10. Use
Case:
Logis[cs
• Supplier
tracking
and
performance
• Queries
– Shipments
from
supplier
‘ACM’
in
last
24h
– Shipments
in
region
‘US’
not
from
‘ACM’
SUPPLIER_ID
NAME
REGION
ACM
ACME
Corp
US
GAL
GotALot
Inc
US
BAP
Bits
and
Pieces
Ltd
Europe
ZUP
Zu
Pli
Asia
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
12. Requirements
• Support
for
different
data
sources
• Support
for
different
query
interfaces
• Low-‐latency/real-‐[me
• Ad-‐hoc
queries
• Scalable,
reliable
14. Google’s
Dremel
h>p://research.google.com/pubs/pub36632.html
Sergey
Melnik,
Andrey
Gubarev,
Jing
Jing
Long,
Geoffrey
Romer,
Shiva
Shivakumar,
Ma@
Tolton,
Theo
Vassilakis,
Proc.
of
the
36th
Int'l
Conf
on
Very
Large
Data
Bases
(2010),
pp.
330-‐339
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…
“
“
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…
19. Apache
Drill–key
facts
• Inspired
by
Google’s
Dremel
• Standard
SQL
2003
support
• Plug-‐able
data
sources
• Nested
data
is
a
first-‐class
ci[zen
• Schema
is
op.onal
• Community
driven,
open,
100’s
involved
21. Principled
Query
Execu[on
Source
Query
Parser
Logical
Plan
Op[mizer
Physical
Plan
Execu[on
SQL
2003
DrQL
MongoQL
DSL
scanner
API
Topology
CF
etc.
query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op: "filter",
condition:
"x > 3”
},
parser
API
22. Wire-‐level
Architecture
• Each
node:
Drillbit
-‐
maximize
data
locality
• Co-‐ordina[on,
query
planning,
execu[on,
etc,
are
distributed
• Any
node
can
act
as
endpoint
for
a
query—foreman
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
23. Wire-‐level
Architecture
• Curator/Zookeeper
for
ephemeral
cluster
membership
info
• Distributed
cache
(Hazelcast)
for
metadata,
locality
informa[on,
etc.
Curator/Zk
Distributed
Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed
Cache
Distributed
Cache
Distributed
Cache
24. Wire-‐level
Architecture
• Origina[ng
Drillbit
acts
as
foreman:
manages
query
execu[on,
scheduling,
locality
informa[on,
etc.
• Streaming
data
communica.on
avoiding
SerDe
Curator/Zk
Distributed
Cache
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Storage
Process
Drillbit
node
Distributed
Cache
Distributed
Cache
Distributed
Cache
25. Wire-‐level
Architecture
Foreman
turns
into
root
of
the
mul[-‐level
execu[on
tree,
leafs
ac[vate
their
storage
engine
interface.
node
node
node
Curator/Zk
26. Key
features
• Full
SQL
–
ANSI
SQL
2003
• Nested
Data
as
first
class
ci[zen
• Op[onal
Schema
• Extensibility
Points
…
27. Extensibility
Points
• Source
query
à
parser
API
• Custom
operators,
UDF
à
logical
plan
• Serving
tree,
CF,
topology
à
physical
plan/op[mizer
• Data
sources
&formats
à
scanner
API
Source
Query
Parser
Logical
Plan
Op[mizer
Physical
Plan
Execu[on
28. …
and
Hadoop?
• HDFS
can
be
a
data
source
• Complementary
use
cases*
• …
use
Apache
Drill
– Find
record
with
specified
condi[on
– Aggrega[on
under
dynamic
condi[ons
• …
use
MapReduce
– Data
mining
with
mul[ple
itera[ons
– ETL
*)
h>ps://cloud.google.com/files/BigQueryTechnicalWP.pdf
31. Status
• Heavy
development
by
mul[ple
organiza[ons
• Available
– Logical
plan
(ADSP)
– Reference
interpreter
– Basic
SQL
parser
– Basic
demo
32. Status
May
2013
• Full
SQL
support
(+JDBC)
• Physical
plan
• In-‐memory
compressed
data
interfaces
• Distributed
execu[on
33. Status
May
2013
• HBase
and
MySQL
storage
engine
• WebUI
client
34. Contribu[ng
Contribu[ons
appreciated
(not
only
code
drops)
…
• Test
data
&
test
queries
• Use
case
scenarios
(textual/SQL
queries)
• Documenta[on
• Further
schedule
– Alpha
Q2
– Beta
Q3
35. Kudos
to
…
• Julian
Hyde,
Pentaho
• Lisen
Mu,
XingCloud
• Tim
Chen,
Microsow
• Chris
Merrick,
RJMetrics
• David
Alves,
UT
Aus[n
• Sree
Vaadi,
SSS/NGData
• Jacques
Nadeau,
MapR
• Ted
Dunning,
MapR
36. Engage!
• Follow
@ApacheDrill
on
Twi>er
• Sign
up
at
mailing
lists
(user
|
dev)
h>p://incubator.apache.org/drill/mailing-‐lists.html
• Standing
G+
hangouts
every
Tuesday
at
5pm
GMT
h>p://j.mp/apache-‐drill-‐hangouts
• Keep
an
eye
on
h>p://drill-‐user.org/