SlideShare a Scribd company logo
1 of 36
Download to read offline
Apache	
  Drill	
  
a	
  interac.ve,	
  ad-­‐hoc	
  query	
  system	
  for	
  large-­‐scale	
  datasets	
  
Michael	
  Hausenblas,	
  Chief	
  Data	
  Engineer	
  EMEA,	
  MapR	
  
Big	
  Data	
  User	
  Group	
  Stu>gart,	
  2013-­‐05-­‐16	
  
Which	
  
workloads	
  do	
  
you	
  
encounter	
  in	
  
your	
  
environment?	
  
h>p://www.flickr.com/photos/kevinomara/2866648330/	
  licensed	
  under	
  CC	
  BY-­‐NC-­‐ND	
  2.0	
  
Batch	
  processing	
  
…	
  for	
  recurring	
  tasks	
  such	
  as	
  large-­‐scale	
  data	
  mining,	
  ETL	
  
offloading/data-­‐warehousing	
  à	
  for	
  the	
  batch	
  layer	
  in	
  Lambda	
  
architecture	
  
OLTP	
  
…	
  user-­‐facing	
  eCommerce	
  transac[ons,	
  real-­‐[me	
  messaging	
  at	
  
scale	
  (FB),	
  [me-­‐series	
  processing,	
  etc.	
  à	
  for	
  the	
  serving	
  layer	
  in	
  
Lambda	
  architecture	
  
Stream	
  processing	
  
…	
  in	
  order	
  to	
  handle	
  stream	
  sources	
  such	
  as	
  social	
  media	
  feeds	
  
or	
  sensor	
  data	
  (mobile	
  phones,	
  RFID,	
  weather	
  sta[ons,	
  etc.)	
  à	
  
for	
  the	
  speed	
  layer	
  in	
  Lambda	
  architecture	
  
	
  
Search/Informa[on	
  Retrieval	
  
…	
  retrieval	
  of	
  items	
  from	
  unstructured	
  documents	
  (plain	
  
text,	
  etc.),	
  semi-­‐structured	
  data	
  formats	
  (JSON,	
  etc.),	
  as	
  
well	
  as	
  data	
  stores	
  (MongoDB,	
  CouchDB,	
  etc.)	
  
h>p://www.flickr.com/photos/9479603@N02/4144121838/	
  	
  licensed	
  under	
  CC	
  BY-­‐NC-­‐ND	
  2.0	
  
But	
  what	
  about	
  
interac.ve	
  
ad-­‐hoc	
  query	
  	
  
at	
  scale?	
  
Impala
Interac[ve	
  Query	
  (?)	
  
low-­‐latency	
  
Use	
  Case:	
  Marke[ng	
  Campaign	
  
•  Jane,	
  a	
  marke[ng	
  analyst	
  
•  Determine	
  target	
  segments	
  
•  Data	
  from	
  different	
  sources	
  
	
  
Use	
  Case:	
  Logis[cs	
  
•  Supplier	
  tracking	
  and	
  performance	
  
•  Queries	
  
– Shipments	
  from	
  supplier	
  ‘ACM’	
  in	
  last	
  24h	
  
– Shipments	
  in	
  region	
  ‘US’	
  not	
  from	
  ‘ACM’	
  
SUPPLIER_ID	
   NAME	
   REGION	
  
ACM	
   ACME	
  Corp	
   US	
  
GAL	
   GotALot	
  Inc	
   US	
  
BAP	
   Bits	
  and	
  Pieces	
  Ltd	
   Europe	
  
ZUP	
   Zu	
  Pli	
   Asia	
  
{
"shipment": 100123,
"supplier": "ACM",
“timestamp": "2013-02-01",
"description": ”first delivery today”
},
{
"shipment": 100124,
"supplier": "BAP",
"timestamp": "2013-02-02",
"description": "hope you enjoy it”
}
…
Use	
  Case:	
  Crime	
  Detec[on	
  
•  Online	
  purchases	
  
•  Fraud,	
  bilking,	
  etc.	
  
•  Batch-­‐generated	
  overview	
  
•  Modes	
  
– Explora[ve	
  
– Alerts	
  
Requirements	
  
•  Support	
  for	
  different	
  data	
  sources	
  
•  Support	
  for	
  different	
  query	
  interfaces	
  
•  Low-­‐latency/real-­‐[me	
  
•  Ad-­‐hoc	
  queries	
  
•  Scalable,	
  reliable	
  
And now for something completely different …
Google’s	
  Dremel	
  
h>p://research.google.com/pubs/pub36632.html	
  	
  
	
  
Sergey	
  Melnik,	
  Andrey	
  Gubarev,	
  Jing	
  Jing	
  Long,	
  Geoffrey	
  Romer,	
  Shiva	
  Shivakumar,	
  Ma@	
  Tolton,	
  
Theo	
  Vassilakis,	
  Proc.	
  of	
  the	
  36th	
  Int'l	
  Conf	
  on	
  Very	
  Large	
  Data	
  Bases	
  (2010),	
  pp.	
  330-­‐339	
  
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…
“
“
Dremel is a scalable, interactive ad-hoc
query system for analysis of read-only
nested data. By combining multi-level
execution trees and columnar data layout,
it is capable of running aggregation
queries over trillion-row tables in
seconds. The system scales to thousands of
CPUs and petabytes of data, and has
thousands of users at Google.
…
Google’s	
  Dremel	
  
multi-level execution trees	
  
columnar data layout	
  
Google’s	
  Dremel	
  
nested data + schema	
   column-striped representation	
  
map nested data to tables	
  
Google’s	
  Dremel	
  
experiments:
datasets & query performance	
  
Back to Apache Drill …
Apache	
  Drill–key	
  facts	
  
•  Inspired	
  by	
  Google’s	
  Dremel	
  
•  Standard	
  SQL	
  2003	
  support	
  
•  Plug-­‐able	
  data	
  sources	
  
•  Nested	
  data	
  is	
  a	
  first-­‐class	
  ci[zen	
  
•  Schema	
  is	
  op.onal	
  
•  Community	
  driven,	
  open,	
  100’s	
  involved	
  
High-­‐level	
  Architecture	
  
Principled	
  Query	
  Execu[on	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   Op[mizer	
  
Physical	
  
Plan	
   Execu[on	
  
SQL	
  2003	
  	
  
DrQL	
  
MongoQL	
  
DSL	
  
scanner	
  API	
  Topology	
  
CF	
  
etc.	
  
query: [
{
@id: "log",
op: "sequence",
do: [
{
op: "scan",
source: “logs”
},
{
op: "filter",
condition:
"x > 3”
},
parser	
  API	
  
Wire-­‐level	
  Architecture	
  
•  Each	
  node:	
  Drillbit	
  -­‐	
  maximize	
  data	
  locality	
  
•  Co-­‐ordina[on,	
  query	
  planning,	
  execu[on,	
  etc,	
  are	
  distributed	
  
•  Any	
  node	
  can	
  act	
  as	
  endpoint	
  for	
  a	
  query—foreman	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Wire-­‐level	
  Architecture	
  
•  Curator/Zookeeper	
  for	
  ephemeral	
  cluster	
  membership	
  info	
  
•  Distributed	
  cache	
  (Hazelcast)	
  for	
  metadata,	
  locality	
  
informa[on,	
  etc.	
  
Curator/Zk	
  
Distributed	
  Cache	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Distributed	
  Cache	
   Distributed	
  Cache	
   Distributed	
  Cache	
  
Wire-­‐level	
  Architecture	
  
•  Origina[ng	
  Drillbit	
  acts	
  as	
  foreman:	
  manages	
  query	
  execu[on,	
  
scheduling,	
  locality	
  informa[on,	
  etc.	
  
•  Streaming	
  data	
  communica.on	
  avoiding	
  SerDe	
  
Curator/Zk	
  
Distributed	
  Cache	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Storage	
  
Process	
  
Drillbit	
  
node	
  
Distributed	
  Cache	
   Distributed	
  Cache	
   Distributed	
  Cache	
  
Wire-­‐level	
  Architecture	
  
Foreman	
  turns	
  into	
  
root	
  of	
  the	
  mul[-­‐level	
  
execu[on	
  tree,	
  leafs	
  
ac[vate	
  their	
  storage	
  
engine	
  interface.	
  
node	
  
node	
   node	
  
Curator/Zk	
  
Key	
  features	
  
•  Full	
  SQL	
  –	
  ANSI	
  SQL	
  2003	
  
•  Nested	
  Data	
  as	
  first	
  class	
  ci[zen	
  
•  Op[onal	
  Schema	
  
•  Extensibility	
  Points	
  …	
  
Extensibility	
  Points	
  
•  Source	
  query	
  à	
  parser	
  API	
  
•  Custom	
  operators,	
  UDF	
  à	
  logical	
  plan	
  
•  Serving	
  tree,	
  CF,	
  topology	
  à	
  physical	
  plan/op[mizer	
  
•  Data	
  sources	
  &formats	
  à	
  scanner	
  API	
  
Source	
  
Query	
   Parser	
  
Logical	
  
Plan	
   Op[mizer	
  
Physical	
  
Plan	
   Execu[on	
  
…	
  and	
  Hadoop?	
  
•  HDFS	
  can	
  be	
  a	
  data	
  source	
  
•  Complementary	
  use	
  cases*	
  
•  …	
  use	
  Apache	
  Drill	
  
–  Find	
  record	
  with	
  specified	
  condi[on	
  
–  Aggrega[on	
  under	
  dynamic	
  condi[ons	
  
•  …	
  use	
  MapReduce	
  
–  Data	
  mining	
  with	
  mul[ple	
  itera[ons	
  
–  ETL	
  
*)	
  h>ps://cloud.google.com/files/BigQueryTechnicalWP.pdf	
  	
  
Basic	
  Demo	
  
h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo	
  	
  
{
"id": "0001",
"type": "donut",
”ppu": 0.55,
"batters":
{
"batter”:
[
{ "id": "1001", "type": "Regular" },
{ "id": "1002", "type": "Chocolate" },
…
data	
  source:	
  donuts.json	
  
query:[ {
op:"sequence",
do:[
{
op: "scan",
ref: "donuts",
source: "local-logs",
selection: {data: "activity"}
},
{
op: "filter",
expr: "donuts.ppu < 2.00"
},
…
logical	
  plan:	
  simple_plan.json	
  
result:	
  out.json	
  
{
"sales" : 700.0,
"typeCount" : 1,
"quantity" : 700,
"ppu" : 1.0
}
{
"sales" : 109.71,
"typeCount" : 2,
"quantity" : 159,
"ppu" : 0.69
}
{
"sales" : 184.25,
"typeCount" : 2,
"quantity" : 335,
"ppu" : 0.55
}
BE	
  A	
  PART	
  OF	
  IT!	
  
Status	
  
•  Heavy	
  development	
  by	
  mul[ple	
  organiza[ons	
  
•  Available	
  
– Logical	
  plan	
  (ADSP)	
  
– Reference	
  interpreter	
  
– Basic	
  SQL	
  parser	
  	
  
– Basic	
  demo	
  
Status	
  
May	
  2013	
  
	
  
•  Full	
  SQL	
  support	
  (+JDBC)	
  
•  Physical	
  plan	
  
•  In-­‐memory	
  compressed	
  data	
  interfaces	
  
•  Distributed	
  execu[on	
  
Status	
  
May	
  2013	
  
	
  
•  HBase	
  and	
  MySQL	
  storage	
  engine	
  
•  WebUI	
  client	
  
Contribu[ng	
  
Contribu[ons	
  appreciated	
  (not	
  only	
  code	
  drops)	
  …	
  
	
  
•  Test	
  data	
  &	
  test	
  queries	
  
•  Use	
  case	
  scenarios	
  (textual/SQL	
  queries)	
  
•  Documenta[on	
  
•  Further	
  schedule	
  
–  Alpha	
  Q2	
  
–  Beta	
  Q3	
  
Kudos	
  to	
  …	
  
•  Julian	
  Hyde,	
  Pentaho	
  	
  
•  Lisen	
  Mu,	
  XingCloud	
  
•  Tim	
  Chen,	
  Microsow	
  
•  Chris	
  Merrick,	
  RJMetrics	
  	
  
•  David	
  Alves,	
  UT	
  Aus[n	
  
•  Sree	
  Vaadi,	
  SSS/NGData	
  
•  Jacques	
  Nadeau,	
  MapR	
  
•  Ted	
  Dunning,	
  MapR	
  
Engage!	
  
•  Follow	
  @ApacheDrill	
  on	
  Twi>er	
  
•  Sign	
  up	
  at	
  mailing	
  lists	
  (user	
  |	
  dev)	
  	
  
h>p://incubator.apache.org/drill/mailing-­‐lists.html	
  	
  
	
  
•  Standing	
  G+	
  hangouts	
  every	
  Tuesday	
  at	
  5pm	
  GMT	
  
h>p://j.mp/apache-­‐drill-­‐hangouts	
  	
  
•  Keep	
  an	
  eye	
  on	
  h>p://drill-­‐user.org/	
  	
  

More Related Content

What's hot

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseCloudera, Inc.
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part IIArjen de Vries
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with SparkArjen de Vries
 
Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubGlobus
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012Jimmy Lai
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Hortonworks
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Evert Lammerts
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodbantoinegirbal
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Andrey Vykhodtsev
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivesiddharthboora
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010ragho
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Cloudera, Inc.
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in SearchAmund Tveit
 

What's hot (20)

Sf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBaseSf NoSQL MeetUp: Apache Hadoop and HBase
Sf NoSQL MeetUp: Apache Hadoop and HBase
 
Hive hcatalog
Hive hcatalogHive hcatalog
Hive hcatalog
 
An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
Bigdata processing with Spark - part II
Bigdata processing with Spark - part IIBigdata processing with Spark - part II
Bigdata processing with Spark - part II
 
Bigdata processing with Spark
Bigdata processing with SparkBigdata processing with Spark
Bigdata processing with Spark
 
Publishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHubPublishing and Serving Machine Learning Models with DLHub
Publishing and Serving Machine Learning Models with DLHub
 
MongoDB
MongoDBMongoDB
MongoDB
 
When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012When big data meet python @ COSCUP 2012
When big data meet python @ COSCUP 2012
 
Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012Future of HCatalog - Hadoop Summit 2012
Future of HCatalog - Hadoop Summit 2012
 
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010Introduction to SARA's Hadoop Hackathon - dec 7th 2010
Introduction to SARA's Hadoop Hackathon - dec 7th 2010
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb2011 mongo FR - scaling with mongodb
2011 mongo FR - scaling with mongodb
 
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
Big Data Essentials meetup @ IBM Ljubljana 23.06.2015
 
HADOOP
HADOOPHADOOP
HADOOP
 
report on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hivereport on aadhaar anlysis using bid data hadoop and hive
report on aadhaar anlysis using bid data hadoop and hive
 
Hive ICDE 2010
Hive ICDE 2010Hive ICDE 2010
Hive ICDE 2010
 
Understanding hdfs
Understanding hdfsUnderstanding hdfs
Understanding hdfs
 
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
Apache Hadoop an Introduction - Todd Lipcon - Gluecon 2010
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 
Mapreduce in Search
Mapreduce in SearchMapreduce in Search
Mapreduce in Search
 

Viewers also liked

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillMapR Technologies
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
An introduction to Microservices
An introduction to MicroservicesAn introduction to Microservices
An introduction to MicroservicesCisco DevNet
 
Solving Problems with Graphs
Solving Problems with GraphsSolving Problems with Graphs
Solving Problems with GraphsMarko Rodriguez
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill WorkshopCharles Givre
 
Quantum Processes in Graph Computing
Quantum Processes in Graph ComputingQuantum Processes in Graph Computing
Quantum Processes in Graph ComputingMarko Rodriguez
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBasedave_revell
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drilltshiran
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Helena Edelson
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataMarko Rodriguez
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsChris Saint-Amant
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsJonas Bonér
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelinprajods
 

Viewers also liked (18)

Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...Self-Service BI for big data applications using Apache Drill (Big Data Amster...
Self-Service BI for big data applications using Apache Drill (Big Data Amster...
 
Apache drill
Apache drillApache drill
Apache drill
 
Free Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache DrillFree Code Friday: Drill 101 - Basics of Apache Drill
Free Code Friday: Drill 101 - Basics of Apache Drill
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
An introduction to Microservices
An introduction to MicroservicesAn introduction to Microservices
An introduction to Microservices
 
Dremel Paper Review
Dremel Paper ReviewDremel Paper Review
Dremel Paper Review
 
High-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in HadoopHigh-Scale Entity Resolution in Hadoop
High-Scale Entity Resolution in Hadoop
 
Solving Problems with Graphs
Solving Problems with GraphsSolving Problems with Graphs
Solving Problems with Graphs
 
Apache Drill Workshop
Apache Drill WorkshopApache Drill Workshop
Apache Drill Workshop
 
Quantum Processes in Graph Computing
Quantum Processes in Graph ComputingQuantum Processes in Graph Computing
Quantum Processes in Graph Computing
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
Analyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache DrillAnalyzing Real-World Data with Apache Drill
Analyzing Real-World Data with Apache Drill
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
Streaming Big Data with Spark, Kafka, Cassandra, Akka & Scala (from webinar)
 
Titan: The Rise of Big Graph Data
Titan: The Rise of Big Graph DataTitan: The Rise of Big Graph Data
Titan: The Rise of Big Graph Data
 
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.jsNetflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
Netflix JavaScript Talks - Scaling A/B Testing on Netflix.com with Node.js
 
Scalability, Availability & Stability Patterns
Scalability, Availability & Stability PatternsScalability, Availability & Stability Patterns
Scalability, Availability & Stability Patterns
 
Big Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and ZeppelinBig Data visualization with Apache Spark and Zeppelin
Big Data visualization with Apache Spark and Zeppelin
 

Similar to Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill MapR Technologies
 
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Timothy Chen
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionCodemotion
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataRostislav Pashuto
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big DataOmnia Safaan
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewRiccardo Zamana
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of MetadataJim Dowling
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016Mark Smith
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009Ian Foster
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)tsliwowicz
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesCorley S.r.l.
 
20181215 introduction to graph databases
20181215   introduction to graph databases20181215   introduction to graph databases
20181215 introduction to graph databasesTimothy Findlay
 

Similar to Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets (20)

Apache Drill
Apache DrillApache Drill
Apache Drill
 
Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill Berlin Hadoop Get Together Apache Drill
Berlin Hadoop Get Together Apache Drill
 
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
 
Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...Large scale, interactive ad-hoc queries over different datastores with Apache...
Large scale, interactive ad-hoc queries over different datastores with Apache...
 
M7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal HausenblasM7 and Apache Drill, Micheal Hausenblas
M7 and Apache Drill, Micheal Hausenblas
 
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in ProductionTugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
Tugdual Grall - Real World Use Cases: Hadoop and NoSQL in Production
 
Aggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of dataAggregated queries with Druid on terrabytes and petabytes of data
Aggregated queries with Druid on terrabytes and petabytes of data
 
Drill at the Chicago Hug
Drill at the Chicago HugDrill at the Chicago Hug
Drill at the Chicago Hug
 
Inroduction to Big Data
Inroduction to Big DataInroduction to Big Data
Inroduction to Big Data
 
Data saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overviewData saturday malta - ADX Azure Data Explorer overview
Data saturday malta - ADX Azure Data Explorer overview
 
Data Science with the Help of Metadata
Data Science with the Help of MetadataData Science with the Help of Metadata
Data Science with the Help of Metadata
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Big data meet_up_08042016
Big data meet_up_08042016Big data meet_up_08042016
Big data meet_up_08042016
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Apache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real TimeApache Eagle - Monitor Hadoop in Real Time
Apache Eagle - Monitor Hadoop in Real Time
 
Computing Outside The Box June 2009
Computing Outside The Box June 2009Computing Outside The Box June 2009
Computing Outside The Box June 2009
 
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)Taboola's experience with Apache Spark (presentation @ Reversim 2014)
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Big data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting LanguagesBig data, just an introduction to Hadoop and Scripting Languages
Big data, just an introduction to Hadoop and Scripting Languages
 
20181215 introduction to graph databases
20181215   introduction to graph databases20181215   introduction to graph databases
20181215 introduction to graph databases
 

More from MapR Technologies

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscapeMapR Technologies
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationMapR Technologies
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataMapR Technologies
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureMapR Technologies
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...MapR Technologies
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsMapR Technologies
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMapR Technologies
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action MapR Technologies
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsMapR Technologies
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageMapR Technologies
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionMapR Technologies
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformMapR Technologies
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...MapR Technologies
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareMapR Technologies
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsMapR Technologies
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Technologies
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data AnalyticsMapR Technologies
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsMapR Technologies
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR Technologies
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLMapR Technologies
 

More from MapR Technologies (20)

Converging your data landscape
Converging your data landscapeConverging your data landscape
Converging your data landscape
 
ML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & EvaluationML Workshop 2: Machine Learning Model Comparison & Evaluation
ML Workshop 2: Machine Learning Model Comparison & Evaluation
 
Self-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your DataSelf-Service Data Science for Leveraging ML & AI on All of Your Data
Self-Service Data Science for Leveraging ML & AI on All of Your Data
 
Enabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data CaptureEnabling Real-Time Business with Change Data Capture
Enabling Real-Time Business with Change Data Capture
 
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
Machine Learning for Chickens, Autonomous Driving and a 3-year-old Who Won’t ...
 
ML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning LogisticsML Workshop 1: A New Architecture for Machine Learning Logistics
ML Workshop 1: A New Architecture for Machine Learning Logistics
 
Machine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model ManagementMachine Learning Success: The Key to Easier Model Management
Machine Learning Success: The Key to Easier Model Management
 
Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action Data Warehouse Modernization: Accelerating Time-To-Action
Data Warehouse Modernization: Accelerating Time-To-Action
 
Live Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIsLive Tutorial – Streaming Real-Time Events Using Apache APIs
Live Tutorial – Streaming Real-Time Events Using Apache APIs
 
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale StorageBringing Structure, Scalability, and Services to Cloud-Scale Storage
Bringing Structure, Scalability, and Services to Cloud-Scale Storage
 
Live Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn PredictionLive Machine Learning Tutorial: Churn Prediction
Live Machine Learning Tutorial: Churn Prediction
 
An Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data PlatformAn Introduction to the MapR Converged Data Platform
An Introduction to the MapR Converged Data Platform
 
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
How to Leverage the Cloud for Business Solutions | Strata Data Conference Lon...
 
Best Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in HealthcareBest Practices for Data Convergence in Healthcare
Best Practices for Data Convergence in Healthcare
 
Geo-Distributed Big Data and Analytics
Geo-Distributed Big Data and AnalyticsGeo-Distributed Big Data and Analytics
Geo-Distributed Big Data and Analytics
 
MapR Product Update - Spring 2017
MapR Product Update - Spring 2017MapR Product Update - Spring 2017
MapR Product Update - Spring 2017
 
3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics3 Benefits of Multi-Temperature Data Management for Data Analytics
3 Benefits of Multi-Temperature Data Management for Data Analytics
 
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA DeploymentsCisco & MapR bring 3 Superpowers to SAP HANA Deployments
Cisco & MapR bring 3 Superpowers to SAP HANA Deployments
 
MapR and Cisco Make IT Better
MapR and Cisco Make IT BetterMapR and Cisco Make IT Better
MapR and Cisco Make IT Better
 
Evolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQLEvolving from RDBMS to NoSQL + SQL
Evolving from RDBMS to NoSQL + SQL
 

Recently uploaded

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Apache Drill: An Active, Ad-hoc Query System for large-scale Data Sets

  • 1. Apache  Drill   a  interac.ve,  ad-­‐hoc  query  system  for  large-­‐scale  datasets   Michael  Hausenblas,  Chief  Data  Engineer  EMEA,  MapR   Big  Data  User  Group  Stu>gart,  2013-­‐05-­‐16  
  • 2. Which   workloads  do   you   encounter  in   your   environment?   h>p://www.flickr.com/photos/kevinomara/2866648330/  licensed  under  CC  BY-­‐NC-­‐ND  2.0  
  • 3. Batch  processing   …  for  recurring  tasks  such  as  large-­‐scale  data  mining,  ETL   offloading/data-­‐warehousing  à  for  the  batch  layer  in  Lambda   architecture  
  • 4. OLTP   …  user-­‐facing  eCommerce  transac[ons,  real-­‐[me  messaging  at   scale  (FB),  [me-­‐series  processing,  etc.  à  for  the  serving  layer  in   Lambda  architecture  
  • 5. Stream  processing   …  in  order  to  handle  stream  sources  such  as  social  media  feeds   or  sensor  data  (mobile  phones,  RFID,  weather  sta[ons,  etc.)  à   for  the  speed  layer  in  Lambda  architecture    
  • 6. Search/Informa[on  Retrieval   …  retrieval  of  items  from  unstructured  documents  (plain   text,  etc.),  semi-­‐structured  data  formats  (JSON,  etc.),  as   well  as  data  stores  (MongoDB,  CouchDB,  etc.)  
  • 7. h>p://www.flickr.com/photos/9479603@N02/4144121838/    licensed  under  CC  BY-­‐NC-­‐ND  2.0   But  what  about   interac.ve   ad-­‐hoc  query     at  scale?  
  • 8. Impala Interac[ve  Query  (?)   low-­‐latency  
  • 9. Use  Case:  Marke[ng  Campaign   •  Jane,  a  marke[ng  analyst   •  Determine  target  segments   •  Data  from  different  sources    
  • 10. Use  Case:  Logis[cs   •  Supplier  tracking  and  performance   •  Queries   – Shipments  from  supplier  ‘ACM’  in  last  24h   – Shipments  in  region  ‘US’  not  from  ‘ACM’   SUPPLIER_ID   NAME   REGION   ACM   ACME  Corp   US   GAL   GotALot  Inc   US   BAP   Bits  and  Pieces  Ltd   Europe   ZUP   Zu  Pli   Asia   { "shipment": 100123, "supplier": "ACM", “timestamp": "2013-02-01", "description": ”first delivery today” }, { "shipment": 100124, "supplier": "BAP", "timestamp": "2013-02-02", "description": "hope you enjoy it” } …
  • 11. Use  Case:  Crime  Detec[on   •  Online  purchases   •  Fraud,  bilking,  etc.   •  Batch-­‐generated  overview   •  Modes   – Explora[ve   – Alerts  
  • 12. Requirements   •  Support  for  different  data  sources   •  Support  for  different  query  interfaces   •  Low-­‐latency/real-­‐[me   •  Ad-­‐hoc  queries   •  Scalable,  reliable  
  • 13. And now for something completely different …
  • 14. Google’s  Dremel   h>p://research.google.com/pubs/pub36632.html       Sergey  Melnik,  Andrey  Gubarev,  Jing  Jing  Long,  Geoffrey  Romer,  Shiva  Shivakumar,  Ma@  Tolton,   Theo  Vassilakis,  Proc.  of  the  36th  Int'l  Conf  on  Very  Large  Data  Bases  (2010),  pp.  330-­‐339   Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. …
  • 15. Google’s  Dremel   multi-level execution trees   columnar data layout  
  • 16. Google’s  Dremel   nested data + schema   column-striped representation   map nested data to tables  
  • 18. Back to Apache Drill …
  • 19. Apache  Drill–key  facts   •  Inspired  by  Google’s  Dremel   •  Standard  SQL  2003  support   •  Plug-­‐able  data  sources   •  Nested  data  is  a  first-­‐class  ci[zen   •  Schema  is  op.onal   •  Community  driven,  open,  100’s  involved  
  • 21. Principled  Query  Execu[on   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on   SQL  2003     DrQL   MongoQL   DSL   scanner  API  Topology   CF   etc.   query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, parser  API  
  • 22. Wire-­‐level  Architecture   •  Each  node:  Drillbit  -­‐  maximize  data  locality   •  Co-­‐ordina[on,  query  planning,  execu[on,  etc,  are  distributed   •  Any  node  can  act  as  endpoint  for  a  query—foreman   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node  
  • 23. Wire-­‐level  Architecture   •  Curator/Zookeeper  for  ephemeral  cluster  membership  info   •  Distributed  cache  (Hazelcast)  for  metadata,  locality   informa[on,  etc.   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 24. Wire-­‐level  Architecture   •  Origina[ng  Drillbit  acts  as  foreman:  manages  query  execu[on,   scheduling,  locality  informa[on,  etc.   •  Streaming  data  communica.on  avoiding  SerDe   Curator/Zk   Distributed  Cache   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Storage   Process   Drillbit   node   Distributed  Cache   Distributed  Cache   Distributed  Cache  
  • 25. Wire-­‐level  Architecture   Foreman  turns  into   root  of  the  mul[-­‐level   execu[on  tree,  leafs   ac[vate  their  storage   engine  interface.   node   node   node   Curator/Zk  
  • 26. Key  features   •  Full  SQL  –  ANSI  SQL  2003   •  Nested  Data  as  first  class  ci[zen   •  Op[onal  Schema   •  Extensibility  Points  …  
  • 27. Extensibility  Points   •  Source  query  à  parser  API   •  Custom  operators,  UDF  à  logical  plan   •  Serving  tree,  CF,  topology  à  physical  plan/op[mizer   •  Data  sources  &formats  à  scanner  API   Source   Query   Parser   Logical   Plan   Op[mizer   Physical   Plan   Execu[on  
  • 28. …  and  Hadoop?   •  HDFS  can  be  a  data  source   •  Complementary  use  cases*   •  …  use  Apache  Drill   –  Find  record  with  specified  condi[on   –  Aggrega[on  under  dynamic  condi[ons   •  …  use  MapReduce   –  Data  mining  with  mul[ple  itera[ons   –  ETL   *)  h>ps://cloud.google.com/files/BigQueryTechnicalWP.pdf    
  • 29. Basic  Demo   h>ps://cwiki.apache.org/confluence/display/DRILL/Demo+HowTo     { "id": "0001", "type": "donut", ”ppu": 0.55, "batters": { "batter”: [ { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" }, … data  source:  donuts.json   query:[ { op:"sequence", do:[ { op: "scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "filter", expr: "donuts.ppu < 2.00" }, … logical  plan:  simple_plan.json   result:  out.json   { "sales" : 700.0, "typeCount" : 1, "quantity" : 700, "ppu" : 1.0 } { "sales" : 109.71, "typeCount" : 2, "quantity" : 159, "ppu" : 0.69 } { "sales" : 184.25, "typeCount" : 2, "quantity" : 335, "ppu" : 0.55 }
  • 30. BE  A  PART  OF  IT!  
  • 31. Status   •  Heavy  development  by  mul[ple  organiza[ons   •  Available   – Logical  plan  (ADSP)   – Reference  interpreter   – Basic  SQL  parser     – Basic  demo  
  • 32. Status   May  2013     •  Full  SQL  support  (+JDBC)   •  Physical  plan   •  In-­‐memory  compressed  data  interfaces   •  Distributed  execu[on  
  • 33. Status   May  2013     •  HBase  and  MySQL  storage  engine   •  WebUI  client  
  • 34. Contribu[ng   Contribu[ons  appreciated  (not  only  code  drops)  …     •  Test  data  &  test  queries   •  Use  case  scenarios  (textual/SQL  queries)   •  Documenta[on   •  Further  schedule   –  Alpha  Q2   –  Beta  Q3  
  • 35. Kudos  to  …   •  Julian  Hyde,  Pentaho     •  Lisen  Mu,  XingCloud   •  Tim  Chen,  Microsow   •  Chris  Merrick,  RJMetrics     •  David  Alves,  UT  Aus[n   •  Sree  Vaadi,  SSS/NGData   •  Jacques  Nadeau,  MapR   •  Ted  Dunning,  MapR  
  • 36. Engage!   •  Follow  @ApacheDrill  on  Twi>er   •  Sign  up  at  mailing  lists  (user  |  dev)     h>p://incubator.apache.org/drill/mailing-­‐lists.html       •  Standing  G+  hangouts  every  Tuesday  at  5pm  GMT   h>p://j.mp/apache-­‐drill-­‐hangouts     •  Keep  an  eye  on  h>p://drill-­‐user.org/