0
1©MapR Technologies - Confidential
Apache Drill
2©MapR Technologies - Confidential
My Background
 Academia, Startups
– Aptex, MusicMatch, ID Analytics, Veoh
– Big data s...
3©MapR Technologies - Confidential
MapR Technologies
 The open enterprise-grade distribution for Hadoop
– Easy, dependabl...
4©MapR Technologies - Confidential
Agenda
 What?
– What exactly does Drill do?
 Why?
– Why do we need Apache Drill?
 Wh...
5©MapR Technologies - Confidential
Apache Drill Overview
 Drill overview
– Low latency interactive queries
– Standard ANS...
6©MapR Technologies - Confidential
Big Data Processing – Hadoop
Batch processing
Query runtime Minutes to hours
Data volum...
7©MapR Technologies - Confidential
Big Data Processing – Hadoop and Storm
Batch processing Stream processing
Query runtime...
8©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream pro...
9©MapR Technologies - Confidential
Big Data Processing – The missing part
Batch processing Interactive analysis Stream pro...
10©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runt...
11©MapR Technologies - Confidential
Big Data Processing
Batch processing Interactive analysis Stream processing
Query runt...
12©MapR Technologies - Confidential
Latency Matters
 Ad-hoc analysis with interactive tools
 Real-time dashboards
 Even...
13©MapR Technologies - Confidential
Nested Query Languages
 DrQL
– SQL-like query language for nested data
– Compatible w...
14©MapR Technologies - Confidential
Nested Data Model
 The data model in Dremel is Protocol Buffers
– Nested
– Schema
 A...
15©MapR Technologies - Confidential
Extensibility
 Nested query languages
– Pluggable model
– DrQL
– Mongo Query Language...
16©MapR Technologies - Confidential
Design Principles
Flexible
• Pluggable query languages
• Extensible execution engine
•...
17©MapR Technologies - Confidential
Apache DRill
18©MapR Technologies - Confidential
Architecture
 Only the execution engine knows the physical attributes of the cluster
...
19©MapR Technologies - Confidential
Execution Engine Layers
 Drill execution engine has two layers
– Operator layer is se...
20©MapR Technologies - Confidential
DrQL Example
SELECT DocId AS Id,
COUNT(Name.Language.Code) WITHIN Name AS
Cnt,
Name.Ur...
21©MapR Technologies - Confidential
Query Components
 Query components:
– SELECT
– FROM
– WHERE
– GROUP BY
– HAVING
– (JO...
22©MapR Technologies - Confidential
scan-json
filter
flatten
aggregate
exp1
exp2
"table-1"
Logical Plan
23©MapR Technologies - Confidential
Execution Plan
scan-json
filter
flatten
aggregate
exp1
exp2
"table-1" scan-json
filter
fla...
24©MapR Technologies - Confidential
{op: "sequence",
do: [
{op: "scan",
source: "table-1.json"
selection: "*"
},
{op: "fil...
25©MapR Technologies - Confidential
Representing a DAG
{ @id: 19, op: "aggregate",
input: 18,
type: <simple|running|repeat...
26©MapR Technologies - Confidential
Multiple Inputs
{ @id: 25, op: "cogroup",
groupings: [
{ref: 23, expr: “id”}, {ref:
24...
27©MapR Technologies - Confidential
Scan Operators
Scan with schema Scan without schema
Operator
output
Protocol Buffers J...
28©MapR Technologies - Confidential
Design Principles
Flexible
• Pluggable query languages
• Extensible execution engine
•...
29©MapR Technologies - Confidential
Hadoop Integration
 Hadoop data sources
– Hadoop FileSystem API (HDFS/MapR-FS)
– HBas...
30©MapR Technologies - Confidential
Get Involved!
 Download these slides
– http://www.mapr.com/company/events/hug-france-...
Upcoming SlideShare
Loading in...5
×

HUG France - Apache Drill

1,088

Published on

Talk at Hug FR on December 4, 2012 about the new Apache Drill project. Notably, this talk includes an introduction to the converging specification for the logical plan in Drill.

Published in: Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,088
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • No graphic changes….Note for Bullet changes:Open Source-- Community consensusAPIAvailable for all Distributions--
  • Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
  • Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format
  • Note: we have an already partially built execution engine
  • Example query that Drill should supportNeed to talk more here about what Dremel does
  • Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeLisa - Need landing pageReferences to paper and such at end
  • Transcript of "HUG France - Apache Drill"

    1. 1. 1©MapR Technologies - Confidential Apache Drill
    2. 2. 2©MapR Technologies - Confidential My Background  Academia, Startups – Aptex, MusicMatch, ID Analytics, Veoh – Big data since before big  Open source – since the dark ages before the internet – Mahout, Zookeeper, Drill – bought the beer at first HUG  MapR  Founding member of Apache Drill
    3. 3. 3©MapR Technologies - Confidential MapR Technologies  The open enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions  MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises  MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually  MapR has partnered with Google to provide Hadoop on Google Compute Engine
    4. 4. 4©MapR Technologies - Confidential Agenda  What? – What exactly does Drill do?  Why? – Why do we need Apache Drill?  Who? – Who is doing this?  How? – How does Drill work inside?  Conclusion – How can you help? – Where can you find out more?
    5. 5. 5©MapR Technologies - Confidential Apache Drill Overview  Drill overview – Low latency interactive queries – Standard ANSI SQL support  Open-Source – 100’s involved across US and Europe – Community consensus on API, functionality  PMC expects first version late this quarter – Several components already developed
    6. 6. 6©MapR Technologies - Confidential Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming model MapReduce Users Developers Google project MapReduce Open source project Hadoop MapReduce
    7. 7. 7©MapR Technologies - Confidential Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming model MapReduce DAG (pre-programmed) Users Developers Developers Google project MapReduce Open source project Hadoop MapReduce Storm or Apache S4
    8. 8. 8©MapR Technologies - Confidential Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming model MapReduce DAG (pre-programmed) Users Developers Developers Google project MapReduce Open source project Hadoop MapReduce Storm and S4
    9. 9. 9©MapR Technologies - Confidential Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries (ad hoc) DAG (pre-programmed) Users Developers Analysts and developers Developers Google project MapReduce Open source project Hadoop MapReduce Storm and S4
    10. 10. 10©MapR Technologies - Confidential Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4
    11. 11. 11©MapR Technologies - Confidential Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Introducing Apache Drill
    12. 12. 12©MapR Technologies - Confidential Latency Matters  Ad-hoc analysis with interactive tools  Real-time dashboards  Event/trend detection and analysis – Network intrusions – Fraud – Failures
    13. 13. 13©MapR Technologies - Confidential Nested Query Languages  DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing  Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}  Other languages/programming models can plug in
    14. 14. 14©MapR Technologies - Confidential Nested Data Model  The data model in Dremel is Protocol Buffers – Nested – Schema  Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, …  Flat records are supported as a special case of nested data – CSV, TSV, … { "name": "Srivas", "gender": "Male", "followers": 100 } { "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305" } enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } Avro IDL JSON
    15. 15. 15©MapR Technologies - Confidential Extensibility  Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading  Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant  Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)  Scalable data sources – Pluggable model – Hadoop – HBase
    16. 16. 16©MapR Technologies - Confidential Design Principles Flexible • Pluggable query languages • Extensible execution engine • Pluggable data formats • Column-based and row-based • Schema and schema-less • Pluggable data sources Easy • Unzip and run • Zero configuration • Reverse DNS not needed • IP addresses can change • Clear and concise log messages Dependable • No SPOF • Instant recovery from crashes Fast • C/C++ core with Java support • Google C++ style guide • Min latency and max throughput (limited only by hardware)
    17. 17. 17©MapR Technologies - Confidential Apache DRill
    18. 18. 18©MapR Technologies - Confidential Architecture  Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, …  Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly  Each level of the plan has a human readable representation – Facilitates debugging and unit testing
    19. 19. 19©MapR Technologies - Confidential Execution Engine Layers  Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance
    20. 20. 20©MapR Technologies - Confidential DrQL Example SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; * Example from the Dremel paper
    21. 21. 21©MapR Technologies - Confidential Query Components  Query components: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN)  Key logical operators: – Scan – Filter – Aggregate – (Join)
    22. 22. 22©MapR Technologies - Confidential scan-json filter flatten aggregate exp1 exp2 "table-1" Logical Plan
    23. 23. 23©MapR Technologies - Confidential Execution Plan scan-json filter flatten aggregate exp1 exp2 "table-1" scan-json filter flatten exp1 "table-1" scan-json filter flatten exp1 "table-1" node1 node2 node3
    24. 24. 24©MapR Technologies - Confidential {op: "sequence", do: [ {op: "scan", source: "table-1.json" selection: "*" }, {op: "filter", expr: <expr> }, {op: "flatten", expr: <expr>, drop: "false" }, {op: "aggregate", type: repeat, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } ] } Logical Plan Syntax
    25. 25. 25©MapR Technologies - Confidential Representing a DAG { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } aggregate exp2 18 19
    26. 26. 26©MapR Technologies - Confidential Multiple Inputs { @id: 25, op: "cogroup", groupings: [ {ref: 23, expr: “id”}, {ref: 24, expr: “id”} ] } cogroup 23 25 id24id
    27. 27. 27©MapR Technologies - Confidential Scan Operators Scan with schema Scan without schema Operator output Protocol Buffers JSON-like (MessagePack) Supported data formats ColumnIO (column-based protobuf/Dremel) RecordIO (row-based protobuf) CSV JSON HBase SELECT … FROM … ColumnIO(proto URI, data URI) RecordIO(proto URI, data URI) Json(data URI) HBase(table name) • Drill supports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data
    28. 28. 28©MapR Technologies - Confidential Design Principles Flexible • Pluggable query languages • Extensible execution engine • Pluggable data formats • Column-based and row-based • Schema and schema-less • Pluggable data sources Easy • Unzip and run • Zero configuration • Reverse DNS not needed • IP addresses can change • Clear and concise log messages Dependable • No SPOF • Instant recovery from crashes Fast • C/C++ core with Java support • Google C++ style guide • Min latency and max throughput (limited only by hardware)
    29. 29. 29©MapR Technologies - Confidential Hadoop Integration  Hadoop data sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase  Hadoop data formats – Apache Avro – RCFile  MapReduce-based tools to create column-based formats  Table registry in HCatalog  Run long-running services in YARN
    30. 30. 30©MapR Technologies - Confidential Get Involved!  Download these slides – http://www.mapr.com/company/events/hug-france-12-04-2012  Join the project – drill-dev-subscribe@incubator.apache.org – #apachedrill  Contact me: – tdunning@maprtech.com – tdunning@apache.org – ted.dunning@gmail.com – @ted_dunning  Join MapR – jobs@mapr.com
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×