Gera Shegalov @PJUG, Jan 15, 2013
/home/gera: whoami■ Saarland University■ 1st intern in Immortal DB @ Microsoft Research■ JMS, RDBMS HA @ Oracle■ Hadoop Ma...
■ Open enterprise-grade distribution for Hadoop ● Easy, dependable and fast ● Open source with standards-based extensions■...
Agenda■ What? ● What exactly does Drill do?■ Why? ● Why do we need Apache Drill?■ Who? ● Who is doing this?■ How? ● How do...
Apache Drill Overview■ Drill overview  ● Low latency interactive queries  ● Standard ANSI SQL support  ● Domain Specific L...
Big Data Processing                Batch              Interactive         Stream                processing         analysi...
Latency Matters■ Ad-hoc analysis with interactive tools■ Real-time dashboards■ Event/trend detection and analysis  ●   Net...
Nested Query Languages■ DrQL  ●   SQL-like query language for nested data  ●   Compatible with Google BigQuery/Dremel     ...
Nested Data Model■ The data model in Dremel is Protocol Buffers  ●   Nested  ●   Schema■ Apache Drill is designed to suppo...
Extensibility■ Nested query languages  ● Pluggable model  ● DrQL  ● Mongo Query Language  ● Cascading■ Distributed executi...
Design Principles  Flexible                            Easy  ●   Pluggable query languages       ●   Unzip and run  ●   Ex...
Architecture
Execution EngineOperator layer is serialization-aware   Processes individual recordsExecution layer is not serialization-a...
DrQL Examplelocal-logs = donuts.json:                                                     SELECT{                         ...
Query Components■ User Query (DrQL) components:  ● SELECT  ● FROM  ● WHERE  ● GROUP BY  ● HAVING  ● (JOIN)■ Logical operat...
Logical Plan
Logical Plan Syntax:Operators & Expressions        query:[         {           op:"sequence",           do:[           {  ...
Logical Streaming Example                     0                     1                     2                     3         ...
Representing a DAG          { @id: 19, op: "aggregate",            input: 18,            type: <simple|running|repeat>,   ...
Multiple Inputs                  { @id: 25, op: "cogroup",                    groupings: [                      {ref: 23, ...
Physical Scan Operators               Scan with schema                Scan without schemaOperator       Protocol Buffers  ...
Hadoop Integration■   Hadoop data sources    ●   Hadoop FileSystem API (HDFS/MapR-FS)    ●   HBase■   Hadoop data formats ...
Where is Drill now?■ API Definition■ Reference Implementation for Logical Plan Interpreter ● 1:1 mapping logical/physical ...
Contribute!■ Participate in Design discussions: JIRA, ML, Wiki, Google Doc!■ Write a parser for your favorite QL / Domain-...
Thanks, Q&A■ Download these slides  ●   http://www.mapr.com/company/events/pjug-1-15-2013■ Join the project  ●   drill-dev...
Upcoming SlideShare
Loading in...5
×

Apache Drill @ PJUG, Jan 15, 2013

1,141

Published on

Apache Drill is new Apache incubator project. It's goal is to provide a distributed system for interactive analysis of large-scale datasets. Inspired by Google's Dremel technology, it aims to process trillions of records in seconds. We will cover the goals of Apache Drill, its use cases and how it relates to Hadoop, MongoDB and other large-scale distributed systems. We'll also talk about details of the architecture, points of extensibility, data flow and our first query languages (DrQL and SQL).

Published in: Technology
1 Comment
3 Likes
Statistics
Notes
  • Nice overview of Apache Drill, thanks for sharing Gera!
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total Views
1,141
On Slideshare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
21
Comments
1
Likes
3
Embeds 0
No embeds

No notes for slide

Apache Drill @ PJUG, Jan 15, 2013

  1. 1. Gera Shegalov @PJUG, Jan 15, 2013
  2. 2. /home/gera: whoami■ Saarland University■ 1st intern in Immortal DB @ Microsoft Research■ JMS, RDBMS HA @ Oracle■ Hadoop MapReduce / Hadoop Core■ Founding member of Apache Drill
  3. 3. ■ Open enterprise-grade distribution for Hadoop ● Easy, dependable and fast ● Open source with standards-based extensions■ MapR is deployed at 1000’s of companies ● From small Internet startups to Fortune 100■ MapR customers analyze massive amounts of data: ● Hundreds of billions of events daily ● 90% of the world’s Internet population monthly ● $1 trillion in retail purchases annually■ MapR in the Cloud: ● partnered with Google: Hadoop on Google Compute Engine ● partnered with Amazon: M3/M5 options for Elastic Map Reduce
  4. 4. Agenda■ What? ● What exactly does Drill do?■ Why? ● Why do we need Apache Drill?■ Who? ● Who is doing this?■ How? ● How does Drill work inside?■ Conclusion ● How can you help? ● Where can you find out more?
  5. 5. Apache Drill Overview■ Drill overview ● Low latency interactive queries ● Standard ANSI SQL support ● Domain Specific Languages / Your own QL■ Open-Source ● Apache Incubator ● 100’s involved across US and Europe ● Community consensus on API, functionality
  6. 6. Big Data Processing Batch Interactive Stream processing analysis processing Milliseconds toQuery runtime Minutes to hours Never-ending minutesData volume TBs to PBs GBs to PBs Continuous streamProgramming MapReduce Queries DAGmodel Analysts andUsers Developers Developers developersGoogle MapReduce DremelprojectOpen source Hadoop Apache Drill Storm and S4project MapReduce
  7. 7. Latency Matters■ Ad-hoc analysis with interactive tools■ Real-time dashboards■ Event/trend detection and analysis ● Network intrusions ● Fraud ● Failures
  8. 8. Nested Query Languages■ DrQL ● SQL-like query language for nested data ● Compatible with Google BigQuery/Dremel ● BigQuery applications should work with Drill ● Designed to support efficient column-based processing ● No record assembly during query processing■ Mongo Query Language ● {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}■ Other languages/programming models can plug in
  9. 9. Nested Data Model■ The data model in Dremel is Protocol Buffers ● Nested ● Schema■ Apache Drill is designed to support multiple data models ● Schema: Protocol Buffers, Apache Avro, … ● Schema-less: JSON, BSON, …■ Flat records are supported as a special case of nested data ● CSV, TSV, … Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" }
  10. 10. Extensibility■ Nested query languages ● Pluggable model ● DrQL ● Mongo Query Language ● Cascading■ Distributed execution engine ● Extensible model (eg, Dryad) ● Low-latency ● Fault tolerant■ Nested data formats ● Pluggable model ● Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) ● Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON)■ Scalable data sources ● Pluggable model ● Hadoop ● HBase
  11. 11. Design Principles Flexible Easy ● Pluggable query languages ● Unzip and run ● Extensible execution engine ● Zero configuration ● Pluggable data formats ● Reverse DNS not needed ● Column-based and row-based ● IP addresses can change ● Schema and schema-less ● Clear and concise log messages ● Pluggable data sources ● N(ot)O(nly) Hadoop Dependable Fast ● No SPOF ● Minimum Java core ● Instant recovery from crashes ● C/C++ core with Java support ● Google C++ style guide ● Min latency and max throughput (limited only by hardware)
  12. 12. Architecture
  13. 13. Execution EngineOperator layer is serialization-aware Processes individual recordsExecution layer is not serialization-aware Processes batches of records (blobs/JSON trees) Responsible for communication, dependencies and fault tolerance
  14. 14. DrQL Examplelocal-logs = donuts.json: SELECT{ ppu, "id": "0003", typeCount = "type": "donut", COUNT(*) OVER PARTITION BY ppu, "name": "Old Fashioned", quantity = "ppu": 0.55, "sales": 300, SUM(sales) OVER PARTITION BY ppu, "batters": sales = { SUM(ppu*sales) OVER PARTITION BY "batter": ppu [ FROM local-logs donuts { "id": "1001", "type": "Regular" }, { "id": "1002", "type": "Chocolate" } WHERE donuts.ppu < 1.00 ] ORDER BY dountuts.ppu DESC; }, "topping": [ { "id": "5001", "type": "None" }, { "id": "5002", "type": "Glazed" }, { "id": "5003", "type": "Chocolate" }, { "id": "5004", "type": "Maple" } ] }
  15. 15. Query Components■ User Query (DrQL) components: ● SELECT ● FROM ● WHERE ● GROUP BY ● HAVING ● (JOIN)■ Logical operators: ● Scan ● Filter ● Aggregate ● (Join)
  16. 16. Logical Plan
  17. 17. Logical Plan Syntax:Operators & Expressions query:[ { op:"sequence", do:[ { op: "scan", memo: "initial_scan", ref: "donuts", source: "local-logs", selection: {data: "activity"} }, { op: "transform", transforms: [ { ref: "donuts.quanity", expr: "donuts.sales"} ] }, { op: "filter", expr: "donuts.ppu < 1.00" }, ---
  18. 18. Logical Streaming Example 0 1 2 3 4{ @id: <refnum>, op: “window-frame”, input: <input>, keys: [ 0 <name>,... 01 ], 012 ref: <name>, 123 before: 2, 234 after: here}
  19. 19. Representing a DAG { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] }
  20. 20. Multiple Inputs { @id: 25, op: "cogroup", groupings: [ {ref: 23, expr: “id”}, {ref: 24, expr: “id”} ] }
  21. 21. Physical Scan Operators Scan with schema Scan without schemaOperator Protocol Buffers JSON-like (MessagePack)outputSupported ColumnIO (column-based JSONdata formats protobuf/Dremel) HBase RecordIO (row-based protobuf) CSVSELECT … ColumnIO(proto URI, data URI) Json(data URI)FROM … RecordIO(proto URI, data URI) HBase(table name)
  22. 22. Hadoop Integration■ Hadoop data sources ● Hadoop FileSystem API (HDFS/MapR-FS) ● HBase■ Hadoop data formats ● Apache Avro ● RCFile■ MapReduce-based tools to create column-based formats■ Table registry in HCatalog■ Run long-running services in YARN
  23. 23. Where is Drill now?■ API Definition■ Reference Implementation for Logical Plan Interpreter ● 1:1 mapping logical/physical op ● Single JVM■ Demo
  24. 24. Contribute!■ Participate in Design discussions: JIRA, ML, Wiki, Google Doc!■ Write a parser for your favorite QL / Domain-Specific Language■ Write Storage Engine API implementations ● HDFS, Hbase, relational, XML DB.■ Write Physical Operators ● scan-hbase, scan-cassandra, scan-mongo ● scan-jdbc, scan-odbc, scan-jms (browse topic/queue), scan-* ● combined functionality operators: group-aggregate, ... ● sort-merge-join, hash-join, index-lookup-join■ Etc...
  25. 25. Thanks, Q&A■ Download these slides ● http://www.mapr.com/company/events/pjug-1-15-2013■ Join the project ● drill-dev-subscribe@incubator.apache.org ● #apachedrill■ Contact me: ● gshegalov@maprtech.com■ Join MapR ● jobs@mapr.com
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×