• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Hug france-2012-12-04
 

Hug france-2012-12-04

on

  • 3,274 views

Talk at Hug FR on December 4, 2012 about the new Apache Drill project. Notably, this talk includes an introduction to the converging specification for the logical plan in Drill.

Talk at Hug FR on December 4, 2012 about the new Apache Drill project. Notably, this talk includes an introduction to the converging specification for the logical plan in Drill.

Statistics

Views

Total Views
3,274
Views on SlideShare
926
Embed Views
2,348

Actions

Likes
1
Downloads
27
Comments
0

5 Embeds 2,348

http://hugfrance.fr 2339
http://translate.googleusercontent.com 5
http://ranksit.com 2
https://www.google.fr 1
http://webcache.googleusercontent.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • No graphic changes….Note for Bullet changes:Open Source-- Community consensusAPIAvailable for all Distributions--
  • Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
  • Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format
  • Note: we have an already partially built execution engine
  • Example query that Drill should supportNeed to talk more here about what Dremel does
  • Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeLisa - Need landing pageReferences to paper and such at end

Hug france-2012-12-04 Hug france-2012-12-04 Presentation Transcript

  • Apache Drill©MapR Technologies - Confidential 1
  • My Background Startups – Aptex, MusicMatch, ID Analytics, Veoh – Big data since before big Open source – since the dark ages before the internet – Mahout, Zookeeper, Drill – bought the beer at first HUG MapR Founding member of Apache Drill©MapR Technologies - Confidential 2
  • MapR Technologies The open enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually MapR has partnered with Google to provide Hadoop on Google Compute Engine©MapR Technologies - Confidential 3
  • Agenda What? – What exactly does Drill do? Why? – Why do we need Apache Drill? Who? – Who is doing this? How? – How does Drill work inside? Conclusion – How can you help? – Where can you find out more?©MapR Technologies - Confidential 4
  • Apache Drill Overview Drill overview – Low latency interactive queries – Standard ANSI SQL support Open-Source – 100’s involved across US and Europe – Community consensus on API, functionality PMC expects first version late this quarter – Several components already developed©MapR Technologies - Confidential 5
  • Big Data Processing – Hadoop Batch processing Query runtime Minutes to hours Data volume TBs to PBs Programming MapReduce model Users Developers Google project MapReduce Open source Hadoop project MapReduce©MapR Technologies - Confidential 6
  • Big Data Processing – Hadoop and Storm Batch processing Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm or Apache S4 project MapReduce©MapR Technologies - Confidential 7
  • Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Never-ending Data volume TBs to PBs Continuous stream Programming MapReduce DAG model (pre-programmed) Users Developers Developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 8
  • Big Data Processing – The missing part Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model (ad hoc) (pre-programmed) Users Developers Analysts and Developers developers Google project MapReduce Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 9
  • Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce©MapR Technologies - Confidential 10
  • Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to Never-ending minutes Data volume TBs to PBs GBs to PBs Continuous stream Programming MapReduce Queries DAG model Users Developers Analysts and Developers developers Google project MapReduce Dremel Open source Hadoop Storm and S4 project MapReduce Introducing Apache Drill©MapR Technologies - Confidential 11
  • Latency Matters Ad-hoc analysis with interactive tools Real-time dashboards Event/trend detection and analysis – Network intrusions – Fraud – Failures©MapR Technologies - Confidential 12
  • Nested Query Languages DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} Other languages/programming models can plug in©MapR Technologies - Confidential 13
  • Nested Data Model The data model in Dremel is Protocol Buffers – Nested – Schema Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, … Flat records are supported as a special case of nested data – CSV, TSV, … Avro IDL JSON enum Gender { { MALE, FEMALE "name": "Srivas", } "gender": "Male", "followers": 100 record User { } string name; { Gender gender; "name": "Raina", long followers; "gender": "Female", } "followers": 200, "zip": "94305" }©MapR Technologies - Confidential 14
  • Extensibility Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) Scalable data sources – Pluggable model – Hadoop – HBase©MapR Technologies - Confidential 15
  • Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)©MapR Technologies - Confidential 16
  • Apache DRill©MapR Technologies - Confidential 17
  • Architecture Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, … Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly Each level of the plan has a human readable representation – Facilitates debugging and unit testing©MapR Technologies - Confidential 18
  • Execution Engine Layers Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance©MapR Technologies - Confidential 19
  • DrQL Example SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + , + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, ^http) AND DocId < 20;©MapR Technologies - Confidential 20 * Example from the Dremel paper
  • Query Components Query components: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN) Key logical operators: – Scan – Filter – Aggregate – (Join)©MapR Technologies - Confidential 21
  • Logical Plan scan-json "table-1" filter exp1 flatten aggregate exp2©MapR Technologies - Confidential 22
  • Logical Plan Syntax {op: "sequence", do: [ {op: "scan", source: "table-1.json" selection: "*" }, {op: "filter", expr: <expr> }, {op: "flatten", expr: <expr>, drop: "false" }, {op: "aggregate", type: repeat, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] } ] }©MapR Technologies - Confidential 23
  • Representing a DAG 18 aggregate exp2 19 { @id: 19, op: "aggregate", input: 18, type: <simple|running|repeat>, keys: [<name>,...], aggregations: [ {ref: <name>, expr: <aggexpr> },... ] }©MapR Technologies - Confidential 24
  • Multiple Inputs id 23 24 id cogroup { @id: 25, op: "cogroup", groupings: [ 25 {ref: 23, expr: “id”}, {ref: 24, expr: “id”} ] }©MapR Technologies - Confidential 25
  • Scan Operators • Drill supports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data Scan with schema Scan without schema Operator Protocol Buffers JSON-like (MessagePack) output Supported ColumnIO (column-based protobuf/Dremel) JSON data formats RecordIO (row-based protobuf) HBase CSV SELECT … ColumnIO(proto URI, data URI) Json(data URI) FROM … RecordIO(proto URI, data URI) HBase(table name)©MapR Technologies - Confidential 26
  • Design Principles Flexible Easy • Pluggable query languages • Unzip and run • Extensible execution engine • Zero configuration • Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages • Pluggable data sources Dependable Fast • No SPOF • C/C++ core with Java support • Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)©MapR Technologies - Confidential 27
  • Hadoop Integration Hadoop data sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase Hadoop data formats – Apache Avro – RCFile MapReduce-based tools to create column-based formats Table registry in HCatalog Run long-running services in YARN©MapR Technologies - Confidential 28
  • Get Involved! Download these slides – http://www.mapr.com/company/events/hug-france-12-04-2012 Join the project – drill-dev-subscribe@incubator.apache.org – #apachedrill Contact me: – tdunning@maprtech.com – tdunning@apache.org – ted.dunning@maprtech.com – @ted_dunning Join MapR – jobs@mapr.com©MapR Technologies - Confidential 29