Drill at the Chicago Hug


Published on

This is a project report given to the Chicago Hadoop Users' Group meeting in Chicago on September 19, 2012

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Drill Remove schema requirementIn-situ for real since we’ll support multiple formatsNote: MR needed for big joins so to speak
  • DrillWill support nestedNo schema required
  • Load data into Drill (optional)Could just use as is in “row” formatMultiple query languagesPluggability very important
  • Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
  • Protocol buffers are conceptual data modelWill support multiple data modelsWill have to define a way to explain data format (filtering, fields, etc)Schema-less will have perf penaltyHbase will be one format
  • Example query that Drill should supportNeed to talk more here about what Dremel does
  • Note: we have an already partially built execution engine
  • Be prepared for Apache questionsCommitter vs committee vs contributorIf can’t answer question, ask them to answer and contributeLisa - Need landing pageReferences to paper and such at end
  • Drill at the Chicago Hug

    1. 1. Apache Drill Interactive Analysis of Large-Scale Datasets Ted Dunning Chief Application Architect, MapR
    2. 2. My Background • Startups – Aptex, MusicMatch, ID Analytics, Veoh – Big data since before big • Open source – since the dark ages before the internet – Mahout, Zookeeper, Drill – bought the beer at first HUG • MapR • Founding member of Apache Drill
    3. 3. MapR Technologies • The open enterprise-grade distribution for Hadoop – Easy, dependable and fast – Open source with standards-based extensions • MapR is deployed at 1000’s of companies – From small Internet startups to the world’s largest enterprises • MapR customers analyze massive amounts of data: – Hundreds of billions of events daily – 90% of the world’s Internet population monthly – $1 trillion in retail purchases annually • MapR has partnered with Google to provide Hadoop on Google Compute Engine
    4. 4. Latency Matters • Ad-hoc analysis with interactive tools • Real-time dashboards • Event/trend detection and analysis – Network intrusions – Fraud – Failures
    5. 5. Big Data Processing Batch processing Interactive analysis Stream processing Query runtime Minutes to hours Milliseconds to minutes Never-ending Data volume TBs to PBs GBs to PBs Continuous stream Programming model MapReduce Queries DAG Users Developers Analysts and developers Developers Google project MapReduce Dremel Open source project Hadoop MapReduce Storm and S4 Introducing Apache Drill…
    7. 7. Google Dremel • Interactive analysis of large-scale datasets – Trillion records at interactive speeds – Complementary to MapReduce – Used by thousands of Google employees – Paper published at VLDB 2010 • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis • Model – Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive – SQL-like query language with nested data support • Implementation – Column-based storage and processing – In-situ data access (GFS and Bigtable) – Tree architecture as in Web search (and databases)
    8. 8. Google BigQuery • Hosted Dremel (Dremel as a Service) • CLI (bq) and Web UI • Import data from Google Cloud Storage or local files – Files must be in CSV format • Nested data not supported [yet] except built-in datasets – Schema definition required
    9. 9. APACHE DRILL
    10. 10. Architecture • Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, … • Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly • Each level of the plan has a human readable representation – Facilitates debugging and unit testing
    11. 11. Architecture (2)
    12. 12. Execution Engine Layers • Drill execution engine has two layers – Operator layer is serialization-aware • Processes individual records – Execution layer is not serialization-aware • Processes batches of records (blobs) • Responsible for communication, dependencies and fault tolerance
    13. 13. Data Flow
    14. 14. Nested Query Languages • DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing • Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}} • Other languages/programming models can plug in
    15. 15. Nested Data Model • The data model in Dremel is Protocol Buffers – Nested – Schema • Apache Drill is designed to support multiple data models – Schema: Protocol Buffers, Apache Avro, … – Schema-less: JSON, BSON, … • Flat records are supported as a special case of nested data – CSV, TSV, … { "name": "Srivas", "gender": "Male", "followers": 100 } { "name": "Raina", "gender": "Female", "followers": 200, "zip": "94305" } enum Gender { MALE, FEMALE } record User { string name; Gender gender; long followers; } Avro IDL JSON
    16. 16. DrQL Example SELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name AS Cnt, Name.Url + ',' + Name.Language.Code AS Str FROM t WHERE REGEXP(Name.Url, '^http') AND DocId < 20; * Example from the Dremel paper
    17. 17. Query Components • Query components: – SELECT – FROM – WHERE – GROUP BY – HAVING – (JOIN) • Key logical operators: – Scan – Filter – Aggregate – (Join)
    18. 18. Extensibility • Nested query languages – Pluggable model – DrQL – Mongo Query Language – Cascading • Distributed execution engine – Extensible model (eg, Dryad) – Low-latency – Fault tolerant • Nested data formats – Pluggable model – Column-based (ColumnIO/Dremel, Trevni, RCFile) and row-based (RecordIO, Avro, JSON, CSV) – Schema (Protocol Buffers, Avro, CSV) and schema-less (JSON, BSON) • Scalable data sources – Pluggable model – Hadoop – HBase
    19. 19. Scan Operators Scan with schema Scan without schema Operator output Protocol Buffers JSON-like (MessagePack) Supported data formats ColumnIO (column-based protobuf/Dremel) RecordIO (row-based protobuf) CSV JSON HBase SELECT … FROM … ColumnIO(proto URI, data URI) RecordIO(proto URI, data URI) Json(data URI) HBase(table name) • Drill supports multiple data formats by having per-format scan operators • Queries involving multiple data formats/sources are supported • Fields and predicates can be pushed down into the scan operator • Scan operators may have adaptive side-effects (database cracking) • Produce ColumnIO from RecordIO • Google PowerDrill stores materialized expressions with the data
    20. 20. Design Principles Flexible • Pluggable query languages • Extensible execution engine • Pluggable data formats • Column-based and row-based • Schema and schema-less • Pluggable data sources Easy • Unzip and run • Zero configuration • Reverse DNS not needed • IP addresses can change • Clear and concise log messages Dependable • No SPOF • Instant recovery from crashes Fast • C/C++ core with Java support • Google C++ style guide • Min latency and max throughput (limited only by hardware)
    21. 21. Hadoop Integration • Hadoop data sources – Hadoop FileSystem API (HDFS/MapR-FS) – HBase • Hadoop data formats – Apache Avro – RCFile • MapReduce-based tools to create column-based formats • Table registry in HCatalog • Run long-running services in YARN
    22. 22. Get Involved! • Download (almost) these slides – http://www.mapr.com/company/events/bay-area-hug/9-19-2012 • Join the project – drill-dev-subscribe@incubator.apache.org #apachedrill • Contact me: – tdunning@maprtech.com – tdunning@apache.org – ted.dunning@maprtech.com – @ted_dunning • Join MapR – jobs@mapr.com