• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Drill lightning-london-big-data-10-01-2012
 

Drill lightning-london-big-data-10-01-2012

on

  • 1,795 views

A quick status report on Drill.

A quick status report on Drill.

Statistics

Views

Total Views
1,795
Views on SlideShare
1,789
Embed Views
6

Actions

Likes
0
Downloads
10
Comments
0

1 Embed 6

http://www.linkedin.com 6

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Drill Remove schema requirementIn-situ for real since we’ll support multiple formatsNote: MR needed for big joins so to speak
  • Likely to support theseCould add HiveQL and more as well. Could even be clever and support HiveQL to MR or Drill based upon queryPig as wellPluggabilityData formatQuery languageSomething 6-9 months alpha qualityCommunity driven, I can’t speak for projectMapRFS gives better chunk size controlNFS support may make small test drivers easierUnified namespace will allow multi-cluster accessMight even have drill component that autoformats dataRead only model
  • Example query that Drill should supportNeed to talk more here about what Dremel does

Drill lightning-london-big-data-10-01-2012 Drill lightning-london-big-data-10-01-2012 Presentation Transcript

  • Apache DrillInteractive Analysis of Large-Scale Datasets Ted Dunning Chief Application Architect, MapR
  • Big Data Processing Batch processing Interactive analysis Stream processingQuery runtime Minutes to hours Milliseconds to Never-ending minutesData volume TBs to PBs GBs to PBs Continuous streamProgramming MapReduce Queries DAGmodelUsers Developers Analysts and Developers developersGoogle project MapReduce DremelOpen source Hadoop Storm and S4project MapReduce Introducing Apache Drill…
  • GOOGLE DREMEL
  • Google Dremel• Interactive analysis of large-scale datasets – Trillion records at interactive speeds – Complementary to MapReduce – Used by thousands of Google employees – Paper published at VLDB 2010 • Authors: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis• Model – Nested data model with schema • Most data at Google is stored/transferred in Protocol Buffers • Normalization (to relational) is prohibitive – SQL-like query language with nested data support• Implementation – Column-based storage and processing – In-situ data access (GFS and Bigtable) – Tree architecture as in Web search (and databases)
  • APACHE DRILL
  • Architecture• Only the execution engine knows the physical attributes of the cluster – # nodes, hardware, file locations, …• Public interfaces enable extensibility – Developers can build parsers for new query languages – Developers can provide an execution plan directly• Each level of the plan has a human readable representation – Facilitates debugging and unit testing
  • Nested Query Languages• DrQL – SQL-like query language for nested data – Compatible with Google BigQuery/Dremel • BigQuery applications should work with Drill – Designed to support efficient column-based processing • No record assembly during query processing• Mongo Query Language – {$query: {x: 3, y: "abc"}, $orderby: {x: 1}}• Other languages/programming models can plug in
  • DrQL ExampleSELECT DocId AS Id, COUNT(Name.Language.Code) WITHIN Name ASCnt, Name.Url + , + Name.Language.Code ASStrFROM tWHERE REGEXP(Name.Url, ^http) AND DocId < 20; * Example from the Dremel paper
  • Design PrinciplesFlexible Easy• Pluggable query languages • Unzip and run• Extensible execution engine • Zero configuration• Pluggable data formats • Reverse DNS not needed • Column-based and row-based • IP addresses can change • Schema and schema-less • Clear and concise log messages• Pluggable data sourcesDependable Fast• No SPOF • C/C++ core with Java support• Instant recovery from crashes • Google C++ style guide • Min latency and max throughput (limited only by hardware)
  • Get Involved!• Download (a longer version of) these slides – http://www.mapr.com/company/events/bay-area-hug/9-19-2012• Join the project – drill-dev-subscribe@incubator.apache.org #apachedrill• Contact me: – tdunning@maprtech.com – tdunning@apache.org – ted.dunning@maprtech.com – @ted_dunning• Join MapR – jobs@mapr.com