Your SlideShare is downloading. ×
Introduction to Apache Drill - Big Data Bellevue Meetup 20131023
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Introduction to Apache Drill - Big Data Bellevue Meetup 20131023


Published on

Published in: Technology

1 Like
  • Be the first to comment

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • Greet
    First time talking in front of meetup
  • - Search Data pipeline, real-time ingesting social data (FB/Twitter)
  • - Battle nations, Top grossing #3, Matchmaking, Online real-time PvP
  • - Open Source PaaS, Metrics and Usage data
  • - Halo new Big Data pipeline, working on data ingestion, with open source like Kafka
  • Love open source, with enthusiastic people offering their time and energy
    Smart people and a great sense of community
  • - Also some less well known projects such as MMORPG game engine, etc.
  • And I found Drill!
  • Interactive / real-time is HOT
    Batch processing doesn’t serve all our needs anymore
  • - TC originally article that led me to start contributing
  • - Drill’s Apache incubator proposal, outlining what Drill is trying to achieve
  • Pro: Can handle big data, MapReduce abstracts all the distribution and management, flexible code to process
    Con: Slow
    Hive query startup requires a long processing time….
  • Pro: Highly available solutions, handle large writes/reads
    Con: Not easy to do adhoc SQL like queries
  • Stream processing is becoming very popular, and new projects are rising up such as Samza based on Kafka.
    Walmart labs has a project called Muppet that they called Fast data..
    Con: Need to definite topology, and cannot do adhoc querying
  • AWS CloudSearch, Lucene, all these technology performs searches with the basis of an index they maintain, therefore needs preprocessing of data.
  • What most people do in querying multiple sources is to engineer an ETL pipeline to a common DW, and query a DW for any cross segments data.
    But obviously ETL implies a delay, we can do better.
  • Drill is not to replace MapReduce, but to supplement it
  • Two innovations: handle nested-data column style (column-striped representation) and multi-level execution trees
  • repetition levels (r) — at what repeated field in the field’s path the value has repeated.
    definition levels (d) — how many fields in path that could be undefined (because they are optional or repeated) are actually present
    Only repeated fields increment the repetition level, only non-required fields increment the definition level.
    Required fields are always defined and do not need a definition level. Non repeated fields do not need a repetition level.
    An optional field requires one extra bit to store zero if it is NULL and one if it is defined.
    NULL values do not need to be stored as the definition level captures this information.
  • Source query - Human (eg DSL) or tool written(eg SQL/ANSI compliant) query
    Source query is parsed and transformed to produce the logical plan
    Logical plan: dataflow of what should logically be done
    Typically, the logical plan lives in memory in the form of Java objects, but also has a textual form
    The logical query is then transformed and optimized into the physical plan.
    Optimizer introduces of parallel computation, taking topology into account
    Optimizer handles columnar data to improve processing speed
    The physical plan represents the actual structure of computation as it is done by the system
    How physical and exchange operators should be applied
    Assignment to particular nodes and cores + actual query execution per node
  • Drillbits per node, maximize data locality
    Co-ordination, query planning, optimization, scheduling, execution are distributed
    By default, Drillbits hold all roles, modules can optionally be disabled.
    Any node/Drillbit can act as endpoint for particular query.
  • Zookeeper maintains ephemeral cluster membership information only
    Small distributed cache utilizing embedded Hazelcast maintains information about individual queue depth, cached query plans, metadata, locality information, etc.
  • Originating Drillbit acts as foreman, manages all execution for their particular query, scheduling based on priority, queue depth and locality information.
    Drillbit data communication is streaming and avoids any serialization/deserialization
  • Red: originating drillbit, is the root of the multi-level execution tree, per query/job
    Leafs use their storage engine interface to scan respective data source (DB, file, etc.)
  • Handing over to Ted
  • Michael?
  • Transcript

    • 1. Introduction to Apache Drill Big Data Bellevue Meetup @tnachen Timothy Chen
    • 2. Motivation Key Facts Architecture Overview
    • 3. About me
    • 4. I Open Source
    • 5. Use Case: Marketing Campaign Jane, a marketing analyst Determine target segments Data from different sources
    • 6. Use Case: Crime Detection • • • • Online purchases Fraud, billing, etc. Batch-generated overview Modes – Explorative – Alerts
    • 7. Requirements • • • • • Support for different data sources Support for different query interfaces Low-latency/real-time Ad-hoc queries Scalable, reliable
    • 8. Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google. … “ “ Google’s Dremel Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Proc. of the 36th Int'l Conf on Very Large Data Bases (2010), pp. 330-339
    • 9. Google’s Dremel multi-level execution trees columnar data layout
    • 10. Google’s Dremel nested data + schema column-striped representation map nested data to tables
    • 11. Google’s Dremel experiments: datasets & query performance
    • 12. Apache Drill–key facts Inspired by Google’s Dremel Standard SQL 2003 support Plug-able data sources Nested data is a first-class citizen Schema is optional Community driven, open, 100’s involved
    • 13. High-level Architecture
    • 14. Principled Query Execution Source query—what we want to do (analyst friendly) Logical Plan— what we want to do (language agnostic, computer friendly) Physical Plan—how we want to do it (the best way we can tell) Execution Plan—where we want to do it
    • 15. Principled Query Execution Source Query SQL 2003 DrQL MongoQL DSL Parser parser API Logical Plan query: [ { @id: "log", op: "sequence", do: [ { op: "scan", source: “logs” }, { op: "filter", condition: "x > 3” }, Optimizer Topology CF etc. Physical Plan Execution scanner API
    • 16. Wire-level Architecture Each node: Drillbit - maximize data locality Co-ordination, query planning, execution, etc, are distributed Any node can act as endpoint for a query—foreman Drillbit Drillbit Drillbit Drillbit Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
    • 17. Wire-level Architecture Curator/Zookeeper for ephemeral cluster membership info Distributed cache (Hazelcast) for metadata, locality information, etc. Curator/Zk Curator/Zk Drillbit Drillbit Drillbit Drillbit Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
    • 18. Wire-level Architecture Originating Drillbit acts as foreman: manages query execution, scheduling, locality information, etc. Streaming data communication avoiding SerDe Curator/Zk Curator/Zk Drillbit Drillbit Drillbit Drillbit Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Distributed Distributed Cache Cache Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process Storage Storage Process Process node node node node
    • 19. Wire-level Architecture Foreman turns into root of the multi-level execution tree, leafs activate their storage engine interface. node Curator/Zk Curator/Zk node node
    • 20. On the shoulders of giants … Jackson for JSON SerDe for metadata Typesafe HOCON for configuration and module management Netty4 as core RPC engine, protobuf for communication Vanilla Java, LArray and Netty ByteBuf for off-heap large data structures Hazelcast for distributed cache Netflix Curator on top of Zookeeper for service registry Optiq for SQL parsing and cost optimization Parquet ( ORC Janino for expression compilation ASM for ByteCode manipulation Yammer Metrics for metrics Guava extensively Carrot HPC for primitive collections
    • 21. Key features Full SQL – ANSI SQL 2003 Nested Data as first class citizen Optional Schema Extensibility Points …
    • 22. Extensibility Points Source query  parser API Custom operators, UDF  logical plan Serving tree, CF, topology  physical plan/optimizer Data sources &formats  scanner API Source Query Parser Logical Plan Optimizer Physical Plan Execution
    • 23. User Interfaces API—DrillClient Encapsulates endpoint discovery Supports logical and physical plan submission, query cancellation, query status Supports streaming return results JDBC driver, converting JDBC into DrillClient communication. REST proxy for DrillClient
    • 24. User Interfaces
    • 25. Let’s get our hands dirty…
    • 26. Demo Install Preparation Usage $ wget $ tar -zxf apache-drill-1.0.0-m1-binary-release.tar.gz $ export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.7.0_11.jdk/Contents /Home $ export DRILL_LOG_DIR=$PWD $ ./bin/ start $ ./bin/sqlline -u jdbc:drill:schema=parquet-local -n admin -p admin
    • 27. Useful Resources Getting Started guide Demo HowTo Demo+HowTo How to build/install Apache Drill on Ubuntu 13.04
    • 28. Be a part of it!
    • 29. Status Heavy development by multiple organizations (MapR, Pentaho, Microsoft, Thoughtworks, XingCloud, etc.) Currently more than 100k LOC Alpha available via
    • 30. Kudos to … Julian Hyde, Pentaho Lisen Mu, XingCloud Tim Chen, Microsoft Chris Merrick, RJMetrics David Alves, UT Austin Sree Vaadi, SSS Srihari Srinivasan, ThoughtWorks • • • • • • • • • Ben Becker, MapR Jacques Nadeau, MapR Ted Dunning, MapR Keys Botzum, MapR Jason Frantz Ellen Friedman Chris Wensel, Concurrent Gera Shegalov, Oracle Ryan Rawson, Ohm Data Alexandre Beche, CERN Jason Altekruse, MapR
    • 31. Contributing Contributions appreciated—not only code drops … Test data & test queries Use case scenarios (textual/SQL queries) Documentation
    • 32. Engage! Follow @ApacheDrill on Twitter Sign up at mailing lists (user | dev) Standing G+ hangouts every Tuesday at 18:00 CET Keep an eye on
    • 33. Twitter: @tnachen Email: