GBDC 2013-01-28

660 views

Published on

Published in: Technology

GBDC 2013-01-28

  1. 1. Using realtime SQL2003 to queryJSON on Hadoop with Apache Drill January 28, 2013 Jacques Nadeau Apache Drill Contributor @ MapR Technologies
  2. 2. Me• Apache Drill and HBase Contributor• Sponsored by MapR Technologies to lead Apache Drill contributions – Enterprise-grade high performance distribution for Hadoop – Open source plus standards-based extensions – Large number Fortune 100 customers, startups too. – Free distribution for unlimited nodes – Partnered to provide on Google Compute Engine and Amazon Elastic MapReduce
  3. 3. Transaction informationJane works as anAnalyst at anecommerce websiteHow does she figure User profilesout good targetingsegments for the nextmarketing campaign?She has some ideasand lots of data Access logs
  4. 4. Let’s try using existing options• Use Oracle – Write flattening MongoDB query for export and generate giant CSV. Work with MapReduce team to build a MapReduce job that provides export. Contact DBA to import data exports. Use Oracle SQL to determine answers.• Use Hive – Pull up Hive. Start writing queries. Realize that Hive/Mongo interconnector doesn’t support nested data. Realize that Hive doesn’t have JDBC/ODBC storage handler. Query data from Oracle and copy to Hadoop. Query flattened Mongo data and copy into Hadoop. Write HiveQL query. Wait 30 minutes for result. Repeat until desired outcome. Avoid frustration along the way with the flattened Mongo data, portion of Oracle extraction, and the lack of major portions of SQL syntax.• Use Data Virtualization Solution – Write SQL query against virtualization interface. Realize that you still need to ETL Mongo data since it isn’t natively supported. Query runs slowly since virtualization solution doesn’t run locally against Hadoop data and fails to effectively distribute your query.• Use MapReduce – Work with Engineering to define a specification for needs. Use Sqoop to setup regular ETL from Oracle. Define a custom MapReduce to import Mongo data. – Look at output and realize different analyses should be done, repeat cycle (or learn Java)
  5. 5. Why are things so hard?• Slow – Virtualization solutions don’t support data locality and pushdown – MapReduce sacrifices performance to support long running jobs, recoverability, and ultimate flexibility• Old – Most systems assume flat data with well-defined static schemas• Hard – Write queries in multiple languages (Does anybody no MongoQL, CQL, HiveQL and SQL?) – Analysts often need custom development help• Error Prone – ETL leads to data synchronization issues – Lack of query transparency leads to incorrect assumptions and bad business conclusions• Expensive – Commercial solutions are very expensive – Typically provide poor compatibility with newer NoSQL technologies
  6. 6. Open Source Mantra: WWGD? Distributed Interactive Batch Datastore File System analysis processing GFS BigTable Dremel MapReduce Hadoop HDFS HBase MapReduceBuild Apache Drill to provide a true open source solution to interactive analysis of Big Data
  7. 7. Apache Drill Overview• Drill overview – Low latency interactive queries – Standard ANSI SQL2003 support – Domain Specific Languages / Your own QL – Inspired by, compatible with Google BigQuery/Dremel – Supports Nested/Hierarchical Data Formats – Supports RDBMS, Hadoop and NoSQL alike• Open-Source and Flexible – Apache Incubator – 100’s involved across US and Europe – Community consensus on API, functionality
  8. 8. Why do we need another tool?Point queries Data Analyst & Reporting Queries0-100ms 3 minutes – 20 minutes Interactive Queries 100ms – 3 minutes Data Mining and Major ETL 20 minutes – 20 hours MapReduce, Apache Per Hive and PIG Drill system interfaces
  9. 9. Why not improve Hive or Pig?• Different Goals• SQL should be first class concern• MapReduce severely hampers processing model and performance – Startup cost is high – Map:Reduce recoverability and barrier disadvantages – Job:Job recoverability and barrier disadvantages (chained jobs)• Need to build from in-memory representation – Two canonical in-memory formats (row-based and columnar) – Support much larger memory sizes – Smaller memory footprint per record – Avoid serialization/deserialization and object creation costs between nodes and operations• Performance of interactive queries is critical – Evaluation and Operator code generation & compilation• First class recognition of nested types without metadata requirement – Schema Discovery and standard schema representation• Clear delineation between important stages – Support for multiple optimizers and researcher experimentation
  10. 10. How does it work?• Drillbits run on each node to minimize network transfer• Queries can be fed to any Drillbit. SELECT * FROM oracle.transactions,• Coordination, query planning, mongo.users, optimization, scheduling, and hdfs.events LIMIT 1 execution are distributed
  11. 11. Flexibility with Strongly Defined Tiers and APIs
  12. 12. Apache Drill currently in development• Heavy active development by multiple supporting organizations• Available – Logical plan syntax and interpreter – Reference Interpreter• In progress – SQL interpreter – Storage Engine implementations for Accumulo, Cassandra, HBase, and HDFS file formats
  13. 13. Conclusion & Questions• Put Apache Drill on your roadmap, we’ll make your life easier• Join the community – Code: http://github.com/apache/incubator-drill – Mailing List: drill-user@incubator.apache.org – Wiki: https://cwiki.apache.org/confluence/display/DRILL• Access this presentation: http://bit.ly/Wo6DLd• Contact Me: – jacques.drill@gmail.com

×