Drilling on JSON


Published on

Variety is the spice of life, but it’s also the reality of big data. For this reason, JSON has now becoming lingua franca of data in the internet – for APIs, data exchange, data storage and data processing. In the business intelligence world, SQL is the language to analyze the data in other forms. Hence, the myriad of “SQL-on-Hadoop” projects. However, traditional SQL isn’t JSON/Parquet/etc. friendly. ETL into flattened tables is costly and not real time.
Apache Drill unifies SQL with variety of data forms on Hadoop. That enables interactive analytics using your favorite BI tool and visualization tool on you data simultaneously. In this talk, we’ll introduce Apache Drill and describe use cases.

- See more at: http://nosql2014.dataversity.net/sessionPop.cfm?confid=81&proposalid=6850#sthash.NhuLz6Dq.dpuf

Published in: Software
1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • With other technologies you have to do this, then this, then this, …
  • TODO: Add Impala and Splunk logos
  • Need an example or analogy to explain self-describing data.
  • All SQL engines (traditional or SQL-on-Hadoop) view tables as spreadsheet-like data structures with rows and columns. All records have the same structure, and there is no support for nested data or repeating fields. Drill views tables conceptually as collections of JSON (with additional types) documents. Each record can have a different structure (hence, schema-less). This is revolutionary and has never been done before.

    If you consider the four data models shown in the 2x2, all models can be represented by the complex, no schema model (JSON) because it is the most flexible. However, no other data model can be represented by the flat, fixed schema model. Therefore, when using any SQL engine except Drill, the data has to be transformed before it can be available to queries.
  • Drilling on JSON

    1. 1. © 2014 MapR Technologies 1#NoSQLNow @apachedrill © 2014 MapR Technologies#NoSQLNow Drilling on JSON
    2. 2. © 2014 MapR Technologies 2#NoSQLNow @apachedrill NoSQL We don't need no transaction We don't need no ACID control No schema in the tables No limit to the scale out DBA, leave them JSON alone Hey DBA, leave them JSON alone All in all it's just another data in the BASE All in all it’s just another shard into cloud. …With apologies to Roger Waters
    3. 3. © 2014 MapR Technologies 3 Martin Fowler says: “aggregate- oriented” What you're most likely to access as a unit. Key Value Store  Couchbase  Riak  Citrusleaf  Redis  BerkeleyDB  Membrain  ... Document  MongoDB  CouchDB  RavenDB  Couchbase  ... Graph  OrientDB  DEX  Neo4j  GraphBase  ...Wide Column  HBase  Hypertable  Cassandra  MapR-DB  ... NoSQL Landscape
    4. 4. © 2014 MapR Technologies 4 Data landscape is changing New types of applications • Social, mobile, Web, “Internet of Things”, Cloud… • Iterative/Agile in nature • More users, more data New data models & data types • Flexible Schema/Schema less • Rapidly changing • Semi-structured/Nested data { "data": [ "id": "X999_Y999", "from": { "name": "Tom Brady", "id": "X12" }, "message": "Looking forward to 2014!", "actions": [ { "name": "Comment", "link": "http://www.facebook.com/X99/posts Y999" }, { "name": "Like", "link": "http://www.facebook.com/X99/posts Y999" } ], "type": "status", "created_time": "2013-08-02T21:27:44+0000", "updated_time": "2013-08-02T21:27:44+0000" } } JSON
    5. 5. © 2014 MapR Technologies 5 • Pioneering Data Agility for Hadoop • Apache open source project • Scale-out execution engine for low-latency queries • Unified SQL-based API for analytics & operational applications APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
    6. 6. © 2014 MapR Technologies 6#NoSQLNow @apachedrill Zero to Results in 2 Minutes (3 Commands) $ tar xzf apache-drill.tar.gz $ apache-drill/bin/sqlline -u jdbc:drill:zk=local 0: jdbc:drill:zk=local> SELECT DISTINCT users.name as name, users.emails.work as email FROM dfs.logs.`/data/logs` logs, dfs.users.`/profiles.json` users WHERE logs.uid = users.id AND logs.errorLevel > 5; +------------+------------+ | name | email | +------------+------------+ | john | john@gmail.com| | jack | jack@yahoo.com| | Ronn | ronn@mapr.com | | Pat | pat@hotmail.com| ... Install Launch shell (embedded mode) Query Query
    7. 7. © 2014 MapR Technologies 7 Drill Supports Schema Discovery On-The-Fly • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data 2Schema Discovered On-The-FlySchema Declared In Advance SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
    8. 8. © 2014 MapR Technologies 8#NoSQLNow @apachedrill Self-Describing Data is Ubiquitous Flat files in DFS • Complex data (Thrift, Avro, protobuf) • Columnar data (Parquet, ORC) • Loosely defined (JSON) • Traditional files (CSV, TSV) Data stored in NoSQL stores • Relational-like (rows, columns) • Sparse data (NoSQL maps) • Embedded blobs (JSON) • Document stores (nested objects) { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
    9. 9. © 2014 MapR Technologies 9#NoSQLNow @apachedrill Drill’s Data Model is Flexible HBase JSON BSON CSV TSV Parquet Avro Schema-lessFixed schema Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC } RDBMS/SQL-on-Hadoop table Apache Drill table
    10. 10. © 2014 MapR Technologies 10#NoSQLNow @apachedrill Core Modules within a Drillbit SQL Parser Optimizer PhysicalPlan DFS HBase RPC Endpoint Distributed Cache StoragePlugins LogicalPlan Execution Hive MongoDB CouchBase Cassandra RDBMS
    11. 11. © 2014 MapR Technologies 11#NoSQLNow @apachedrill Processing in Files MapReduce Generic fileformats Rows/Columns in files (tables) Hive – Pig - etc Query Impala Tez Hive NoSQL MongoDB Hbase Cassandra Riak Redis HADOOPDisk & Storage RDBMS Highly Structured Data ANSI- SQL SQL++ R, etc bits,bytes,blocks $100K – $200K / TB$1K/TB$10K/TB Semi Structured & Self describingNo Structure OLTP EDW Apache Drill
    12. 12. © 2014 MapR Technologies 12#NoSQLNow @apachedrill NoSQL NoETL Drill, Baby, Drill: Self-Service Data Exploration using Apache Drill Thursday, August 21st. 9.30 AM Apache Drill