Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Analyzing Real-World Data with Apache Drill

4,130 views

Published on

Learn how to analyze data with Apache Drill, the open source schema-free SQL engine.

Published in: Data & Analytics
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Analyzing Real-World Data with Apache Drill

  1. 1. Analyzing Real-World Data with Apache Drill © 2014 MapR Techno©lo 2g0ie1s4 MapR Technologies 1
  2. 2. © 2014 MapR Technologies 2 Data is doubling in size every two years
  3. 3. 44 ZETTABYTES © 2014 MapR Technologies 3 IDC estimates that in 2020, there will be 44 zettabytes of data in the world 4.4 ZETTABYTES 1.8 ZETTABYTES 2011 2013 2020 Source: IDC Digital Universe
  4. 4. © 2014 MapR Technologies 4 UNSTRUCTURED DATA Unstructured data will account for more than 80% of the data collected by organizations STRUCTURED DATA 1980 1990 2000 2010 2020 Source: Human-Computer Interaction & Knowledge Discovery in Complex Unstructured, Big Data Total Data Stored
  5. 5. NoSchema Datastores are Capturing this Data Volume MBs-GBs TBs-PBs RELATIONAL DATABASES “NOSCHEMA” DATASTORES Structure Development 1980 1990 2000 2010 2020 © 2014 MapR Technologies 5 Fixed schema DBA controls structure Dynamic schema (schema-free) Application controls structure Database Structured Structured, semi-structured and unstructured Planned (release cycle = months-years) Iterative (release cycle = days-weeks)
  6. 6. WANT 2 DON’T WANT © 2014 MapR Technologies 6 SQL in the Big Data World • SQL • BI (Tableau, MicroStrategy, etc.) • Low latency • Scalability • Create and maintain schemas on: – HDFS (Parquet, JSON, etc.) – HBase – MongoDB • Transform or copy data We want SQL and BI support without compromising the flexibility and agility of NoSchema datastores
  7. 7. • Schema-free scale-out query engine for Hadoop and NoSQL • Point-and-query vs. schema-first • Low latency • Extreme ease of use • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs © 2014 MapR Technologies 7 APACHE DRILL 40+ contributors 150+ years of experience building databases and distributed systems
  8. 8. Evolution Towards Self-Service Data Exploration © 2014 MapR Technologies 8 Data Modeling and Transformation Data Visualization IT-driven IT-driven IT-driven Self-service IT-driven Self-service Not needed Self-service Traditional BI w/ RDBMS Self-Service BI w/ RDBMS SQL-on-Hadoop Self-Service Data Exploration Zero-day analytics
  9. 9. © 2014 MapR Technologies 9
  10. 10. RDBMS/SQL-on-Hadoop table Apache Drill table © 2014 MapR Technologies 10 Drill’s Data Model is Flexible Fixed schema Schema-less HBase JSON BSON CSV TSV Parquet Avro Flat Complex Flexibility Flexibility Name Gender Age Michael M 6 Jennifer F 3 { name: { first: Michael, last: Smith }, hobbies: [ski, soccer], district: Los Altos } { name: { first: Jennifer, last: Gates }, hobbies: [sing], preschool: CCLC }
  11. 11. Drill Supports Schema Discovery On-The-Fly Schema Declared In Advance Schema2 Discovered On-The-Fly © 2014 MapR Technologies 11 • Fixed schema • Leverage schema in centralized repository (Hive Metastore) • Fixed schema, evolving schema or schema-less • Leverage schema in centralized repository or self-describing data SCHEMA ON WRITE SCHEMA BEFORE READ SCHEMA ON THE FLY
  12. 12. SELECT po_document.AllowPartialShipment FROM j_purchaseorder; © 2014 MapR Technologies 12 Native JSON SELECT json_value(po_document, '$.AllowPartialShipment’ RETURNING NUMBER) FROM j_purchaseorder; JSON query with Drill: JSON query with Oracle: Relational databases cannot provide true schema-free JSON support.
  13. 13. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 13 Architecture
  14. 14. © 2014 MapR Technologies 14 High Level Architecture • Cluster of commodity servers – Daemon (drillbit) on each node • No dependency on other execution engines (MapReduce, Spark, Tez) – Better performance and manageability • ZooKeeper maintains ephemeral cluster membership information – drillbit uses ZooKeeper to find other drillbits in the cluster – Client uses ZooKeeper to find drillbits • Data processing unit is columnar record batches – Enables schema flexibility with negligible performance impact
  15. 15. ZooKeeper ZooKeeper ZooKeeper © 2014 MapR Technologies 15 Drill Maximizes Data Locality drillbit DataNode/Regi onServer/mong od drillbit DataNode/Regi onServer/mong od drillbit DataNode/Regi onServer/mong od … Data Source Best Practice HDFS or MapR-FS drillbit on each DataNode HBase or MapR-DB drillbit on each RegionServer MongoDB drillbit on each mongod node (when using replicas, run it on the replica node)
  16. 16. 5. Return results to client © 2014 MapR Technologies 16 SELECT* Query Execution drillbit ZooKeeper Client (JDBC, ODBC, REST) 1. Find drillbits (once per session) 2. Submit query to drillbit 3. Create logical and physical execution plans 4. Farm out execution of fragments to cluster (completely distributed execution) ZooKeeper ZooKeeper drillbit drillbit * CTAS (CREATE TABLE AS SELECT) queries include steps 1-4
  17. 17. DFS © 2014 MapR Technologies 17 Core Modules within drillbit SQL Parser Hive HBase Distributed Cache Storage Plugins MongoDB Physical Plan Execution Logical Plan Optimizer RPC Endpoint
  18. 18. Example: Analyzing Real-World Data © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 18
  19. 19. © 2014 MapR Technologies 19 Demo Plan 1. Run Drill 2. Configure DFS and MongoDB storage plugins 3. Explore the data – Basics – Complex data – Views
  20. 20. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 20 Run Drill
  21. 21. You can now access the Web UI: http://localhost:8047 © 2014 MapR Technologies 21 Run Drill in Embedded Mode (sqlline) $ tar xf apache-drill-0.7.0.tar.gz $ cd apache-drill-0.7.0 $ bin/sqlline -u jdbc:drill:zk=local > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/user.json` LIMIT 1; +---------------+------------+--------------+------------+------------+ | yelping_since | votes | review_count | name | user_id | +---------------+------------+--------------+------------+------------+ | 2012-02 | {"funny":1,"useful":5,"cool":0} | 6 | Lee | qtrmBGNqCvupHMHL_bKFgQ | • drillbit (Drill daemon) starts automatically in embedded mode • No ZooKeeper in embedded mode (hence zk=local) • Can’t use BI clients (JDBC/ODBC) in embedded mode
  22. 22. • Define the Drill cluster name and ZooKeeper nodes in conf/drill-override.conf • Start drillbit: $ bin/drillbit.sh start © 2014 MapR Technologies 22 Or Run Drill in Distributed Mode… • Make sure ZooKeeper (zkServer) is running: $ zkServer start • Access the Web UI: http://localhost:8047 • Connect a client to the cluster (eg, sqlline): $ bin/sqlline -u jdbc:drill:zk=localhost:2181 • Clients (like sqlline) connect to ZooKeeper to discover the cluster nodes • If you have multiple Drill clusters registered in one ZooKeeper ensemble, specify the desired cluster in the JDBC connection string: jdbc:drill:zk=localhost:2181/drill/<clustername> • Not sure if ZooKeeper is running? Run telnet localhost 2181 and make sure it connects
  23. 23. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 23 Configure Storage Plugins
  24. 24. © 2014 MapR Technologies 24 Enable MongoDB Storage Plugin
  25. 25. Define Workspaces in the DFS Storage Plugin © 2014 MapR Technologies 25 • d
  26. 26. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 26 Explore the Data: Basics
  27. 27. © 2014 MapR Technologies 27 Inventory: DFS Files { "votes": {"funny": 0, "useful": 2, "cool": 1}, "user_id": "Xqd0DzHaiyRqVH3WRG7hzg", "review_id": "15SdjuK7DmYqUAj6rjGowg", "stars": 5, "date": "2007-05-17", "text": "dr. goldberg offers everything ...", "type": "review", "business_id": "vcNAWiLM4dR7D2nwwJ7nCA" }
  28. 28. © 2014 MapR Technologies 28 Inventory: MongoDB Collections $ mongo MongoDB shell version: 2.6.5 > show databases; admin (empty) local 0.078GB yelp 0.453GB > use yelp > db.users.findOne() { "_id" : ObjectId("54566cdf3237149de181a92a"), "yelping_since" : "2012-02", "votes" : { "funny" : 1, "useful" : 5, "cool" : 0 }, "review_count" : 6, "name" : "Lee", "user_id" : "qtrmBGNqCvupHMHL_bKFgQ", "friends" : [ ] }
  29. 29. © 2014 MapR Technologies 29 Let’s Go! > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/y elp/review.json` WHERE stars = 1 LIMIT 1; +------------+------------+------------+------------+------------+------------+------------+-------------+ | votes | user_id | review_id | stars | date | text | type | business_id | +------------+------------+------------+------------+------------+------------+------------+-------------+ | {"funny":0,"useful":0,"cool":0} | Qrs3EICADUKNFoUq2iHStA | _ePLBPrkrf4bhyiKWEn4Qg | 1 | 2013-04-19 | I don't know what Dr. Goldberg was like before moving to Arizona, but let me tell you, STAY AWAY from this doctor and this office. | review | vcNAWiLM4dR7D2nwwJ7nCA | +------------+------------+------------+------------+------------+------------+------------+-------------+
  30. 30. © 2014 MapR Technologies 30 Using Storage Plugins and Workspaces Storage plugin Workspace Path relative to workspace > SELECT * FROM dfs.root.`/Users/tshiran/Development/demo/data/yelp/review.json` LIMIT 1; > SELECT * FROM dfs.demo.`yelp/review.json` LIMIT 1; > SELECT * FROM mongo.yelp.users LIMIT 1; > USE mongo.yelp; > SELECT * FROM users LIMIT 1; Storage Plugin Workspace Table dfs Path Path relative to workspace mongo Database Collection hive Database Table hbase Namespace Table
  31. 31. © 2014 MapR Technologies 31 Most Common User Names (MongoDB) > SELECT name, count(*) AS users FROM mongo.yelp.users GROUP BY name ORDER BY users DESC LIMIT 10; +------------+------------+ | name | users | +------------+------------+ | David | 2453 | | John | 2378 | | Michael | 2322 | | Chris | 2202 | | Mike | 2037 | | Jennifer | 1867 | | Jessica | 1463 | | Jason | 1457 | | Michelle | 1439 | | Brian | 1436 | +------------+------------+
  32. 32. © 2014 MapR Technologies 32 Cities with the Most Businesses > SELECT state, city, count(*) AS businesses FROM dfs.demo.`/yelp/business.json` GROUP BY state, city ORDER BY businesses DESC LIMIT 10; +------------+------------+-------------+ | state | city | businesses | +------------+------------+-------------+ | NV | Las Vegas | 12021 | | AZ | Phoenix | 7499 | | AZ | Scottsdale | 3605 | | EDH | Edinburgh | 2804 | | AZ | Mesa | 2041 | | AZ | Tempe | 2025 | | NV | Henderson | 1914 | | AZ | Chandler | 1637 | | WI | Madison | 1630 | | AZ | Glendale | 1196 | +------------+------------+-------------+
  33. 33. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 33 Explore the Data: Complex Data
  34. 34. © 2014 MapR Technologies 34 business.json (1) { "business_id": "4bEjOyTaDG24SY5TxsaUNQ", "full_address": "3655 Las Vegas Blvd SnThe StripnLas Vegas, NV 89109", "hours": { "Monday": {"close": "23:00", "open": "07:00"}, "Tuesday": {"close": "23:00", "open": "07:00"}, "Friday": {"close": "00:00", "open": "07:00"}, "Wednesday": {"close": "23:00", "open": "07:00"}, "Thursday": {"close": "23:00", "open": "07:00"}, "Sunday": {"close": "23:00", "open": "07:00"}, "Saturday": {"close": "00:00", "open": "07:00"} }, "open": true, "categories": ["Breakfast & Brunch", "Steakhouses", "French", "Restaurants"], "city": "Las Vegas", "review_count": 4084, "name": "Mon Ami Gabi", "neighborhoods": ["The Strip"], "longitude": -115.172588519464,
  35. 35. © 2014 MapR Technologies 35 business.json (2) "state": "NV", "stars": 4.0, "attributes": { "Alcohol": "full_bar”, "Noise Level": "average", "Has TV": false, "Attire": "casual", "Ambience": { "romantic": true, "intimate": false, "touristy": false, "hipster": false, "classy": true, "trendy": false, "casual": false }, "Good For": {"dessert": false, "latenight": false, "lunch": false, "dinner": true, "breakfast": false, "brunch": false}, } }
  36. 36. Which Places Are Open Right Now (22:00)? > SELECT name, b.hours © 2014 MapR Technologies 36 FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Saturday.`open` < '22:00' AND b.hours.Saturday.`close` > '22:00' LIMIT 2; +------------+------------+ | name | hours | +------------+------------+ | Chang Jiang Chinese Kitchen | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"22:30","open":"11:00"},"Monday":{" close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":" 22:00","open":"11:00"},"Sunday":{"close":"21:00","open":"16:00"},"Saturday":{"close":"22:30","ope n":"11:00"}} | | Grand China Restaurant | {"Tuesday":{"close":"22:00","open":"11:00"},"Friday":{"close":"23:00","open":"11:00"},"Monday":{" close":"22:00","open":"11:00"},"Wednesday":{"close":"22:00","open":"11:00"},"Thursday":{"close":" 22:00","open":"11:00"},"Sunday":{"close":"22:00","open":"12:00"},"Saturday":{"close":"23:00","ope n":"11:00"}} | +------------+------------+
  37. 37. It’s 10pm in Vegas and I Want Good Hummus! > SELECT name, stars, b.hours.Friday, categories © 2014 MapR Technologies 37 FROM dfs.demo.`yelp/business.json` b WHERE b.hours.Friday.`open` < '22:00' AND b.hours.Friday.`close` > '22:00' AND REPEATED_CONTAINS(categories, 'Mediterranean') AND city = 'Las Vegas' ORDER BY stars DESC LIMIT 2; +------------+------------+------------+------------+ | name | stars | EXPR$2 | categories | +------------+------------+------------+------------+ | Olives | 4.0 | {"close":"22:30","open":"11:00"} | ["Mediterranean","Restaurants"] | | Marrakech Moroccan Restaurant | 4.0 | {"close":"23:00","open":"17:30"} | ["Mediterranean","Middle Eastern","Moroccan","Restaurants"] | +------------+------------+------------+------------+
  38. 38. © 2014 MapR Technologies 38 Flatten Repeated Values > SELECT name, categories FROM dfs.demo.`yelp/business.json` LIMIT 3; +------------+------------+ | name | categories | +------------+------------+ | Eric Goldberg, MD | ["Doctors","Health & Medical"] | | Pine Cone Restaurant | ["Restaurants"] | | Deforest Family Restaurant | ["American (Traditional)","Restaurants"] | +------------+------------+ > SELECT name, FLATTEN(categories) AS categories FROM dfs.demo.`yelp/business.json` LIMIT 5; +------------+------------+ | name | categories | +------------+------------+ | Eric Goldberg, MD | Doctors | | Eric Goldberg, MD | Health & Medical | | Pine Cone Restaurant | Restaurants | | Deforest Family Restaurant | American (Traditional) | | Deforest Family Restaurant | Restaurants | +------------+------------+
  39. 39. Most and Least Common Business Categories > SELECT category, count(*) AS businesses FROM (SELECT name, FLATTEN(categories) AS category © 2014 MapR Technologies 39 FROM dfs.demo.`yelp/business.json`) c GROUP BY category ORDER BY businesses DESC; +------------+------------+ | category | businesses | +------------+------------+ | Restaurants | 14303 | … | Australian | 1 | | Boat Dealers | 1 | | Firewood | 1 | +------------+------------+ 715 rows selected (3.439 seconds) > SELECT name, categories FROM dfs.demo.`yelp/business.json` WHERE true and REPEATED_CONTAINS(categories, 'Australian'); +------------+------------+ | name | categories | +------------+------------+ | The Australian AZ | ["Bars","Burgers","Nightlife","Australian","Sports Bars","Restaurants"] | +------------+------------+
  40. 40. © 2014 © 201 M4 aMpaRp RTe Tcehcnhonloogloiegsies 40 Explore the Data: Views
  41. 41. columns[0] columns[4] © 2014 MapR Technologies 41 Create a View for Name-Gender Mapping names.csv: > CREATE VIEW dfs.tmp.`names` AS SELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > USE dfs.tmp; > CREATE VIEW names1 ASSELECT columns[0] AS name, columns[4] AS gender FROM dfs.demo.`names.csv`; > SELECT * FROM dfs.tmp.names WHERE name = 'John'; +------------+------------+ | name | gender | +------------+------------+ | John | Male | +------------+------------+
  42. 42. Most Common Names (and their Genders) on Yelp > SELECT u.name, n.gender, count(*) AS number © 2014 MapR Technologies 42 FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY u.name, n.gender ORDER BY number DESC LIMIT 10; +------------+------------+------------+ | name | gender | number | +------------+------------+------------+ | David | Male | 2453 | | John | Male | 2378 | | Michael | Male | 2322 | | Chris | Unknown | 2202 | | Mike | Male | 2037 | | Jennifer | Female | 1867 | | Jessica | Female | 1463 | | Jason | Male | 1457 | | Michelle | Female | 1439 | | Brian | Male | 1436 | +------------+------------+------------+
  43. 43. © 2014 MapR Technologies 43 Who Rates Higher – Men or Women? > SELECT n.gender, count(*) AS users, round(avg(average_stars), 2) stars FROM mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name GROUP BY n.gender; +------------+------------+------------+ | gender | users | stars | +------------+------------+------------+ | Female | 103684 | 3.77 | | Male | 97430 | 3.696 | | Unknown | 18409 | 3.727 | +------------+------------+------------+
  44. 44. © 2014 MapR Technologies 44 Who Writes More – Men or Women? It takes a 3-way join to find out… > SELECT n.gender, round(avg(length(r.text))) AS review_length FROM dfs.demo.`yelp/review.json` r, mongo.yelp.users u, dfs.tmp.names n WHERE u.name = n.name AND r.user_id = u.user_id GROUP BY n.gender; +------------+---------------+ | gender | review_length | +------------+---------------+ | Male | 665 | | Female | 730 | | Unknown | 711 | +------------+---------------+
  45. 45. © 2014 MapR Technologies 45 Drill Tweets (@ApacheDrill)
  46. 46. © 2014 MapR Technologies 46 Thank You • Learn: incubator.apache.org/drill/ • Download: incubator.apache.org/drill/download/ • Ask questions: drill-user@incubator.apache.org • Contact me: tshiran@apache.org
  47. 47. © 2014 MapR Technologies 47 Thank You Tomer Shiran, VP Product Management @mapr maprtech tshiran@mapr.com MapRTechnologies maprtech mapr-technologies

×