Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Apache Drill in the toolbox

3,730 views

Published on

at Tokyo Apache Drill Meetup Vol.3 Mar 22, 2016
http://drill.connpass.com/event/27414/

Published in: Software
  • Be the first to comment

Apache Drill in the toolbox

  1. 1. in the toolbox Naoki Takezoe @takezoen BizReach, Inc
  2. 2. A lot of JSON in the world ● Configuration ● Data ● Log
  3. 3. We want to query or analyze them. How?
  4. 4. Solutions for searching JSON
  5. 5. We♥SQL
  6. 6. What is Apache Drill? ● Storage ○ Classpath, Local file system / HDFS / S3, HBase, Hive, MongoDB, JDBC ● File format ○ JSON, Parquet, CSV / TSV / PSV Schema-free SQL Query Engine for Hadoop, NoSQL and Cloud Storage
  7. 7. Let's begin!!
  8. 8. Installation 1. Download and expand Drill distribution 2. cd apache-drill-1.6.0/bin 3. ./drill-embedded http://localhost:8047/
  9. 9. Query local JSON files {"name": "suzuki", "dept": "sales"} {"name": "yamada", "dept": "development"} {"name": "sato", "dept": "development"} ... SELECT * FROM dfs.`/tmp/users.json` T1 WHERE T1.name = 'takezoe'
  10. 10. Access to RDB tables Configure jdbc storage plugin at the web console: { "type": "jdbc", "driver": "org.h2.Driver", "url": "jdbc:h2:~/.gitbucket/data", "username": "sa", "password": "sa", "enabled": true }
  11. 11. Join JSON and RDB SELECT T1.`user`.name AS name, T2.MAIL_ADDRESS AS mail FROM dfs.`/tmp/users.json` T1 INNER JOIN h2.DATA.PUBLIC.ACCOUNT T2 ON T1.`user`.name = T2.USER_NAME
  12. 12. Connect to Drill via JDBC We can use any JDBC frontend or BI tool with Drill JDBC Requires ZooKeeper
  13. 13. Connect to Drill via JDBC Setup ZooKeeper $ tar xvzf zookeeper-3.4.8.tar.gz $ cd zookeeper-3.4.8 $ mv conf/zoo_sample.cfg conf/zoo.cfg $ cd bin $ ./zkServer.sh start Run drillbit $ cd apache-drill-1.6.0/bin $ ./drillbit.sh start
  14. 14. Connect to Drill via JDBC ● JDBC Driver ○ DRILL_HOME/jars/jdbc-driver/drill-jdbc-all-1.6.0.jar ● Class ○ org.apache.drill.jdbc.Driver ● URL ○ jdbc:drill:drillbit=localhost
  15. 15. Handling nested JSON
  16. 16. Query nested JSON {"user": {"name": "suzuki", "dept": "sales"}} {"user": {"name": "yamada", "dept": "development"}} {"user": {"name": "sato", "dept": "development"}} ... SELECT T.`user`.name AS name, T.`user`.dept AS dept FROM dfs.`/tmp/users.json` T WHERE T.`user`.name = 'yamada'; Extract JSON property as column
  17. 17. Expand nested JSON property to records {"user": { "name": "yamada", "experience": [ {"lang": "Java"}, {"lang": "Scala"} ] }} SELECT T2.name AS name, T2.experience.lang AS lang, FROM ( SELECT T1.`user`.name AS name, FLATTEN(T1.`user`.experience) AS experience FROM dfs.`/tmp/users.json` T1 ) T2 Expand nested array as individual table
  18. 18. In the case of jq $ cat users.json | jq '.user | select(.name == "yamada")' Nested JSON in Drill brings complexy. Maybe jq is better for simple query?
  19. 19. Use cases
  20. 20. Action log ● Store action log into the local file as JSON ● We can query them using Drill if necessary
  21. 21. Data warehouse ● Aggregate various datasources to Drill ● Data synchronization is no need
  22. 22. e.g. Access Elasticsearch through Hive ● elasticsearch-hadoop supports Hive ● Drill supports Hive http://takezoe.hatenablog.com/entry/20150524/p1 Can we access Elasticsearch from Drill?
  23. 23. Conclusion
  24. 24. Conclusion Apache Drill is ● good tool for querying various datasets ● easy setup and user friendly ● pre-investment is not required ● useful for small data, not only big data Put Apache Drill into your toolbox!

×