Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Killing ETL with Apache Drill

2,473 views

Published on

The Extract-Transform-Load (ETL) process is one of the most time consuming processes facing anyone who wishes to analyze data. Imagine if you could quickly, easily and scaleably merge and query data without having to spend hours in data prep. Well.. you don’t have to imagine it. You can with Apache Drill. In this hands-on, interactive presentation Mr. Givre will show you how to unleash the power of Apache Drill and explore your data without any kind of ETL process.

Published in: Data & Analytics

Killing ETL with Apache Drill

  1. 1. Killing ETL with Drill Charles S. Givre @cgivre cgivre@thedataist.com
  2. 2. The problems
  3. 3. We want SQL and BI support without compromising flexibility and ability of NoSchema datastores.
  4. 4. Data is not arranged in an optimal way for ad-hoc analysis
  5. 5. Data is not arranged in an optimal way for ad-hoc analysis ETL Data Warehouse
  6. 6. Analytics teams spend between 50%-90% of their time preparing their data.
  7. 7. 76% of Data Scientist say this is the least enjoyable part of their job. http://visit.crowdflower.com/rs/416-ZBE-142/images/CrowdFlower_DataScienceReport_2016.pdf
  8. 8. The ETL Process consumes the most time and contributes almost no value to the end product.
  9. 9. ETL Data Warehouse
  10. 10. “Any sufficiently advanced technology is indistinguishable from magic” —Arthur C. Clarke
  11. 11. You just query the data… no schema
  12. 12. Drill is NOT just SQL on Hadoop
  13. 13. Drill scales
  14. 14. Drill is open source Download Drill at: drill.apache.org
  15. 15. Why should you use Drill?
  16. 16. Why should you use Drill? Drill is easy to use
  17. 17. Drill is easy to use Drill uses standard ANSI SQL
  18. 18. Drill is FAST!!
  19. 19. https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
  20. 20. https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
  21. 21. https://www.mapr.com/blog/comparing-sql-functions-and-performance-apache-spark-and-apache-drill
  22. 22. Quick Demo Thank you Jair Aguirre!!
  23. 23. Quick Demo seanlahman.com/baseball-archive/statistics
  24. 24. Quick Demo data = load '/user/cloudera/data/baseball_csv/Teams.csv' using PigStorage(','); filtered = filter data by ($0 == '1988'); tm_hr = foreach filtered generate (chararray) $40 as team, (int) $19 as hrs; ordered = order tm_hr by hrs desc; dump ordered; Execution Time: 1 minute, 38 seconds
  25. 25. Quick Demo SELECT columns[40], cast(columns[19] as int) AS HR FROM `baseball_csv/Teams.csv` WHERE columns[0] = '1988' ORDER BY HR desc; Execution Time: 0232 seconds!!
  26. 26. Drill is Versatile
  27. 27. NoSQL, No Problem
  28. 28. NoSQL, No Problem https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json
  29. 29. NoSQL, No Problem https://raw.githubusercontent.com/mongodb/docs-assets/primer-dataset/primer-dataset.json SELECT t.address.zipcode AS zip, count(name) AS rests FROM `restaurants` t GROUP BY t.address.zipcode ORDER BY rests DESC LIMIT 10;
  30. 30. Querying Across Silos
  31. 31. Querying Across Silos Farmers Market Data Restaurant Data
  32. 32. Querying Across Silos SELECT t1.Borough, t1.markets, t2.rests, cast(t1.markets AS FLOAT)/ cast(t2.rests AS FLOAT) AS ratio FROM ( SELECT Borough, count(`Farmers Markets Name`) AS markets FROM `farmers_markets.csv` GROUP BY Borough ) t1 JOIN ( SELECT borough, count(name) AS rests FROM mongo.test.`restaurants` GROUP BY borough ) t2 ON t1.Borough=t2.borough ORDER BY ratio DESC;
  33. 33. Querying Across Silos Execution Time: 0.502 Seconds
  34. 34. To follow along, please download the files at: https://github.com/cgivre/drillworkshop
  35. 35. Querying Drill
  36. 36. Querying Drill SELECT DISTINCT management_role FROM cp.`employee.json`;
  37. 37. Querying Drill http://localhost:8047
  38. 38. Querying Drill SELECT * FROM cp.`employee.json` LIMIT 20
  39. 39. Querying Drill SELECT * FROM cp.`employee.json` LIMIT 20
  40. 40. Querying Drill SELECT <fields> FROM <table> WHERE <optional logical condition>
  41. 41. Querying Drill SELECT name, address, email FROM customerData WHERE age > 20
  42. 42. Querying Drill SELECT name, address, email FROM dfs.logs.`/data/customers.csv` WHERE age > 20
  43. 43. Querying Drill FROM dfs.logs.`/data/customers.csv` Storage Plugin Workspace Table
  44. 44. Querying Drill Plugins Supported Description cp Queries files in the Java ClassPath dfs File System. Can connect to remote filesystems such as Hadoop hbase Connects to HBase hive Integrates Drill with the Apache Hive metastore kudu Provides a connection to Apache Kudu mongo Connects to mongoDB RDBMS Provides a connection to relational databases such as MySQL, Postgres, Oracle and others. S3 Provides a connection to an S3 cluster
  45. 45. Problem: You have multiple log files which you would like to analyze
  46. 46. Problem: You have multiple log files which you would like to analyze • In the sample data files, there is a folder called ‘logs’ which contains the following structure:
  47. 47. SELECT * FROM dfs.drillworkshop.`logs/` LIMIT 10
  48. 48. SELECT * FROM dfs.drillworkshop.`logs/` LIMIT 10
  49. 49. dirn accesses the subdirectories
  50. 50. dirn accesses the subdirectories SELECT * FROM dfs.drilldata.`logs/` WHERE dir0 = ‘2013’
  51. 51. Function Description MAXDIR(), MINDIR() Limit query to the first or last directory IMAXDIR(), IMINDIR() Limit query to the first or last directory in case insensitive order. Directory Functions WHERE dir<n> = MAXDIR('<plugin>.<workspace>', '<filename>')
  52. 52. In Class Exercise: Find the total number of items sold by year and the total dollar sales in each year. HINT: Don’t forget to CAST() the fields to appropriate data types SELECT dir0 AS data_year, SUM( CAST( item_count AS INTEGER ) ) as total_items, SUM( CAST( amount_spent AS FLOAT ) ) as total_sales FROM dfs.drillworkshop.`logs/` GROUP BY dir0
  53. 53. Let’s look at JSON data
  54. 54. Let’s look at JSON data [ { "name": "Farley, Colette L.", "email": "iaculis@atarcu.ca", "DOB": "2011-08-14", "phone": "1-758-453-3833" }, { "name": "Kelley, Cherokee R.", "email": "ante.blandit@malesuadafringilla.edu", "DOB": "1992-09-01", "phone": "1-595-478-7825" } … ]
  55. 55. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  56. 56. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  57. 57. Let’s look at JSON data SELECT * FROM dfs.drillworkshop.`json/customers.json`
  58. 58. What about nested data?
  59. 59. Please open baltimore_salaries.json in a text editor
  60. 60. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  61. 61. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  62. 62. { "meta" : { "view" : { "id" : "nsfe-bg53", "name" : "Baltimore City Employee Salaries FY2015", "attribution" : "Mayor's Office", "averageRating" : 0, "category" : "City Government", … " "format" : { } }, }, "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", "55314.00", "53626.04" ] , [ 2, "31C7A2FE-60E6-4219-890B-AFF01C09EC65", 2, 1438255843, "393202", 1438255843, "393202", null, "Aaron,Petra L", "ASSISTANT STATE'S ATTORNEY", "A29045", "States Attorneys Office (045)", "2006-09-25T00:00:00", "74000.00", "73000.08" ]
  63. 63. "data" : [ [ 1, "66020CF9-8449-4464-AE61-B2292C7A0F2D", 1, 1438255843, "393202", 1438255843, “393202", null, "Aaron,Patricia G", "Facilities/Office Services II", "A03031", "OED-Employment Dev (031)", "1979-10-24T00:00:00", “55314.00", “53626.04" ]
  64. 64. Drill has a series of functions for nested data
  65. 65. Let’s look at this data in Drill
  66. 66. Let’s look at this data in Drill SELECT * FROM dfs.drillworkshop.`baltimore_salaries.json`
  67. 67. Let’s look at this data in Drill SELECT * FROM dfs.drillworkshop.`baltimore_salaries.json`
  68. 68. Let’s look at this data in Drill SELECT data FROM dfs.drillworkshop.`baltimore_salaries.json`
  69. 69. FLATTEN( <json array> ) separates elements in a repeated field into individual records.
  70. 70. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  71. 71. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  72. 72. SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json`
  73. 73. SELECT raw_data[8] AS name … FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json` )
  74. 74. SELECT raw_data[8] AS name, raw_data[9] AS job_title FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`baltimore_salaries.json` )
  75. 75. SELECT raw_data[9] AS job_title, AVG( CAST( raw_data[13] AS DOUBLE ) ) AS avg_salary, COUNT( DISTINCT raw_data[8] ) AS person_count FROM ( SELECT FLATTEN( data ) AS raw_data FROM dfs.drillworkshop.`json/baltimore_salaries.json` ) GROUP BY raw_data[9] ORDER BY avg_salary DESC
  76. 76. Using the JSON file, recreate the earlier query to find the average salary by job title and how many people have each job title.
  77. 77. Log Files
  78. 78. Log Files • Drill does not natively support reading log files… yet • If you are NOT using Merlin, included in the GitHub repo are several .jar files. Please take a second and copy them to <drill directory>/jars/3rdparty
  79. 79. Log Files 070823 21:00:32 1 Connect root@localhost on test1 070823 21:00:48 1 Query show tables 070823 21:00:56 1 Query select * from category 070917 16:29:01 21 Query select * from location 070917 16:29:12 21 Query select * from location where id = 1 LIMIT 1
  80. 80. log": { "type": "log", "extensions": [ "log" ], "fieldNames": [ "date", "time", "pid", "action", "query" ], "pattern": "(d{6})s(d{2}:d{2}:d{2})s+(d+)s(w+)s+(.+)" } }
  81. 81. SELECT * FROM dfs.drillworkshop.`log_files/mysql.log`
  82. 82. SELECT * FROM dfs.drillworkshop.`log_files/mysql.log`
  83. 83. HTTPD Log Files
  84. 84. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" 158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21"
  85. 85. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 195.154.46.135 - - [25/Oct/2015:04:11:25 +0100] "GET /linux/doing-pxe-without-dhcp-control HTTP/1.1" 200 24323 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:26 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 23.95.237.180 - - [25/Oct/2015:04:11:27 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 5.1; rv:35.0) Gecko/20100101 Firefox/35.0" 158.222.5.157 - - [25/Oct/2015:04:24:31 +0100] "GET /join_form HTTP/1.0" 200 11114 "http://howto.basjes.nl/" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" 158.222.5.157 - - [25/Oct/2015:04:24:32 +0100] "POST /join_form HTTP/1.1" 302 9093 "http://howto.basjes.nl/join_form" "Mozilla/5.0 (Windows NT 6.3; WOW64; rv:34.0) Gecko/20100101 Firefox/34.0 AlexaToolbar/alxf-2.21" "httpd": { "type": "httpd", "logFormat": "%h %l %u %t "%r" %>s %b "%{Referer}i" "%{User-agent}i"", "timestampFormat": null },
  86. 86. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 SELECT * FROM dfs.drillworkshop.`data_files/log_files/small-server- log.httpd`
  87. 87. HTTPD Log Files For “documentation”: https://issues.apache.org/jira/browse/DRILL-3423 SELECT * FROM dfs.drillworkshop.`data_files/log_files/small-server- log.httpd`
  88. 88. HTTPD Log Files SELECT request_referer, parse_url( request_referer ) AS url_data FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
  89. 89. HTTPD Log Files SELECT request_referer, parse_url( request_referer ) AS url_data FROM dfs.drillworkshop.`data_files/log_files/small-server-log.httpd`
  90. 90. Networking Functions
  91. 91. Networking Functions • inet_aton( <ip> ): Converts an IPv4 Address to an integer • inet_ntoa( <int> ): Converts an integer to an IPv4 address • is_private(<ip>): Returns true if the IP is private • in_network(<ip>,<cidr>): Returns true if the IP is in the CIDR block • getAddressCount( <cidr> ): Returns the number of IPs in a CIDR block • getBroadcastAddress(<cidr>): Returns the broadcast address of a CIDR block • getNetmast( <cidr> ): Returns the net mask of a CIDR block • getLowAddress( <cidr>): Returns the low IP of a CIDR block • getHighAddress(<cidr>): Returns the high IP of a CIDR block • parse_user_agent( <ua_string> ): Returns a map of user agent information
  92. 92. PCAP Files
  93. 93. SELECT * FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
  94. 94. SELECT * FROM dfs.test.`dns-zone-transfer-ixfr.pcap`
  95. 95. Connecting other Data Sources
  96. 96. Connecting other Data Sources
  97. 97. Connecting other Data Sources
  98. 98. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC
  99. 99. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC
  100. 100. Connecting other Data Sources SELECT teams.name, SUM( batting.HR ) as hr_total FROM batting INNER JOIN teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY batting.teamID ORDER BY hr_total DESC MySQL: 0.047 seconds
  101. 101. Connecting other Data Sources MySQL: 0.047 seconds Drill: 0.366 seconds SELECT teams.name, SUM( batting.HR ) as hr_total FROM mysql.stats.batting INNER JOIN mysql.stats.teams ON batting.teamID=teams.teamID WHERE batting.yearID = 1988 AND teams.yearID = 1988 GROUP BY teams.name ORDER BY hr_total DESC
  102. 102. Conclusion • Drill is easy to use • Drill scales • Drill is open source • Drill is versatile
  103. 103. Why aren’t you using Drill?
  104. 104. Thank you! Charles Givre @cgivre givre_charles@bah.com thedataist.com

×