Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

•

17 likes•8,072 views

Study after study shows that data preparation and other data janitorial work consume 50-90% of most data scientists’ time. Apache Drill is a very promising tool which can help address this. Drill works with many different forms of “self describing data” and allows analysts to run ad-hoc queries in ANSI SQL against that data. Unlike HIVE or other SQL on Hadoop tools, Drill is not a wrapper for Map-Reduce and can scale to clusters of up to 10k nodes.

Data & Analytics

@OPENDATASCI
Apache Drill & Zeppelin: Two Promising
Tools You’ve Never Heard Of
@OPENDATASCI
OPEN
DATA
SCIENCE
CONFERENCE
Charles Givre
@cgivre, givre_charles@bah.com
S A N F R A N C I S C O | 2 0 1 5

Charles Givre @cgivre
givre_charles@bah.com

@OPENDATASCI
Drill is the first schema free SQL Engine

@OPENDATASCI
Drill allows you to query self-describing data
wherever it is, using standard SQL.

@OPENDATASCI
+( )+ =
SELECT <fields>
FROM dfs.strata.`file1.csv.` f1
JOIN dfs.strata.`file2.json` f2 ON
f1.id = f2.id
WHERE <something is true>
GROUP BY f1.field1
ORDER BY f2.name DESC

$@OPENDATASCI { “employee_id":1, "full_name":"Sheri Nowmer”, “first_name":"Sheri", “last_name”:”Nowmer", … "management_role":"Senior Management” }$

@OPENDATASCI
SELECT management_role, COUNT( employee_id ) as roleCount
FROM cp.`employee.json`
GROUP BY management_role
ORDER BY roleCount DESC

$@OPENDATASCI import requests import pandas as pd import json url = "http://localhost:8047/query.json" employee_query = """SELECT management_role, COUNT( employee_id ) as roleCount FROM cp.`employee.json` GROUP BY management_role ORDER BY roleCount DESC""" data = {"queryType" : "SQL", "query": employee_query } data_json = json.dumps(data) headers = {'Content-type': ‘application/json'} response = requests.post(url, data=data_json, headers=headers) df = pd.DataFrame( response.json()['rows'] ) print( df.head() )$

@OPENDATASCI
management_role roleCount
0 Store Full Time Staf 764
1 Store Temp Staff 264
2 Store Management 100
3 Middle Management 17
4 Senior Management 10
http://bit.ly/1MJZ7QX : Accessing Drill via REST
http://bit.ly/1GWbNIq: Accessing Drill via ODBC

@OPENDATASCI
Drill
SQL-on-Hadoop (Hive, Impala,
etc.)
Use case Self-service, in-situ, SQL-based analytics Data warehouse offload
Data sources
Hadoop, NoSQL, cloud storage (including multiple
instances)
A single Hadoop cluster
Data model Schema-free JSON (like MongoDB) Relational
User
experience
Point-and-query Ingest data → define schemas →
query
Deployment
model
Standalone service or co-located with Hadoop or
NoSQL
Co-located with Hadoop
Data
management
Self-service IT-driven
SQL ANSI SQL SQL-like
1.0 availability Q2 2015 Q2 2013 or earlier
https://drill.apache.org/faq/

@OPENDATASCI
A web based notebook that enables
interactive data analytics

@OPENDATASCI
Zeppelin makes exploring data easy

@OPENDATASCI
val bankText = sc.textFile("/Users/cgivre/Downloads/bank/bank-full.csv")
case class Bank(age:Integer, job:String, marital : String, education : String, balance :
Integer)
val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!=""age"").map(
s=>Bank(s(0).toInt,
s(1).replaceAll(""", ""),
s(2).replaceAll(""", ""),
s(3).replaceAll(""", ""),
s(5).replaceAll(""", "").toInt
)
)
bank.toDF().registerTempTable("bank")

@OPENDATASCI
%sql
SELECT marital, COUNT( 1 ) AS statusCount
FROM bank
GROUP BY marital
ORDER BY statusCount DESC

@OPENDATASCI
%sql
SELECT education, age, AVG( balance) as avgBalance
FROM bank
WHERE age > 20 AND age < 35
GROUP BY age, education
ORDER BY age ASC

@OPENDATASCI
%sql
SELECT education, age, AVG( balance) as avgBalance
FROM bank
WHERE age > ${minimumAge=30} AND age < ${maximumAge=35}
GROUP BY age, education
ORDER BY age ASC

@OPENDATASCI
%sql
SELECT age, count(1) value
FROM bank
WHERE marital="${marital=single,single|divorced|married}"
GROUP BY age
ORDER BY age

@OPENDATASCI
https://zeppelin.incubator.apache.org/

@OPENDATASCI
Questions?
givre_charles@bah.com
@cgivre

What's hot

Spark SQL versus Apache Drill: Different Tools with Different RulesDataWorks Summit/Hadoop Summit

Analyzing Real-World Data with Apache Drilltshiran

Have your Cake and Eat it Too - Architecture for Batch and Real-time processingDataWorks Summit

Fast Data Analytics with Spark and PythonBenjamin Bengfort

SQOOP - RDBMS to HadoopSofian Hadiwijaya

Introduction to the Hadoop EcoSystemShivaji Dutta

Hadoop User Group - Status Apache DrillMapR Technologies

Apache HAWQ ArchitectureAlexey Grishchenko

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...DataWorks Summit

Hadoop and rdbms with sqoop Guy Harrison

M7 and Apache Drill, Micheal HausenblasModern Data Stack France

SQL-on-Hadoop with Apache DrillMapR Technologies

Application architectures with Hadoop – Big Data TechCon 2014hadooparchbook

Hive on spark is blazing fast or is it finalHortonworks

Cloudera ImpalaScott Leberknight

The Future of Hadoop: A deeper look at Apache SparkCloudera, Inc.

Understanding the Value and Architecture of Apache DrillDataWorks Summit

Tachyon and Apache Sparkrhatr

Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin

Apache drillJakub Pieprzyk

What's hot (20)

Spark SQL versus Apache Drill: Different Tools with Different Rules

Analyzing Real-World Data with Apache Drill

Have your Cake and Eat it Too - Architecture for Batch and Real-time processing

Fast Data Analytics with Spark and Python

SQOOP - RDBMS to Hadoop

Introduction to the Hadoop EcoSystem

Hadoop User Group - Status Apache Drill

Apache HAWQ Architecture

Introduction to Apache Amaterasu (Incubating): CD Framework For Your Big Data...

Hadoop and rdbms with sqoop

M7 and Apache Drill, Micheal Hausenblas

SQL-on-Hadoop with Apache Drill

Application architectures with Hadoop – Big Data TechCon 2014

Hive on spark is blazing fast or is it final

Cloudera Impala

The Future of Hadoop: A deeper look at Apache Spark

Understanding the Value and Architecture of Apache Drill

Tachyon and Apache Spark

Transformation Processing Smackdown; Spark vs Hive vs Pig

Apache drill

Viewers also liked

Killing ETL with Apache DrillCharles Givre

Data Exploration with Apache Drill: Day 1Charles Givre

Apache Drill WorkshopCharles Givre

Data Exploration with Apache Drill: Day 2Charles Givre

What Does Your Smart Car Know About You? Strata London 2016Charles Givre

Strata NYC 2015 What does your smart device know about you?Charles Givre

Merlin: The Ultimate Data Science EnvironmentCharles Givre

Apache Drill - Why, What, Howmcsrivas

Large scale, interactive ad-hoc queries over different datastores with Apache...jaxLondonConference

Drill into Drill – How Providing Flexibility and Performance is PossibleMapR Technologies

Introduction to Apache Drill - interactive query and analysis at scaleMapR Technologies

RAPIM 2011Bp Nafri

NarkobaBp Nafri

Km 65 tahun 2002Bp Nafri

PSCOBp Nafri

Apache Storm - Minando redes sociales y medios en tiempo realAndrés Mauricio Palacios

RAKORNIS 2010Bp Nafri

Pristine Advisers PresentationPattyBaronowski

ISPS CodeBp Nafri

KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPALBeny Jackson Maliota

Viewers also liked (20)

Killing ETL with Apache Drill

Data Exploration with Apache Drill: Day 1

Apache Drill Workshop

Data Exploration with Apache Drill: Day 2

What Does Your Smart Car Know About You? Strata London 2016

Strata NYC 2015 What does your smart device know about you?

Merlin: The Ultimate Data Science Environment

Apache Drill - Why, What, How

Large scale, interactive ad-hoc queries over different datastores with Apache...

Drill into Drill – How Providing Flexibility and Performance is Possible

Introduction to Apache Drill - interactive query and analysis at scale

RAPIM 2011

Narkoba

Km 65 tahun 2002

PSCO

Apache Storm - Minando redes sociales y medios en tiempo real

RAKORNIS 2010

Pristine Advisers Presentation

ISPS Code

KELAIKLAUTAN KAPAL DAN DOKUMENTASI KAPAL

Similar to Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

Twitter with hadoop for oowGwen (Chen) Shapira

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...Lace Lofranco

Uotm workshopRavi Patel

Sql on everything with drillJulien Le Dem

Introduction to Big Data Analytics on Apache HadoopAvkash Chauhan

Mongodb - drupal dev daysPierre Joye

New-Age Search through Apache SolrEdureka!

Self-Service BI for big data applications using Apache Drill (Big Data Amster...Dataconomy Media

Self-Service BI for big data applications using Apache Drill (Big Data Amster...Mats Uddenfeldt

Get started with Microsoft SQL PolybaseHenk van der Valk

From oracle to hadoop with Sqoop and other toolsGuy Harrison

A glimpse of test automation in hadoop ecosystem by Deepika AcharyQA or the Highway

Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31Timothy Spann

Azure Data Lake Intro (SQLBits 2016)Michael Rys

How to Use Apache Zeppelin with HWX HDBHortonworks

MongoDB and Azure DatabricksMongoDB

Hadoop and Big Data: RevealedSachin Holla

Redshift IntroductionDataKitchen

Ignite Your Big Data With a Spark!Progress

Oracle hadoop let them talk together !Laurent Leturgez

Similar to Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of (20)

Twitter with hadoop for oow

Microsoft Ignite AU 2017 - Orchestrating Big Data Pipelines with Azure Data F...

Uotm workshop

Sql on everything with drill

Introduction to Big Data Analytics on Apache Hadoop

Mongodb - drupal dev days

New-Age Search through Apache Solr

Self-Service BI for big data applications using Apache Drill (Big Data Amster...

Get started with Microsoft SQL Polybase

From oracle to hadoop with Sqoop and other tools

A glimpse of test automation in hadoop ecosystem by Deepika Achary

Apache Deep Learning 101 - ApacheCon Montreal 2018 v0.31

Azure Data Lake Intro (SQLBits 2016)

How to Use Apache Zeppelin with HWX HDB

MongoDB and Azure Databricks

Hadoop and Big Data: Revealed

Redshift Introduction

Ignite Your Big Data With a Spark!

Oracle hadoop let them talk together !

Recently uploaded

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...limedy534

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster

9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07

Heart Disease Classification Report: A Data Analysis ProjectBoston Institute of Analytics

GA4 Without Cookies [Measure Camp AMS]📊 Markus Baersch

INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdfSocial Samosa

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh

Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda

How we prevented account sharing with MFAAndrei Kaleshka

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一fhwihughh

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort

Top 5 Best Data Analytics Courses In Queensdataanalyticsqueen03

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster

Call Girls in Saket 99530🔝 56974 Escort Service9953056974 Low Rate Call Girls In Saket, Delhi NCR

Generative AI for Social Good at Open Data Science East 2024Colleen Farrelly

RABBIT: A CLI tool for identifying bots based on their GitHub events.natarajan8993

Recently uploaded (20)

Effects of Smartphone Addiction on the Academic Performances of Grades 9 to 1...

EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx

9654467111 Call Girls In Munirka Hotel And Home Service

2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING

Heart Disease Classification Report: A Data Analysis Project

GA4 Without Cookies [Measure Camp AMS]

INTERNSHIP ON PURBASHA COMPOSITE TEX LTD

Kantar AI Summit- Under Embargo till Wednesday, 24th April 2024, 4 PM, IST.pdf

Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps

1:1定制(UQ毕业证）昆士兰大学毕业证成绩单修改留信学历认证原版一模一样

High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...

Customer Service Analytics - Make Sense of All Your Data.pptx

How we prevented account sharing with MFA

办理学位证纽约大学毕业证(NYU毕业证书）原版一比一

Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)

Top 5 Best Data Analytics Courses In Queens

Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024

Call Girls in Saket 99530🔝 56974 Escort Service

Generative AI for Social Good at Open Data Science East 2024

RABBIT: A CLI tool for identifying bots based on their GitHub events.

Apache Drill and Zeppelin: Two Promising Tools You've Never Heard Of

1. @OPENDATASCI Apache Drill & Zeppelin: Two Promising Tools You’ve Never Heard Of @OPENDATASCI OPEN DATA SCIENCE CONFERENCE Charles Givre @cgivre, givre_charles@bah.com S A N F R A N C I S C O | 2 0 1 5

2. @OPENDATASCI Who am I?

3. Charles Givre @cgivre givre_charles@bah.com

4. @OPENDATASCI Apache Zeppelin

5. @OPENDATASCI

6. @OPENDATASCI Drill is the first schema free SQL Engine

7. @OPENDATASCI Drill allows you to query self-describing data wherever it is, using standard SQL.

8. @OPENDATASCI Drill allows you to query self-describing data wherever it is, using standard SQL.

9. @OPENDATASCI Drill allows you to query self-describing data wherever it is, using standard SQL.

10. @OPENDATASCI Open Source

11. @OPENDATASCI Flexible and Extensible

12. @OPENDATASCI Scale

13. @OPENDATASCI + =

14. @OPENDATASCI +( )+ =

15. @OPENDATASCI +( )+ = SELECT <fields> FROM dfs.strata.`file1.csv.` f1 JOIN dfs.strata.`file2.json` f2 ON f1.id = f2.id WHERE <something is true> GROUP BY f1.field1 ORDER BY f2.name DESC

16. @OPENDATASCI ETL

17. @OPENDATASCI

18. @OPENDATASCI

19. @OPENDATASCI

20. @OPENDATASCI { “employee_id":1, "full_name":"Sheri Nowmer”, “first_name":"Sheri", “last_name”:”Nowmer", … "management_role":"Senior Management” }

21. @OPENDATASCI SELECT management_role, COUNT( employee_id ) as roleCount FROM cp.`employee.json` GROUP BY management_role ORDER BY roleCount DESC

22. @OPENDATASCI

23. @OPENDATASCI

24. @OPENDATASCI

25. @OPENDATASCI import requests import pandas as pd import json url = "http://localhost:8047/query.json" employee_query = """SELECT management_role, COUNT( employee_id ) as roleCount FROM cp.`employee.json` GROUP BY management_role ORDER BY roleCount DESC""" data = {"queryType" : "SQL", "query": employee_query } data_json = json.dumps(data) headers = {'Content-type': ‘application/json'} response = requests.post(url, data=data_json, headers=headers) df = pd.DataFrame( response.json()['rows'] ) print( df.head() )

26. @OPENDATASCI management_role roleCount 0 Store Full Time Staf 764 1 Store Temp Staff 264 2 Store Management 100 3 Middle Management 17 4 Senior Management 10 http://bit.ly/1MJZ7QX : Accessing Drill via REST http://bit.ly/1GWbNIq: Accessing Drill via ODBC

27. @OPENDATASCI Performance

28. @OPENDATASCI

29. @OPENDATASCI Drill SQL-on-Hadoop (Hive, Impala, etc.) Use case Self-service, in-situ, SQL-based analytics Data warehouse offload Data sources Hadoop, NoSQL, cloud storage (including multiple instances) A single Hadoop cluster Data model Schema-free JSON (like MongoDB) Relational User experience Point-and-query Ingest data → define schemas → query Deployment model Standalone service or co-located with Hadoop or NoSQL Co-located with Hadoop Data management Self-service IT-driven SQL ANSI SQL SQL-like 1.0 availability Q2 2015 Q2 2013 or earlier https://drill.apache.org/faq/

30. @OPENDATASCI https://drill.apache.org

31. @OPENDATASCI Apache Zeppelin

32. @OPENDATASCI What is it?

33.

34. @OPENDATASCI A web based notebook that enables interactive data analytics

35. @OPENDATASCI Built in integration with

36. @OPENDATASCI

37. @OPENDATASCI Zeppelin makes exploring data easy

38. @OPENDATASCI

39. @OPENDATASCI val bankText = sc.textFile("/Users/cgivre/Downloads/bank/bank-full.csv") case class Bank(age:Integer, job:String, marital : String, education : String, balance : Integer) val bank = bankText.map(s=>s.split(";")).filter(s=>s(0)!=""age"").map( s=>Bank(s(0).toInt, s(1).replaceAll(""", ""), s(2).replaceAll(""", ""), s(3).replaceAll(""", ""), s(5).replaceAll(""", "").toInt ) ) bank.toDF().registerTempTable("bank")

40. @OPENDATASCI %sql SELECT marital, COUNT( 1 ) AS statusCount FROM bank GROUP BY marital ORDER BY statusCount DESC

41. @OPENDATASCI

42. @OPENDATASCI

43. @OPENDATASCI

44. @OPENDATASCI

45. @OPENDATASCI %sql SELECT education, age, AVG( balance) as avgBalance FROM bank WHERE age > 20 AND age < 35 GROUP BY age, education ORDER BY age ASC

46. @OPENDATASCI

47. @OPENDATASCI

48. @OPENDATASCI

49. @OPENDATASCI %sql SELECT education, age, AVG( balance) as avgBalance FROM bank WHERE age > ${minimumAge=30} AND age < ${maximumAge=35} GROUP BY age, education ORDER BY age ASC

50. @OPENDATASCI

51. @OPENDATASCI

52. @OPENDATASCI %sql SELECT age, count(1) value FROM bank WHERE marital="${marital=single,single|divorced|married}" GROUP BY age ORDER BY age

53. @OPENDATASCI

54. @OPENDATASCI

55. @OPENDATASCI https://zeppelin.incubator.apache.org/

56. @OPENDATASCI Questions? givre_charles@bah.com @cgivre