SlideShare a Scribd company logo
1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Big Data Developers in Madrid
> Time-series data analysis and persistence with Druid
(IoT, clickstream analytics, ...)
Raúl Marín
Solutions Engineering @ Hortonworks
May 10th, 2018
2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda
3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda
4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Development starts in 2011 at Metamarkets, open-sourced in late 2012
⬢ Initial use case: interactive ad-analytics
⬢ +150 contributors
⬢ Main features:
– Columns-oriented distributed data store
– Scalable to PBs & 1000s concurrent users
– Batch & Real-time ingestion
– Sub-second response for arbitrary
time-based slice-and-dice:
• Data partition by time dimension
• Automatic data summarization
• Approximate algorithms (hyperLogLog, theta)
• Auto-indexing on load
What’s Druid? An overview
5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Interactive and Exploratory Analytics on event data
⬢ Suitable for BI/OLAP demanding interactive visualization of complex data streams:
– Real-time bidding events
– User activity streams
– Voice call logs
– Network traffic flows
– Firewall events
– Application performance metrics
⬢ Querying event data at large scale poses multiple challenges:
– Window joining not guaranteed
– Potential duplicated events
Where does Druid shine?
6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda
7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
High Level Druid Architecture
HDP
Historical
Node
Historical
Node
Historical
Node
Batch Data
Broker
Node
Queries
Kafka, Storm,
Spark, API
(Twitter), etc
Streaming
Data Realtime
Node
Realtime
Node
Handoff
Deep Storage
HDFS (HDP) or S3
8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Scatters query across historical and realtime nodes
– (Clients issue queries to this node, but queries are processed elsewhere.)
⬢ Merge results from different query nodes
⬢ (Distributed) caching layer
Broker Nodes
9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Ability to ingest streams of data
⬢ Both push and pull based ingestion
⬢ Stores data in write-optimized structure
⬢ Periodically converts write-optimized structure
to read-optimized segments
⬢ Event query-able as soon as it is ingested
Realtime Nodes
10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Shared nothing architecture
⬢ Main workhorses of druid cluster
⬢ Load immutable read optimized segments
⬢ Respond to queries
⬢ Use memory mapped files
to load segments
Historical nodes
11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Assigns segments to historical nodes
⬢ Interval based cost function to distribute segments
⬢ Makes sure query load is uniform across historical nodes
⬢ Handles replication of data
⬢ Configurable rules to load/drop data
Coordinator Nodes
12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ A highly-available & distributed service for indexing tasks
⬢ Indexing is performed by related components:
– Overlord
– Middle Managers
– Peons
⬢ Index definition is specified via a JSON file and submitted
to the Overlord.
Indexing service
13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda
14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ Data is organized in Data Sources, top level abstraction, equivalent to Data Tables
⬢ Within a Data Source, data is stored in Segments
⬢ Segments are partitioned by time and, eventually, some dimensions
⬢ Segment size matters in order to avoid resource contention (~GBs)
Data Storage in Druid
Time
Segment 1:
Monday
Segment 2:
Tuesday
Segment 3:
Wednesday
Segment 4:
Thursday
Segment 5:
Friday
15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
⬢ A segment contains:
– A timestamp column
– One or many dimension columns
– One or many metric columns
– Indexes to facilitate fast lookups
and aggregations
Segment Data Structure
16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda
17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Typical queries and operators available on Druid
⬢ Time based (time-series)
⬢ Filters (search/select)
⬢ Group by
⬢ Top N - Equivalent to a group_by + order over 1 dimension
(results approximated for efficiency if there are more than 1000 dimension values)
⬢ Time boundary - earliest and latest data points of a data set
⬢ Granularity (roll up)
18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Queries and results expressed in JSON (HTTP Rest API)
19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Superset - BI Dashboarding fully integrated with Druid
20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
Query Druid from Hive with SQL (BI tools integration)
⬢ Druid supports SQL natively (experimental feature) → Hive integration preferred
⬢ Point hive to the broker:
– SET hive.druid.broker.address.default=druid.broker.hostname:8082;
⬢ Simple CREATE EXTERNAL TABLE statement
CREATE EXTERNAL TABLE druid_wikiticker
STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler'
TBLPROPERTIES ("druid.datasource" = "wikiticker");
⬢ Broker node endpoint specified as a Hive configuration parameter
⬢ Automatic Druid data schema discovery: segment metadata query
Hive table name
Hive storage handler classname
Druid data source name
21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
▪ What’s Druid?
▪ What does Druid look like?
▪ How is data modeled in Druid?
▪ How to query data on Druid
▪ Demo
Agenda
22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved
DEMO

More Related Content

What's hot

Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Timothy Spann
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
DataWorks Summit/Hadoop Summit
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
DataWorks Summit
 
Big Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - TokyoBig Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - Tokyo
DataWorks Summit
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
DataWorks Summit/Hadoop Summit
 
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data TransformationMunich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data Transformation
DataWorks Summit
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
DataWorks Summit
 
Big data course
Big data  courseBig data  course
Big data course
kiruthikab6
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
DataWorks Summit
 
Talend Online Training
Talend Online TrainingTalend Online Training
Talend Online Training
QEdge Tech
 
Kudu as Storage Layer to Digitize Credit Processes
Kudu as Storage Layer to Digitize Credit ProcessesKudu as Storage Layer to Digitize Credit Processes
Kudu as Storage Layer to Digitize Credit Processes
DataWorks Summit
 
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
DataWorks Summit
 
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
DataWorks Summit/Hadoop Summit
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
DataWorks Summit/Hadoop Summit
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
Spark Summit
 
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big DataHortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Mats Johansson
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
Hortonworks
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
DataWorks Summit
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
OW2
 

What's hot (20)

Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
Enterprise Data Science at Scale @ Princeton, NJ 14-Nov-2017
 
Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...Automatic Detection, Classification and Authorization of Sensitive Personal D...
Automatic Detection, Classification and Authorization of Sensitive Personal D...
 
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterpriseUsing Spark Streaming and NiFi for the next generation of ETL in the enterprise
Using Spark Streaming and NiFi for the next generation of ETL in the enterprise
 
Big Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - TokyoBig Traffic, Big Trouble: Big Data - Tokyo
Big Traffic, Big Trouble: Big Data - Tokyo
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Munich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data TransformationMunich Re: Driving a Big Data Transformation
Munich Re: Driving a Big Data Transformation
 
The EDW Ecosystem
The EDW EcosystemThe EDW Ecosystem
The EDW Ecosystem
 
Deep learning 101
Deep learning 101Deep learning 101
Deep learning 101
 
Big data course
Big data  courseBig data  course
Big data course
 
IoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management ThingsIoFMT – Internet of Fleet Management Things
IoFMT – Internet of Fleet Management Things
 
Talend Online Training
Talend Online TrainingTalend Online Training
Talend Online Training
 
Kudu as Storage Layer to Digitize Credit Processes
Kudu as Storage Layer to Digitize Credit ProcessesKudu as Storage Layer to Digitize Credit Processes
Kudu as Storage Layer to Digitize Credit Processes
 
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
Data Acquisition Automation for NiFi in a Hybrid Cloud environment – the Path...
 
Apache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJApache Hadoop Crash Course - HS16SJ
Apache Hadoop Crash Course - HS16SJ
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
TopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David DurstTopNotch: Systematically Quality Controlling Big Data by David Durst
TopNotch: Systematically Quality Controlling Big Data by David Durst
 
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big DataHortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
Hortonworks DataFlow & Apache Nifi @Oslo Hadoop Big Data
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...Achieving a 360-degree view of manufacturing via open source industrial data ...
Achieving a 360-degree view of manufacturing via open source industrial data ...
 
All data accessible to all my organization - Presentation at OW2con'19, June...
 All data accessible to all my organization - Presentation at OW2con'19, June... All data accessible to all my organization - Presentation at OW2con'19, June...
All data accessible to all my organization - Presentation at OW2con'19, June...
 

Similar to Time-series data analysis and persistence with Druid

Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
Aaron Brooks
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
DataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
Thiago Santiago
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
alanfgates
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
DataWorks Summit
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
DataWorks Summit/Hadoop Summit
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
DataWorks Summit
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
Timothy Spann
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
Hortonworks
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
Sean Roberts
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Innovative Management Services
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Haimo Liu
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
Milind Pandit
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
DataWorks Summit
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
Slim Baltagi
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
DataWorks Summit
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
DataWorks Summit
 

Similar to Time-series data analysis and persistence with Druid (20)

Druid Scaling Realtime Analytics
Druid Scaling Realtime AnalyticsDruid Scaling Realtime Analytics
Druid Scaling Realtime Analytics
 
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
Running Enterprise Workloads with an open source Hybrid Cloud Data Architectu...
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
Hortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data ScienceHortonworks - IBM Cognitive - The Future of Data Science
Hortonworks - IBM Cognitive - The Future of Data Science
 
Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019Hive Performance Dataworks Summit Melbourne February 2019
Hive Performance Dataworks Summit Melbourne February 2019
 
Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?Fast SQL on Hadoop, Really?
Fast SQL on Hadoop, Really?
 
Interactive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using DruidInteractive Analytics at Scale in Apache Hive Using Druid
Interactive Analytics at Scale in Apache Hive Using Druid
 
A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0 A Reference Architecture for ETL 2.0
A Reference Architecture for ETL 2.0
 
Enterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFiEnterprise IIoT Edge Processing with Apache NiFi
Enterprise IIoT Edge Processing with Apache NiFi
 
Enabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical EnterpriseEnabling the Real Time Analytical Enterprise
Enabling the Real Time Analytical Enterprise
 
S2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real WorldS2DS London 2015 - Hadoop Real World
S2DS London 2015 - Hadoop Real World
 
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...
 
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFIHarnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
Harnessing Data-in-Motion with HDF 2.0, introduction to Apache NIFI/MINIFI
 
Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1Hortonworks Data in Motion Webinar Series - Part 1
Hortonworks Data in Motion Webinar Series - Part 1
 
HDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi IntroductionHDF Powered by Apache NiFi Introduction
HDF Powered by Apache NiFi Introduction
 
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data ArchitectureRunning Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
Running Enterprise Workloads with an Open Source Hybrid Cloud Data Architecture
 
Building a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise HadoopBuilding a Modern Data Architecture with Enterprise Hadoop
Building a Modern Data Architecture with Enterprise Hadoop
 
Hortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - WebinarHortonworks and Platfora in Financial Services - Webinar
Hortonworks and Platfora in Financial Services - Webinar
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 

Recently uploaded

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
benishzehra469
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
Oppotus
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Linda486226
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
v3tuleee
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
John Andrews
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
haila53
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
ukgaet
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 

Recently uploaded (20)

Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Empowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptxEmpowering Data Analytics Ecosystem.pptx
Empowering Data Analytics Ecosystem.pptx
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Q1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year ReboundQ1’2024 Update: MYCI’s Leap Year Rebound
Q1’2024 Update: MYCI’s Leap Year Rebound
 
一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单一比一原版(BU毕业证)波士顿大学毕业证成绩单
一比一原版(BU毕业证)波士顿大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdfSample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
Sample_Global Non-invasive Prenatal Testing (NIPT) Market, 2019-2030.pdf
 
SOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape ReportSOCRadar Germany 2024 Threat Landscape Report
SOCRadar Germany 2024 Threat Landscape Report
 
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理一比一原版(UofS毕业证书)萨省大学毕业证如何办理
一比一原版(UofS毕业证书)萨省大学毕业证如何办理
 
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...
 
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdfCh03-Managing the Object-Oriented Information Systems Project a.pdf
Ch03-Managing the Object-Oriented Information Systems Project a.pdf
 
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
一比一原版(UVic毕业证)维多利亚大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 

Time-series data analysis and persistence with Druid

  • 1. 1 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Big Data Developers in Madrid > Time-series data analysis and persistence with Druid (IoT, clickstream analytics, ...) Raúl Marín Solutions Engineering @ Hortonworks May 10th, 2018
  • 2. 2 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ▪ What’s Druid? ▪ What does Druid look like? ▪ How is data modeled in Druid? ▪ How to query data on Druid ▪ Demo Agenda
  • 3. 3 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ▪ What’s Druid? ▪ What does Druid look like? ▪ How is data modeled in Druid? ▪ How to query data on Druid ▪ Demo Agenda
  • 4. 4 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Development starts in 2011 at Metamarkets, open-sourced in late 2012 ⬢ Initial use case: interactive ad-analytics ⬢ +150 contributors ⬢ Main features: – Columns-oriented distributed data store – Scalable to PBs & 1000s concurrent users – Batch & Real-time ingestion – Sub-second response for arbitrary time-based slice-and-dice: • Data partition by time dimension • Automatic data summarization • Approximate algorithms (hyperLogLog, theta) • Auto-indexing on load What’s Druid? An overview
  • 5. 5 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Interactive and Exploratory Analytics on event data ⬢ Suitable for BI/OLAP demanding interactive visualization of complex data streams: – Real-time bidding events – User activity streams – Voice call logs – Network traffic flows – Firewall events – Application performance metrics ⬢ Querying event data at large scale poses multiple challenges: – Window joining not guaranteed – Potential duplicated events Where does Druid shine?
  • 6. 6 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ▪ What’s Druid? ▪ What does Druid look like? ▪ How is data modeled in Druid? ▪ How to query data on Druid ▪ Demo Agenda
  • 7. 7 © Hortonworks Inc. 2011 – 2018. All Rights Reserved High Level Druid Architecture HDP Historical Node Historical Node Historical Node Batch Data Broker Node Queries Kafka, Storm, Spark, API (Twitter), etc Streaming Data Realtime Node Realtime Node Handoff Deep Storage HDFS (HDP) or S3
  • 8. 8 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Scatters query across historical and realtime nodes – (Clients issue queries to this node, but queries are processed elsewhere.) ⬢ Merge results from different query nodes ⬢ (Distributed) caching layer Broker Nodes
  • 9. 9 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Ability to ingest streams of data ⬢ Both push and pull based ingestion ⬢ Stores data in write-optimized structure ⬢ Periodically converts write-optimized structure to read-optimized segments ⬢ Event query-able as soon as it is ingested Realtime Nodes
  • 10. 10 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Shared nothing architecture ⬢ Main workhorses of druid cluster ⬢ Load immutable read optimized segments ⬢ Respond to queries ⬢ Use memory mapped files to load segments Historical nodes
  • 11. 11 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Assigns segments to historical nodes ⬢ Interval based cost function to distribute segments ⬢ Makes sure query load is uniform across historical nodes ⬢ Handles replication of data ⬢ Configurable rules to load/drop data Coordinator Nodes
  • 12. 12 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ A highly-available & distributed service for indexing tasks ⬢ Indexing is performed by related components: – Overlord – Middle Managers – Peons ⬢ Index definition is specified via a JSON file and submitted to the Overlord. Indexing service
  • 13. 13 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ▪ What’s Druid? ▪ What does Druid look like? ▪ How is data modeled in Druid? ▪ How to query data on Druid ▪ Demo Agenda
  • 14. 14 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ Data is organized in Data Sources, top level abstraction, equivalent to Data Tables ⬢ Within a Data Source, data is stored in Segments ⬢ Segments are partitioned by time and, eventually, some dimensions ⬢ Segment size matters in order to avoid resource contention (~GBs) Data Storage in Druid Time Segment 1: Monday Segment 2: Tuesday Segment 3: Wednesday Segment 4: Thursday Segment 5: Friday
  • 15. 15 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ⬢ A segment contains: – A timestamp column – One or many dimension columns – One or many metric columns – Indexes to facilitate fast lookups and aggregations Segment Data Structure
  • 16. 16 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ▪ What’s Druid? ▪ What does Druid look like? ▪ How is data modeled in Druid? ▪ How to query data on Druid ▪ Demo Agenda
  • 17. 17 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Typical queries and operators available on Druid ⬢ Time based (time-series) ⬢ Filters (search/select) ⬢ Group by ⬢ Top N - Equivalent to a group_by + order over 1 dimension (results approximated for efficiency if there are more than 1000 dimension values) ⬢ Time boundary - earliest and latest data points of a data set ⬢ Granularity (roll up)
  • 18. 18 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Queries and results expressed in JSON (HTTP Rest API)
  • 19. 19 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Superset - BI Dashboarding fully integrated with Druid
  • 20. 20 © Hortonworks Inc. 2011 – 2018. All Rights Reserved Query Druid from Hive with SQL (BI tools integration) ⬢ Druid supports SQL natively (experimental feature) → Hive integration preferred ⬢ Point hive to the broker: – SET hive.druid.broker.address.default=druid.broker.hostname:8082; ⬢ Simple CREATE EXTERNAL TABLE statement CREATE EXTERNAL TABLE druid_wikiticker STORED BY 'org.apache.hadoop.hive.druid.DruidStorageHandler' TBLPROPERTIES ("druid.datasource" = "wikiticker"); ⬢ Broker node endpoint specified as a Hive configuration parameter ⬢ Automatic Druid data schema discovery: segment metadata query Hive table name Hive storage handler classname Druid data source name
  • 21. 21 © Hortonworks Inc. 2011 – 2018. All Rights Reserved ▪ What’s Druid? ▪ What does Druid look like? ▪ How is data modeled in Druid? ▪ How to query data on Druid ▪ Demo Agenda
  • 22. 22 © Hortonworks Inc. 2011 – 2018. All Rights Reserved DEMO