SlideShare a Scribd company logo
1 of 46
Download to read offline
A Day in the life of a Druid
Architect
Benjamin Hopp
Senior Solutions Architect @ Imply
ben@imply.io
San Francisco Airport Marriott Waterfront
Real-Time Analytics at Scale
https://www.druidsummit.org/
What do I do?
Productionalization
Implementation
Recommendation
Education
Ask a lot of Questions
● What is the use-case?
○ Is it a good fit for druid?
● Who are the stakeholders?
○ End users - running queries
○ Data Engineers - ingesting data
○ Cluster Administrators - managing services
● How are they using the cluster?
● Where is the data coming from?
● What are the issues or concerns?
● Where does druid fit in the technology stack?
When to use Druid
6
Search
platform
OLAP
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
Timeseries
database
● Optimized for
time-based datasets
● Time-based functions
When NOT to use Druid
7
OLTP
Individual
record
update/delet
e
Big join
operations
Where Druid fits in
8
Data lakes
Message buses
Raw data Storage Analyze Application
Cluster Evaluation
Druid Architecture
Pick your servers
Data NodesD
● Large-ish
● Scales with size of data and query
volume
● Lots of cores, lots of memory, fast NVMe
disk
Query NodesQ
● Medium-ish
● Scales with concurrency and # of Data
nodes
● Typically CPU bound
Master NodesM
● Small-ish Nodes
● Coordinator scales with # of segments
● Overlord scales with # of supervisors and
tasks
Configure for MAXIMUM PERFORMANCE
Data NodesD
● Enable Cache
● Heap/maxDirectMemory size
● druid.processing.buffer.sizeBytes
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Query NodesQ
● Disable Caching
● Heap/maxDirectMemory size
● druid.broker.http.numConnections
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Master NodesM ● Heap Size
Data Evaluation
Unified Console
Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction
Controlling Segment Size
● Number of Tasks - Keep to lowest number that supports max
ingestion rate.
● Segment Granularity - Increase if only 1 file per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000
Compaction
● Combines small segments into larger segments
● Useful for late-arriving data
● Task submitted to Overlord
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}
Rollup
● Pre-aggregation at ingestion
time
● Saves space, better
compression
● Query performance boost
Rollup
timestamp page city count sum_added sum_deleted
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61
2011-01-01T00:00:00Z Ke$ha LA 2 46 53
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88
timestamp page city added deleted
2011-01-01T00:01:35Z
Justin
Bieber
SF 10 5
2011-01-01T00:03:45Z
Justin
Bieber
SF 25 37
2011-01-01T00:05:62Z
Justin
Bieber
SF 15 19
2011-01-01T00:06:33Z Ke$ha LA 30 45
2011-01-01T00:08:51Z Ke$ha LA 16 8
2011-01-01T00:09:17Z
Miley
Cyrus
DC 75 10
2011-01-01T00:11:25Z
Miley
Cyrus
DC 11 25
2011-01-01T00:23:30Z
Miley
Cyrus
DC 22 12
2011-01-01T00:49:33Z
Miley
Cyrus
DC 90 41
Summarize with data sketches
timestamp page city count
sum_
added
sum_
deleted userid_sketch
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61 sketch_obj
2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88 sketch_obj
timestamp page userid city added deleted
2011-01-01T00:01:3
5Z
Justin
Bieber
user11 SF 10 5
2011-01-01T00:03:4
5Z
Justin
Bieber
user22 SF 25 37
2011-01-01T00:05:6
2Z
Justin
Bieber
user11 SF 15 19
2011-01-01T00:06:3
3Z
Ke$ha user33 LA 30 45
2011-01-01T00:08:5
1Z
Ke$ha user33 LA 16
8
2011-01-01T00:09:1
7Z
Miley
Cyrus
user11 DC 75 10
2011-01-01T00:11:2
5Z
Miley
Cyrus
user44 DC 11 25
2011-01-01T00:23:3
0Z
Miley
Cyrus
user44 DC 22 12
2011-01-01T00:49:3
3Z
Miley
Cyrus
user55 DC 90 41
Choose column types carefully
String column
indexed
fast aggregation
fast grouping
Numeric column
indexed
fast aggregation
fast grouping
Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often filter on
● Improves locality, compression,
storage size, query performance
Query Evaluation
Decisions based on data!
Use Druid SQL
● Easier to learn/more familiar
● Will attempt to make intelligent query type choices (timeseries
vs topN vs groupBy)
● There are some limitations - such as multi-value dimensions,
not all aggregations are supported
Explain Plan
EXPLAIN PLAN FOR
SELECT channel, sum(added)
FROM wikipedia
WHERE commentLength >= 50
GROUP BY channel
ORDER BY sum(added) desc
LIMIT 3
Pick your query carefully
● TimeBoundary - Returns min/max timestamp for given interval.
● Timeseries - When you don’t want to group by dimension
● TopN - When you want to group by a single dimension
○ Approximate if > 1000 dimension values
● GroupBy - Least performant/most flexible
● Scan - For returning streaming raw data
○ Perfect ordering not preserved
● Select - For returning paginated raw data
● Search - Returns dimensions that match text search
Using Lookups
● Use lookups when you have dimensions that change to avoid
re-indexing data
● Lookups are key/value pairs stored on every node.
● Loaded via file or JDBC connection to external database
● Lookups are loaded into the java heap size, so large lookups
need larger heaps
Stay in touch
29
@druidio
https://imply.io
https://druid.apache.org/
Ben Hopp
Benjamin.hopp@imply.io
LinkedIn: benhopp
@implydata
roadmap and community update
Ben Hopp
ben@imply.io
Apache Druid 0.17.0
Druid 0.17.0
Our first release as a top-level Apache project!
3
Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
4
Druid 0.17.0 Highlights
● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
5
Apache Druid 0.16.0
Druid 0.16.0
Over 350 new features from 50 contributors!
Released September 2019.
7
Druid 0.16.0 Highlights
● Native parallel batch shuffle
○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions
● Query vectorization phase one
○ Allows queries to be sped up by reducing the number of method calls
● Indexer process
○ An alternative to the MiddleManager + Peon task execution system which is easier to
configure and deploy
● Improved web console
○ Kafka & Kinesis support!
○ Point-and-click reindexing
8
Druid 0.17.0
Our first release as a top-level Apache project!
Coming soon (really soon).
9
Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
10
Druid 0.17.0 Highlights
● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
11
…and beyond!!
…and beyond!!
A selection of items planned for future 2020 Druid releases.
13
…and beyond!!
● SQL Joins
○ A multi-phase project to add full SQL Join support to Druid. Coming up first -
sub-queries and lookups
● Windowed aggregations
○ For example, moving average and cumulative sum aggregations.
● Dynamic query prioritization & laning
○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking
light ones.
● Extended query vectorization support
○ Richer support for query vectorization against more query types
14
Download
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started
15
Contribute
16
https://github.com/apache/druid
Stay in touch
17
@druidio
Join the community!
http://druid.io/community
Free training hosted by Imply!
https://imply.io/druid-days
Follow the Druid project on Twitter!

More Related Content

What's hot

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 

What's hot (20)

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 

Similar to A Day in the Life of a Druid Implementor and Druid's Roadmap

Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 

Similar to A Day in the Life of a Druid Implementor and Druid's Roadmap (20)

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 

More from Itai Yaffe

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALE
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data Israel
 

Recently uploaded

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
wsppdmt
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Bertram Ludäscher
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
gajnagarg
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
chadhar227
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
Health
 

Recently uploaded (20)

一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
一比一原版(UCD毕业证书)加州大学戴维斯分校毕业证成绩单原件一模一样
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
SR-101-01012024-EN.docx  Federal Constitution  of the Swiss ConfederationSR-101-01012024-EN.docx  Federal Constitution  of the Swiss Confederation
SR-101-01012024-EN.docx Federal Constitution of the Swiss Confederation
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Hapur [ 7014168258 ] Call Me For Genuine Models We ...
 
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
怎样办理伦敦大学毕业证(UoL毕业证书)成绩单学校原版复制
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Gartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptxGartner's Data Analytics Maturity Model.pptx
Gartner's Data Analytics Maturity Model.pptx
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 

A Day in the Life of a Druid Implementor and Druid's Roadmap

  • 1. A Day in the life of a Druid Architect Benjamin Hopp Senior Solutions Architect @ Imply ben@imply.io
  • 2. San Francisco Airport Marriott Waterfront Real-Time Analytics at Scale https://www.druidsummit.org/
  • 3.
  • 4. What do I do? Productionalization Implementation Recommendation Education
  • 5. Ask a lot of Questions ● What is the use-case? ○ Is it a good fit for druid? ● Who are the stakeholders? ○ End users - running queries ○ Data Engineers - ingesting data ○ Cluster Administrators - managing services ● How are they using the cluster? ● Where is the data coming from? ● What are the issues or concerns? ● Where does druid fit in the technology stack?
  • 6. When to use Druid 6 Search platform OLAP ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries Timeseries database ● Optimized for time-based datasets ● Time-based functions
  • 7. When NOT to use Druid 7 OLTP Individual record update/delet e Big join operations
  • 8. Where Druid fits in 8 Data lakes Message buses Raw data Storage Analyze Application
  • 11. Pick your servers Data NodesD ● Large-ish ● Scales with size of data and query volume ● Lots of cores, lots of memory, fast NVMe disk Query NodesQ ● Medium-ish ● Scales with concurrency and # of Data nodes ● Typically CPU bound Master NodesM ● Small-ish Nodes ● Coordinator scales with # of segments ● Overlord scales with # of supervisors and tasks
  • 12. Configure for MAXIMUM PERFORMANCE Data NodesD ● Enable Cache ● Heap/maxDirectMemory size ● druid.processing.buffer.sizeBytes ● druid.processing.numMergeBuffers ● druid.processing.numThreads Query NodesQ ● Disable Caching ● Heap/maxDirectMemory size ● druid.broker.http.numConnections ● druid.processing.numMergeBuffers ● druid.processing.numThreads Master NodesM ● Heap Size
  • 15. Optimize segment size Ideally 300 - 700 mb (~ 5 million rows) To control segment size ● Alter segment granularity ● Specify partition spec ● Use Automatic Compaction
  • 16. Controlling Segment Size ● Number of Tasks - Keep to lowest number that supports max ingestion rate. ● Segment Granularity - Increase if only 1 file per segment and < 200MB "segmentGranularity": "HOUR" ● Max Rows Per Segment - Increase if a single segment is < 200MB "maxRowsPerSegment": 5000000
  • 17. Compaction ● Combines small segments into larger segments ● Useful for late-arriving data ● Task submitted to Overlord { "type" : "compact", "dataSource" : "wikipedia", "interval" : "2017-01-01/2018-01-01" }
  • 18. Rollup ● Pre-aggregation at ingestion time ● Saves space, better compression ● Query performance boost
  • 19. Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41
  • 20. Summarize with data sketches timestamp page city count sum_ added sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41
  • 21. Choose column types carefully String column indexed fast aggregation fast grouping Numeric column indexed fast aggregation fast grouping
  • 22. Partitioning beyond time ● Druid always partitions by time ● Decide which dimension to partition on… next ● Partition by some dimension you often filter on ● Improves locality, compression, storage size, query performance
  • 25. Use Druid SQL ● Easier to learn/more familiar ● Will attempt to make intelligent query type choices (timeseries vs topN vs groupBy) ● There are some limitations - such as multi-value dimensions, not all aggregations are supported
  • 26. Explain Plan EXPLAIN PLAN FOR SELECT channel, sum(added) FROM wikipedia WHERE commentLength >= 50 GROUP BY channel ORDER BY sum(added) desc LIMIT 3
  • 27. Pick your query carefully ● TimeBoundary - Returns min/max timestamp for given interval. ● Timeseries - When you don’t want to group by dimension ● TopN - When you want to group by a single dimension ○ Approximate if > 1000 dimension values ● GroupBy - Least performant/most flexible ● Scan - For returning streaming raw data ○ Perfect ordering not preserved ● Select - For returning paginated raw data ● Search - Returns dimensions that match text search
  • 28. Using Lookups ● Use lookups when you have dimensions that change to avoid re-indexing data ● Lookups are key/value pairs stored on every node. ● Loaded via file or JDBC connection to external database ● Lookups are loaded into the java heap size, so large lookups need larger heaps
  • 29. Stay in touch 29 @druidio https://imply.io https://druid.apache.org/ Ben Hopp Benjamin.hopp@imply.io LinkedIn: benhopp @implydata
  • 30. roadmap and community update Ben Hopp ben@imply.io
  • 32. Druid 0.17.0 Our first release as a top-level Apache project! 3
  • 33. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 4
  • 34. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 5
  • 36. Druid 0.16.0 Over 350 new features from 50 contributors! Released September 2019. 7
  • 37. Druid 0.16.0 Highlights ● Native parallel batch shuffle ○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions ● Query vectorization phase one ○ Allows queries to be sped up by reducing the number of method calls ● Indexer process ○ An alternative to the MiddleManager + Peon task execution system which is easier to configure and deploy ● Improved web console ○ Kafka & Kinesis support! ○ Point-and-click reindexing 8
  • 38. Druid 0.17.0 Our first release as a top-level Apache project! Coming soon (really soon). 9
  • 39. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 10
  • 40. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 11
  • 42. …and beyond!! A selection of items planned for future 2020 Druid releases. 13
  • 43. …and beyond!! ● SQL Joins ○ A multi-phase project to add full SQL Join support to Druid. Coming up first - sub-queries and lookups ● Windowed aggregations ○ For example, moving average and cumulative sum aggregations. ● Dynamic query prioritization & laning ○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking light ones. ● Extended query vectorization support ○ Richer support for query vectorization against more query types 14
  • 44. Download Druid community site (new): https://druid.apache.org/ Imply distribution: https://imply.io/get-started 15
  • 46. Stay in touch 17 @druidio Join the community! http://druid.io/community Free training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!