SlideShare a Scribd company logo
A Day in the life of a Druid
Architect
Benjamin Hopp
Senior Solutions Architect @ Imply
ben@imply.io
San Francisco Airport Marriott Waterfront
Real-Time Analytics at Scale
https://www.druidsummit.org/
What do I do?
Productionalization
Implementation
Recommendation
Education
Ask a lot of Questions
● What is the use-case?
○ Is it a good fit for druid?
● Who are the stakeholders?
○ End users - running queries
○ Data Engineers - ingesting data
○ Cluster Administrators - managing services
● How are they using the cluster?
● Where is the data coming from?
● What are the issues or concerns?
● Where does druid fit in the technology stack?
When to use Druid
6
Search
platform
OLAP
● Real-time ingestion
● Flexible schema
● Full text search
● Batch ingestion
● Efficient storage
● Fast analytic queries
Timeseries
database
● Optimized for
time-based datasets
● Time-based functions
When NOT to use Druid
7
OLTP
Individual
record
update/delet
e
Big join
operations
Where Druid fits in
8
Data lakes
Message buses
Raw data Storage Analyze Application
Cluster Evaluation
Druid Architecture
Pick your servers
Data NodesD
● Large-ish
● Scales with size of data and query
volume
● Lots of cores, lots of memory, fast NVMe
disk
Query NodesQ
● Medium-ish
● Scales with concurrency and # of Data
nodes
● Typically CPU bound
Master NodesM
● Small-ish Nodes
● Coordinator scales with # of segments
● Overlord scales with # of supervisors and
tasks
Configure for MAXIMUM PERFORMANCE
Data NodesD
● Enable Cache
● Heap/maxDirectMemory size
● druid.processing.buffer.sizeBytes
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Query NodesQ
● Disable Caching
● Heap/maxDirectMemory size
● druid.broker.http.numConnections
● druid.processing.numMergeBuffers
● druid.processing.numThreads
Master NodesM ● Heap Size
Data Evaluation
Unified Console
Optimize segment size
Ideally 300 - 700 mb (~ 5 million rows)
To control segment size
● Alter segment granularity
● Specify partition spec
● Use Automatic Compaction
Controlling Segment Size
● Number of Tasks - Keep to lowest number that supports max
ingestion rate.
● Segment Granularity - Increase if only 1 file per segment and <
200MB
"segmentGranularity": "HOUR"
● Max Rows Per Segment - Increase if a single segment is <
200MB
"maxRowsPerSegment": 5000000
Compaction
● Combines small segments into larger segments
● Useful for late-arriving data
● Task submitted to Overlord
{
"type" : "compact",
"dataSource" : "wikipedia",
"interval" : "2017-01-01/2018-01-01"
}
Rollup
● Pre-aggregation at ingestion
time
● Saves space, better
compression
● Query performance boost
Rollup
timestamp page city count sum_added sum_deleted
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61
2011-01-01T00:00:00Z Ke$ha LA 2 46 53
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88
timestamp page city added deleted
2011-01-01T00:01:35Z
Justin
Bieber
SF 10 5
2011-01-01T00:03:45Z
Justin
Bieber
SF 25 37
2011-01-01T00:05:62Z
Justin
Bieber
SF 15 19
2011-01-01T00:06:33Z Ke$ha LA 30 45
2011-01-01T00:08:51Z Ke$ha LA 16 8
2011-01-01T00:09:17Z
Miley
Cyrus
DC 75 10
2011-01-01T00:11:25Z
Miley
Cyrus
DC 11 25
2011-01-01T00:23:30Z
Miley
Cyrus
DC 22 12
2011-01-01T00:49:33Z
Miley
Cyrus
DC 90 41
Summarize with data sketches
timestamp page city count
sum_
added
sum_
deleted userid_sketch
2011-01-01T00:00:00Z
Justin
Bieber
SF 3 50 61 sketch_obj
2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj
2011-01-01T00:00:00Z Miley
Cyrus
DC 4 198 88 sketch_obj
timestamp page userid city added deleted
2011-01-01T00:01:3
5Z
Justin
Bieber
user11 SF 10 5
2011-01-01T00:03:4
5Z
Justin
Bieber
user22 SF 25 37
2011-01-01T00:05:6
2Z
Justin
Bieber
user11 SF 15 19
2011-01-01T00:06:3
3Z
Ke$ha user33 LA 30 45
2011-01-01T00:08:5
1Z
Ke$ha user33 LA 16
8
2011-01-01T00:09:1
7Z
Miley
Cyrus
user11 DC 75 10
2011-01-01T00:11:2
5Z
Miley
Cyrus
user44 DC 11 25
2011-01-01T00:23:3
0Z
Miley
Cyrus
user44 DC 22 12
2011-01-01T00:49:3
3Z
Miley
Cyrus
user55 DC 90 41
Choose column types carefully
String column
indexed
fast aggregation
fast grouping
Numeric column
indexed
fast aggregation
fast grouping
Partitioning beyond time
● Druid always partitions by time
● Decide which dimension to
partition on… next
● Partition by some dimension you
often filter on
● Improves locality, compression,
storage size, query performance
Query Evaluation
Decisions based on data!
Use Druid SQL
● Easier to learn/more familiar
● Will attempt to make intelligent query type choices (timeseries
vs topN vs groupBy)
● There are some limitations - such as multi-value dimensions,
not all aggregations are supported
Explain Plan
EXPLAIN PLAN FOR
SELECT channel, sum(added)
FROM wikipedia
WHERE commentLength >= 50
GROUP BY channel
ORDER BY sum(added) desc
LIMIT 3
Pick your query carefully
● TimeBoundary - Returns min/max timestamp for given interval.
● Timeseries - When you don’t want to group by dimension
● TopN - When you want to group by a single dimension
○ Approximate if > 1000 dimension values
● GroupBy - Least performant/most flexible
● Scan - For returning streaming raw data
○ Perfect ordering not preserved
● Select - For returning paginated raw data
● Search - Returns dimensions that match text search
Using Lookups
● Use lookups when you have dimensions that change to avoid
re-indexing data
● Lookups are key/value pairs stored on every node.
● Loaded via file or JDBC connection to external database
● Lookups are loaded into the java heap size, so large lookups
need larger heaps
Stay in touch
29
@druidio
https://imply.io
https://druid.apache.org/
Ben Hopp
Benjamin.hopp@imply.io
LinkedIn: benhopp
@implydata
roadmap and community update
Ben Hopp
ben@imply.io
Apache Druid 0.17.0
Druid 0.17.0
Our first release as a top-level Apache project!
3
Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
4
Druid 0.17.0 Highlights
● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
5
Apache Druid 0.16.0
Druid 0.16.0
Over 350 new features from 50 contributors!
Released September 2019.
7
Druid 0.16.0 Highlights
● Native parallel batch shuffle
○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions
● Query vectorization phase one
○ Allows queries to be sped up by reducing the number of method calls
● Indexer process
○ An alternative to the MiddleManager + Peon task execution system which is easier to
configure and deploy
● Improved web console
○ Kafka & Kinesis support!
○ Point-and-click reindexing
8
Druid 0.17.0
Our first release as a top-level Apache project!
Coming soon (really soon).
9
Druid 0.17.0 Highlights
● Native batch - binary inputs & more
○ Supports non-binary formats such as ORC, Parquet, and Avro
○ Native batch tasks can now read from HDFS
○ Single-dimension range partitioning for parallel native batch
● Compaction improvements
○ Parallel index task split hints and parallel auto-compaction
○ Stateful auto-compaction
● Parallel query merge on brokers
○ Broker can now optionally merge query results in parallel using multiple threads.
10
Druid 0.17.0 Highlights
● ...and More!
○ Improved SQL-compatible null handling
○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram
metric types
○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system
tables in a new sys.supervisors table
○ Fast historical start with deferred loading of segments until query time
○ New readiness and self-discovery resources
○ Task assignment based on MiddleManager categories
○ Security updates
11
…and beyond!!
…and beyond!!
A selection of items planned for future 2020 Druid releases.
13
…and beyond!!
● SQL Joins
○ A multi-phase project to add full SQL Join support to Druid. Coming up first -
sub-queries and lookups
● Windowed aggregations
○ For example, moving average and cumulative sum aggregations.
● Dynamic query prioritization & laning
○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking
light ones.
● Extended query vectorization support
○ Richer support for query vectorization against more query types
14
Download
Druid community site (new): https://druid.apache.org/
Imply distribution: https://imply.io/get-started
15
Contribute
16
https://github.com/apache/druid
Stay in touch
17
@druidio
Join the community!
http://druid.io/community
Free training hosted by Imply!
https://imply.io/druid-days
Follow the Druid project on Twitter!

More Related Content

What's hot

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Spark Summit
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
Johan Gustavsson
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
Carl Steinbach
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
Sadayuki Furuhashi
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Amy W. Tang
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Spark Summit
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
Michael Spector
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
Taro L. Saito
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
Zekeriya Besiroglu
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Spark Summit
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Codemotion
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
DataWorks Summit/Hadoop Summit
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
Yahoo Developer Network
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
Imply
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Spark Summit
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
Seth Rosenblum
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
Modern Data Stack France
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
kbajda
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
Treasure Data, Inc.
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
Itai Yaffe
 

What's hot (20)

Insights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald NowlingInsights into Customer Behavior from Clickstream Data by Ronald Nowling
Insights into Customer Behavior from Clickstream Data by Ronald Nowling
 
tdtechtalk20160330johan
tdtechtalk20160330johantdtechtalk20160330johan
tdtechtalk20160330johan
 
The Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedInThe Past, Present, and Future of Hadoop at LinkedIn
The Past, Present, and Future of Hadoop at LinkedIn
 
Prestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for PrestoPrestogres, ODBC & JDBC connectivity for Presto
Prestogres, ODBC & JDBC connectivity for Presto
 
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
Powering Predictive Mapping at Scale with Spark, Kafka, and Elastic Search: S...
 
Real-time analytics with Druid at Appsflyer
Real-time analytics with Druid at AppsflyerReal-time analytics with Druid at Appsflyer
Real-time analytics with Druid at Appsflyer
 
Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015Presto @ Treasure Data - Presto Meetup Boston 2015
Presto @ Treasure Data - Presto Meetup Boston 2015
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena EdelsonStreaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
Streaming Analytics with Spark, Kafka, Cassandra and Akka by Helena Edelson
 
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
Graph databases and the Panama Papers - Stefan Armbruster - Codemotion Milan ...
 
What's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and BeyondWhat's new in SQL on Hadoop and Beyond
What's new in SQL on Hadoop and Beyond
 
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using DruidJuly 2014 HUG : Pushing the limits of Realtime Analytics using Druid
July 2014 HUG : Pushing the limits of Realtime Analytics using Druid
 
August meetup - All about Apache Druid
August meetup - All about Apache Druid August meetup - All about Apache Druid
August meetup - All about Apache Druid
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Monitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadogMonitoring and scaling postgres at datadog
Monitoring and scaling postgres at datadog
 
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
HUG France Feb 2016 - Migration de données structurées entre Hadoop et RDBMS ...
 
Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016Presto at Hadoop Summit 2016
Presto at Hadoop Summit 2016
 
Building a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with RocanaBuilding a system for machine and event-oriented data with Rocana
Building a system for machine and event-oriented data with Rocana
 
Building an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using SparkBuilding an ETL pipeline for Elasticsearch using Spark
Building an ETL pipeline for Elasticsearch using Spark
 

Similar to A Day in the Life of a Druid Implementor and Druid's Roadmap

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
Databricks
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Ahmed Ossama
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
Corey Huinker
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Dipti Borkar
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
Rajesh Muppalla
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
Changshu Liu
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
Andrew Brust
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
markgrover
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
Omid Vahdaty
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
MongoDB
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
Ramesh Pabba - seeking new projects
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
HostedbyConfluent
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
Piotr Findeisen
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
Zhenxiao Luo
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
Blake Irvine
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
DataWorks Summit
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
Marcin Bielak
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
Zhenxiao Luo
 

Similar to A Day in the Life of a Druid Implementor and Druid's Roadmap (20)

Spark Meetup at Uber
Spark Meetup at UberSpark Meetup at Uber
Spark Meetup at Uber
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Cloud arch patterns
Cloud arch patternsCloud arch patterns
Cloud arch patterns
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
Presto – Today and Beyond – The Open Source SQL Engine for Querying all Data...
 
Lambda architecture @ Indix
Lambda architecture @ IndixLambda architecture @ Indix
Lambda architecture @ Indix
 
Scaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in CloudScaling 100PB Data Warehouse in Cloud
Scaling 100PB Data Warehouse in Cloud
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
SQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for ImpalaSQL Engines for Hadoop - The case for Impala
SQL Engines for Hadoop - The case for Impala
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
Lightning Talk: Why and How to Integrate MongoDB and NoSQL into Hadoop Big Da...
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Introduction to Hadoop Administration
Introduction to Hadoop AdministrationIntroduction to Hadoop Administration
Introduction to Hadoop Administration
 
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
Designing Apache Hudi for Incremental Processing With Vinoth Chandar and Etha...
 
Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020Presto @ Zalando - Big Data Tech Warsaw 2020
Presto @ Zalando - Big Data Tech Warsaw 2020
 
Presto@Uber
Presto@UberPresto@Uber
Presto@Uber
 
Netflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering MeetupNetflix Data Engineering @ Uber Engineering Meetup
Netflix Data Engineering @ Uber Engineering Meetup
 
Geospatial data platform at Uber
Geospatial data platform at UberGeospatial data platform at Uber
Geospatial data platform at Uber
 
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
IoT databases - review and challenges - IoT, Hardware & Robotics meetup - onl...
 
Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019Real time analytics on deep learning @ strata data 2019
Real time analytics on deep learning @ strata data 2019
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
Itai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Itai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
Itai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
Itai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Itai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
Itai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
Itai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
Itai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Itai Yaffe
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
Itai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
Itai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
Itai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
Itai Yaffe
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
Itai Yaffe
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALE
Itai Yaffe
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data Israel
Itai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Scalable Incremental Index for Druid
Scalable Incremental Index for DruidScalable Incremental Index for Druid
Scalable Incremental Index for Druid
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 
Serverless data processing built for internet SCALE
Serverless data processing built for internet SCALEServerless data processing built for internet SCALE
Serverless data processing built for internet SCALE
 
Ask me anything - Women in Big Data Israel
Ask me anything - Women in Big Data IsraelAsk me anything - Women in Big Data Israel
Ask me anything - Women in Big Data Israel
 

Recently uploaded

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
74nqk8xf
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
nuttdpt
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
zsjl4mimo
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
Timothy Spann
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
74nqk8xf
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
manishkhaire30
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
apvysm8
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Fernanda Palhano
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
jitskeb
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
rwarrenll
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
g4dpvqap0
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
Walaa Eldin Moustafa
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
soxrziqu
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
Social Samosa
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
v7oacc3l
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
Sachin Paul
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
Timothy Spann
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
Sm321
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
bopyb
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
Lars Albertsson
 

Recently uploaded (20)

一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
一比一原版(Coventry毕业证书)考文垂大学毕业证如何办理
 
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
一比一原版(UCSF文凭证书)旧金山分校毕业证如何办理
 
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
一比一原版(Harvard毕业证书)哈佛大学毕业证如何办理
 
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
06-04-2024 - NYC Tech Week - Discussion on Vector Databases, Unstructured Dat...
 
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
一比一原版(牛布毕业证书)牛津布鲁克斯大学毕业证如何办理
 
Learn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queriesLearn SQL from basic queries to Advance queries
Learn SQL from basic queries to Advance queries
 
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
办(uts毕业证书)悉尼科技大学毕业证学历证书原版一模一样
 
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdfUdemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
Udemy_2024_Global_Learning_Skills_Trends_Report (1).pdf
 
Experts live - Improving user adoption with AI
Experts live - Improving user adoption with AIExperts live - Improving user adoption with AI
Experts live - Improving user adoption with AI
 
My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.My burning issue is homelessness K.C.M.O.
My burning issue is homelessness K.C.M.O.
 
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
一比一原版(爱大毕业证书)爱丁堡大学毕业证如何办理
 
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data Lake
 
University of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma TranscriptUniversity of New South Wales degree offer diploma Transcript
University of New South Wales degree offer diploma Transcript
 
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
4th Modern Marketing Reckoner by MMA Global India & Group M: 60+ experts on W...
 
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
在线办理(英国UCA毕业证书)创意艺术大学毕业证在读证明一模一样
 
Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......Palo Alto Cortex XDR presentation .......
Palo Alto Cortex XDR presentation .......
 
DSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelinesDSSML24_tspann_CodelessGenerativeAIPipelines
DSSML24_tspann_CodelessGenerativeAIPipelines
 
Challenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more importantChallenges of Nation Building-1.pptx with more important
Challenges of Nation Building-1.pptx with more important
 
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
一比一原版(GWU,GW文凭证书)乔治·华盛顿大学毕业证如何办理
 
End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024End-to-end pipeline agility - Berlin Buzzwords 2024
End-to-end pipeline agility - Berlin Buzzwords 2024
 

A Day in the Life of a Druid Implementor and Druid's Roadmap

  • 1. A Day in the life of a Druid Architect Benjamin Hopp Senior Solutions Architect @ Imply ben@imply.io
  • 2. San Francisco Airport Marriott Waterfront Real-Time Analytics at Scale https://www.druidsummit.org/
  • 3.
  • 4. What do I do? Productionalization Implementation Recommendation Education
  • 5. Ask a lot of Questions ● What is the use-case? ○ Is it a good fit for druid? ● Who are the stakeholders? ○ End users - running queries ○ Data Engineers - ingesting data ○ Cluster Administrators - managing services ● How are they using the cluster? ● Where is the data coming from? ● What are the issues or concerns? ● Where does druid fit in the technology stack?
  • 6. When to use Druid 6 Search platform OLAP ● Real-time ingestion ● Flexible schema ● Full text search ● Batch ingestion ● Efficient storage ● Fast analytic queries Timeseries database ● Optimized for time-based datasets ● Time-based functions
  • 7. When NOT to use Druid 7 OLTP Individual record update/delet e Big join operations
  • 8. Where Druid fits in 8 Data lakes Message buses Raw data Storage Analyze Application
  • 11. Pick your servers Data NodesD ● Large-ish ● Scales with size of data and query volume ● Lots of cores, lots of memory, fast NVMe disk Query NodesQ ● Medium-ish ● Scales with concurrency and # of Data nodes ● Typically CPU bound Master NodesM ● Small-ish Nodes ● Coordinator scales with # of segments ● Overlord scales with # of supervisors and tasks
  • 12. Configure for MAXIMUM PERFORMANCE Data NodesD ● Enable Cache ● Heap/maxDirectMemory size ● druid.processing.buffer.sizeBytes ● druid.processing.numMergeBuffers ● druid.processing.numThreads Query NodesQ ● Disable Caching ● Heap/maxDirectMemory size ● druid.broker.http.numConnections ● druid.processing.numMergeBuffers ● druid.processing.numThreads Master NodesM ● Heap Size
  • 15. Optimize segment size Ideally 300 - 700 mb (~ 5 million rows) To control segment size ● Alter segment granularity ● Specify partition spec ● Use Automatic Compaction
  • 16. Controlling Segment Size ● Number of Tasks - Keep to lowest number that supports max ingestion rate. ● Segment Granularity - Increase if only 1 file per segment and < 200MB "segmentGranularity": "HOUR" ● Max Rows Per Segment - Increase if a single segment is < 200MB "maxRowsPerSegment": 5000000
  • 17. Compaction ● Combines small segments into larger segments ● Useful for late-arriving data ● Task submitted to Overlord { "type" : "compact", "dataSource" : "wikipedia", "interval" : "2017-01-01/2018-01-01" }
  • 18. Rollup ● Pre-aggregation at ingestion time ● Saves space, better compression ● Query performance boost
  • 19. Rollup timestamp page city count sum_added sum_deleted 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 timestamp page city added deleted 2011-01-01T00:01:35Z Justin Bieber SF 10 5 2011-01-01T00:03:45Z Justin Bieber SF 25 37 2011-01-01T00:05:62Z Justin Bieber SF 15 19 2011-01-01T00:06:33Z Ke$ha LA 30 45 2011-01-01T00:08:51Z Ke$ha LA 16 8 2011-01-01T00:09:17Z Miley Cyrus DC 75 10 2011-01-01T00:11:25Z Miley Cyrus DC 11 25 2011-01-01T00:23:30Z Miley Cyrus DC 22 12 2011-01-01T00:49:33Z Miley Cyrus DC 90 41
  • 20. Summarize with data sketches timestamp page city count sum_ added sum_ deleted userid_sketch 2011-01-01T00:00:00Z Justin Bieber SF 3 50 61 sketch_obj 2011-01-01T00:00:00Z Ke$ha LA 2 46 53 sketch_obj 2011-01-01T00:00:00Z Miley Cyrus DC 4 198 88 sketch_obj timestamp page userid city added deleted 2011-01-01T00:01:3 5Z Justin Bieber user11 SF 10 5 2011-01-01T00:03:4 5Z Justin Bieber user22 SF 25 37 2011-01-01T00:05:6 2Z Justin Bieber user11 SF 15 19 2011-01-01T00:06:3 3Z Ke$ha user33 LA 30 45 2011-01-01T00:08:5 1Z Ke$ha user33 LA 16 8 2011-01-01T00:09:1 7Z Miley Cyrus user11 DC 75 10 2011-01-01T00:11:2 5Z Miley Cyrus user44 DC 11 25 2011-01-01T00:23:3 0Z Miley Cyrus user44 DC 22 12 2011-01-01T00:49:3 3Z Miley Cyrus user55 DC 90 41
  • 21. Choose column types carefully String column indexed fast aggregation fast grouping Numeric column indexed fast aggregation fast grouping
  • 22. Partitioning beyond time ● Druid always partitions by time ● Decide which dimension to partition on… next ● Partition by some dimension you often filter on ● Improves locality, compression, storage size, query performance
  • 25. Use Druid SQL ● Easier to learn/more familiar ● Will attempt to make intelligent query type choices (timeseries vs topN vs groupBy) ● There are some limitations - such as multi-value dimensions, not all aggregations are supported
  • 26. Explain Plan EXPLAIN PLAN FOR SELECT channel, sum(added) FROM wikipedia WHERE commentLength >= 50 GROUP BY channel ORDER BY sum(added) desc LIMIT 3
  • 27. Pick your query carefully ● TimeBoundary - Returns min/max timestamp for given interval. ● Timeseries - When you don’t want to group by dimension ● TopN - When you want to group by a single dimension ○ Approximate if > 1000 dimension values ● GroupBy - Least performant/most flexible ● Scan - For returning streaming raw data ○ Perfect ordering not preserved ● Select - For returning paginated raw data ● Search - Returns dimensions that match text search
  • 28. Using Lookups ● Use lookups when you have dimensions that change to avoid re-indexing data ● Lookups are key/value pairs stored on every node. ● Loaded via file or JDBC connection to external database ● Lookups are loaded into the java heap size, so large lookups need larger heaps
  • 29. Stay in touch 29 @druidio https://imply.io https://druid.apache.org/ Ben Hopp Benjamin.hopp@imply.io LinkedIn: benhopp @implydata
  • 30. roadmap and community update Ben Hopp ben@imply.io
  • 32. Druid 0.17.0 Our first release as a top-level Apache project! 3
  • 33. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 4
  • 34. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 5
  • 36. Druid 0.16.0 Over 350 new features from 50 contributors! Released September 2019. 7
  • 37. Druid 0.16.0 Highlights ● Native parallel batch shuffle ○ Two-phase shuffle system allows for ‘perfect rollup’ and partitioning on dimensions ● Query vectorization phase one ○ Allows queries to be sped up by reducing the number of method calls ● Indexer process ○ An alternative to the MiddleManager + Peon task execution system which is easier to configure and deploy ● Improved web console ○ Kafka & Kinesis support! ○ Point-and-click reindexing 8
  • 38. Druid 0.17.0 Our first release as a top-level Apache project! Coming soon (really soon). 9
  • 39. Druid 0.17.0 Highlights ● Native batch - binary inputs & more ○ Supports non-binary formats such as ORC, Parquet, and Avro ○ Native batch tasks can now read from HDFS ○ Single-dimension range partitioning for parallel native batch ● Compaction improvements ○ Parallel index task split hints and parallel auto-compaction ○ Stateful auto-compaction ● Parallel query merge on brokers ○ Broker can now optionally merge query results in parallel using multiple threads. 10
  • 40. Druid 0.17.0 Highlights ● ...and More! ○ Improved SQL-compatible null handling ○ New dropwizard emitter which supports counter, gauge, meter, timer and histogram metric types ○ Task supervisors (e.g. Kafka or Kinesis supervisors) are now recorded in the system tables in a new sys.supervisors table ○ Fast historical start with deferred loading of segments until query time ○ New readiness and self-discovery resources ○ Task assignment based on MiddleManager categories ○ Security updates 11
  • 42. …and beyond!! A selection of items planned for future 2020 Druid releases. 13
  • 43. …and beyond!! ● SQL Joins ○ A multi-phase project to add full SQL Join support to Druid. Coming up first - sub-queries and lookups ● Windowed aggregations ○ For example, moving average and cumulative sum aggregations. ● Dynamic query prioritization & laning ○ Mix ‘heavy’ and ‘light’ workloads in the same cluster without heavy workloads blocking light ones. ● Extended query vectorization support ○ Richer support for query vectorization against more query types 14
  • 44. Download Druid community site (new): https://druid.apache.org/ Imply distribution: https://imply.io/get-started 15
  • 46. Stay in touch 17 @druidio Join the community! http://druid.io/community Free training hosted by Imply! https://imply.io/druid-days Follow the Druid project on Twitter!