SlideShare a Scribd company logo
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scalable Incremental Index
for Druid
Anastasia Braginsky, Liran Funaro, Eran Meir, Edward Bortnikov
Yahoo Research
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Just a bit about us
Verizon Media Group (VMG)
Creating what's next in content, advertising, and technology with brands like Yahoo, HuffPost, TechCrunch, etc.
Reaching 900MM consumers.
Very large private cloud for our data. Huge variety of workloads. Open source technologies operated at scale.
Yahoo Research
Working in many research areas to build the breakthroughs that power our products at a global scale.
AI/ML, Computational Advertising, Search, Content Recommendation, User Modeling, Systems, and more.
Scalable Systems Research Team
Part of Yahoo Research Lab, Haifa.
Working on novel algorithms and systems to make our products fast, scalable and reliable.
Steady flow of contributions to open source technologies: HBase, RocksDB, Phoenix, DataSketches, Druid.
2
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Just a bit about Druid at VMG
Used by dozens of products
Content analytics, Ad analytics, User analytics, Network and System analytics, ...
Hosting many petabytes of data
Deployed on Private and Public Cloud infrastructure
>1K hosts
Contributing to Druid code since Yahoo times
Example - integrated DataSketches for real-time aggregations over streaming data.
3
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
This Talk is About
Data ingestion in Druid
… and speeding it up with Oak
… a scalable off-heap concurrent key-value map in Java
4
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Data Model
5
Primary dimension Numeric or Sketch
(approximate aggregate)
Numeric or String
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Physical Storage and Serving
Workload characteristics
Data is mostly write-once, append-only.
Queries mostly focus on the time dimension first.
Data organization
Data is ordered by time (sequence of chunks, in the order of ingestion).
Chunks consist of segments, ordered internally by time.
Druid figures out which segments are required to serve a query.
6
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Architecture
7
ingestion
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Real-Time Ingestion
Works with a variety of data feeds
Kafka, Kinesis, Spark, Local FS, …
Real work - Middle Manager (MM) processes
MM runs multiple indexing tasks (Peons). Task = JVM.
Experimental architecture: Indexer (replacing MM/Peons). Task = thread.
Indexing tasks
Generate data segments and flush them to deep storage (HDFS, S3, …)
Serve queries - ingested data is immediately queryable
8
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Performance Caveats
Segment size crucial for system performance
Queries are slow if segments are too big or too small.
Segment granularity is controlled by indexer task settings.
Current default size (in rows): 5M rows/segment
Current recommended size (in bytes): 300-700 MB/segment.
Druid automatically compacts (merges) small segments in the background.
We want relatively big segments
With sketches, 5M rows easily translates to 10-20 GB’s.
RAM is becoming cheaper (128/256 GB/host are common).
Storage is becoming faster, optimized for big transfers
Scanning many small files is inefficient.
Background compactions of many small files inefficient.
9
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Incremental Index (I2
)
Core data structure used by indexing tasks
Ordered map of data rows - point updates/point lookups/range queries.
Concurrent (thread safe) - allows parallel reads and writes.
Comes in two forms
Plain (collection of facts) or Rollup (on-the-fly aggregation).
Built on top of JDK ConcurrentSkipList (CSL)
Adapts the Druid row-oriented data model to the KV paradigm.
Key = Timestamp + {Dimension}*. Value = {Metric}*.
How far can it scale?
10
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
GC - the Blessing and the Curse
Java Garbage Collection (GC)
Manages objects allocated on the JVM heap.
Awesome for programmer experience (no memory management headache).
… but never really designed for Big Data applications.
GC Perils
Resource waste: steals CPU cycles + RAM headroom.
Poor SLA: GC pauses → software stalls → high tail latencies.
Experiment - GC impact on I2
10M rows → tens of millions of Java objects.
25% of data ingestion time spent on GC.
2x memory consumed by GC mechanisms to sustain reasonable performance.
11
Source: jelastic.com
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scaling I2
Goal: eliminate the GC overhead in I2
Better scaling with data size (up to tens of GB’s per index).
More efficient use of RAM resources (low overhead).
How: GC-insensitive implementation
Replace the JDK skiplist with an ordered map with off-heap data storage.
… with identical (strong) concurrency guarantees.
… desirably, orders-of-magnitude less metadata objects.
Oak comes to help
Open source, off-heap ordered map by Yahoo.
https://github.com/yahoo/Oak
12
Source: clipartsign.com
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Oak, Introduced
Features
Concurrent ordered map with atomic semantics.
Managed-memory programming experience.
Unmanaged-memory performance.
Design principles
Data (keys and values) stored off-heap.
Metadata (internal search tree) stored on-heap.
Custom low-overhead internal GC.
API
Traditional JDK API (ConcurrentNavigableMap) for backward compatibility.
Zero-copy (ZC) API for the best performance - queries and in-place updates.
Stream API optimized for fast scans.
13
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Oak in a Nutshell
14
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scaling with Parallelism (11M KV-pairs)
15
Put Get (ZC)
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Scaling with Parallelism (11M KV-pairs)
16
Ascending scan (ZC), 10K pairs/scan Descending scan (ZC), 10K pairs/scan
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
I2
-Oak
I2
implementation on top of OakMap
Configurable at system level (the legacy I2
is still a default).
Minor refactoring of the Druid code (I2
API abstraction).
Implemented as core part of Druid but could be an extension to reduce friction.
Details
Druid I2
schema mapped to OakMap keys and values.
Leverages the ZC API for queries and in-place aggregation.
Auxiliary data structures (e.g., string dictionaries) remain on-heap.
Sketch aggregators currently unsupported.
Project Status
Code complete. Component- and system-level benchmarks.
Community: Git issue. GA expected mid-2020.
17
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Ingestion - Scaling with Data Size
18
Ingesting 1M to 7M tuples
Tuple size 1.25KB
30GB available RAM
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Ingestion - Scaling with RAM
19
Ingesting 7M tuples
Tuple size 1.25KB
RAM scaling 25GB to 32GB
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Druid Ingestion - RAM overhead
20
Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited.
Summary
21
GC overhead is critical for real-time ingestion performance
Solution - off-heap incremental index
Implementation based on Oak.
Better scalability and RAM efficiency vs the legacy.
Contribution to Druid
Implementation under community review, expected GA mid-2020.

More Related Content

What's hot

[253] apache ni fi
[253] apache ni fi[253] apache ni fi
[253] apache ni fi
NAVER D2
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Cloudera, Inc.
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
Uri Laserson
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
Ian Foster
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Marco Garcia
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
James Chen
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
Bart Vandewoestyne
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
Patrick Nicolas
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
Venkata Naga Ravi
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
longda feng
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
Dzung Nguyen
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
Edureka!
 
[db tech showcase Tokyo 2015] D25:The difference between logical and physical...
[db tech showcase Tokyo 2015] D25:The difference between logical and physical...[db tech showcase Tokyo 2015] D25:The difference between logical and physical...
[db tech showcase Tokyo 2015] D25:The difference between logical and physical...Insight Technology, Inc.
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
DataStax
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
Kishor Parkhe
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Douglas Bernardini
 
Beyond Hadoop and MapReduce
Beyond Hadoop and MapReduceBeyond Hadoop and MapReduce
Beyond Hadoop and MapReduce
Alexander Alten
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performance
Jim Dowling
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weitingWei Ting Chen
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Cloudera, Inc.
 

What's hot (20)

[253] apache ni fi
[253] apache ni fi[253] apache ni fi
[253] apache ni fi
 
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
Hadoop World 2011: Hadoop and Netezza Deployment Models and Case Study - Kris...
 
Hadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciencesHadoop ecosystem for health/life sciences
Hadoop ecosystem for health/life sciences
 
Computing Outside The Box September 2009
Computing Outside The Box September 2009Computing Outside The Box September 2009
Computing Outside The Box September 2009
 
Construindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigDataConstruindo Data Lakes - Visão Prática com Hadoop e BigData
Construindo Data Lakes - Visão Prática com Hadoop e BigData
 
Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作Etu L2 Training - Hadoop 企業應用實作
Etu L2 Training - Hadoop 企業應用實作
 
Hadoop & Big Data benchmarking
Hadoop & Big Data benchmarkingHadoop & Big Data benchmarking
Hadoop & Big Data benchmarking
 
Open Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learningOpen Source Lambda Architecture for deep learning
Open Source Lambda Architecture for deep learning
 
Big Data Benchmarking
Big Data BenchmarkingBig Data Benchmarking
Big Data Benchmarking
 
Jstorm introduction-0.9.6
Jstorm introduction-0.9.6Jstorm introduction-0.9.6
Jstorm introduction-0.9.6
 
Big Data and Hadoop Introduction
 Big Data and Hadoop Introduction Big Data and Hadoop Introduction
Big Data and Hadoop Introduction
 
Hadoop Architecture and HDFS
Hadoop Architecture and HDFSHadoop Architecture and HDFS
Hadoop Architecture and HDFS
 
[db tech showcase Tokyo 2015] D25:The difference between logical and physical...
[db tech showcase Tokyo 2015] D25:The difference between logical and physical...[db tech showcase Tokyo 2015] D25:The difference between logical and physical...
[db tech showcase Tokyo 2015] D25:The difference between logical and physical...
 
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
Using Approximate Data for Small, Insightful Analytics (Ben Kornmeier, Protec...
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapRHadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
Hadoop benchmark: Evaluating Cloudera, Hortonworks, and MapR
 
Beyond Hadoop and MapReduce
Beyond Hadoop and MapReduceBeyond Hadoop and MapReduce
Beyond Hadoop and MapReduce
 
Hopsfs 10x HDFS performance
Hopsfs 10x HDFS performanceHopsfs 10x HDFS performance
Hopsfs 10x HDFS performance
 
20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting20150704 benchmark and user experience in sahara weiting
20150704 benchmark and user experience in sahara weiting
 
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop EcosystemWhy Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
Why Apache Spark is the Heir to MapReduce in the Hadoop Ecosystem
 

Similar to Scalable Incremental Index for Druid

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Hazelcast
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
Leandro Totino Pereira
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Alluxio, Inc.
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Richard McDougall
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
SoftServe
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
KeithETD_CTO
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Bhupesh Bansal
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
Hadoop User Group
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big datasolarisyourep
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
xKinAnx
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
confluent
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
EMC
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
outstanding59
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldRichard McDougall
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
outstanding59
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
Trayan Iliev
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
Splunk
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
Christopher Curtin
 
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Orgad Kimchi
 

Similar to Scalable Incremental Index for Druid (20)

Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of HadoopBig Data, Simple and Fast: Addressing the Shortcomings of Hadoop
Big Data, Simple and Fast: Addressing the Shortcomings of Hadoop
 
Real time analytics
Real time analyticsReal time analytics
Real time analytics
 
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & AlluxioUltra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
Ultra Fast Deep Learning in Hybrid Cloud Using Intel Analytics Zoo & Alluxio
 
Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013Is your cloud ready for Big Data? Strata NY 2013
Is your cloud ready for Big Data? Strata NY 2013
 
Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist Essential Data Engineering for Data Scientist
Essential Data Engineering for Data Scientist
 
EMC Isilon Database Converged deck
EMC Isilon Database Converged deckEMC Isilon Database Converged deck
EMC Isilon Database Converged deck
 
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
Voldemort & Hadoop @ Linkedin, Hadoop User Group Jan 2010
 
Hadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedInHadoop and Voldemort @ LinkedIn
Hadoop and Voldemort @ LinkedIn
 
kumarResume
kumarResumekumarResume
kumarResume
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
Scaling Security on 100s of Millions of Mobile Devices Using Apache Kafka® an...
 
Pivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant DancePivotal: Virtualize Big Data to Make the Elephant Dance
Pivotal: Virtualize Big Data to Make the Elephant Dance
 
App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)App cap2956v2-121001194956-phpapp01 (1)
App cap2956v2-121001194956-phpapp01 (1)
 
Inside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworldInside the Hadoop Machine @ VMworld
Inside the Hadoop Machine @ VMworld
 
App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)App Cap2956v2 121001194956 Phpapp01 (1)
App Cap2956v2 121001194956 Phpapp01 (1)
 
Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9Stream Processing with CompletableFuture and Flow in Java 9
Stream Processing with CompletableFuture and Flow in Java 9
 
Getting Started with Splunk Breakout Session
Getting Started with Splunk Breakout SessionGetting Started with Splunk Breakout Session
Getting Started with Splunk Breakout Session
 
UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015UnConference for Georgia Southern Computer Science March 31, 2015
UnConference for Georgia Southern Computer Science March 31, 2015
 
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use CaseOracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
Oracle Solaris 11 as a BIG Data Platform Apache Hadoop Use Case
 

More from Itai Yaffe

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
Itai Yaffe
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Itai Yaffe
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
Itai Yaffe
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
Itai Yaffe
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Itai Yaffe
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
Itai Yaffe
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
Itai Yaffe
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Itai Yaffe
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
Itai Yaffe
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
Itai Yaffe
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Itai Yaffe
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
Itai Yaffe
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Itai Yaffe
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
Itai Yaffe
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
Itai Yaffe
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
Itai Yaffe
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
Itai Yaffe
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
Itai Yaffe
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
Itai Yaffe
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
Itai Yaffe
 

More from Itai Yaffe (20)

Mastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data ProcessingMastering Partitioning for High-Volume Data Processing
Mastering Partitioning for High-Volume Data Processing
 
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse AutomationSolving Data Engineers Velocity - Wix's Data Warehouse Automation
Solving Data Engineers Velocity - Wix's Data Warehouse Automation
 
Lessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark ApplicationsLessons Learnt from Running Thousands of On-demand Spark Applications
Lessons Learnt from Running Thousands of On-demand Spark Applications
 
Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?Why do the majority of Data Science projects never make it to production?
Why do the majority of Data Science projects never make it to production?
 
Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"Planning a data solution - "By Failing to prepare, you are preparing to fail"
Planning a data solution - "By Failing to prepare, you are preparing to fail"
 
Evaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening NotesEvaluating Big Data & ML Solutions - Opening Notes
Evaluating Big Data & ML Solutions - Opening Notes
 
Big data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real timeBig data serving: Processing and inference at scale in real time
Big data serving: Processing and inference at scale in real time
 
Data Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management MonolithsData Lakes on Public Cloud: Breaking Data Management Monoliths
Data Lakes on Public Cloud: Breaking Data Management Monoliths
 
Unleashing the Power of your Data
Unleashing the Power of your DataUnleashing the Power of your Data
Unleashing the Power of your Data
 
Data Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening NotesData Lake on Public Cloud - Opening Notes
Data Lake on Public Cloud - Opening Notes
 
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
Airflow Summit 2020 - Migrating airflow based spark jobs to kubernetes - the ...
 
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and DruidDevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
DevTalks Reimagined 2020 - Funnel Analysis with Spark and Druid
 
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
Virtual Apache Druid Meetup: AIADA (Ask Itai and David Anything)
 
Introducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom ConnectorsIntroducing Kafka Connect and Implementing Custom Connectors
Introducing Kafka Connect and Implementing Custom Connectors
 
A Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's RoadmapA Day in the Life of a Druid Implementor and Druid's Roadmap
A Day in the Life of a Druid Implementor and Druid's Roadmap
 
Funnel Analysis with Spark and Druid
Funnel Analysis with Spark and DruidFunnel Analysis with Spark and Druid
Funnel Analysis with Spark and Druid
 
The benefits of running Spark on your own Docker
The benefits of running Spark on your own DockerThe benefits of running Spark on your own Docker
The benefits of running Spark on your own Docker
 
Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?Optimizing Spark-based data pipelines - are you up for it?
Optimizing Spark-based data pipelines - are you up for it?
 
Scheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructureScheduling big data workloads on serverless infrastructure
Scheduling big data workloads on serverless infrastructure
 
GraphQL API on a Serverless Environment
GraphQL API on a Serverless EnvironmentGraphQL API on a Serverless Environment
GraphQL API on a Serverless Environment
 

Recently uploaded

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
ahzuo
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
balafet
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
Subhajit Sahu
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
pchutichetpong
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
NABLAS株式会社
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
Tiktokethiodaily
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
slg6lamcq
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
Opendatabay
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
mbawufebxi
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
ahzuo
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
vcaxypu
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
AbhimanyuSinha9
 

Recently uploaded (20)

一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
一比一原版(UIUC毕业证)伊利诺伊大学|厄巴纳-香槟分校毕业证如何办理
 
Machine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptxMachine learning and optimization techniques for electrical drives.pptx
Machine learning and optimization techniques for electrical drives.pptx
 
Adjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTESAdjusting primitives for graph : SHORT REPORT / NOTES
Adjusting primitives for graph : SHORT REPORT / NOTES
 
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...
 
社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .社内勉強会資料_LLM Agents                              .
社内勉強会資料_LLM Agents                              .
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
1.Seydhcuxhxyxhccuuxuxyxyxmisolids 2019.pptx
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
一比一原版(UniSA毕业证书)南澳大学毕业证如何办理
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Opendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptxOpendatabay - Open Data Marketplace.pptx
Opendatabay - Open Data Marketplace.pptx
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
一比一原版(Bradford毕业证书)布拉德福德大学毕业证如何办理
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
一比一原版(CBU毕业证)卡普顿大学毕业证如何办理
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
一比一原版(ArtEZ毕业证)ArtEZ艺术学院毕业证成绩单
 
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...Best best suvichar in gujarati english meaning of this sentence as Silk road ...
Best best suvichar in gujarati english meaning of this sentence as Silk road ...
 

Scalable Incremental Index for Druid

  • 1. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scalable Incremental Index for Druid Anastasia Braginsky, Liran Funaro, Eran Meir, Edward Bortnikov Yahoo Research
  • 2. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Just a bit about us Verizon Media Group (VMG) Creating what's next in content, advertising, and technology with brands like Yahoo, HuffPost, TechCrunch, etc. Reaching 900MM consumers. Very large private cloud for our data. Huge variety of workloads. Open source technologies operated at scale. Yahoo Research Working in many research areas to build the breakthroughs that power our products at a global scale. AI/ML, Computational Advertising, Search, Content Recommendation, User Modeling, Systems, and more. Scalable Systems Research Team Part of Yahoo Research Lab, Haifa. Working on novel algorithms and systems to make our products fast, scalable and reliable. Steady flow of contributions to open source technologies: HBase, RocksDB, Phoenix, DataSketches, Druid. 2
  • 3. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Just a bit about Druid at VMG Used by dozens of products Content analytics, Ad analytics, User analytics, Network and System analytics, ... Hosting many petabytes of data Deployed on Private and Public Cloud infrastructure >1K hosts Contributing to Druid code since Yahoo times Example - integrated DataSketches for real-time aggregations over streaming data. 3
  • 4. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. This Talk is About Data ingestion in Druid … and speeding it up with Oak … a scalable off-heap concurrent key-value map in Java 4
  • 5. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Data Model 5 Primary dimension Numeric or Sketch (approximate aggregate) Numeric or String
  • 6. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Physical Storage and Serving Workload characteristics Data is mostly write-once, append-only. Queries mostly focus on the time dimension first. Data organization Data is ordered by time (sequence of chunks, in the order of ingestion). Chunks consist of segments, ordered internally by time. Druid figures out which segments are required to serve a query. 6
  • 7. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Architecture 7 ingestion
  • 8. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Real-Time Ingestion Works with a variety of data feeds Kafka, Kinesis, Spark, Local FS, … Real work - Middle Manager (MM) processes MM runs multiple indexing tasks (Peons). Task = JVM. Experimental architecture: Indexer (replacing MM/Peons). Task = thread. Indexing tasks Generate data segments and flush them to deep storage (HDFS, S3, …) Serve queries - ingested data is immediately queryable 8
  • 9. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Performance Caveats Segment size crucial for system performance Queries are slow if segments are too big or too small. Segment granularity is controlled by indexer task settings. Current default size (in rows): 5M rows/segment Current recommended size (in bytes): 300-700 MB/segment. Druid automatically compacts (merges) small segments in the background. We want relatively big segments With sketches, 5M rows easily translates to 10-20 GB’s. RAM is becoming cheaper (128/256 GB/host are common). Storage is becoming faster, optimized for big transfers Scanning many small files is inefficient. Background compactions of many small files inefficient. 9
  • 10. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Incremental Index (I2 ) Core data structure used by indexing tasks Ordered map of data rows - point updates/point lookups/range queries. Concurrent (thread safe) - allows parallel reads and writes. Comes in two forms Plain (collection of facts) or Rollup (on-the-fly aggregation). Built on top of JDK ConcurrentSkipList (CSL) Adapts the Druid row-oriented data model to the KV paradigm. Key = Timestamp + {Dimension}*. Value = {Metric}*. How far can it scale? 10
  • 11. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. GC - the Blessing and the Curse Java Garbage Collection (GC) Manages objects allocated on the JVM heap. Awesome for programmer experience (no memory management headache). … but never really designed for Big Data applications. GC Perils Resource waste: steals CPU cycles + RAM headroom. Poor SLA: GC pauses → software stalls → high tail latencies. Experiment - GC impact on I2 10M rows → tens of millions of Java objects. 25% of data ingestion time spent on GC. 2x memory consumed by GC mechanisms to sustain reasonable performance. 11 Source: jelastic.com
  • 12. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scaling I2 Goal: eliminate the GC overhead in I2 Better scaling with data size (up to tens of GB’s per index). More efficient use of RAM resources (low overhead). How: GC-insensitive implementation Replace the JDK skiplist with an ordered map with off-heap data storage. … with identical (strong) concurrency guarantees. … desirably, orders-of-magnitude less metadata objects. Oak comes to help Open source, off-heap ordered map by Yahoo. https://github.com/yahoo/Oak 12 Source: clipartsign.com
  • 13. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Oak, Introduced Features Concurrent ordered map with atomic semantics. Managed-memory programming experience. Unmanaged-memory performance. Design principles Data (keys and values) stored off-heap. Metadata (internal search tree) stored on-heap. Custom low-overhead internal GC. API Traditional JDK API (ConcurrentNavigableMap) for backward compatibility. Zero-copy (ZC) API for the best performance - queries and in-place updates. Stream API optimized for fast scans. 13
  • 14. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Oak in a Nutshell 14
  • 15. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scaling with Parallelism (11M KV-pairs) 15 Put Get (ZC)
  • 16. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Scaling with Parallelism (11M KV-pairs) 16 Ascending scan (ZC), 10K pairs/scan Descending scan (ZC), 10K pairs/scan
  • 17. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. I2 -Oak I2 implementation on top of OakMap Configurable at system level (the legacy I2 is still a default). Minor refactoring of the Druid code (I2 API abstraction). Implemented as core part of Druid but could be an extension to reduce friction. Details Druid I2 schema mapped to OakMap keys and values. Leverages the ZC API for queries and in-place aggregation. Auxiliary data structures (e.g., string dictionaries) remain on-heap. Sketch aggregators currently unsupported. Project Status Code complete. Component- and system-level benchmarks. Community: Git issue. GA expected mid-2020. 17
  • 18. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Ingestion - Scaling with Data Size 18 Ingesting 1M to 7M tuples Tuple size 1.25KB 30GB available RAM
  • 19. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Ingestion - Scaling with RAM 19 Ingesting 7M tuples Tuple size 1.25KB RAM scaling 25GB to 32GB
  • 20. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Druid Ingestion - RAM overhead 20
  • 21. Verizon confidential and proprietary. Unauthorized disclosure, reproduction or other use prohibited. Summary 21 GC overhead is critical for real-time ingestion performance Solution - off-heap incremental index Implementation based on Oak. Better scalability and RAM efficiency vs the legacy. Contribution to Druid Implementation under community review, expected GA mid-2020.