Druid meetup @walkme

•

0 likes•176 views

Dori Waldman

Druid in production

Technology

Druid -Fyber
Dori Waldman - Big Data Lead

Ad Tech - Data
2
Timestamp Country Publisher Demand Application Device OS SDK impression click
1/1/2019
12:40
US P1 D1 Angry-Birds Samsung 9 2.1 1 0
1/1/2019
12:40
US P1 D2 MyApp Iphone 8 2.1 1 1

Data visualization
■ We have predeﬁned dashboard based on Cassandra
○ Fast (query long range in a second)
○ Table per query
3

Why we need Druid ?
■ Support dynamic queries (cube) on large amount of data
4

Druid
5
https://www.slideshare.net/doriwaldman/druid-88876307
5

Fyber - Druid requirements
■ Cube +80 Dimensions and 20 Metrics
■ Performance Query 3 month of data in 6 seconds (3 dimensions)
■ Size 5T raw data per day to index
6

8
Spark stream from Json to Parquet S3 Spark batch for clean cardinality , pre-agg , enrich data (K8s)
Partial data (materialized view)
Data - Pipeline

Hour→Day→Week→...
9
Motivation
Less segments you have , less cores will be used per query (core per segment) → serve more concurrent users
BUT if 1 core read 700M of data while other cores are not in use its also not good design → need to ﬁnd the right tune
partition - data/segments should split evenly (long tail...)
By doing aggregation of aggregation we minimize data size , reduce #segments
■ 1 Hour 10 segments of 200M
■ 1 Day 100 segments of 220M ( ~ reduce data by 50% compare to 240 * 220M )
■ We have 900 cores (30 nodes , each has 32 cores) -- problematic to read 9000 segments

“Materialized Views”
10
Motivation
● Several small cubes in which the dimensions has correlation
○ Row correlation , assume dimension is country (220 rows) impact of
■ adding gender is 440 rows
■ adding country phone (+072 Israel ) preﬁx will not add new rows
○ Business correlation like device detail cube (OS / Carrier)
● One large cube with all dimensions that will be used via ﬁlter and not topN query
● Use cardinality byRow with time series query to measure the dimensions correlation
● we modify the UI to handle cubes logic by query the smallest cube which answer user dimensions
● Our rule of thumb ~10M rows per small daily cube (most queries are on daily cubes)

Materialized Views
Cubes sync - user can see not aligned data during query of last day - need to manage druid state (mysql)

Airﬂow
12
■ Scheduler
■ Recover from failure
■ UI
■ Each task monitor itself , and autoﬁx if needed including sending atomic alerts per Dag (since airﬂow 10)

13
We collect druid clients usage such as :
● average query time
● query time range (last 2 days , or last 3 month)
● popular dimensions
This allows us to check if we tune needed #segments / cubes separation
Should we move to druid new ingestion instead of EMR ?
Should we move to druid new materialized views ?
Added anomaly detection above druid (based on https://github.com/yahoo/egads)
Day after deployment

What's hot

You might be paying too much for BigQueryRyuji Tamagawa

Climate data in r with the raster packageAlberto Labarga

PTD and beyondJohan Gustavsson

Zeotap: Moving to ScyllaDB - A Graph of Billions ScaleScyllaDB

Cassandra Lunch #59 Functions in CassandraAnant Corporation

Speedment & Sencha at Oracle Open World 2015Speedment, Inc.

GraphiteDavid Lutz

Using ScyllaDB with JanusGraph for Cyber SecurityScyllaDB

Sizing Your Scylla ClusterScyllaDB

Webinar: Using Control Theory to Keep Compactions Under ControlScyllaDB

View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...Srinath Perera

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...NETWAYS

Q4 2016 GeoTrellis PresentationRob Emanuele

Time Series Data in a Time Series WorldMapR Technologies

DruidDori Waldman

Streaming computing: architectures, and tchnologiesNatalino Busa

Enabling Access to Big Geospatial Data with LocationTech and Apache projectsRob Emanuele

Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...DataStax

Introduction to Real-time data processingYogi Devendra Vyavahare

Time series database by Harshil AmbagadeSigmoid

What's hot (20)

You might be paying too much for BigQuery

Climate data in r with the raster package

PTD and beyond

Zeotap: Moving to ScyllaDB - A Graph of Billions Scale

Cassandra Lunch #59 Functions in Cassandra

Speedment & Sencha at Oracle Open World 2015

Graphite

Using ScyllaDB with JanusGraph for Cyber Security

Sizing Your Scylla Cluster

Webinar: Using Control Theory to Keep Compactions Under Control

View, Act, and React: Shaping Business Activity with Analytics, BigData Queri...

OSMC 2018 | Learnings, patterns and Uber’s metrics platform M3, open sourced ...

Q4 2016 GeoTrellis Presentation

Time Series Data in a Time Series World

Druid

Streaming computing: architectures, and tchnologies

Enabling Access to Big Geospatial Data with LocationTech and Apache projects

Maintaining Consistency Across Data Centers (Randy Fradin, BlackRock) | Cassa...

Introduction to Real-time data processing

Time series database by Harshil Ambagade

Similar to Druid meetup @walkme

Our journey with druid - from initial research to full production scaleItai Yaffe

Druid @ branch Biswajit Das

Cloud arch patternsCorey Huinker

Spark Streaming and IoT by Mike FreedmanSpark Summit

NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg

Purpose-built NoSQL Database for IoT by Basavaraj SoppannavarData Con LA

Processing 19 billion messages in real time and NOT dying in the processJampp

Concept to production Nationwide Insurance BigInsights Journey with TelematicsSeeling Cheung

Auditing data and answering the life long question, is it the end of the day ...Simona Meriam

Scaling up uber's real time data analyticsXiang Fu

Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure Redis Labs

Mongo db 2.4 time series data - BrignoliCodemotion

Cloud Experience: Data-driven Applications Made Simple and FastDatabricks

Webinar: SQL for Machine Data?Crate.io

Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutesShivji Kumar Jha

MongoDB for Time Series DataMongoDB

IBM Insight 2013 - Aetna's production experience using IBM DB2 Analytics Acce...Daniel Martin

MongoDB for Time Series Data: Setting the Stage for Sensor ManagementMongoDB

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB AtlasMongoDB

Counting Unique Users in Real-Time: Here's a Challenge for You!DataWorks Summit

Similar to Druid meetup @walkme (20)

Our journey with druid - from initial research to full production scale

Druid @ branch

Cloud arch patterns

Spark Streaming and IoT by Mike Freedman

NetflixOSS Meetup season 3 episode 1

Purpose-built NoSQL Database for IoT by Basavaraj Soppannavar

Processing 19 billion messages in real time and NOT dying in the process

Concept to production Nationwide Insurance BigInsights Journey with Telematics

Auditing data and answering the life long question, is it the end of the day ...

Scaling up uber's real time data analytics

Redis For Distributed & Fault Tolerant Data Plumbing Infrastructure

Mongo db 2.4 time series data - Brignoli

Cloud Experience: Data-driven Applications Made Simple and Fast

Webinar: SQL for Machine Data?

Druid Summit 2023 : Changing Druid Ingestion from 3 hours to 5 minutes

MongoDB for Time Series Data

IBM Insight 2013 - Aetna's production experience using IBM DB2 Analytics Acce...

MongoDB for Time Series Data: Setting the Stage for Sensor Management

MongoDB World 2019: Packing Up Your Data and Moving to MongoDB Atlas

Counting Unique Users in Real-Time: Here's a Challenge for You!

Recently uploaded

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski

Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4

08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls

SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

A Domino Admins Adventures (Engage 2024)Gabriella Davis

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies

How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

AI as an Interface for Commercial BuildingsMemoori

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Slack Application Development 101 Slidespraypatel2

Key Features Of Token Development (1).pptxLBM Solutions

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

Recently uploaded (20)

Integration and Automation in Practice: CI/CD in Mule Integration and Automat...

Azure Monitor & Application Insight to monitor Infrastructure & Application

08448380779 Call Girls In Greater Kailash - I Women Seeking Men

SIEMENS: RAPUNZEL – A Tale About Knowledge Graph

My Hashitalk Indonesia April 2024 Presentation

A Domino Admins Adventures (Engage 2024)

The 7 Things I Know About Cyber Security After 25 Years | April 2024

Benefits Of Flutter Compared To Other Frameworks

How to Troubleshoot Apps for the Modern Connected Worker

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...

[2024]Digital Global Overview Report 2024 Meltwater.pdf

Breaking the Kubernetes Kill Chain: Host Path Mount

Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

AI as an Interface for Commercial Buildings

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Slack Application Development 101 Slides

Key Features Of Token Development (1).pptx

Unblocking The Main Thread Solving ANRs and Frozen Frames

Druid meetup @walkme

1. Druid -Fyber Dori Waldman - Big Data Lead

2. Ad Tech - Data 2 Timestamp Country Publisher Demand Application Device OS SDK impression click 1/1/2019 12:40 US P1 D1 Angry-Birds Samsung 9 2.1 1 0 1/1/2019 12:40 US P1 D2 MyApp Iphone 8 2.1 1 1

3. Data visualization ■ We have predeﬁned dashboard based on Cassandra ○ Fast (query long range in a second) ○ Table per query 3

4. Why we need Druid ? ■ Support dynamic queries (cube) on large amount of data 4

5. Druid 5 https://www.slideshare.net/doriwaldman/druid-88876307 5

6. Fyber - Druid requirements ■ Cube +80 Dimensions and 20 Metrics ■ Performance Query 3 month of data in 6 seconds (3 dimensions) ■ Size 5T raw data per day to index 6

7. Implementation

8. 8 Spark stream from Json to Parquet S3 Spark batch for clean cardinality , pre-agg , enrich data (K8s) Partial data (materialized view) Data - Pipeline

9. Hour→Day→Week→... 9 Motivation Less segments you have , less cores will be used per query (core per segment) → serve more concurrent users BUT if 1 core read 700M of data while other cores are not in use its also not good design → need to ﬁnd the right tune partition - data/segments should split evenly (long tail...) By doing aggregation of aggregation we minimize data size , reduce #segments ■ 1 Hour 10 segments of 200M ■ 1 Day 100 segments of 220M ( ~ reduce data by 50% compare to 240 * 220M ) ■ We have 900 cores (30 nodes , each has 32 cores) -- problematic to read 9000 segments

10. “Materialized Views” 10 Motivation ● Several small cubes in which the dimensions has correlation ○ Row correlation , assume dimension is country (220 rows) impact of ■ adding gender is 440 rows ■ adding country phone (+072 Israel ) preﬁx will not add new rows ○ Business correlation like device detail cube (OS / Carrier) ● One large cube with all dimensions that will be used via ﬁlter and not topN query ● Use cardinality byRow with time series query to measure the dimensions correlation ● we modify the UI to handle cubes logic by query the smallest cube which answer user dimensions ● Our rule of thumb ~10M rows per small daily cube (most queries are on daily cubes)

11. Materialized Views Cubes sync - user can see not aligned data during query of last day - need to manage druid state (mysql)

12. Airflow 12 ■ Scheduler ■ Recover from failure ■ UI ■ Each task monitor itself , and autofix if needed including sending atomic alerts per Dag (since airflow 10)

13. 13 We collect druid clients usage such as : ● average query time ● query time range (last 2 days , or last 3 month) ● popular dimensions This allows us to check if we tune needed #segments / cubes separation Should we move to druid new ingestion instead of EMR ? Should we move to druid new materialized views ? Added anomaly detection above druid (based on https://github.com/yahoo/egads) Day after deployment

14. Thank You

Druid meetup @walkme

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Druid meetup @walkme

Similar to Druid meetup @walkme (20)

More from Dori Waldman

More from Dori Waldman (10)

Recently uploaded

Recently uploaded (20)

Druid meetup @walkme