SlideShare a Scribd company logo
ULTIMATE JOURNEY TOWARDS
REALTIME DATA PLATFORM
2.5M / s
BORIS TROFIMOV @ SIGMA SOFTWARE
Leading DWH @ Oath:
Major expertise Big Data and Enterprise
Cofounder of Odessa JUG
Passionate follower of Scala
Associate professor at ONPU
ABOUT ME
WHERE IS BIG DATA?
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
INEGRATION
POINTS
INTRODUCING DATA PLATFORM
DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
DOMAIN SERVICE
API
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
BIG DATA ?
API
DATA PLATFORM
BACK OFFICE
CUSTOMER
WEB PORTAL
MOBILE APPS
MICROSERVICE
INFRASTRUCTURE SERVICES
SERVICE
DISCOVERY
SHARED
CONFIG
DOMAIN
DEPENDENCY
MANAGEMENT
ACL
MANAGEMENT
MICROSERVICE
CORE DOMAIN SERVICES
INEGRATION
POINTS
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
DOMAIN SERVICE
INTRODUCING DATA PLATFORM
DATA PLATFORM
INTRODUCING DATA PLATFORM
DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
INTRODUCING DATA PLATFORM
events
DATA PLATFORM
INTRODUCING DATA PLATFORM
3rd PARTY
PROVIDERS
PLATFORM
COMPONENTS
REPORTING
ANALYTICS
Major mission: organize data
events
ZOOMING IN DATA PLATFORM
INGESTION
MODULE
REPORTING
SERVICE
WAREHOUSE
VALIDATION
ENRICHMENT
MODULE
RAW DATA
AGGREGATIONS
MODULE
FACTS
RAW DATA
DIMENSIONS
ANALYTICS
MODULE
CONFIGURATION
MODULE
DIMENSION
UPDATER
based on © https://blogs.msdn.microsoft.com/agile/2012/07/26/cqrs-journey-guidance-project-released/
7 JUNE
OUR DOMAIN
CORE
PLATFORM
DATA
PLATFORM
VIDEO PLAYERS
CONTENT OWNERS
END USERS
OUR DOMAIN
S3 Data Lake
5 PB
Vertica
500 TB
Raws/Table
600 B
Events/Sec
2.5 M
Files/Hour/Pipeline
15 K
Data/Daly
25 TB
DATA LAKE PROCESSING
ORIGINAL PIPELINE
DATA PLATFORM
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~1h
UNITED LAMBDA PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h
UNITED LAMBDA PLATFORM
DATA PLATFORM
KAFKA SPARK MEMSQL
VERTICAS3 HADOOPNGINX
REPORTING
SERVICE
DATA LAG ~2m
DATA LAG ~1h
WHAT WAS GOOD
DATA DELIVERY TIME 2 min
FINE ON PROD SCALE @ THAT TIME -- 150K/s
PAINFUL SCALE UP TO 1M/s
НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ
И ВЫРАСТИ В 20 РАЗ
WHY WE NEEDED CHANGES
ROCKY SCALING
Adding/removing nodes to CDH YARN requires yarn restart and downtime for apps
Tricky to build quick sandboxes
The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes
Max supported rate limit 1M events/s, while Business required 2.5M/s
ZERO TOLERANCE
EC2 faulty nodes could make Spark or Memsql get stuck for a while
Buggy HA, even one faulty node could break entire Memsql cluster, make to recreate database and lose data
PUSH approach to write data to Memsql
MONITORING & ALERTING
Find the most relevant metrics
Eliminate FALSE POSITIVE and FALSE NEGATIVE errors
ON A WAY TO 2.5 M / S
MIGRATING SPARK TO EMR
EASY CREATE, EASY DESTROY
• easy to … make bill cost a fortune
MULTIPLE EMR CLUSTERS
• Separating concerns and Isolation
• Better to run single application per EMR cluster
• Simplified auto-scaling rules
STATELESS EMR CLUSTERS
• Do not use local HDFS
CAUTION, EMR!
EASY TO ALLOCATE AND EASY TO LOSE EMR NODE
• Concerns mostly m4.4xl as the most popular instance type
LOSING MASTER NODE – LOSING ENTIRE CLUSTER
• Hard to build reliable platform involving multiple AZ [see Fleets model]
• Develop one-step evacuation procedure to another EMR
LUCK OF LACK ON SPECIFIC INSTANCE TYPE
• Can be mitigated by fleets model
DEPLOYMENT DETAILS
MASTER
TASK TASK TASK
…
EMR CLUSTER [YARN]
TASK TASK TASK
…
DEPLOYMENT DETAILS
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]
TASK TASK TASK
…
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
LOCAL
YARN
CONFIG
TASK TASK TASK
…
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…
LOCAL
YARN
CONFIG
DEPLOYMENT DETAILS
SPARK BINARIES
MASTER
S3
YARN
CONFIG
(zip)
TASK TASK TASK
…
EMR CLUSTER [YARN]DOCKER CONTAINER
DRIVER APP
TASK TASK TASK
…
LOCAL
YARN
CONFIG
RANCHER
CDH vs EMR
E M RC D H
Cannot scale out/in on demand Is able to scale out/in on demand
No extra cost (for community
license)
Extra ~30% to EC2 costs
Per second billing (!)
Adding machines to CDH
requires restarting Yarn
No Yarn restart
Easy configuration management
via CM
Limited configuration available
during EMR creation
Classic Yarn cluster Ordinary Yarn under hood, imposes
EMR-driven way to deploy apps
Single CDH per AZ EMR cluster on demand as unit
of clustering
MAKING SPARK WRITE FASTER
USING CUSTOM HADOOP COMMITER
• FileOutputCommiter committer with V2 option to exclude file moving in HDFS/S3
WRITE DATAFRAME TO HDFS FIRST
• Spark writes to HDFS directly into partitioned folder and registers new partition in
Hive
WRITING FASTER – FILE FORMATS
MOST STABLE PERFORMANCE ON ORC UNCOMPRESSED
• spark apps writes raw data in ORC
• presto reads ORC and writes aggregations in ORC
• replication uses ORC to send delta to Vertica
BEST PERFORMANCE ON HDFS BLOCK SIZE AND STRIP 64M
• Thankfully to strict retention policy 6 hours
ENABLING hive.orc.use-column-names=true
• simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name
• allows to evolve/modify schema for dataframe and database independently
SPARK PERFORMANCE
ONE EXECUTOR PER YARN NODE
• for better cpu and cache utilization, using 16 vcores (aligning to m4.4xl)
ALIGN RDD PARTITIONS TO VCORES
• Repartition data we read from Kakfa [address if there is a skew in kafka partitions]
SPLIT PROCESSING BATCH INTERVAL ONTO RESPONSIBILITY ZONES
• Control each interval separately
FETCH FROM KAFKA ENRICHMENT WRITE TO HIVE
1 minute
8 seconds 20 seconds 20 seconds
STUFF/OVERHEAD
12 seconds
FRIENDLY REMINDER
DATA PLATFORM
KAFKA SPARK MEMSQL
NGINX
REPORTING
SERVICE
INTRODUCING PRESTO
DATA PLATFORM
KAFKA SPARK PRESTO
VERTICANGINX
REPORTING
SERVICE
DATA LAG ~3m
UNDER HOOD
Aggregations and replications are running every minute
Presto uses dimensions hosted outside. Using Memsql with realtime
updates
VERTICA
NODE
REPORTING
SERVICE
SPARK, EMR
HDFS
NODE
NODE
COLLOCATED HDFS/PRESTO
PRESTO
REPLICATORS
JENKINS SCHEDULER
MEMSQL
FAULT TOLERANCE
EMR FLEETS MODEL
• New feature
• Allows to focus on cores instead of machines
• Allows provisioning nodes over multiple AZ
SPARK SPECULATION & BLACK LISTING
• Faulty nodes is total disaster (c)
• Spark Feature request to introduce minimal speculation interval (conflict with DirectCommiter)
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
BATCH 1
HDFS/HIVE
AGGREGATION TABLE
BATCH 2
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
BATCH 3
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates microbatch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 3
BATCH 1 BATCH 2 BATCH 1
BATCH 3
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates micro-batch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACT TABLE
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATION TABLE
REPLICATOR
VERTICA
AGGREGATION TABLE
BATCH 1 BATCH 2 BATCH 1
BATCH 3
FAULT TOLERANCE
EVENT/BATCH SOURCING
• Spark associates micro-batch with batch_id [timestamp]
• Batch_id is partitioned Hive column
• Aggregating and replicating only missed batches
• In case of failures after restart every component shall auto-recover without data losses
SPARK
BATCH 1
HDFS/HIVE
RAW FACTS
BATCH 2
BATCH 3
PRESTO
HDFS/HIVE
AGGREGATED FACTS
REPLICATOR
VERTICA
AGGREGATED FACTS
BATCH 1 BATCH 2 BATCH 1
BATCH 3
BATCH 2
BATCH 3
BACKPRESSURE ENABLED
DATA PLATFORM
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
SPARK STREAMING BACKPRESSURE
• MUST HAVE for variable rate
• FEATURE contributed to Spark master with back pressure initial max rate for direct mode
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
HDFS VALVE
• HDFS between Spark and Presto
• Retention policy 12h
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
PULL WRITE
• Using Vertica’s query COPY from HDFS to let Vertica read data with own rate
BACKPRESSURE ENABLED
KAFKA SPARK PRESTO
VERTICANGINX REPORTING
SERVICE
KAFKA OUTAGES
• Lua writes events directly to Kafka
• Unsent events stored locally and sent to S3
• NiFi periodically rends that data back to Kafka
I AM A GOD
I HAVE NO IDEA
WHAT’S GOING ON
MONITORING FUNDAMENTALS
FUNDAMENTAL REALTIME METRICS
• IN RATE
• OUT RATE
• CURRENT LAG
• ERRORS RATE
• BATCH PROCESSING TIME
• PIPELINE LATENCY
SEPARATED APP INTRODUCED [ aka BANDARLOG ]
• Tracks offsets for kafka, and Hive/Presto and Vertica
• Standalone application
• Open sourced soon
USING DATADOG
• Dashboards, monitors
DASHBOARD EXAMPLE [INGESTION]
DASHBOARD EXAMPLE [AGGREGATIONS]
WHAT WE HAVE ACHIEVED
SCLABLE PRODUCTION
• Ability to grow further beyond 1M/s up to 2.5M
STABLE PRODUCTION ENVIRONMENT
• fault tolerant components, easier to recover
LESS EXPENSIVE
• Smaller Spark cluster (-50%)
• Presto cluster is smaller than Memsql-driven one (30%)
SIMPLIFIED MAINTENANCE
• Auto recovery and scaling
• No wakeups over night
THANK YOU

More Related Content

What's hot

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
Hari Shreedharan
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
Knoldus Inc.
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
confluent
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
Sachin Aggarwal
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Provectus
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Summit
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
confluent
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Legacy Typesafe (now Lightbend)
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Guido Schmutz
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
Jack Gudenkauf
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
felixcss
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
DataWorks Summit/Hadoop Summit
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
Peerapat Asoktummarungsri
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
Hortonworks
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Databricks
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
Knoldus Inc.
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Databricks
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
P. Taylor Goetz
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Helena Edelson
 

What's hot (20)

Real Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark StreamingReal Time Data Processing Using Spark Streaming
Real Time Data Processing Using Spark Streaming
 
Meet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + KafkaMeet Up - Spark Stream Processing + Kafka
Meet Up - Spark Stream Processing + Kafka
 
So You Want to Write a Connector?
So You Want to Write a Connector? So You Want to Write a Connector?
So You Want to Write a Connector?
 
Apache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault ToleranceApache Spark Streaming: Architecture and Fault Tolerance
Apache Spark Streaming: Architecture and Fault Tolerance
 
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
Data Summer Conf 2018, “Building unified Batch and Stream processing pipeline...
 
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
Spark Streaming: Pushing the throughput limits by Francois Garillot and Gerar...
 
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
Cross the streams thanks to Kafka and Flink (Christophe Philemotte, Digazu) K...
 
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and DatabricksFour Things to Know About Reliable Spark Streaming with Typesafe and Databricks
Four Things to Know About Reliable Spark Streaming with Typesafe and Databricks
 
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms comparedApache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
 
Spark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream ProcessingSpark Streaming & Kafka-The Future of Stream Processing
Spark Streaming & Kafka-The Future of Stream Processing
 
Sparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with SparkSparkly Notebook: Interactive Analysis and Visualization with Spark
Sparkly Notebook: Interactive Analysis and Visualization with Spark
 
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/TridentQuerying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
Querying the Internet of Things: Streaming SQL on Kafka/Samza and Storm/Trident
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
Hive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it finalHive on spark is blazing fast or is it final
Hive on spark is blazing fast or is it final
 
Streaming SQL
Streaming SQLStreaming SQL
Streaming SQL
 
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
Cassandra and SparkSQL: You Don't Need Functional Programming for Fun with Ru...
 
Introduction to Spark Streaming
Introduction to Spark StreamingIntroduction to Spark Streaming
Introduction to Spark Streaming
 
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim DowlingStructured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
Structured-Streaming-as-a-Service with Kafka, YARN, and Tooling with Jim Dowling
 
Apache storm vs. Spark Streaming
Apache storm vs. Spark StreamingApache storm vs. Spark Streaming
Apache storm vs. Spark Streaming
 
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
Leveraging Kafka for Big Data in Real Time Bidding, Analytics, ML & Campaign ...
 

Similar to Ultimate journey towards realtime data platform with 2.5M events per sec

Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020
b0ris_1
 
Cowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис ТрофімовCowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис Трофімов
Sigma Software
 
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовCowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Sigma Software
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
Shivji Kumar Jha
 
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Andrejs Prokopjevs
 
Amazon Aurora TechConnect
Amazon Aurora TechConnect Amazon Aurora TechConnect
Amazon Aurora TechConnect
LavanyaMurthy9
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Carol McDonald
 
VMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphereVMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphere
VMworld
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Dibyendu Bhattacharya
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
Yahoo Developer Network
 
How to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScaleHow to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScale
MariaDB plc
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
Databricks
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
Amazon Web Services
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Spark Summit
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
hadooparchbook
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
Spark Summit
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Landon Robinson
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Amazon Web Services
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
Jen Aman
 

Similar to Ultimate journey towards realtime data platform with 2.5M events per sec (20)

Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020Cowboy dating with big data TechDays at Lohika-2020
Cowboy dating with big data TechDays at Lohika-2020
 
Cowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис ТрофімовCowboy dating with big data, Борис Трофімов
Cowboy dating with big data, Борис Трофімов
 
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис ТрофимовCowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
Cowboy Dating with Big Data or DWH Evolution in Action, Борис Трофимов
 
osi-oss-dbs.pptx
osi-oss-dbs.pptxosi-oss-dbs.pptx
osi-oss-dbs.pptx
 
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
Optimize DR and Cloning with Logical Hostnames in Oracle E-Business Suite (OA...
 
Amazon Aurora TechConnect
Amazon Aurora TechConnect Amazon Aurora TechConnect
Amazon Aurora TechConnect
 
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DBStructured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
Structured Streaming Data Pipeline Using Kafka, Spark, and MapR-DB
 
VMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphereVMworld 2015: Advanced SQL Server on vSphere
VMworld 2015: Advanced SQL Server on vSphere
 
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...Flink Forward San Francisco 2018:  Dave Torok & Sameer Wadkar - "Embedding Fl...
Flink Forward San Francisco 2018: Dave Torok & Sameer Wadkar - "Embedding Fl...
 
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark StreamingNear Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
Near Real time Indexing Kafka Messages to Apache Blur using Spark Streaming
 
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
October 2016 HUG: Architecture of an Open Source RDBMS powered by HBase and ...
 
How to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScaleHow to Manage Scale-Out Environments with MariaDB MaxScale
How to Manage Scale-Out Environments with MariaDB MaxScale
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
Best Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWSBest Practices for Using Apache Spark on AWS
Best Practices for Using Apache Spark on AWS
 
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
Highlights and Challenges from Running Spark on Mesos in Production by Morri ...
 
What no one tells you about writing a streaming app
What no one tells you about writing a streaming appWhat no one tells you about writing a streaming app
What no one tells you about writing a streaming app
 
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
What No One Tells You About Writing a Streaming App: Spark Summit East talk b...
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
Accelerating Application Development with Amazon Aurora (DAT312-R2) - AWS re:...
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 

More from b0ris_1

Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...
b0ris_1
 
Devoxx 2022
Devoxx 2022Devoxx 2022
Devoxx 2022
b0ris_1
 
IT Arena-2021
IT Arena-2021IT Arena-2021
IT Arena-2021
b0ris_1
 
New accelerators in Big Data - Upsolver
New accelerators in Big Data - UpsolverNew accelerators in Big Data - Upsolver
New accelerators in Big Data - Upsolver
b0ris_1
 
Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]
b0ris_1
 
Bending Spark towards enterprise needs
Bending Spark towards enterprise needsBending Spark towards enterprise needs
Bending Spark towards enterprise needs
b0ris_1
 
Audience counting at Scale
Audience counting at ScaleAudience counting at Scale
Audience counting at Scale
b0ris_1
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
b0ris_1
 
So various polymorphism in Scala
So various polymorphism in ScalaSo various polymorphism in Scala
So various polymorphism in Scala
b0ris_1
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 framework
b0ris_1
 
Spring AOP Introduction
Spring AOP IntroductionSpring AOP Introduction
Spring AOP Introductionb0ris_1
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
b0ris_1
 
Clustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and HazelcastClustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and Hazelcast
b0ris_1
 

More from b0ris_1 (13)

Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...Learning from nature or human body as a source on inspiration for software en...
Learning from nature or human body as a source on inspiration for software en...
 
Devoxx 2022
Devoxx 2022Devoxx 2022
Devoxx 2022
 
IT Arena-2021
IT Arena-2021IT Arena-2021
IT Arena-2021
 
New accelerators in Big Data - Upsolver
New accelerators in Big Data - UpsolverNew accelerators in Big Data - Upsolver
New accelerators in Big Data - Upsolver
 
Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]Learning from nature [slides from Software Architecture meetup]
Learning from nature [slides from Software Architecture meetup]
 
Bending Spark towards enterprise needs
Bending Spark towards enterprise needsBending Spark towards enterprise needs
Bending Spark towards enterprise needs
 
Audience counting at Scale
Audience counting at ScaleAudience counting at Scale
Audience counting at Scale
 
Scalding Big (Ad)ta
Scalding Big (Ad)taScalding Big (Ad)ta
Scalding Big (Ad)ta
 
So various polymorphism in Scala
So various polymorphism in ScalaSo various polymorphism in Scala
So various polymorphism in Scala
 
Continuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 frameworkContinuous DB migration based on carbon5 framework
Continuous DB migration based on carbon5 framework
 
Spring AOP Introduction
Spring AOP IntroductionSpring AOP Introduction
Spring AOP Introduction
 
MongoDB Distilled
MongoDB DistilledMongoDB Distilled
MongoDB Distilled
 
Clustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and HazelcastClustering Java applications with Terracotta and Hazelcast
Clustering Java applications with Terracotta and Hazelcast
 

Recently uploaded

block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
Divya Somashekar
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
Kamal Acharya
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
seandesed
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
ankuprajapati0525
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
ssuser9bd3ba
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
fxintegritypublishin
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
VENKATESHvenky89705
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
gdsczhcet
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Sreedhar Chowdam
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
AJAYKUMARPUND1
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
Amil Baba Dawood bangali
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
FluxPrime1
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
Massimo Talia
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
Robbie Edward Sayers
 

Recently uploaded (20)

block diagram and signal flow graph representation
block diagram and signal flow graph representationblock diagram and signal flow graph representation
block diagram and signal flow graph representation
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
Courier management system project report.pdf
Courier management system project report.pdfCourier management system project report.pdf
Courier management system project report.pdf
 
Architectural Portfolio Sean Lockwood
Architectural Portfolio Sean LockwoodArchitectural Portfolio Sean Lockwood
Architectural Portfolio Sean Lockwood
 
The role of big data in decision making.
The role of big data in decision making.The role of big data in decision making.
The role of big data in decision making.
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
LIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.pptLIGA(E)11111111111111111111111111111111111111111.ppt
LIGA(E)11111111111111111111111111111111111111111.ppt
 
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdfHybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
Hybrid optimization of pumped hydro system and solar- Engr. Abdul-Azeez.pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
road safety engineering r s e unit 3.pdf
road safety engineering  r s e unit 3.pdfroad safety engineering  r s e unit 3.pdf
road safety engineering r s e unit 3.pdf
 
Standard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - NeometrixStandard Reomte Control Interface - Neometrix
Standard Reomte Control Interface - Neometrix
 
Gen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdfGen AI Study Jams _ For the GDSC Leads in India.pdf
Gen AI Study Jams _ For the GDSC Leads in India.pdf
 
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&BDesign and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
Design and Analysis of Algorithms-DP,Backtracking,Graphs,B&B
 
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
Pile Foundation by Venkatesh Taduvai (Sub Geotechnical Engineering II)-conver...
 
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
NO1 Uk best vashikaran specialist in delhi vashikaran baba near me online vas...
 
DESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docxDESIGN A COTTON SEED SEPARATION MACHINE.docx
DESIGN A COTTON SEED SEPARATION MACHINE.docx
 
Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024Nuclear Power Economics and Structuring 2024
Nuclear Power Economics and Structuring 2024
 
HYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generationHYDROPOWER - Hydroelectric power generation
HYDROPOWER - Hydroelectric power generation
 

Ultimate journey towards realtime data platform with 2.5M events per sec

  • 1. ULTIMATE JOURNEY TOWARDS REALTIME DATA PLATFORM 2.5M / s BORIS TROFIMOV @ SIGMA SOFTWARE
  • 2. Leading DWH @ Oath: Major expertise Big Data and Enterprise Cofounder of Odessa JUG Passionate follower of Scala Associate professor at ONPU ABOUT ME
  • 3. WHERE IS BIG DATA?
  • 4. API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS INEGRATION POINTS INTRODUCING DATA PLATFORM
  • 5. DOMAIN SERVICE API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM
  • 6. DOMAIN SERVICE API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE INFRASTRUCTURE SERVICES SERVICE DISCOVERY SHARED CONFIG DOMAIN DEPENDENCY MANAGEMENT ACL MANAGEMENT MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM
  • 7. DOMAIN SERVICE API BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE INFRASTRUCTURE SERVICES SERVICE DISCOVERY SHARED CONFIG DOMAIN DEPENDENCY MANAGEMENT ACL MANAGEMENT MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM BIG DATA ?
  • 8. API DATA PLATFORM BACK OFFICE CUSTOMER WEB PORTAL MOBILE APPS MICROSERVICE INFRASTRUCTURE SERVICES SERVICE DISCOVERY SHARED CONFIG DOMAIN DEPENDENCY MANAGEMENT ACL MANAGEMENT MICROSERVICE CORE DOMAIN SERVICES INEGRATION POINTS DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE DOMAIN SERVICE INTRODUCING DATA PLATFORM
  • 11. DATA PLATFORM INTRODUCING DATA PLATFORM 3rd PARTY PROVIDERS PLATFORM COMPONENTS REPORTING ANALYTICS Major mission: organize data events
  • 12. ZOOMING IN DATA PLATFORM INGESTION MODULE REPORTING SERVICE WAREHOUSE VALIDATION ENRICHMENT MODULE RAW DATA AGGREGATIONS MODULE FACTS RAW DATA DIMENSIONS ANALYTICS MODULE CONFIGURATION MODULE DIMENSION UPDATER
  • 13. based on © https://blogs.msdn.microsoft.com/agile/2012/07/26/cqrs-journey-guidance-project-released/
  • 16. OUR DOMAIN S3 Data Lake 5 PB Vertica 500 TB Raws/Table 600 B Events/Sec 2.5 M Files/Hour/Pipeline 15 K Data/Daly 25 TB DATA LAKE PROCESSING
  • 17. ORIGINAL PIPELINE DATA PLATFORM VERTICAS3 HADOOPNGINX REPORTING SERVICE DATA LAG ~1h
  • 18. UNITED LAMBDA PLATFORM DATA PLATFORM KAFKA SPARK MEMSQL VERTICAS3 HADOOPNGINX REPORTING SERVICE DATA LAG ~2m DATA LAG ~1h
  • 19. UNITED LAMBDA PLATFORM DATA PLATFORM KAFKA SPARK MEMSQL VERTICAS3 HADOOPNGINX REPORTING SERVICE DATA LAG ~2m DATA LAG ~1h
  • 20. WHAT WAS GOOD DATA DELIVERY TIME 2 min FINE ON PROD SCALE @ THAT TIME -- 150K/s PAINFUL SCALE UP TO 1M/s
  • 21. НЕЛЬЗЯ ПРОСТО ТАК ВЗЯТЬ И ВЫРАСТИ В 20 РАЗ
  • 22. WHY WE NEEDED CHANGES ROCKY SCALING Adding/removing nodes to CDH YARN requires yarn restart and downtime for apps Tricky to build quick sandboxes The latest Memsql release 5.X It was not able to operate cluster with > 80 nodes Max supported rate limit 1M events/s, while Business required 2.5M/s ZERO TOLERANCE EC2 faulty nodes could make Spark or Memsql get stuck for a while Buggy HA, even one faulty node could break entire Memsql cluster, make to recreate database and lose data PUSH approach to write data to Memsql MONITORING & ALERTING Find the most relevant metrics Eliminate FALSE POSITIVE and FALSE NEGATIVE errors
  • 23. ON A WAY TO 2.5 M / S
  • 24.
  • 25. MIGRATING SPARK TO EMR EASY CREATE, EASY DESTROY • easy to … make bill cost a fortune MULTIPLE EMR CLUSTERS • Separating concerns and Isolation • Better to run single application per EMR cluster • Simplified auto-scaling rules STATELESS EMR CLUSTERS • Do not use local HDFS
  • 26. CAUTION, EMR! EASY TO ALLOCATE AND EASY TO LOSE EMR NODE • Concerns mostly m4.4xl as the most popular instance type LOSING MASTER NODE – LOSING ENTIRE CLUSTER • Hard to build reliable platform involving multiple AZ [see Fleets model] • Develop one-step evacuation procedure to another EMR LUCK OF LACK ON SPECIFIC INSTANCE TYPE • Can be mitigated by fleets model
  • 27. DEPLOYMENT DETAILS MASTER TASK TASK TASK … EMR CLUSTER [YARN] TASK TASK TASK …
  • 28. DEPLOYMENT DETAILS MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN] TASK TASK TASK …
  • 29. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP TASK TASK TASK …
  • 30. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP LOCAL YARN CONFIG TASK TASK TASK …
  • 31. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP TASK TASK TASK … LOCAL YARN CONFIG
  • 32. DEPLOYMENT DETAILS SPARK BINARIES MASTER S3 YARN CONFIG (zip) TASK TASK TASK … EMR CLUSTER [YARN]DOCKER CONTAINER DRIVER APP TASK TASK TASK … LOCAL YARN CONFIG RANCHER
  • 33. CDH vs EMR E M RC D H Cannot scale out/in on demand Is able to scale out/in on demand No extra cost (for community license) Extra ~30% to EC2 costs Per second billing (!) Adding machines to CDH requires restarting Yarn No Yarn restart Easy configuration management via CM Limited configuration available during EMR creation Classic Yarn cluster Ordinary Yarn under hood, imposes EMR-driven way to deploy apps Single CDH per AZ EMR cluster on demand as unit of clustering
  • 34.
  • 35. MAKING SPARK WRITE FASTER USING CUSTOM HADOOP COMMITER • FileOutputCommiter committer with V2 option to exclude file moving in HDFS/S3 WRITE DATAFRAME TO HDFS FIRST • Spark writes to HDFS directly into partitioned folder and registers new partition in Hive
  • 36. WRITING FASTER – FILE FORMATS MOST STABLE PERFORMANCE ON ORC UNCOMPRESSED • spark apps writes raw data in ORC • presto reads ORC and writes aggregations in ORC • replication uses ORC to send delta to Vertica BEST PERFORMANCE ON HDFS BLOCK SIZE AND STRIP 64M • Thankfully to strict retention policy 6 hours ENABLING hive.orc.use-column-names=true • simplifies Spark app, allowing to write dataframe as is, presto accesses columns by name • allows to evolve/modify schema for dataframe and database independently
  • 37.
  • 38. SPARK PERFORMANCE ONE EXECUTOR PER YARN NODE • for better cpu and cache utilization, using 16 vcores (aligning to m4.4xl) ALIGN RDD PARTITIONS TO VCORES • Repartition data we read from Kakfa [address if there is a skew in kafka partitions] SPLIT PROCESSING BATCH INTERVAL ONTO RESPONSIBILITY ZONES • Control each interval separately FETCH FROM KAFKA ENRICHMENT WRITE TO HIVE 1 minute 8 seconds 20 seconds 20 seconds STUFF/OVERHEAD 12 seconds
  • 39.
  • 40. FRIENDLY REMINDER DATA PLATFORM KAFKA SPARK MEMSQL NGINX REPORTING SERVICE
  • 41. INTRODUCING PRESTO DATA PLATFORM KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE DATA LAG ~3m
  • 42. UNDER HOOD Aggregations and replications are running every minute Presto uses dimensions hosted outside. Using Memsql with realtime updates VERTICA NODE REPORTING SERVICE SPARK, EMR HDFS NODE NODE COLLOCATED HDFS/PRESTO PRESTO REPLICATORS JENKINS SCHEDULER MEMSQL
  • 43.
  • 44. FAULT TOLERANCE EMR FLEETS MODEL • New feature • Allows to focus on cores instead of machines • Allows provisioning nodes over multiple AZ SPARK SPECULATION & BLACK LISTING • Faulty nodes is total disaster (c) • Spark Feature request to introduce minimal speculation interval (conflict with DirectCommiter)
  • 45. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 1 BATCH 2 BATCH 1
  • 46. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 BATCH 3 PRESTO BATCH 1 HDFS/HIVE AGGREGATION TABLE BATCH 2 REPLICATOR VERTICA AGGREGATION TABLE BATCH 1
  • 47. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 1 BATCH 2 BATCH 1 BATCH 3
  • 48. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates microbatch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 3 BATCH 1 BATCH 2 BATCH 1 BATCH 3
  • 49. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates micro-batch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACT TABLE BATCH 2 BATCH 3 PRESTO HDFS/HIVE AGGREGATION TABLE REPLICATOR VERTICA AGGREGATION TABLE BATCH 1 BATCH 2 BATCH 1 BATCH 3
  • 50. FAULT TOLERANCE EVENT/BATCH SOURCING • Spark associates micro-batch with batch_id [timestamp] • Batch_id is partitioned Hive column • Aggregating and replicating only missed batches • In case of failures after restart every component shall auto-recover without data losses SPARK BATCH 1 HDFS/HIVE RAW FACTS BATCH 2 BATCH 3 PRESTO HDFS/HIVE AGGREGATED FACTS REPLICATOR VERTICA AGGREGATED FACTS BATCH 1 BATCH 2 BATCH 1 BATCH 3 BATCH 2 BATCH 3
  • 51.
  • 52.
  • 53. BACKPRESSURE ENABLED DATA PLATFORM KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE
  • 54. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE SPARK STREAMING BACKPRESSURE • MUST HAVE for variable rate • FEATURE contributed to Spark master with back pressure initial max rate for direct mode
  • 55. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE HDFS VALVE • HDFS between Spark and Presto • Retention policy 12h
  • 56. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE PULL WRITE • Using Vertica’s query COPY from HDFS to let Vertica read data with own rate
  • 57. BACKPRESSURE ENABLED KAFKA SPARK PRESTO VERTICANGINX REPORTING SERVICE KAFKA OUTAGES • Lua writes events directly to Kafka • Unsent events stored locally and sent to S3 • NiFi periodically rends that data back to Kafka
  • 58.
  • 59. I AM A GOD I HAVE NO IDEA WHAT’S GOING ON
  • 60. MONITORING FUNDAMENTALS FUNDAMENTAL REALTIME METRICS • IN RATE • OUT RATE • CURRENT LAG • ERRORS RATE • BATCH PROCESSING TIME • PIPELINE LATENCY SEPARATED APP INTRODUCED [ aka BANDARLOG ] • Tracks offsets for kafka, and Hive/Presto and Vertica • Standalone application • Open sourced soon USING DATADOG • Dashboards, monitors
  • 63.
  • 64. WHAT WE HAVE ACHIEVED SCLABLE PRODUCTION • Ability to grow further beyond 1M/s up to 2.5M STABLE PRODUCTION ENVIRONMENT • fault tolerant components, easier to recover LESS EXPENSIVE • Smaller Spark cluster (-50%) • Presto cluster is smaller than Memsql-driven one (30%) SIMPLIFIED MAINTENANCE • Auto recovery and scaling • No wakeups over night