SlideShare a Scribd company logo
Analyzing Petabyte Scale Financial
Data with Apache Pinot and Apache
Kafka
Xiaoman Dong @Stripe
Joey Pereira @Stripe
Agenda
● Tracking funds at Stripe
● Quick intro on Pinot
● Challenges: scale and latency
● Optimizations for a large table
Tracking funds at Stripe
Stripe is complicated
Tracking funds at Stripe
Stripe is complicated
Tracking funds at Stripe
Ledger, the financial source of truth
● Unified data format for financial activity
● Exhaustively covers all activity
● Centralized observability
Tracking funds at Stripe
Ledger, the financial source of truth
● Unified data format for financial activity
● Exhaustively covers all activity
● Centralized observability
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
Modelling as state machines
Successful payment
Tracking funds at Stripe
● What action caused the transition.
● Why it transitioned.
● When it transitioned.
● Looking at transitions across multiple systems and teams.
Observability
Transaction-level investigation
Tracking funds at Stripe
Modelling as state machines
Incomplete states are balances
Tracking funds at Stripe
Modelling as state machines
Incomplete states are balances
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
Observability
Tracking funds at Stripe
Detection
Date of state’s first transition
Amount ($$)
● Look up one state transition
○ by ID or other properties
● Look up one state, inspect it
○ listing transitions with sorting, paging, and summaries
● Aggregate many states
Query patterns
Tracking funds at Stripe
● Look up one state transition
○ by ID or other properties
● Look up one state, inspect it
○ listing transitions with sorting, paging, and summaries
● Aggregate many states
This is easy... until we have:
● Hundreds of billions of rows
● States with hundreds of millions of transitions
● Need for fresh, real-time data
● Queries with sub-second latency, serving interactive UI
Query patterns
Tracking funds at Stripe
World before Pinot
Tracking funds at Stripe
Two complicated systems
World before Pinot
Tracking funds at Stripe
Two complicated systems
World with Pinot
Tracking funds at Stripe
● One system for serving all cases
● Simple and elegant
● No more multiple copies of data
Quick intro on Pinot
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
Pinot Distributed Architecture
* (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
(Transition joey => xd)
●
Our challenges
Query Latency
Data Freshness
Data Scale
Challenge #1:
Data Scale: the Largest Single Table in Pinot
One cluster to serve all major queries
Huge tables
● Each with more than hundreds of billions rows
● 700TB storage on disk, after 2x replication
Pinot numbers
● Offline segments: ~60k segments per table
● Real time table: 64 partitions
Hosted by AWS EC2 Instances
● ~1000 small hosts (4000 vCPU) with attached SSD
● Instance config selected based on performance and cost
One cluster to serve all major queries
Huge tables
● Each with more than hundreds of billions rows
● 700TB storage on disk, after 2x replication
Pinot numbers
● Offline segments: ~60k segments per table
● Real time table: 64 partitions
Hosted by AWS EC2 Instances
● ~1000 small hosts (4000 vCPU) with attached SSD
● Instance config selected based on performance and cost
Largest Pinot table in
the world !
Challenge #2:
Data freshness: Kafka Ingestion
What Pinot + Kafka Brings
Pinot broker provides merged view of offline and real time data
● Real-time Kafka ingestion comes with second level data freshness
● Merged view allows us query whole data set like one single table
Financial Data in Real Time (1/2)
Avoid duplication is critical for financial systems
● A Flink deduplication job as upstream
● Exactly-once Kafka sink used in Flink
Exactly-once from Flink to Pinot
● Kafka transactional consumer enabled in Pinot
● Atomic update of Kafka offset and Pinot segment
● Result: 1:1 mapping from Flink output to Pinot
● No extra effort needed for us
Financial Data in Real Time (2/2)
● Alternative Solution: deduplication within Pinot directly
○ Pinot’s real time upsert feature is a nice option to explore
○ Sustained 200k+ QPS into Pinot offline table in our experiments
Challenge #3:
Drive Down the Query Latency
Optimizations Applied (1/4)
● Partitioning - Hashing data across Pinot servers
○ The most powerful optimization tool in Pinot
○ Map partitions to servers: Pinot becomes a key-value store
Optimizations Applied (1/4)
● Partitioning - Hashing data across Pinot servers
○ The most powerful optimization tool in Pinot
○ Map partitions to servers: Pinot becomes a key-value store
Depending on query type,
partitioning can improve
query latency by 2x ~ 10x
Optimizations Applied (2/4)
● Sorting - Organize data between segments
○ Sorting is powerful when done in Spark ETL job; we can arrange
how the rows are divided into segments
○ Column min/max values can help avoid scanning segments
○ Grouping the same value into the the same segment can reduce
storage cost and speed up pre-aggregations
Optimizations Applied (2/4)
● Sorting - Organize data between segments
○ Sorting is powerful when done in Spark ETL job; we can arrange
how the rows are divided into segments
○ Column min/max values can help avoid scanning segments
○ Grouping the same value into the the same segment can reduce
storage cost and speed up pre-aggregations
In our production data set, sorting roughly
improves aggregation query latency by 2x
Optimization Applied (3/4)
● Bloom filter - Quickly prune out a Pinot segment
○ Best friend of key-value style lookup query
○ Works best when there are very few hit in filter
○ Configurable in Pinot: control false positive rate or total size
Optimization Applied (4/4)
● Pre-aggregation by star tree index
○ Pinot supports a specialized pre-aggregation called “star-tree index”
○ Pre-aggregates several columns to avoid computation during query
○ Star tree index balances between disk space and query time for
aggregations with multiple dimensions
Optimization Applied (4/4)
● Pre-aggregation by star tree index
○ Pinot supports a specialized pre-aggregation called “star-tree index”
○ Pre-aggregates several columns to avoid computation during query
○ Star tree index balances between disk space and query time for
aggregations with multiple dimensions
Query latency improvement
(accounts with billion-level transactions):
~30 seconds vs. 300 milliseconds
The Combined Power of Four Optimizations
● They can reduce query latency to sub second for any large table
○ Works well for our hundreds of billions of rows
○ Most of the time, tables are small and we only need some of them
● We chose the optimizations to speed up all 5 production queries
○ Some queries need only bloom filter
○ Partitioning and sorting are applied for critical queries
Real time ingestion needs extra care
Optimizing real time ingestion (1/2)
With 3-day real time data in Pinot, we saw 2~3 sec added latencies
● Pinot real time segments are often very small
● Real time server numbers are limited by Kafka partition count
(max 64 servers in our case)
● Each real time server ends up with many small segments
● Real time server has high I/O and high CPU during query
Optimizing real time ingestion (2/2)
Latency back to sub-seconds after adopting Tiered Storage
● Tiered storage enables different storage hosts for segments based on time
● Moves real time segments into dedicated servers ASAP
● Utilizes more servers to process query for real time segments
● Avoids query slow down in Kafka consumers with back pressure
Production Query Latency Chart
Hundreds of billions of rows,
~700 TB data,
all are sub-second latency.
Financial Precision
● Precise numbers are critical for financial data processing
● Java BigDecimal is the answer for Pinot
● Pinot supports BigDecimal by BINARY columns (currently)
○ Computation (e.g., sum) done by UDF-style scalar functions
○ Star Tree index can be applied to BigDecimal columns
○ Works for all our use cases
○ No significant performance penalty observed
With Pinot and Kafka working together, we have created the largest Pinot
table in the world, to represent financial funds flow graphs.
● With hundreds of billions of edges
● Seconds of data freshness
● Financial precise number support
● Exactly-once Kafka semantics
● Sub-second query latency
Conclusion
Future Plans
● Reduce hardware cost by applying tiered storage in offline table
○ Use HDD-based hosts for data months old
● Multi-region Pinot cluster
● Try out many of Pinot’s exciting new features
Thanks and Questions (We are hiring!)
(Backup Slides)
● Ledger models financial activity as state machines
● Transitions are immutable append-only logs in Kafka
● Everything is transaction-level
● Incomplete states are represented by balances.
● Two core use-cases: transaction-level queries, and aggregation analytics
● Current system is unscalable and complex
Summarizing
Tracking funds at Stripe
Pinot and Kafka works in synergy
Detect problems in hundreds of billions rows (cont’d)
How to detect issues in a graph of half trillion nodes?
1) Sum all money in/out nodes, focus only on non-zero nodes
Now we have 20 million nodes with non-zero sum, how to analyze it?
2) Group by
a) Day of first transaction seen -- Time Series
b) Sign of sum (negative/positive flow)
c) Some node properties like type
We have a time series, and fields we can slice/dice. OLAP Cube
Modelling as state machines
Tracking funds at Stripe
Transitions State balances
Modelling as state machines
Tracking funds at Stripe
Transitions State balances
Modelling as state machines
Tracking funds at Stripe
Transitions State balances
Modelling as state machines
Balances of incomplete payment
Tracking funds at Stripe
Modelling as state machines
Balances of successful payment
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
● Data volume, handling hundreds of billions of records
● Data freshness, getting real-time processing
● Query latency, making analytics usable for interactive internal UIs
● Achieving all three at once: difficult!
Why this is challenging?
Tracking funds at Stripe
Modelling as state machines
Dozens and dozens of states
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
Observability
Aggregating state balances
Tracking funds at Stripe
Double-Entry Bookkeeping
● Internal funds flow represented by a directed graph
● Record the graph edge as Double-Entry Bookkeeping
● Nodes in the graph are modeled as accounts
● Accounts should eventually have zero balances
Detect problems in hundreds of billions of rows
Money in/out graph nodes should sum to zero (“cleared”).
Stuck funds over time = Revenue Loss
● One card swipe could create 10+ nodes
● Hundreds of billions unique nodes and increasing
Lessons Learned
● Metadata becomes heavy for huge tables
○ O(n2
) algorithm is not good when processing 60k segments
○ Avoid sending 1k+ segment names across 100+ servers
○ Metadata is important when aiming for sub-second latency
● Tailing effect of p99/p95 latencies when we have 1000 servers
○ Occasional hiccups in server becomes high probability events
and drags down p99/p95 query latency
○ Limit servers queried to be as small as possible
(partitioning, server grouping, etc)
Clearing Time Series (Exploring)
Pinot Segment File Storage
Financial Data in Real Time (1/2)
● We have an upstream Flink deduplication job in place
● No duplication allowed
○ Pinot’s real time primary key is a nice option to explore
○ Sustained 200k+ QPS into Pinot offline tables in our
deduplication experiments (after optimization)
○ An upstream Flink deduplication job may be the best choice
● Exactly-once consumption from Kafka to Pinot
○ Kafka transactional consumer enabled in Pinot
○ 1:1 mapping of Kafka message to table rows
○ Critical for financial data processing
Table Design Optimization Iterations
● It takes 2~3 days for Spark ETL job to
process full data set
● Scale up only after optimized design
○ Shadow production query
○ Rebuild whole data set when needed
● General rule of thumb:
the fewer segments scanned, the better
Kafka Ingestion Optimization (2/2)
● Partition/Sharding in Real time tables (Experimented)
○ Needs a streaming job to shuffler Kafka topic by key
○ Helps query performance for real time table
○ Worth adopting
● Merging small segments into large segments
○ Needs cron style job to do the work
○ Helps pruning and scanning
○ Not a bottleneck for us

More Related Content

What's hot

The delta architecture
The delta architectureThe delta architecture
The delta architecture
Prakash Chockalingam
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
confluent
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
Databricks
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
StreamNative
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
Databricks
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Flink Forward
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
Mayank Shrivastava
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Altinity Ltd
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Seunghyun Lee
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kai Wähner
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
nkorla1share
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
Ryan Blue
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
DataWorks Summit
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
Databricks
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
Databricks
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
Knoldus Inc.
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
Guido Schmutz
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
Mykola Zerniuk
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
Kishore Gopalakrishna
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
Taro L. Saito
 

What's hot (20)

The delta architecture
The delta architectureThe delta architecture
The delta architecture
 
CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®CDC patterns in Apache Kafka®
CDC patterns in Apache Kafka®
 
Databricks Delta Lake and Its Benefits
Databricks Delta Lake and Its BenefitsDatabricks Delta Lake and Its Benefits
Databricks Delta Lake and Its Benefits
 
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
Trino: A Ludicrously Fast Query Engine - Pulsar Summit NA 2021
 
The Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization OpportunitiesThe Parquet Format and Performance Optimization Opportunities
The Parquet Format and Performance Optimization Opportunities
 
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
Squirreling Away $640 Billion: How Stripe Leverages Flink for Change Data Cap...
 
Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020Apache Pinot Meetup Sept02, 2020
Apache Pinot Meetup Sept02, 2020
 
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander ZaitsevClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
 
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
Pinot: Realtime OLAP for 530 Million Users - Sigmod 2018
 
Kafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid CloudKafka for Real-Time Replication between Edge and Hybrid Cloud
Kafka for Real-Time Replication between Edge and Hybrid Cloud
 
Cassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ NetflixCassandra Data Modeling - Practical Considerations @ Netflix
Cassandra Data Modeling - Practical Considerations @ Netflix
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Intro to Delta Lake
Intro to Delta LakeIntro to Delta Lake
Intro to Delta Lake
 
Making Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta LakeMaking Apache Spark Better with Delta Lake
Making Apache Spark Better with Delta Lake
 
Stream processing using Kafka
Stream processing using KafkaStream processing using Kafka
Stream processing using Kafka
 
ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!ksqlDB - Stream Processing simplified!
ksqlDB - Stream Processing simplified!
 
Intro to databricks delta lake
 Intro to databricks delta lake Intro to databricks delta lake
Intro to databricks delta lake
 
Building real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case studyBuilding real time analytics applications using pinot : A LinkedIn case study
Building real time analytics applications using pinot : A LinkedIn case study
 
Introduction to Presto at Treasure Data
Introduction to Presto at Treasure DataIntroduction to Presto at Treasure Data
Introduction to Presto at Treasure Data
 

Similar to Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
HostedbyConfluent
 
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Rajesh Kannan S
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
confluent
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
Bowen Li
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
Apache Apex
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
DataWorks Summit
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
datamantra
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Bowen Li
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Steven Wu
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
Allen (Xiaozhong) Wang
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
Yaroslav Tkachenko
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Flink Forward
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
C4Media
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Data Con LA
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
datamantra
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
DataWorks Summit/Hadoop Summit
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
Scott Mansfield
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
Shriya Arora
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Apache Apex
 

Similar to Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe (20)

Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
Real-time Analytics with Upsert Using Apache Kafka and Apache Pinot | Yupeng ...
 
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
Slashn Talk OLTP in Supply Chain - Handling Super-scale and Change Propagatio...
 
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ UberKafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
Kafka Summit NYC 2017 - Scalable Real-Time Complex Event Processing @ Uber
 
Apache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyondApache Flink 101 - the rise of stream processing and beyond
Apache Flink 101 - the rise of stream processing and beyond
 
Architectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark StreamingArchitectual Comparison of Apache Apex and Spark Streaming
Architectual Comparison of Apache Apex and Spark Streaming
 
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
Flink Forward San Francisco 2018: Gregory Fee - "Bootstrapping State In Apach...
 
Bootstrapping state in Apache Flink
Bootstrapping state in Apache FlinkBootstrapping state in Apache Flink
Bootstrapping state in Apache Flink
 
Interactive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark StreamingInteractive Data Analysis in Spark Streaming
Interactive Data Analysis in Spark Streaming
 
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen LiTowards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
Towards Apache Flink 2.0 - Unified Data Processing and Beyond, Bowen Li
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Storing State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your AnalyticsStoring State Forever: Why It Can Be Good For Your Analytics
Storing State Forever: Why It Can Be Good For Your Analytics
 
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and PinotExactly-Once Financial Data Processing at Scale with Flink and Pinot
Exactly-Once Financial Data Processing at Scale with Flink and Pinot
 
Data Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFixData Science in the Cloud @StitchFix
Data Science in the Cloud @StitchFix
 
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
Big Data Day LA 2016/ Big Data Track - Portable Stream and Batch Processing w...
 
Building end to end streaming application on Spark
Building end to end streaming application on SparkBuilding end to end streaming application on Spark
Building end to end streaming application on Spark
 
Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex Next Gen Big Data Analytics with Apache Apex
Next Gen Big Data Analytics with Apache Apex
 
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDBEVCache: Lowering Costs for a Low Latency Cache with RocksDB
EVCache: Lowering Costs for a Low Latency Cache with RocksDB
 
Streaming datasets for personalization
Streaming datasets for personalizationStreaming datasets for personalization
Streaming datasets for personalization
 
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache ApexHadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
Hadoop Summit SJ 2016: Next Gen Big Data Analytics with Apache Apex
 

More from HostedbyConfluent

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
HostedbyConfluent
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
HostedbyConfluent
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
HostedbyConfluent
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
HostedbyConfluent
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
HostedbyConfluent
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
HostedbyConfluent
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
HostedbyConfluent
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
HostedbyConfluent
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
HostedbyConfluent
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
HostedbyConfluent
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
HostedbyConfluent
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
HostedbyConfluent
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
HostedbyConfluent
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
HostedbyConfluent
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
HostedbyConfluent
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
HostedbyConfluent
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
HostedbyConfluent
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
HostedbyConfluent
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
HostedbyConfluent
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
HostedbyConfluent
 

More from HostedbyConfluent (20)

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Renaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit LondonRenaming a Kafka Topic | Kafka Summit London
Renaming a Kafka Topic | Kafka Summit London
 
Evolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at TrendyolEvolution of NRT Data Ingestion Pipeline at Trendyol
Evolution of NRT Data Ingestion Pipeline at Trendyol
 
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking TechniquesEnsuring Kafka Service Resilience: A Dive into Health-Checking Techniques
Ensuring Kafka Service Resilience: A Dive into Health-Checking Techniques
 
Exactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and KafkaExactly-once Stream Processing with Arroyo and Kafka
Exactly-once Stream Processing with Arroyo and Kafka
 
Fish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit LondonFish Plays Pokemon | Kafka Summit London
Fish Plays Pokemon | Kafka Summit London
 
Tiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit LondonTiered Storage 101 | Kafla Summit London
Tiered Storage 101 | Kafla Summit London
 
Building a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And WhyBuilding a Self-Service Stream Processing Portal: How And Why
Building a Self-Service Stream Processing Portal: How And Why
 
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
From the Trenches: Improving Kafka Connect Source Connector Ingestion from 7 ...
 
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
Future with Zero Down-Time: End-to-end Resiliency with Chaos Engineering and ...
 
Navigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka ClustersNavigating Private Network Connectivity Options for Kafka Clusters
Navigating Private Network Connectivity Options for Kafka Clusters
 
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data PlatformApache Flink: Building a Company-wide Self-service Streaming Data Platform
Apache Flink: Building a Company-wide Self-service Streaming Data Platform
 
Explaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy PubExplaining How Real-Time GenAI Works in a Noisy Pub
Explaining How Real-Time GenAI Works in a Noisy Pub
 
TL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit LondonTL;DR Kafka Metrics | Kafka Summit London
TL;DR Kafka Metrics | Kafka Summit London
 
A Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSLA Window Into Your Kafka Streams Tasks | KSL
A Window Into Your Kafka Streams Tasks | KSL
 
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing PerformanceMastering Kafka Producer Configs: A Guide to Optimizing Performance
Mastering Kafka Producer Configs: A Guide to Optimizing Performance
 
Data Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and BeyondData Contracts Management: Schema Registry and Beyond
Data Contracts Management: Schema Registry and Beyond
 
Code-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink AppsCode-First Approach: Crafting Efficient Flink Apps
Code-First Approach: Crafting Efficient Flink Apps
 
Debezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC EcosystemDebezium vs. the World: An Overview of the CDC Ecosystem
Debezium vs. the World: An Overview of the CDC Ecosystem
 
Beyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local DisksBeyond Tiered Storage: Serverless Kafka with No Local Disks
Beyond Tiered Storage: Serverless Kafka with No Local Disks
 

Recently uploaded

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
Cheryl Hung
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
Abida Shariff
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 

Recently uploaded (20)

Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptxIOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
IOS-PENTESTING-BEGINNERS-PRACTICAL-GUIDE-.pptx
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 

Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka | Xiaoman Dong and Joey Pereira, Stripe

  • 1. Analyzing Petabyte Scale Financial Data with Apache Pinot and Apache Kafka Xiaoman Dong @Stripe Joey Pereira @Stripe
  • 2. Agenda ● Tracking funds at Stripe ● Quick intro on Pinot ● Challenges: scale and latency ● Optimizations for a large table
  • 6. Ledger, the financial source of truth ● Unified data format for financial activity ● Exhaustively covers all activity ● Centralized observability Tracking funds at Stripe
  • 7. Ledger, the financial source of truth ● Unified data format for financial activity ● Exhaustively covers all activity ● Centralized observability Tracking funds at Stripe
  • 8. Modelling as state machines Successful payment Tracking funds at Stripe
  • 9. Modelling as state machines Successful payment Tracking funds at Stripe
  • 10. Modelling as state machines Successful payment Tracking funds at Stripe
  • 11. Modelling as state machines Successful payment Tracking funds at Stripe
  • 12. Modelling as state machines Successful payment Tracking funds at Stripe
  • 13. ● What action caused the transition. ● Why it transitioned. ● When it transitioned. ● Looking at transitions across multiple systems and teams. Observability Transaction-level investigation Tracking funds at Stripe
  • 14. Modelling as state machines Incomplete states are balances Tracking funds at Stripe
  • 15. Modelling as state machines Incomplete states are balances Tracking funds at Stripe
  • 17. Observability Tracking funds at Stripe Detection Date of state’s first transition Amount ($$)
  • 18. ● Look up one state transition ○ by ID or other properties ● Look up one state, inspect it ○ listing transitions with sorting, paging, and summaries ● Aggregate many states Query patterns Tracking funds at Stripe
  • 19. ● Look up one state transition ○ by ID or other properties ● Look up one state, inspect it ○ listing transitions with sorting, paging, and summaries ● Aggregate many states This is easy... until we have: ● Hundreds of billions of rows ● States with hundreds of millions of transitions ● Need for fresh, real-time data ● Queries with sub-second latency, serving interactive UI Query patterns Tracking funds at Stripe
  • 20. World before Pinot Tracking funds at Stripe Two complicated systems
  • 21. World before Pinot Tracking funds at Stripe Two complicated systems
  • 22. World with Pinot Tracking funds at Stripe ● One system for serving all cases ● Simple and elegant ● No more multiple copies of data
  • 23. Quick intro on Pinot
  • 24. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 25. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 26. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 27. Pinot Distributed Architecture * (courtesy of blog https://www.confluent.io/blog/real-time-analytics-with-kafka-and-pinot/ )
  • 29. Our challenges Query Latency Data Freshness Data Scale
  • 30. Challenge #1: Data Scale: the Largest Single Table in Pinot
  • 31. One cluster to serve all major queries Huge tables ● Each with more than hundreds of billions rows ● 700TB storage on disk, after 2x replication Pinot numbers ● Offline segments: ~60k segments per table ● Real time table: 64 partitions Hosted by AWS EC2 Instances ● ~1000 small hosts (4000 vCPU) with attached SSD ● Instance config selected based on performance and cost
  • 32. One cluster to serve all major queries Huge tables ● Each with more than hundreds of billions rows ● 700TB storage on disk, after 2x replication Pinot numbers ● Offline segments: ~60k segments per table ● Real time table: 64 partitions Hosted by AWS EC2 Instances ● ~1000 small hosts (4000 vCPU) with attached SSD ● Instance config selected based on performance and cost Largest Pinot table in the world !
  • 33. Challenge #2: Data freshness: Kafka Ingestion
  • 34. What Pinot + Kafka Brings Pinot broker provides merged view of offline and real time data ● Real-time Kafka ingestion comes with second level data freshness ● Merged view allows us query whole data set like one single table
  • 35. Financial Data in Real Time (1/2) Avoid duplication is critical for financial systems ● A Flink deduplication job as upstream ● Exactly-once Kafka sink used in Flink Exactly-once from Flink to Pinot ● Kafka transactional consumer enabled in Pinot ● Atomic update of Kafka offset and Pinot segment ● Result: 1:1 mapping from Flink output to Pinot ● No extra effort needed for us
  • 36. Financial Data in Real Time (2/2) ● Alternative Solution: deduplication within Pinot directly ○ Pinot’s real time upsert feature is a nice option to explore ○ Sustained 200k+ QPS into Pinot offline table in our experiments
  • 37. Challenge #3: Drive Down the Query Latency
  • 38. Optimizations Applied (1/4) ● Partitioning - Hashing data across Pinot servers ○ The most powerful optimization tool in Pinot ○ Map partitions to servers: Pinot becomes a key-value store
  • 39. Optimizations Applied (1/4) ● Partitioning - Hashing data across Pinot servers ○ The most powerful optimization tool in Pinot ○ Map partitions to servers: Pinot becomes a key-value store Depending on query type, partitioning can improve query latency by 2x ~ 10x
  • 40. Optimizations Applied (2/4) ● Sorting - Organize data between segments ○ Sorting is powerful when done in Spark ETL job; we can arrange how the rows are divided into segments ○ Column min/max values can help avoid scanning segments ○ Grouping the same value into the the same segment can reduce storage cost and speed up pre-aggregations
  • 41. Optimizations Applied (2/4) ● Sorting - Organize data between segments ○ Sorting is powerful when done in Spark ETL job; we can arrange how the rows are divided into segments ○ Column min/max values can help avoid scanning segments ○ Grouping the same value into the the same segment can reduce storage cost and speed up pre-aggregations In our production data set, sorting roughly improves aggregation query latency by 2x
  • 42. Optimization Applied (3/4) ● Bloom filter - Quickly prune out a Pinot segment ○ Best friend of key-value style lookup query ○ Works best when there are very few hit in filter ○ Configurable in Pinot: control false positive rate or total size
  • 43. Optimization Applied (4/4) ● Pre-aggregation by star tree index ○ Pinot supports a specialized pre-aggregation called “star-tree index” ○ Pre-aggregates several columns to avoid computation during query ○ Star tree index balances between disk space and query time for aggregations with multiple dimensions
  • 44. Optimization Applied (4/4) ● Pre-aggregation by star tree index ○ Pinot supports a specialized pre-aggregation called “star-tree index” ○ Pre-aggregates several columns to avoid computation during query ○ Star tree index balances between disk space and query time for aggregations with multiple dimensions Query latency improvement (accounts with billion-level transactions): ~30 seconds vs. 300 milliseconds
  • 45. The Combined Power of Four Optimizations ● They can reduce query latency to sub second for any large table ○ Works well for our hundreds of billions of rows ○ Most of the time, tables are small and we only need some of them ● We chose the optimizations to speed up all 5 production queries ○ Some queries need only bloom filter ○ Partitioning and sorting are applied for critical queries
  • 46. Real time ingestion needs extra care
  • 47. Optimizing real time ingestion (1/2) With 3-day real time data in Pinot, we saw 2~3 sec added latencies ● Pinot real time segments are often very small ● Real time server numbers are limited by Kafka partition count (max 64 servers in our case) ● Each real time server ends up with many small segments ● Real time server has high I/O and high CPU during query
  • 48. Optimizing real time ingestion (2/2) Latency back to sub-seconds after adopting Tiered Storage ● Tiered storage enables different storage hosts for segments based on time ● Moves real time segments into dedicated servers ASAP ● Utilizes more servers to process query for real time segments ● Avoids query slow down in Kafka consumers with back pressure
  • 49. Production Query Latency Chart Hundreds of billions of rows, ~700 TB data, all are sub-second latency.
  • 50. Financial Precision ● Precise numbers are critical for financial data processing ● Java BigDecimal is the answer for Pinot ● Pinot supports BigDecimal by BINARY columns (currently) ○ Computation (e.g., sum) done by UDF-style scalar functions ○ Star Tree index can be applied to BigDecimal columns ○ Works for all our use cases ○ No significant performance penalty observed
  • 51. With Pinot and Kafka working together, we have created the largest Pinot table in the world, to represent financial funds flow graphs. ● With hundreds of billions of edges ● Seconds of data freshness ● Financial precise number support ● Exactly-once Kafka semantics ● Sub-second query latency Conclusion
  • 52. Future Plans ● Reduce hardware cost by applying tiered storage in offline table ○ Use HDD-based hosts for data months old ● Multi-region Pinot cluster ● Try out many of Pinot’s exciting new features
  • 53. Thanks and Questions (We are hiring!)
  • 55. ● Ledger models financial activity as state machines ● Transitions are immutable append-only logs in Kafka ● Everything is transaction-level ● Incomplete states are represented by balances. ● Two core use-cases: transaction-level queries, and aggregation analytics ● Current system is unscalable and complex Summarizing Tracking funds at Stripe
  • 56. Pinot and Kafka works in synergy
  • 57. Detect problems in hundreds of billions rows (cont’d) How to detect issues in a graph of half trillion nodes? 1) Sum all money in/out nodes, focus only on non-zero nodes Now we have 20 million nodes with non-zero sum, how to analyze it? 2) Group by a) Day of first transaction seen -- Time Series b) Sign of sum (negative/positive flow) c) Some node properties like type We have a time series, and fields we can slice/dice. OLAP Cube
  • 58. Modelling as state machines Tracking funds at Stripe Transitions State balances
  • 59. Modelling as state machines Tracking funds at Stripe Transitions State balances
  • 60. Modelling as state machines Tracking funds at Stripe Transitions State balances
  • 61. Modelling as state machines Balances of incomplete payment Tracking funds at Stripe
  • 62. Modelling as state machines Balances of successful payment Tracking funds at Stripe
  • 64. ● Data volume, handling hundreds of billions of records ● Data freshness, getting real-time processing ● Query latency, making analytics usable for interactive internal UIs ● Achieving all three at once: difficult! Why this is challenging? Tracking funds at Stripe
  • 65. Modelling as state machines Dozens and dozens of states Tracking funds at Stripe
  • 68. Double-Entry Bookkeeping ● Internal funds flow represented by a directed graph ● Record the graph edge as Double-Entry Bookkeeping ● Nodes in the graph are modeled as accounts ● Accounts should eventually have zero balances
  • 69. Detect problems in hundreds of billions of rows Money in/out graph nodes should sum to zero (“cleared”). Stuck funds over time = Revenue Loss ● One card swipe could create 10+ nodes ● Hundreds of billions unique nodes and increasing
  • 70.
  • 71. Lessons Learned ● Metadata becomes heavy for huge tables ○ O(n2 ) algorithm is not good when processing 60k segments ○ Avoid sending 1k+ segment names across 100+ servers ○ Metadata is important when aiming for sub-second latency ● Tailing effect of p99/p95 latencies when we have 1000 servers ○ Occasional hiccups in server becomes high probability events and drags down p99/p95 query latency ○ Limit servers queried to be as small as possible (partitioning, server grouping, etc)
  • 72. Clearing Time Series (Exploring)
  • 74. Financial Data in Real Time (1/2) ● We have an upstream Flink deduplication job in place ● No duplication allowed ○ Pinot’s real time primary key is a nice option to explore ○ Sustained 200k+ QPS into Pinot offline tables in our deduplication experiments (after optimization) ○ An upstream Flink deduplication job may be the best choice ● Exactly-once consumption from Kafka to Pinot ○ Kafka transactional consumer enabled in Pinot ○ 1:1 mapping of Kafka message to table rows ○ Critical for financial data processing
  • 75. Table Design Optimization Iterations ● It takes 2~3 days for Spark ETL job to process full data set ● Scale up only after optimized design ○ Shadow production query ○ Rebuild whole data set when needed ● General rule of thumb: the fewer segments scanned, the better
  • 76. Kafka Ingestion Optimization (2/2) ● Partition/Sharding in Real time tables (Experimented) ○ Needs a streaming job to shuffler Kafka topic by key ○ Helps query performance for real time table ○ Worth adopting ● Merging small segments into large segments ○ Needs cron style job to do the work ○ Helps pruning and scanning ○ Not a bottleneck for us