"In today's fast-paced global E-Commerce industry, the amount of data generated by online shoppers is massive. To deliver real-time analytics, effective advertising campaigns and machine learning based personalized recommendations are crucial. However, building a reliable and scalable data pipeline to support this is a challenging task.
In this talk, we'll share how we tackled the challenge of building a fully managed robust data pipeline using a combination of streaming analytics, batch processing, data lake, and machine learning. Our platform, built on Google Cloud Platform and powered by Confluent Kafka, enables us to process a massive volume of events per day.
We'll dive into the technical details of our architecture, tech stack, and data flow, including how we use
• Kafka Streams Java applications which are deployed in kubernetes to consume, deduplicate, transform, filter, and write data into HBase NoSQL database for real-time analytics,
• Push to Meta for advertising campaigns,
• Google AI for personalized recommendations,
• Confluent sink connector to push events to Google Cloud Storage and BigQuery, and ksqlDB for bot filtering.
• We'll also cover our observability, monitoring, and deployment practices.
But we don't want to just talk about our pipeline, we want to help you build one too. You'll leave our talk with practical insights and lessons learned from our experience, including tips on building a reliable, fault-tolerant, and scalable data pipeline, choosing the right tech stack, and ensuring end-to-end observability. Join us, and learn how to take your data pipeline to the next level."
Scalable E-Commerce Data Pipelines with Kafka: Real-Time Analytics, Batch, ML, Data Lake, and Beyond
1. Scalable E-Commerce Data Pipelines with Kafka:
Real-Time Analytics, Batch, ML, Data Lake, and Beyond
Aristatle Subramaniam
Lead Data Engineer
Bigcommerce
Mahendra Kumar
VP, Data and Software Engineering
Bigcommerce
2.
3. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
4. Bigcommerce Data Pipeline Architecture
◿ Importance of Real-time
Data Handling
◿ Processing massive volumes
of Data
◿ Agility and Adaptability
◿ Data-Driven
Decision-Making
5. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
6. ◿ Shopper Click Events
◿ Data Collection with Filebeat
◿ Confluent Kafka for Reliable Event Handling
◿ Real-time Analytics Processing
using Kafka Streams
Real-time Analytics Data Pipeline Architecture
◿ Deployment in Google Kubernetes
◿ Data Storage with HBase
◿ Querying with Apache Phoenix
◿ Cloud SQL for Aggregated and
Precomputed results
◿ Python API-Powered Dashboards
9. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
10. Challenges
◿ Bot filtering :
● Bot Impact on Analytics - Conversion ratios and purchase funnels.
● Rising Bot Traffic - 40%, of product page views and shopper visit events.
● Challenges in Data Accuracy - bot traffic can lead to skewed results.
● JavaScript non-bot event - for lookup
◿ Bulk import of historical orders and catalog for onboarding a new store.
◿ Repeated Orders and Carts events.
11. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
19. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
20. ◿ Meta’s Conversion API helps merchants effectively run advertising
campaigns on customer audiences.
◿ Server-Side Event Transmission APIs- visits, product page views,
cart additions, and orders
Meta conversion APIs
21. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
22. Data Lake Architecture and Use Cases
◿ GCS Sink Connector to store and archive raw events
◿ Batch Processing ad hoc queries
◿ To train the machine learning model
◿ Insights - Rockstar Products, Most Abandoned Products, Best
Customers, Repeat Purchase Rate
◿ Internal Analytics
23. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A
24. AI powered Personalized Product Recommendation
For shoppers
Enable shoppers to easily
discover the products they love
For merchants
Enable merchants to easily
leverage AI to merchandise
their products
Boost shopper engagement
and conversion rates
26. ◿ Business objectives
● Click through rate
○ Product catalog
○ Product Page View events
● Conversion rate
○ Product catalog
○ Product Page View events
○ Added To Cart Events
◿ Placement Type
● Detail page view
◿ Training data used
● Full product catalog data
● Product Page View and Cart : 3 month of data
● Real Time data
◿ Model tuning options
● Automatically
● Trigger Manually
Others you may like model
27. Frequently bought together
◿ Business objectives
● Revenue per session
◿ Placement Type
● Add to cart
● Registry
◿ Datasets used:
● Purchase events
● Product catalog
◿ Training data used
● Full product catalog data
● Purchase events data for 1 year
● Real Time data
◿ Automated model build and deployment
◿ Provide secure, scalable and performant APIs for serving product
recommendations
30. Data Platform Architecture
Overview
Data Lake
Ad-hoc analysis, and ML
Real-time Analytics & Insights
1.6B events per day
Personalized Product Recommendation
Improve conversion ratio and click through rate
Meta Conversion APIs
Run effective Ad campaigns
Challenges
Bot filtering
Agenda
Observability, Monitoring & Alerting
Charts
Q&A