"How to survive Black Friday: preparing e-commerce for a peak season", Yurii Panaiotov.pptx

Let's get acquainted
➢ Solutions Architect at Silpo (E-commerce)
➢ In IT since 2008 (17 years of experience)
➢ Has experience in various roles (full-stack, backend,
team lead, architect, CTO)
➢ Prints on a 3d printer for PrintArmy

➢ 20+ billion UAH in revenue over 3 years
➢ 1+ million orders during this period
➢ 5+ million unique Guests
Silpo E-commerce in Numbers
➢ Employees: 2K+
➢ IT specialists: 350+
➢ Teams: 35+
➢ Releases: 27K+
➢ Merge requests: 36K+

Offline Shops
Silpo E-commerce
and Ecosystem
Today we’ll talk about…
➢ Products: 14+
➢ Service and microservices: 400+
➢ RPS: 4k+
➢ Requests per month: 6B+
➢ Total Data Served per month: 30TB+

Courier platform Picking platform
Merchant platform
SuperAPP LOKO SuperWEB
Core services:
PIM, OMS
Payment Gateway
GEO
services

Black Friday and New Year Rush: Foodtech at Full Throttle
➢ October - April - the Start of the High Season in FoodTech
➢ Black Friday — the Start of the Very-High Season in FoodTech
➢ Load increases 2x+ compared to regular periods
➢ Bad weather acts like a mini Black Friday
➢ Sessions: 11-13M+ /mo
➢ Sessions high season: 20M+
/mo

Challenges
➢ Traffic Increase 2x+
➢ Maintain target SLA for response time (performance-critical)
➢ High load on integration processes – price & stock updates with every check in online
and offline
➢ The cost of failure is enormous – measured in millions, uah
➢ Efficient resource usage – no burning money on infrastructure

Our goals of performance testing
➢ Identify bottlenecks before peak load: database, APIs, third-party services.
➢ Evaluate scalability: how the system responds to increasing traffic.
➢ Find the “breakpoint”: where degradation begins.
➢ Measure key metrics: latency (P50, P95, P99), throughput (RPS), error rate, saturation (CPU, memory, IOPS).

Example: Performance Testing for Catalog + Cart + Checkout
Goal:
Validate how the system handles the full user flow — from product catalog to checkout — under medium to high traffic.
Scenario Simulated:
➢ Mixed flow: product catalog → add to cart → proceed to checkout
➢ Test Load: 500 RPS (simulated real user behavior) - 20k Guests online

Findings & Bottlenecks:
➢ Already at 200 RPS, response time reached 1.2s
➢ First bottleneck: shopping-cart-service could not handle load

➢ After scaling to 15 replicas, response time improved to 700ms

➢ At this point, a CPU alert triggered for settings-service
➢ Adjusting CPU limits helped reduce response time to 420ms at 400 RPS

➢ Error rate from shopping-cart-service remained high
➢ Tracing revealed MongoDB as the source of latency and timeouts
➢ After scaling the MongoDB cluster (Mongo Atlas) → performance stabilized

Post-Test Action Items:
➢ Implement auto scaling for services
➢ Review and fine-tune CPU limits
➢ Review and tuning cache TTL where applicable
➢ Pay attention to third-party dependencies

Before you start
Define clear goals:
➢ What are you testing? (e.g., checkout speed, API scalability)
➢ What metrics matter? (e.g., latency, throughput, error rate)
Know your baseline:
➢ Run a few light tests first to get current performance numbers.
➢ Helps compare results before vs. after optimization.
Simulate real user behavior:
➢ Create scenarios that simulate real user behavior (session-based: browse → add to cart → checkout).
➢ Don't forget about background processes that can affect the user's interaction with the system.
Test in production-like environment:
➢ Match replicas, services, DB, and configurations.
➢ Avoid testing on dev or staging with different specs.

During the Test
Warm up the system:
➢ Ramp up gradually to avoid false bottlenecks from cold starts or caching.
Monitor everything:
➢ Track CPU, memory, DB, network, latency, RPS, error rate, timeouts.
➢ Use Grafana, Prometheus, Datadog, ELK, Jaeger, etc.
Watch for saturation points:
➢ CPU over 80%, DB connections maxing out, queue delays → red flags.
Include background processes:
➢ Cron jobs, data syncs, backups — they can skew results if not considered.

After the Test
Analyze metrics deeply:
➢ Focus on P95/P99 latency, error spikes, and slowest endpoints
Identify root causes:
➢ Use logs + tracing to pinpoint what’s causing slowness
Optimize step by step:
➢ Index DB tables
➢ Cache expensive calls
➢ Make heavy operations async
➢ Auto scaling, resource limits
Retest after every major fix:
➢ Small changes can have unexpected impact — revalidate

Application Hosting Volumes
➢ Kubernetes Clusters: 14 (AWS EKS) - Product-oriented Architecture, separate cluster for each product
➢ Stateless Services: ~400 - Service-oriented Architecture inside each product
➢ Nodes: ~160 (AWS EC2)
➢ Pods: ~4000

Application Auto Scaling: In-Depth Overview
queries metrics
scales pods
checks pod
schedulability
scales nodes
adds/removes nodes

Application Auto Scaling: Pitfalls and Tips
queries metrics
scales pods
checks pod
schedulability
scales nodes
adds/removes nodes

Application Auto Scaling: Pitfalls and Tips

Compensatory Flows
➢ Identify the most critical flows in your business
➢ Define potential bottlenecks that could impact performance or availability
➢ Implement compensatory flows where feasible and meaningful
Example: Orders queue
Shopping Cart Service Order Service
StoreFront ECOM
Create Order
Save Order
On Black Friday 2024, the
Order Service experienced
a 1-hour outage, during
which approximately 3,000
orders were queued,
representing an estimated
millions in revenue

Example: Rollback for send SMS provider
We use two providers for sending SMS for out OTP codes. If the primary provider is unavailable,
the system automatically falls back to the secondary one.

300 UAH promo code for
SilpoApp

"How to survive Black Friday: preparing e-commerce for a peak season", Yurii Panaiotov.pptx

More Related Content

Similar to "How to survive Black Friday: preparing e-commerce for a peak season", Yurii Panaiotov.pptx

More from Fwdays

Recently uploaded

"How to survive Black Friday: preparing e-commerce for a peak season", Yurii Panaiotov.pptx