Let's get acquainted
➢ Solutions Architect at Silpo (E-commerce)
➢ In IT since 2008 (17 years of experience)
➢ Has experience in various roles (full-stack, backend,
team lead, architect, CTO)
➢ Prints on a 3d printer for PrintArmy
➢ 20+ billion UAH in revenue over 3 years
➢ 1+ million orders during this period
➢ 5+ million unique Guests
Silpo E-commerce in Numbers
➢ Employees: 2K+
➢ IT specialists: 350+
➢ Teams: 35+
➢ Releases: 27K+
➢ Merge requests: 36K+
Offline Shops
Silpo E-commerce
and Ecosystem
Today we’ll talk about…
➢ Products: 14+
➢ Service and microservices: 400+
➢ RPS: 4k+
➢ Requests per month: 6B+
➢ Total Data Served per month: 30TB+
Courier platform Picking platform
Merchant platform
SuperAPP LOKO SuperWEB
Core services:
PIM, OMS
Payment Gateway
GEO
services
High-level architecture
Black Friday and New Year Rush: Foodtech at Full Throttle
➢ October - April - the Start of the High Season in FoodTech
➢ Black Friday — the Start of the Very-High Season in FoodTech
➢ Load increases 2x+ compared to regular periods
➢ Bad weather acts like a mini Black Friday
➢ Sessions: 11-13M+ /mo
➢ Sessions high season: 20M+
/mo
Challenges
➢ Traffic Increase 2x+
➢ Maintain target SLA for response time (performance-critical)
➢ High load on integration processes – price & stock updates with every check in online
and offline
➢ The cost of failure is enormous – measured in millions, uah
➢ Efficient resource usage – no burning money on infrastructure
Performance testing
Our goals of performance testing
➢ Identify bottlenecks before peak load: database, APIs, third-party services.
➢ Evaluate scalability: how the system responds to increasing traffic.
➢ Find the “breakpoint”: where degradation begins.
➢ Measure key metrics: latency (P50, P95, P99), throughput (RPS), error rate, saturation (CPU, memory, IOPS).
Example: Performance Testing for Catalog + Cart + Checkout
Goal:
Validate how the system handles the full user flow — from product catalog to checkout — under medium to high traffic.
Scenario Simulated:
➢ Mixed flow: product catalog → add to cart → proceed to checkout
➢ Test Load: 500 RPS (simulated real user behavior) - 20k Guests online
Example: Performance Testing for Catalog + Cart + Checkout
Findings & Bottlenecks:
➢ Already at 200 RPS, response time reached 1.2s
➢ First bottleneck: shopping-cart-service could not handle load
Example: Performance Testing for Catalog + Cart + Checkout
Findings & Bottlenecks:
➢ After scaling to 15 replicas, response time improved to 700ms
Example: Performance Testing for Catalog + Cart + Checkout
Findings & Bottlenecks:
➢ At this point, a CPU alert triggered for settings-service
➢ Adjusting CPU limits helped reduce response time to 420ms at 400 RPS
Example: Performance Testing for Catalog + Cart + Checkout
Findings & Bottlenecks:
➢ Error rate from shopping-cart-service remained high
➢ Tracing revealed MongoDB as the source of latency and timeouts
➢ After scaling the MongoDB cluster (Mongo Atlas) → performance stabilized
Example: Performance Testing for Catalog + Cart + Checkout
Post-Test Action Items:
➢ Implement auto scaling for services
➢ Review and fine-tune CPU limits
➢ Review and tuning cache TTL where applicable
➢ Pay attention to third-party dependencies
Before you start
Define clear goals:
➢ What are you testing? (e.g., checkout speed, API scalability)
➢ What metrics matter? (e.g., latency, throughput, error rate)
Know your baseline:
➢ Run a few light tests first to get current performance numbers.
➢ Helps compare results before vs. after optimization.
Simulate real user behavior:
➢ Create scenarios that simulate real user behavior (session-based: browse → add to cart → checkout).
➢ Don't forget about background processes that can affect the user's interaction with the system.
Test in production-like environment:
➢ Match replicas, services, DB, and configurations.
➢ Avoid testing on dev or staging with different specs.
During the Test
Warm up the system:
➢ Ramp up gradually to avoid false bottlenecks from cold starts or caching.
Monitor everything:
➢ Track CPU, memory, DB, network, latency, RPS, error rate, timeouts.
➢ Use Grafana, Prometheus, Datadog, ELK, Jaeger, etc.
Watch for saturation points:
➢ CPU over 80%, DB connections maxing out, queue delays → red flags.
Include background processes:
➢ Cron jobs, data syncs, backups — they can skew results if not considered.
After the Test
Analyze metrics deeply:
➢ Focus on P95/P99 latency, error spikes, and slowest endpoints
Identify root causes:
➢ Use logs + tracing to pinpoint what’s causing slowness
Optimize step by step:
➢ Index DB tables
➢ Cache expensive calls
➢ Make heavy operations async
➢ Auto scaling, resource limits
Retest after every major fix:
➢ Small changes can have unexpected impact — revalidate
Auto Scaling
Application Hosting Volumes
➢ Kubernetes Clusters: 14 (AWS EKS) - Product-oriented Architecture, separate cluster for each product
➢ Stateless Services: ~400 - Service-oriented Architecture inside each product
➢ Nodes: ~160 (AWS EC2)
➢ Pods: ~4000
Application Auto Scaling: In-Depth Overview
queries metrics
scales pods
checks pod
schedulability
scales nodes
adds/removes nodes
Application Auto Scaling: Pitfalls and Tips
queries metrics
scales pods
checks pod
schedulability
scales nodes
adds/removes nodes
Application Auto Scaling: Pitfalls and Tips
queries metrics
scales pods
checks pod
schedulability
scales nodes
adds/removes nodes
Application Auto Scaling: Pitfalls and Tips
Application Auto Scaling: Pitfalls and Tips
queries metrics
scales pods
checks pod
schedulability
scales nodes
adds/removes nodes
Compensatory Flows
Compensatory Flows
➢ Identify the most critical flows in your business
➢ Define potential bottlenecks that could impact performance or availability
➢ Implement compensatory flows where feasible and meaningful
Example: Orders queue
Shopping Cart Service Order Service
StoreFront ECOM
Create Order
Save Order
On Black Friday 2024, the
Order Service experienced
a 1-hour outage, during
which approximately 3,000
orders were queued,
representing an estimated
millions in revenue
Example: Rollback for send SMS provider
We use two providers for sending SMS for out OTP codes. If the primary provider is unavailable,
the system automatically falls back to the secondary one.
Cheat Code
Cheat Code
300 UAH promo code for
SilpoApp

"How to survive Black Friday: preparing e-commerce for a peak season", Yurii Panaiotov.pptx

  • 2.
    Let's get acquainted ➢Solutions Architect at Silpo (E-commerce) ➢ In IT since 2008 (17 years of experience) ➢ Has experience in various roles (full-stack, backend, team lead, architect, CTO) ➢ Prints on a 3d printer for PrintArmy
  • 3.
    ➢ 20+ billionUAH in revenue over 3 years ➢ 1+ million orders during this period ➢ 5+ million unique Guests Silpo E-commerce in Numbers ➢ Employees: 2K+ ➢ IT specialists: 350+ ➢ Teams: 35+ ➢ Releases: 27K+ ➢ Merge requests: 36K+
  • 4.
    Offline Shops Silpo E-commerce andEcosystem Today we’ll talk about… ➢ Products: 14+ ➢ Service and microservices: 400+ ➢ RPS: 4k+ ➢ Requests per month: 6B+ ➢ Total Data Served per month: 30TB+
  • 5.
    Courier platform Pickingplatform Merchant platform SuperAPP LOKO SuperWEB Core services: PIM, OMS Payment Gateway GEO services
  • 6.
  • 7.
    Black Friday andNew Year Rush: Foodtech at Full Throttle ➢ October - April - the Start of the High Season in FoodTech ➢ Black Friday — the Start of the Very-High Season in FoodTech ➢ Load increases 2x+ compared to regular periods ➢ Bad weather acts like a mini Black Friday ➢ Sessions: 11-13M+ /mo ➢ Sessions high season: 20M+ /mo
  • 8.
    Challenges ➢ Traffic Increase2x+ ➢ Maintain target SLA for response time (performance-critical) ➢ High load on integration processes – price & stock updates with every check in online and offline ➢ The cost of failure is enormous – measured in millions, uah ➢ Efficient resource usage – no burning money on infrastructure
  • 9.
  • 10.
    Our goals ofperformance testing ➢ Identify bottlenecks before peak load: database, APIs, third-party services. ➢ Evaluate scalability: how the system responds to increasing traffic. ➢ Find the “breakpoint”: where degradation begins. ➢ Measure key metrics: latency (P50, P95, P99), throughput (RPS), error rate, saturation (CPU, memory, IOPS).
  • 11.
    Example: Performance Testingfor Catalog + Cart + Checkout Goal: Validate how the system handles the full user flow — from product catalog to checkout — under medium to high traffic. Scenario Simulated: ➢ Mixed flow: product catalog → add to cart → proceed to checkout ➢ Test Load: 500 RPS (simulated real user behavior) - 20k Guests online
  • 12.
    Example: Performance Testingfor Catalog + Cart + Checkout Findings & Bottlenecks: ➢ Already at 200 RPS, response time reached 1.2s ➢ First bottleneck: shopping-cart-service could not handle load
  • 13.
    Example: Performance Testingfor Catalog + Cart + Checkout Findings & Bottlenecks: ➢ After scaling to 15 replicas, response time improved to 700ms
  • 14.
    Example: Performance Testingfor Catalog + Cart + Checkout Findings & Bottlenecks: ➢ At this point, a CPU alert triggered for settings-service ➢ Adjusting CPU limits helped reduce response time to 420ms at 400 RPS
  • 15.
    Example: Performance Testingfor Catalog + Cart + Checkout Findings & Bottlenecks: ➢ Error rate from shopping-cart-service remained high ➢ Tracing revealed MongoDB as the source of latency and timeouts ➢ After scaling the MongoDB cluster (Mongo Atlas) → performance stabilized
  • 16.
    Example: Performance Testingfor Catalog + Cart + Checkout Post-Test Action Items: ➢ Implement auto scaling for services ➢ Review and fine-tune CPU limits ➢ Review and tuning cache TTL where applicable ➢ Pay attention to third-party dependencies
  • 17.
    Before you start Defineclear goals: ➢ What are you testing? (e.g., checkout speed, API scalability) ➢ What metrics matter? (e.g., latency, throughput, error rate) Know your baseline: ➢ Run a few light tests first to get current performance numbers. ➢ Helps compare results before vs. after optimization. Simulate real user behavior: ➢ Create scenarios that simulate real user behavior (session-based: browse → add to cart → checkout). ➢ Don't forget about background processes that can affect the user's interaction with the system. Test in production-like environment: ➢ Match replicas, services, DB, and configurations. ➢ Avoid testing on dev or staging with different specs.
  • 18.
    During the Test Warmup the system: ➢ Ramp up gradually to avoid false bottlenecks from cold starts or caching. Monitor everything: ➢ Track CPU, memory, DB, network, latency, RPS, error rate, timeouts. ➢ Use Grafana, Prometheus, Datadog, ELK, Jaeger, etc. Watch for saturation points: ➢ CPU over 80%, DB connections maxing out, queue delays → red flags. Include background processes: ➢ Cron jobs, data syncs, backups — they can skew results if not considered.
  • 19.
    After the Test Analyzemetrics deeply: ➢ Focus on P95/P99 latency, error spikes, and slowest endpoints Identify root causes: ➢ Use logs + tracing to pinpoint what’s causing slowness Optimize step by step: ➢ Index DB tables ➢ Cache expensive calls ➢ Make heavy operations async ➢ Auto scaling, resource limits Retest after every major fix: ➢ Small changes can have unexpected impact — revalidate
  • 20.
  • 21.
    Application Hosting Volumes ➢Kubernetes Clusters: 14 (AWS EKS) - Product-oriented Architecture, separate cluster for each product ➢ Stateless Services: ~400 - Service-oriented Architecture inside each product ➢ Nodes: ~160 (AWS EC2) ➢ Pods: ~4000
  • 22.
    Application Auto Scaling:In-Depth Overview queries metrics scales pods checks pod schedulability scales nodes adds/removes nodes
  • 23.
    Application Auto Scaling:Pitfalls and Tips queries metrics scales pods checks pod schedulability scales nodes adds/removes nodes
  • 24.
    Application Auto Scaling:Pitfalls and Tips queries metrics scales pods checks pod schedulability scales nodes adds/removes nodes
  • 25.
    Application Auto Scaling:Pitfalls and Tips
  • 26.
    Application Auto Scaling:Pitfalls and Tips queries metrics scales pods checks pod schedulability scales nodes adds/removes nodes
  • 27.
  • 28.
    Compensatory Flows ➢ Identifythe most critical flows in your business ➢ Define potential bottlenecks that could impact performance or availability ➢ Implement compensatory flows where feasible and meaningful Example: Orders queue Shopping Cart Service Order Service StoreFront ECOM Create Order Save Order On Black Friday 2024, the Order Service experienced a 1-hour outage, during which approximately 3,000 orders were queued, representing an estimated millions in revenue
  • 29.
    Example: Rollback forsend SMS provider We use two providers for sending SMS for out OTP codes. If the primary provider is unavailable, the system automatically falls back to the secondary one.
  • 30.
  • 31.
  • 32.
    300 UAH promocode for SilpoApp