3. Tackling Performance Bottlenecks in
Market Data Service
Problems
• App becomes slow or unresponsive during market open and close
• Users report frustration and difficulty executing trades
• Potential impact on user satisfaction, retention & market competitiveness
Goal
Primary objective is to ensure that market data is effectively ingested and delivered to the FE
application, providing users with real-time and accurate information on stock prices, market trends,
and related financial data
4. Assumptions
• Stock trading app with 10 Million Active Users.
• Peak usage occurs during market open (9:30 AM) and close (4:00 PM), with user activity increasing
by ~100% during these times
• Each user makes an average of 5 API calls per session
• 3rd Party Data Provider Services are working fine. Issue is not being faced by any other competitor(s)
• Market data service is hosted on a single server with limited resources
• App utilizes a centralized database for Market data and User information
• Issue caused solely due to market data service and there is no impact from other services
• Rate limiting for API access is already implemented for external users
5. Data Collection and Analysis
Quantitative Qualitative
APM tools
(bottleneck identification)
App response times,
error rates
Resource utilization
(CPU, memory, network)
User interviews and
surveys (pain points)
Engineer interviews
(potential causes)
App logs and DB queries
(technical issues)
Metric Baseline Value Peak Hour Value Peak Hour Degradation
App Startup Time 3.5 seconds 7.5 seconds 114.30%
Screen Load Time 3.0 seconds 5.0 seconds 66.70%
User Error Rate 1% 2.50% 150%
API Response Time 100 milliseconds 223 milliseconds 123%
DBMs Av. Load % 47% 89% 89.36%
IOPS 5000 15000 200%
Session Length 12 minutes 15 minutes
Key Metrics
6. Current Architecture
Market Data Backend Service
Client Applications
Real-time data
streaming update
- Webhook
Communication Layer
Data Ingestion
Service
Data Processing
Service
Data Storage
Service
Normalization, Cleansing, QC
Data Delivery Service
Increased data
requests
Increased
DB I/O
Increased
Network
Congestion
Fetch Data -
Exchanges, Instruments
& Quotes
Primary Issues
• Scalability:
Servers or databases reaching capacity
• Inefficient data processing:
Slow data parsing & aggregation
• Network bottlenecks:
Limited bandwidth/latency issues/API timeouts
Request for
new data
NYSE
NASDAQ
TSX
3rd Party
Data
Broker
Centralized
Database
7. Proposed Architecture
Market Data Backend Service
Client Applications
Data Ingestion
Service
Data Processing
Service
Data Storage
Service
ALB
Horizontal Scaling
Data Delivery Service
Publish data to Data Exchange
Solution Candidates
• Scheduled Horizontal Scaling
Async - Update
Cache Service
Fetch Non-cached Data from DB
& Update Cache
Decoupling Services
DB
Sharding
Subscribe to relevant
Kafka topics
Real-time data
streaming update
- Webhook
NYSE
NASDAQ
TSX
3rd Party
Data
Broker
Worker Instances
Worker Nodes
Worker Nodes
Data Request
from FE
Worker Instances
Worker Nodes
Worker Nodes
Data Ingestion
Service
Data not present in DB OR
Cache – Request Data
Ingestion Service
• Reduce # requests to BE
• Fetch data from Cache
• Decoupling Data generation
from Data Delivery
• DB Optimizations
8. Proposed Solutions
Recommendation Description Effort Impact Priority
Horizontal Scaling:
Increase server capacity by adding more
servers during PEAK hours
Low cost, quick implementation -
Improves peak hour performance
Low –Medium High High
Reduce # queries to backend: Implement
web-accelerator like Varnish
Low cost, moderate dev. Effort Medium High High
Caching Market Data:
Store frequently accessed market data to
reduce backend load - Redis
Moderate cost, requires dev. effort -
Reduces DB load, improves retrieval
speed
Medium High High
Optimizing Data Pipelines:
Decoupling data ingestion & delivery
pipelines for efficiency – Kafka
Moderate cost, requires technical
expertise
High High Medium
Database Indexing:
Create indexes on frequently used database
columns like Stock symbols, LTP
Low cost, requires DB expertise Low Medium Medium
Database Sharding:
Divide DB into smaller, logical, horizontal
partitions – moving out historical data
High cost, complex implementation High High Low
Long
Term
Short
Term
9. Trade-offs & Implementation Details
Category Trade-off Potential Impact Mitigation Strategies
Data Ingestion
Volume vs.
Processing Speed
Worker nodes might become overloaded
during peak hours, leading to data
processing delays & potential latency
Reduced responsiveness ,
inaccurate or delayed
data delivery
Implement scalable infrastructure solutions
like horizontal scaling of worker nodes, utilize
load balancing strategies
Real-time vs.
Historical Data
Granularity
Keeping large amounts of historical data in
the cache increases storage costs & data
retrieval speed. Fetching data on demand
from DB might introduce additional latency
Increased resource
consumption, slower data
delivery for historical
data.
Implement data retention policies to store
only relevant historical data, utilize caching
for frequently accessed data, optimize DB
queries for improved retrieval speed
Messaging Queue
Technology
Choice
Choosing the wrong MQ tech can lead to
performance limitations, scalability issues,
& integration complexity with other system
components.
System bottlenecks, data
delivery failures,
increased dev &
maintenance costs
Research and evaluate different messaging
queue options like RabbitMQ based on
compatibility with existing tech
Estimated Timeline Short Term – 2-6 months | Long term – 6-12 months
Team Involvement Dev team will lead the technical implementation, but collaboration with Devops, & DBAs is crucial
Existing systems might need API integrations or data format transformations to
connect seamlessly with new components.
Thorough testing and validation required.
Integration Challenges
10. The Way Forward
• Prioritize quick wins and low-risk improvements
• (Scaling out at pre-defined intervals, Web Accelerator, Data Caching)
• Comprehensive monitoring of performance metrics & user feedback
• Data ingestion rate, worker node CPU utilization, Redis cache hit rate, avg. response time etc.
• Load testing using Artificial Load and size the demand
• Gather ongoing user and engineer feedback
• Recommendation will improve Peak & Non-Peak Performance
• Machine Learning-Based Load Prediction factoring in external variables
Success KPIs
• 50% reduction in API response time, DBM Av. Load % & IOPS
• 60% reduction in user error rate
• 50% reduction in app startup time and screen load time
• Improved user experience, churn & increased platform adoption