The number of internet users is increasing rapidly and so is the number of mobile/web applications. Processing and analyzing user activity is one of the techniques to observe/monitor mobile/web apps. Much of this user activity is captured by the mobile app as a structured log.
The problem we are trying to solve here is building and operating a processing backend that ingests activity data from millions of devices with availability and SLA guarantees.
This talk was presented at AWS Community Day Bengaluru 2019 by Kokilavani Kathiresan, Ravikumar Kota and Shailja Agarwala - Intuit
2. Introduction
- Containers, Serverless, Microservice Architecture change the way the software is
built
- The systems are more distributed, and more ephemeral
- No Complex system is ever fully healthy
- Better Resilience and Fault Tolerance is the goal
- Ease of debugging is a cornerstone to maintain and evolve robust systems
3. Observability
- Internal states of the system should be inferred by
its external outputs
- Reduce MTTD and MTTR
- Verifying the health of the service proactively
- To know what’s broken, and why?
- Provides the all-important feedback that drives
future iterations
4.
5. Our Business Case
- To Collect logs, traces and metrics from Mobile/Web Browser
- Get insights of the application
- Understanding the user behavior patterns
- Monitor application performance
6. Front-end Logging Service
- Exposed a REST Endpoint
- Spring boot application which accepts the
compressed log message
- Decompress and Validate the Payload
- Forward it to the application’s log
destination (Splunk)
Requirements:
- 20000 Transactions per second
- 1 second latency
Internet
Logging
Service
AWS Account
Compressed Batched Logs
7. Latency Improvement
We split the service into two microservices.
Producer:
- Receives request and Validate the sender
- Accepts the payload
- Puts the data to queue
Consumer:
- Polls the data from queue
- Extract the payload and Validate the data
- Sends it to log destination
Logging
Service -
Producer
Logging
Service -
Consumer
SQS
9. Well Architected Framework
Five pillars :
- Operational excellence
- Security
- Reliability
- Performance efficiency
- Cost optimization
10. EC2 Setup
Producer:
- Compute Intensive (c5.2xlarge)
- No of instances : 3 to 20
Consumer:
- Memory Intensive (m5.2xlarge)
- No of Instances : 3 to 20
Alarms:
- Based on JVM metrics sent to Cloud watch
12. Route 53
- Expose the producer ELB through Route 53
- Route 53 endpoint is hosted behind Intuit API
gateway
- Disaster recovery through multiple CName across
region
EC2 EC2 EC2
17. Target Groups
- With auto scaling and load balancers involved, target groups will route
requests to EC2s and microservices
- Requests are being sent to new targets as soon as the registration is
complete and initial health check is passed
20. AMI Restack
Background:
- Intuit compliance team applies security patches and new baseline images are
released every 2 weeks
- App teams must either use these AMIs or derive AMIs from those baseline images
- Automated this entire process by using CW Rule and Codebuild services
22. Code build logs - Baking Logging service AMI
- Launch the new EC2 instance from Baseline AMI
- Copy chef recipes required to install software like java etc.. and
configuration required for Splunk forwarder and log rotation
- Bake logging service AMI
- Publish cloud watch event with the AMI id
24. CW rule on Baked AMI
- Cloud watch rule configured to trigger on baked logging service AMI
- We have 2 targets configured on this CW Rule
- Lambda function: Creates new launch config with new AMI and updates
ASG
- Code pipeline: CD service to automate the steps to release logging
service