The document discusses processing terabytes of data every day using AWS services. It summarizes challenges faced such as Lambda timeouts when processing large files, IP address starvation when scaling Lambdas, and missing data incidents. It provides lessons learned around using serverless architectures efficiently, such as splitting large tasks, separating subnets, and implementing custom retry logic. Best practices for development, releases, and incident response are also outlined.
3. Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
🌍 loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro
4. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
5. AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige
6. Cognito Detect
on premise solution
(soon also for the cloud!)
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven & Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige
8. Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A Vectra product for Incident Response
9. Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige
10. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
16. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
19. Lambda timeouts incident
● AWS Lambda timeout: 15 minutes (max)
● We are receiving files every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, files
can be too big to be processed within timeout
limits
@katavic_d - @loige
24. Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!
25. Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige
30. ● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige
31. Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige
35. Fast retry at peak times
● Lambda retry logic is not configurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly defined
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige
39. Fast retry at peak times
Processing in this range of time is likely to fail@katavic_d - @loige
40. Fast retry at peak times
If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry
41. Can we extend the retry period
in case of failure?
@katavic_d - @loige
42. @katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new file is stored in S3
43. @katavic_d - @loige
Extended retry period
If the Lambda fails, the event is automatically retried, up to 2 times
44. @katavic_d - @loige
Extended retry period
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
45. @katavic_d - @loige
Extended retry period
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
46. @katavic_d - @loige
Extended retry period
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3
47. @katavic_d - @loige
Extended retry period
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3
48. Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige
49. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
50. AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige
51. AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige
52. AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige
53. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
54. Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige
55. Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notified
@katavic_d - @loige
56. Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige
57. Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Norm Kerth, Project Retrospectives: A Handbook for Team Review
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige
59. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
60. Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting, Start Finishing”
@katavic_d - @loige
61. Development best practices
● Clear acceptance criteria
○ Collectively defined (3 amigos)
○ Make sure you know when a card is done
● Split the work in small units of work (cards)
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige
62. Development best practices
● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige
63. Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
64. Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowflakes”, one code base for all customers
● Feature flags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige
65. Conclusion
We are still waking up at night sometimes,
but we are definitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Allow your team space to fail
● Always review and strive for improvement
● Monitor/Instrument as much as you can
● Use managed services to reduce the operational overhead
(but learn their nuances)
@katavic_d - @loige
66. We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
- loige.link/tera-inf -
67. Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic