SlideShare a Scribd company logo
1 of 67
Download to read offline
Processing Terabytes of data every day
… and sleeping at night
@katavic_d - @loige
London, 04/07/2019
- loige.link/tera-inf -
Domagoj KatavicSenior Software Engineer
🐦 @katavic_d
😸 github.com/dkatavic
Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
🌍 loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige
Cognito Detect
on premise solution
(soon also for the cloud!)
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven & Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige
@katavic_d - @loige
Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A Vectra product for Incident Response
Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Our first iteration
@katavic_d - @loige
@katavic_d - @loige
Control plane
Centralised
Logging &
Metrics
Security
● Separate VPCs
● Strict Security Groups (whitelisting)
● Red, amber, green subnets
● Encryption at rest through AWS services
● Client Certificates + TLS
● Pentest
@katavic_d - @loige
Let’s start the beta!
@katavic_d - @loige
Warning: different timezones!
A cu m
Our ne *
@katavic_d - @loige *yeah, we actually look that cute when we sleep!
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Lambda timeouts incident
● AWS Lambda timeout: 15 minutes (max)
● We are receiving files every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, files
can be too big to be processed within timeout
limits
@katavic_d - @loige
Splitter lambda
@katavic_d - @loige
Message-aware splitting
Lessons learned
● Predictable data input for
predictable performance
● Data ingestion parallelization
(exploiting serverless
capabilities)
@katavic_d - @loige
@katavic_d - @loige
Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!
Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Missing data incident
@katavic_d - @loige
@katavic_d - @loige
● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige
Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Fast retry at peak times
● Lambda retry logic is not configurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly defined
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in these range of time is likely to succeed@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in this range of time is likely to fail@katavic_d - @loige
Fast retry at peak times
If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry
Can we extend the retry period
in case of failure?
@katavic_d - @loige
@katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new file is stored in S3
@katavic_d - @loige
Extended retry period
If the Lambda fails, the event is automatically retried, up to 2 times
@katavic_d - @loige
Extended retry period
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
@katavic_d - @loige
Extended retry period
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
@katavic_d - @loige
Extended retry period
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3
@katavic_d - @loige
Extended retry period
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3
Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige
AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige
AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige
Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notified
@katavic_d - @loige
Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige
Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Norm Kerth, Project Retrospectives: A Handbook for Team Review
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige
Incidents Retrospective
● Summary
● Events timeline
● Contributing Factors
● Remediation / Solution
● Actions for the future
● Transparency
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting, Start Finishing”
@katavic_d - @loige
Development best practices
● Clear acceptance criteria
○ Collectively defined (3 amigos)
○ Make sure you know when a card is done
● Split the work in small units of work (cards)
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige
Development best practices
● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowflakes”, one code base for all customers
● Feature flags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige
Conclusion
We are still waking up at night sometimes,
but we are definitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Allow your team space to fail
● Always review and strive for improvement
● Monitor/Instrument as much as you can
● Use managed services to reduce the operational overhead
(but learn their nuances)
@katavic_d - @loige
We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
- loige.link/tera-inf -
Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic

More Related Content

What's hot

Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web FrameworkDaniel Woods
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkLuciano Mammino
 
Front-end for Java developers Devoxx France 2018
Front-end for Java developers Devoxx France 2018Front-end for Java developers Devoxx France 2018
Front-end for Java developers Devoxx France 2018Deepu K Sasidharan
 
SpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellSpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellJeroen Resoort
 
Microservices and serverless in python projects
Microservices and serverless in python projectsMicroservices and serverless in python projects
Microservices and serverless in python projectsJose Manuel Ortega Candel
 
Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web FrameworkDaniel Woods
 
GC Tuning Confessions Of A Performance Engineer - Improved :)
GC Tuning Confessions Of A Performance Engineer - Improved :)GC Tuning Confessions Of A Performance Engineer - Improved :)
GC Tuning Confessions Of A Performance Engineer - Improved :)Monica Beckwith
 
Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)David Carr
 
Leveraging HP Performance Center
Leveraging HP Performance CenterLeveraging HP Performance Center
Leveraging HP Performance CenterMartin Spier
 
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Way Improved :) GC Tuning Confessions - presented at JavaOne2015Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Way Improved :) GC Tuning Confessions - presented at JavaOne2015Monica Beckwith
 
betterCode Workshop: Effizientes DevOps-Tooling mit Go
betterCode Workshop:  Effizientes DevOps-Tooling mit GobetterCode Workshop:  Effizientes DevOps-Tooling mit Go
betterCode Workshop: Effizientes DevOps-Tooling mit GoQAware GmbH
 
Rubyconf presentation
Rubyconf presentationRubyconf presentation
Rubyconf presentationkrevuri
 
Idi2018 - Serverless does not mean Opsless
Idi2018 - Serverless does not mean OpslessIdi2018 - Serverless does not mean Opsless
Idi2018 - Serverless does not mean OpslessLinuxaria.com
 
Building Web Apps in Ratpack
Building Web Apps in RatpackBuilding Web Apps in Ratpack
Building Web Apps in RatpackDaniel Woods
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageRan Levy
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandMatt Tesauro
 
Painless container management with Container Engine and Kubernetes
Painless container management with Container Engine and KubernetesPainless container management with Container Engine and Kubernetes
Painless container management with Container Engine and KubernetesJorrit Salverda
 
Aws Lambda in Swift - NSLondon - 3rd December 2020
Aws Lambda in Swift - NSLondon - 3rd December 2020Aws Lambda in Swift - NSLondon - 3rd December 2020
Aws Lambda in Swift - NSLondon - 3rd December 2020Andrea Scuderi
 
Altitude SF 2017: Building a continuous deployment pipeline
Altitude SF 2017: Building a continuous deployment pipelineAltitude SF 2017: Building a continuous deployment pipeline
Altitude SF 2017: Building a continuous deployment pipelineFastly
 

What's hot (20)

Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web Framework
 
Top conf serverlezz
Top conf   serverlezzTop conf   serverlezz
Top conf serverlezz
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless framework
 
Front-end for Java developers Devoxx France 2018
Front-end for Java developers Devoxx France 2018Front-end for Java developers Devoxx France 2018
Front-end for Java developers Devoxx France 2018
 
SpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellSpringOne 2016 in a nutshell
SpringOne 2016 in a nutshell
 
Microservices and serverless in python projects
Microservices and serverless in python projectsMicroservices and serverless in python projects
Microservices and serverless in python projects
 
Ratpack Web Framework
Ratpack Web FrameworkRatpack Web Framework
Ratpack Web Framework
 
GC Tuning Confessions Of A Performance Engineer - Improved :)
GC Tuning Confessions Of A Performance Engineer - Improved :)GC Tuning Confessions Of A Performance Engineer - Improved :)
GC Tuning Confessions Of A Performance Engineer - Improved :)
 
Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)Intro to Ratpack (CDJDN 2015-01-22)
Intro to Ratpack (CDJDN 2015-01-22)
 
Leveraging HP Performance Center
Leveraging HP Performance CenterLeveraging HP Performance Center
Leveraging HP Performance Center
 
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Way Improved :) GC Tuning Confessions - presented at JavaOne2015Way Improved :) GC Tuning Confessions - presented at JavaOne2015
Way Improved :) GC Tuning Confessions - presented at JavaOne2015
 
betterCode Workshop: Effizientes DevOps-Tooling mit Go
betterCode Workshop:  Effizientes DevOps-Tooling mit GobetterCode Workshop:  Effizientes DevOps-Tooling mit Go
betterCode Workshop: Effizientes DevOps-Tooling mit Go
 
Rubyconf presentation
Rubyconf presentationRubyconf presentation
Rubyconf presentation
 
Idi2018 - Serverless does not mean Opsless
Idi2018 - Serverless does not mean OpslessIdi2018 - Serverless does not mean Opsless
Idi2018 - Serverless does not mean Opsless
 
Building Web Apps in Ratpack
Building Web Apps in RatpackBuilding Web Apps in Ratpack
Building Web Apps in Ratpack
 
Continuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritageContinuous Deployment Applied at MyHeritage
Continuous Deployment Applied at MyHeritage
 
Intro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP SwitzerlandIntro to DefectDojo at OWASP Switzerland
Intro to DefectDojo at OWASP Switzerland
 
Painless container management with Container Engine and Kubernetes
Painless container management with Container Engine and KubernetesPainless container management with Container Engine and Kubernetes
Painless container management with Container Engine and Kubernetes
 
Aws Lambda in Swift - NSLondon - 3rd December 2020
Aws Lambda in Swift - NSLondon - 3rd December 2020Aws Lambda in Swift - NSLondon - 3rd December 2020
Aws Lambda in Swift - NSLondon - 3rd December 2020
 
Altitude SF 2017: Building a continuous deployment pipeline
Altitude SF 2017: Building a continuous deployment pipelineAltitude SF 2017: Building a continuous deployment pipeline
Altitude SF 2017: Building a continuous deployment pipeline
 

Similar to Processing Terabytes of Data

Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance ComputingLuciano Mammino
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance ComputingLuciano Mammino
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAnthony Scata
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaSteven Wu
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...Luciano Mammino
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streamingdatamantra
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarShubham Tagra
 
Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20RegisWilson1
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Storyvanphp
 
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Skillenza Build with Serverless Challenge -  Advanced Serverless ConceptsSkillenza Build with Serverless Challenge -  Advanced Serverless Concepts
Skillenza Build with Serverless Challenge - Advanced Serverless ConceptsDhaval Nagar
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsCesar Cardenas Desales
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleDmytro Semenov
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelGarindra Prahandono
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1ChemAxon
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverlessgjdevos
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...confluent
 

Similar to Processing Terabytes of Data (20)

Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Skillenza Build with Serverless Challenge -  Advanced Serverless ConceptsSkillenza Build with Serverless Challenge -  Advanced Serverless Concepts
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applications
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development ModelLaskar: High-Velocity GraphQL & Lambda-based Software Development Model
Laskar: High-Velocity GraphQL & Lambda-based Software Development Model
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverless
 
Netty training
Netty trainingNetty training
Netty training
 
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
From Zero to Streaming Healthcare in Production (Alexander Kouznetsov, Invita...
 

More from Luciano Mammino

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSLuciano Mammino
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...Luciano Mammino
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoLuciano Mammino
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperLuciano Mammino
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Luciano Mammino
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsLuciano Mammino
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022Luciano Mammino
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableLuciano Mammino
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Luciano Mammino
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinLuciano Mammino
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaLuciano Mammino
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)Luciano Mammino
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made SimpleLuciano Mammino
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessLuciano Mammino
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Luciano Mammino
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Luciano Mammino
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3Luciano Mammino
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsLuciano Mammino
 

More from Luciano Mammino (20)

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJS
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiper
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLs
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & Airtable
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust Dublin
 
Monoliths to the cloud!
Monoliths to the cloud!Monoliths to the cloud!
Monoliths to the cloud!
 
The senior dev
The senior devThe senior dev
The senior dev
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community Vijayawada
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iterators
 

Recently uploaded

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 

Recently uploaded (20)

Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 

Processing Terabytes of Data

  • 1. Processing Terabytes of data every day … and sleeping at night @katavic_d - @loige London, 04/07/2019 - loige.link/tera-inf -
  • 2. Domagoj KatavicSenior Software Engineer 🐦 @katavic_d 😸 github.com/dkatavic
  • 3. Luciano Mammino Cloud Architect 🐦 @loige 😸 github.com/lmammino 🌍 loige.co 4.7 out of 5 stars on Amazon.com With @mariocasciaro
  • 4. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 5. AI to detect and hunt for cyber attackers Cognito Platform ● Detect ● Recall @katavic_d - @loige
  • 6. Cognito Detect on premise solution (soon also for the cloud!) ● Analyzing network traffic and logs ● Uses AI to deliver real-time attack visibility ● Behaviour driven & Host centric ● Provides threat context and most relevant attack details @katavic_d - @loige
  • 8. Cognito Recall ● Collects network metadata and stores it in “the cloud” ● Data is processed, enriched and standardised ● Data is made searchable @katavic_d - @loige A Vectra product for Incident Response
  • 9. Recall requirements ● Data isolation ● Ingestion speed: ~2GB/min x customer (up ~3TB x day per customer) ● Investigation tool: Flexible data exploration @katavic_d - @loige
  • 10. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 12. @katavic_d - @loige Control plane Centralised Logging & Metrics
  • 13. Security ● Separate VPCs ● Strict Security Groups (whitelisting) ● Red, amber, green subnets ● Encryption at rest through AWS services ● Client Certificates + TLS ● Pentest @katavic_d - @loige
  • 14. Let’s start the beta! @katavic_d - @loige
  • 15. Warning: different timezones! A cu m Our ne * @katavic_d - @loige *yeah, we actually look that cute when we sleep!
  • 16. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 19. Lambda timeouts incident ● AWS Lambda timeout: 15 minutes (max) ● We are receiving files every minute (containing 1 minute of network traffic) ● During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige
  • 22. Lessons learned ● Predictable data input for predictable performance ● Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige
  • 24. Lambdas IP starvation incident ● Spinning up many lambdas consumed all the available IPs in a subnet ● Failure to get an IP for the new ES machines ● ElasticSearch cannot scale up ● Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!
  • 25. Lessons learned ● Every running lambda inside a VPC uses an ENI (Elastic Network Interface) ● Every ENI takes a private IP address ● Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! ● Consider putting lambdas in their dedicated subnet @katavic_d - @loige
  • 30. ● New lambda version: triggered insertion failures ● ElasticSearch rejecting inserts and logging errors ● Our log reporting agents got stuck (we DDoS’d ourselves!) ● Monitoring/Alerting failed Resolution: ● Fix mismatching schema ● Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige
  • 31. Alerting on lambda failures Using logs: ● Best case: no logs ● Worst case: no logs (available)! A better approach: ● Attach a DLQ to your lambdas ● Alert on queue size with CloudWatch! ● Visibility on Lambda retries @katavic_d - @loige
  • 35. Fast retry at peak times ● Lambda retry logic is not configurable loige.link/lambda-retry ● Most events will be retried 2 times ● Time between retry attempts is not clearly defined (observed in the order of few seconds) ● What if all retry attempts happen at peak times? @katavic_d - @loige
  • 36. Fast retry at peak times @katavic_d - @loige
  • 37. Fast retry at peak times Processing in these range of time is likely to succeed@katavic_d - @loige
  • 38. Fast retry at peak times @katavic_d - @loige
  • 39. Fast retry at peak times Processing in this range of time is likely to fail@katavic_d - @loige
  • 40. Fast retry at peak times If retries are in the same zone, the message will fail and go to the DLQ 1st retry 2nd retry
  • 41. Can we extend the retry period in case of failure? @katavic_d - @loige
  • 42. @katavic_d - @loige Extended retry period We normally trigger our ingestion Lambda when a new file is stored in S3
  • 43. @katavic_d - @loige Extended retry period If the Lambda fails, the event is automatically retried, up to 2 times
  • 44. @katavic_d - @loige Extended retry period If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
  • 45. @katavic_d - @loige Extended retry period At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
  • 46. @katavic_d - @loige Extended retry period If the processing still fails, we can extend the VisibilityTimeout (event delay) x3
  • 47. @katavic_d - @loige Extended retry period If the processing still fails, we eventually drop the message and alert for manual intervention. x3
  • 48. Lessons learned ● Cannot always rely on the default retry logic ● SQS events + DLQ = custom SERVERLESS retry logic ● Now we only alert on custom metrics when we are sure the event will fail (logic error) ● https://loige.link/async-lambda-retry @katavic_d - @loige
  • 49. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 50. AWS nuances ● Serverless is generally cheap, but be careful! ○ You are paying for wait time ○ Bugs may be expensive ○ 100ms charging blocks ● https://loige.link/lambda-pricing ● https://loige.link/serverless-costs-all-wrong @katavic_d - @loige
  • 51. AWS nuances ● Not every service/feature is available in every region or AZ ○ SQS FIFO :( ○ Not all AWS regions have 3 AZs ○ Not all instance types are available in every availability zone ● https://loige.link/aws-regional-services @katavic_d - @loige
  • 52. AWS nuances ● Limits everywhere! ○ Soft vs hard limits ○ Take them into account in your design ● https://loige.link/aws-service-limits @katavic_d - @loige
  • 53. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 54. Process How to deal with incidents ● Page ● Engineers on call ● Incident Retrospective ● Actions @katavic_d - @loige
  • 55. Pages ● Page is an alarm for people on call (Pagerduty) ● Rotate ops & devs (share the pain) ● Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) ● When a page is received, it needs to be acknowledged or it is automatically escalated ● If customer facing (e.g. service not available), customer is notified @katavic_d - @loige
  • 56. Engineers on call 1. Use operational handbook 2. Might escalate to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige
  • 57. Incidents Retrospective "Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Norm Kerth, Project Retrospectives: A Handbook for Team Review TLDR; NOT A BLAMING GAME! @katavic_d - @loige
  • 58. Incidents Retrospective ● Summary ● Events timeline ● Contributing Factors ● Remediation / Solution ● Actions for the future ● Transparency @katavic_d - @loige
  • 59. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 60. Development best practices ● Regular Retrospectives (not just for incidents) ○ What’s good ○ What’s bad ○ Actions to improve ● Kanban Board ○ All work visible ○ One card at the time ○ Work In Progress limit ○ “Stop Starting, Start Finishing” @katavic_d - @loige
  • 61. Development best practices ● Clear acceptance criteria ○ Collectively defined (3 amigos) ○ Make sure you know when a card is done ● Split the work in small units of work (cards) ○ High throughput ○ More predictability ● Bugs take priority over features! @katavic_d - @loige
  • 62. Development best practices ● Pair programming ○ Share the knowledge/responsibility ○ Improve team dynamics ○ Enforced by low WIP limit ● Quality over deadlines ● Don’t estimate without data @katavic_d - @loige
  • 63. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 64. Release process ● Infrastructure as a code ○ Deterministic deployments ○ Infrastructure versioning using git ● No “snowflakes”, one code base for all customers ● Feature flags: ○ Special features ○ Soft releases ● Automated tests before release @katavic_d - @loige
  • 65. Conclusion We are still waking up at night sometimes, but we are definitely sleeping a lot more and better! Takeaways: ● Have healthy and clear processes ● Allow your team space to fail ● Always review and strive for improvement ● Monitor/Instrument as much as you can ● Use managed services to reduce the operational overhead (but learn their nuances) @katavic_d - @loige
  • 66. We are hiring … Talk to us!@katavic_d - @loige Thank you! - loige.link/tera-inf -
  • 67. Credits Pictures from Unsplash Huge thanks for support and reviews to: ● All the Vectra team ● Yan Cui (@theburningmonk) ● Paul Dolan ● @gbinside ● @augeva ● @Podgeypoos79 ● @PawrickMannion ● @micktwomey ● Vedran Jukic