SlideShare a Scribd company logo
Processing Terabytes of data every day
… and sleeping at night
@katavic_d - @loige
User Group
Belfast
06/02/2019
loige.link/tera-bel
Domagoj KatavicSenior Software Engineer
🐦 @katavic_d
😸 github.com/dkatavic
Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige
Cognito Detect
on premise solution
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige
@katavic_d - @loige
Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A new Vectra product for Incident Response
Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Our first iteration
@katavic_d - @loige
@katavic_d - @loige
Control plane
Centralised
Logging &
Metrics
Security
● Separate VPCs
● Strict Security Groups (whitelisting)
● Red, amber, green subnets
● Encryption at rest through AWS services
● Client Certificates + TLS
● Pentest
@katavic_d - @loige
Let’s start the beta!
@katavic_d - @loige
Warning: different timezones!
A cu m
Our ne *
@katavic_d - @loige *yeah, we actually look that cute when we sleep!
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Lambda timeouts incident
● AWS Lambda timeout: 5 minutes 15 minutes
● We are receiving files every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, files
can be too big to be processed within timeout
limits
@katavic_d - @loige
Splitter lambda
@katavic_d - @loige
Message-aware splitting
Lessons learned
● Predictable data input for
predictable performance
● Data ingestion parallelization
(exploiting serverless
capabilities)
@katavic_d - @loige
@katavic_d - @loige
Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!
Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Missing data incident
@katavic_d - @loige
@katavic_d - @loige
● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige
Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
@katavic_d - @loige
Fast retry at peak times
● Lambda retry logic is not configurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly defined
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in these range of time is likely to succeed@katavic_d - @loige
Fast retry at peak times
@katavic_d - @loige
Fast retry at peak times
Processing in this range of time is likely to fail@katavic_d - @loige
Fast retry at peak times
If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry
Can we extend the retry period
in case of failure?
@katavic_d - @loige
@katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new file is stored in S3
@katavic_d - @loige
Extended retry period
If the Lambda fails, the event is automatically retried, up to 2 times
@katavic_d - @loige
Extended retry period
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
@katavic_d - @loige
Extended retry period
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
@katavic_d - @loige
Extended retry period
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3
@katavic_d - @loige
Extended retry period
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3
Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige
AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige
AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige
Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notified
@katavic_d - @loige
Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige
Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Nor t , Pro t R os t e : A Han k o T m e
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige
Incidents Retrospective
● Summary
● Events timeline
● Contributing Factors
● Remediation / Solution
● Actions for the future
● Transparency
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting Start Finishing”
@katavic_d - @loige
Development best practices
● Clear acceptance criteria
○ Collectively defined (3 amigos)
○ Make sure you know when a card is done
● Split the work in small cards
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige
Development best practices
● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige
Agenda
● The problem space
● Our first MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige
Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowflakes”, one code base for all customers
● Feature flags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige
Conclusion
We are still waking up at night sometimes,
but we are definitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Allow your team space to fail
● Always review and strive for improvement
● Monitor/Instrument as much as you can
● Use managed services to reduce the operational overhead
(but learn their nuances)
@katavic_d - @loige
We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
loige.link/tera-bel
Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic

More Related Content

What's hot

Tracing Microservices with Zipkin
Tracing Microservices with ZipkinTracing Microservices with Zipkin
Tracing Microservices with Zipkin
takezoe
 

What's hot (20)

Self-Healing Serverless Applications (Stackery @ GlueCon 2018)
Self-Healing Serverless Applications (Stackery @ GlueCon 2018)Self-Healing Serverless Applications (Stackery @ GlueCon 2018)
Self-Healing Serverless Applications (Stackery @ GlueCon 2018)
 
Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20Cloud Native Patterns Meetup 2019-11-20
Cloud Native Patterns Meetup 2019-11-20
 
Netty training
Netty trainingNetty training
Netty training
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
SpringOne 2016 in a nutshell
SpringOne 2016 in a nutshellSpringOne 2016 in a nutshell
SpringOne 2016 in a nutshell
 
Top conf serverlezz
Top conf   serverlezzTop conf   serverlezz
Top conf serverlezz
 
Building a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless frameworkBuilding a serverless company on AWS lambda and Serverless framework
Building a serverless company on AWS lambda and Serverless framework
 
Kafka At Scale in the Cloud
Kafka At Scale in the CloudKafka At Scale in the Cloud
Kafka At Scale in the Cloud
 
Cloud TiDB deep dive
Cloud TiDB deep diveCloud TiDB deep dive
Cloud TiDB deep dive
 
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...S3, Cassandra or Outer Space? Dumping Time Series Data using Spark  - Demi Be...
S3, Cassandra or Outer Space? Dumping Time Series Data using Spark - Demi Be...
 
Java Performance Engineer's Survival Guide
Java Performance Engineer's Survival GuideJava Performance Engineer's Survival Guide
Java Performance Engineer's Survival Guide
 
Distributed tracing using open tracing & jaeger 2
Distributed tracing using open tracing & jaeger 2Distributed tracing using open tracing & jaeger 2
Distributed tracing using open tracing & jaeger 2
 
"Плюси та мінуси впровадження AWS Lambda в проєкт" Віталій Григоришин
"Плюси та мінуси впровадження AWS Lambda в проєкт" Віталій Григоришин"Плюси та мінуси впровадження AWS Lambda в проєкт" Віталій Григоришин
"Плюси та мінуси впровадження AWS Lambda в проєкт" Віталій Григоришин
 
Understanding time in structured streaming
Understanding time in structured streamingUnderstanding time in structured streaming
Understanding time in structured streaming
 
Flagger: Istio Progressive Delivery Operator
Flagger: Istio Progressive Delivery OperatorFlagger: Istio Progressive Delivery Operator
Flagger: Istio Progressive Delivery Operator
 
Serverless Security at LASCON 2017
Serverless Security at LASCON 2017Serverless Security at LASCON 2017
Serverless Security at LASCON 2017
 
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)Ensuring Performance in a Fast-Paced Environment (CMG 2014)
Ensuring Performance in a Fast-Paced Environment (CMG 2014)
 
Microservices with Micronaut
Microservices with MicronautMicroservices with Micronaut
Microservices with Micronaut
 
Tracing Microservices with Zipkin
Tracing Microservices with ZipkinTracing Microservices with Zipkin
Tracing Microservices with Zipkin
 
Local Lambda Debugging
Local Lambda DebuggingLocal Lambda Debugging
Local Lambda Debugging
 

Similar to Processing TeraBytes of data every day and sleeping at night

AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...
Luciano Mammino
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
Peter Mounce
 

Similar to Processing TeraBytes of data every day and sleeping at night (20)

Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
Serverless for High Performance Computing
Serverless for High Performance ComputingServerless for High Performance Computing
Serverless for High Performance Computing
 
AWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runnersAWS Techniques and lessons writing low cost autoscaling GitLab runners
AWS Techniques and lessons writing low cost autoscaling GitLab runners
 
AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...AWS Lambda and Serverless framework: lessons learned while building a serverl...
AWS Lambda and Serverless framework: lessons learned while building a serverl...
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Netflix Data Pipeline With Kafka
Netflix Data Pipeline With KafkaNetflix Data Pipeline With Kafka
Netflix Data Pipeline With Kafka
 
Writing and deploying serverless python applications
Writing and deploying serverless python applicationsWriting and deploying serverless python applications
Writing and deploying serverless python applications
 
Building real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark StreamingBuilding real time Data Pipeline using Spark Streaming
Building real time Data Pipeline using Spark Streaming
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Node.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scaleNode.js Web Apps @ ebay scale
Node.js Web Apps @ ebay scale
 
PyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applicationsPyConIT 2018 Writing and deploying serverless python applications
PyConIT 2018 Writing and deploying serverless python applications
 
PyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applicationsPyConIE 2017 Writing and deploying serverless python applications
PyConIE 2017 Writing and deploying serverless python applications
 
Debugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan KumarDebugging data pipelines @OLA by Karan Kumar
Debugging data pipelines @OLA by Karan Kumar
 
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling StoryPHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
PHP At 5000 Requests Per Second: Hootsuite’s Scaling Story
 
[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue[AWS Builders] Effective AWS Glue
[AWS Builders] Effective AWS Glue
 
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
Skillenza Build with Serverless Challenge -  Advanced Serverless ConceptsSkillenza Build with Serverless Challenge -  Advanced Serverless Concepts
Skillenza Build with Serverless Challenge - Advanced Serverless Concepts
 
There is something about serverless
There is something about serverlessThere is something about serverless
There is something about serverless
 
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
Infrastructure at Scale: Apache Kafka, Twitter Storm & Elastic Search (ARC303...
 
Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2Netflix Open Source Meetup Season 4 Episode 2
Netflix Open Source Meetup Season 4 Episode 2
 
AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1AWS Lambdas are cool - Cheminfo Stories Day 1
AWS Lambdas are cool - Cheminfo Stories Day 1
 

More from Luciano Mammino

More from Luciano Mammino (20)

Did you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJSDid you know JavaScript has iterators? DublinJS
Did you know JavaScript has iterators? DublinJS
 
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
What I learned by solving 50 Advent of Code challenges in Rust - RustNation U...
 
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS MilanoBuilding an invite-only microsite with Next.js & Airtable - ReactJS Milano
Building an invite-only microsite with Next.js & Airtable - ReactJS Milano
 
From Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiperFrom Node.js to Design Patterns - BuildPiper
From Node.js to Design Patterns - BuildPiper
 
Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!Let's build a 0-cost invite-only website with Next.js and Airtable!
Let's build a 0-cost invite-only website with Next.js and Airtable!
 
Everything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLsEverything I know about S3 pre-signed URLs
Everything I know about S3 pre-signed URLs
 
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022JavaScript Iteration Protocols - Workshop NodeConf EU 2022
JavaScript Iteration Protocols - Workshop NodeConf EU 2022
 
Building an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & AirtableBuilding an invite-only microsite with Next.js & Airtable
Building an invite-only microsite with Next.js & Airtable
 
Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀Let's take the monolith to the cloud 🚀
Let's take the monolith to the cloud 🚀
 
A look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust DublinA look inside the European Covid Green Certificate - Rust Dublin
A look inside the European Covid Green Certificate - Rust Dublin
 
Monoliths to the cloud!
Monoliths to the cloud!Monoliths to the cloud!
Monoliths to the cloud!
 
The senior dev
The senior devThe senior dev
The senior dev
 
Node.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community VijayawadaNode.js: scalability tips - Azure Dev Community Vijayawada
Node.js: scalability tips - Azure Dev Community Vijayawada
 
A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)A look inside the European Covid Green Certificate (Codemotion 2021)
A look inside the European Covid Green Certificate (Codemotion 2021)
 
AWS Observability Made Simple
AWS Observability Made SimpleAWS Observability Made Simple
AWS Observability Made Simple
 
Semplificare l'observability per progetti Serverless
Semplificare l'observability per progetti ServerlessSemplificare l'observability per progetti Serverless
Semplificare l'observability per progetti Serverless
 
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
Finding a lost song with Node.js and async iterators - NodeConf Remote 2021
 
Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021Finding a lost song with Node.js and async iterators - EnterJS 2021
Finding a lost song with Node.js and async iterators - EnterJS 2021
 
How to send gzipped requests with boto3
How to send gzipped requests with boto3How to send gzipped requests with boto3
How to send gzipped requests with boto3
 
Finding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iteratorsFinding a lost song with Node.js and async iterators
Finding a lost song with Node.js and async iterators
 

Recently uploaded

Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns
 

Recently uploaded (20)

Cyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdfCyaniclab : Software Development Agency Portfolio.pdf
Cyaniclab : Software Development Agency Portfolio.pdf
 
Corporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMSCorporate Management | Session 3 of 3 | Tendenci AMS
Corporate Management | Session 3 of 3 | Tendenci AMS
 
Prosigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology SolutionsProsigns: Transforming Business with Tailored Technology Solutions
Prosigns: Transforming Business with Tailored Technology Solutions
 
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
Gamify Your Mind; The Secret Sauce to Delivering Success, Continuously Improv...
 
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.ILBeyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
Beyond Event Sourcing - Embracing CRUD for Wix Platform - Java.IL
 
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
A Python-based approach to data loading in TM1 - Using Airflow as an ETL for TM1
 
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
How Does XfilesPro Ensure Security While Sharing Documents in Salesforce?
 
Advanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should KnowAdvanced Flow Concepts Every Developer Should Know
Advanced Flow Concepts Every Developer Should Know
 
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAGAI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
AI/ML Infra Meetup | Reducing Prefill for LLM Serving in RAG
 
Vitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume MontevideoVitthal Shirke Microservices Resume Montevideo
Vitthal Shirke Microservices Resume Montevideo
 
AI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in MichelangeloAI/ML Infra Meetup | ML explainability in Michelangelo
AI/ML Infra Meetup | ML explainability in Michelangelo
 
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
Field Employee Tracking System| MiTrack App| Best Employee Tracking Solution|...
 
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoamOpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
OpenFOAM solver for Helmholtz equation, helmholtzFoam / helmholtzBubbleFoam
 
top nidhi software solution freedownload
top nidhi software solution freedownloadtop nidhi software solution freedownload
top nidhi software solution freedownload
 
Agnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in KrakówAgnieszka Andrzejewska - BIM School Course in Kraków
Agnieszka Andrzejewska - BIM School Course in Kraków
 
AI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning FrameworkAI/ML Infra Meetup | Perspective on Deep Learning Framework
AI/ML Infra Meetup | Perspective on Deep Learning Framework
 
Breaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdfBreaking the Code : A Guide to WhatsApp Business API.pdf
Breaking the Code : A Guide to WhatsApp Business API.pdf
 
Crafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM IntegrationCrafting the Perfect Measurement Sheet with PLM Integration
Crafting the Perfect Measurement Sheet with PLM Integration
 
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERRORTROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
TROUBLESHOOTING 9 TYPES OF OUTOFMEMORYERROR
 
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital TransformationWSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
WSO2Con2024 - WSO2's IAM Vision: Identity-Led Digital Transformation
 

Processing TeraBytes of data every day and sleeping at night

  • 1. Processing Terabytes of data every day … and sleeping at night @katavic_d - @loige User Group Belfast 06/02/2019 loige.link/tera-bel
  • 2. Domagoj KatavicSenior Software Engineer 🐦 @katavic_d 😸 github.com/dkatavic
  • 3. Luciano Mammino Cloud Architect 🐦 @loige 😸 github.com/lmammino loige.co 4.7 out of 5 stars on Amazon.com With @mariocasciaro
  • 4. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 5. AI to detect and hunt for cyber attackers Cognito Platform ● Detect ● Recall @katavic_d - @loige
  • 6. Cognito Detect on premise solution ● Analyzing network traffic and logs ● Uses AI to deliver real-time attack visibility ● Behaviour driven Host centric ● Provides threat context and most relevant attack details @katavic_d - @loige
  • 8. Cognito Recall ● Collects network metadata and stores it in “the cloud” ● Data is processed, enriched and standardised ● Data is made searchable @katavic_d - @loige A new Vectra product for Incident Response
  • 9. Recall requirements ● Data isolation ● Ingestion speed: ~2GB/min x customer (up ~3TB x day per customer) ● Investigation tool: Flexible data exploration @katavic_d - @loige
  • 10. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 12. @katavic_d - @loige Control plane Centralised Logging & Metrics
  • 13. Security ● Separate VPCs ● Strict Security Groups (whitelisting) ● Red, amber, green subnets ● Encryption at rest through AWS services ● Client Certificates + TLS ● Pentest @katavic_d - @loige
  • 14. Let’s start the beta! @katavic_d - @loige
  • 15. Warning: different timezones! A cu m Our ne * @katavic_d - @loige *yeah, we actually look that cute when we sleep!
  • 16. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 19. Lambda timeouts incident ● AWS Lambda timeout: 5 minutes 15 minutes ● We are receiving files every minute (containing 1 minute of network traffic) ● During peak hours for the biggest customer, files can be too big to be processed within timeout limits @katavic_d - @loige
  • 22. Lessons learned ● Predictable data input for predictable performance ● Data ingestion parallelization (exploiting serverless capabilities) @katavic_d - @loige
  • 24. Lambdas IP starvation incident ● Spinning up many lambdas consumed all the available IPs in a subnet ● Failure to get an IP for the new ES machines ● ElasticSearch cannot scale up ● Solution: separate ElasticSearch and Lambda subnets @katavic_d - @loige GI IP!
  • 25. Lessons learned ● Every running lambda inside a VPC uses an ENI (Elastic Network Interface) ● Every ENI takes a private IP address ● Edge conditions or bugs might generate spikes in the number of running lambdas and you might run out of IPs in the subnet! ● Consider putting lambdas in their dedicated subnet @katavic_d - @loige
  • 30. ● New lambda version: triggered insertion failures ● ElasticSearch rejecting inserts and logging errors ● Our log reporting agents got stuck (we DDoS’d ourselves!) ● Monitoring/Alerting failed Resolution: ● Fix mismatching schema ● Scaled out centralised logging system Why didn’t we receive the page @katavic_d - @loige
  • 31. Alerting on lambda failures Using logs: ● Best case: no logs ● Worst case: no logs (available)! A better approach: ● Attach a DLQ to your lambdas ● Alert on queue size with CloudWatch! ● Visibility on Lambda retries @katavic_d - @loige
  • 35. Fast retry at peak times ● Lambda retry logic is not configurable loige.link/lambda-retry ● Most events will be retried 2 times ● Time between retry attempts is not clearly defined (observed in the order of few seconds) ● What if all retry attempts happen at peak times? @katavic_d - @loige
  • 36. Fast retry at peak times @katavic_d - @loige
  • 37. Fast retry at peak times Processing in these range of time is likely to succeed@katavic_d - @loige
  • 38. Fast retry at peak times @katavic_d - @loige
  • 39. Fast retry at peak times Processing in this range of time is likely to fail@katavic_d - @loige
  • 40. Fast retry at peak times If retries are in the same zone, the message will fail and go to the DLQ 1st retry 2nd retry
  • 41. Can we extend the retry period in case of failure? @katavic_d - @loige
  • 42. @katavic_d - @loige Extended retry period We normally trigger our ingestion Lambda when a new file is stored in S3
  • 43. @katavic_d - @loige Extended retry period If the Lambda fails, the event is automatically retried, up to 2 times
  • 44. @katavic_d - @loige Extended retry period If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)
  • 45. @katavic_d - @loige Extended retry period At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)
  • 46. @katavic_d - @loige Extended retry period If the processing still fails, we can extend the VisibilityTimeout (event delay) x3
  • 47. @katavic_d - @loige Extended retry period If the processing still fails, we eventually drop the message and alert for manual intervention. x3
  • 48. Lessons learned ● Cannot always rely on the default retry logic ● SQS events + DLQ = custom SERVERLESS retry logic ● Now we only alert on custom metrics when we are sure the event will fail (logic error) ● https://loige.link/async-lambda-retry @katavic_d - @loige
  • 49. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 50. AWS nuances ● Serverless is generally cheap, but be careful! ○ You are paying for wait time ○ Bugs may be expensive ○ 100ms charging blocks ● https://loige.link/lambda-pricing ● https://loige.link/serverless-costs-all-wrong @katavic_d - @loige
  • 51. AWS nuances ● Not every service/feature is available in every region or AZ ○ SQS FIFO :( ○ Not all AWS regions have 3 AZs ○ Not all instance types are available in every availability zone ● https://loige.link/aws-regional-services @katavic_d - @loige
  • 52. AWS nuances ● Limits everywhere! ○ Soft vs hard limits ○ Take them into account in your design ● https://loige.link/aws-service-limits @katavic_d - @loige
  • 53. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 54. Process How to deal with incidents ● Page ● Engineers on call ● Incident Retrospective ● Actions @katavic_d - @loige
  • 55. Pages ● Page is an alarm for people on call (Pagerduty) ● Rotate ops & devs (share the pain) ● Generate pages from different sources (Logs, Cloudwatch, SNS, grafana, etc) ● When a page is received, it needs to be acknowledged or it is automatically escalated ● If customer facing (e.g. service not available), customer is notified @katavic_d - @loige
  • 56. Engineers on call 1. Use operational handbook 2. Might escalate to other engineers 3. Find mitigation / remediation 4. Update handbook 5. Prepare for retrospective @katavic_d - @loige
  • 57. Incidents Retrospective "Regardless of what we discover, we understand and truly believe that everyone did the best job they could, given what they knew at the time, their skills and abilities, the resources available, and the situation at hand." – Nor t , Pro t R os t e : A Han k o T m e TLDR; NOT A BLAMING GAME! @katavic_d - @loige
  • 58. Incidents Retrospective ● Summary ● Events timeline ● Contributing Factors ● Remediation / Solution ● Actions for the future ● Transparency @katavic_d - @loige
  • 59. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 60. Development best practices ● Regular Retrospectives (not just for incidents) ○ What’s good ○ What’s bad ○ Actions to improve ● Kanban Board ○ All work visible ○ One card at the time ○ Work In Progress limit ○ “Stop Starting Start Finishing” @katavic_d - @loige
  • 61. Development best practices ● Clear acceptance criteria ○ Collectively defined (3 amigos) ○ Make sure you know when a card is done ● Split the work in small cards ○ High throughput ○ More predictability ● Bugs take priority over features! @katavic_d - @loige
  • 62. Development best practices ● Pair programming ○ Share the knowledge/responsibility ○ Improve team dynamics ○ Enforced by low WIP limit ● Quality over deadlines ● Don’t estimate without data @katavic_d - @loige
  • 63. Agenda ● The problem space ● Our first MVP & Beta period ● INCIDENTS! And lessons learned ● AWS Nuances ● Process to deal with incidents ● Development best practices ● Release process @katavic_d - @loige
  • 64. Release process ● Infrastructure as a code ○ Deterministic deployments ○ Infrastructure versioning using git ● No “snowflakes”, one code base for all customers ● Feature flags: ○ Special features ○ Soft releases ● Automated tests before release @katavic_d - @loige
  • 65. Conclusion We are still waking up at night sometimes, but we are definitely sleeping a lot more and better! Takeaways: ● Have healthy and clear processes ● Allow your team space to fail ● Always review and strive for improvement ● Monitor/Instrument as much as you can ● Use managed services to reduce the operational overhead (but learn their nuances) @katavic_d - @loige
  • 66. We are hiring … Talk to us!@katavic_d - @loige Thank you! loige.link/tera-bel
  • 67. Credits Pictures from Unsplash Huge thanks for support and reviews to: ● All the Vectra team ● Yan Cui (@theburningmonk) ● Paul Dolan ● @gbinside ● @augeva ● @Podgeypoos79 ● @PawrickMannion ● @micktwomey ● Vedran Jukic