Processing Terabytes of Data

Processing Terabytes of data every day
… and sleeping at night
@katavic_d - @loige
London, 04/07/2019
- loige.link/tera-inf -

Domagoj KatavicSenior Software Engineer
🐦 @katavic_d
😸 github.com/dkatavic

Luciano Mammino Cloud Architect
🐦 @loige
😸 github.com/lmammino
🌍 loige.co
4.7 out of 5 stars
on Amazon.com
With @mariocasciaro

Agenda
● The problem space
● Our ﬁrst MVP & Beta period
● INCIDENTS! And lessons learned
● AWS Nuances
● Process to deal with incidents
● Development best practices
● Release process
@katavic_d - @loige

AI to detect and hunt for
cyber attackers
Cognito Platform
● Detect
● Recall
@katavic_d - @loige

Cognito Detect
on premise solution
(soon also for the cloud!)
● Analyzing network traffic and logs
● Uses AI to deliver real-time attack visibility
● Behaviour driven & Host centric
● Provides threat context and most relevant
attack details
@katavic_d - @loige

Cognito Recall
● Collects network metadata
and stores it in “the cloud”
● Data is processed, enriched and standardised
● Data is made searchable
@katavic_d - @loige
A Vectra product for Incident Response

Recall requirements
● Data isolation
● Ingestion speed: ~2GB/min x customer
(up ~3TB x day per customer)
● Investigation tool:
Flexible data exploration
@katavic_d - @loige

Our first iteration
@katavic_d - @loige

@katavic_d - @loige
Control plane
Centralised
Logging &
Metrics

Security
● Separate VPCs
● Strict Security Groups (whitelisting)
● Red, amber, green subnets
● Encryption at rest through AWS services
● Client Certiﬁcates + TLS
● Pentest
@katavic_d - @loige

Let’s start the beta!
@katavic_d - @loige

Warning: different timezones!
A cu m
Our ne *
@katavic_d - @loige *yeah, we actually look that cute when we sleep!

Lambda timeouts incident
● AWS Lambda timeout: 15 minutes (max)
● We are receiving ﬁles every minute
(containing 1 minute of network traffic)
● During peak hours for the biggest customer, ﬁles
can be too big to be processed within timeout
limits
@katavic_d - @loige

Splitter lambda
@katavic_d - @loige

Lessons learned
● Predictable data input for
predictable performance
● Data ingestion parallelization
(exploiting serverless
capabilities)
@katavic_d - @loige

Lambdas IP starvation incident
● Spinning up many lambdas consumed
all the available IPs in a subnet
● Failure to get an IP for the new ES
machines
● ElasticSearch cannot scale up
● Solution: separate ElasticSearch and
Lambda subnets
@katavic_d - @loige
GI
IP!

Lessons learned
● Every running lambda inside a VPC uses an ENI
(Elastic Network Interface)
● Every ENI takes a private IP address
● Edge conditions or bugs might generate spikes in the
number of running lambdas and you might run out of
IPs in the subnet!
● Consider putting lambdas in their dedicated subnet
@katavic_d - @loige

@katavic_d - @loige
Missing data incident

● New lambda version: triggered insertion failures
● ElasticSearch rejecting inserts and logging errors
● Our log reporting agents got stuck (we DDoS’d ourselves!)
● Monitoring/Alerting failed
Resolution:
● Fix mismatching schema
● Scaled out centralised logging system
Why didn’t we receive the page
@katavic_d - @loige

Alerting on lambda failures
Using logs:
● Best case: no logs
● Worst case: no logs (available)!
A better approach:
● Attach a DLQ to your lambdas
● Alert on queue size with
CloudWatch!
● Visibility on Lambda retries
@katavic_d - @loige

Fast retry at peak times
● Lambda retry logic is not conﬁgurable
loige.link/lambda-retry
● Most events will be retried 2 times
● Time between retry attempts is not clearly deﬁned
(observed in the order of few seconds)
● What if all retry attempts happen at peak times?
@katavic_d - @loige

@katavic_d - @loige

Processing in these range of time is likely to succeed@katavic_d - @loige

Processing in this range of time is likely to fail@katavic_d - @loige

If retries are in the same zone, the message will fail and go to the DLQ
1st retry 2nd retry

Can we extend the retry period
in case of failure?
@katavic_d - @loige

@katavic_d - @loige
Extended retry period
We normally trigger our ingestion Lambda when a new ﬁle is stored in S3

@katavic_d - @loige
If the Lambda fails, the event is automatically retried, up to 2 times

@katavic_d - @loige
If the Lambda still fails, the event is copied to the Dead Letter Queue (DLQ)

@katavic_d - @loige
At this point our Lambda, can receive an SQS event from the DLQ (custom retry logic)

@katavic_d - @loige
If the processing still fails, we can extend the VisibilityTimeout (event delay)
x3

@katavic_d - @loige
If the processing still fails, we eventually drop the message and alert for manual intervention.
x3

Lessons learned
● Cannot always rely on the default retry logic
● SQS events + DLQ =
custom SERVERLESS retry logic
● Now we only alert on custom metrics when
we are sure the event will fail (logic error)
● https://loige.link/async-lambda-retry
@katavic_d - @loige

AWS nuances
● Serverless is generally cheap, but be careful!
○ You are paying for wait time
○ Bugs may be expensive
○ 100ms charging blocks
● https://loige.link/lambda-pricing
● https://loige.link/serverless-costs-all-wrong
@katavic_d - @loige

AWS nuances
● Not every service/feature is available in every region or AZ
○ SQS FIFO :(
○ Not all AWS regions have 3 AZs
○ Not all instance types are available in every availability zone
● https://loige.link/aws-regional-services
@katavic_d - @loige

AWS nuances
● Limits everywhere!
○ Soft vs hard limits
○ Take them into account in your design
● https://loige.link/aws-service-limits
@katavic_d - @loige

Process
How to deal with incidents
● Page
● Engineers on call
● Incident Retrospective
● Actions
@katavic_d - @loige

Pages
● Page is an alarm for people on call (Pagerduty)
● Rotate ops & devs (share the pain)
● Generate pages from different sources (Logs, Cloudwatch, SNS,
grafana, etc)
● When a page is received, it needs to be acknowledged or it is
automatically escalated
● If customer facing (e.g. service not available), customer is notiﬁed
@katavic_d - @loige

Engineers on call
1. Use operational handbook
2. Might escalate to other engineers
3. Find mitigation / remediation
4. Update handbook
5. Prepare for retrospective
@katavic_d - @loige

Incidents Retrospective
"Regardless of what we discover, we understand and truly
believe that everyone did the best job they could, given
what they knew at the time, their skills and abilities, the
resources available, and the situation at hand."
– Norm Kerth, Project Retrospectives: A Handbook for Team Review
TLDR; NOT A BLAMING GAME!
@katavic_d - @loige

Incidents Retrospective
● Summary
● Events timeline
● Contributing Factors
● Remediation / Solution
● Actions for the future
● Transparency
@katavic_d - @loige

Development best practices
● Regular Retrospectives (not just for incidents)
○ What’s good
○ What’s bad
○ Actions to improve
● Kanban Board
○ All work visible
○ One card at the time
○ Work In Progress limit
○ “Stop Starting, Start Finishing”
@katavic_d - @loige

● Clear acceptance criteria
○ Collectively deﬁned (3 amigos)
○ Make sure you know when a card is done
● Split the work in small units of work (cards)
○ High throughput
○ More predictability
● Bugs take priority over features!
@katavic_d - @loige

● Pair programming
○ Share the knowledge/responsibility
○ Improve team dynamics
○ Enforced by low WIP limit
● Quality over deadlines
● Don’t estimate without data
@katavic_d - @loige

Release process
● Infrastructure as a code
○ Deterministic deployments
○ Infrastructure versioning using git
● No “snowﬂakes”, one code base for all customers
● Feature ﬂags:
○ Special features
○ Soft releases
● Automated tests before release
@katavic_d - @loige

Conclusion
We are still waking up at night sometimes,
but we are deﬁnitely sleeping a lot more and better!
Takeaways:
● Have healthy and clear processes
● Allow your team space to fail
● Always review and strive for improvement
● Monitor/Instrument as much as you can
● Use managed services to reduce the operational overhead
(but learn their nuances)
@katavic_d - @loige

We are hiring …
Talk to us!@katavic_d - @loige
Thank you!
- loige.link/tera-inf -

Credits
Pictures from Unsplash
Huge thanks for support and reviews to:
● All the Vectra team
● Yan Cui (@theburningmonk)
● Paul Dolan
● @gbinside
● @augeva
● @Podgeypoos79
● @PawrickMannion
● @micktwomey
● Vedran Jukic

Processing Terabytes of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Processing Terabytes of Data

Similar to Processing Terabytes of Data (20)

More from Luciano Mammino

More from Luciano Mammino (20)

Recently uploaded

Recently uploaded (20)

Processing Terabytes of Data