Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Martin Holste, FireEye
October 2015
CMP403
AWS L...
What to Expect from the Session
This is a deep-dive on general computing uses for
AWS Lambda.
• You will understand what m...
Problem: Big data, little time
At FireEye, one of the ways we protect customers is by
analyzing mountains of event data to...
Our app creates questions and finds answers
Lambda-
driven search
and analytics
EMR
analytic
output
EC2-based
proprietary
...
Amazon EMR
Scheduled jobs that process all
data for anomaly detection:
• K-means
• Linear regression
• Geographic time-lin...
Visualize search results analytically
User-defined analytics
based on ad hoc features
of the search result set
draw attent...
How big is our Big?
For an average customer:
Average security event size is about 3k bytes at 20k
events/sec ~= 60 MB/sec,...
How long does this take?
A single process downloads, decompresses, greps, and
processes at about 35k events/sec (105 MB/se...
Lambda FTW
What if you could spin up 10k
processes in 100 ms?
Standard map-reduce pattern
without the startup time or hass...
Lambda cascade
AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”
Code components
Basic web app
Handles UI request,
invokes cascade
functions, streams
results.
Cascade function
Invokes wor...
Basic web app
var listStream = new S3KeyListStream(searchParams);
var lambdaStream = new LambdaStream(maxWorkers);
listStr...
Basic web app key points
• Batched async execution within an async pipeline is very
unintuitive.
• Trick is to use end:fal...
Lambda cascade function
// Chop our given list of keys up into batches
var batches = [];
var batch = [];
for (var i = 0, l...
Lambda cascade function (continued)
// Invoke each batch in parallel, returning
aggregated result when all are finished.
a...
Lambda cascade function key points
• Nature of the data and workload will dictate the correct
batch sizes to give a cascad...
Worker function
var lineSplitter = new eventstream.split();
lineSplitter.on(‘data’, process).on(‘end’, cb);
// Create our ...
Worker function key points
• Use the full 1.5 GB of memory.
• Download Amazon S3 keys concurrently.
• 5 seems to be the ma...
Non–Amazon S3–sourced workloads
Lambda can source from anything:
Amazon DynamoDB
Amazon RDS
Amazon Kinesis
Amazon EC2 endp...
Example Twitter App
How do my followers feel about _____
1. Enter in a keyword to the UI.
2. A Lambda worker executes for each follower.
3. Se...
Streaming Results
Progressive results
Thirty seconds is an eternity in UX time.
Go beyond a progress bar, return streaming, progressive
resu...
Mechanical sympathy
Visualizing the result stream as it matures communicates
the magnitude of the work being performed and...
Lambda Use Cases
Lambda is the future (and past)
It demonstrates the essence of AWS: capability through
simplicity.
These things are no lon...
Dev advantages
• If the code works once, it works
at any scale.
• Unit and integration testing are
easy (no cluster setup ...
Beyond containers
• No patching, all upgrades are core
competency updates
• No instance monitoring, only app
monitoring
• ...
Remember mainframes?
Mainframes offer attractive operating model,
unattractive graphical capabilities.
PCs take over by br...
Related Sessions
ARC308 - The Serverless Company Using AWS Lambda:
Streamlining Architecture with AWS
CMP301 - AWS Lambda:...
Remember to complete
your evaluations!
Thank you!
(CMP403) AWS Lambda: Simplifying Big Data Workloads
Upcoming SlideShare
Loading in …5
×

(CMP403) AWS Lambda: Simplifying Big Data Workloads

3,597 views

Published on

AWS Lambda allows any Node.js app to be run at scale in a massively parallel environment with no up-front costs or planning. This session shows how to use Lambda to build dynamic analytic data flows that can be tuned as they execute, based on initial results, to provide real-time output streamed to web clients. This process enables a cost-effective and responsive user experience for ad hoc big data jobs and lets developers focus on how data is consumed and presented, instead of how it is obtained.

Published in: Technology

(CMP403) AWS Lambda: Simplifying Big Data Workloads

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Martin Holste, FireEye October 2015 CMP403 AWS Lambda Simplifying Big Data Workloads
  2. 2. What to Expect from the Session This is a deep-dive on general computing uses for AWS Lambda. • You will understand what makes Lambda a big deal for big data. • You will not learn about asynchronously triggered workloads (see related sessions for that). • You will see interactive, data-driven user experiences that work with minimal ops overhead and at any scale.
  3. 3. Problem: Big data, little time At FireEye, one of the ways we protect customers is by analyzing mountains of event data to find “evil.” Some of it we have online in indexes, some of it we have in cold storage on Amazon S3. We needed to be able to take advantage of the rich history in our archived data without hurting our user experience.
  4. 4. Our app creates questions and finds answers Lambda- driven search and analytics EMR analytic output EC2-based proprietary detection Amazon EMR triggers investigations EC2-based indexed search AWS Lambda provides context Questions Answers
  5. 5. Amazon EMR Scheduled jobs that process all data for anomaly detection: • K-means • Linear regression • Geographic time-lining What analysis are we doing? AWS Lambda Free-form searching to drive ad hoc: • Reports • Visualizations • Analytical statistics (clustering, correlation, linear regression, etc.)
  6. 6. Visualize search results analytically User-defined analytics based on ad hoc features of the search result set draw attention to otherwise uninteresting facets of the data.
  7. 7. How big is our Big? For an average customer: Average security event size is about 3k bytes at 20k events/sec ~= 60 MB/sec, which is about 5 TB/day. One week = 35 TB, 12 billion events.
  8. 8. How long does this take? A single process downloads, decompresses, greps, and processes at about 35k events/sec (105 MB/sec). To process a week of data: Processes Time Scale 1 ~4 days 10 ~6 hours 100 ~1 hour 1000 ~5 minutes 10000 seconds 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 1 10 100 1000
  9. 9. Lambda FTW What if you could spin up 10k processes in 100 ms? Standard map-reduce pattern without the startup time or hassle of map-reduce frameworks. Write your simple worker code, and let cascading Lambda functions handle the heavy lifting.
  10. 10. Lambda cascade AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”
  11. 11. Code components Basic web app Handles UI request, invokes cascade functions, streams results. Cascade function Invokes workers, aggregates and returns results. Can be made recursive. Worker function Performs atomic work, returns results to invoker.
  12. 12. Basic web app var listStream = new S3KeyListStream(searchParams); var lambdaStream = new LambdaStream(maxWorkers); listStream .pipe(lambdaStream, { end: false }) .pipe(serverSentStream) .pipe(httpResponse);
  13. 13. Basic web app key points • Batched async execution within an async pipeline is very unintuitive. • Trick is to use end:false to manually call end in pipeline code when all work is done. • Pipeline will naturally queue up batches to stay under configured Lambda provisioning limits.
  14. 14. Lambda cascade function // Chop our given list of keys up into batches var batches = []; var batch = []; for (var i = 0, len = allKeys.length; i < len; i++){ batch.push(allKeys[i]); if (batch.length >= batchSize){ batches.push(batch.slice()); batch = []; } }
  15. 15. Lambda cascade function (continued) // Invoke each batch in parallel, returning aggregated result when all are finished. async.map(batches, invoke, function (err, results) { if (err) { context.fail('async.map error: ' + err.toString()); return; } context.succeed(results);
  16. 16. Lambda cascade function key points • Nature of the data and workload will dictate the correct batch sizes to give a cascade function. Need to avoid running out of memory to aggregate results. • 100:1 seems to work well, good balance between low cascade overhead and manageable intermediate result size.
  17. 17. Worker function var lineSplitter = new eventstream.split(); lineSplitter.on(‘data’, process).on(‘end’, cb); // Create our pipeline s3.getObject({ Bucket: srcBucket, Key: srcKey }) .createReadStream() .pipe(zlib.createGunzip()) .pipe(lineSplitter);
  18. 18. Worker function key points • Use the full 1.5 GB of memory. • Download Amazon S3 keys concurrently. • 5 seems to be the magic number for files in the 2-3 MB range. • Use a faster decompression algorithm like LZ4 high- compression, which is up to 32x faster than zlib. • Make sure warnings and failures percolate up with results.
  19. 19. Non–Amazon S3–sourced workloads Lambda can source from anything: Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon EC2 endpoints The Internet
  20. 20. Example Twitter App
  21. 21. How do my followers feel about _____ 1. Enter in a keyword to the UI. 2. A Lambda worker executes for each follower. 3. Sentiment is reviewed (positive/negative/neutral). 4. Results are aggregated.
  22. 22. Streaming Results
  23. 23. Progressive results Thirty seconds is an eternity in UX time. Go beyond a progress bar, return streaming, progressive results. Show something meaningful in 3-5 seconds, final result in 30. Graphically represent the updating data.
  24. 24. Mechanical sympathy Visualizing the result stream as it matures communicates the magnitude of the work being performed and shows value.
  25. 25. Lambda Use Cases
  26. 26. Lambda is the future (and past) It demonstrates the essence of AWS: capability through simplicity. These things are no longer needed: • Servers • Operating systems • Networking Dev effort focuses only on core competencies, not infrastructure.
  27. 27. Dev advantages • If the code works once, it works at any scale. • Unit and integration testing are easy (no cluster setup required). • Any failures are due to faulty code or bad input, which are caught by good unit tests.
  28. 28. Beyond containers • No patching, all upgrades are core competency updates • No instance monitoring, only app monitoring • Goes beyond containers, devs have ultra-consistent environment
  29. 29. Remember mainframes? Mainframes offer attractive operating model, unattractive graphical capabilities. PCs take over by bringing the compute to the people for a rich, graphical experience. Ubiquitous mobile broadband centralizes the compute again by allowing best of both worlds. 1970’s 1990’s 2010’s
  30. 30. Related Sessions ARC308 - The Serverless Company Using AWS Lambda: Streamlining Architecture with AWS CMP301 - AWS Lambda: Event-Driven Code in the Cloud
  31. 31. Remember to complete your evaluations!
  32. 32. Thank you!

×