Successfully reported this slideshow.
Your SlideShare is downloading. ×

(CMP403) AWS Lambda: Simplifying Big Data Workloads

(CMP403) AWS Lambda: Simplifying Big Data Workloads

Download to read offline

AWS Lambda allows any Node.js app to be run at scale in a massively parallel environment with no up-front costs or planning. This session shows how to use Lambda to build dynamic analytic data flows that can be tuned as they execute, based on initial results, to provide real-time output streamed to web clients. This process enables a cost-effective and responsive user experience for ad hoc big data jobs and lets developers focus on how data is consumed and presented, instead of how it is obtained.

AWS Lambda allows any Node.js app to be run at scale in a massively parallel environment with no up-front costs or planning. This session shows how to use Lambda to build dynamic analytic data flows that can be tuned as they execute, based on initial results, to provide real-time output streamed to web clients. This process enables a cost-effective and responsive user experience for ad hoc big data jobs and lets developers focus on how data is consumed and presented, instead of how it is obtained.

More Related Content

Similar to (CMP403) AWS Lambda: Simplifying Big Data Workloads

More from Amazon Web Services

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all

(CMP403) AWS Lambda: Simplifying Big Data Workloads

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Martin Holste, FireEye October 2015 CMP403 AWS Lambda Simplifying Big Data Workloads
  2. 2. What to Expect from the Session This is a deep-dive on general computing uses for AWS Lambda. • You will understand what makes Lambda a big deal for big data. • You will not learn about asynchronously triggered workloads (see related sessions for that). • You will see interactive, data-driven user experiences that work with minimal ops overhead and at any scale.
  3. 3. Problem: Big data, little time At FireEye, one of the ways we protect customers is by analyzing mountains of event data to find “evil.” Some of it we have online in indexes, some of it we have in cold storage on Amazon S3. We needed to be able to take advantage of the rich history in our archived data without hurting our user experience.
  4. 4. Our app creates questions and finds answers Lambda- driven search and analytics EMR analytic output EC2-based proprietary detection Amazon EMR triggers investigations EC2-based indexed search AWS Lambda provides context Questions Answers
  5. 5. Amazon EMR Scheduled jobs that process all data for anomaly detection: • K-means • Linear regression • Geographic time-lining What analysis are we doing? AWS Lambda Free-form searching to drive ad hoc: • Reports • Visualizations • Analytical statistics (clustering, correlation, linear regression, etc.)
  6. 6. Visualize search results analytically User-defined analytics based on ad hoc features of the search result set draw attention to otherwise uninteresting facets of the data.
  7. 7. How big is our Big? For an average customer: Average security event size is about 3k bytes at 20k events/sec ~= 60 MB/sec, which is about 5 TB/day. One week = 35 TB, 12 billion events.
  8. 8. How long does this take? A single process downloads, decompresses, greps, and processes at about 35k events/sec (105 MB/sec). To process a week of data: Processes Time Scale 1 ~4 days 10 ~6 hours 100 ~1 hour 1000 ~5 minutes 10000 seconds 0 50,000 100,000 150,000 200,000 250,000 300,000 350,000 400,000 1 10 100 1000
  9. 9. Lambda FTW What if you could spin up 10k processes in 100 ms? Standard map-reduce pattern without the startup time or hassle of map-reduce frameworks. Write your simple worker code, and let cascading Lambda functions handle the heavy lifting.
  10. 10. Lambda cascade AWS Big Data blog: “Building Scalable and Responsive Big Data Interfaces with AWS Lambda”
  11. 11. Code components Basic web app Handles UI request, invokes cascade functions, streams results. Cascade function Invokes workers, aggregates and returns results. Can be made recursive. Worker function Performs atomic work, returns results to invoker.
  12. 12. Basic web app var listStream = new S3KeyListStream(searchParams); var lambdaStream = new LambdaStream(maxWorkers); listStream .pipe(lambdaStream, { end: false }) .pipe(serverSentStream) .pipe(httpResponse);
  13. 13. Basic web app key points • Batched async execution within an async pipeline is very unintuitive. • Trick is to use end:false to manually call end in pipeline code when all work is done. • Pipeline will naturally queue up batches to stay under configured Lambda provisioning limits.
  14. 14. Lambda cascade function // Chop our given list of keys up into batches var batches = []; var batch = []; for (var i = 0, len = allKeys.length; i < len; i++){ batch.push(allKeys[i]); if (batch.length >= batchSize){ batches.push(batch.slice()); batch = []; } }
  15. 15. Lambda cascade function (continued) // Invoke each batch in parallel, returning aggregated result when all are finished. async.map(batches, invoke, function (err, results) { if (err) { context.fail('async.map error: ' + err.toString()); return; } context.succeed(results);
  16. 16. Lambda cascade function key points • Nature of the data and workload will dictate the correct batch sizes to give a cascade function. Need to avoid running out of memory to aggregate results. • 100:1 seems to work well, good balance between low cascade overhead and manageable intermediate result size.
  17. 17. Worker function var lineSplitter = new eventstream.split(); lineSplitter.on(‘data’, process).on(‘end’, cb); // Create our pipeline s3.getObject({ Bucket: srcBucket, Key: srcKey }) .createReadStream() .pipe(zlib.createGunzip()) .pipe(lineSplitter);
  18. 18. Worker function key points • Use the full 1.5 GB of memory. • Download Amazon S3 keys concurrently. • 5 seems to be the magic number for files in the 2-3 MB range. • Use a faster decompression algorithm like LZ4 high- compression, which is up to 32x faster than zlib. • Make sure warnings and failures percolate up with results.
  19. 19. Non–Amazon S3–sourced workloads Lambda can source from anything: Amazon DynamoDB Amazon RDS Amazon Kinesis Amazon EC2 endpoints The Internet
  20. 20. Example Twitter App
  21. 21. How do my followers feel about _____ 1. Enter in a keyword to the UI. 2. A Lambda worker executes for each follower. 3. Sentiment is reviewed (positive/negative/neutral). 4. Results are aggregated.
  22. 22. Streaming Results
  23. 23. Progressive results Thirty seconds is an eternity in UX time. Go beyond a progress bar, return streaming, progressive results. Show something meaningful in 3-5 seconds, final result in 30. Graphically represent the updating data.
  24. 24. Mechanical sympathy Visualizing the result stream as it matures communicates the magnitude of the work being performed and shows value.
  25. 25. Lambda Use Cases
  26. 26. Lambda is the future (and past) It demonstrates the essence of AWS: capability through simplicity. These things are no longer needed: • Servers • Operating systems • Networking Dev effort focuses only on core competencies, not infrastructure.
  27. 27. Dev advantages • If the code works once, it works at any scale. • Unit and integration testing are easy (no cluster setup required). • Any failures are due to faulty code or bad input, which are caught by good unit tests.
  28. 28. Beyond containers • No patching, all upgrades are core competency updates • No instance monitoring, only app monitoring • Goes beyond containers, devs have ultra-consistent environment
  29. 29. Remember mainframes? Mainframes offer attractive operating model, unattractive graphical capabilities. PCs take over by bringing the compute to the people for a rich, graphical experience. Ubiquitous mobile broadband centralizes the compute again by allowing best of both worlds. 1970’s 1990’s 2010’s
  30. 30. Related Sessions ARC308 - The Serverless Company Using AWS Lambda: Streamlining Architecture with AWS CMP301 - AWS Lambda: Event-Driven Code in the Cloud
  31. 31. Remember to complete your evaluations!
  32. 32. Thank you!

×