Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SMC303 Real-time Data Processing Using AWS Lambda


Published on

What if there were an easier way to perform big data analysis with less setup, instant scaling, and no servers to provision and manage? With serverless computing, you can perform real-time stream processing of multiple data types without needing to spin up servers or install software. Come learn how you can use AWS Lambda with Amazon Kinesis to analyze streaming data in real-time and then store the results in a managed NoSQL database such as Amazon DynamoDB. You’ll learn tips and tricks for doing in-line processing, data manipulation, and even distributed MapReduce on large data sets.

Published in: Technology
  • Secrets To Working Online, Hundreds of online opportunites, you can profit with today! ➥➥➥
    Are you sure you want to  Yes  No
    Your message goes here
  • Legitimate jobs paying $40/h, Tap into the booming online job industry and start working now! ♣♣♣
    Are you sure you want to  Yes  No
    Your message goes here
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website!
    Are you sure you want to  Yes  No
    Your message goes here

SMC303 Real-time Data Processing Using AWS Lambda

  1. 1. © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Tara E. Walker | Technical Evangelist | @taraw April 2017 SMC303 Real-Time Data Processing Using AWS Lambda
  2. 2. Agenda What’s Serverless Real-Time Data Processing? Processing Streaming Data with Lambda and Kinesis Streaming Data Processing Demo Data Processing Pipeline with Lambda and MapReduce Building a Big Data Processing Solution Demo What’s Serverless Real-Time Data Processing? Serverless Processing of Real-Time Streaming Data Streaming Data Processing Demo Serverless Data Processing with Distributed Computing Customer Story: Fannie Mae-Distributed Computing with Lambda
  3. 3. What’s Serverless Real-Time Data Processing?
  4. 4. AWS Lambda Efficient performance at scale Easy to author, deploy, maintain, secure & manage. Focus on business logic to build back-end services that perform at scale. Bring Your Own Code: Stateless, event-driven code with native support for Node.js, Java, Python and C# languages. No Infrastructure to manage: Compute without managing infrastructure like Amazon EC2 instances and Auto Scaling groups. Cost-effective: Automatically matches capacity to request rate. Compute cost 100 ms increments. Triggered by events: Direct Sync & Async API calls, AWS service integrations, and 3rd party triggers.
  5. 5. Amazon S3 Amazon DynamoDB Amazon Kinesis AWS CloudFormation AWS CloudTrail Amazon CloudWatch Amazon Cognito Amazon SNS Amazon SES Cron events DATA STORES ENDPOINTS CONFIGURATION REPOSITORIES EVENT/MESSAGE SERVICES Lambda Event Sources … more on the way! AWS CodeCommit Amazon API Gateway Amazon Alexa AWS IoT AWS Step Functions
  6. 6. Serverless Real-Time Data Processing Is.. Capture Data Streams IoT Data Financial Data Log Data No servers to provision or manage EVENT SOURCE Node.js Python Java C# Process Data Streams FUNCTION Clickstream Data Output Data DATABASE CLOUD SERVICES
  7. 7. Amazon DynamoDB Amazon Kinesis Amazon S3 Amazon SNS ASYNCHRONOUS PUSH MODEL STREAM PULL MODEL Lambda Real-Time Event Sources Amazon Alexa AWS IoT SYNCHRONOUS PUSH MODEL Mapping owned by Event Source Mapping owned by Lambda Invokes Lambda via Event Source API Lambda function invokes when new records found on stream Resource-based policy permissions Lambda Execution role policy permissions Concurrent executions Sync invocation Async Invocation Sync invocation Lambda polls the streams HOW IT WORKS
  8. 8. Serverless Processing of Real-Time Streaming Data
  9. 9. Amazon Kinesis Real-Time: Collect real-time data streams and promptly respond to key business events and operational triggers. Real-time latencies. Easy to use: Focus on quickly launching data streaming applications instead of managing infrastructure. Amazon Kinesis Offering: Managed services for streaming data ingestion and processing. • Amazon Kinesis Streams: Build applications that process or analyze streaming data. • Amazon Kinesis Firehose: Load massive volumes of streaming data into Amazon S3 and Amazon Redshift. • Amazon Kinesis Analytics: Analyze data streams using SQL queries.
  10. 10. Processing Real-Time Streams: Lambda + Amazon Kinesis Streaming data sent to Amazon Kinesis and stored in shards Multiple Lambda functions can be triggered to process same Amazon Kinesis stream for “fan out” Lambda can process data and store results ex. to DynamoDB, S3 Lambda can aggregate data to services like Amazon Elasticsearch Service for analytics Lambda sends event data and function info to Amazon CloudWatch for capturing metrics and monitoring Amazon Kinesis AWS Lambda Amazon CloudWatch Amazn DynamoDB AWS Lambda Amazon Elasticsearch Service Amazon S3
  11. 11. Processing Streams: Set Up Amazon Kinesis Stream Streams Made up of Shards Each Shard ingests/reads data up to 1 MB/sec Each Shard emits/writes data up to 2 MB/sec Each shard supports 5 reads/sec Data All data is stored and is replayable for 24 hours Make sure partition key distribution is even to optimize parallel throughput Partition key used to distribute PUTs across shards, choose key with more groups than shards Best Practice Determine an initial size/shards to plan for expected maximum demand  Leverage “Help me decide how many shards I need” option in Console  Use formula for Number Of Shards: max(incoming_write_bandwidth_in_KB/1000, outgoing_read_bandwidth_in_KB / 2000)
  12. 12. Processing Streams: Create Lambda functions Memory CPU allocation proportional to the memory configured Increasing memory makes your code execute faster (if CPU bound) Increasing memory allows for larger record sizes processed Timeout Increasing timeout allows for longer functions, but longer wait in case of errors Permission model Execution role defined for Lambda must have permission to access the stream Retries With Amazon Kinesis, Lambda retries until the data expires (24 hours) Best Practice Write Lambda function code to be stateless Instantiate AWS clients & database clients outside the scope of the function handler
  13. 13. Processing Streams: Configure Event Source Amazon Kinesis mapped as event source in Lambda Batch size Max number of records that Lambda will send to one invocation Not equivalent to effective batch size Effective batch size is every 250 ms – Calculated as: MIN(records available, batch size, 6MB) Increasing batch size allows fewer Lambda function invocations with more data processed per function Best Practices Set to “Trim Horizon” for reading from start of stream (all data) Set to “Latest” for reading most recent data (LIFO) (latest data)
  14. 14. Processing streams: How It Works Polling Concurrent polling and processing per shard Lambda polls every 250 ms if no records found Will grab as much data as possible in one GetRecords call (Batch) Batching Batches are passed for invocation to Lambda through function parameters Batch size may impact duration if the Lambda function takes longer to process more records Sub batch in memory for invocation payload Synchronous invocation Batches invoked as synchronous RequestResponse type Lambda honors Amazon Kinesis at least once semantics Each shard blocks in order of synchronous invocation
  15. 15. Processing streams: Tuning throughput If put / ingestion rate is greater than the theoretical throughput, your processing is at risk of falling behind Maximum theoretical throughput # shards * 2MB / Lambda function duration (s) Effective theoretical throughput # shards * batch size (MB) / Lambda function duration (s) … … Source Amazon Kinesis Destination 1 Lambda Destination 2 FunctionsShards Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards Waits for responsePolls a batch
  16. 16. Processing streams: Tuning Throughput w/ Retries Retries Will retry on execution failures until the record is expired Throttles and errors impacts duration and directly impacts throughput Best Practice Retry with exponential backoff of up to 60s Effective theoretical throughput with retries ( # shards * batch size (MB) ) / ( function duration (s) * retries until expiry) … … Source Amazon Kinesis Destination 1 Lambda Destination 2 FunctionsShards Lambda will scale automaticallyScale Amazon Kinesis by splitting or merging shards Receives errorPolls a batch Receives error Receives success
  17. 17. Processing streams: Common observations Effective batch size may be less than configured during low throughput Effective batch size will increase during higher throughput Increased Lambda duration -> decreased # of invokes and GetRecord calls Too many consumers of your stream may compete with Amazon Kinesis read limits and induce ReadProvisionedThroughputExceeded errors and metrics Amazon Kinesis AWS Lambda
  18. 18. Processing streams: Monitoring with Cloudwatch • GetRecords: (effective throughput) • PutRecord : bytes, latency, records, etc • GetRecords.IteratorAgeMilliseconds: how old your last processed records were Monitoring Amazon Kinesis Streams Monitoring Lambda functions • Invocation count: Time function invoked • Duration: Execution/processing time • Error count: Number of Errors • Throttle count: Number of time function throttled • Iterator Age: Time elapsed from batch received & final record written to stream • Review All Metrics • Make Custom logs • View RAM consumed • Search for log events Debugging AWS X-Ray Coming soon!
  19. 19. Streaming Data Processing Demo
  20. 20. Serverless Data Processing with Distributed Computing 10101101 11001010
  21. 21. Serverless Distributed Computing: Map-Reduce Model Why Serverless Data Processing with Distributed Computing? Remove Difficult infrastructure management  Cluster administration  Complex configuration tools Enable simple, elastic, user-friendly distributed data processing  Eliminate complexity of state management  Bring Distributed Computing power to the masses
  22. 22. Serverless Distributed Computing: Map-Reduce Model Why Serverless Data Processing with Distributed Computing? Eliminate utilization concerns  Makes code simpler by removes complexities of multi- threading processing to optimize server usage  Cost-effective option to run ad hoc MapReduce jobs Easier, automatic horizontal scaling  Provide ability to process scientific and analytics applications
  23. 23. Serverless Distributed Computing: MapReduce Input Bucket 1 2 Driver job state Mapper Functions map phase S3 event source mapper output 3 Coordinator 4 Reducer step 1 reducer output 5 recursively create n‘th reducer step ResultFinal Reducer reduce phase 6
  24. 24. Serverless Distributed Computing: PyWren PyWren Prototype Developed at University of California, Berkeley Uses Python with AWS Lambda stateless functions for large scale data analytics Achieved @ 30-40 MB/s write and read performance per-core to S3 object store Scaled to 60-80 GB/s across 2800 simultaneous functions
  25. 25. Serverless Distributed Computing: Benchmark Using Amazon MapReduce Reference Architecture Framework with Lambda Dataset Queries:  Scan query (90 M Rows, 6.36 GB of data)  Select query on Page Rankings  Aggregation query on UserVisits ( 775M rows, ~127GB of data) Rankings (rows) Rankings (bytes) UserVisits (rows) UserVisits (bytes) Documents (bytes) 90 Million 6.38 GB 775 Million 126.8 GB 136.9 GB
  26. 26. Serverless Distributed Computing: Benchmark Using Amazon MapReduce Reference Architecture Framework with Lambda Subset of the Amplab benchmark ran to compare with other data processing frameworks Performance Benchmarks: Execution time for each workload in seconds TECHNOLOGY SCAN 1A SCAN 1B AGGREGATE 2A Amazon Redshift (HDD) 2.49 2.61 25.46 Serverless MapReduce 39 47 200 Impala - Disk - 1.2.3 12.015 12.015 113.72 Impala - Mem - 1.2.3 2.17 3.01 84.35 Shark - Disk - 0.8.1 6.6 7 151.4 Shark - Mem - 0.8.1 1.7 1.8 83.7 Hive - 0.12 YARN 50.49 59.93 730.62 Tez - 0.2.0 28.22 36.35 377.48
  27. 27. Fannie Mae: Distributed Computing with Lambda
  28. 28. © 2017 Fannie Mae. Trademarks of Fannie Mae. 29 © 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Bin Lu, Fannie Mae 4/18/2017 High Performance Computing Using AWS Lambda for Financial Modeling
  29. 29. © 2017 Fannie Mae. Trademarks of Fannie Mae. 304/19/2017 Fannie Mae Business Fannie Mae is a leading source of financing for mortgage lenders: • Providing access to affordable mortgage financing in all market conditions. • Effectively managing and reducing risk to our business, taxpayers, and the housing finance system. In 2016, Fannie Mae provided $637B in liquidity to the mortgage market, enabling • 1.1M home purchase , • 1.4 M refinancing, • 724K rental housing units.
  30. 30. © 2017 Fannie Mae. Trademarks of Fannie Mae. 314/19/2017 Fannie Mae Financial Modeling Financial Modeling is a Monte-Carlo simulation process to project future cash flows , which is used for managing the mortgage risk on daily basis: • Underwriting and valuation • Risk management • Financial reporting • Loss mitigation and loan removal ~10 Quadrillion (10𝑥𝑥𝑥𝑥15 ) of cash flow projections each month in hundreds of economic scenarios.
  31. 31. © 2017 Fannie Mae. Trademarks of Fannie Mae. 324/19/2017 Fannie Mae Financial Modeling Infrastructure High Performance Computing grids is the key infrastructure component for financial modeling at Fannie Mae. Fannie Mae existing HPC grids no longer meet our growing business needs: • It is 7 years old with limited computing capacity, limited IO capacity, limited storage and complex API. • It takes more than half a year to add incremental compute capacity and develop any new application. We are looking for a new HPC facility to react to the rapidly changing market! • Unlimited computing resources and unlimited storage. • Serverless infrastructure with simple distributed computing API. • Efficient cost model.
  32. 32. © 2017 Fannie Mae. Trademarks of Fannie Mae. 334/19/2017 Fannie Mae’s Journey to AWS Serverless HPC Service In 2016, Fannie Mae began to work with AWS to build the first serverless HPC computing platform in the industry using Lambda service. This is also the first pilot program for Fannie Mae to develop an AWS cloud native application. Once the infrastructure is setup, we are able to develop a new application within a month and provision the compute resources within minutes. In March 2017, Fannie Mae successfully deployed the first financial modeling application to preproduction and ran on 15,000 concurrent executions
  33. 33. © 2017 Fannie Mae. Trademarks of Fannie Mae. 344/19/2017 Fannie Mae’s Serverless HPC Performance Lambda service configuration: • Initial burst rate = 2,000, incremental rate = 100 per minute, throttle limit = 15,000. • Lambda ramps up automatically from 2,000 to 15,000 concurrent executions. Application Result: • One simulation run of ~ 20 million mortgages takes 2 hours, >3 times faster than the existing process. • The performance does not degrade during the ramp up period. • Lambdas’ CPU efficiency is close to 100%. Actual elapsed time is consistent with the estimated elapsed time based on Lambda billing time. Number of New Lambda Invocations every 5 Mins Maximum Concurrent Lambdas = 15,000
  34. 34. © 2017 Fannie Mae. Trademarks of Fannie Mae. 354/19/2017 Simple Serverless HPC Reference Architecture Map-reduce framework is used for simple parallel workload: • Input file in S3 input bucket is split using EC2 to n triggers, which are saved in S3 event bucket. • Lambda automatically ramps up n concurrent executions and writes outputs to S3 mapper bucket. • EC2 is used to aggregate outputs and write final result to S3 reducer bucket. Amazon S3 Input Amazon EC2 Splitter … AWS Lambda Mappers Amazon EC2 Reducer AmazonS3 Mapper Result Amazon S3 Reducer Result … Amazon S3 Event
  35. 35. © 2017 Fannie Mae. Trademarks of Fannie Mae. 364/19/2017 Complex Serverless HPC Reference Architecture Breakdown complex workload into multiple simple ones: …
  36. 36. © 2017 Fannie Mae. Trademarks of Fannie Mae. 374/19/2017 Benefit of Serverlesss HPC Service Cost Effective • Never pay for idle. The cost is based on actual vCPU usage, not elapsed time or maximum processing capacity of the infrastructure. • Performance improvement at zero cost: 1 Lambda x 15,000 hours = 15,000 Lambda x 1 hour. Shorter Time to Market • Ability to burst to cloud immediately to access additional computing resources. • Ability to focus on business needs. No server to manage and no complex distributed computing code to write. Most Complete Data Analytics Platform • Streamlined integration with big data platform and BI tools / Data Lake. • Business resiliency.
  37. 37. © 2017 Fannie Mae. Trademarks of Fannie Mae. 384/19/2017 Considerations and Next Step Considerations: • Maximize S3 performance by distributing the key names to evenly distribute objects across the partitions. • Set up a separate AWS account for unlimited Lambda access / IP addresses. • Adopt microservice architecture to migrate one business function/application at a time. • Integrate with AWS big data analytics platform for accessing unlimited storage and state of art business analytical tools. Next step: • Production migration of the first application in Q2 2017. • Complete migration of primary loan performance modeling applications to AWS in early 2018.
  38. 38. Real-time Data Processing with Lambda: Next Steps
  39. 39. Data Processing with AWS: Next steps  Learn more about AWS Serverless at  Explore the AWS Lambda Reference Architecture on GitHub:  Real-Time Streaming: streamprocessing  Distributed Computing Reference Architecture (serverless MapReduce)
  40. 40. Data Processing with AWS: Next steps  Create an Amazon Kinesis stream. Visit the Amazon Kinesis Console and configure a stream to receive data Ex. data from Social media feeds.  Create & test a Lambda function to process streams from Amazon Kinesis by visiting Lambda console. First 1M requests each month are on us!  Read the Developer Guide and try the Lambda and Amazon Kinesis Tutorial:  kinesis.html  Send questions, comments, feedback to the AWS Lambda Forums
  41. 41. Thank you! Tara E. Walker AWS Technical Evangelist @taraw