Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

AWS Lambda from the trenches

4,207 views

Published on

AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?

In this talk Yan will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.

Published in: Technology
  • Hello! Get Your Professional Job-Winning Resume Here - Check our website! https://vk.cc/818RFv
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

AWS Lambda from the trenches

  1. 1. from the TRENCHESTRENCHES what you should know before you go to production AWS LAMBDAAWS LAMBDA
  2. 2. hi,I’mYanCui
  3. 3. AWS user since 2009
  4. 4. apr, 2016
  5. 5. hidden complexities and dependencies low utilisation to leave room for traffic spikes EC2 scaling is slow, so scale earlier lots of cost for unused resources up to 30 mins for deployment deployment required downtime
  6. 6. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  7. 7. “what would good look like for us?”
  8. 8. be small be fast have zero downtime have no lock-step DEPLOYMENTS SHOULD...
  9. 9. FEATURES SHOULD... be deployable independently be loosely-coupled
  10. 10. WE WANT TO... minimise cost for unused resources minimise ops effort reduce tech mess deliver visible improvements faster
  11. 11. nov, 2016
  12. 12. 170 Lambda functions in prod 1.2 GB deployment packages in prod 95% cost saving vs EC2 15x no. of prod releases per month
  13. 13. time is a good fit
  14. 14. 1st function in prod! time is a good fit
  15. 15. ? time is a good fit 1st function in prod!
  16. 16. ALERTING CI / CD TESTING LOGGING MONITORING
  17. 17. 170 functions WOOF! ? ? time is a good fit 1st function in prod!
  18. 18. SECURITY DISTRIBUTED TRACING CONFIG MANAGEMENT
  19. 19. evolving the PLATFORM
  20. 20. rebuilt search
  21. 21. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearch
  22. 22. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  23. 23. new analytics pipeline
  24. 24. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery
  25. 25. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery 1 developer, 2 days design production (his 1st serverless project)
  26. 26. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery “nothing ever got done this fast at Skype!” - Chris Twamley
  27. 27. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  28. 28. Rebuilt with Lambda
  29. 29. Rebuilt with Lambda
  30. 30. BigQuery
  31. 31. BigQuery
  32. 32. grapheneDB BigQuery
  33. 33. grapheneDB BigQuery
  34. 34. grapheneDB BigQuery
  35. 35. getting PRODUCTION READY
  36. 36. CHOOSE A FRAMEWORK DEPLOYMENT
  37. 37. http://serverless.com
  38. 38. https://github.com/awslabs/serverless-application-model
  39. 39. http://apex.run
  40. 40. https://apex.github.io/up
  41. 41. https://github.com/claudiajs/claudia
  42. 42. https://github.com/Miserlou/Zappa
  43. 43. http://gosparta.io/
  44. 44. TESTING
  45. 45. amzn.to/29Lxuzu
  46. 46. Level of Testing 1.Unit do our objects do the right thing? are they easy to work with?
  47. 47. Level of Testing 1.Unit 2.Integration does our code work against code we can’t change?
  48. 48. handler
  49. 49. handler test by invoking the handler
  50. 50. Level of Testing 1.Unit 2.Integration 3.Acceptance does the whole system work?
  51. 51. Level of Testing unit integration acceptance feedback confidence
  52. 52. “…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise. The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…” Don’t Mock Types You Can’t Change
  53. 53. “…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do… Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…” Don’t Mock Types You Can’t Change
  54. 54. Don’t Mock Types You Can’t Change Services
  55. 55. “…Wherever possible, an acceptance test should exercise the system end-to- end without directly calling its internal code. An end-to-end test interacts with the system only from the outside: through its interface…” Testing End-to-End
  56. 56. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  57. 57. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input
  58. 58. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input Validate
  59. 59. CI + CD PIPELINE
  60. 60. “the earlier you consider CI + CD, the more time you save in the long run” - me
  61. 61. “…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed… This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…” Testing End-to-End
  62. 62. “deployment scripts that only live on the CI box is a disaster waiting to happen” - me
  63. 63. Jenkins build config deploys and tests unit + integration tests deploy acceptance tests
  64. 64. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi
  65. 65. build.sh allows repeatable builds on both local & CI
  66. 66. Auto Auto Manual
  67. 67. LOGGING
  68. 68. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  69. 69. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp API Gateway Request Id your log message
  70. 70. function name date function version
  71. 71. LOG OVERLOAD
  72. 72. CENTRALISE LOGS
  73. 73. CENTRALISE LOGS MAKE THEM EASILY SEARCHABLE
  74. 74. + + the elk stack
  75. 75. CloudWatch Logs
  76. 76. CloudWatch Logs AWS Lambda ELK stack
  77. 77. CloudWatch Events
  78. 78. http://bit.ly/2f3zxQG
  79. 79. DISTRIBUTED TRACING
  80. 80. “my followers didn’t receive my new post!” - a user
  81. 81. where could the problem be?
  82. 82. correlation IDs* * eg. request-id, user-id, yubl-id, etc.
  83. 83. ROLL YOUR OWN CLIENTS
  84. 84. kinesis client http client sns client
  85. 85. http://bit.ly/2k93hAj
  86. 86. ROLL YOUR OWN CLIENTS X-RAY
  87. 87. Amazon X-Ray
  88. 88. Amazon X-Ray
  89. 89. traces do not span over API Gateway
  90. 90. useful, but hampered by current limitations
  91. 91. http://bit.ly/2s9yxmA
  92. 92. MONITORING + ALERTING
  93. 93. “where do I install monitoring agents?”
  94. 94. you can’t
  95. 95. • invocation Count • error Count • latency • throttling • granular to the minute • support custom metrics
  96. 96. • same metrics as CW • better dashboard • support custom metrics https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/
  97. 97. “how do I batch up and send logs in the background?”
  98. 98. you can’t (kinda)
  99. 99. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  100. 100. CloudWatch Logs AWS Lambda ELK stack logs metrics CloudWatch
  101. 101. http://bit.ly/2gGredx
  102. 102. DASHBOARDS
  103. 103. DASHBOARDS SET ALARMS
  104. 104. DASHBOARDS SET ALARMS TRACK APP-LEVEL METRICS
  105. 105. Not Only CloudWatch
  106. 106. “you really don't want your monitoring system to fail at the same time as the system it monitors” - me
  107. 107. CONFIG MANAGEMENT
  108. 108. easily and quickly propagate config changes
  109. 109. CENTRALISED CONFIG SERVICE
  110. 110. config service goes here
  111. 111. EC2 parameter store
  112. 112. CENTRALISED CONFIG SERVICE CLIENT LIBRARY
  113. 113. http://bit.ly/2yLUjwd
  114. 114. sensitive data should be encrypted in-flight, and at rest (credentials, connection string, etc.)
  115. 115. role-based access
  116. 116. KMS
  117. 117. EC2 Parameter Store HTTPS role-based access encrypted in-flight
  118. 118. EC2 Parameter Store encrypt role-based access
  119. 119. EC2 Parameter Store encrypted at-rest
  120. 120. HTTPS role-based access EC2 Parameter Store encrypted in-flight
  121. 121. KMS FRAMEWORK PLUG-INS
  122. 122. PRO TIPS
  123. 123. SERVERLESS FRAMEWORK
  124. 124. max 75 GB total deployment package size* * limit is per AWS region
  125. 125. CLEAN UP OLD PACKAGES
  126. 126. Janitor Monkey
  127. 127. Janitor Lambda http://bit.ly/2xzVu4a
  128. 128. disable versionFunctions in
  129. 129. install Serverless framework as dev dependency at project level dev dependencies are excluded since 1.16.0
  130. 130. http://bit.ly/2vzBqhC
  131. 131. http://amzn.to/2vtUkDU
  132. 132. UNDERSTAND COLDSTARTS
  133. 133. Amazon X-Ray 1st invocation 2nd invocation cold start
  134. 134. source: http://bit.ly/2oBEbw2
  135. 135. http://bit.ly/2tb7bLJ
  136. 136. EMBRACE NODE.JS & PYTHON
  137. 137. http://bit.ly/2rtCCBz
  138. 138. C# http://bit.ly/2rtCCBz
  139. 139. Java http://bit.ly/2rtCCBz
  140. 140. NodeJs, Python http://bit.ly/2rtCCBz
  141. 141. what about type safety?
  142. 142. complexity ceiling of a Node.js app complexity
  143. 143. complexity ceiling of a Node.js app complexity referential transparency immutability as default type inference option types union types …
  144. 144. for managing complexity complexity ceiling of a Node.js app complexity referential transparency immutability as default type inference option types union types …
  145. 145. complexity ceiling of a Node.js app complexity complexity ceiling of a Node.js Lambda function
  146. 146. if you can limit the complexity of your solution, maybe you won’t need the tools for managing that complexity. me
  147. 147. AVOID HARD ASSUMPTIONS ABOUT FUNCTION LIFETIME
  148. 148. USE STATE FOR OPTIMISATION
  149. 149. AVOID COLDSTARTS
  150. 150. CloudWatch Event AWS Lambda
  151. 151. CloudWatch Event AWS Lambda ping ping ping ping
  152. 152. CloudWatch Event AWS Lambda ping ping ping ping
  153. 153. CloudWatch Event AWS Lambda ping ping ping ping HEALTH CHECKS?
  154. 154. max 5 mins execution time
  155. 155. USE RECURSION FOR LONG RUNNING TASKS
  156. 156. CONSIDER PARTIAL FAILURES
  157. 157. “AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires…” http://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html
  158. 158. should function fail on partial/any failures?
  159. 159. SNS Kinesis SQS after 3 attempts share processing logic events are processed in chronological order failed events are retried out of sequence
  160. 160. PROCESS SQS WITH RECURSIVE FUNCTIONS
  161. 161. http://bit.ly/2npomX6
  162. 162. AVOID HOT KINESS STREAMS
  163. 163. “Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second.” http://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html
  164. 164. “If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each Lambda function processes events on a shard in the order that they arrive.” http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html
  165. 165. when no. of processors goes up…
  166. 166. ReadProvisionedThroughputExceeded can have too many Kinesis read operations…
  167. 167. ReadRecords.IteratorAge unpredictable spikes in read ‘latency’…
  168. 168. can kinda workaround…
  169. 169. http://bit.ly/2uv5LsH
  170. 170. clever but costly
  171. 171. for subsystems that don’t have to be realtime, or are task- based (ie. order doesn’t matter), consider other triggers such as S3 or SNS.me
  172. 172. @theburningmonk theburningmonk.com github.com/theburningmonk
  173. 173. sign up here: http://bit.ly/2xCwJEe

×