Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serverless in Production, an experience report (AWS UG South Wales)

159 views

Published on

AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?

In this talk Yan and Scott will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.

Published in: Technology
  • Be the first to comment

Serverless in Production, an experience report (AWS UG South Wales)

  1. 1. in production an experience reportan experience report what you should know before you go to production ServerlessServerless
  2. 2. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  3. 3. Yan Cui http://theburningmonk.com @theburningmonk Principal Engineer @
  4. 4. “Netflix for sports” offices in London, Leeds, Katowice and Tokyo
  5. 5. available in Austria, Switzerland, Germany, Japan and Canada Italy coming soon ;-)
  6. 6. available on 30+ platforms
  7. 7. ~500,000 concurrent viewers
  8. 8. “Netflix for sports” offices in London, Leeds, Katowice and Tokyo We’re hiring! Visit engineering.dazn.com to learn more. follow @dazneng for updates about the engineering team.
  9. 9. apr, 2016
  10. 10. hey guys, vote on this post and I’ll announce a winner at 10PM tonight
  11. 11. 10PM traffic
  12. 12. 10PM traffic 70-100x
  13. 13. low utilisation to leave room for spikes EC2 scaling is slow, so scale earlier
  14. 14. lots of $$$ for unused resources
  15. 15. up to 30 mins for deployment deployment required downtime
  16. 16. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  17. 17. “what would good look like for us?”
  18. 18. be small be fast have zero downtime have no lock-step DEPLOYMENTS SHOULD...
  19. 19. FEATURES SHOULD... be deployable independently be loosely-coupled
  20. 20. WE WANT TO... minimise cost for unused resources minimise ops effort reduce tech mess deliver visible improvements faster
  21. 21. nov, 2016
  22. 22. 170 Lambda functions in prod 1.2 GB deployment packages in prod 95% cost saving vs EC2 15x no. of prod releases per month
  23. 23. time is a good fit
  24. 24. 1st function in prod! time is a good fit
  25. 25. ? time is a good fit 1st function in prod!
  26. 26. ALERTING CI / CD TESTING LOGGING MONITORING
  27. 27. Practices ToolsPrinciples what is good? how to make it good? with what?
  28. 28. Principles outlast Tools
  29. 29. 170 functions ? ? time is a good fit 1st function in prod!
  30. 30. SECURITY DISTRIBUTED TRACING CONFIG MANAGEMENT
  31. 31. evolving the PLATFORM
  32. 32. rebuilt search
  33. 33. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearch
  34. 34. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  35. 35. new analytics pipeline
  36. 36. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery
  37. 37. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery 1 developer, 2 days design production (his 1st serverless project)
  38. 38. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery “nothing ever got done this fast at Skype!” - Chris Twamley
  39. 39. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  40. 40. Rebuilt with Lambda
  41. 41. Rebuilt with Lambda
  42. 42. BigQuery
  43. 43. BigQuery
  44. 44. grapheneDB BigQuery
  45. 45. grapheneDB BigQuery
  46. 46. grapheneDB BigQuery
  47. 47. getting PRODUCTION READY
  48. 48. choose a tried-and-tested deployment framework, don’t invent your own
  49. 49. http://serverless.com
  50. 50. https://github.com/awslabs/serverless-application-model
  51. 51. http://apex.run
  52. 52. https://apex.github.io/up
  53. 53. https://github.com/claudiajs/claudia
  54. 54. https://github.com/Miserlou/Zappa
  55. 55. http://gosparta.io/
  56. 56. TESTING
  57. 57. amzn.to/29Lxuzu
  58. 58. Level of Testing 1.Unit do our objects do the right thing? are they easy to work with?
  59. 59. Level of Testing 1.Unit 2.Integration does our code work against code we can’t change?
  60. 60. handler
  61. 61. handler test by invoking the handler
  62. 62. Level of Testing 1.Unit 2.Integration 3.Acceptance does the whole system work?
  63. 63. Level of Testing unit integration acceptance feedback confidence
  64. 64. “…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise. The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…” Don’t Mock Types You Can’t Change
  65. 65. “…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do… Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…” Don’t Mock Types You Can’t Change
  66. 66. Don’t Mock Types You Can’t Change Services
  67. 67. Paul Johnston The serverless approach to testing is different and may actually be easier. http://bit.ly/2t5viwK
  68. 68. LambdaAPI Gateway DynamoDB
  69. 69. LambdaAPI Gateway DynamoDB Unit Tests
  70. 70. LambdaAPI Gateway DynamoDB Unit Tests Mock/Stub
  71. 71. is our request correct? is the request mapping set up correctly?is the API resources configured correctly? are we assuming the correct schema? LambdaAPI Gateway DynamoDB is Lambda proxy configured correctly? is IAM policy set up correctly? is the table created? what unit tests will not tell you…
  72. 72. most Lambda functions are simple have single purpose, the risk of shipping broken software has largely shifted to how they integrate with external services observation
  73. 73. optimize towards shipping working software, even if it means slowing down your feedback loop…
  74. 74. …if a service can’t provide you with a relatively easy way to test the interface in reality, then you should consider using another one. Paul Johnston
  75. 75. “…Wherever possible, an acceptance test should exercise the system end-to- end without directly calling its internal code. An end-to-end test interacts with the system only from the outside: through its interface…” Testing End-to-End
  76. 76. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  77. 77. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input
  78. 78. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input Validate
  79. 79. integration tests exercise system’s Integration with its external dependencies my code
  80. 80. acceptance tests exercise system End-to-End from the outside my code
  81. 81. integration tests differ from acceptance tests only in HOW the Lambda functions are invoked observation
  82. 82. CI + CD PIPELINE
  83. 83. me the earlier you consider CI/CD the more time you save in the long run
  84. 84. “…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed… This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…” Testing End-to-End
  85. 85. me deployment scripts that only live on the CI box is a disaster waiting to happen…
  86. 86. Jenkins build config deploys and tests unit + integration tests deploy acceptance tests
  87. 87. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi
  88. 88. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi install Serverless framework as dev dependency
  89. 89. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi install Serverless framework as dev dependency mitigate version conflicts
  90. 90. http://alistair.cockburn.us/Hexagonal+architecture
  91. 91. build.sh allows repeatable builds on both local & CI
  92. 92. Auto Auto Manual
  93. 93. LOGGING
  94. 94. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  95. 95. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp API Gateway Request Id your log message
  96. 96. Me Logs are not easily searchable in CloudWatch Logs.
  97. 97. CloudWatch Logs
  98. 98. AWS Lambda invokes AWS Lambda stdout asynchronously any log aggregation service CloudWatch Logs
  99. 99. CloudWatch Logs AWS Lambda AWS Lambda stdout any log aggregation service asynchronously invokes
  100. 100.
  101. 101. CloudWatch Events
  102. 102. DISTRIBUTED TRACING
  103. 103. a user my followers didn’t receive my new post!
  104. 104. where could the problem be?
  105. 105. correlation IDs* * eg. request-id, user-id, yubl-id, etc.
  106. 106. wrap HTTP client & AWS SDK clients to forward captured correlation IDs
  107. 107. kinesis client http client sns client
  108. 108. use X-Ray for performance tracing
  109. 109. Amazon X-Ray
  110. 110. Amazon X-Ray
  111. 111. X-Ray traces do not span over API Gateway, or async event sources
  112. 112. MONITORING + ALERTING
  113. 113. no place to install agents/daemons
  114. 114. • invocation Count • error Count • latency • throttling • granular to the minute • support custom metrics
  115. 115. • same metrics as CW • better dashboard • support custom metrics https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/
  116. 116. my code
  117. 117. my code
  118. 118. my code internet internet press button something happens
  119. 119. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  120. 120. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  121. 121. no more background processing, other than what the platform provides
  122. 122. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  123. 123. CloudWatch Logs AWS Lambda ELK stack logs metrics CloudWatch
  124. 124. don’t forget to setup dashboards & CW alarms
  125. 125. CONFIG MANAGEMENT
  126. 126. design for easy & quick propagation of config changes
  127. 127. me Environment variables make it hard to share configurations across functions.
  128. 128. me Environment variables make it hard to implement fine-grained access to sensitive info.
  129. 129. config service goes here
  130. 130. SSM Parameter Store
  131. 131. sensitive data should be encrypted in-flight, and at-rest
  132. 132. enforce role-based access to sensitive configuration values
  133. 133. SSM Parameter Store HTTPS role-based access encrypted in-flight
  134. 134. SSM Parameter Store encrypt role-based access
  135. 135. SSM Parameter Store encrypted at-rest
  136. 136. HTTPS role-based access SSM Parameter Store encrypted in-flight
  137. 137. invest into a robust client library
  138. 138. fetch & cache at cold-start
  139. 139. invalidate at interval & weak signals
  140. 140. max 75 GB total deployment package size* * limit is per AWS region
  141. 141. Janitor Monkey
  142. 142. Janitor Lambda http://bit.ly/2xzVu4a
  143. 143. disable versionFunctions in
  144. 144. install Serverless framework as dev dependency at project level dev dependencies are excluded since 1.16.0
  145. 145. http://bit.ly/2vzBqhC
  146. 146. http://amzn.to/2vtUkDU
  147. 147. UNDERSTAND COLDSTARTS
  148. 148. Amazon X-Ray 1st invocation 2nd invocation cold start
  149. 149. http://bit.ly/2lNInES
  150. 150. http://bit.ly/2lNInES
  151. 151. http://bit.ly/2rtCCBz
  152. 152. C# http://bit.ly/2rtCCBz
  153. 153. Java http://bit.ly/2rtCCBz
  154. 154. NodeJs, Python http://bit.ly/2rtCCBz
  155. 155. EMBRACE Node.js, Python, or Golang
  156. 156. CloudWatch Event AWS Lambda
  157. 157. CloudWatch Event AWS Lambda ping ping ping ping
  158. 158. CloudWatch Event AWS Lambda ping ping ping ping
  159. 159. CloudWatch Event AWS Lambda ping ping ping ping HEALTH CHECKS?
  160. 160. https://github.com/FidelLimited/serverless-plugin-warmup
  161. 161. INEFFECTIVE when you have many concurrent executions
  162. 162. “Netflix for sports” offices in London, Leeds, Katowice and Tokyo We’re hiring! Visit engineering.dazn.com to learn more. follow @dazneng for updates about the engineering team.
  163. 163. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/production-ready-serverless get 40% off with: ytcui
  164. 164. @theburningmonk theburningmonk.com github.com/theburningmonk

×