Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serverless in production (O'Reilly Software Architecture)

920 views

Published on

AWS Lambda has changed the way we deploy and run software, but the serverless paradigm has created new challenges to old problems: How do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?

Yan Cui shares solutions to these challenges, drawing on his experience running Lambda in production and migrating from an existing monolithic architecture.

Published in: Technology

Serverless in production (O'Reilly Software Architecture)

  1. 1. Serverless in production an experience report Yan Cui
  2. 2. What’s in this talk? ! how to responsibly run a serverless architecture (aka. how to do ops in serverless) ! testing, CI/CD ! logging, distributed tracing, monitoring ! config management, securing secrets ! coldstarts ! gotchas/limitations + workarounds/hacks
  3. 3. hi,I’mYanCui
  4. 4. hi,I’mYanCui AWS user since 2009
  5. 5. apr, 2016
  6. 6. Before ! hidden complexities and dependencies ! low utilisation to leave headroom for large spikes ! EC2 scaling is slow, so scale earlier ! paying for lots of used resources ! up to 30 mins to deploy ! deployments required downtime
  7. 7. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  8. 8. “what would good look like for us?”
  9. 9. Deployments should… ! be small ! be fast ! have zero downtime ! require no lock-step
  10. 10. Features should… ! be independently deployable ! be loosely-coupled
  11. 11. We want to… ! minimise cost of unused resources ! minimise ops effort ! reduce technical mess ! deliver visible improvements to users faster
  12. 12. nov, 2016
  13. 13. 170 Lambda functions in prod 1.2 GB deployment packages in prod 95% cost saving vs EC2 15x no. of prod releases per month
  14. 14. time is a good fit
  15. 15. 1st function in prod! time is a good fit
  16. 16. ? time is a good fit 1st function in prod!
  17. 17. Practices ToolsPrinciples what is good? how to make it good? with what?
  18. 18. Principles outlast Tools
  19. 19. ALERTING CI / CD TESTING LOGGING MONITORING
  20. 20. 170 functions WOOF! ? ? time is a good fit 1st function in prod!
  21. 21. CONFIG MANAGEMENT SECURITY DISTRIBUTED TRACING
  22. 22. evolving the platform
  23. 23. building a better search experience
  24. 24. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearch
  25. 25. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  26. 26. building an analytics pipeline
  27. 27. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery
  28. 28. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery 1 developer, 2 days design production (his 1st serverless project)
  29. 29. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery “thank you, nothing ever got done this fast at Skype!”
  30. 30. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  31. 31. rebuilding the timeline feature
  32. 32. building better user recommendations
  33. 33. BigQuery
  34. 34. BigQuery
  35. 35. grapheneDB BigQuery
  36. 36. grapheneDB BigQuery
  37. 37. grapheneDB BigQuery
  38. 38. getting PRODUCTION READY
  39. 39. CHOOSE A FRAMEWORK DEPLOYMENT
  40. 40. http://serverless.com
  41. 41. https://github.com/awslabs/serverless-application-model
  42. 42. http://apex.run
  43. 43. https://apex.github.io/up
  44. 44. https://github.com/claudiajs/claudia
  45. 45. https://github.com/Miserlou/Zappa
  46. 46. http://gosparta.io/
  47. 47. TESTING
  48. 48. amzn.to/29Lxuzu
  49. 49. Level of Testing 1.Unit do our objects do the right thing? are they easy to work with?
  50. 50. 1.Unit 2.Integration does our code work against code we can’t change? Level of Testing
  51. 51. handler
  52. 52. handler test by invoking the handler
  53. 53. Level of Testing 1.Unit 2.Integration 3.Acceptance does the whole system work?
  54. 54. Level of Testing unit integration acceptance feedback confidence
  55. 55. “…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise. The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…” Don’t Mock Types You Can’t Change
  56. 56. “…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do… Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…” Don’t Mock Types You Can’t Change
  57. 57. Services Don’t Mock Types You Can’t Change
  58. 58. Paul Johnston The serverless approach to testing is different and may actually be easier. http://bit.ly/2t5viwK
  59. 59. LambdaAPI Gateway DynamoDB
  60. 60. LambdaAPI Gateway DynamoDB Unit Tests
  61. 61. LambdaAPI Gateway DynamoDB Unit Tests Mock/Stub
  62. 62. is our request correct? is the request mapping set up is the API resources configured correctly? are we assuming the correct schema? LambdaAPI Gateway DynamoDB is Lambda proxy configured correctly? is IAM policy set up correctly? is the table created? what unit tests will not tell you…
  63. 63. most Lambda functions are simple have single purpose, the risk of shipping broken software has largely shifted to how they integrate with external services observation
  64. 64. But it slows down my feedback loop… IT’S NOT ABOUT YOU!
  65. 65. me test your system, not (just) your code
  66. 66. API Gateway IOT Kinesis SNS ElastiCache CloudWatch DynamoDB IAM S3 Auth0 GrapheneDB SES Twilio Google BigQuery MongoLab CloudSearch APN GCM Lambda EC2
  67. 67. …if a service can’t provide you with a relatively easy way to test the interface in reality, then you should consider using another one. Paul Johnston
  68. 68. “…Wherever possible, an acceptance test should exercise the system end-to-end without directly calling its internal code. An end-to-end test interacts with the system only from the outside: through its interface…” Testing End-to-End
  69. 69. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  70. 70. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input
  71. 71. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input Validate
  72. 72. integration tests exercise system’s Integration with its external dependencies
  73. 73. acceptance tests exercise system End-to-End from the outside
  74. 74. integration tests differ from acceptance tests only in HOW the Lambda functions are invoked observation
  75. 75. CI/CD PIPELINE
  76. 76. “…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed… This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…” Testing End-to-End
  77. 77. me Deployment scripts that only live on the CI box is a disaster waiting to happen.
  78. 78. Jenkins build config deploys and tests unit + integration tests deploy acceptance tests
  79. 79. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION …
  80. 80. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION … install serverless framework as dev dependency
  81. 81. can be run locally & on the CI box
  82. 82. auto auto manual
  83. 83. LOGGING
  84. 84. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  85. 85. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp API Gateway Request Id your log message
  86. 86. function name date function version
  87. 87. me Logs are not easily searchable in CloudWatch Logs.
  88. 88. LOG OVERLOAD
  89. 89. CENTRALISE LOGS
  90. 90. CENTRALISE LOGS MAKE THEM EASILY SEARCHABLE
  91. 91. + + the elk stack
  92. 92. CloudWatch Logs
  93. 93. CloudWatch Logs AWS Lambda ELK stack
  94. 94. CloudWatch Events
  95. 95. CloudWatch Logs
  96. 96. http://bit.ly/2f3zxQG
  97. 97. DISTRIBUTED TRACING
  98. 98. “my followers didn’t receive my new post!” - a user
  99. 99. where could the problem be?
  100. 100. correlation IDs* * eg. request-id, user-id, yubl-id, etc.
  101. 101. ROLL YOUR OWN CLIENTS
  102. 102. kinesis client http client sns client
  103. 103. http://bit.ly/2k93hAj kinesis global.CONTEXT log.info(…) api-b global.CONTEXT global.CONTEXT global.CONTEXT x-correlation-id = … x-correlation-xxx = … API Gateway Kinesis SNS API Gateway API Gatewayapi-a api-c sns headers[“User-Agent”] headers[“Debug-Log-Enabled”] MessageAttributes: [ “x-correlation-id”: … “User-Agent”: … “Debug-Log-Enabled”: … ] global.CONTEXT headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] headers[“User-Agent”] headers[“Debug-Log-Enabled”] headers[“x-correlation-id”] data.__context capture forward function event
  104. 104. ROLL YOUR OWN CLIENTS X-RAY
  105. 105. Amazon X-Ray
  106. 106. Amazon X-Ray
  107. 107. traces do not span over API Gateway
  108. 108. MONITORING + ALERTING
  109. 109. “where do I install monitoring agents?”
  110. 110. you can’t
  111. 111. • invocation Count • error Count • latency • throttling • granular to the minute • support custom metrics
  112. 112. • invocation Count • error Count • latency • throttling • granular to the minute • support custom metrics
  113. 113. Why not IOPipe? ! pervasive access to your entire application ! adds latency for tracking
  114. 114. me The only “background” processing you get are the capabilities the platform provides out of the box.
  115. 115. “how do I batch up and send logs/metrics in the background?”
  116. 116. you can’t (kinda)
  117. 117. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  118. 118. CloudWatch Logs AWS Lambda ELK stack logs metrics CloudWatch
  119. 119. CloudWatch Logs
  120. 120. CloudWatch Logs AWS Lambda ELK stack logs metrics CloudWatch memory used memory size billed duration
  121. 121. http://bit.ly/2gGredx
  122. 122. http://bit.ly/2goFZ8F
  123. 123. DASHBOARDS
  124. 124. DASHBOARDS SET ALARMS
  125. 125. DASHBOARDS SET ALARMS TRACK APP-LEVEL METRICS
  126. 126. Not Only CloudWatch
  127. 127. don’t put all your eggs in one basket aka. you don’t want your monitoring system to fail at the same time as the systems it monitors
  128. 128. CONFIG MANAGEMENT
  129. 129. Lambda
  130. 130. me Environment variables make it hard to share configurations across functions.
  131. 131. me Environment variables make it hard to implement fine-grained access to sensitive info.
  132. 132. http://bit.ly/2uQKABA
  133. 133. couples ability to deploy with access to sensitive data, which often don’t overlap in a large engineering team or in a regulated environment
  134. 134. CENTRALISED CONFIG SERVICE
  135. 135. config service goes here
  136. 136. Why not consul or etcd? ! multiple EC2 instances in multi-AZ for HA ! have to manage servers, patch OS, patch software, etc. ! learning curve for configuring the service ! learning curve for using the CLI tools
  137. 137. SSM Parameter Store
  138. 138. SSM Parameter Store HTTPS role-based access encrypted in-flight
  139. 139. SSM Parameter Store encrypt role-based access
  140. 140. SSM Parameter Store encrypted at-rest
  141. 141. HTTPS role-based access SSM Parameter Store encrypted in-flight
  142. 142. SSM Parameter Store decrypt role-based access
  143. 143. CENTRALISED CONFIG SERVICE CLIENT LIBRARY
  144. 144. Requirements for client library ! standardise and encapsulate how you manage configs ! supports client-side caching (fetch & cache at coldstart) ! invalidate cache at interval ! invalidate cache explicitly when staleness is detected
  145. 145. http://bit.ly/2yLUjwd
  146. 146. PRO TIPS
  147. 147. max 75 GB total deployment package size* * limit is per AWS region
  148. 148. Janitor Monkey
  149. 149. Janitor Lambda http://bit.ly/2xzVu4a
  150. 150. disable versionFunctions in
  151. 151. install Serverless framework as dev dependency at project level dev dependencies are excluded since 1.16.0
  152. 152. http://bit.ly/2vzBqhC
  153. 153. http://amzn.to/2vtUkDU
  154. 154. UNDERSTAND COLDSTARTS
  155. 155. Amazon X-Ray 1st invocation 2nd invocation cold start
  156. 156. source: http://bit.ly/2oBEbw2
  157. 157. http://bit.ly/2rtCCBz
  158. 158. http://bit.ly/2rtCCBz C#
  159. 159. http://bit.ly/2rtCCBz Java
  160. 160. http://bit.ly/2rtCCBz NodeJs, Python
  161. 161. me C# and Java experiences ~100 times the cold start time of Python and also suffer from much higher standard deviation
  162. 162. me memory size improves cold start time linearly
  163. 163. AVOID COLDSTARTS
  164. 164. CloudWatch Event AWS Lambda
  165. 165. CloudWatch Event AWS Lambda ping ping ping ping
  166. 166. CloudWatch Event AWS Lambda ping ping ping ping
  167. 167. CloudWatch Event AWS Lambda ping ping ping ping HEALTH CHECKS?
  168. 168. AWS Lambda docs Take advantage of container re-use to improve the performance of your function. Make sure any externalized configuration or dependencies that your code retrieves are stored and referenced locally after initial execution. Limit the re-initialization of variables/objects on every invocation. Instead use static initialization/ constructor, global/static variables and singletons. Keep alive and reuse connections (HTTP, database, etc.) that were established during a previous invocation. http://amzn.to/2jzLmkb
  169. 169. max 5 mins execution time
  170. 170. http://bit.ly/2w6ItdI
  171. 171. CONSIDER PARTIAL FAILURES
  172. 172. AWS Lambda docs AWS Lambda polls your stream and invokes your Lambda function. Therefore, if a Lambda function fails, AWS Lambda attempts to process the erring batch of records until the time the data expires. http://amzn.to/2vs2lIg
  173. 173. vs processing halts until failed events are retried successfully/ expired from stream prioritize realtime-ness, retry failed events with best effort, then skip
  174. 174. SNS Kinesis SQS after 3 attempts share processing logic events are processed in chronological order failed events are retried out of sequence
  175. 175. PROCESS SQS WITH RECURSIVE FUNCTIONS
  176. 176. http://bit.ly/2npomX6
  177. 177. AVOID HOT KINESS STREAMS
  178. 178. AWS Lambda docs Each shard can support up to 5 transactions per second for reads, up to a maximum total data read rate of 2 MB per second. http://amzn.to/2ubyaot
  179. 179. AWS Lambda docs If your stream has 100 active shards, there will be 100 Lambda functions running concurrently. Then, each Lambda function processes events on a shard in the order that they arrive. http://amzn.to/2ubyaot
  180. 180. when no. of processors goes up…
  181. 181. ReadProvisionedThroughputExceeded can have too many Kinesis read operations…
  182. 182. ReadRecords.IteratorAge unpredictable spikes in read ‘latency’…
  183. 183. can kinda workaround…
  184. 184. http://bit.ly/2uv5LsH
  185. 185. clever, but costly
  186. 186. new tool, new problems but they’re easier to deal with
  187. 187. @theburningmonk theburningmonk.com github.com/theburningmonk
  188. 188. http://bit.ly/2yQZj1H

×