Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
in production
an experience reportan experience report
what you should know before you go to production
ServerlessServerle...
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user since 2009
Yan Cui
http://theburningmonk.com
@theburningmonk
Scott Smethurst
apr, 2016
hey guys, vote on this post
and I’ll announce a winner at
10PM tonight
10PM
traffic
10PM
traffic
70-100x
low utilisation to leave room for spikes
EC2 scaling is slow, so scale earlier
lots of $$$ for unused resources
up to 30 mins for deployment
deployment required downtime
- Dan North
“lead time to someone saying
thank you is the only reputation
metric that matters.”
“what would good
look like for us?”
be small
be fast
have zero downtime
have no lock-step
DEPLOYMENTS SHOULD...
FEATURES SHOULD...
be deployable independently
be loosely-coupled
WE WANT TO...
minimise cost for unused resources
minimise ops effort
reduce tech mess
deliver visible improvements faster
nov, 2016
170 Lambda functions in prod
1.2 GB deployment packages in prod
95% cost saving vs EC2
15x no. of prod releases per month
time
is a good fit
1st function in prod!
time
is a good fit
?
time
is a good fit
1st function in prod!
ALERTING
CI / CD
TESTING
LOGGING
MONITORING
Practices ToolsPrinciples
what is good? how to make it good? with what?
Principles outlast Tools
170 functions
? ?
time
is a good fit
1st function in prod!
SECURITY
DISTRIBUTED TRACING
CONFIG MANAGEMENT
evolving the PLATFORM
rebuilt search
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearch
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
new analytics pipeline
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
1 developer, 2 days
design production
(his 1st serverless pro...
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
“nothing ever got done
this fast at Skype!”
- Chris Twamley
- Dan North
“lead time to someone saying
thank you is the only reputation
metric that matters.”
Rebuilt
with Lambda
Rebuilt
with Lambda
BigQuery
BigQuery
grapheneDB
BigQuery
grapheneDB
BigQuery
grapheneDB
BigQuery
getting PRODUCTION READY
choose a tried-and-tested
deployment framework,
don’t invent your own
http://serverless.com
https://github.com/awslabs/serverless-application-model
http://apex.run
https://apex.github.io/up
https://github.com/claudiajs/claudia
https://github.com/Miserlou/Zappa
http://gosparta.io/
TESTING
amzn.to/29Lxuzu
Level of Testing
1.Unit
do our objects do the right thing?
are they easy to work with?
Level of Testing
1.Unit
2.Integration
does our code work against code we
can’t change?
handler
handler
test by invoking
the handler
Level of Testing
1.Unit
2.Integration
3.Acceptance
does the whole system work?
Level of Testing
unit
integration
acceptance
feedback
confidence
“…We find that tests that mock external
libraries often need to be complex to
get the code into the right state for the
fu...
“…The second risk is that we have to be
sure that the behaviour we stub or mock
matches what the external library will
act...
Don’t Mock Types You Can’t Change
Services
Paul Johnston
The serverless approach to
testing is different and may
actually be easier.
http://bit.ly/2t5viwK
LambdaAPI Gateway DynamoDB
LambdaAPI Gateway DynamoDB
Unit Tests
LambdaAPI Gateway DynamoDB
Unit Tests
Mock/Stub
is our request correct?
is the request mapping
set up correctly?is the API resources
configured correctly?
are we assuming ...
most Lambda functions are simple
have single purpose, the risk of
shipping broken software has largely
shifted to how they...
optimize towards shipping working
software, even if it means slowing
down your feedback loop…
“…Wherever possible, an acceptance
test should exercise the system end-to-
end without directly calling its internal
code....
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Validate
integration tests exercise
system’s Integration with its
external dependencies
my code
acceptance tests exercise
system End-to-End from
the outside
my code
integration tests differ from
acceptance tests only in HOW the
Lambda functions are invoked
observation
CI + CD PIPELINE
me
the earlier you consider CI/CD
the more time you save in
the long run
“…We prefer to have the end-to-end
tests exercise both the system and the
process by which it’s built and
deployed…
This s...
me
deployment scripts that only
live on the CI box is a disaster
waiting to happen…
Jenkins build config deploys and tests
unit + integration tests
deploy
acceptance tests
if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE 'node_modules/...
if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE 'node_modules/...
if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE 'node_modules/...
http://alistair.cockburn.us/Hexagonal+architecture
build.sh allows repeatable builds on both local & CI
Auto Auto Manual
LOGGING
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp API Gateway ...
Yan
Logs are not easily searchable
in CloudWatch Logs.
CloudWatch Logs
CloudWatch Logs AWS Lambda ELK stack
…
CloudWatch Events
DISTRIBUTED TRACING
a user
my followers didn’t receive
my new post!
where could the
problem be?
correlation IDs*
* eg. request-id, user-id, yubl-id, etc.
wrap HTTP client & AWS SDK clients
to forward captured correlation IDs
kinesis client
http client
sns client
use X-Ray for performance tracing
Amazon X-Ray
Amazon X-Ray
X-Ray traces do not span over API
Gateway, or async event sources
MONITORING + ALERTING
no place to install agents/daemons
• invocation Count
• error Count
• latency
• throttling
• granular to the minute
• support custom metrics
• same metrics as CW
• better dashboard
• support custom metrics
https://www.datadoghq.com/blog/monitoring-lambda-function...
my code
my code
my code
internet internet
press button something happens
those extra 10-20ms for
sending custom metrics
would compound when
you have microservices
and multiple APIs are
called wit...
Amazon found every 100ms of latency
cost them 1% in sales.
http://bit.ly/2EXPfbA
no more background processing,
other than what the platform provides
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|14897953...
CloudWatch Logs AWS Lambda
ELK stack
logs
metrics
CloudWatch
don’t forget to setup dashboards
& CW alarms
CONFIG MANAGEMENT
design for easy & quick
propagation of config changes
me
Environment variables make it
hard to share configurations
across functions.
me
Environment variables make it
hard to implement fine-grained
access to sensitive info.
config service
goes here
SSM
Parameter
Store
sensitive data should be encrypted
in-flight, and at-rest
enforce role-based access to sensitive
configuration values
SSM Parameter Store
HTTPS
role-based access
encrypted in-flight
SSM Parameter Store
encrypt
role-based access
SSM Parameter Store
encrypted at-rest
HTTPS
role-based access
SSM Parameter Store
encrypted in-flight
invest into a robust client library
fetch & cache at cold-start
invalidate at interval & weak signals
“DevOps is a set of practices intended to
reduce the time between committing a
change to a system and the change being
pla...
NoOps
Serverless ops is different…
remember the principles
rethink the approach
API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log ag...
@theburningmonk
theburningmonk.com
github.com/theburningmonk
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Serverless in production, an experience report (linuxing in london)
Upcoming SlideShare
Loading in …5
×

Serverless in production, an experience report (linuxing in london)

120 views

Published on

AWS Lambda has changed the way we deploy and run software, but this new serverless paradigm has created new challenges to old problems - how do you test a cloud-hosted function locally? How do you monitor them? What about logging and config management? And how do we start migrating from existing architectures?

In this talk Yan and Scott will discuss solutions to these challenges by drawing from real-world experience running Lambda in production and migrating from an existing monolithic architecture.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Serverless in production, an experience report (linuxing in london)

  1. 1. in production an experience reportan experience report what you should know before you go to production ServerlessServerless
  2. 2. Yan Cui http://theburningmonk.com @theburningmonk AWS user since 2009
  3. 3. Yan Cui http://theburningmonk.com @theburningmonk
  4. 4. Scott Smethurst
  5. 5. apr, 2016
  6. 6. hey guys, vote on this post and I’ll announce a winner at 10PM tonight
  7. 7. 10PM traffic
  8. 8. 10PM traffic 70-100x
  9. 9. low utilisation to leave room for spikes EC2 scaling is slow, so scale earlier
  10. 10. lots of $$$ for unused resources
  11. 11. up to 30 mins for deployment deployment required downtime
  12. 12. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  13. 13. “what would good look like for us?”
  14. 14. be small be fast have zero downtime have no lock-step DEPLOYMENTS SHOULD...
  15. 15. FEATURES SHOULD... be deployable independently be loosely-coupled
  16. 16. WE WANT TO... minimise cost for unused resources minimise ops effort reduce tech mess deliver visible improvements faster
  17. 17. nov, 2016
  18. 18. 170 Lambda functions in prod 1.2 GB deployment packages in prod 95% cost saving vs EC2 15x no. of prod releases per month
  19. 19. time is a good fit
  20. 20. 1st function in prod! time is a good fit
  21. 21. ? time is a good fit 1st function in prod!
  22. 22. ALERTING CI / CD TESTING LOGGING MONITORING
  23. 23. Practices ToolsPrinciples what is good? how to make it good? with what?
  24. 24. Principles outlast Tools
  25. 25. 170 functions ? ? time is a good fit 1st function in prod!
  26. 26. SECURITY DISTRIBUTED TRACING CONFIG MANAGEMENT
  27. 27. evolving the PLATFORM
  28. 28. rebuilt search
  29. 29. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearch
  30. 30. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  31. 31. new analytics pipeline
  32. 32. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery
  33. 33. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery 1 developer, 2 days design production (his 1st serverless project)
  34. 34. Legacy Monolith Amazon Kinesis Amazon Lambda Google BigQuery “nothing ever got done this fast at Skype!” - Chris Twamley
  35. 35. - Dan North “lead time to someone saying thank you is the only reputation metric that matters.”
  36. 36. Rebuilt with Lambda
  37. 37. Rebuilt with Lambda
  38. 38. BigQuery
  39. 39. BigQuery
  40. 40. grapheneDB BigQuery
  41. 41. grapheneDB BigQuery
  42. 42. grapheneDB BigQuery
  43. 43. getting PRODUCTION READY
  44. 44. choose a tried-and-tested deployment framework, don’t invent your own
  45. 45. http://serverless.com
  46. 46. https://github.com/awslabs/serverless-application-model
  47. 47. http://apex.run
  48. 48. https://apex.github.io/up
  49. 49. https://github.com/claudiajs/claudia
  50. 50. https://github.com/Miserlou/Zappa
  51. 51. http://gosparta.io/
  52. 52. TESTING
  53. 53. amzn.to/29Lxuzu
  54. 54. Level of Testing 1.Unit do our objects do the right thing? are they easy to work with?
  55. 55. Level of Testing 1.Unit 2.Integration does our code work against code we can’t change?
  56. 56. handler
  57. 57. handler test by invoking the handler
  58. 58. Level of Testing 1.Unit 2.Integration 3.Acceptance does the whole system work?
  59. 59. Level of Testing unit integration acceptance feedback confidence
  60. 60. “…We find that tests that mock external libraries often need to be complex to get the code into the right state for the functionality we need to exercise. The mess in such tests is telling us that the design isn’t right but, instead of fixing the problem by improving the code, we have to carry the extra complexity in both code and test…” Don’t Mock Types You Can’t Change
  61. 61. “…The second risk is that we have to be sure that the behaviour we stub or mock matches what the external library will actually do… Even if we get it right once, we have to make sure that the tests remain valid when we upgrade the libraries…” Don’t Mock Types You Can’t Change
  62. 62. Don’t Mock Types You Can’t Change Services
  63. 63. Paul Johnston The serverless approach to testing is different and may actually be easier. http://bit.ly/2t5viwK
  64. 64. LambdaAPI Gateway DynamoDB
  65. 65. LambdaAPI Gateway DynamoDB Unit Tests
  66. 66. LambdaAPI Gateway DynamoDB Unit Tests Mock/Stub
  67. 67. is our request correct? is the request mapping set up correctly?is the API resources configured correctly? are we assuming the correct schema? LambdaAPI Gateway DynamoDB is Lambda proxy configured correctly? is IAM policy set up correctly? is the table created? what unit tests will not tell you…
  68. 68. most Lambda functions are simple have single purpose, the risk of shipping broken software has largely shifted to how they integrate with external services observation
  69. 69. optimize towards shipping working software, even if it means slowing down your feedback loop…
  70. 70. “…Wherever possible, an acceptance test should exercise the system end-to- end without directly calling its internal code. An end-to-end test interacts with the system only from the outside: through its interface…” Testing End-to-End
  71. 71. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda
  72. 72. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input
  73. 73. Legacy Monolith Amazon Kinesis Amazon Lambda Amazon CloudSearchAmazon API Gateway Amazon Lambda Test Input Validate
  74. 74. integration tests exercise system’s Integration with its external dependencies my code
  75. 75. acceptance tests exercise system End-to-End from the outside my code
  76. 76. integration tests differ from acceptance tests only in HOW the Lambda functions are invoked observation
  77. 77. CI + CD PIPELINE
  78. 78. me the earlier you consider CI/CD the more time you save in the long run
  79. 79. “…We prefer to have the end-to-end tests exercise both the system and the process by which it’s built and deployed… This sounds like a lot of effort (it is), but has to be done anyway repeatedly during the software’s lifetime…” Testing End-to-End
  80. 80. me deployment scripts that only live on the CI box is a disaster waiting to happen…
  81. 81. Jenkins build config deploys and tests unit + integration tests deploy acceptance tests
  82. 82. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi
  83. 83. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi install Serverless framework as dev dependency
  84. 84. if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run int-$STAGE elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then STAGE=$2 REGION=$3 PROFILE=$4 npm install AWS_PROFILE=$PROFILE npm run acceptance-$STAGE else usage exit 1 fi install Serverless framework as dev dependency mitigate version conflicts
  85. 85. http://alistair.cockburn.us/Hexagonal+architecture
  86. 86. build.sh allows repeatable builds on both local & CI
  87. 87. Auto Auto Manual
  88. 88. LOGGING
  89. 89. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now?
  90. 90. 2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae GOT is off air, what do I do now? UTC Timestamp API Gateway Request Id your log message
  91. 91. Yan Logs are not easily searchable in CloudWatch Logs.
  92. 92. CloudWatch Logs
  93. 93. CloudWatch Logs AWS Lambda ELK stack
  94. 94.
  95. 95. CloudWatch Events
  96. 96. DISTRIBUTED TRACING
  97. 97. a user my followers didn’t receive my new post!
  98. 98. where could the problem be?
  99. 99. correlation IDs* * eg. request-id, user-id, yubl-id, etc.
  100. 100. wrap HTTP client & AWS SDK clients to forward captured correlation IDs
  101. 101. kinesis client http client sns client
  102. 102. use X-Ray for performance tracing
  103. 103. Amazon X-Ray
  104. 104. Amazon X-Ray
  105. 105. X-Ray traces do not span over API Gateway, or async event sources
  106. 106. MONITORING + ALERTING
  107. 107. no place to install agents/daemons
  108. 108. • invocation Count • error Count • latency • throttling • granular to the minute • support custom metrics
  109. 109. • same metrics as CW • better dashboard • support custom metrics https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/
  110. 110. my code
  111. 111. my code
  112. 112. my code internet internet press button something happens
  113. 113. those extra 10-20ms for sending custom metrics would compound when you have microservices and multiple APIs are called within one slice of user event
  114. 114. Amazon found every 100ms of latency cost them 1% in sales. http://bit.ly/2EXPfbA
  115. 115. no more background processing, other than what the platform provides
  116. 116. console.log(“hydrating yubls from db…”); console.log(“fetching user info from user-api”); console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”); console.log(“MONITORING|1489795335|8|count|yubls-served”); timestamp metric value metric type metric namemetrics logs
  117. 117. CloudWatch Logs AWS Lambda ELK stack logs metrics CloudWatch
  118. 118. don’t forget to setup dashboards & CW alarms
  119. 119. CONFIG MANAGEMENT
  120. 120. design for easy & quick propagation of config changes
  121. 121. me Environment variables make it hard to share configurations across functions.
  122. 122. me Environment variables make it hard to implement fine-grained access to sensitive info.
  123. 123. config service goes here
  124. 124. SSM Parameter Store
  125. 125. sensitive data should be encrypted in-flight, and at-rest
  126. 126. enforce role-based access to sensitive configuration values
  127. 127. SSM Parameter Store HTTPS role-based access encrypted in-flight
  128. 128. SSM Parameter Store encrypt role-based access
  129. 129. SSM Parameter Store encrypted at-rest
  130. 130. HTTPS role-based access SSM Parameter Store encrypted in-flight
  131. 131. invest into a robust client library
  132. 132. fetch & cache at cold-start
  133. 133. invalidate at interval & weak signals
  134. 134. “DevOps is a set of practices intended to reduce the time between committing a change to a system and the change being placed into normal production, while ensuring high quality.” https://en.wikipedia.org/wiki/DevOps#Definitions_and_History
  135. 135. NoOps
  136. 136. Serverless ops is different…
  137. 137. remember the principles rethink the approach
  138. 138. API Gateway and Kinesis Authentication & authorisation (IAM, Cognito) Testing Running & Debugging functions locally Log aggregation Monitoring & Alerting X-Ray Correlation IDs CI/CD Performance and Cost optimisation Error Handling Configuration management VPC Security Leading practices (API Gateway, Kinesis, Lambda) Canary deployments http://bit.ly/production-ready-serverless get 40% off with: ytcui
  139. 139. @theburningmonk theburningmonk.com github.com/theburningmonk

×