An outline of how Moneytree uses Amazon SWF to coordinate our backend aggregation workflow. Focuses on how to run a large scale distributed system with a few developers while still sleeping at night.
2. Who Am I?
Ross Sharrott
Founder & CTO of Moneytree
American
10 Years in Japan (Feb 24!)
Previously Senior IT Manager
Love distributed architectures in the
cloud
10. 1 Account / Many Statements
But we had a problem…
To determine a CC balance, we need
information from multiple statements
We needed a post statement process
Download
Data
Process
Statements
Post Process
Statements
Store Data +
Additional
Information
12. Queue Falls Down
I know…I’ll use a queue!
Queues are linear
Where are we in the process?
Logged in yet? Processing data?
What do you do when a job fails?
How do you relate jobs to one workflow?
13. Enter SWF
AWS Managed Service
Coordinates Workflows / Maintains
history
Provides multiple queues called Task
Lists
Handle decision points with Deciders
Perform tasks with Activity Workers
15. SWF World – A Restaurant
Decider – does nothing, makes decisions
Workflow Starter – takes orders
Activity Worker – makes food
Activity Worker – distributes food
SWF – maintains history, distributes
tasks
16. Activity Worker
Very similar to any queue worker
Handles a specific task
Polls a Task List to get new info
Reports activity success or failure
Puts results in a DB or on S3, etc.
17. Workflow Decider
Uses workflow history to make decisions
Schedules tasks
Handles rescheduling failures & timeouts
Reacts to external events (Signals)
Reacts to completion events
20. 1 Day of Work
Yesterday:
70,000 Workflows
Average Completion Time: 1 Minute
575,000 Decision Tasks
146,000 Statements Processed
70,000 Aggregation Tasks
70,000 Post Process Tasks
22. How To Sleep At Night
Make Workers Scalable
Avoid SWF API Throttling
Expect Failures
Measure Everything
23. Make Workers Scalable
Separate concerns into individual
workers
Scale each worker process individually
Automate scaling your workers
Make workers idempotent
You can always try again
24. Avoid API Throttling
Don’t call GetWorkflowHistory
Stress test your implementation
Limits are by Region, not domain!
Get your limits raised
We hit limits on day 1
Use exponential retry
Have a circuit breaker
25. Expect Failures
Cloud = Failures
Dyno / EC2 instance restarts
Network & Service outages
Don’t wait for failed processes
Use aggressive timeouts
Use heartbeats for long processes
26. Monitor Everything
Use Performance Monitoring
10x increase in performance = 10x workers
New Relic & Cloudwatch
Centralize Logging
Cloud resources disappear w/their logs
Papertrail / Logentries
Log Everything & Setup Alerts
If you don’t log it, you can’t fix it
27. Sleep At Night
Make Workers Scalable
Avoid SWF API Throttling
Expect Failures
Measure Everything
28. Thank You!
Moneytree is hiring!
iOS Developers
API Developers / AWS Dev Ops
Technology Ninjas
Ross Sharrott Founder / CTO
rsharrott@moneytree.jp
@moneytreejp
Editor's Notes
Manager – does nothing, makes decisionsWaitress – takes ordersCook – makes foodHall Staff – delivers foodPOS System – maintains history, distributes tasks
Long Poll SWF for new decisions. Monitors a single decision task list.
Top Level is simpleBut…We can fail to login or need additional informationWe can fail to process a statement
Decider to handle the WorkflowData Aggregation Activity WorkerStatement Processing Activity WorkerPost Processing Activity WorkerShare Data via S3