11. Lessons Learned & Misc Facts
• On an average day: 120,000 messages per minute
• Cost of Lambdas: ~$5.00/day
• Cost of Kinesis (140 shards): ~$50.00/day
• There is a soft limit in the amount of lambdas that can be executing at any given point in time. Be mindful
of their execution time.
• Kinesis is very opaque
• We append shard information to the messages on the fly to get a better understanding on how the streams are working.
• Instrument/measure everything!
• So many moving parts can make it harder to troubleshoot.
Editor's Notes
Redshift as the sole repository for “real time” and Analytics
Windows service (C#) running on EC2
Unzip CloudWatch messages
Discard CloudWatch Wrapper
Match start / stop messages
Upload to S3
Load from S3 to Redshift
Problems
Trying to join compound messages on the fly by the Web UI was too slow
Could not process and load data fast enough; database kept falling behind.
Did not have backfill capabilities
Our problem was that we could complete all the processing without falling behind.
The system would stay up to date for a few hours and then it would begin lagging and never recover.
Problem: we were trying to use one repository for all purposes.
We decided to implement a common approach for ingesting. Fast and Slow (EXPLAIN)
We’re just going to focus and drill into the blue boxes for timesake
Windows services (C#) running on EC2 to
Fast service:
Unzip CloudWatch messages
Discard CloudWatch Wrapper
Match start / stop messages. THIS WAS DONE IN MEMORY
Send messages to Elastic Search
Elastic Search as the repository for the fast path (web UI)
Elastic Search query response time was fast enough for web UI. Yay!
Problems
We were having incidents too often
Fast service ran very hot (high memory and CPU utilization)
Fast service was too sensitive to sudden increases in data flow (didn’t auto-scale)
Adding auto-scaling capabilities would have kept adding to code complexity, maintenance, etc.
Still did not have backfill capabilities
EXPLAIN EACH STEP AT A TIME
Use Lambdas to perform only one task at a time
Unzip and discard CloudWatch wrapper
Segregate and Pair
Send to S3
Use Kinesis Streams as the holding place between processing steps
Use S3 as the data lake for the entire pipeline
Windows service as the Elastic Search Loader
Elastic Search as the repository for the fast path (web UI)
Redshift as the repository for the slow path (data analytics/reporting)
Elastic Search query response time was fast enough for web UI. Yay!
Using S3 as the data lake provided us with the means to backfill data if needed.
Problems
Lots of moving parts, harder to see exactly what’s going on.
Explain limitation of one lambda per shard and how a big batch would cause it to fall behind