Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Case Study: Stream Processing on AWS using Kappa Architecture

397 views

Published on

In the summer of 2016, XpertSea decided to migrate its operations to AWS and to build a data processing system that is able to scale to the extent of our ambitions. Come see how we built our platform inspired by Kappa Architecture, able to support connected devices located all-around the globe and state-of-the-art machine learning algorithms.

Published in: Technology
  • Be the first to comment

Case Study: Stream Processing on AWS using Kappa Architecture

  1. 1. Stream Processing on AWS using Kappa Architecture Joey Bolduc-Gilbert joey@xpertsea.com
  2. 2. 4.3B People depend on fish for key protein 50% Of all fish protein comes from farming 2x More fish than any other animal protein
  3. 3. Feed Conversion and Water Footprint 1857 6.8 756 2.9 469 1.7 3.8 1.1 Water for 1 lbs of meat (gallons) Feed for 1 lbs of meat (lbs)
  4. 4. Lost to disease and poor management for a typical shrimp farmer -50%
  5. 5. Aquaculture technology gap today Manual sampling Visual inspection Non-digital records
  6. 6. Aqua Farming Analytics Are a Tool for Change
  7. 7. Collect data Ingest, store, process, serve data Consume data IoT Devices Data SaaS Annotate data, train models Machine Learning / AI The XpertSea Platform
  8. 8. Aquaculture Data Animals Water quality Genetics Ocean Production Transactions Location Equipment Feed Weather Diseases
  9. 9. Which challenges are we facing? Extracting value out of that data ● Highly distributed ● Need for some near real-time metrics ● Large scale aggregations (region and industry wide) ● Unreliable networks ● And many more!
  10. 10. The CAP theorem Consistency Availability Partition tolerance Pick 2?
  11. 11. The CAP theorem
  12. 12. What is a data system? Working around the CAP Theorem ● Simple equation, Query = Function(All data) ● A data system answers questions about a dataset ● Lots of complexity caused by the mutability of data ● You obviously cannot process all your data from scratch for every query
  13. 13. Lambda Architecture
  14. 14. Upsides & Downsides ● Immutability of the data lake ● More traceability ● Ensure you can make your system evolve quickly ● Designed to scale Strengths Weakness ● Complexity of maintaining two layers ● It doesn’t really beat CAP, it just reduces its complexity
  15. 15. Kappa Architecture
  16. 16. XpertSea’s Deepwater platform
  17. 17. The AWS Services we use ● S3 ● API Gateway ● SQS/SNS ● AWS Lambda ● ECS ● DynamoDB ● RDS ● CloudFormation ● CloudWatch ● IAM ● CloudFront ● Route 53 ● Cognito (soon) ● SES (soon)
  18. 18. The core system Data Ingestion API Gateway AWS Lambda Data lake (S3) Metadata Store (DynamoDB)
  19. 19. The core system Data lake { "id": "0000d9f6-045d-438b-8058-4ee6447ba0fa", "parent_id": "", "timestamp": 1517801441, "key": "event/0000d9f6-045d-438b-8058-4ee6447ba0fa/payload.json", "schema": "WorkflowV1.json", "type": "Workflow", "data": { "id": "0000d9f6-045d-438b-8058-4ee6447ba0fa", "timestamp": 1517799978, "created_at": 1517801441, "created_by": "xyz" // .... } } ● Stored in JSON for simplicity ● Metadata is copied to DynamoDB ● JSON Schema to handle validation ● Remember, all data is immutable! A typical entry
  20. 20. The core system Data Processing ● Generally a preprocessor to read from the stream and compute the latest data; ● A number of processors to perform the hard work; ● Generally a producer to cache the results somewhere; ● Chain as many as you need! { "schema": "WorkflowV1.json", "data": { "id": "0000d9f6-045d-438b-8058-4ee6447ba0fa", "timestamp": 1517799978, "created_at": 1517801441.1827073, "created_by": "portal", "computed_1": 10, "computed_2": 142, // .... }, "event_ids": [ "0000d9f6-045d-438b-8058-4ee6447ba0fa", "id1", "id2" ] } A typical message
  21. 21. The core system Data Processing (Serverless) AWS Lambda AWS Lambda AWS Lambda AWS Lambda AWS Lambda DynamoDB RDS
  22. 22. The core system Data Processing (Containers) Amazon SQS ECS + Docker Amazon SQS ECS + Docker DynamoDB RDS
  23. 23. AWS Lambda AWS Lambda DynamoDB RDS Amazon SQS ECS + Docker Of course we can mix and match!
  24. 24. The core system Data Processing AWS Lambda We use a publisher/subscriber model to notify dependencies (higher level queries or aggregations): ● A new result is now ready to be used ● A old value was recomputed due to additional data This can trigger immediate recalculation, or be queued to be processed as part of a batch. SNS is the obvious choice for this! Amazon SNS AWS Lambda
  25. 25. The core system Serving layer (API) API Gateway AWS Lambda DynamoDB RDS Route 53 S3 bucket
  26. 26. The core system Serving layer (Web App) S3 bucket Route 53 CloudFront Polymer
  27. 27. Scaling concerns ● Multi-region to reduce latency ● New type of data? New pipeline ● Most of the system is serverless, or at least managed ● Serving data layer might need to move to Dynamo in the future ● Keeping it in a relation DB for now to facilitate our Machine Learning training framework integration How to scale?
  28. 28. Some tools we use and love Polymer Python 3 with Troposphere
  29. 29. Tools we choose not to use ● Amazon Kinesis and Kinesis Firehose (pricing) ● AWS IoT (useful for a large amount of simple devices) ● CloudWatch for log processing (we like ELK stacks better) ● Cassandra/Hadoop (too complex for now)
  30. 30. Resources and references On the CAP Theroem: https://codahale.com/you-cant-sacrifice-partition-tolerance/ On Lambda Architecture: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html On Kappa Architecture: https://www.oreilly.com/ideas/questioning-the-lambda-architecture
  31. 31. Q & A
  32. 32. Thank you!

×