In the summer of 2016, XpertSea decided to migrate its operations to AWS and to build a data processing system that is able to scale to the extent of our ambitions. Come see how we built our platform inspired by Kappa Architecture, able to support connected devices located all-around the globe and state-of-the-art machine learning algorithms.
8. Collect data Ingest, store, process, serve data Consume data
IoT Devices Data SaaS
Annotate data, train models
Machine Learning / AI
The XpertSea Platform
10. Which challenges are we facing?
Extracting value out of that data
● Highly distributed
● Need for some near real-time metrics
● Large scale aggregations (region and industry wide)
● Unreliable networks
● And many more!
13. What is a data system?
Working around the CAP Theorem
● Simple equation, Query = Function(All data)
● A data system answers questions about a dataset
● Lots of complexity caused by the mutability of data
● You obviously cannot process all your data from scratch
for every query
15. Upsides & Downsides
● Immutability of the data lake
● More traceability
● Ensure you can make your system evolve quickly
● Designed to scale
Strengths
Weakness
● Complexity of maintaining two layers
● It doesn’t really beat CAP, it just reduces its complexity
18. The AWS Services we use
● S3
● API Gateway
● SQS/SNS
● AWS Lambda
● ECS
● DynamoDB
● RDS
● CloudFormation
● CloudWatch
● IAM
● CloudFront
● Route 53
● Cognito (soon)
● SES (soon)
19. The core system
Data Ingestion
API Gateway AWS Lambda
Data lake (S3)
Metadata Store
(DynamoDB)
20. The core system
Data lake
{
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"parent_id": "",
"timestamp": 1517801441,
"key": "event/0000d9f6-045d-438b-8058-4ee6447ba0fa/payload.json",
"schema": "WorkflowV1.json",
"type": "Workflow",
"data": {
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"timestamp": 1517799978,
"created_at": 1517801441,
"created_by": "xyz"
// ....
}
}
● Stored in JSON for simplicity
● Metadata is copied to DynamoDB
● JSON Schema to handle validation
● Remember, all data is immutable!
A typical entry
21. The core system
Data Processing
● Generally a preprocessor to read from
the stream and compute the latest data;
● A number of processors to perform the
hard work;
● Generally a producer to cache the results
somewhere;
● Chain as many as you need!
{
"schema": "WorkflowV1.json",
"data": {
"id": "0000d9f6-045d-438b-8058-4ee6447ba0fa",
"timestamp": 1517799978,
"created_at": 1517801441.1827073,
"created_by": "portal",
"computed_1": 10,
"computed_2": 142,
// ....
},
"event_ids": [
"0000d9f6-045d-438b-8058-4ee6447ba0fa",
"id1",
"id2"
]
}
A typical message
22. The core system
Data Processing (Serverless)
AWS Lambda
AWS Lambda
AWS Lambda
AWS Lambda
AWS Lambda
DynamoDB
RDS
23. The core system
Data Processing (Containers)
Amazon SQS
ECS + Docker
Amazon SQS
ECS + Docker
DynamoDB
RDS
25. The core system
Data Processing
AWS Lambda
We use a publisher/subscriber model to notify
dependencies (higher level queries or aggregations):
● A new result is now ready to be used
● A old value was recomputed due to additional data
This can trigger immediate recalculation, or be queued to
be processed as part of a batch.
SNS is the obvious choice for this!
Amazon SNS
AWS Lambda
28. Scaling concerns
● Multi-region to reduce latency
● New type of data? New pipeline
● Most of the system is serverless, or at least managed
● Serving data layer might need to move to Dynamo in the future
● Keeping it in a relation DB for now to facilitate our Machine Learning
training framework integration
How to scale?
29. Some tools we use and love
Polymer
Python 3 with Troposphere
30. Tools we choose not to use
● Amazon Kinesis and Kinesis Firehose (pricing)
● AWS IoT (useful for a large amount of simple devices)
● CloudWatch for log processing (we like ELK stacks better)
● Cassandra/Hadoop (too complex for now)
31. Resources and references
On the CAP Theroem: https://codahale.com/you-cant-sacrifice-partition-tolerance/
On Lambda Architecture: http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html
On Kappa Architecture: https://www.oreilly.com/ideas/questioning-the-lambda-architecture