Distributed Data Systems

Distributed Data Systems
How Do They Even?

About Me - Jared L Kerim
- Software Developer (Python)
- Mozilla Geolocation Cloud Services Team
- CTO at PressureNET

PressureNET (Shameless Plug)
- Gathers sensor data from
smartphones
- Constant stream of data to
servers
- API to retrieve data
- Visualization
- Analysis

The First Architecture
Sensors Web Servers MySQL API

The Problem: MySQL
- Slow lookups
- Takes a lot of disk space
- Cost (Large Relational DBs are expensive)
- Schema changes (become slow or impossible)

How Big is “Big”
- PressureNET 100 req/s, 1.5 billion records
- Analytics Systems 5000 req/s, 100s of
billions of records
- Ad Buying Service 500k req/s, trillions of
records

The Question
What is ????
Sensors ???? APIWeb Servers

What do we want to accomplish?
- Receive and store large amounts of data
- Access it quickly
- Small fast lookups (visualization)
- Large batch computations (mapreduce)

Considerations
- Durability (we don’t want to lose data)
- Redundancy (expect failures!)
- Scalability (simple growth, no upper limit)

Durability
- Data in a durable store should be ‘safe’
- Don’t remove data from one durable data
store until it is confirmed to be in another
durable data store
- Durable data stores should have redundant
backups (hot standbys)

Redundancy
- Each stage of your system should have
multiple copies
- If one copy goes down, another should take
over
- Redundancy ensures availability

Scalability
- The rate of data intake can grow or spike
- Your system should be able to add more
resources to handle that growth
- Require that your workload is partitionable

Proposed Architecture
Sensors Ingestors Queue Aggregator
S3
DynamoDB

We Are Not Alone
- This architecture is widely adopted
- Analytics
- Ad Serving/Views
- Log Analysis
- Sensor Data
- Game Events
- Video Events

Ingestors
- A redundant, scalable set of nodes which
receive data over http
- Can apply early validation and
authentication
- Stateless, low latency

Queue
- A scalable, durable storage mechanism for
data ‘in flight’
- Only holds data temporarily
- Typically preserves the order data was
received in

Aggregator
- A scalable, stateless set of workers which
consume data from the queue
- Can process data in small batches
- Write raw or transformed data to persistent
storage such as S3, Databases, etc.

Distributed Data Systems

More Related Content

What's hot

Viewers also liked

Similar to Distributed Data Systems

Recently uploaded

Distributed Data Systems