Distributed Data Systems
How Do They Even?
About Me - Jared L Kerim
- Software Developer (Python)
- Mozilla Geolocation Cloud Services Team
- CTO at PressureNET
PressureNET (Shameless Plug)
- Gathers sensor data from
smartphones
- Constant stream of data to
servers
- API to retrieve data
- Visualization
- Analysis
The First Architecture
Sensors Web Servers MySQL API
The Problem: MySQL
- Slow lookups
- Takes a lot of disk space
- Cost (Large Relational DBs are expensive)
- Schema changes (become slow or impossible)
How Big is “Big”
- PressureNET 100 req/s, 1.5 billion records
- Analytics Systems 5000 req/s, 100s of
billions of records
- Ad Buying Service 500k req/s, trillions of
records
The Question
What is ????
Sensors ???? APIWeb Servers
What do we want to accomplish?
- Receive and store large amounts of data
- Access it quickly
- Small fast lookups (visualization)
- Large batch computations (mapreduce)
Considerations
- Durability (we don’t want to lose data)
- Redundancy (expect failures!)
- Scalability (simple growth, no upper limit)
Durability
- Data in a durable store should be ‘safe’
- Don’t remove data from one durable data
store until it is confirmed to be in another
durable data store
- Durable data stores should have redundant
backups (hot standbys)
Redundancy
- Each stage of your system should have
multiple copies
- If one copy goes down, another should take
over
- Redundancy ensures availability
Scalability
- The rate of data intake can grow or spike
- Your system should be able to add more
resources to handle that growth
- Require that your workload is partitionable
Proposed Architecture
Sensors Ingestors Queue Aggregator
S3
DynamoDB
We Are Not Alone
- This architecture is widely adopted
- Analytics
- Ad Serving/Views
- Log Analysis
- Sensor Data
- Game Events
- Video Events
Ingestors
- A redundant, scalable set of nodes which
receive data over http
- Can apply early validation and
authentication
- Stateless, low latency
Queue
- A scalable, durable storage mechanism for
data ‘in flight’
- Only holds data temporarily
- Typically preserves the order data was
received in
Aggregator
- A scalable, stateless set of workers which
consume data from the queue
- Can process data in small batches
- Write raw or transformed data to persistent
storage such as S3, Databases, etc.

Distributed Data Systems

  • 1.
  • 2.
    About Me -Jared L Kerim - Software Developer (Python) - Mozilla Geolocation Cloud Services Team - CTO at PressureNET
  • 3.
    PressureNET (Shameless Plug) -Gathers sensor data from smartphones - Constant stream of data to servers - API to retrieve data - Visualization - Analysis
  • 4.
    The First Architecture SensorsWeb Servers MySQL API
  • 5.
    The Problem: MySQL -Slow lookups - Takes a lot of disk space - Cost (Large Relational DBs are expensive) - Schema changes (become slow or impossible)
  • 6.
    How Big is“Big” - PressureNET 100 req/s, 1.5 billion records - Analytics Systems 5000 req/s, 100s of billions of records - Ad Buying Service 500k req/s, trillions of records
  • 7.
    The Question What is???? Sensors ???? APIWeb Servers
  • 8.
    What do wewant to accomplish? - Receive and store large amounts of data - Access it quickly - Small fast lookups (visualization) - Large batch computations (mapreduce)
  • 9.
    Considerations - Durability (wedon’t want to lose data) - Redundancy (expect failures!) - Scalability (simple growth, no upper limit)
  • 10.
    Durability - Data ina durable store should be ‘safe’ - Don’t remove data from one durable data store until it is confirmed to be in another durable data store - Durable data stores should have redundant backups (hot standbys)
  • 11.
    Redundancy - Each stageof your system should have multiple copies - If one copy goes down, another should take over - Redundancy ensures availability
  • 12.
    Scalability - The rateof data intake can grow or spike - Your system should be able to add more resources to handle that growth - Require that your workload is partitionable
  • 13.
    Proposed Architecture Sensors IngestorsQueue Aggregator S3 DynamoDB
  • 14.
    We Are NotAlone - This architecture is widely adopted - Analytics - Ad Serving/Views - Log Analysis - Sensor Data - Game Events - Video Events
  • 15.
    Ingestors - A redundant,scalable set of nodes which receive data over http - Can apply early validation and authentication - Stateless, low latency
  • 16.
    Queue - A scalable,durable storage mechanism for data ‘in flight’ - Only holds data temporarily - Typically preserves the order data was received in
  • 17.
    Aggregator - A scalable,stateless set of workers which consume data from the queue - Can process data in small batches - Write raw or transformed data to persistent storage such as S3, Databases, etc.