Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Distributed Data Systems

A brief overview of distributed data systems in the context of analytics ingestion data pipelines.

Related Books

Free with a 30 day trial from Scribd

See all

Related Audiobooks

Free with a 30 day trial from Scribd

See all
  • Login to see the comments

Distributed Data Systems

  1. 1. Distributed Data Systems How Do They Even?
  2. 2. About Me - Jared L Kerim - Software Developer (Python) - Mozilla Geolocation Cloud Services Team - CTO at PressureNET
  3. 3. PressureNET (Shameless Plug) - Gathers sensor data from smartphones - Constant stream of data to servers - API to retrieve data - Visualization - Analysis
  4. 4. The First Architecture Sensors Web Servers MySQL API
  5. 5. The Problem: MySQL - Slow lookups - Takes a lot of disk space - Cost (Large Relational DBs are expensive) - Schema changes (become slow or impossible)
  6. 6. How Big is “Big” - PressureNET 100 req/s, 1.5 billion records - Analytics Systems 5000 req/s, 100s of billions of records - Ad Buying Service 500k req/s, trillions of records
  7. 7. The Question What is ???? Sensors ???? APIWeb Servers
  8. 8. What do we want to accomplish? - Receive and store large amounts of data - Access it quickly - Small fast lookups (visualization) - Large batch computations (mapreduce)
  9. 9. Considerations - Durability (we don’t want to lose data) - Redundancy (expect failures!) - Scalability (simple growth, no upper limit)
  10. 10. Durability - Data in a durable store should be ‘safe’ - Don’t remove data from one durable data store until it is confirmed to be in another durable data store - Durable data stores should have redundant backups (hot standbys)
  11. 11. Redundancy - Each stage of your system should have multiple copies - If one copy goes down, another should take over - Redundancy ensures availability
  12. 12. Scalability - The rate of data intake can grow or spike - Your system should be able to add more resources to handle that growth - Require that your workload is partitionable
  13. 13. Proposed Architecture Sensors Ingestors Queue Aggregator S3 DynamoDB
  14. 14. We Are Not Alone - This architecture is widely adopted - Analytics - Ad Serving/Views - Log Analysis - Sensor Data - Game Events - Video Events
  15. 15. Ingestors - A redundant, scalable set of nodes which receive data over http - Can apply early validation and authentication - Stateless, low latency
  16. 16. Queue - A scalable, durable storage mechanism for data ‘in flight’ - Only holds data temporarily - Typically preserves the order data was received in
  17. 17. Aggregator - A scalable, stateless set of workers which consume data from the queue - Can process data in small batches - Write raw or transformed data to persistent storage such as S3, Databases, etc.

×