The Secrets of Building Realtime Big Data Systems

128,035 views
125,579 views

Published on

The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.

Presented at POSSCON '11.

Published in: Technology
6 Comments
209 Likes
Statistics
Notes
No Downloads
Views
Total views
128,035
On SlideShare
0
From Embeds
0
Number of Embeds
58,534
Actions
Shares
0
Downloads
3,237
Comments
6
Likes
209
Embeds 0
No embeds

No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • The Secrets of Building Realtime Big Data Systems

    1. The Secrets of Building Realtime Big Data Systems Nathan Marz @nathanmarz
    2. Who am I?
    3. Who am I?
    4. Who am I?
    5. Who am I?(Upcoming book)
    6. BackType• >30 TB of data• Process 100M messages / day• Serve 300 requests / sec• 100 to 200 machine cluster• 3 full-time employees, 2 interns
    7. Built on open-source Thrift Cascading Scribe ZeroMQ Zookeeper Pallet
    8. What is a data system? View 1 Raw data View 2 View 3
    9. What is a data system? # Tweets / URL Tweets Influence scores Trending topics
    10. Everything else: schemas, databases, indexing, etc are implementation
    11. Essential properties of a data system
    12. 1. Robust
    13. 1. Robustto machine failure
    14. 1. Robustto machine failureand human error
    15. 2. Low latency reads and updates
    16. 3. Scalable
    17. 4. General
    18. 5. Extensible
    19. 6. Allows ad-hoc analysis
    20. 7. Minimal maintenance
    21. 8. Debuggable
    22. Layered Architecture Speed Layer Batch Layer
    23. Let’s pretend temporarily thatupdate latency doesn’t matter
    24. Let’s pretend it’s OK for a view to lag by a few hours
    25. Batch layer• Arbitrary computation• Horizontally scalable• High latency
    26. Batch layer Not the end-all-be-all of batchcomputation, but the most general
    27. HadoopDistributed DistributedFilesystem FilesystemInput files Output files MapReduceInput files Output filesInput files Output files
    28. Hadoop• Express your computation in terms of MapReduce• Get parallelism and scalability “for free”
    29. Batch layer• Store master copy of dataset• Master dataset is append-only
    30. Batch layerview = fn(master dataset)
    31. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
    32. Batch layer• In practice, too expensive to fully recompute each view to get updates• A production batch workflow adds minimum amount of incrementalization necessary for performance
    33. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
    34. Batch layerRobust and fault-tolerant to both machineand human error.Low latency reads.Low latency updates.Scalable to increases in data or traffic.Extensible to support new features or relatedservices.Generalizes to diverse types of data and requests.Allows ad hoc queries.Minimal maintenance.Debuggable: can trace how any value in thesystem came to be.
    35. Speed layerCompensate for high latency of updates to batch layer
    36. Speed layerKey point: Only needs to compensate for data not yet absorbed in serving layer
    37. Speed layerKey point: Only needs to compensate for data not yet absorbed in serving layer Hours of data instead of years of data
    38. Application-level Queries Batch Layer Query Merge Speed Layer Query
    39. Speed layerOnce data is absorbed into batch layer, can discard speed layer results
    40. Speed layer• Message passing• Incremental algorithms• Read/Write databases • Riak • Cassandra • HBase • etc.
    41. Speed layerSignificantly more complex than the batch layer
    42. Speed layerBut the batch layer eventually overrides the speed layer
    43. Speed layerSo that complexity is transient
    44. Flexibility in layered architecture• Do slow and accurate algorithm in batch layer• Do fast but approximate algorithm in speed layer• “Eventual accuracy”
    45. Data modelEvery record is a single, discrete fact at a moment in time
    46. Data model• Alice lives in San Francisco as of time 12345• Bob and Gary are friends as of time 13723• Alice lives in New York as of time 19827
    47. Data model• Remember: master dataset is append-only• A person can have multiple location records• “Current location” is a view on this data: pick location with most recent timestamp
    48. Data model• Extremely useful having the full history for each entity • Doing analytics • Recovering from mistakes (like writing bad data)
    49. Data model Reshare: trueGender: female Property Tweet: 456 Property Reaction Reactor Reactor Tweet: 123 Alice Bob Property Property Content: RT @bob Content: Data is fun! Data is fun!
    50. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com

    ×