Batch layer• In practice, too expensive to fully recompute each view to get updates• A production batch workflow adds minimum amount of incrementalization necessary for performance
Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
Batch layerRobust and fault-tolerant to both machineand human error.Low latency reads.Low latency updates.Scalable to increases in data or traffic.Extensible to support new features or relatedservices.Generalizes to diverse types of data and requests.Allows ad hoc queries.Minimal maintenance.Debuggable: can trace how any value in thesystem came to be.
Speed layerCompensate for high latency of updates to batch layer
Speed layerKey point: Only needs to compensate for data not yet absorbed in serving layer
Speed layerKey point: Only needs to compensate for data not yet absorbed in serving layer Hours of data instead of years of data
Speed layerSignificantly more complex than the batch layer
Speed layerBut the batch layer eventually overrides the speed layer
Speed layerSo that complexity is transient
Flexibility in layered architecture• Do slow and accurate algorithm in batch layer• Do fast but approximate algorithm in speed layer• “Eventual accuracy”
Data modelEvery record is a single, discrete fact at a moment in time
Data model• Alice lives in San Francisco as of time 12345• Bob and Gary are friends as of time 13723• Alice lives in New York as of time 19827
Data model• Remember: master dataset is append-only• A person can have multiple location records• “Current location” is a view on this data: pick location with most recent timestamp
Data model• Extremely useful having the full history for each entity • Doing analytics • Recovering from mistakes (like writing bad data)
Data model Reshare: trueGender: female Property Tweet: 456 Property Reaction Reactor Reactor Tweet: 123 Alice Bob Property Property Content: RT @bob Content: Data is fun! Data is fun!
1–5 of 5 previous next Post a comment