The Secrets of Building Realtime Big Data Systems
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

The Secrets of Building Realtime Big Data Systems

on

  • 114,770 views

The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.

The architectural principles behind building systems that scale to vast amounts of data and operate on that data in realtime.

Presented at POSSCON '11.

Statistics

Views

Total Views
114,770
Views on SlideShare
56,835
Embed Views
57,935

Actions

Likes
191
Downloads
3,103
Comments
6

72 Embeds 57,935

http://nathanmarz.com 49901
http://tech.backtype.com 5257
http://softwarestrategiesblog.com 793
http://feeds.feedburner.com 747
http://www.bigdatanosql.com 289
http://www.redditmedia.com 166
http://www.nosqldatabases.com 126
http://blog.derekperez.com 103
http://nathanmarz.com. 87
http://paper.li 62
http://theoldreader.com 35
http://twitter.com 34
http://nmarz.squarespace.com 32
http://data.story.lu 31
http://www.techgig.com 30
http://www.scoop.it 26
http://lanyrd.com 22
http://bigdata.oktopic.com 14
http://a0.twimg.com 14
http://abjkk.posterous.com 13
https://twitter.com 12
http://xianguo.com 11
http://localhost 10
http://translate.googleusercontent.com 9
http://webcache.googleusercontent.com 9
http://trunk.ly 7
http://www.newsblur.com 7
http://wiki 6
http://jmmiddleware.wordpress.com 5
http://mybuilding.buntt.us 5
http://www.linkedin.com 5
http://snf-59420.vm.okeanos.grnet.gr 5
https://softwarestrategiesblog.com 4
http://dashboard.bloglines.com 4
https://www.linkedin.com 4
http://slideclip.b-prep.com 3
http://us-w1.rockmelt.com 3
https://tcd.blackboard.com 2
http://learn.ced.tuc.gr 2
http://www.onlydoo.com 2
https://p.yammer.com 2
http://mmilonakis.ced.tuc.gr 2
http://staging1.rien.tv 2
http://ranksit.com 2
http://www.mefeedia.com 2
http://nosqldatabases.squarespace.com 2
http://web.archive.org 1
http://pmomale-ld1 1
https://www.linkedin-ei.com 1
http://www.goread.io 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Apple Keynote

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel

15 of 6 Post a comment

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
  • I like this Big Data presentation. I do have here a link of Big Data use cases in banking.
    http://www.syoncloud.com/solutions
    Are you sure you want to
    Your message goes here
    Processing…
  • I don't get it! Why not use something like truviso's streaming architecture. And what great secret/insight do the slides reveal?
    Are you sure you want to
    Your message goes here
    Processing…
  • big data
    Are you sure you want to
    Your message goes here
    Processing…
  • Good
    Are you sure you want to
    Your message goes here
    Processing…
  • very thoughtful.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

The Secrets of Building Realtime Big Data Systems Presentation Transcript

  • 1. The Secrets of Building Realtime Big Data Systems Nathan Marz @nathanmarz
  • 2. Who am I?
  • 3. Who am I?
  • 4. Who am I?
  • 5. Who am I?(Upcoming book)
  • 6. BackType• >30 TB of data• Process 100M messages / day• Serve 300 requests / sec• 100 to 200 machine cluster• 3 full-time employees, 2 interns
  • 7. Built on open-source Thrift Cascading Scribe ZeroMQ Zookeeper Pallet
  • 8. What is a data system? View 1 Raw data View 2 View 3
  • 9. What is a data system? # Tweets / URL Tweets Influence scores Trending topics
  • 10. Everything else: schemas, databases, indexing, etc are implementation
  • 11. Essential properties of a data system
  • 12. 1. Robust
  • 13. 1. Robustto machine failure
  • 14. 1. Robustto machine failureand human error
  • 15. 2. Low latency reads and updates
  • 16. 3. Scalable
  • 17. 4. General
  • 18. 5. Extensible
  • 19. 6. Allows ad-hoc analysis
  • 20. 7. Minimal maintenance
  • 21. 8. Debuggable
  • 22. Layered Architecture Speed Layer Batch Layer
  • 23. Let’s pretend temporarily thatupdate latency doesn’t matter
  • 24. Let’s pretend it’s OK for a view to lag by a few hours
  • 25. Batch layer• Arbitrary computation• Horizontally scalable• High latency
  • 26. Batch layer Not the end-all-be-all of batchcomputation, but the most general
  • 27. HadoopDistributed DistributedFilesystem FilesystemInput files Output files MapReduceInput files Output filesInput files Output files
  • 28. Hadoop• Express your computation in terms of MapReduce• Get parallelism and scalability “for free”
  • 29. Batch layer• Store master copy of dataset• Master dataset is append-only
  • 30. Batch layerview = fn(master dataset)
  • 31. Batch layer MapReduce BatchMaster dataset View 1 MapReduce Batch View 2 Batch View 3 MapReduce
  • 32. Batch layer• In practice, too expensive to fully recompute each view to get updates• A production batch workflow adds minimum amount of incrementalization necessary for performance
  • 33. Incremental batch layer Batch View 1New data Batch View Batch View 2 maintenance workflow Query Append Batch View 3 All data
  • 34. Batch layerRobust and fault-tolerant to both machineand human error.Low latency reads.Low latency updates.Scalable to increases in data or traffic.Extensible to support new features or relatedservices.Generalizes to diverse types of data and requests.Allows ad hoc queries.Minimal maintenance.Debuggable: can trace how any value in thesystem came to be.
  • 35. Speed layerCompensate for high latency of updates to batch layer
  • 36. Speed layerKey point: Only needs to compensate for data not yet absorbed in serving layer
  • 37. Speed layerKey point: Only needs to compensate for data not yet absorbed in serving layer Hours of data instead of years of data
  • 38. Application-level Queries Batch Layer Query Merge Speed Layer Query
  • 39. Speed layerOnce data is absorbed into batch layer, can discard speed layer results
  • 40. Speed layer• Message passing• Incremental algorithms• Read/Write databases • Riak • Cassandra • HBase • etc.
  • 41. Speed layerSignificantly more complex than the batch layer
  • 42. Speed layerBut the batch layer eventually overrides the speed layer
  • 43. Speed layerSo that complexity is transient
  • 44. Flexibility in layered architecture• Do slow and accurate algorithm in batch layer• Do fast but approximate algorithm in speed layer• “Eventual accuracy”
  • 45. Data modelEvery record is a single, discrete fact at a moment in time
  • 46. Data model• Alice lives in San Francisco as of time 12345• Bob and Gary are friends as of time 13723• Alice lives in New York as of time 19827
  • 47. Data model• Remember: master dataset is append-only• A person can have multiple location records• “Current location” is a view on this data: pick location with most recent timestamp
  • 48. Data model• Extremely useful having the full history for each entity • Doing analytics • Recovering from mistakes (like writing bad data)
  • 49. Data model Reshare: trueGender: female Property Tweet: 456 Property Reaction Reactor Reactor Tweet: 123 Alice Bob Property Property Content: RT @bob Content: Data is fun! Data is fun!
  • 50. Questions? Twitter: @nathanmarzEmail: nathan.marz@gmail.com Web: http://nathanmarz.com