Handling Data in Mega Scale Systems


Published on

Handling Data in Mega Scale Systems by Vineet Gupta, GM Software Engineer.

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Handling Data in Mega Scale Systems

  1. 1. Intelligent People. Uncommon Ideas.<br />Handling Data in Mega Scale Web Apps(lessons learnt @ Directi)<br />Vineet Gupta | GM – Software Engineering | Directi<br />http://vineetgupta.spaces.live.com<br />Licensed under Creative Commons Attribution Sharealike Noncommercial<br />
  2. 2. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  3. 3. Not Covering<br />Offline Processing (Batching / Queuing)<br />Distributed Processing – Map Reduce<br />Non-blocking IO<br />Fault Detection, Tolerance and Recovery<br />
  4. 4. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  5. 5. How Big Does it Get<br />22M+ users<br />Dozens of DB servers<br />Dozens of Web servers<br />Six specialized graph database servers to run recommendations engine<br />Source:http://highscalability.com/digg-architecture<br />
  6. 6. How Big Does it Get<br />1 TB / Day<br />100 M blogs indexed / day<br />10 B objects indexed / day<br />0.5 B photos and videos<br />Data doubles in 6 months<br />Users double in 6 months<br />Source:http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/<br />
  7. 7. How Big Does it Get<br />2 PB Raw Storage<br />470 M photos, 4-5 sizes each<br />400 k photos added / day<br />35 M photos in Squid cache (total)<br />2 M photos in Squid RAM<br />38k reqs / sec to Memcached<br />4 B queries / day<br />Source:http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html<br />
  8. 8. How Big Does it Get<br />Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters<br />2 PB of data<br />26 B SQL queries / day<br />1 B page views / day<br />3 B API calls / month<br />15,000 App servers<br />Source:http://highscalability.com/ebay-architecture/<br />
  9. 9. How Big Does it Get<br />450,000 low cost commodity servers in 2006<br />Indexed 8 B web-pages in 2005<br />200 GFS clusters (1 cluster = 1,000 – 5,000 machines)<br />Read / write thruput = 40 GB / sec across a cluster<br />Map-Reduce<br />100k jobs / day<br />20 PB of data processed / day<br />10k MapReduce programs<br />Source:http://highscalability.com/google-architecture/<br />
  10. 10. Key Trends<br />Data Size ~ PB<br />Data Growth ~ TB / day<br />No of servers – 10s to 10,000<br />No of datacenters – 1 to 10<br />Queries – B+ / day<br />Specialized needs – more / other than RDBMS<br />
  11. 11. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  12. 12. Host<br />RAM<br />CPU<br />CPU<br />RAM<br />CPU<br />RAM<br />App Server<br />DB Server<br />Vertical Scaling (Scaling Up)<br />
  13. 13. Big Irons<br />Sunfire E20k<br />PowerEdge SC1435<br />36x 1.8GHz processors<br />Dualcore 1.8 GHz processor<br />$450,000 - $2,500,000<br />Around $1,500<br />
  14. 14. Vertical Scaling (Scaling Up)<br />Increasing the hardware resources on a host<br />Pros<br />Simple to implement<br />Fast turnaround time<br />Cons<br />Finite limit<br />Hardware does not scale linearly (diminishing returns for each incremental unit)<br />Requires downtime<br />Increases Downtime Impact<br />Incremental costs increase exponentially<br />
  15. 15. Host<br />Host<br />App Server<br />DB Server<br />Vertical Partitioning of Services<br />
  16. 16. Vertical Partitioning of Services<br />Split services on separate nodes<br />Each node performs different tasks<br />Pros<br />Increases per application Availability<br />Task-based specialization, optimization and tuning possible<br />Reduces context switching<br />Simple to implement for out of band processes<br />No changes to App required<br />Flexibility increases<br />Cons<br />Sub-optimal resource utilization<br />May not increase overall availability<br />Finite Scalability<br />
  17. 17. Horizontal Scaling of App Server<br />Web Server<br />Load Balancer<br />Web Server<br />DB Server<br />Web Server<br />
  18. 18. Horizontal Scaling of App Server<br />Add more nodes for the same service<br />Identical, doing the same task<br />Load Balancing<br />Hardware balancers are faster<br />Software balancers are more customizable<br />
  19. 19. The problem - State<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />DB Server<br />User 2<br />Web Server<br />
  20. 20. Sticky Sessions<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />DB Server<br />User 2<br />Web Server<br />Asymmetrical load distribution<br />Downtime<br />
  21. 21. Central Session Store<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />Session Store<br />User 2<br />Web Server<br />SPOF<br />Reads and Writes generate network + disk IO<br />
  22. 22. Clustered Sessions<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />User 2<br />Web Server<br />
  23. 23. Clustered Sessions<br />Pros<br />No SPOF<br />Easier to setup<br />Fast Reads<br />Cons<br />n x Writes<br />Increase in network IO with increase in nodes<br />Stale data (rare)<br />
  24. 24. Sticky Sessions with Central Store<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />DB Server<br />User 2<br />Web Server<br />
  25. 25. More Session Management<br />No Sessions<br />Stuff state in a cookie and sign it!<br />Cookie is sent with every request / response<br />Super Slim Sessions<br />Keep small amount of frequently used data in cookie<br />Pull rest from DB (or central session store)<br />
  26. 26. Sessions - Recommendation<br />Bad<br />Sticky sessions<br />Good<br />Clustered sessions for small number of nodes and / or small write volume<br />Central sessions for large number of nodes or large write volume<br />Great<br />No Sessions!<br />
  27. 27. App Tier Scaling - More<br />HTTP Accelerators / Reverse Proxy<br />Static content caching, redirect to lighter HTTP<br />Async NIO on user-side, Keep-alive connection pool<br />CDN<br />Get closer to your user<br />Akamai, Limelight<br />IP Anycasting<br />Async NIO<br />
  28. 28. Scaling a Web App<br />App-Layer<br />Add more nodes and load balance!<br />Avoid Sticky Sessions<br />Avoid Sessions!!<br />Data Store<br />Tricky! Very Tricky!!!<br />
  29. 29. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  30. 30. Replication = Scaling by Duplication<br />App Layer<br />T1, T2, T3, T4<br />
  31. 31. Replication = Scaling by Duplication<br />App Layer<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />Each node has its own copy of data<br />Shared Nothing Cluster<br />
  32. 32. Replication<br />Read : Write = 4:1<br />Scale reads at cost of writes!<br />Duplicate Data – each node has its own copy<br />Master Slave<br />Writes sent to one node, cascaded to others<br />Multi-Master<br />Writes can be sent to multiple nodes<br />Can lead to deadlocks<br />Requires conflict management<br />
  33. 33. Master-Slave<br />App Layer<br />Master<br />Slave<br />Slave<br />Slave<br />Slave<br />n x Writes – Async vs. Sync<br />SPOF<br />Async - Critical Reads from Master!<br />
  34. 34. Multi-Master<br />App Layer<br />Master<br />Master<br />Slave<br />Slave<br />Slave<br />n x Writes – Async vs. Sync<br />No SPOF<br />Conflicts!<br />
  35. 35. Replication Considerations<br />Asynchronous<br />Guaranteed, but out-of-band replication from Master to Slave<br />Master updates its own db and returns a response to client<br />Replication from Master to Slave takes place asynchronously<br />Faster response to a client <br />Slave data is marginally behind the Master<br />Requires modification to App to send critical reads and writes to master, and load balance all other reads<br />Synchronous<br />Guaranteed, in-band replication from Master to Slave<br />Master updates its own db, and confirms all slaves have updated their db before returning a response to client<br />Slower response to a client <br />Slaves have the same data as the Master at all times<br />Requires modification to App to send writes to master and load balance all reads<br />
  36. 36. Replication Considerations<br />Replication at RDBMS level<br />Support may exists in RDBMS or through 3rd party tool<br />Faster and more reliable<br />App must send writes to Master, reads to any db and critical reads to Master<br />Replication at Driver / DAO level<br />Driver / DAO layer ensures <br />writes are performed on all connected DBs<br />Reads are load balanced<br />Critical reads are sent to a Master<br />In most cases RDBMS agnostic<br />Slower and in some cases less reliable<br />
  37. 37. Diminishing Returns<br />Per Server:<br />4R, 1W<br />2R, 1W<br />1R, 1W<br />Read<br />Read<br />Read<br />Write<br />Write<br />Write<br />Read<br />Read<br />Read<br />Read<br />Write<br />Write<br />Write<br />Write<br />
  38. 38. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  39. 39. Partitioning = Scaling by Division<br />Vertical Partitioning<br />Divide data on tables / columns<br />Scale to as many boxes as there are tables or columns<br />Finite<br />Horizontal Partitioning<br />Divide data on rows<br />Scale to as many boxes as there are rows!<br />Limitless scaling<br />
  40. 40. Vertical Partitioning<br />App Layer<br />T1, T2, T3, T4, T5<br />Note: A node here typically represents a shared nothing cluster<br />
  41. 41. Vertical Partitioning<br />App Layer<br />T3<br />T4<br />T5<br />T2<br />T1<br />Facebook - User table, posts table can be on separate nodes<br />Joins need to be done in code (Why have them?)<br />
  42. 42. Horizontal Partitioning<br />App Layer<br />T3<br />T4<br />T5<br />T2<br />T1<br />First million rows<br />T3<br />T4<br />T5<br />T2<br />T1<br />Second million rows<br />T3<br />T4<br />T5<br />T2<br />T1<br />Third million rows<br />
  43. 43. Horizontal Partitioning Schemes<br />Value Based<br />Split on timestamp of posts<br />Split on first alphabet of user name<br />Hash Based<br />Use a hash function to determine cluster<br />Lookup Map<br />First Come First Serve<br />Round Robin<br />
  44. 44. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  45. 45. CAP Theorem<br />Source:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=<br />
  46. 46. Transactions<br />Transactions make you feel alone<br />No one else manipulates the data when you are<br />Transactional serializability<br />The behavior is as if a serial order exists<br />Source:http://blogs.msdn.com/pathelland/<br />Slide 46<br />
  47. 47. Life in the “Now”<br />Transactions live in the “now” inside services<br />Time marches forward<br />Transactions commit <br />Advancing time<br />Transactions see the committed transactions<br />A service’s biz-logic lives in the “now”<br />Source:http://blogs.msdn.com/pathelland/<br />Slide 47<br />
  48. 48. Sending Unlocked Data Isn’t “Now”<br />Messages contain unlocked data<br />Assume no shared transactions<br />Unlocked data may change<br />Unlocking it allows change<br />Messages are not from the “now”<br />They are from the past<br />There is no simultaneity at a distance!<br /><ul><li>Similar to speed of light
  49. 49. Knowledge travels at speed of light
  50. 50. By the time you see a distant object it may have changed!
  51. 51. By the time you see a message, the data may have changed!</li></ul>Services, transactions, and locks bound simultaneity!<br /><ul><li> Inside a transaction, things appear simultaneous (to others)
  52. 52. Simultaneity only inside a transaction!
  53. 53. Simultaneity only inside a service!</li></ul>Source:http://blogs.msdn.com/pathelland/<br />Slide 48<br />
  54. 54. Outside Data: a Blast from the Past<br />All data from distant stars is from the past<br /><ul><li> 10 light years away; 10 year old knowledge
  55. 55. The sun may have blown up 5 minutes ago
  56. 56. We won’t know for 3 minutes more…</li></ul>All data seen from a distant service is from the “past”<br />By the time you see it, it has been unlocked and may change<br />Each service has its own perspective<br />Inside data is “now”; outside data is “past”<br />My inside is not your inside; my outside is not your outside<br />This is like going from Newtonian to Einstonian physics<br /><ul><li> Newton’s time marched forward uniformly
  57. 57. Instant knowledge
  58. 58. Classic distributed computing: many systems look like one
  59. 59. RPC, 2-phase commit, remote method calls…
  60. 60. In Einstein’s world, everything is “relative” to one’s perspective
  61. 61. Today: No attempt to blur the boundary</li></ul>Source:http://blogs.msdn.com/pathelland/<br />Slide 49<br />
  62. 62. Versions and Distributed Systems<br />Can’t have “the same” dataat many locations<br />Unless it isa snapshot<br />Changing distributed dataneeds versions<br />Creates asnapshot…<br />Source:http://blogs.msdn.com/pathelland/<br />
  63. 63. Subjective Consistency<br />Given what I know here and now, make a decision<br />Remember the versions of all the data used to make this decision<br />Record the decision as being predicated on these versions<br />Other copies of the object may make divergent decisions<br />Try to sort out conflicts within the family<br />If necessary, programmatically apologize<br />Very rarely, whine and fuss for human help<br />Subjective Consistency<br /> Given the information I have at hand, make a decision and act on it !<br /> Remember the information at hand !<br />Ambassadors Had Authority<br />Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later!<br />Source:http://blogs.msdn.com/pathelland/<br />
  64. 64. Eventual Consistency<br />Eventually, all the copies of the object share their changes<br />“I’ll show you mine if you show me yours!”<br />Now, apply subjective consistency:<br />“Given the information I have at hand, make a decision and act on it!”<br />Everyone has the same information, everyone comes to the same conclusion about the decisions to take…<br />Eventual Consistency<br /><ul><li> Given the same knowledge, produce the same result !
  65. 65. Everyone sharing their knowledge leads to the same result...</li></ul>This is NOT magic; it is a design requirement !<br />Idempotence, commutativity, and associativity of the operations(decisions made) are all implied by this requirement<br />Source:http://blogs.msdn.com/pathelland/<br />
  66. 66. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  67. 67. Why Normalize?<br />Classic problemwith de-normalization<br />Can’t updateSam’s phone #since there aremany copies<br />Emp #<br />Emp Name<br />Mgr #<br />Mgr Name<br />Emp Phone<br />47<br />Joe<br />13<br />Sam<br />5-1234<br />18<br />Sally<br />38<br />Harry<br />3-3123<br />91<br />Pete<br />13<br />Sam<br />2-1112<br />66<br />Mary<br />02<br />Betty<br />5-7349<br />Mgr Phone<br />6-9876<br />5-6782<br />6-9876<br />4-0101<br />Normalization’s Goal Is Eliminating Update Anomalies<br />Can Be Changed Without “Funny Behavior”<br />Each Data Item Lives in One Place<br />De-normalization is<br />OK if you aren’t going to update!<br />Source:http://blogs.msdn.com/pathelland/<br />
  68. 68. Eliminate Joins<br />
  69. 69. Eliminate Joins<br />6 joins for 1 query!<br />Do you think FB would do this?<br />And how would you do joins with partitioned data?<br />De-normalization removes joins<br />But increases data volume<br />But disk is cheap and getting cheaper<br />And can lead to inconsistent data<br />If you are lazy<br />However this is not really an issue<br />
  70. 70. “Append-Only” Data<br />Many Kinds of Computing are “Append-Only”<br />Lots of observations are made about the world<br />Debits, credits, Purchase-Orders, Customer-Change-Requests, etc<br />As time moves on, more observations are added<br />You can’t change the history but you can add new observations<br />Derived Results May Be Calculated<br />Estimate of the “current” inventory<br />Frequently inaccurate<br />Historic Rollups Are Calculated<br />Monthly bank statements<br />
  71. 71. Databases and Transaction Logs<br />Transaction Logs Are the Truth<br />High-performance & write-only<br />Describe ALL the changes to the data<br />Data-Base  the Current Opinion<br />Describes the latest value of the data as perceived by the application<br />Log<br />DB<br />The Database Is a Caching of the Transaction Log !<br />It is the subset of the latest committed values represented in the transaction log…<br />Source:http://blogs.msdn.com/pathelland/<br />
  72. 72. We Are Swimming in a Sea of Immutable Data <br />Source:http://blogs.msdn.com/pathelland/<br />
  73. 73. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  74. 74. Caching<br />Makes scaling easier (cheaper)<br />Core Idea<br />Read data from persistent store into memory<br />Store in a hash-table<br />Read first from cache, if not, load from persistent store<br />
  75. 75. Write thru Cache<br />App Server<br />Cache<br />
  76. 76. Write back Cache<br />App Server<br />Cache<br />
  77. 77. Sideline Cache<br />App Server<br />Cache<br />
  78. 78. Memcached<br />
  79. 79. How does it work<br />In-memory Distributed Hash Table<br />Memcached instance manifests as a process (often on the same machine as web-server)<br />Memcached Client maintains a hash table<br />Which item is stored on which instance<br />Memcached Server maintains a hash table<br />Which item is stored in which memory location<br />
  80. 80. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  81. 81. It’s not all Relational!<br />Amazon - S3, SimpleDb, Dynamo<br />Google - App Engine Datastore, BigTable<br />Microsoft – SQL Data Services, Azure Storages<br />Facebook – Cassandra<br />LinkedIn - Project Voldemort<br />Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable<br />
  82. 82. Tuplespaces<br />Basic Concepts<br />No tables - Containers-Entity<br />No schema - each tuple has its own set of properties<br />Amazon SimpleDB – strings only<br />Microsoft Azure SQL Data Services<br />Strings, blob, datetime, bool, int, double, etc.<br />No x-container joins as of now<br />Google App Engine Datastore<br />Strings, blob, datetime, bool, int, double, etc.<br />
  83. 83. Key-Value Stores<br />Google BigTable<br />Sparse, Distributed, multi-dimensional sorted map<br />Indexed by row key, column key, timestamp<br />Each value is an un-interpreted array of bytes<br />Amazon Dynamo<br />Data partitioned and replicated using consistent hashing<br />Decentralized replica sync protocol<br />Consistency thru versioning<br />Facebook Cassandra<br />Used for Inbox search<br />Open Source<br />Scalaris<br />Keys stored in lexicographical order<br />Improved Paxos to provide ACID<br />Memory resident, no persistence<br />
  84. 84. In Summary<br />Real Life Scaling requires trade offs<br />No Silver Bullet<br />Need to learn new things<br />Need to un-learn<br />Balance!<br />
  85. 85. QUESTIONS?<br />
  86. 86. Intelligent People. Uncommon Ideas.<br />Licensed under Creative Commons Attribution Sharealike Noncommercial<br />