Handling Data in Mega Scale Web Systems


Published on

Published in: Technology

Handling Data in Mega Scale Web Systems

  1. 1. Vineet Gupta | GM – Software Engineering | Directi http://www.vineetgupta.com Licensed under Creative Commons Attribution Sharealike Noncommercial Intelligent People. Uncommon Ideas.
  2. 2. <ul><ul><li>22M+ users </li></ul></ul><ul><ul><li>Dozens of DB servers </li></ul></ul><ul><ul><li>Dozens of Web servers </li></ul></ul><ul><ul><li>Six specialized graph database servers to run recommendations engine </li></ul></ul>Source: http://highscalability.com/digg-architecture
  3. 3. <ul><ul><li>1 TB / Day </li></ul></ul><ul><ul><li>100 M blogs indexed / day </li></ul></ul><ul><ul><li>10 B objects indexed / day </li></ul></ul><ul><ul><li>0.5 B photos and videos </li></ul></ul><ul><ul><li>Data doubles in 6 months </li></ul></ul><ul><ul><li>Users double in 6 months </li></ul></ul>Source: http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/
  4. 4. <ul><ul><li>2 PB Raw Storage </li></ul></ul><ul><ul><li>470 M photos, 4-5 sizes each </li></ul></ul><ul><ul><li>400 k photos added / day </li></ul></ul><ul><ul><li>35 M photos in Squid cache (total) </li></ul></ul><ul><ul><li>2 M photos in Squid RAM </li></ul></ul><ul><ul><li>38k reqs / sec to Memcached </li></ul></ul><ul><ul><li>4 B queries / day </li></ul></ul>Source: http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html
  5. 5. <ul><ul><li>Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters </li></ul></ul><ul><ul><li>2 PB of data </li></ul></ul><ul><ul><li>26 B queries / day </li></ul></ul><ul><ul><li>1 B page views / day </li></ul></ul><ul><ul><li>3 B API calls / month </li></ul></ul><ul><ul><li>15,000 App servers </li></ul></ul>Source: http://highscalability.com/ebay-architecture/
  6. 6. <ul><ul><li>450,000 low cost commodity servers in 2006 </li></ul></ul><ul><ul><li>Indexed 8 B web-pages in 2005 </li></ul></ul><ul><ul><li>200 GFS clusters (1 cluster = 1,000 – 5,000 machines) </li></ul></ul><ul><ul><li>Read / write thruput = 40 GB / sec across a cluster </li></ul></ul><ul><ul><li>Map-Reduce </li></ul></ul><ul><ul><ul><li>100k jobs / day </li></ul></ul></ul><ul><ul><ul><li>20 PB of data processed / day </li></ul></ul></ul><ul><ul><ul><li>10k MapReduce programs </li></ul></ul></ul>Source: http://highscalability.com/google-architecture/
  7. 7. <ul><ul><li>Data Size ~ PB </li></ul></ul><ul><ul><li>Data Growth ~ TB / day </li></ul></ul><ul><ul><li>No of servers – 10s to 10,000 </li></ul></ul><ul><ul><li>No of datacenters – 1 to 10 </li></ul></ul><ul><ul><li>Queries – B+ / day </li></ul></ul>
  8. 8. Host App Server DB Server RAM CPU CPU CPU RAM RAM
  9. 9. Sunfire X4640 M2 8 x 6-core 2.6 GHz $ 27k to $ 170k PowerEdge R200 Dual core 2.8 GHz Around $ 550
  10. 10. <ul><ul><li>Increasing the hardware resources on a host </li></ul></ul><ul><ul><li>Pros </li></ul></ul><ul><ul><ul><li>Simple to implement </li></ul></ul></ul><ul><ul><ul><li>Fast turnaround time </li></ul></ul></ul><ul><ul><li>Cons </li></ul></ul><ul><ul><ul><li>Finite limit </li></ul></ul></ul><ul><ul><ul><li>Hardware does not scale linearly (diminishing returns for each incremental unit) </li></ul></ul></ul><ul><ul><ul><li>Requires downtime </li></ul></ul></ul><ul><ul><ul><li>Increases Downtime Impact </li></ul></ul></ul><ul><ul><ul><li>Incremental costs increase exponentially </li></ul></ul></ul>
  11. 11. T1, T2, T3, T4 App Layer
  12. 12. T1, T2, T3, T4 App Layer T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 T1, T2, T3, T4 <ul><ul><li>Each node has its own copy of data </li></ul></ul><ul><ul><li>Shared Nothing Cluster </li></ul></ul>
  13. 13. <ul><ul><li>Read : Write = 4:1 </li></ul></ul><ul><ul><ul><li>Scale reads at cost of writes! </li></ul></ul></ul><ul><ul><li>Duplicate Data – each node has its own copy </li></ul></ul><ul><ul><li>Master Slave </li></ul></ul><ul><ul><ul><li>Writes sent to one node, cascaded to others </li></ul></ul></ul><ul><ul><li>Multi-Master </li></ul></ul><ul><ul><ul><li>Writes can be sent to multiple nodes </li></ul></ul></ul><ul><ul><ul><li>Can lead to deadlocks </li></ul></ul></ul><ul><ul><ul><li>Requires conflict management </li></ul></ul></ul>
  14. 14. Master App Layer Slave Slave Slave Slave <ul><ul><li>n x Writes – Async vs. Sync </li></ul></ul><ul><ul><li>SPOF </li></ul></ul><ul><ul><li>Async - Critical Reads from Master! </li></ul></ul>
  15. 15. Master App Layer Master Slave Slave Slave <ul><ul><li>n x Writes – Async vs. Sync </li></ul></ul><ul><ul><li>No SPOF </li></ul></ul><ul><ul><li>Conflicts! O(N2) or O(N3) resolution </li></ul></ul>
  16. 16. Write Read Write Read Write Read Write Read Write Read Write Read Write Read <ul><li>Per Server: </li></ul><ul><ul><li>4R, 1W </li></ul></ul><ul><ul><li>2R, 1W </li></ul></ul><ul><ul><li>1R, 1W </li></ul></ul>
  17. 17. <ul><ul><li>Vertical Partitioning </li></ul></ul><ul><ul><ul><li>Divide data on tables / columns </li></ul></ul></ul><ul><ul><ul><li>Scale to as many boxes as there are tables or columns </li></ul></ul></ul><ul><ul><ul><li>Finite </li></ul></ul></ul><ul><ul><li>Horizontal Partitioning </li></ul></ul><ul><ul><ul><li>Divide data on rows </li></ul></ul></ul><ul><ul><ul><li>Scale to as many boxes as there are rows! </li></ul></ul></ul><ul><ul><ul><li>Limitless scaling </li></ul></ul></ul>
  18. 18. T1, T2, T3, T4, T5 App Layer
  19. 19. T3 App Layer T4 T5 T2 T1 <ul><ul><li>Facebook - User table, posts table can be on separate nodes </li></ul></ul><ul><ul><li>Joins need to be done in code (Why have them?) </li></ul></ul>
  20. 20. T3 App Layer T4 T5 T2 T1 First million rows T3 T4 T5 T2 T1 Second million rows T3 T4 T5 T2 T1 Third million rows
  21. 21. <ul><ul><li>Value Based </li></ul></ul><ul><ul><ul><li>Split on timestamp of posts </li></ul></ul></ul><ul><ul><ul><li>Split on first alphabet of user name </li></ul></ul></ul><ul><ul><li>Hash Based </li></ul></ul><ul><ul><ul><li>Use a hash function to determine cluster </li></ul></ul></ul><ul><ul><li>Lookup Map </li></ul></ul><ul><ul><ul><li>First Come First Serve </li></ul></ul></ul><ul><ul><ul><li>Round Robin </li></ul></ul></ul>
  22. 22. Source: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=
  23. 23. <ul><ul><li>In distributed systems, much weaker forms of consistency are often acceptable, e.g., </li></ul></ul><ul><ul><ul><li>Only a few (or even one) possible writers of data, and/or </li></ul></ul></ul><ul><ul><ul><li>Read-mostly data (seldom modified), and/or </li></ul></ul></ul><ul><ul><ul><li>Stale data may be acceptable </li></ul></ul></ul><ul><ul><li>Eventual consistency </li></ul></ul><ul><ul><ul><li>If no updates take place for a long time, all replicas will eventually become consistent </li></ul></ul></ul><ul><ul><li>Implementation </li></ul></ul><ul><ul><ul><li>Need only ensure updates eventually reach all of the replicated copies of the data </li></ul></ul></ul>
  24. 24. <ul><ul><li>Monotonic Reads </li></ul></ul><ul><ul><ul><li>If a node sees a version x at time t, it will never see an older version at a later time </li></ul></ul></ul><ul><ul><li>Monotonic Writes </li></ul></ul><ul><ul><ul><li>A write operation by a process on a data item x is completed before any successive write operation on x by the same process </li></ul></ul></ul><ul><ul><li>Read your writes </li></ul></ul><ul><ul><ul><li>The effect of a write operation by a process on data item x will always be seen by a successive read operation on x by the same process </li></ul></ul></ul><ul><ul><li>Writes follow Reads </li></ul></ul><ul><ul><ul><li>Write occurs on a copy of x that is at least as recent as the last copy read by the process </li></ul></ul></ul>
  25. 25. <ul><ul><li>Many Kinds of Computing are “Append-Only” </li></ul></ul><ul><ul><ul><li>Lots of observations are made about the world </li></ul></ul></ul><ul><ul><ul><ul><li>Debits, credits, Purchase-Orders, Customer-Change-Requests, etc </li></ul></ul></ul></ul><ul><ul><ul><li>As time moves on, more observations are added </li></ul></ul></ul><ul><ul><ul><ul><li>You can’t change the history but you can add new observations </li></ul></ul></ul></ul><ul><ul><li>Derived Results May Be Calculated </li></ul></ul><ul><ul><ul><li>Estimate of the “current” inventory </li></ul></ul></ul><ul><ul><ul><li>Frequently inaccurate </li></ul></ul></ul><ul><ul><li>Historic Rollups Are Calculated </li></ul></ul><ul><ul><ul><li>Monthly bank statements </li></ul></ul></ul>
  26. 27. <ul><ul><li>5 joins for 1 query! </li></ul></ul><ul><ul><ul><li>Do you think FB would do this? </li></ul></ul></ul><ul><ul><ul><li>And how would you do joins with partitioned data? </li></ul></ul></ul><ul><ul><li>De-normalization removes joins </li></ul></ul><ul><ul><li>But increases data volume </li></ul></ul><ul><ul><ul><li>However disk is cheap and getting cheaper </li></ul></ul></ul><ul><ul><li>And can lead to inconsistent data </li></ul></ul><ul><ul><ul><li>But only if we do UPDATEs and DELETEs </li></ul></ul></ul>
  27. 28. <ul><ul><li>Normalization’s Goal Is Eliminating Update Anomalies </li></ul></ul><ul><ul><ul><li>Can Be Changed Without “Funny Behavior” </li></ul></ul></ul><ul><ul><ul><li>Each Data Item Lives in One Place </li></ul></ul></ul>Emp # Emp Name Mgr # Mgr Name Emp Phone Mgr Phone 47 Joe 13 Sam 5-1234 6-9876 18 Sally 38 Harry 3-3123 5-6782 91 Pete 13 Sam 2-1112 6-9876 66 Mary 02 Betty 5-7349 4-0101 Classic problem with de-normalization Can’t update Sam’s phone # since there are many copies De-normalization is OK if you aren’t going to update! Source: http://blogs.msdn.com/pathelland/
  28. 29. <ul><ul><li>Partitioning for scaling </li></ul></ul><ul><ul><ul><li>Replication for availability </li></ul></ul></ul><ul><ul><li>No ACID transactions </li></ul></ul><ul><ul><li>No JOINs </li></ul></ul><ul><ul><li>Immutable data </li></ul></ul><ul><ul><ul><li>No cascaded UPDATEs and DELETEs </li></ul></ul></ul>
  29. 31. <ul><ul><li>Partitioning – for R/W scaling </li></ul></ul><ul><ul><li>Replication – for availability </li></ul></ul><ul><ul><li>Versioning – for immutable data </li></ul></ul><ul><ul><li>Eventual Consistency </li></ul></ul><ul><ul><li>Error detection and handling </li></ul></ul>
  30. 32. <ul><ul><li>Google – BigTable </li></ul></ul><ul><ul><li>Amazon – Dynamo </li></ul></ul><ul><ul><li>Facebook – Cassandra (BigTable + Dynamo) </li></ul></ul><ul><ul><li>LinkedIn – Voldemort (similar to Dynamo) </li></ul></ul><ul><ul><li>Many more </li></ul></ul>
  31. 33. <ul><ul><li>Tens of millions of customers served at peak times </li></ul></ul><ul><ul><li>Tens of thousands of servers </li></ul></ul><ul><ul><li>Both customers and servers distributed world wide </li></ul></ul>
  32. 34. <ul><li>http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html </li></ul><ul><ul><li>Eventually consistent data store </li></ul></ul><ul><ul><li>Always writable </li></ul></ul><ul><ul><li>Decentralized </li></ul></ul><ul><ul><li>All nodes have the same responsibilities </li></ul></ul>
  33. 36. <ul><ul><li>Similar to Chord </li></ul></ul><ul><ul><ul><li>Each node gets an ID from the space of keys </li></ul></ul></ul><ul><ul><ul><li>Nodes are arranged in a ring </li></ul></ul></ul><ul><ul><ul><li>Data stored on the first node clockwise of the current placement of the data key </li></ul></ul></ul><ul><ul><li>Replication </li></ul></ul><ul><ul><ul><li>Preference lists of N nodes following the associated node </li></ul></ul></ul>
  34. 37. <ul><ul><li>A problem with the Chord scheme </li></ul></ul><ul><ul><ul><li>Nodes placed randomly on ring </li></ul></ul></ul><ul><ul><ul><li>Leads to uneven data & load distribution </li></ul></ul></ul><ul><ul><li>In Dynamo </li></ul></ul><ul><ul><ul><li>“ Virtual” nodes </li></ul></ul></ul><ul><ul><ul><li>Each physical node has multiple virtual nodes </li></ul></ul></ul><ul><ul><ul><ul><li>More powerful machines have more virtual nodes </li></ul></ul></ul></ul><ul><ul><ul><li>Distribute virtual nodes across the ring </li></ul></ul></ul>
  35. 38. <ul><ul><li>Updates generate a new timestamp </li></ul></ul><ul><ul><ul><li>Vector clocks are used </li></ul></ul></ul><ul><ul><li>Eventual consistency </li></ul></ul><ul><ul><ul><li>Multiple versions of the same object might co-exist </li></ul></ul></ul><ul><ul><li>Syntactic Reconciliation </li></ul></ul><ul><ul><ul><li>System might be able to resolve conflicts automatically </li></ul></ul></ul><ul><ul><li>Semantic Reconciliation </li></ul></ul><ul><ul><ul><li>Conflict resolution pushed to application </li></ul></ul></ul>
  36. 40. <ul><ul><li>Request arrives at a node (coordinator) </li></ul></ul><ul><ul><ul><li>Ideally the node responsible for the particular key </li></ul></ul></ul><ul><ul><ul><li>Else forwards request to the node responsible for that key and that node will become the coordinator </li></ul></ul></ul><ul><ul><li>The first N healthy and distinct nodes following the key position are considered for the request </li></ul></ul><ul><ul><li>Application defines </li></ul></ul><ul><ul><ul><li>N = total number of participating nodes </li></ul></ul></ul><ul><ul><ul><li>R = number of nodes required for successful Read </li></ul></ul></ul><ul><ul><ul><li>W = number of nodes required for successful write </li></ul></ul></ul><ul><ul><li>R + W > N gives quorum </li></ul></ul>
  37. 41. <ul><ul><li>Writes </li></ul></ul><ul><ul><ul><li>Requires generation of a new vector clock by coordinator </li></ul></ul></ul><ul><ul><ul><li>Coordinator writes locally </li></ul></ul></ul><ul><ul><ul><li>Forwards to N nodes, if W-1 respond then the write was successful </li></ul></ul></ul><ul><ul><li>Reads </li></ul></ul><ul><ul><ul><li>Forwards to N nodes, if R-1 respond then forwards to user </li></ul></ul></ul><ul><ul><ul><li>Only unique responses forwarded </li></ul></ul></ul><ul><ul><ul><li>User handles merging if multiple versions exist </li></ul></ul></ul>
  38. 42. <ul><ul><li>Sloppy Quorum </li></ul></ul><ul><ul><ul><li>Read write ops performed on first N healthy nodes </li></ul></ul></ul><ul><ul><ul><li>Increases availability </li></ul></ul></ul><ul><ul><li>Hinted Handoff </li></ul></ul><ul><ul><ul><li>If node in preference list is not available, send replica to a node further down in the list </li></ul></ul></ul><ul><ul><ul><li>With a hint containing the identity of the original node </li></ul></ul></ul><ul><ul><ul><li>The receiving node keeps checking for the original </li></ul></ul></ul><ul><ul><ul><li>If the original becomes available, transfers replica to it </li></ul></ul></ul>
  39. 43. <ul><ul><li>Replica Synchronization </li></ul></ul><ul><ul><ul><li>Synchronize with another node </li></ul></ul></ul><ul><ul><ul><li>Each node maintains a separate Merkel tree for each key range it hosts </li></ul></ul></ul><ul><ul><ul><li>Nodes exchange roots of trees for common key-ranges </li></ul></ul></ul><ul><ul><ul><li>Quickly determine divergent keys by comparing hashes </li></ul></ul></ul>
  40. 44. <ul><ul><li>Ring Membership </li></ul></ul><ul><ul><ul><li>Membership is explicit to avoid re-balancing of partition assignment </li></ul></ul></ul><ul><ul><ul><li>Use background gossip to build 1-hop DHT </li></ul></ul></ul><ul><ul><ul><li>Use external entity to bootstrap the system to avoid partitioned rings </li></ul></ul></ul><ul><ul><li>Failure Detection </li></ul></ul><ul><ul><ul><li>Node A finds node B unreachable (for servicing a request) </li></ul></ul></ul><ul><ul><ul><li>A uses other nodes to service requests and periodically checks B </li></ul></ul></ul><ul><ul><ul><li>A does not assume B to have failed </li></ul></ul></ul><ul><ul><ul><li>No globally consistent view of failure (because of explicit ring membership) </li></ul></ul></ul>
  41. 45. <ul><ul><li>Application Configurable (N, R, W) </li></ul></ul><ul><ul><li>Every node is aware of the data hosted by its peers </li></ul></ul><ul><ul><ul><li>requiring the gossiping of the full routing table with other nodes </li></ul></ul></ul><ul><ul><ul><li>scalability is limited by this to a few hundred nodes </li></ul></ul></ul><ul><ul><ul><li>hierarchy may help to overcome the limitation </li></ul></ul></ul>
  42. 46. <ul><ul><li>Typical configuration for the Dynamo (N, R, W) is (3, 2, 2) </li></ul></ul><ul><ul><li>Some implementations vary (N, R, W) </li></ul></ul><ul><ul><ul><li>Always write might have W=1 (Shopping Cart) </li></ul></ul></ul><ul><ul><ul><li>Product catalog might have R=1 and W=N </li></ul></ul></ul><ul><ul><li>Response requirement is 300ms for any request (read or write) </li></ul></ul>
  43. 47. <ul><ul><li>Consistency vs. Availability </li></ul></ul><ul><ul><ul><li>99.94% one version </li></ul></ul></ul><ul><ul><ul><li>0.00057% two </li></ul></ul></ul><ul><ul><ul><li>0.00047% three </li></ul></ul></ul><ul><ul><ul><li>0.00009% four </li></ul></ul></ul><ul><ul><li>Server-driven or Client-driven coordination </li></ul></ul><ul><ul><ul><li>Server-driven </li></ul></ul></ul><ul><ul><ul><ul><li>uses load balancers </li></ul></ul></ul></ul><ul><ul><ul><ul><li>forwards requests to desired set of nodes </li></ul></ul></ul></ul><ul><ul><ul><li>Client-driven 50% faster </li></ul></ul></ul><ul><ul><ul><ul><li>requires the polling of Dynamo membership updates </li></ul></ul></ul></ul><ul><ul><ul><ul><li>the client is responsible for determining the appropriate nodes to send the request to </li></ul></ul></ul></ul><ul><ul><li>Successful responses (without time-out) 99.9995% </li></ul></ul>
  44. 52. <ul><ul><li>Enormous data (and high growth) </li></ul></ul><ul><ul><ul><li>Traditional solutions don’t work </li></ul></ul></ul><ul><ul><li>Distributed databases </li></ul></ul><ul><ul><ul><li>Lots of interesting work happening </li></ul></ul></ul><ul><ul><li>Great time for young programmers! </li></ul></ul><ul><ul><ul><li>Problem solving ability </li></ul></ul></ul>
  45. 54. Intelligent People. Uncommon Ideas. Licensed under Creative Commons Attribution Sharealike Noncommercial