Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Handling Data in Mega Scale Systems

5,059 views

Published on

Handling Data in Mega Scale Systems by Vineet Gupta, GM Software Engineer.

Published in: Technology
  • Visit Here to Read PDF Format === http://readjpaojdpa.ygto.com/8430558322-mega-maquinas-mega-machines.html
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • 1 minute a day to keep your weight away! ●●● http://t.cn/A6PnIGtz
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating for everyone is here: ♥♥♥ http://bit.ly/2ZDZFYj ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Follow the link, new dating source: ♥♥♥ http://bit.ly/2ZDZFYj ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Thanks Ankit : )
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Handling Data in Mega Scale Systems

  1. 1. Intelligent People. Uncommon Ideas.<br />Handling Data in Mega Scale Web Apps(lessons learnt @ Directi)<br />Vineet Gupta | GM – Software Engineering | Directi<br />http://vineetgupta.spaces.live.com<br />Licensed under Creative Commons Attribution Sharealike Noncommercial<br />
  2. 2. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  3. 3. Not Covering<br />Offline Processing (Batching / Queuing)<br />Distributed Processing – Map Reduce<br />Non-blocking IO<br />Fault Detection, Tolerance and Recovery<br />
  4. 4. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  5. 5. How Big Does it Get<br />22M+ users<br />Dozens of DB servers<br />Dozens of Web servers<br />Six specialized graph database servers to run recommendations engine<br />Source:http://highscalability.com/digg-architecture<br />
  6. 6. How Big Does it Get<br />1 TB / Day<br />100 M blogs indexed / day<br />10 B objects indexed / day<br />0.5 B photos and videos<br />Data doubles in 6 months<br />Users double in 6 months<br />Source:http://www.royans.net/arch/2007/10/25/scaling-technorati-100-million-blogs-indexed-everyday/<br />
  7. 7. How Big Does it Get<br />2 PB Raw Storage<br />470 M photos, 4-5 sizes each<br />400 k photos added / day<br />35 M photos in Squid cache (total)<br />2 M photos in Squid RAM<br />38k reqs / sec to Memcached<br />4 B queries / day<br />Source:http://mysqldba.blogspot.com/2008/04/mysql-uc-2007-presentation-file.html<br />
  8. 8. How Big Does it Get<br />Virtualized database spans 600 production instances residing in 100+ server clusters distributed over 8 datacenters<br />2 PB of data<br />26 B SQL queries / day<br />1 B page views / day<br />3 B API calls / month<br />15,000 App servers<br />Source:http://highscalability.com/ebay-architecture/<br />
  9. 9. How Big Does it Get<br />450,000 low cost commodity servers in 2006<br />Indexed 8 B web-pages in 2005<br />200 GFS clusters (1 cluster = 1,000 – 5,000 machines)<br />Read / write thruput = 40 GB / sec across a cluster<br />Map-Reduce<br />100k jobs / day<br />20 PB of data processed / day<br />10k MapReduce programs<br />Source:http://highscalability.com/google-architecture/<br />
  10. 10. Key Trends<br />Data Size ~ PB<br />Data Growth ~ TB / day<br />No of servers – 10s to 10,000<br />No of datacenters – 1 to 10<br />Queries – B+ / day<br />Specialized needs – more / other than RDBMS<br />
  11. 11. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  12. 12. Host<br />RAM<br />CPU<br />CPU<br />RAM<br />CPU<br />RAM<br />App Server<br />DB Server<br />Vertical Scaling (Scaling Up)<br />
  13. 13. Big Irons<br />Sunfire E20k<br />PowerEdge SC1435<br />36x 1.8GHz processors<br />Dualcore 1.8 GHz processor<br />$450,000 - $2,500,000<br />Around $1,500<br />
  14. 14. Vertical Scaling (Scaling Up)<br />Increasing the hardware resources on a host<br />Pros<br />Simple to implement<br />Fast turnaround time<br />Cons<br />Finite limit<br />Hardware does not scale linearly (diminishing returns for each incremental unit)<br />Requires downtime<br />Increases Downtime Impact<br />Incremental costs increase exponentially<br />
  15. 15. Host<br />Host<br />App Server<br />DB Server<br />Vertical Partitioning of Services<br />
  16. 16. Vertical Partitioning of Services<br />Split services on separate nodes<br />Each node performs different tasks<br />Pros<br />Increases per application Availability<br />Task-based specialization, optimization and tuning possible<br />Reduces context switching<br />Simple to implement for out of band processes<br />No changes to App required<br />Flexibility increases<br />Cons<br />Sub-optimal resource utilization<br />May not increase overall availability<br />Finite Scalability<br />
  17. 17. Horizontal Scaling of App Server<br />Web Server<br />Load Balancer<br />Web Server<br />DB Server<br />Web Server<br />
  18. 18. Horizontal Scaling of App Server<br />Add more nodes for the same service<br />Identical, doing the same task<br />Load Balancing<br />Hardware balancers are faster<br />Software balancers are more customizable<br />
  19. 19. The problem - State<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />DB Server<br />User 2<br />Web Server<br />
  20. 20. Sticky Sessions<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />DB Server<br />User 2<br />Web Server<br />Asymmetrical load distribution<br />Downtime<br />
  21. 21. Central Session Store<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />Session Store<br />User 2<br />Web Server<br />SPOF<br />Reads and Writes generate network + disk IO<br />
  22. 22. Clustered Sessions<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />User 2<br />Web Server<br />
  23. 23. Clustered Sessions<br />Pros<br />No SPOF<br />Easier to setup<br />Fast Reads<br />Cons<br />n x Writes<br />Increase in network IO with increase in nodes<br />Stale data (rare)<br />
  24. 24. Sticky Sessions with Central Store<br />Web Server<br />User 1<br />Load Balancer<br />Web Server<br />DB Server<br />User 2<br />Web Server<br />
  25. 25. More Session Management<br />No Sessions<br />Stuff state in a cookie and sign it!<br />Cookie is sent with every request / response<br />Super Slim Sessions<br />Keep small amount of frequently used data in cookie<br />Pull rest from DB (or central session store)<br />
  26. 26. Sessions - Recommendation<br />Bad<br />Sticky sessions<br />Good<br />Clustered sessions for small number of nodes and / or small write volume<br />Central sessions for large number of nodes or large write volume<br />Great<br />No Sessions!<br />
  27. 27. App Tier Scaling - More<br />HTTP Accelerators / Reverse Proxy<br />Static content caching, redirect to lighter HTTP<br />Async NIO on user-side, Keep-alive connection pool<br />CDN<br />Get closer to your user<br />Akamai, Limelight<br />IP Anycasting<br />Async NIO<br />
  28. 28. Scaling a Web App<br />App-Layer<br />Add more nodes and load balance!<br />Avoid Sticky Sessions<br />Avoid Sessions!!<br />Data Store<br />Tricky! Very Tricky!!!<br />
  29. 29. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  30. 30. Replication = Scaling by Duplication<br />App Layer<br />T1, T2, T3, T4<br />
  31. 31. Replication = Scaling by Duplication<br />App Layer<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />T1, T2, T3, T4<br />Each node has its own copy of data<br />Shared Nothing Cluster<br />
  32. 32. Replication<br />Read : Write = 4:1<br />Scale reads at cost of writes!<br />Duplicate Data – each node has its own copy<br />Master Slave<br />Writes sent to one node, cascaded to others<br />Multi-Master<br />Writes can be sent to multiple nodes<br />Can lead to deadlocks<br />Requires conflict management<br />
  33. 33. Master-Slave<br />App Layer<br />Master<br />Slave<br />Slave<br />Slave<br />Slave<br />n x Writes – Async vs. Sync<br />SPOF<br />Async - Critical Reads from Master!<br />
  34. 34. Multi-Master<br />App Layer<br />Master<br />Master<br />Slave<br />Slave<br />Slave<br />n x Writes – Async vs. Sync<br />No SPOF<br />Conflicts!<br />
  35. 35. Replication Considerations<br />Asynchronous<br />Guaranteed, but out-of-band replication from Master to Slave<br />Master updates its own db and returns a response to client<br />Replication from Master to Slave takes place asynchronously<br />Faster response to a client <br />Slave data is marginally behind the Master<br />Requires modification to App to send critical reads and writes to master, and load balance all other reads<br />Synchronous<br />Guaranteed, in-band replication from Master to Slave<br />Master updates its own db, and confirms all slaves have updated their db before returning a response to client<br />Slower response to a client <br />Slaves have the same data as the Master at all times<br />Requires modification to App to send writes to master and load balance all reads<br />
  36. 36. Replication Considerations<br />Replication at RDBMS level<br />Support may exists in RDBMS or through 3rd party tool<br />Faster and more reliable<br />App must send writes to Master, reads to any db and critical reads to Master<br />Replication at Driver / DAO level<br />Driver / DAO layer ensures <br />writes are performed on all connected DBs<br />Reads are load balanced<br />Critical reads are sent to a Master<br />In most cases RDBMS agnostic<br />Slower and in some cases less reliable<br />
  37. 37. Diminishing Returns<br />Per Server:<br />4R, 1W<br />2R, 1W<br />1R, 1W<br />Read<br />Read<br />Read<br />Write<br />Write<br />Write<br />Read<br />Read<br />Read<br />Read<br />Write<br />Write<br />Write<br />Write<br />
  38. 38. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  39. 39. Partitioning = Scaling by Division<br />Vertical Partitioning<br />Divide data on tables / columns<br />Scale to as many boxes as there are tables or columns<br />Finite<br />Horizontal Partitioning<br />Divide data on rows<br />Scale to as many boxes as there are rows!<br />Limitless scaling<br />
  40. 40. Vertical Partitioning<br />App Layer<br />T1, T2, T3, T4, T5<br />Note: A node here typically represents a shared nothing cluster<br />
  41. 41. Vertical Partitioning<br />App Layer<br />T3<br />T4<br />T5<br />T2<br />T1<br />Facebook - User table, posts table can be on separate nodes<br />Joins need to be done in code (Why have them?)<br />
  42. 42. Horizontal Partitioning<br />App Layer<br />T3<br />T4<br />T5<br />T2<br />T1<br />First million rows<br />T3<br />T4<br />T5<br />T2<br />T1<br />Second million rows<br />T3<br />T4<br />T5<br />T2<br />T1<br />Third million rows<br />
  43. 43. Horizontal Partitioning Schemes<br />Value Based<br />Split on timestamp of posts<br />Split on first alphabet of user name<br />Hash Based<br />Use a hash function to determine cluster<br />Lookup Map<br />First Come First Serve<br />Round Robin<br />
  44. 44. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  45. 45. CAP Theorem<br />Source:http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.20.1495<br />
  46. 46. Transactions<br />Transactions make you feel alone<br />No one else manipulates the data when you are<br />Transactional serializability<br />The behavior is as if a serial order exists<br />Source:http://blogs.msdn.com/pathelland/<br />Slide 46<br />
  47. 47. Life in the “Now”<br />Transactions live in the “now” inside services<br />Time marches forward<br />Transactions commit <br />Advancing time<br />Transactions see the committed transactions<br />A service’s biz-logic lives in the “now”<br />Source:http://blogs.msdn.com/pathelland/<br />Slide 47<br />
  48. 48. Sending Unlocked Data Isn’t “Now”<br />Messages contain unlocked data<br />Assume no shared transactions<br />Unlocked data may change<br />Unlocking it allows change<br />Messages are not from the “now”<br />They are from the past<br />There is no simultaneity at a distance!<br /><ul><li>Similar to speed of light
  49. 49. Knowledge travels at speed of light
  50. 50. By the time you see a distant object it may have changed!
  51. 51. By the time you see a message, the data may have changed!</li></ul>Services, transactions, and locks bound simultaneity!<br /><ul><li> Inside a transaction, things appear simultaneous (to others)
  52. 52. Simultaneity only inside a transaction!
  53. 53. Simultaneity only inside a service!</li></ul>Source:http://blogs.msdn.com/pathelland/<br />Slide 48<br />
  54. 54. Outside Data: a Blast from the Past<br />All data from distant stars is from the past<br /><ul><li> 10 light years away; 10 year old knowledge
  55. 55. The sun may have blown up 5 minutes ago
  56. 56. We won’t know for 3 minutes more…</li></ul>All data seen from a distant service is from the “past”<br />By the time you see it, it has been unlocked and may change<br />Each service has its own perspective<br />Inside data is “now”; outside data is “past”<br />My inside is not your inside; my outside is not your outside<br />This is like going from Newtonian to Einstonian physics<br /><ul><li> Newton’s time marched forward uniformly
  57. 57. Instant knowledge
  58. 58. Classic distributed computing: many systems look like one
  59. 59. RPC, 2-phase commit, remote method calls…
  60. 60. In Einstein’s world, everything is “relative” to one’s perspective
  61. 61. Today: No attempt to blur the boundary</li></ul>Source:http://blogs.msdn.com/pathelland/<br />Slide 49<br />
  62. 62. Versions and Distributed Systems<br />Can’t have “the same” dataat many locations<br />Unless it isa snapshot<br />Changing distributed dataneeds versions<br />Creates asnapshot…<br />Source:http://blogs.msdn.com/pathelland/<br />
  63. 63. Subjective Consistency<br />Given what I know here and now, make a decision<br />Remember the versions of all the data used to make this decision<br />Record the decision as being predicated on these versions<br />Other copies of the object may make divergent decisions<br />Try to sort out conflicts within the family<br />If necessary, programmatically apologize<br />Very rarely, whine and fuss for human help<br />Subjective Consistency<br /> Given the information I have at hand, make a decision and act on it !<br /> Remember the information at hand !<br />Ambassadors Had Authority<br />Back before radio, it could be months between communication with the king. Ambassadors would make treaties and much more... They had binding authority. The mess was sorted out later!<br />Source:http://blogs.msdn.com/pathelland/<br />
  64. 64. Eventual Consistency<br />Eventually, all the copies of the object share their changes<br />“I’ll show you mine if you show me yours!”<br />Now, apply subjective consistency:<br />“Given the information I have at hand, make a decision and act on it!”<br />Everyone has the same information, everyone comes to the same conclusion about the decisions to take…<br />Eventual Consistency<br /><ul><li> Given the same knowledge, produce the same result !
  65. 65. Everyone sharing their knowledge leads to the same result...</li></ul>This is NOT magic; it is a design requirement !<br />Idempotence, commutativity, and associativity of the operations(decisions made) are all implied by this requirement<br />Source:http://blogs.msdn.com/pathelland/<br />
  66. 66. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  67. 67. Why Normalize?<br />Classic problemwith de-normalization<br />Can’t updateSam’s phone #since there aremany copies<br />Emp #<br />Emp Name<br />Mgr #<br />Mgr Name<br />Emp Phone<br />47<br />Joe<br />13<br />Sam<br />5-1234<br />18<br />Sally<br />38<br />Harry<br />3-3123<br />91<br />Pete<br />13<br />Sam<br />2-1112<br />66<br />Mary<br />02<br />Betty<br />5-7349<br />Mgr Phone<br />6-9876<br />5-6782<br />6-9876<br />4-0101<br />Normalization’s Goal Is Eliminating Update Anomalies<br />Can Be Changed Without “Funny Behavior”<br />Each Data Item Lives in One Place<br />De-normalization is<br />OK if you aren’t going to update!<br />Source:http://blogs.msdn.com/pathelland/<br />
  68. 68. Eliminate Joins<br />
  69. 69. Eliminate Joins<br />6 joins for 1 query!<br />Do you think FB would do this?<br />And how would you do joins with partitioned data?<br />De-normalization removes joins<br />But increases data volume<br />But disk is cheap and getting cheaper<br />And can lead to inconsistent data<br />If you are lazy<br />However this is not really an issue<br />
  70. 70. “Append-Only” Data<br />Many Kinds of Computing are “Append-Only”<br />Lots of observations are made about the world<br />Debits, credits, Purchase-Orders, Customer-Change-Requests, etc<br />As time moves on, more observations are added<br />You can’t change the history but you can add new observations<br />Derived Results May Be Calculated<br />Estimate of the “current” inventory<br />Frequently inaccurate<br />Historic Rollups Are Calculated<br />Monthly bank statements<br />
  71. 71. Databases and Transaction Logs<br />Transaction Logs Are the Truth<br />High-performance & write-only<br />Describe ALL the changes to the data<br />Data-Base  the Current Opinion<br />Describes the latest value of the data as perceived by the application<br />Log<br />DB<br />The Database Is a Caching of the Transaction Log !<br />It is the subset of the latest committed values represented in the transaction log…<br />Source:http://blogs.msdn.com/pathelland/<br />
  72. 72. We Are Swimming in a Sea of Immutable Data <br />Source:http://blogs.msdn.com/pathelland/<br />
  73. 73. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  74. 74. Caching<br />Makes scaling easier (cheaper)<br />Core Idea<br />Read data from persistent store into memory<br />Store in a hash-table<br />Read first from cache, if not, load from persistent store<br />
  75. 75. Write thru Cache<br />App Server<br />Cache<br />
  76. 76. Write back Cache<br />App Server<br />Cache<br />
  77. 77. Sideline Cache<br />App Server<br />Cache<br />
  78. 78. Memcached<br />
  79. 79. How does it work<br />In-memory Distributed Hash Table<br />Memcached instance manifests as a process (often on the same machine as web-server)<br />Memcached Client maintains a hash table<br />Which item is stored on which instance<br />Memcached Server maintains a hash table<br />Which item is stored in which memory location<br />
  80. 80. Outline<br />Characteristics<br />App Tier Scaling<br />Replication<br />Partitioning<br />Consistency<br />Normalization<br />Caching<br />Data Engine Types<br />
  81. 81. It’s not all Relational!<br />Amazon - S3, SimpleDb, Dynamo<br />Google - App Engine Datastore, BigTable<br />Microsoft – SQL Data Services, Azure Storages<br />Facebook – Cassandra<br />LinkedIn - Project Voldemort<br />Ringo, Scalaris, Kai, Dynomite, MemcacheDB, ThruDB, CouchDB, Hbase, Hypertable<br />
  82. 82. Tuplespaces<br />Basic Concepts<br />No tables - Containers-Entity<br />No schema - each tuple has its own set of properties<br />Amazon SimpleDB – strings only<br />Microsoft Azure SQL Data Services<br />Strings, blob, datetime, bool, int, double, etc.<br />No x-container joins as of now<br />Google App Engine Datastore<br />Strings, blob, datetime, bool, int, double, etc.<br />
  83. 83. Key-Value Stores<br />Google BigTable<br />Sparse, Distributed, multi-dimensional sorted map<br />Indexed by row key, column key, timestamp<br />Each value is an un-interpreted array of bytes<br />Amazon Dynamo<br />Data partitioned and replicated using consistent hashing<br />Decentralized replica sync protocol<br />Consistency thru versioning<br />Facebook Cassandra<br />Used for Inbox search<br />Open Source<br />Scalaris<br />Keys stored in lexicographical order<br />Improved Paxos to provide ACID<br />Memory resident, no persistence<br />
  84. 84. In Summary<br />Real Life Scaling requires trade offs<br />No Silver Bullet<br />Need to learn new things<br />Need to un-learn<br />Balance!<br />
  85. 85. QUESTIONS?<br />
  86. 86. Intelligent People. Uncommon Ideas.<br />Licensed under Creative Commons Attribution Sharealike Noncommercial<br />

×