Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Why and how of highly scalable web sites


Published on

A presentation I made on high scalability for websites/applications. Discusses why scalability is important, and provides an overview of scaling techniques for network, database and applications.

Published in: Technology
  • Be the first to comment

Why and how of highly scalable web sites

  1. 1. Why and How of Highly S-c-a-l-a-b-l-e Web Sites<br />Faizan Javed Ph.D.,<br /><br />
  2. 2. Some stats to get you pumped up…..<br />F acebook: <br /> 200710m users, 200860m, 2009120m, 2010500m+. 13 million queries per second.<br /> 570 billion page views a month. <br /> 1.2 million photos served per second.<br /> ….in 1997: <br />“Obidos”1 big DB, 1 big server. Not a whole lot of customers.<br /> Today: 100-150 services build a page. 60m+ unique users.<br /> : King of scalability.<br />Started as a single server research project in 1997.<br />2005: indexed 8billion pages. Now??<br />YouTube: founded 2/2005. <br />2006  100m views per day.<br />Now: 1 billion views per day.<br />
  3. 3. Its not just the big guys…<br /> : social web gaming and Facebook apps.<br /> 50m monthly users. 10m daily active users. <br />Zynga’s : Fast and furious growth!<br /> 1million daily players after 4 days.<br /> 10million after 60 days.<br /> Currently 35million+.<br /> .com: <br /> 2009 stats 30m users, 2 billion requests a month, 13,000 requests per second.<br />
  4. 4. Some definitions and comments..<br />Scalable vs high performance: <br />Performance: Blindingly fast for 1000 users and 1 GB data<br />Scalable: maintain that performance for upto 10 times data <br />and users.<br />Scale up vs Scale out: Hardware/Architecture<br />Up: Buy a bigger box! Easy, but cost doesn’t scale linearly..<br />Out: Add regular boxes! Cheap(er), but admin costs and balance ..<br />How about hybrid systems? (requires proper capacity planning)<br />Infrastructure/plumbing is important! <br /> Google’s PageRank available almost immediately, but Google infrastructure secret for far longer…<br />
  5. 5. How Web 2.0 is driving this<br />TechCrunch/Slashdot/Digg effect – Rapid unexpected customer demand/growth!<br />Web 1.0 was mostly static pages: <br />Push content to RDBMS, cache in HTML on front-end, round robin webservers and voila!<br />Web 2.0’s social phenomena brings unique challenges…<br />Comments, “liking”, content suggestions, earned virtual currency, reputation systems…all DYNAMIC and written back to data store.<br />Real time social graphs (connectivity between people, places, and things)<br />Facebook example<br />You are hip. You are popular. You have 500 friends. You log in…<br />Facebook gathers status of all your 500 friends at the same time!<br />500 requests, replies merged, services contacted, all in a reasonable amount of time!<br />
  6. 6. Technology adoption lifecyle<br />Fast growth and rapid adoption occurs here!<br />
  7. 7. “I AM GOING TO SCALE MY FOOT UP YOUR A**!”<br />Ted Dziubathe ‘anti-Arrington’: <br /><br />Rant1: “Scalability is not your problem, getting people to give a sh*t is.”<br />Discuss capacity planning – what do you need to scale to ??<br /> “Every year, we take the busiest minute of the busiest hour of the busiest day and build capacity on that, we built our systems to (handle that load) and we went above and beyond that.” –Scott Gulbransen, Intuit Spokesman<br />Rant 2: “Saying "Rails doesn't scale" is like saying "my car doesn't go infinitely fast". ”<br />Don’t blame a single technology, chances are you are doing it wrong.  <br />PHP doesn’t “scale”? Java doesn’t “scale”? Both are used extensively by Google!<br />Rant 3: Silicon valley “machismo”: <br />Yea! I am gonna write a post about scalability and once it hits reddit everyone will know how hardcore I am !<br />
  8. 8. Why care about high scalability <br />Its not 1998 anymore: <br /> More and more Web 2.0 apps out there..more data to process. <br /> Raw lust for real-time data. <br />Jack posts something…his friends expect it to “ping” pop-up on their screen…<br />Web 3.0, the Semantic web: <br />“Intelligent Web”, software (personal) agents that will make recommendations based on user (browsing) profiles <br />Cutting edge web/software technology: <br /> Great advances being made in all areas of computer science…<br />YouTube content infringement but just one example..<br />Start-ups and capacity planning: <br />Good to be aware of what to scale to<br />
  9. 9. Scaling: Hardware & Network<br />Machine redundancy: Master/Slave, Cold/Warm/Hot spares<br />Load balancing (for horizontal scaling):<br />Hardware (Cisco routers - $$$), Software (Pound, LVS)<br /> Layer 4 (TCP), sticky sessions (not needed in REST model)<br /> Layer 7 (HTTP), mapping URLs to servers<br />Content caching :<br />Reverse proxy: a load balancer that can cache static and dynamic content<br />CDN (content delivery network): Akamai, Netscaler, etc.<br /> Geographically dispersed caching of content to minimize network latency <br />
  10. 10. Layer 4 (TCP) load balancing<br />Round robin algorithm: rotates amongst the listed servers<br />Least connections algorithm: checks for active connections and assigns request to server with least requests (doesn’t overload servers that are handling slow queries).<br />
  11. 11. Layer 7 (HTTP) load balancing<br />Hash table: Create an entry for each URL with server to redirect to.<br />Simple indexing: Perform hashing function on URL. Ensure a uniform distribution.<br />Why do this? For serving large files, a cache farm may be needed, and layer 4 balancing will store duplicates while layer 7 will allow each single object to exist only once on a cache server.<br />
  12. 12. Content Delivery Networks (CDNs)<br />Performance golden rule: <br />10-20%: time spent downloading HTML doc. <br />80-90%: components in the page.<br />Component servers closer to user  <br /> fewer network hops or response time  response times of many HTTP requests improved.<br /> Alternative to re-architecting database, application<br /> BUT…<br /> smaller companies/startups may not be able to afford CDN services. (Akamai, SAVVIS, etc.)<br />Free CDN services: Globule, CoDeeN, CoralCDN<br />
  13. 13. Scaling: Database/backend<br />RDBMS such as MySQL, SQL Server, Oracle:<br />Denormalize database: Reduce joins, create redundant data<br />Fixing data inconsistency now job of application!<br />Replication (to scale reads):Master-Slave, Tree, Master-Master<br />Caching (to scale reads): Memcachedcache layer on database<br />Partitioning (to scale writes): Clustering (vertical), Federation/Sharding (horizontal)<br /><ul><li>What if your app does more far more writes than can be handled by RDBMS systems?
  14. 14. “NoSQL” movement: a data store based on key/value pairs
  15. 15. Leading methodologies: Amazon Dynamo, Google BigTable</li></ul>Cassandra (Dynamo): Used at Twitter, Facebook, Digg, Rackspace<br />Voldemort (Dynamo), CrouchDB, MongoDB, Hbase, etc. <br />
  16. 16. Brewers CAP theorem<br />“..though its desirable to have Consistency, High-Availability and Partition-tolerance in every system, unfortunately no system can achieve all three at the same time.”<br />Consistent: Guarantees state of a system at any time unless explicitly changed. Example 3 is not consistent (master-master set-up).<br />Available: Examples 1 and 2 are not highly available. If a node goes down, there is total data loss.<br />Partition-tolerance: Example 3 is partition tolerant but not consistent (Bank ATM withdrawal example).<br />BigTable: Consistent + Available<br />Dynamo: Available + Partition-tolerant<br />
  17. 17. Denormalizing the database – Simple example<br />Query 1:<br />SELECTproduct_name, order_date<br />FROM orders INNER JOIN products USING(product_id) <br />WHEREproduct_name like 'A%' ORDER by order_date DESC<br /><ul><li>Scan order_date index on Orders table, compare product_name in Products table</li></ul>Query 2:<br />SELECTproduct_name, order_date<br />FROM orders <br />WHEREproduct_name like 'A%' ORDER by order_date DESC<br /><ul><li>No join, single index, but data replication</li></ul>Denormalizationvs normalization – which is better? For small n it doesn’t matter !<br />
  18. 18. Replication (scaling reads): Master-Slave<br />Read/Write ratio generally 80/20 or 90/10.<br />All writes performed on master.<br />Event writes to binary log, transmitted to slaves.<br />Slave is read only.<br />Provides more read power!<br />Slave needs to be at least as powerful as the master as it needs to do all writes a master can.<br /> Every box has to perform every write, so can’t scale writes.<br />
  19. 19. Replication (scaling reads): Master-Master<br />Provides High Availability – each machine is copy of other.<br /> Writes faster than single master<br />Problems with auto-incrementing IDs: inserting records into masters simultaneously..<br />Solution: relax reliance on IDs being sequential<br />Replication can be playing catchup..consistency issues..<br />
  20. 20. Load balancing and replication<br />
  21. 21. Caching (scaling reads): Memcached<br />functionget_foo(intuserid) { <br />/* first try the cache */<br />data = memcached_fetch("userrow:" + userid); <br />if (!data) { <br />/* not found : request database */<br />data = db_select("SELECT * FROM users WHERE userid = ?", userid); <br />/* then store in cache until next get*/<br />memcached_add("userrow:" + userid, data); <br /> } <br />return data; <br />}<br />Great for speeding up expensive fetch/read queries - e.g., a product details page on an e-commerce site. <br />Not good for update/write queries – will result in a cache miss and a database call PLUS a cache update impacting performnce. <br />
  22. 22. Clustering (scaling writes)<br />Vertical partitioning (easy but limited).<br />Distribute tables across different clusters.<br />Application logic needs to know where tables are located.<br />Design: Go through every query to check which tables join.<br />Managing clusters is difficult.<br />Increases SQL connections.<br />Large database with 6 tables<br />Cluster 2: <br />Tables 3 and 4<br />Cluster 3:<br />Tables 5 and 6<br />Cluster 1: <br />Tables 1 and 2<br />
  23. 23. Sharding (scaling writes)<br />“Shared-nothing” partitioning scheme.<br />Slice data over multiple servers.<br />E.g., Users table.<br />Odd users id on server 1<br />Even users id on server 2<br />Design: Shard data so that all records reside on same shard. Avoid cross-server joins!<br />Referential integrity might need to be enforced in application code.<br />
  24. 24. Code change due to good sharding..<br />Before sharding:<br /> string connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"]; <br />OdbcConnectionconn = new OdbcConnection(connectionString); <br />conn.Open(); <br />OdbcCommandcmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn); <br />OdbcParameterparam = cmd.Parameters.Add("@CustomerID", OdbcType.Int); <br />param.Value = customerId; <br />OdbcDataReader reader = cmd.ExecuteReader();<br />After sharding:<br />string connectionString = GetDatabaseFor(customerId); <br /> string connectionString = ConfigurationSettings.AppSettings["ConnectionInfo"]; <br />OdbcConnectionconn = new OdbcConnection(connectionString); <br />conn.Open(); <br />OdbcCommandcmd = new OdbcCommand("SELECT Name, Address FROM Customers WHERE CustomerID= ?", conn); <br />OdbcParameterparam = cmd.Parameters.Add("@CustomerID", OdbcType.Int); <br />param.Value = customerId; <br />OdbcDataReader reader = cmd.ExecuteReader();<br />
  25. 25. Back to denormalization:An average social network profile schema in great need of de-normalization….<br />
  26. 26. DIGG case study: Denormalizing the database<br />CREATE TABLE `Diggs` (`id` INT(11),`itemid` INT(11),`userid` INT(11),`digdate` DATETIME,PRIMARY KEY (`id`),KEY `user` (`userid`),KEY `item` (`itemid`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;<br />CREATE TABLE `Friends` (`id` INT(10) AUTO_INCREMENT,`userid` INT(10),`username` VARCHAR(15),`friendid` INT(10),`friendname` VARCHAR(15),`mutual` TINYINT(1),`date_created` DATETIME,PRIMARY KEY (`id`),UNIQUE KEY `Friend_unique` (`userid`,`friendid`),KEY `Friend_friend` (`friendid`)) ENGINE=InnoDB DEFAULT CHARSET=utf8;<br />The problem – Intersection of two sets:<br />Users who dugg an item (millions of rows)<br />Users that have befriended the digger <br /> (100s of millions of rows)<br />………….<br />Why Digg made the shift to Cassandra<br /> (Dynamo/BigTable hybrid)<br /><br /> ………………<br />
  27. 27. DIGG case study: Denormalizing the database<br />JOIN too slow in SQL, do it in PHP:<br />Query ‘Friends’ for all my friends. With a cold cache takes 1.5 s.<br />Query ‘Diggs’ for any diggs of a specific item by a user in the set of friend user IDs. Enormous query, looks somewhat like..14 seconds with a cold cache<br />SELECT `digdate`, `id` <br />FROM `Diggs`<br />WHERE `userid` IN (59, 9006, 15989, 16045, 29183, 30220, 62511, 75212, 79006, can balloon to hundreds of user IDs..)AND itemid = 13084479 <br />ORDER BY `digdate` DESC, `id` DESC LIMIT 4;<br />
  28. 28. Amazon Dynamo - A Distributed Storage System<br />Hard to create redundancy and parallelism, are single point of failure.<br />Two db servers for identical data: difficult to synchronize!<br />Master/Slave: master has to take all heat when writes are occurring!<br />Huge issue for mega e-commerce sites.<br />Adding more web servers doesn’t help…it’s the database that is the problem!<br />
  29. 29. Dynamo – A Distributed Storage System<br />Ring of identical computers.<br />Fault tolerance: data is redundant<br />Eventually consistent storage system<br />Hard to create a responsive and consistent distributed storage system…so redundancy is accomplished asynchronously.<br />Partitioning algorithm is complex (which node will store an object to scale).<br />Simple Put and Get interface<br />Put requires Key,Context and Object..content used by Dynamo to validate requests.<br />
  30. 30. Google BigTable(opensource:Hbase)<br /><ul><li>Applications store data rows in tables.
  31. 31. Collection of rows located by (sortable) row key (and optional timestamp)
  32. 32. Columns may be sparse, and arbitrary in number
  33. 33. Column name form “<family>:<label>”.
  34. 34. Only a single row at a time may be locked by default.
  35. 35. Conceptual view:</li></li></ul><li>Google BigTable (opensource:Hbase)<br />Physical storage view:<br /> Stored on a per-column family basis. <br /> Empty cells of the conceptual view not stored (requests return no value)<br />
  36. 36. Scaling: Application<br />Front-end “performance” enhancements: <br /> Minimize HTTP requests, Use Gzip for compression, Reduce DNS lookups, Minify Javascipt and CSS, CDN, etc.<br /><ul><li>“Special-purpose” computations: </li></ul>Crawled documents, web request logs, inverted indices, most frequent queries, etc.<br /> Enter MapReduce: distributed processing framework by Google<br />Apache Hadoop– opensourceMapReduce by Yahoo!<br />“(Key,Value) pairs + a Map function and a Reduce function”<br /> ** Powers 90%+ of Google’s jobs and apps internally.<br /> ** Recently supplanted by Google Percolator for Instant Search<br />Microsoft note: DryadLINQ, their answer to MapReduce.<br />
  37. 37. MapReduce/Hadoop:<br />For processing and generating large datasets on clusters.<br />Input dataset split into independent chunks.<br />Operates on <key, value> pairs.<br />Implicit parallelization: splitting and distributing data, starting maps, reduces, collecting output<br />One Master, multiple Workers <br />
  38. 38. MapReduce: WordCount example<br />File 1:<br /> Hello World Bye World<br />File 2:<br /> Hello Hadoop Goodbye Hadoop<br />The output of the first map:<br />< Bye, 1> < Hello, 1> < World, 2><br />The output of the second map:<br />< Goodbye, 1> < Hadoop, 2> < Hello, 1><br />The output of the job is:<br /> < Bye, 1> < Goodbye, 1> < Hadoop, 2><br /> < Hello, 2> < World, 2><br />map(String key, String value): <br /> // key: document name <br /> // value: document contents <br /> for each word w in value: EmitIntermediate(w, "1"); <br />reduce(String key, Iterator values): <br /> // key: a word <br /> // values: a list of counts <br />int result = 0; <br /> for each v in values: <br /> result += ParseInt(v); Emit(AsString(result));<br />
  39. 39. Word count Amazon Elastic Mapreduce<br />DEMO<br />
  40. 40. Microsoft Dryad/DryadLINQ<br />Microsoft’s answer to MapReduce<br />Dryad jobs are directed acyclic graphs (DAGs) vs. Map/Distribute/Sort/Reduce operations of MapReduce.<br />No fault tolerance between stages in MapReduce.<br />Big jobs can be more efficient with Dryad.<br />Allows developers to specifydata communication mechanisms of computations (TCP pipes, files, FIFOs, etc)<br />Allows arbitrary number of inputs and outputs for computations (MapReduce is restricted to 1-1)<br />
  41. 41. DryadLINQ = LINQ + Dryad<br />Collection<T> collection;<br />boolIsLegal(Key k); <br />string Hash(Key);<br />var results = from c in collection where IsLegal(c.key) select new { Hash(c.key), c.value};<br />collection<br />Data<br />C#<br />C#<br />C#<br />C#<br />results<br />
  42. 42. Sawzall – Parallel Analysis of Data<br />Interpreted, procedural domain-specific language to handle huge quantities of data.<br />Type-safe scripting language which utilizes Google infrastructure<br />Used to process log data generated by Google servers.<br />Suitable for the map-phase of map-reduce. Widely used at Google.<br />topwords: table top(3) of word: string weight count: int;  fields: array of bytes = splitcsvline(input);  w: string = string(fields[0]);  c: int = int(string(fields[1]), 10);  if (c != 0) {    emit topwords <- w weight c;  }<br />Input: abc, 1 def,2 ghi,3 def,4 jkl, 5<br />Output: topwords[] = def, 6, 0 topwords[] = jkl, 5, 0 topwords[]=ghi, 3, 0<br />
  43. 43. Percolator – Incremental distributed processing<br />Near instant processing new crawled web documents.<br />Incremental vs Batch processing (MapReduce)<br />Allows changes to web index without rebuilding entire index from scratch.<br />Akin to database triggers: sits atop BigTable, can make changes to web maps.<br />“Fresher” web results, faster indexing and crawling.<br />
  44. 44. Summary of topics presented<br />TBD<br />