• Save
Dealing with web scale data
Upcoming SlideShare
Loading in...5

Like this? Share it with your network


Dealing with web scale data

Uploaded on

Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase......

Google processes 400 petabytes of data every month and that was way back in 2007! With users generating massive amounts of data in social networking sites like Facebook and Twitter, and an increase in the use of sensor devices, the amount of data generated is only going to go up. Further, with the cost of hard-disks going down, and such data being made available to everyone, and with the advent of cloud computing, we now have the power to process such data ourselves.

What are the challenges of processing such massive amounts of data? With such data being available to every corporation, big or small, how does this change how we have been perceiving data? The talk takes you through some of the technologies used to tackle these challenges.

The talk has been tailored to suit students. It helps them relate to and appreciate the subjects they learn in their curriculum - data structures, programming languages, databases, operating systems, networking etc. At the same time, it describes some of the interesting work being done in the software industry in the areas of databases, data analysis, cloud computing etc.

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • awsme~
    Are you sure you want to
    Your message goes here
No Downloads


Total Views
On Slideshare
From Embeds
Number of Embeds



Embeds 582

http://jnaapti.com 463
http://localhost 100
http://www.linkedin.com 10
https://www.linkedin.com 7
https://si0.twimg.com 1
https://twitter.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

    No notes for slide


  • 1. Dealing with Web Scale DataGautham Pai, Founderjnaapti
  • 2. A Few GuidelinesAsk questions – be active What I cover depends on how active you areLearn concepts before technology You will be bombarded with several concepts, tools and technologies – just remember that you are learning to bridge concepts and technology. After this program, you should be comfortable dabbling with these concepts on your own – even reading things that are not covered today. http://jnaapti.com/
  • 3. The Different VasesSource: http://www.flickr.com/photos/bachmont/1382572541/ http://jnaapti.com/
  • 4. The Different Vases :( Not preferableIdeal!Sufficient Source: http://www.flickr.com/photos/bachmont/1382572541/ http://jnaapti.com/
  • 5. Quick PollHow many of you are from a CS background?Knowledge of: Data Structures Algorithms DatabasesHave heard of: NoSQL Key-Value Stores Cloud Computing MapReduce Hadoop http://jnaapti.com/
  • 6. Part 0 – Setting the Context
  • 7. What is this talk about?2 themes in this talk: About data – how is it stored, how do we work with it About understanding technology via concepts learnt http://jnaapti.com/
  • 8. How much data are we talking about really? 200 million Tweets per day – as of Jun 2011 Wikipedia dump current revisions only - 31GB uncompressed entire history runs into multiple TBs uncompressed Common Crawl data – 10s of Tbs Tumblr – adding 3TB of new data everyday Google processes 25PB of data per day Facebook – 135+ billion messages a month Facebook – 130TB of logs generated per day Vestas - Wind data - 18 to 24 petabytes of data to be processed http://jnaapti.com/
  • 9. We are dealing with a lot more data...Increase in the number of sensor devicesLarger audience of users using our applications via theweb and social networks results in increased datagenerationCost of storage is falling – so we never discard any ofthe data http://jnaapti.com/
  • 10. Whats in it for me?Scrabulous case study Built by 2 young chaps from Kolkata Both were in their early 20s when they built it One was still in college. 500,000 users daily – back in 2008, 25,000$ in ad-revenues per monthThese days lots of apps being built by Source: Wikipediacollege under-graduates. If they can do it, you can do it too! http://jnaapti.com/
  • 11. You have all it takesYou have access to a lotof the tools that bigcorporations use for freeYou have computingpower available cheaplyYou have access to a lotof the data for free http://jnaapti.com/
  • 12. What do I need then?All you need is a little intelligence and a lot of perseverance and you are on your way! http://jnaapti.com/
  • 13. Questions to askOk, you have the resourcesYou build a cool webapplicationIt is an overnight hit - can youhandle it?What happens if the server hasa disk crash?Can we prevent website Slashdot Effectoutages in the account ofhardware failures? http://jnaapti.com/
  • 14. Looking for answersWhat do technology companies like Google/Facebook/Twitteruse to manage data? What challenges do they face in managing such huge volumes of data? How do they analyze such data? Image Source: http://opencompute.org/ http://jnaapti.com/
  • 15. From concept to technologyWe learn quite a few subjects inComputer Science – data structures,algorithms, databases, networking,operating systems, graph theory, etc.Are we ever going to use this/need thisas engineers?How do I use my knowledge of CS tounderstand the latest developments inthe industry? Image Source:http://www.flickr.com/photos/nics_events/2223583947/ http://jnaapti.com/
  • 16. From concept to technologyThis talk is about connecting concepts to real world examples Image Source:http://www.flickr.com/photos/nics_events/2223583947/ http://jnaapti.com/
  • 17. A few snappy examplesAnalysis of question papers from various companiesAnalysis of image patterns in your photos and moviecollectionsAnalysis of your Facebook friends 2nd degree connections Who is active at what time? Who talks about what? http://jnaapti.com/
  • 18. Part 1 – Dealing with Data
  • 19. What is this section all about?Before dealing with big-data problems, we first need toknow how data is handled.This section tries to answer questions like: How is it that 0s and 1s are sufficient to do anything that a computer does? Why do we need data structures? Why do we need databases – why cant I just store all data as flat files? http://jnaapti.com/
  • 20. Computers – A Bit ProcessorComputers only 0 0 1 0 0 1 0 1 1 0 0 1 0 0 1 0understand bits 0 1 1 1 1 1 1 0 0 0 1 1 0 1 1 0They have a way to store 0 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1and process these bits 1 0 0 1 0 0 1 0 0 1 1 1 1 1 1 0It is upto users to give 0 0 1 1 0 1 1 0 0 1 0 0 1 0 0 0the bits a “meaning” http://jnaapti.com/
  • 21. Data StructuresData structure is like acastPour your bits into it anda shape is createdThe shape helps usprovide a meaning to thebits Image Source: http://www.flickr.com/photos/andrein/3020194734/ http://jnaapti.com/
  • 22. Programming Languages Human mind does not understand bits. We need higher level constructs to process bits. This is where programming languagescome in. They act as a bridge between what humans want to do and what machines understand. Image Source: http://www.flickr.com/photos/jurvetson/5872448596/ http://jnaapti.com/
  • 23. Programming LanguagesVariables a = 10, b = 20 c = a + bTypes if condition:Operators do_this() for i in range(10):Conditionals do_this()Looping urllib.urlopen(http://yahoo.com /).read()Libraries [str.lower() for str in list_of_strings] http://jnaapti.com/
  • 24. Primitive TypesLanguages usually have two bangaloreprimitive types 123 Numbers – Integers, 567.89 Floats, Doubles etc 0 Strings – A sequence of characters put together -123Why these two types? Why -567.89not just strings? 123 http://jnaapti.com/
  • 25. Composite Types (or Collections)The world is complex Name → First Name + Last Name ---We cannot model everything Phone No → (Country Code) Area Code +with only strings and numbers Subscriber Number ---We need ways to put Address → Door No + Street + City +primitive values together to State + Pin Codeform more complex types ---Collections are a bag of values Composite of composites: Person → Name + Phone No + Addressput together ---Bottom up v/s Top down Group of People http://jnaapti.com/
  • 26. Collections – General Object ContainersWe can represent As a matter of fact,anything in the world this is what JSON allowsusing collections you to doCollections can bemapped to bitsComputers can interpretthose bits http://jnaapti.com/
  • 27. CollectionsThree basic types of collections: Lists Sets Maps http://jnaapti.com/
  • 28. Collections – ListsGrocery shopping exampleOrder of items matterDo items need to be of the same type?The key identifier is the position of the item in the listOperations on a list: add an item to list remove an item from the list get an item from the list at a specific position http://jnaapti.com/
  • 29. Collections – SetsItems in a set are uniqueThere is no definite orderOperations on a set: Add items to the set Test if an item exists in the set Remove an item from the set http://jnaapti.com/
  • 30. Collections - MapsLots of maps in the real Toothpaste - 1, Rs. 54 Matchbox - 10, Rs. 15world Tomatoes - 1kg, Rs. 10Indices are not always Chips - 1, Rs. 15integers in real world ---We may want to Identify Dictionary of word definitionsproperties of an item, --- Phone book containing phoneusing some name numbers http://jnaapti.com/
  • 31. Collections – Maps contd...Maps allow us to Grocery list: Item is the key, properties are valuesassociate a key with a ---value Dictionary as a map: keys are the words, values are the definitionsThe name that is used to ---identify the set of Phone book as a map: keys are the names, values are the phone numbersproperties is called a keyThe properties identifiedis called the value http://jnaapti.com/
  • 32. Collections – Maps contd...Keys dont have a definite Important:order The analogy breaks here - Dont get confused with theOperations on a map: way a map works – keys Put a key, value pair dont have an order... Get a value for a key Looking up keys, not values - You dont say get me the Get me all the keys and word whose definition I will look at them one is ... by one http://jnaapti.com/
  • 33. More composite typesList of lists List of people is a list of mapsList of maps ---Map of maps Mailboxes containing... mails is a map of maps http://jnaapti.com/
  • 34. Interfaces and ImplementationLists implemented using Arrays or Linked ListsMaps implemented using Hashtables http://jnaapti.com/
  • 35. HashtablesRun the key through a magicfunction that gives you a numberThe number is a unique slot into anarrayThe magic function is called a“hash function” - it is chosen suchthat there are minimal collisionsand most uniform distribution Image Source: Wikipedia http://jnaapti.com/
  • 36. Gmail – An ExampleWhat datastructures do we usehere? Mail Mailbox Person LabelA mailbox has a list of mailsA mail can be representedusing a map http://jnaapti.com/
  • 37. Gmail – An ExampleWhat is the mailbox size? How much RAM does a system have?If all the data of the world could fit into the RAM of a single machine,we wouldnt have a lot of the problems we faceLuckily, thats not the case!Properties of RAMs Are limited in their capacity Are volatile (data disappears on reboot) Max data in memory is 256GB Conclusion: We need the disk http://jnaapti.com/
  • 38. Hmm... Our First “Big” Data ProblemLet us say, the data is present as a huge 7 GB file in thedisk.What is the amount of time it takes to read this fileinto memory?How do I measure disk speeds? http://jnaapti.com/
  • 39. Measuring Disk Read Speed$ date;cat a_very_large_file > /dev/null;date$ iotop http://jnaapti.com/
  • 40. Disk Read SpeedWe can get disk read speeds close to 80MB/sLets round it off to 100MB/sReading 7000MB would take 70 secondsWould you wait if Gmail took 70 seconds to fetch your mails?Remember, parallel read accesses and writes slow it down further. Hmm, ok, this doesnt work, we need something faster, solution? http://jnaapti.com/
  • 41. How do we solve this?Imagine a world where there are no databases - youhave a hard-disk and you are asked to solve thisproblem.We need to be able to read only the data we want asquickly as we can. How do we solve this? http://jnaapti.com/
  • 42. SolutionStore data in fixed sized records and then have a way to jump to the starting location of a specific record http://jnaapti.com/
  • 43. Relational DatabasesRelational databases are an abstraction of your filesystem to deal with “relational” data. http://jnaapti.com/
  • 44. A word about AbstractionReading from a disk Instruct the hardware to move the read head to a specific location, now read the dataReading from a file Open the file, Read it, Close itReading from a database Connect to the DB, query for data, Close connectionOne of the skills you can pickup as an engineer is being able to define an operation at every level of abstraction http://jnaapti.com/
  • 45. Relational Database DesignDefine Entities and their RelationshipsHandling 1..1, 1..n and m..n relationshipsPerform normalizationTake the entities and their relationships and come upwith tables, fields, primary keys and foreign keysDefine queries to add, update, fetch and delete data http://jnaapti.com/
  • 46. Mapping Design to ImplementationData is stored in tables (which map to entities)Tables contain records (rows) and fields (columns)Records are of fixed lengthRecords are stored sequentially http://jnaapti.com/
  • 47. Relational Databases – Storage StructureUse hash-tables to point to records in the tables – soindividual records can be retrieved without having tosearch the entire dataset.This process is called “indexing”.In theory you can have many such indexes.Foreign keys are also indexed to speed up the lookup. http://jnaapti.com/
  • 48. Data Storage StructuresOrdered/unordered Flat filesISAMHeapsHash bucketsB+ Trees http://jnaapti.com/
  • 49. Part 2 – Dealing with WebScale Data
  • 50. Web Application DesignClient/ServerDistributed computingModels/Views/Controllers (MVC) http://jnaapti.com/
  • 51. Client/Server Model http://jnaapti.com/
  • 52. Client/Server Model – Separate DB Layer http://jnaapti.com/
  • 53. Problem 1 – Too Many RequestsWhat if a thousand users access my server at the same time?If the server can handle 200 such requests in parallel in onesecond, what if I have 400 requests per second? 1st second → 200 requests 2nd second → 600 requests (200 are from the previous second)Results in server thrashingSolution: Load Balanced Setup http://jnaapti.com/
  • 54. Load Balancing http://jnaapti.com/
  • 55. Load BalancingLoad balancing is a way of parallelizing processingacross multiple machinesThe load balancer acts as a proxy that streamsrequests and responses between the client and theprocessing server.Eg: HAProxyStateful and Stateless Architectures http://jnaapti.com/
  • 56. Problem 2 – Even More RequestsWhat if the Load Balancer itself becomes thebottleneck?Solution: Round Robin DNS Building multiple independent clusters http://jnaapti.com/
  • 57. Clustered Setup http://jnaapti.com/
  • 58. Problem 3 http://jnaapti.com/
  • 59. Problem 3 – The Stateful DatabaseA single database cannot handle all requests from allusers.Unlike front-end servers, databases are not “stateless”If we are trying to only read information, its fine, butif we are trying to write information, this is a problem. http://jnaapti.com/
  • 60. Scale Up v/s Scale OutScale up means to add resources (CPUs or memory) toa single system system in order to increase itsprocessing capabilities Scale up has limitations in how much we can scale – but is easier to doScale out means to add more nodes to a system Scale out provides linear scalability, is less expensive, but is complex compared to scale-up http://jnaapti.com/
  • 61. Scale Up Solution to the DB Problem http://jnaapti.com/
  • 62. Scale Up Solution to the DB ProblemIncrease the systems capacity by adding moreresources to the system – faster disks, more RAM,faster processors, more cores etcIntroduce on-the-fly compression of data in thedatabase Scale up is not scalable enough http://jnaapti.com/
  • 63. Scale Out Solutions to the DB Problem http://jnaapti.com/
  • 64. Scale Out Solutions to the DB Problem http://jnaapti.com/
  • 65. Scale Out Solutions to the DB ProblemUntil the virtualization revolution and until we reachedthe limits of hardware, we were looking at scale upsolutions rather than scale out solutionsPartition your data and put them on multiple systems– a subset of the rows in each systemThis is called Sharding http://jnaapti.com/
  • 66. Issues with ShardingNo clear way of partitioning the dataMaintaining ACID (Atomicity, Consistency, Isolation,Durability) properties is complexJoining data across machines is complexRe-sharding is complex http://jnaapti.com/
  • 67. Other Issues with Relational DatabasesData could be unstructured/semi-structuredImpedance mismatch (ORM issues)Sparse values are not handled well - results in wastage ofstorage (although some engines handle this today)Changes in schema are difficultNot all data require ACID/Transactional supportNormalization results in more queries and that meansmore disk accesses - some apps can do without them http://jnaapti.com/
  • 68. The NoSQL RevolutionNoSQL revolution happened to solve the many issues facedwith storing web-scale data in relational databasesNoSQL as the name suggests dont use SQL to store andretrieve dataWidely adopted in web applications these days, severalsolutions availableStill in research – no clear winner and therefore difficult tochoose among alternatives http://jnaapti.com/
  • 69. Advantages of NoSQL StoresThey dont require fixed schemasAvoid joinsSharding (Scale out) is easier – some even do itautomaticallyMany of the implementations replicate the data andthus avoid SPOFs (Single Point of Failure) http://jnaapti.com/
  • 70. Examples of NoSQL StoresMongoDBCouchDBNeo4JCassandraBigTable... http://jnaapti.com/
  • 71. Types of NoSQL StoresKey/ValueDocument StoresGraph DatabasesObject DatabasesRDF Databases... http://jnaapti.com/
  • 72. NoSQL Storage StructuresDistributed HashtablesConsistent HashingOrder-Preserving PartitioningB-treeCOW B-treeStratified B-tree http://jnaapti.com/
  • 73. Part 3 – Analyzing Web ScaleData
  • 74. Examples of Web Scale Data AnalysisDistributed Grep - Look for a pattern in all the TweetsInverted Index Building - This is what is used by searchenginesSentiment AnalysisCompetition AnalysisLog Analysis http://jnaapti.com/
  • 75. Understanding the problem of AnalysisUnlike in the case of retrieving data, in the case ofanalysis, we need to read through everything, butreads are slow in the disk.Lets see a simple math: 1 Hard Disk read speed is 100MB/s 100 Hard Disks read in parallel gives 10GB/s!Can we exploit this parallelism? http://jnaapti.com/
  • 76. The Coin Counting ExampleYou have a sack full of coins, and you are asked to separatethem into 1, 2, 5 and 10 Rs coins and tell how many of eachare present.Now, lets say you have few sacks full of coins and it will takeyou a lot of time to count it yourself – so you call a few otherpeople to help you out.Now, lets say there is few rooms full of coins (like in somelarge temples in India) – how will you count them? http://jnaapti.com/
  • 77. Coin Counting Problem – in depthYou cant add more people to the same room – theroom is already full.You can get a few more rooms, ask people to take somecoins to the other room and then do the countingthere, and come back with the coins and the final count.This will mean a lot of “traffic” in the corridor.So whats a better solution? http://jnaapti.com/
  • 78. A Possible Solution to the Coin Counting ProblemUnload the coins in different rooms rather than in thesame room.Then get workers in different rooms. With an increasein coins, increase the number of rooms and workers.Let the workers in each room work independently. This is how Map/Reduce frameworks work http://jnaapti.com/
  • 79. Traditional Parallel ProcessingUse of threads, sharing data, synchronizationResults in Deadlocks, Livelocks, Starvation etcHandling failures is complex Parallel Programming is hard this way. http://jnaapti.com/
  • 80. Requirements from a parallel processing framework Higher level programming constructs – dont need to deal with sockets, threading, locking, sharing data etc Manage failures - if a task fails or a system breaks down, we want the framework to transparently manage it Recoverability - If a system fails, another system must be able to pick up its workload Replication – if a system fails, we dont lose data – the framework should replicate data in multiple nodes Scalability – Adding more compute nodes should help us increase the compute capacity http://jnaapti.com/
  • 81. Pulling data Or Pushing Computations?Pulling data for computation results in a bottleneckEvery “database store” also has a “processor”.Instead of pulling the data for computation, can wethink about pushing the computation out to where thedata resides?Computation is in "bytes", may be a few MB of objectcode, that is still trivial compared to the data it workson http://jnaapti.com/
  • 82. MapReduceConcept introduced by Google in 2004Framework is inspired by map and reduce functionsfound in functional programming languagesHadoop is an opensource implementation ofMapReduce http://jnaapti.com/
  • 83. MapReduce FrameworksData is spread throughout machines before startingthe taskComputation is done in the nodes where data is storedData is replicated in multiple machines to increasereliabilityTasks are executed on multiple nodes just in case oneof them is running slow http://jnaapti.com/
  • 84. Using the Common Crawl Data – A Case Study The dump is a few 10s of TBs in size Where/How do you download it? Answer: You dont need to download it Instead you push your computation to where the data exists, perform your computation and then only fetch results you are interested in! http://jnaapti.com/
  • 85. RecapMy knowledge of computer science: Am I ever going to use this/need this as an engineer? How do I use this knowledge to understand the latest developments in software engineering? Hope you have an answer now! http://jnaapti.com/
  • 86. Parting ThoughtsTechnology changes very rapidly – dont expect to bespoon-fedPractise, Practise, Practise - KatasConcept before TechnologyTry out new things – even if they are not related to yourproject/curriculumRead and understand other peoples codeRead a lot, for example: http://highscalability.com/ http://jnaapti.com/
  • 87. We at jnaapti conduct workshops and providetraining on these technologies – contact us at http://jnaapti.com/ for more details http://jnaapti.com/
  • 88. For feedback/clarification email: gautham-at-jnaapti.com http://jnaapti.com/
  • 89. Thanks and... http://jnaapti.com/
  • 90. All the Best http://jnaapti.com/
  • 91. Happy Hacking http://jnaapti.com/
  • 92. SourcesTwitter - http://blog.twitter.com/2011/03/numbers.htmlTumblr -http://highscalability.com/blog/2012/2/13/tumblr-architecture-15-billFacebook log data -http://www.facebook.com/note.php?note_id=409881258919Facebook messages -http://highscalability.com/blog/2010/11/16/facebooks-new-real-timeVestas -http://www-01.ibm.com/software/success/cssdb.nsf/cs/RMUE-8NMJQ http://jnaapti.com/