Large Scale Data Processing & Storage


Published on

Presented at International Conference on Adv. and Emerging Technologies, ICAET 2010.

Published in: Technology
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Large Scale Data Processing & Storage

  1. 1. Large Scale Data Processing and Storage Ilayaraja Prabakaran Product Engineer
  2. 2. Agenda Introduction to large data problem MapReduce programming model Web mining using MapReduce MapReduce with Hadoop Hadoop Distributed File System Elastic MapReduce Scalable storage architecture
  3. 3. Large Data !
  4. 4. Large Data !
  5. 5. Large Data !
  6. 6. Large Data !
  7. 7. Internet 2009 ! Websites 234 million - The number of websites by December 2009. 47 million - Added websites in 2009 Social Media 126 million – The number of blogs on the Internet (as tracked by BlogPulse). 27.3 million – Number of tweets on Twitter per day (November, 2009) 350 million – People on Facebook.
  8. 8. Internet 2009 ! Images 4 billion – Photos hosted by Flickr (October 2009). 2.5 billion – Photos uploaded each month to Facebook. Videos 1 billion – The total number of videos YouTube serves in one day. 924 million – Videos viewed per month on Hulu in the US (November 2009).
  9. 9. The good news is that “Big Data” is here. Bad news is that we are struggling to store and analyze it. Anyways, Should you worry about it?
  10. 10. 3 papers .. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung, The Google File System, 19th ACM Symposium on Operating Systems Principles, Lake George, NY, October, 2003. Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters, OSDI'04: Sixth Symposium on Operating System Design and Implementation, San Francisco, CA, December, 2004. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber, Bigtable: A Distributed Storage System for Structured Data, OSDI'06: Seventh Symposium on Operating System Design and Implementation, Seattle, WA, November, 2006.
  11. 11. Opensource Solutions MapReduce GFS BigTable
  12. 12. MapReduce Programming model for processing multi terabyte data on hundreds of CPUs in parallel. MapReduce provides: - Automatic parallelization and distribution - Fault tolerance - I/O scheduling - Status and Monitoring
  13. 13. Programming model Input & Output: set of key/value pairs Programmer specifies two functions: PDS LQBNH LQBYDOXH ! OLVW RXWBNH LQWHUPHGLDWHBYDOXH Processes input key/value pair Produces set of intermediate pairs UHGXFH RXWBNH OLVW LQWHUPHGLDWHBYDOXH ! OLVW RXWBNH RXWBYDOXH Combines all intermediate values for a particular key Produces a set of merged output values (usually just one)
  14. 14. Execution
  15. 15. Parallel Execution
  16. 16. Example Thinking in MapReduce
  17. 17. Sam’s Mother Believed “an apple a day keeps a doctor away” Mother Sam An Apple Ref. SALSA HPC Group at Community Grids Labs
  18. 18. One day Sam thought of drinking the apple He used a to cut the and a to make juice.
  19. 19. Next Day Sam applied his invention to all the fruits he could find in the fruit basket (map ‘( )) A list of values mapped into another list of values, which gets reduced into a single value (a, , o, , p, , …) reduce Classical Notion of MapReduce in Functional Programming
  20. 20. 18 Years Later Sam got his first job in JuiceRUs for his talent in making juice Wa i t ! Now, it’s not just one basket but a whole container of fruits Large data and list of values for output Also, they produce a list of juice types separately But, Sam had just ONE and ONE NOT ENOUGH !!
  21. 21. Brave Sam Implemented a parallel version of his innovation (a, , o, , p, , …) (a, , o, , p, , …) Grouped by key Each input to a reduce is a key, value-list (possibly a list of these, depending on the grouping/hashing mechanism) e.g. a, ( …) Reduced into a list of values
  22. 22. Brave Sam Implemented a parallel version of his innovation A list of key, value pairs mapped into another list of key, value pairs which gets grouped by the key and reduced into a list of values The idea of MapReduce in Data Intensive Computing
  23. 23. Word Count • map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, 1); • reduce(String output_key, Iterator intermediate_values): //output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(output_key, AsString(result));
  24. 24. Word Count: Example a rose is a rose is a rose a,1 rose,1 is,1 a,1 a1,1,1,1 a,4 rose,1 rose1,1,1,1 rose,4 is,1 is1,1,1 is,3 a,1 rose,1 is,1
  25. 25. Demo Time Lets have some fun ☺
  26. 26. rediff uses MapReduce for.. Web crawling and indexing Web data mining - Reverse web-link graph - ngram database - Anchor text analysis Mining usage logs - Related queries - Search Suggest - Query classification
  27. 27. Reverse Web-link Graph Web- Key: Values: fromUrl: anchor: news fromUrl: anchor: rediff news anchor: rediff headlines fromUrl: anchor: …….
  28. 28. Web Graph: MapReduce • map(String input_key, String input_value): // input_key: from-url // input_value: document contents for each outlink x in input_value: // parsed data to-url = x.url // outgoing link anchor = x.anchor // click-able text from-url = input_key EmitIntermediate(to-url, from-url,anchor);
  29. 29. Web Graph: MapReduce • reduce(String output_key, Iterator intermediate_values): //output_key: a word // output_values: a list of InLinks // i.e. from-url,anchor pairs result = new InLinks( ) for each v in intermediate_values: result.add(v.url, v.anchor) Emit(output_key, result);
  30. 30. Navigational Search
  31. 31. Anchor text mining Input: Web Graph Output: ranked set of anchors.
  32. 32. Anchor text mining: MapReduce map(key,value) Key: to-url; value: Inlinks for each inlink ‘i’ in value: for each n-gram ‘ng’ in anchor: score = calc_rank(ng) emit( to-url, ng, score )
  33. 33. Anchor text mining: MapReduce reduce(key,values) Key: to-url, ng pair; values: an iterator over score agg_score = 0 for each score ‘s’ in values: agg_score = agg_score +s emit( to-url, ng, agg_score )
  34. 34. Hadoop Opensource implementation of MapReduce
  35. 35. Hadoop Created by Doug Cutting Originated for Apache Nutch Why hadoop? Doug cutting - The name my kid gave a stuffed yellow elephant. Short, relatively easy to spell and pronounce, meaningless, and not used elsewhere: those are my naming criteria. Kids are good at generating such.
  36. 36. Implementation Hadoop: MapReduce APIs HDFS: Storage Mapper Interface map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) Reducer Interface reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Programmers has to just override these methods, makes life easier ! Takes care of splitting the work, data flow, execution, handling failures so on.
  37. 37. Data flow
  38. 38. Map
  39. 39. Reduce
  40. 40. Driver Method
  41. 41. Combiner Performs local aggregation of the intermediate outputs. Cut down the amount of data transferred from the Mapper to Reducer. a,1 rose,1 a1,1 into (a,2) is,1 rose1 into (rose,1) a,1 is1 into (is,1) a,3 rose,1 rose1,1 into (rose,2) rose,3 is,1 is1 into (is,1) is,2 a,1 a1 into (a,1) rose,1
  42. 42. Variations Identity Reducer - Zero reduce tasks - Examples: “Cleaning web link graph” “Populating HDFS from other data sources” - Map does the job and writes the output to HDFS. MapReduce Chain - Problems that are not solvable just by one map and reduce phase. - Series of map and reduce functions defined - Output of previous job goes as input to next job.
  43. 43. Streaming Allows you to write map/reduce in any programming language. Ex. Python, c++, perl, bash I/O is represented textually. Read from stdin and written to stdout as tab separated key, value pair. Format: key t value n +$'223B+20(ELQKDGRRS MDU +$'223B+20(KDGRRS VWUHDPLQJMDU LQSXW P,QSXW'LUV RXWSXW P2XWSXW'LU PDSSHU P3WKRQ0DSSHUS UHGXFHU P3WKRQ5HGXFHUS
  44. 44. Pipes API that provides strong coupling between c++ code and hadoop. Improved performance over streaming. Key and value pairs are STL strings. APIs: getInputKey(), getInputValue() ELQKDGRRS SLSHV LQSXW LQSXW3DWK RXWSXW RXWSXW3DWK SURJUDP SDWKWRSLSHVSURJUDPH[HFXWDEOH
  45. 45. Hadoop Distributed File System (HDFS)
  46. 46. HDFS design principles Handling hardware failures Streaming data access Storing very large files Running on cluster of commodity hardware Simple coherency model Data locality Portability
  47. 47. HDFS Architecture
  48. 48. HDFS Operation (Read)
  49. 49. HDFS operation (Write)
  50. 50. HDFS Robustness Name node failure, Data node failure and network partitions Heartbeats and Re-replication Cluster Rebalancing Data Integrity: checksum Metadata disk failure: FsImage, Editlog Snapshots
  51. 51. Anatomy of Hadoop MapReduce Job run on HDFS
  52. 52. Map/Reduce Processes Launching Application - User application cod - Submits a specific kind of Map/Reduce job JobTracker - Handles all jobs - Makes all scheduling decisions TaskTracker - Manager for all tasks on a given node Task - Runs an individual map or reduce fragment - Forks from the TaskTracker
  53. 53. Process Diagram
  54. 54. Job Control Flow Application launcher creates and submits job. JobTracker initializes job, creates FileSplits, and adds tasks to queue. TaskTrackers ask for a new map or reduce task every 10 seconds or when the previous task finishes. As tasks run, the TaskTracker reports status to the JobTracker every 10 seconds. Application launcher stops waiting when the job completes.
  55. 55. Hadoop Map/Reduce Job Admin.
  56. 56. Progress of reduce phase
  57. 57. HDFS
  58. 58. Hadoop Benchmarking
  59. 59. Jim Gray’s Sort Benchmark Started by Jim Gray at Microsoft in 1998 Currently managed by 3 of the previous winners Sorting different number of 100 byte records - 10 byte key - 90 byte value Multiple variants: Minute Sort: sort must finish 60.0 secs Terabyte Sort: 10^12 bytes sort Gray Sort: = 10^14 bytes and = 1hour
  60. 60. Hadoop won Terabyte Sort ☺ Hadoop won this in 2008 Took 209 seconds to complete 910 nodes, 1800 maps and 1800 reduces . 2 quad core Xeons @ 2.0ghz per a node 8 GB RAM per a node.
  61. 61. Terabyte Sort Task Timeline
  62. 62. Further stats. Bytes Nodes Maps Reduces Replication Time 500 GB 1406 8000 2600 1 59 s 1 TB 1460 8000 2700 1 62 s 100 TB 3452 190000 10000 2 173 m 1000 TB 3658 80000 20000 2 975 m
  63. 63. Petabyte Sort Task Timeline
  64. 64. Notes on Petabyte Sort 80,000 maps and 20,000 reduces Each node ran 2 maps and 2 reduces at a time Tail of maps was 100 minutes Tail of reduces was 80 minutes - caused by one slow node Used speculative execution The “waste” tasks at the end are mostly speculative execution
  65. 65. Cloud Computing Elastic MapReduce
  66. 66. Impact of Cloud
  67. 67. Definition Characteristics “A pool of highly scalable, abstracted infrastructure, capable of hosting end-customer applications, that is billed by consumption” Characteristics: Dynamic computing infrastructure Service-centric approach Self service based usage model Minimally or self-managed platform Consumption based billing
  68. 68. Amazon web services (AWS) Elastic Compute Cloud (EC2) Elastic MapReduce Simple Storage Service (S3) Elastic Block Storage Elastic Load Balancing Amazon CloudWatch
  69. 69. Elastic MapReduce (EMR) Automatically spins up a Hadoop implementation of mapreduce framework on EC2 cluster. Sub-dividing data in a job flow into smaller chunks so that they can be processed (the “map” function) in parallel. Recombining the processed data into the final solution (the “reduce” function). S3 as the source and destination of input and output data respectively. Easy to use console for launching job with dynamic configuration
  70. 70. BigTable
  71. 71. Motivation Lots of (semi-)structured data – URLs: • Contents, crawl metadata, links, anchors, pagerank, … – Per-user Data: • User preference settings, recent queries/search results, … – Geographic locations: • Physical entities (shops, restaurants, etc.). roads, satellite image data.. Scale is large – Billions of URLs, many versions/page(~20K/version) – Hundreds of millions of users, thousands of q/sec – 100TB+ of satellite image data
  72. 72. Why not just use commercial DB? Scale is too large for most commercial databases Even if it weren’t, cost would be very high – Building internally means system can be applied across many projects for low incremental cost Low-level storage optimizations help performance significantly – Much harder to do when running on top of a database layer – Also fun and challenging to build large-scale systems ☺
  73. 73. Goals Want asynchronous processes to be continuously updating different pieces of data – Want access to most current data at any time Need to support – Very high read/write rates (millions of ops per second) – Efficient scans over all or interesting subsets of data Often want to examine data changes over time – E.g. Contents of a web page over multiple crawls
  74. 74. BigTable Distributed multi-level map – With an interesting data model Fault-tolerant, persistent Scalable – Thousands of servers – Terabytes of in-memory data – Petabytes of disk-based data – Millions of reads/writes per second, efficient scans Self-managing – Servers can be added/removed dynamically – Servers adjust to load imbalance
  75. 75. Hbase Hypertable Use data model similar to BigTable Sparse, distributed, persistent multi- dimensional sorted map Map is indexed by - row key - column key - timestamp
  76. 76. Table: Visual representation
  77. 77. Table: Actual Representation
  78. 78. System Overview
  79. 79. Range Server Manages ranges of table data Caches updates in memory (CellCache) Periodically spills (compacts) cached updates to disk (CellStore)
  80. 80. Master Single Master (hot standbys) Directs meta operations – CREATE TABLE – DROP TABLE – ALTER TABLE Handles recovery of RangeServer Manages RangeServer Load Balancing Client data does not move through Master
  81. 81. Hyperspace Chubby equivalent – Distributed Lock Manager – Filesystem for storing small amounts of metadata – Highly available “Root of distributed data structures”
  82. 82. Optimizations Compression: Cell Store blocks are compressed Caching: Block Cache Query Cache Bloom Filter: Indicates if key is not present Access Groups: minimizing I/O by locality
  83. 83. QA
  84. 84. Thanks Much !
  85. 85. References Jeffery Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters SALSA HPC Group at Community Grids Labs orts_a_petabyte_in_162.html