Hadoop Overview kdd2011

2,935 views

Published on

In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the first half of the tutorial.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
2,935
On SlideShare
0
From Embeds
0
Number of Embeds
207
Actions
Shares
0
Downloads
255
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide

Hadoop Overview kdd2011

  1. 1. Modeling with Hadoop Vijay K. Narayanan Principal Scientist, Yahoo! Labs, Yahoo! Milind Bhandarkar Chief Architect, Greenplum Labs, EMC2
  2. 2. Session 1: Overview of Hadoop •  Motivation •  Hadoop •  Map-Reduce •  Distributed File System •  Next Generation MapReduce •  Q & A 2
  3. 3. Session 2: Modeling with Hadoop •  Types of learning in MapReduce •  Algorithms in MapReduce framework •  Data parallel algorithms •  Sequential algorithms •  Challenges and Enhancements 3
  4. 4. Session 3: Hands On Exercise •  Spin-up Single Node Hadoop cluster in a Virtual Machine •  Write a regression trainer •  Train model on a dataset 4
  5. 5. Overview of Apache Hadoop
  6. 6. Hadoop At Yahoo! (Some Statistics) •  40,000 + machines in 20+ clusters •  Largest cluster is 4,000 machines •  170 Petabytes of storage •  1000+ users •  1,000,000+ jobs/month 6
  7. 7. BEHINDEVERY CLICK
  8. 8. Who Uses Hadoop ?
  9. 9. Why Hadoop ? 10
  10. 10. Big Datasets(Data-Rich Computing theme proposal. J. Campbell, et al, 2007)
  11. 11. Cost Per Gigabyte (http://www.mkomo.com/cost-per-gigabyte)
  12. 12. Storage Trends(Graph by Adam Leventhal, ACM Queue, Dec 2009)
  13. 13. Motivating Examples 14
  14. 14. Yahoo! Search Assist
  15. 15. Search Assist •  Insight: Related concepts appear close together in text corpus •  Input: Web pages •  1 Billion Pages, 10K bytes each •  10 TB of input data •  Output: List(word, List(related words)) 16
  16. 16. Search Assist // Input: List(URL, Text) foreach URL in Input : Words = Tokenize(Text(URL)); foreach word in Tokens : Insert (word, Next(word, Tokens)) in Pairs; Insert (word, Previous(word, Tokens)) in Pairs; // Result: Pairs = List (word, RelatedWord) Group Pairs by word; // Result: List (word, List(RelatedWords) foreach word in Pairs : Count RelatedWords in GroupedPairs; // Result: List (word, List(RelatedWords, count)) foreach word in CountedPairs : Sort Pairs(word, *) descending by count; choose Top 5 Pairs; // Result: List (word, Top5(RelatedWords)) 17
  17. 17. People You May Know
  18. 18. People You May Know •  Insight: You might also know Joe Smith if a lot of folks you know, know Joe Smith •  if you don t know Joe Smith already •  Numbers: •  100 MM users •  Average connections per user is 100 19
  19. 19. People You May Know // Input: List(UserName, List(Connections)) foreach u in UserList : // 100 MM foreach x in Connections(u) : // 100 foreach y in Connections(x) : // 100 if (y not in Connections(u)) : Count(u, y)++; // 1 Trillion Iterations Sort (u,y) in descending order of Count(u,y); Choose Top 3 y; Store (u, {y0, y1, y2}) for serving; 20
  20. 20. Performance •  101 Random accesses for each user •  Assume 1 ms per random access •  100 ms per user •  100 MM users •  100 days on a single machine 21
  21. 21. MapReduce Paradigm 22
  22. 22. Map Reduce •  Primitives in Lisp ( Other functional languages) 1970s •  Google Paper 2004 •  http://labs.google.com/papers/ mapreduce.html 23
  23. 23. Map Output_List = Map (Input_List) Square (1, 2, 3, 4, 5, 6, 7, 8, 9, 10) = (1, 4, 9, 16, 25, 36,49, 64, 81, 100) 24
  24. 24. Reduce Output_Element = Reduce (Input_List) Sum (1, 4, 9, 16, 25, 36,49, 64, 81, 100) = 385 25
  25. 25. Parallelism •  Map is inherently parallel •  Each list element processed independently •  Reduce is inherently sequential •  Unless processing multiple lists •  Grouping to produce multiple lists 26
  26. 26. Search Assist Map // Input: http://hadoop.apache.org Pairs = Tokenize_And_Pair ( Text ( Input ) ) Output = { (apache, hadoop) (hadoop, mapreduce) (hadoop, streaming)(hadoop, pig) (apache, pig) (hadoop, DFS) (streaming,commandline) (hadoop, java) (DFS, namenode) (datanode,block) (replication, default)... } 27
  27. 27. Search Assist Reduce // Input: GroupedList (word, GroupedList(words)) CountedPairs = CountOccurrences (word, RelatedWords) Output = { (hadoop, apache, 7) (hadoop, DFS, 3) (hadoop, streaming,4) (hadoop, mapreduce, 9) ... } 28
  28. 28. Issues with Large Data •  Map Parallelism: Chunking input data •  Reduce Parallelism: Grouping related data •  Dealing with failures load imbalance 29
  29. 29. Apache Hadoop •  January 2006: Subproject of Lucene •  January 2008: Top-level Apache project •  Stable Version: 0.20.203 •  Latest Version: 0.22 (Coming soon) 31
  30. 30. Apache Hadoop •  Reliable, Performant Distributed file system •  MapReduce Programming framework •  Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... 32
  31. 31. Problem: Bandwidth to Data •  Scan 100TB Datasets on 1000 node cluster •  Remote storage @ 10MB/s = 165 mins •  Local storage @ 50-200MB/s = 33-8 mins •  Moving computation is more efficient than moving data •  Need visibility into data placement 33
  32. 32. Problem: Scaling Reliably •  Failure is not an option, it s a rule ! •  1000 nodes, MTBF 1 day •  4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM) •  Need fault tolerant store with reasonable availability guarantees •  Handle hardware faults transparently 34
  33. 33. Hadoop Goals •  Scalable: Petabytes (10 15 Bytes) of data on thousands on nodes •  Economical: Commodity components only •  Reliable •  Engineering reliability into every application is expensive 35
  34. 34. Hadoop MapReduce 36
  35. 35. Think MapReduce •  Record = (Key, Value) •  Key : Comparable, Serializable •  Value: Serializable •  Input, Map, Shuffle, Reduce, Output 37
  36. 36. Seems Familiar ? cat /var/log/auth.log* | grep session opened | cut -d -f10 | sort | uniq -c ~/userlist 38
  37. 37. Map •  Input: (Key , Value ) 1 1•  Output: List(Key , Value ) 2 2•  Projections, Filtering, Transformation 39
  38. 38. Shuffle •  Input: List(Key , Value ) 2 2•  Output •  Sort(Partition(List(Key , List(Value )))) 2 2•  Provided by Hadoop 40
  39. 39. Reduce •  Input: List(Key , List(Value )) 2 2•  Output: List(Key , Value ) 3 3•  Aggregation 41
  40. 40. Hadoop Streaming •  Hadoop is written in Java •  Java MapReduce code is native •  What about Non-Java Programmers ? •  Perl, Python, Shell, R •  grep, sed, awk, uniq as Mappers/Reducers •  Text Input and Output 42
  41. 41. Hadoop Streaming •  Thin Java wrapper for Map Reduce Tasks •  Forks actual Mapper Reducer •  IPC via stdin, stdout, stderr •  Key.toString() t Value.toString() n •  Slower than Java programs •  Allows for quick prototyping / debugging 43
  42. 42. Hadoop Streaming $ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh # mapper.sh sed -e s/ /n/g | grep . # reducer.sh uniq -c | awk {print $2 t $1} 44
  43. 43. Hadoop DistributedFile System (HDFS) 45
  44. 44. HDFS •  Data is organized into files and directories •  Files are divided into uniform sized blocks (default 128MB) and distributed across cluster nodes •  HDFS exposes block placement so that computation can be migrated to data 46
  45. 45. HDFS •  Blocks are replicated (default 3) to handle hardware failure •  Replication for performance and fault tolerance (Rack-Aware placement) •  HDFS keeps checksums of data for corruption detection and recovery 47
  46. 46. HDFS •  Master-Worker Architecture •  Single NameNode •  Many (Thousands) DataNodes 48
  47. 47. HDFS Master (NameNode) •  Manages filesystem namespace •  File metadata (i.e. inode ) •  Mapping inode to list of blocks + locations •  Authorization Authentication •  Checkpoint journal namespace changes 49
  48. 48. Namenode •  Mapping of datanode to list of blocks •  Monitor datanode health •  Replicate missing blocks •  Keeps ALL namespace in memory •  60M objects (File/Block) in 16GB 50
  49. 49. Datanodes •  Handle block storage on multiple volumes block integrity •  Clients access the blocks directly from data nodes •  Periodically send heartbeats and block reports to Namenode •  Blocks are stored as underlying OS s files 51
  50. 50. HDFS Architecture
  51. 51. Next Generation MapReduce 53
  52. 52. MapReduce Today (Courtesy: Arun Murthy, Hortonworks)
  53. 53. Why ? •  Scalability Limitations today •  Maximum cluster size: 4000 nodes •  Maximum Concurrent tasks: 40,000 •  Job Tracker SPOF •  Fixed map and reduce containers (slots) •  Punishes pleasantly parallel apps 55
  54. 54. Why ? (contd) •  MapReduce is not suitable for every application •  Fine-Grained Iterative applications •  HaLoop: Hadoop in a Loop •  Message passing applications •  Graph Processing 56
  55. 55. Requirements •  Need scalable cluster resources manager •  Separate scheduling from resource management •  Multi-Lingual Communication Protocols 57
  56. 56. Bottom Line •  @techmilind #mrng (MapReduce, Next Gen) is in reality, #rmng (Resource Manager, Next Gen) •  Expect different programming paradigms to be implemented •  Including MPI (soon) 58
  57. 57. Architecture (Courtesy: Arun Murthy, Hortonworks)
  58. 58. The New World •  Resource Manager •  Allocates resources (containers) to applications •  Node Manager •  Manages containers on nodes •  Application Master •  Specific to paradigm e.g. MapReduce application master, MPI application master etc 60
  59. 59. Container •  In current terminology: A Task Slot •  Slice of the node s hardware resources •  #of cores, virtual memory, disk size, disk and network bandwidth etc •  Currently, only memory usage is sliced 61
  60. 60. Questions ? 62

×