Big Data Analytics with    Apache Hadoop           Milind Bhandarkar @techmilind                             @milindb     ...
Speaker ProfileMilind Bhandarkar was the founding member of the team at Yahoo!that took Apache Hadoop from 20-node prototy...
Outline• Intro to Hadoop (10 mins)• MapReduce (15 mins)• Hadoop Examples (30 mins)• Q &A               Data Computing Divi...
Apache Hadoop    Data Computing Division
Apache Hadoop• January 2006: Subproject of Lucene• January 2008: Top-level Apache project• Stable Version: 1.0.3• Latest V...
Apache Hadoop• Reliable, Performant Distributed file system• MapReduce Programming framework• Ecosystem: HBase, Hive, Pig,...
Problem: Bandwidth to Data• Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage ...
Problem: Scaling          Reliably• Failure is not an option, it’s a rule ! • 1000 nodes, MTBF < 1 day • 4000 disks, 8000 ...
Hadoop Goals•   Scalable: Petabytes (1015 Bytes) of data on    thousands on nodes• Economical: Commodity components only• ...
Hadoop MapReduce     Data Computing Division
Think MapReduce• Record = (Key,Value)• Key : Comparable, Serializable• Value: Serializable• Input, Map, Shuffle, Reduce, O...
Seems Familiar ?cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c > ~/userlist             ...
Map• Input: (Key ,Value )             1          1• Output: List(Key ,Value )                    2              2• Project...
Shuffle• Input: List(Key ,Value )                 2                2• Output • Sort(Partition(List(Key , List(Value ))))  ...
Reduce• Input: (Key , List(Value ))              2                  2• Output: List(Key ,Value )                     3    ...
Hadoop Streaming• Hadoop is written in Java • Java MapReduce code is “native”• What about Non-Java Programmers ? • Perl, P...
Hadoop Streaming• Thin Java wrapper for Map & Reduce Tasks• Forks actual Mapper & Reducer• IPC via stdin, stdout, stderr• ...
Hadoop Streaming$ bin/hadoop jar hadoop-streaming.jar       -input in-files -output out-dir       -mapper mapper.sh -reduc...
Hadoop Examples     Data Computing Division
Example: Standard         Deviation•   Takeaway: Changing    algorithm to suit    architecture yields the    best implemen...
Implementation 1• Two Map-Reduce stages• First stage computes Mean• Second stage computes standard deviation              ...
Stage 1: Compute Mean• Map Input (x for i = 1 ..N )             i                             m• Map Output (N , Mean(x ))...
Stage 2: Compute   Standard Deviation• Map Input (x for i = 1 ..N ) & Mean(x )              i                             ...
Standard Deviation•   Algebraically equivalent•   Be careful about    numerical accuracy,    though                       ...
Implementation 2• Map Input (x for i = 1 ..N )             i                             m• Map Output (N , [Sum(x ),Mean(...
NGrams Data Computing Division
Bigrams• Input: A large text corpus• Output: List(word , Top (word ))                      1          K       2• Two Stage...
Bigrams: Stage 1           Map• Generate all possible Bigrams• Map Input: Large text corpus• Map computation • In each sen...
pairs.plwhile(<STDIN>) {     chomp;     $_ =~ s/[^a-zA-Z]+/ /g ;     $_ =~ s/^s+//g ;     $_ =~ s/s+$//g ;     $_ =~ tr/A-...
Bigrams: Stage 1          Reduce• Input: List(word , word ) sorted and                   1              2  partitioned• Ou...
count.pl$_ = <STDIN>;chomp;my ($pw1, $pw2) = split(/:/, $_);$count = 1;while(<STDIN>) {  chomp;  my ($w1, $w2) = split(/:/...
Bigrams: Stage 2           Map• Input: List(word , [freq,word ])                   1                           2• Output: ...
Bigrams: Stage 2          Reduce• Input: List(word , [freq,word ])                   1                       2 • partition...
firstN.pl$N = 5;$_ = <STDIN>;chomp;my ($pw1, $count, $pw2) = split(/:/, $_);$idx = 1;$out = "$pw1t$pw2,$count;";while(<STD...
Partitioner• By default, evenly distributes keys • hashcode(key) % NumReducers• Overriding partitioner • Skew in map-outpu...
Partitioner// JobConf.setPartitionerClass(className)public interface Partitioner <K, V>    extends JobConfigurable {  int ...
Fully Sorted Output• By contract, reducer gets input sorted on  key• Typically reducer output order is the same  as input ...
Fully Sorted Output• Use single reducer for small output• Insight: Reducer input must be fully sorted• Partitioner should ...
Number of Maps• Number of Input Splits • Number of HDFS blocks• mapred.map.tasks• Minimum Split Size (mapred.min.split.siz...
Parameter Sweeps• External program processes data based on  command-line parameters• ./prog –params=“0.1,0.3” < in.dat > o...
Parameter Sweeps• Input File: params.txt • Each line contains one combination of    parameters• Input format is NLineInput...
Auxiliary Files• -file auxFile.dat• Job submitter adds file to job.jar• Unjarred on the task tracker• Available to task as...
Auxiliary Files• Tasks need to access “side” files • Read-only Dictionaries (such as for porn    filtering) • Dynamically ...
Distributed Cache• Specify “side” files via –cacheFile• If lot of such files needed • Create a tar.gz archive • Upload to ...
Distributed Cache• TaskTracker downloads these files “once”• Untars archives• Accessible in task’s $cwd before task starts...
Joining Multiple           Datasets• Datasets are streams of key-value pairs• Could be split across multiple files in a  s...
Example• A = (id, name), B = (id, address)• A is in /path/to/A/part-*• B is in /path/to/B/part-*• Select A.name, B.address...
Map in Join• Input: (Key ,Value ) from A or B           1          1 • map.input.file indicates A or B    • MAP_INPUT_FILE...
Reduce in Join• Input: Groups of [Value , A|B] for each Key                                    2            2• Operation d...
MR Join Performance• Map Input = Total of A & B• Map output = Total of A & B• Shuffle & Sort• Reduce input = Total of A & ...
Join Special Cases• Fragment-Replicate • 100GB dataset with 100 MB dataset• Equipartitioned Datasets • Identically Keyed •...
Fragment-Replicate• Fragment larger dataset • Specify as Map input• Replicate smaller dataset • Use Distributed Cache• Map...
Equipartitioned Join• Available since Hadoop 0.16• Datasets joined “before” input to mappers• Input format: CompositeInput...
Examplemapred.join.expr =  inner (    tbl (       ....SequenceFileInputFormat.class,"hdfs://namenode:8020/path/to/data/A" ...
Get Social @EMCAcademics         Data Computing Division
Next Session:Technology Lecture Series : Classic D            Center              on         16 Aug 2012            Data C...
Questions ?   Data Computing Division
Upcoming SlideShare
Loading in …5
×

Big Data Analytics with Hadoop with @techmilind

1,504 views
1,332 views

Published on

Hadoop has rapidly emerged as the preferred solution for big data analytics across unstructured data and companies are seeking competitive advantage by finding effective ways of analyzing new sources of unstructured and machine-generated data. This session reviews the practices of performing analytics using unstructured data with Hadoop.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,504
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
0
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Big Data Analytics with Hadoop with @techmilind

  1. 1. Big Data Analytics with Apache Hadoop Milind Bhandarkar @techmilind @milindb Data Computing Division
  2. 2. Speaker ProfileMilind Bhandarkar was the founding member of the team at Yahoo!that took Apache Hadoop from 20-node prototype to datacenter-scale production system, and has been contributing and working withHadoop since version 0.1.0. He started the Yahoo! Grid solutionsteam focused on training, consulting, and supporting hundreds of newmigrants to Hadoop. Parallel programming languages and paradigmshas been his area of focus for over 20 years, and his area ofspecialization for PhD (Computer Science) from University of Illinoisat Urbana-Champaign. He worked at the Center for Development ofAdvanced Computing (C-DAC), National Center forSupercomputing Applications (NCSA), Center for Simulation ofAdvanced Rockets, Siebel Systems, Pathscale Inc. (acquired byQLogic), Yahoo! and Linkedin. Currently, he is the Chief Scientist,Machine Learning Platforms at Greenplum, a division of EMC. Data Computing Division
  3. 3. Outline• Intro to Hadoop (10 mins)• MapReduce (15 mins)• Hadoop Examples (30 mins)• Q &A Data Computing Division
  4. 4. Apache Hadoop Data Computing Division
  5. 5. Apache Hadoop• January 2006: Subproject of Lucene• January 2008: Top-level Apache project• Stable Version: 1.0.3• Latest Version: 2.0 alpha Data Computing Division
  6. 6. Apache Hadoop• Reliable, Performant Distributed file system• MapReduce Programming framework• Ecosystem: HBase, Hive, Pig, Howl, Oozie, Zookeeper, Chukwa, Mahout, Cascading, Scribe, Cassandra, Hypertable, Voldemort, Azkaban, Sqoop, Flume, Avro ... Data Computing Division
  7. 7. Problem: Bandwidth to Data• Scan 100TB Datasets on 1000 node cluster • Remote storage @ 10MB/s = 165 mins • Local storage @ 50-200MB/s = 33-8 mins• Moving computation is more efficient than moving data • Need visibility into data placement Data Computing Division
  8. 8. Problem: Scaling Reliably• Failure is not an option, it’s a rule ! • 1000 nodes, MTBF < 1 day • 4000 disks, 8000 cores, 25 switches, 1000 NICs, 2000 DIMMS (16TB RAM)• Need fault tolerant store with reasonable availability guarantees • Handle hardware faults transparently Data Computing Division
  9. 9. Hadoop Goals• Scalable: Petabytes (1015 Bytes) of data on thousands on nodes• Economical: Commodity components only• Reliable • Engineering reliability into every application is expensive Data Computing Division
  10. 10. Hadoop MapReduce Data Computing Division
  11. 11. Think MapReduce• Record = (Key,Value)• Key : Comparable, Serializable• Value: Serializable• Input, Map, Shuffle, Reduce, Output Data Computing Division
  12. 12. Seems Familiar ?cat /var/log/auth.log* | grep “session opened” | cut -d’ ‘ -f10 | sort | uniq -c > ~/userlist Data Computing Division
  13. 13. Map• Input: (Key ,Value ) 1 1• Output: List(Key ,Value ) 2 2• Projections, Filtering, Transformation Data Computing Division
  14. 14. Shuffle• Input: List(Key ,Value ) 2 2• Output • Sort(Partition(List(Key , List(Value )))) 2 2• Provided by Hadoop Data Computing Division
  15. 15. Reduce• Input: (Key , List(Value )) 2 2• Output: List(Key ,Value ) 3 3• Aggregation Data Computing Division
  16. 16. Hadoop Streaming• Hadoop is written in Java • Java MapReduce code is “native”• What about Non-Java Programmers ? • Perl, Python, Shell, R • grep, sed, awk, uniq as Mappers/Reducers• Text Input and Output Data Computing Division
  17. 17. Hadoop Streaming• Thin Java wrapper for Map & Reduce Tasks• Forks actual Mapper & Reducer• IPC via stdin, stdout, stderr• Key.toString() t Value.toString() n• Slower than Java programs • Allows for quick prototyping / debugging Data Computing Division
  18. 18. Hadoop Streaming$ bin/hadoop jar hadoop-streaming.jar -input in-files -output out-dir -mapper mapper.sh -reducer reducer.sh# mapper.shsed -e s/ /n/g | grep .# reducer.shuniq -c | awk {print $2 "t" $1} Data Computing Division
  19. 19. Hadoop Examples Data Computing Division
  20. 20. Example: Standard Deviation• Takeaway: Changing algorithm to suit architecture yields the best implementation Data Computing Division
  21. 21. Implementation 1• Two Map-Reduce stages• First stage computes Mean• Second stage computes standard deviation Data Computing Division
  22. 22. Stage 1: Compute Mean• Map Input (x for i = 1 ..N ) i m• Map Output (N , Mean(x )) m 1..Nm• Single Reducer• Reduce Input (Group(Map Output))• Reduce Output (Mean(x )) 1..N Data Computing Division
  23. 23. Stage 2: Compute Standard Deviation• Map Input (x for i = 1 ..N ) & Mean(x ) i m 1..N• Map Output (Sum(x – Mean(x)) for i = i 2 1 ..Nm• Single Reducer• Reduce Input (Group (Map Output)) & N• Reduce Output (σ) Data Computing Division
  24. 24. Standard Deviation• Algebraically equivalent• Be careful about numerical accuracy, though Data Computing Division
  25. 25. Implementation 2• Map Input (x for i = 1 ..N ) i m• Map Output (N , [Sum(x ),Mean(x )]) m 2 1..Nm 1..Nm• Single Reducer• Reduce Input (Group (Map Output))• Reduce Output (σ) Data Computing Division
  26. 26. NGrams Data Computing Division
  27. 27. Bigrams• Input: A large text corpus• Output: List(word , Top (word )) 1 K 2• Two Stages: • Generate all possible bigrams • Find most frequent K bigrams for each word Data Computing Division
  28. 28. Bigrams: Stage 1 Map• Generate all possible Bigrams• Map Input: Large text corpus• Map computation • In each sentence, or each “word word ” 1 2 • Output (word , word ), (word , word ) 1 2 2 1• Partition & Sort by (word , word ) 1 2 Data Computing Division
  29. 29. pairs.plwhile(<STDIN>) { chomp; $_ =~ s/[^a-zA-Z]+/ /g ; $_ =~ s/^s+//g ; $_ =~ s/s+$//g ; $_ =~ tr/A-Z/a-z/; my @words = split(/s+/, $_); for (my $i = 0; $i < $#words - 1; ++$i) { print "$words[$i]:$words[$i+1]n”; print "$words[$i+1]:$words[$i]n”; }} Data Computing Division
  30. 30. Bigrams: Stage 1 Reduce• Input: List(word , word ) sorted and 1 2 partitioned• Output: List(word , [freq, word ]) 1 2• Counting similar to Unigrams example Data Computing Division
  31. 31. count.pl$_ = <STDIN>;chomp;my ($pw1, $pw2) = split(/:/, $_);$count = 1;while(<STDIN>) { chomp; my ($w1, $w2) = split(/:/, $_); if ($w1 eq $pw1 && $w2 eq $pw2) { $count++; } else { print "$pw1:$count:$pw2n"; $pw1 = $w1; $pw2 = $w2; $count = 1; }}print "$pw1:$count:$pw2n"; Data Computing Division
  32. 32. Bigrams: Stage 2 Map• Input: List(word , [freq,word ]) 1 2• Output: List(word , [freq, word ]) 1 2• Identity Mapper (/bin/cat)• Partition by word 1• Sort descending by (word , freq) 1 Data Computing Division
  33. 33. Bigrams: Stage 2 Reduce• Input: List(word , [freq,word ]) 1 2 • partitioned by word 1 • sorted descending by (word , freq) 1• Output: Top (List(word , [freq, word ])) K 1 2• For each word, throw away after K records Data Computing Division
  34. 34. firstN.pl$N = 5;$_ = <STDIN>;chomp;my ($pw1, $count, $pw2) = split(/:/, $_);$idx = 1;$out = "$pw1t$pw2,$count;";while(<STDIN>) { chomp; my ($w1, $c, $w2) = split(/:/, $_); if ($w1 eq $pw1) { if ($idx < $N) { $out .= "$w2,$c;"; $idx++; } } else { print "$outn"; $pw1 = $w1; $idx = 1; $out = "$pw1t$w2,$c;"; }}print "$outn"; Data Computing Division
  35. 35. Partitioner• By default, evenly distributes keys • hashcode(key) % NumReducers• Overriding partitioner • Skew in map-outputs • Restrictions on reduce outputs • All URLs in a domain together Data Computing Division
  36. 36. Partitioner// JobConf.setPartitionerClass(className)public interface Partitioner <K, V> extends JobConfigurable { int getPartition(K key, V value, int maxPartitions);} Data Computing Division
  37. 37. Fully Sorted Output• By contract, reducer gets input sorted on key• Typically reducer output order is the same as input order • Each output file (part file) is sorted• How to make sure that Keys in part i are all less than keys in part i+1 ? Data Computing Division
  38. 38. Fully Sorted Output• Use single reducer for small output• Insight: Reducer input must be fully sorted• Partitioner should provide fully sorted reduce input• Sampling + Histogram equalization Data Computing Division
  39. 39. Number of Maps• Number of Input Splits • Number of HDFS blocks• mapred.map.tasks• Minimum Split Size (mapred.min.split.size)• split_size = max(min(hdfs_block_size, data_size/#maps), min_split_size) Data Computing Division
  40. 40. Parameter Sweeps• External program processes data based on command-line parameters• ./prog –params=“0.1,0.3” < in.dat > out.dat• Objective: Run an instance of ./prog for each parameter combination• Number of Mappers = Number of different parameter combinations Data Computing Division
  41. 41. Parameter Sweeps• Input File: params.txt • Each line contains one combination of parameters• Input format is NLineInputFormat (N=1)• Number of maps = Number of splits = Number of lines in params.txt Data Computing Division
  42. 42. Auxiliary Files• -file auxFile.dat• Job submitter adds file to job.jar• Unjarred on the task tracker• Available to task as $cwd/auxFile.dat• Not suitable for large / frequently used files Data Computing Division
  43. 43. Auxiliary Files• Tasks need to access “side” files • Read-only Dictionaries (such as for porn filtering) • Dynamically linked libraries• Tasks themselves can fetch files from HDFS • Not Always ! (Hint: Unresolved symbols) Data Computing Division
  44. 44. Distributed Cache• Specify “side” files via –cacheFile• If lot of such files needed • Create a tar.gz archive • Upload to HDFS • Specify via –cacheArchive Data Computing Division
  45. 45. Distributed Cache• TaskTracker downloads these files “once”• Untars archives• Accessible in task’s $cwd before task starts• Cached across multiple tasks• Cleaned up upon exit Data Computing Division
  46. 46. Joining Multiple Datasets• Datasets are streams of key-value pairs• Could be split across multiple files in a single directory• Join could be on Key, or any field in Value• Join could be inner, outer, left outer, cross product etc• Join is a natural Reduce operation Data Computing Division
  47. 47. Example• A = (id, name), B = (id, address)• A is in /path/to/A/part-*• B is in /path/to/B/part-*• Select A.name, B.address where A.id == B.id Data Computing Division
  48. 48. Map in Join• Input: (Key ,Value ) from A or B 1 1 • map.input.file indicates A or B • MAP_INPUT_FILE in Streaming• Output: (Key , [Value , A|B]) 2 2 • Key is the Join Key 2 Data Computing Division
  49. 49. Reduce in Join• Input: Groups of [Value , A|B] for each Key 2 2• Operation depends on which kind of join • Inner join checks if key has values from both A & B• Output: (Key , JoinFunction(Value ,…)) 2 2 Data Computing Division
  50. 50. MR Join Performance• Map Input = Total of A & B• Map output = Total of A & B• Shuffle & Sort• Reduce input = Total of A & B• Reduce output = Size of Joined dataset• Filter and Project in Map Data Computing Division
  51. 51. Join Special Cases• Fragment-Replicate • 100GB dataset with 100 MB dataset• Equipartitioned Datasets • Identically Keyed • Equal Number of partitions • Each partition locally sorted Data Computing Division
  52. 52. Fragment-Replicate• Fragment larger dataset • Specify as Map input• Replicate smaller dataset • Use Distributed Cache• Map-Only computation • No shuffle / sort Data Computing Division
  53. 53. Equipartitioned Join• Available since Hadoop 0.16• Datasets joined “before” input to mappers• Input format: CompositeInputFormat• mapred.join.expr• Simpler to use in Java, but can be used in Streaming Data Computing Division
  54. 54. Examplemapred.join.expr = inner ( tbl ( ....SequenceFileInputFormat.class,"hdfs://namenode:8020/path/to/data/A" ), tbl ( ....SequenceFileInputFormat.class,"hdfs://namenode:8020/path/to/data/B" ) ) Data Computing Division
  55. 55. Get Social @EMCAcademics Data Computing Division
  56. 56. Next Session:Technology Lecture Series : Classic D Center on 16 Aug 2012 Data Computing Division
  57. 57. Questions ? Data Computing Division

×