Your SlideShare is downloading. ×
0
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Hadoop 101
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop 101

11,366

Published on

Slide deck from Barcamp Chennai

Slide deck from Barcamp Chennai

0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
11,366
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
370
Comments
0
Likes
17
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hadoop 101 Mohit Soni eBay Inc.BarCamp Chennai - 5 Mohit Soni
  • 2. About Me• I work as a Software Engineer at eBay• Worked on large-scale data processing with eBay Research Labs BarCamp Chennai - 5 Mohit Soni
  • 3. First Things FirstBarCamp Chennai - 5 Mohit Soni
  • 4. MapReduce• Inspired from functional operations – Map – Reduce• Functional operations do not modify data, they generate new data• Original data remains unmodified BarCamp Chennai - 5 Mohit Soni
  • 5. Functional OperationsMap Reducedef sqr(n): def add(i, j): return n * n return i + jlist = [1,2,3,4] list = [1,2,3,4]map(sqr, list) -> [1,4,9,16] reduce(add, list) -> 10 MapReduce def MapReduce(data, mapper, reducer): return reduce(reducer, map(mapper, data)) MapReduce(list, sqr, add) -> 30 Python code BarCamp Chennai - 5 Mohit Soni
  • 6. BarCamp Chennai - 5 Mohit Soni
  • 7. What is Hadoop ?• Framework for large-scale data processing• Based on Google’s MapReduce and GFS• An Apache Software Foundation project• Open Source!• Written in Java• Oh, btw BarCamp Chennai - 5 Mohit Soni
  • 8. Why Hadoop ?• Need to process lots of data (PetaByte scale)• Need to parallelize processing across multitude of CPUs• Achieves above while KeepIng Software Simple• Gives scalability with low-cost commodity hardware BarCamp Chennai - 5 Mohit Soni
  • 9. Hadoop fansSource: Hadoop Wiki BarCamp Chennai - 5 Mohit Soni
  • 10. When to use and not-use Hadoop ?Hadoop is a good choice for:• Indexing data• Log Analysis• Image manipulation• Sorting large-scale data• Data MiningHadoop is not a good choice:• For real-time processing• For processing intensive tasks with little data• If you have Jaguar or RoadRunner in your stock BarCamp Chennai - 5 Mohit Soni
  • 11. HDFS – Overview• Hadoop Distributed File System• Based on Google’s GFS (Google File System)• Write once read many access model• Fault tolerant• Efficient for batch-processing BarCamp Chennai - 5 Mohit Soni
  • 12. HDFS – Blocks Block 1 Block 2 Input Data Block 3• HDFS splits input data into blocks• Block size in HDFS: 64/128MB (configurable)• Block size *nix: 4KB BarCamp Chennai - 5 Mohit Soni
  • 13. HDFS – Replication Block 1 Block 1 Block 2 Block 3 Block 2 Block 3• Blocks are replicated across nodes to handle hardware failure• Node failure is handled gracefully, without loss of data BarCamp Chennai - 5 Mohit Soni
  • 14. HDFS – Architecture NameNodeClient Cluster DataNodes BarCamp Chennai - 5 Mohit Soni
  • 15. HDFS – NameNode• NameNode (Master) – Manages filesystem metadata – Manages replication of blocks – Manages read/write access to files• Metadata – List of files – List of blocks that constitutes a file – List of DataNodes on which blocks reside, etc• Single Point of Failure (candidate for spending $$) BarCamp Chennai - 5 Mohit Soni
  • 16. HDFS – DataNode• DataNode (Slave) – Contains actual data – Manages data blocks – Informs NameNode about block IDs stored – Client read/write data blocks from DataNode – Performs block replication as instructed by NameNode• Block Replication – Supports various pluggable replication strategies – Clients read blocks from nearest DataNode• Data Pipelining – Client write block to first DataNode – First DataNode forwards data to next DataNode in pipeline – When block is replicated across all replicas, next block is chosen BarCamp Chennai - 5 Mohit Soni
  • 17. Hadoop - ArchitectureUser JobTracker TaskTracker TaskTracker NameNode DataNode DataNode DataNode DataNode DataNode DataNode BarCamp Chennai - 5 Mohit Soni
  • 18. Hadoop - Terminology• JobTracker (Master) – 1 Job Tracker per cluster – Accepts job requests from users – Schedule Map and Reduce tasks for TaskTrackers – Monitors tasks and TaskTrackers status – Re-execute task on failure• TaskTracker (Slave) – Multiple TaskTrackers in a cluster – Run Map and Reduce tasks BarCamp Chennai - 5 Mohit Soni
  • 19. MapReduce – FlowInput Map Shuffle + Sort Reduce Output Map ReduceInput Output Data Map Data Reduce Map BarCamp Chennai - 5 Mohit Soni
  • 20. Word Count Hadoop’s HelloWorldBarCamp Chennai - 5 Mohit Soni
  • 21. Word Count Example• Input – Text files• Output – Single file containing (Word <TAB> Count)• Map Phase – Generates (Word, Count) pairs – [{a,1}, {b,1}, {a,1}] [{a,2}, {b,3}, {c,5}] [{a,3}, {b,1}, {c,1}]• Reduce Phase – For each word, calculates aggregate – [{a,7}, {b,5}, {c,6}] BarCamp Chennai - 5 Mohit Soni
  • 22. Word Count – Mapperpublic class WordCountMapper extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { String l = value.toString(); StringTokenizer t = new StringTokenizer(l); while(t.hasMoreTokens()) { word.set(t.nextToken()); out.collect(word, one); } }} BarCamp Chennai - 5 Mohit Soni
  • 23. Word Count – Reducerpublic class WordCountReducer extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWriter> values, OutputCollector<Text, IntWritable> out, Reporter reporter) throws Exception { int sum = 0; while(values.hasNext()) { sum += values.next().get(); } out.collect(key, new IntWritable(sum)); }} BarCamp Chennai - 5 Mohit Soni
  • 24. Word Count – Configpublic class WordCountConfig { public static void main(String[] args) throws Exception { if (args.length() != 2) { System.exit(1); } JobConf conf = new JobConf(WordCountConfig.class); conf.setJobName(“Word Counter”); FileInputFormat.addInputPath(conf, new Path(args[0]); FileInputFormat.addOutputPath(conf, new Path(args[1])); conf.setMapperClass(WordCountMapper.class); conf.setCombinerClass(WordCountReducer.class); conf.setReducerClass(WordCountReducer.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); JobClient.runJob(conf); }} BarCamp Chennai - 5 Mohit Soni
  • 25. Diving Deeper• http://hadoop.apache.org/• Jeffrey Dean and Sanjay Ghemwat, MapReduce: Simplified Data Processing on Large Clusters• Tom White, Hadoop: The Definitive Guide, O’Reilly• Setting up a Single-Node Cluster: http://bit.ly/glNzs4• Setting up a Multi-Node Cluster: http://bit.ly/f5KqCP BarCamp Chennai - 5 Mohit Soni
  • 26. Catching-Up• Follow me on twitter @mohitsoni• http://mohitsoni.com/ BarCamp Chennai - 5 Mohit Soni

×