Introduction to hadoop
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

Introduction to hadoop

  • 515 views
Uploaded on

The is an introduction to Hadoop I gave at the company I work at. ...

The is an introduction to Hadoop I gave at the company I work at.
I give a general introduction to Hadoop core - HDFS & MapReduce

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
515
On Slideshare
496
From Embeds
19
Number of Embeds
5

Actions

Shares
Downloads
24
Comments
0
Likes
3

Embeds 19

https://www.linkedin.com 6
http://54.199.180.60 5
http://www.linkedin.com 4
http://www.caksha.com 3
http://hubot-clb-2081983768.ap-northeast-1.elb.amazonaws.com 1

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Introduction to Hadoop Ron Sher
  • 2. Agenda • • • • • Big data - big issues Hadoop to the rescue Storage - HDFS Processing - MapReduce Hadoop ecosystem
  • 3. Big Data - Big Issues ● Volume, Velocity, Variability ● Lots of data - logs, sensors, social, pictures, video, etc. ● May not fit a single machine ● Access to data is slow ● Hardware may fail ● Network errors happen
  • 4. Hadoop to the rescue • • • • • • Distributed “operating system” Scalable - many servers of commodity hardware with lots of cores and disks Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Suitable for batch processing (throughput over
  • 5. Storage - HDFS • • • • Hadoop Distributed File System Replicated (3 default) fixed size blocks (64MB default) runs on large clusters of commodity machines Optimized for write once - read many throughput of large files
  • 6. HDFS Architecture http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png
  • 7. Useful HDFS commands • • • • • • • • hdfs dfs -get <file name> - copy a file from hdfs to local hdfs dfs -put <file name> [destination]- copy a file from local to hdfs in the specified destination hdfs dfs -cat <file name> - prints a file to stdout hdfs dfs -ls <dir name> - show all files under the specified directory hdfs dfs -mv <file name> <changed name> - rename a file hdfs dfs -rm <file name> - remove a file hdfs dfs -rmr <directory name> - remove a directory hdfs dfs -mkdir <dir name> - creates a directory
  • 8. Processing - MapReduce • • • • A distributed data processing model and execution environment that runs on large clusters of commodity machines Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Computation moved to the data
  • 9. MapReduce Sample - Word Count input Ini Mini Miny Mo Mo Miny Ini Mo Mini
  • 10. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini
  • 11. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1
  • 12. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling Ini, 1 Ini, 1 Mini, 1 Mini, 1 Miny, 1 Miny, 1 Mo, 1 Mo, 1 Mo, 1
  • 13. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1]
  • 14. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1] final result Ini, 2 Mini, 2 Miny,2 Mo, 3
  • 15. http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png How a MapReduce Job Runs in Hadoop
  • 16. Monitoring MR jobs (machine:50030)
  • 17. Monitoring MR jobs (machine:50030)
  • 18. Monitoring MR jobs (machine:50030)
  • 19. Monitoring MR jobs (machine:50030)
  • 20. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 21. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  • 22. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  • 23. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  • 24. Hadoop Ecosystem • • • • • • • • Hive - SQL like language over big data using MR HBase - distributed, column-oriented database ZooKeeper - coordination service Avro - cross language serialization Pig - language for exploring big data Impala - SQL like directly over HDFS Sqoop - tool for moving data from DBs to HDFS Mahout - machine learning and data mining library
  • 25. Some resources • • • • • • Motivation about hadoop and where it’s going video and whitepaper HDFS Architecture Guide How MapReduce Works With Hadoop HDFS shell commands VM MapReduce tutorial