0
Introduction to Hadoop
Ron Sher
Agenda

•
•
•
•
•

Big data - big issues
Hadoop to the rescue
Storage - HDFS
Processing - MapReduce
Hadoop ecosystem
Big Data - Big Issues
● Volume, Velocity, Variability
● Lots of data - logs, sensors, social, pictures,
video, etc.
● May ...
Hadoop to the rescue

•
•
•
•
•
•

Distributed “operating system”
Scalable - many servers of commodity hardware
with lots ...
Storage - HDFS

•
•

•
•

Hadoop Distributed File System
Replicated (3 default) fixed size blocks
(64MB default)
runs on l...
HDFS Architecture
http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png
Useful HDFS commands
•
•
•
•
•
•
•
•

hdfs dfs -get <file name> - copy a file from hdfs to local
hdfs dfs -put <file name>...
Processing - MapReduce

•
•

•
•

A distributed data processing model and execution
environment that runs on large cluster...
MapReduce Sample - Word Count
input

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mi...
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mi...
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mi...
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mi...
MapReduce Sample - Word Count
input

splitting

Ini Mini Miny

Ini Mini Miny
Mo Mo Miny
Ini Mo Mini

Mo Mo Miny

Ini Mo Mi...
http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png

How a MapReduce Job Runs in Hadoop
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Monitoring MR jobs (machine:50030)
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Useful Commands

•
•

mapred job -kill <job id> - kill a running job
mapred job -status <job id> - show status
of a job
Word Count Mapper
public static class Map extends MapReduceBase implements
Mapper<LongWritable, Text, Text, IntWritable> {...
Word Count Reducer
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritabl...
Hadoop Ecosystem

•
•
•
•
•
•
•
•

Hive - SQL like language over big data using MR
HBase - distributed, column-oriented da...
Some resources

•
•
•
•
•
•

Motivation about hadoop and where it’s
going video and whitepaper
HDFS Architecture Guide
How...
Upcoming SlideShare
Loading in...5
×

Introduction to hadoop

444

Published on

The is an introduction to Hadoop I gave at the company I work at.
I give a general introduction to Hadoop core - HDFS & MapReduce

Published in: Technology, Education
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
444
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
41
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Introduction to hadoop"

  1. 1. Introduction to Hadoop Ron Sher
  2. 2. Agenda • • • • • Big data - big issues Hadoop to the rescue Storage - HDFS Processing - MapReduce Hadoop ecosystem
  3. 3. Big Data - Big Issues ● Volume, Velocity, Variability ● Lots of data - logs, sensors, social, pictures, video, etc. ● May not fit a single machine ● Access to data is slow ● Hardware may fail ● Network errors happen
  4. 4. Hadoop to the rescue • • • • • • Distributed “operating system” Scalable - many servers of commodity hardware with lots of cores and disks Reliable - detect failures, redundant storage Fault-tolerant - auto-retry, self-healing Simple - use many servers as one really big computer Suitable for batch processing (throughput over
  5. 5. Storage - HDFS • • • • Hadoop Distributed File System Replicated (3 default) fixed size blocks (64MB default) runs on large clusters of commodity machines Optimized for write once - read many throughput of large files
  6. 6. HDFS Architecture http://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/images/hdfsarchitecture.png
  7. 7. Useful HDFS commands • • • • • • • • hdfs dfs -get <file name> - copy a file from hdfs to local hdfs dfs -put <file name> [destination]- copy a file from local to hdfs in the specified destination hdfs dfs -cat <file name> - prints a file to stdout hdfs dfs -ls <dir name> - show all files under the specified directory hdfs dfs -mv <file name> <changed name> - rename a file hdfs dfs -rm <file name> - remove a file hdfs dfs -rmr <directory name> - remove a directory hdfs dfs -mkdir <dir name> - creates a directory
  8. 8. Processing - MapReduce • • • • A distributed data processing model and execution environment that runs on large clusters of commodity machines Responsible for running a job in parallel on many servers Handles re-trying a task that fails, validating complete results Computation moved to the data
  9. 9. MapReduce Sample - Word Count input Ini Mini Miny Mo Mo Miny Ini Mo Mini
  10. 10. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini
  11. 11. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1
  12. 12. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling Ini, 1 Ini, 1 Mini, 1 Mini, 1 Miny, 1 Miny, 1 Mo, 1 Mo, 1 Mo, 1
  13. 13. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1]
  14. 14. MapReduce Sample - Word Count input splitting Ini Mini Miny Ini Mini Miny Mo Mo Miny Ini Mo Mini Mo Mo Miny Ini Mo Mini mapping Ini, 1 Mini, 1 Miny,1 Mo, 1 Mo, 1 Miny,1 Ini, 1 Mo, 1 Mini, 1 shuffling reducing Ini, 1 Ini, 1 Ini, [1,1] Mini, 1 Mini, 1 Mini, [1,1] Miny, 1 Miny, 1 Miny, [1,1] Mo, 1 Mo, 1 Mo, 1 Mo, [1,1,1] final result Ini, 2 Mini, 2 Miny,2 Mo, 3
  15. 15. http://answers.oreilly.com/uploads/monthly_10_2009/post-118-125676084924_thumb.png How a MapReduce Job Runs in Hadoop
  16. 16. Monitoring MR jobs (machine:50030)
  17. 17. Monitoring MR jobs (machine:50030)
  18. 18. Monitoring MR jobs (machine:50030)
  19. 19. Monitoring MR jobs (machine:50030)
  20. 20. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  21. 21. Useful Commands • • mapred job -kill <job id> - kill a running job mapred job -status <job id> - show status of a job
  22. 22. Word Count Mapper public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); output.collect(word, one); } } }
  23. 23. Word Count Reducer public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } }
  24. 24. Hadoop Ecosystem • • • • • • • • Hive - SQL like language over big data using MR HBase - distributed, column-oriented database ZooKeeper - coordination service Avro - cross language serialization Pig - language for exploring big data Impala - SQL like directly over HDFS Sqoop - tool for moving data from DBs to HDFS Mahout - machine learning and data mining library
  25. 25. Some resources • • • • • • Motivation about hadoop and where it’s going video and whitepaper HDFS Architecture Guide How MapReduce Works With Hadoop HDFS shell commands VM MapReduce tutorial
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×