hadoop

by Project Guide
Deep P Mehta(1498016) Manish R Solanki

Topics To Be Covered
 Various Type Of Computing
 What is Hadoop?
 Disadvantage Of Distributed System &How Hadoop
Overcomes it?
 Hadoop Architecture
 HDFS(Hadoop Distributed File System)
 Map Reduce
 Example Program of Map Reduce(Word Count)

Today Scenario Of Data
• The New York Stock Exchange generates about 4−5
terabytes of data per day.
• Facebook hosts more than 240 billion photos, growing
at 7 petabytes per month.
• Ancestry.com, the genealogy site, stores around 10
petabytes of data.
• The Internet Archive stores around 18.5 petabytes of
data.

Distributed Computing
•A distributed Computing is a
model in which components
located on networked
computers communicate and
coordinate their actions by
passing messages.

Parallel Computing
•Parallel computing is a type of
computation in which many
calculations or the execution of
processes are carried out
simultaneously

Grid Computing
•Grid computing combines
computers to reach a common
goal, to solve a single task at a time

Volunteer Computing
•Volunteer computing is a
type of distributed computing
in which computer donate
their computing resources
(such as processing power and
storage) to "projects".

Cloud Computing
•Cloud computing is a type of
Internet -based computing that
provides shared computer processing
resources and data to computers and
other devices on demand

What Cause The Problem in
Distributed System?
 The Transfer speed is around 100Mbps
 Consider a disk is of 1 Terabyte
 Time to read a disk =10000 seconds around 3 hours
 Increase in time may not be helpful because
 Network Bandwith problem
 Processor limit have been reached

Issues Involved in Distributed
System
 Hardware Problems
As we start using hardware the chances of
failure are very high
 Combing data after analysis
while combining data after analysis from one
disk with other disk it cause failure or data loss

Hadoop
• Hadoop is an software
framework for distributed
storage and distributed
processing
•It Is Built from commodity
hardware.
• Hadoop is designed with a
fundamental assumption of
• Hardware failure
•Largedata Processing

Hadoop
 Doug Cutting and Michael J. Cafarella developed
Hadoop in year 2005
Hadoop Approach To Distributed System
 Hadoop provides a simplified programming model,
which allows users to quickly write and test distributed
system and its efficient automatic distribution of data
and work across machines and in turn utilizing the
underlying parallelism of cpu cores

Advantages Of Hadoop
• High scalability and
availability
• Use commodity (cheap!)
hardware with little
redundancy
• Fault-tolerance
• Move computation rather
than data

Move Computation Rather then Data

HDFS Architecture
 NameNode
It run as master server
Application
 Manages the file system namespace.
 Regulates client’s access to files.
 It also executes file system operations such as
renaming, closing, and opening files and directories.

HDFS Architecture
 DataNode
These node runs as slave&manage data storage of
their system
 Application
 Datanodes perform read-write operations on the file
systems, as per client request.
 They also perform operations such as block creation,
deletion, and replication according to the instructions
of the namenode.

HDFS Architecture
 Block
The minimum amount of data that HDFS can read
or write is called a Block. The default block size is
64MB, but it can be increased as per the need to
change in HDFS configuration.

Map Reduce Architecture
 Map Function
In map phase processing by extracting the input data from
the splits. For each record parsed by the “InputFormat”, it
invoke the user provided “map” function, which emits a
number of key/value pair in the memory buffer
Example
input is” bhaghubhai”
The output will be
b- 2 g-1
h-3 u-1
a-2 i-1

Map Reduce Architecture
 Reduce Function
TaskTracker will read the region files remotely. It sorts the
key/value pairs and for each key, it invoke the “reduce”
function, which collects the key/aggregatedValue into the
output file (one per reducer node).
Example
There are two input split both contain “bhaghubhai
So it will merge it output will be
b- 4 g-2
h-6 u-2
a-4 i-2

Word Count Example Program
Mapper class
public class WordCount {
public static class TokenizerMapper
extends Mapper<Object, Text, Text, IntWritable>{
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
public void map(Object key, Text value, Context context
) throws IOException, InterruptedException {
StringTokenizeritr = new StringTokenizer(value.toString());
while (itr.hasMoreTokens()) {
word.set(itr.nextToken());
context.write(word, one);
}
}
}

Reducer class
public static class IntSumReducer
extends Reducer<Text,IntWritable,Text,IntWritable> {
private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,
Context context
) throws IOException, InterruptedException {
int sum = 0;
for (IntWritableval : values) {
sum += val.get();
}
result.set(sum);
context.write(key, result);
}
}

Driver class
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "word count");
job.setJarByClass(WordCount.class);
job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}

hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to hadoop

Similar to hadoop (20)

Recently uploaded

Recently uploaded (20)

hadoop