Introduction to Hadoop Technology

HADOOP
TECHNOLOGY
Presented by :-
Manish.S. Borkar
Poly 6th sem, IT branch,
Nagpur Polytechnic, Nagpur

Cluster of machines running Hadoop at Yahoo!

Processing Vcards:
Example of VCARD
•BEGIN : VCARD
•N: Manish Borkar
•INSTT : Nagpur Polytechnic, Nagpur
•DESIG : Student
•EMAIL : manish.borkar74@gmail.com
•URL : http://www.facebook.com/oasisfoundation
•URL : http://www.twitter.com/manishborkar
•END:VCARD

• 1 GB – 10 GB – 100 GB --- limits
• More Investments
• -- 10 TB – 100 TB --- again limits
• Data from Facebook, Twitter, RFID
readers, sensors.
• Structured / Unstructured

Here is come the solution
Hadoop….

•Hadoop Distributed File System (HDFS) – a distributed file-
system that stores data on commodity machines, providing
very high aggregate bandwidth across the cluster.
•Hadoop YARN – a resource-management platform
responsible for managing compute resources in clusters
and using them for scheduling of users' applications.
• Hadoop MapReduce – a programming model for large
scale data processing.

 Name node:- The HDFS namespace is a hierarchy of
files and directories. Files and directories are represented on
the NameNode by inodes.
 Data Node:- Each block replica on a DataNode is
represented by two files in the local native filesystem. The
first file contains the data itself and the second file records
the block's metadata
 HDFS Client:- User applications access the
filesystem using the HDFS client, a library that
exports the HDFS filesystem interface.

• MapReduce is an associated implementation for
processing and generating large data sets.
• A Map-Reduce job usually splits the input data-set
into independent chunks which are processed by the
map tasks in a completely parallel manner.
• A MapReduce job is a unit of work that the client wants to be
performed: it consists of the input data, the MapReduce program,
and configuration information. Hadoop runs the job by dividing it
into tasks, of which there are two types: map tasks and reduce tasks
• Mapreduce is a progrmming model for processing and generating
large data sets with a parallel, distributed algorithms on a cluster

THE PROGRAMMING MODEL OF MAPREDUCE
 Map, written by the user, takes an input pair and produces a set of intermediate
key/value pairs. The MapReduce library groups together all intermediate values
associated with the same intermediate key I and passes them to the Reduce
function.

 The Reduce function, also written by the user, accepts an intermediate key I and a set of values
for that key. It merges together these values to form a possibly smaller set of values

 Pig
 Mahout
 Hive
 Avro
 Strom

 Amazon web
 Services
 Apache Bigtop
 Cascading
 Cloudera
 Cloudspace
 Datameter

• As the amount of data being stored around the
globe continues to rise and the cost of technologies
that enable the extraction of meaningful patterns .
As the amount of data and cost of handling it
increases this make difficult to organization to affort
the cost and store the high amount of data
and to process it.Then the hadoop is the best choice
for the growing world by its easy handling and large
storing of data.

[1] UNIX Filesystems: Evolution, Design, and
Implementation. Wiley
Publishing, Inc., 2003.
[2] The diverse and exploding digital universe.
http://www.emc.com/digital
universe, 2009.
[3] Hadoop. http://hadoop.apache.org, 2009.
[4] en.wikipedia.org/wiki/Apache_Hadoop
[5] HDFS (hadoop distributed file system) architecture.
http://hadoop.
apache.org/common/docs/current/hdfs design.html, 2009.

Introduction to Hadoop Technology

Introduction to Hadoop Technology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Introduction to Hadoop Technology

Similar to Introduction to Hadoop Technology (20)

Recently uploaded

Recently uploaded (20)

Introduction to Hadoop Technology