Presentation

A new way to store and analyze data

Presented By :: Harsha Jain
CSE – IV Year Student

www.powerpointpresentationon.blogspot.com

Topics Covered
• What is Hadoop? • HDFS
• Why, Where, When? • Hadoop MapReduce
• Benefits of Hadoop • Installation &
• How Hadoop Works? Execution
• Hdoop Architecture • Demo of installation
• Hadoop Common • Hadoop Community

By Harsha Jain

What is Hadoop?
• Hadoop was created by Douglas Reed Cutting, who named haddop
after his child’s stuffed elephant to support Lucene and Nutch
search engine projects.
• Open-source project administered by Apache Software Foundation.
• Hadoop consists of two key services:
a. Reliable data storage using the Hadoop Distributed File System (HDFS).
b. High-performance parallel data processing using a technique called
MapReduce.
• Hadoop is large-scale, high-performance processing jobs — in spite
of system changes or failures.

By Harsha Jain

Hadoop, Why?
• Need to process 100TB datasets
• On 1 node:
– scanning @ 50MB/s = 23 days
• On 1000 node cluster:
– scanning @ 50MB/s = 33 min
• Need Efficient, Reliable and Usable framework

By Harsha Jain

Where and When Hadoop
Where When
• Batch data processing, not • Process lots of unstructured
real-time / user facing (e.g. data
Document Analysis and • When your processing can
Indexing, Web Graphs and easily be made parallel
Crawling) • Running batch jobs is
• Highly parallel data intensive acceptable
distributed applications • When you have access to lots
• Very large production of cheap hardware
deployments (GRID)

By Harsha Jain

Benefits of Hadoop
• Hadoop is designed to run on cheap commodity
hardware
• It automatically handles data replication and node
failure
• It does the hard work – you can focus on processing
data
• Cost Saving and efficient and reliable data
processing

By Harsha Jain

How Hadoop Works
• Hadoop implements a computational paradigm named
Map/Reduce, where the application is divided into many small
fragments of work, each of which may be executed or re-executed
on any node in the cluster.
• In addition, it provides a distributed file system (HDFS) that
stores data on the compute nodes, providing very high aggregate
bandwidth across the cluster.
• Both Map/Reduce and the distributed file system are designed so
that node failures are automatically handled by the framework.

By Harsha Jain

Hdoop Architecture
The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing

Hadoop Consists::
• Hadoop Common*: The common utilities that support the other
Hadoop subprojects.
• HDFS*: A distributed file system that provides high throughput
access to application data.
• MapReduce*: A software framework for distributed processing of
large data sets on compute clusters.
Hadoop is made up of a number of elements. Hadoop consists of the Hadoop Common,
At the bottom is the Hadoop Distributed File System (HDFS), which stores files across
storage nodes in a Hadoop cluster. Above the HDFS is the MapReduce engine, which
consists of JobTrackers and TaskTrackers.

* This presentation is primarily focus on Hadoop architecture and related sub
project

By Harsha Jain

Data Flow

Web Scribe
Servers Servers

Network
Storage

Oracle Hadoop Cluster MySQ
RAC L

By Harsha Jain

Hadoop Common
• Hadoop Common is a set of utilities that
support the other Hadoop subprojects.
Hadoop Common includes FileSystem,
RPC, and serialization libraries.

By Harsha Jain

HDFS
• Hadoop Distributed File System (HDFS) is
the primary storage system used by
Hadoop applications.
• HDFS creates multiple replicas of data
blocks and distributes them on compute
nodes throughout a cluster to enable
reliable, extremely rapid computations.
• Replication and locality

By Harsha Jain

HDFS Architecture

By Harsha Jain

MapReduce Implementation
1. Input files split (M splits)
2. Assign Master & Workers
3. Map tasks
4. Writing intermediate data to
disk (R regions)
5. Intermediate data read &
sort
6. Reduce tasks
7. Return

By Harsha Jain

MapReduce Cluster
Implementation
Input files M map Intermediate R reduce Output files
tasks files tasks

split 0 Output 0
split 1
split 2
split 3 Output 1
split 4

Several map or Each intermediate file Each reduce task
reduce tasks can run is divided into R corresponds to one
on a single computer partitions, by partition
partitioning function
By Harsha Jain

Examples of MapReduce
Word Count

• Read text files and count how often words
occur.
o The input is text files
o The output is a text file
 each line: word, tab, count
• Map: Produce pairs of (word, count)
• Reduce: For each word, sum up the
counts.

By Harsha Jain

Lets Go…
Installation :: Execution::
• Requirements: Linux, Java • Compile your job into a JAR
1.6, sshd, rsync file
• Configure SSH for • Copy input data into HDFS
password-free authentication • Execute bin/hadoop jar with
• Unpack Hadoop distribution relevant args
• Edit a few configuration files • Monitor tasks via Web
• Format the DFS on the interface (optional)
name node • Examine output when job is
• Start all the daemon complete
processes

By Harsha Jain

Demo Video for installation

By Harsha Jain

Hadoop Community
Hadoop Users Major Contributor
• Adobe • Apache
• Alibaba • Cloudera
• Amazon • Yahoo
• AOL
• Facebook
• Google
• IBM

By Harsha Jain

References
• Apache Hadoop! (http://hadoop.apache.org )
• Hadoop on Wikipedia
(http://en.wikipedia.org/wiki/Hadoop)
• Free Search by Doug Cutting
(http://cutting.wordpress.com )
• Hadoop and Distributed Computing at Yahoo!
(http://developer.yahoo.com/hadoop )
• Cloudera - Apache Hadoop for the Enterprise
(http://www.cloudera.com )

By Harsha Jain

Presentation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (7)

Similar to Presentation

Similar to Presentation (20)

More from ch samaram

More from ch samaram (15)

Recently uploaded

Recently uploaded (20)

Presentation

Editor's Notes