Hadoop by sunitha

Data…Data….Data….
• We live in a data world ????
• Total FaceBook Users:835,525,280 (March
31 st 2012)
• The New York Stock Exchange generates
about one terabyte of new trade data per
• day.
• • Facebook hosts approximately 10 billion
photos, taking up one petabyte of storage

http://www.internetworldstats.com/facebook.htm

Data…is growing ????

From Gantz et al., “The Diverse and Exploding Digital Universe,” March 2008 (http://www.emc.com/
collateral/analyst-reports/diverse-exploding-digital-universe.pdf).

Problem??
• How do we store and analyze the date???
• one terabyte drives the transfer speed is
around 100 MB/s- more than two and a half
hours to read all the data off the disk.
Writing more slower
• We had 100 drives holding one hundredth of
the data.
• Reliability issues ( failure in hard drive)
• Combine data from 100 drives?.
• Existing Tools inadequate to process large
data sets

Why can’t we use RDBMS?
• An RDBMS is good for point queries or
updates, where the dataset has been indexed
to deliver low-latency retrieval and update
times of a relatively small amount of
data. Longer time to read data

CPU

Memory

Disk

Hadoop is the answer!!!!!
• Hadoop is an open source project licensed
under the Apache v2 license
http://hadoop.apache.org/

• Used for processing large datasets in parallel
with the use of low level commodity machines.

• Hadoop is build on two main parts. An special
file system called Hadoop Distributed File
System (HDFS) and the Map Reduce
Framework.

Hadoop History
• Hadoop was created by Doug Cutting, who named it
after his son's toy elephant .
• 2002-2004 Nutch Open Source web-scale, crawler-
based search
• 2004-2006 Google File System & MapReduce papers
published.Added DFS & MapReduce impl to Nutch
• 2006-2008 Yahoo hired Doug Cutting
• On February 19, 2008, Yahoo! Inc. launched what it
claimed was the world's largest Hadoop production
application
• The Yahoo! Search Webmap is a Hadoop application
that runs on more than 10,000 core Linux cluster and
produces data that is now used in every Yahoo! Web
search query.[22]

Who uses Hadoop ?
Amazon American Airlines

AOL Apple

eBay Federal Reserve Board of
Governors
foursquare Fox Interactive Media

FaceBook StumbleUpon

Gemvara Hewlett-Packard

IBM MicroSoft

Twitter NYTimes

NetFlix Linkedin

Why Hadoop?
• Reliable: The software is fault tolerant, it
expects and handles hardware and software
failures
• Scalable: Designed for massive scale of
processors, memory, and local attached
storage
• Distributed: Handles replication. Offers
massively parallel programming model,
MapReduce

What is MapReduce???

– Programming model used by Google

– A combination of the Map and Reduce models
with an associated implementation

– Used for processing and generating large data
sets

MapReduce Explained
• The basic idea is that you divide the job into
two parts: a Map, and a Reduce.
• Map basically takes the problem, splits it into
sub-parts, and sends the sub-parts to different
machines – so all the pieces run at the same
time.
• Reduce takes the results from the sub-parts
and combines them back together to get a
single answer.

Distributed Grep

Split data grep matches
Very All
big Split data grep matches cat matches
data

How Map and Reduce Work
Together

Map Reduce

R
M E
Very Partitioning
A D Result
big Function
P U
data
C
E
• Map:
– Accepts input key/value
pair
Reduce :
– Emits intermediate
key/value pair Accepts intermediate
key/value* pair
Emits output key/value pair

http://ayende.com/blog/4435/map-reduce-a-visual-explanation

RDBMS compared to
MapReduce
Data Gigabytes Petabytes
Size
Access Interactive and Batch
batch
Updates Read and write Write once, read many
many times times
integrity High Low

Scaling Nonlinear Linear

Structur Static schema Dynamic schema
e

Hadoop Family

Pig A platform for manipulating
large data sets
Scripting

Machine
Mahout Machine Learning Algorithms Learning
Bigtable-like structured storage
HBASE for Hadoop HDFS Non-Rel RDBMS

HIVE data warehouse system
Non-Rel RDBMS

Distribute and replicated data
HDFS among machines
Hadoop common
MapReduce Distribute and monitor tasks

Zoo Keeper Distributed Contributed Service

When to use Hadoop?
• Complex information processing is needed
• Unstructured data needs to be turned into structured data
• Queries can’t be reasonably expressed using SQL
• Heavily recursive algorithms
• Complex but parallelizable algorithms needed, such as geo-spatial
analysis or genome sequencing
• Machine learning
• Data sets are too large to fit into database RAM, discs, or require too
many cores (10’s of TB up to PB)
• Data value does not justify expense of constant real-time availability,
such as archives or special interest info, which can be moved to
Hadoop and remain available at lower cost
• Results are not needed in real time
• Fault tolerance is critical
• Significant custom coding would be required to handle job scheduling

• Reference:http://timoelliott.com/blog/2011/09/hadoop-big-data-and-
enterprise-business-intelligence.html

Building Blocks of Hadoop
• Running a set of daemons on different servers
on the network

•NameNode
•DataNode
•Secondary NameNode
•JobTracker
•TaskTracker

References
• Hadoop in Action By Chuck Lam
• Hadoop The Definitive Guide By Tom White
• http://hadoop.apache.org/

Hadoop by sunitha

More Related Content

What's hot

Similar to Hadoop by sunitha

Recently uploaded

Hadoop by sunitha