Presented by

NIKHIL P L

1
Apache Hadoop
•
•
•
•
•
•
•

Developer(s)
Type
License
Written in
OS
Created by
Inspired by

: Apache Software Foundation
: Distributed File System
: Apache License 2.0
: Java
: Cross platform
: Doug Cutting (2005)
: Google’s MapReduce, GFS
2
Sub projects
• HDFS
– distributed, scalable, and portable file system
– Store large data sets
– Cope with hardware failure
– Runs on top of the existing system

3
HDFS - Replication
• Blocks with data are replicated to multiple
nodes
• Allow for node failure without data loss

4
Sub projects .
• MapReduce
– Technology from Google
– Hadoop's fundamental data filtering algorithm
– Map and Reduce functions
– Useful in a wide range of application
• distributed pattern-based searching, distributed
sorting, web link-graph reversal, machine learning,
statistical machine translation.

5
MapReduce - Workflow

6
Hadoop cluster (Terminology)

7
Types of Nodes
• HDFS nodes
– NameNode (Master)
– DataNode (Slaves)

• MapReduce nodes
– Job Tracker (Master)
– Task Tracker (Slaves)

8
Types of Nodes .

9
Sub projects ..
• Hive
– providing data summarization, query, and analysis
– initially developed by Facebook

• Hbase
– open source, non-relational, distributed database
– Providing Google BigTable-model database -like
capabilities

10
Sub projects …
• Zookeeper
– distributed configuration service, synchronization
services, notification systems and naming registry
for large distributed systems.

• Pig
– A language and compiler to generate Hadoop
programs
– Originally developed at Yahoo!

11
How does Hadoop works? .
• HDFS Works

12
How does Hadoop works? ..
• MapReduce Works

13
How does Hadoop works? …
• MapReduce Works

14
How does Hadoop works? ….
• Managing Hadoop Jobs

15
Applications
•
•
•
•

Marketing analytics
Machin learning (eg: spam filters)
Image processing
Processing of XML messages

16
• world's largest Hadoop production application
• ~20,000 machines running Hadoop

17
• the largest Hadoop cluster in the world with
100 PB of storage
• 1200 machines with 8 cores each + 800
machines with 16 cores each
• 32 GB of RAM per machine
• 65 millions files in HDFS
• 12 TB of compressed data added per day

18
Other Users

19
Thanks

20

Hadoop..