Intro to Apache Hadoop

Intro to Apache™ Hadoop®
A Brown Bag Session at EAI Technologies
by Sufi Nawaz

What is this Hadoop you speak of?
"Apache Hadoop is an open-
source software framework that
supports data-intensive
distributed applications, licensed
under the Apache v2 license. It
supports the running of
applications on large clusters of
commodity hardware."
- Wikipedia
Doug Cutting
(Creator)

More about Hadoop
● It is a highly scalable, fault tolerant and
distributed compute and storage platform.
● Based on Google GFS and MapReduce.
● Brings computation to data and not the other
way around.
● Created by Doug Cutting and Mike Cafarella
in 2005.
● Originally developed to support distribution
for the Nutch search engine project.

Why use Hadoop?
● Process lots of data - in petabytes even
● Distributed processing
● Uses simple programming models
● Scalable - add new nodes simply
● Cost effective - uses commodity hardware
● Flexible - Hadoop is schema-less and can
absorb any kind of data
● Fault tolerant - redistribution of failed jobs
and data recovery by data replication

When to use Hadoop and not?
Good for:
● Indexing Data
● Log Analysis
● Image Manipulation
● Sorting Large Scale Data
● Data Mining
Bad for:
● For real time processing
● For processing intensive tasks with little data

Hadoop Modules
- Hadoop Common
- Hadoop Distributed File System (HDFS)
- Hadoop YARN
- Hadoop MapReduce

Hadoop Distributed File System
(HDFS)

The Apache HDFS is the primary distributed
storage component used by applications under
Apache Hadoop project.
Apache HDFS can serve as a stand-alone
distributed file system as well.

A single Namenode maintains the directory
tree and manages the namespace and access
to files by clients. It holds Metadata for list of
files, blocks, datanodes all in memory.
Datanodes store and manage the data blocks
as local files on servers throughout the rest of
the cluster. Reports to Namenode with
heartbeat.

What is HDFS bad for?
● Low latency data access. It trades low
latency to increase the throughput of the
data.
● Lots of small files, since default block size is
64MB. Will increase memory requirements
of namenode.
● Multiple writers and arbitrary modification.

Anatomy of write
● DFSOutputStream splits data into packets.
● Writes into an internal queue.
● DataStreamer asks namenode to get list of
datanodes and uses the internal data queue.
● Namenode gives a list of datanodes for the
pipeline.
● Maintains internal queue of packets waiting
to be acknowledged.

Anatomy of read:
● Namenode returns locations of blocks.
● Datanode list is sorted according to their proximity to the
client.
● FSDataInputStream wraps DFSInputStream, which
manages datanode and namenode I/O.
● Read is called repeatedly on the datanode till end of the
block is reached.
● Finds the next DataNode for next data block.
● All happens transparently to the client.
● Calls close after finishing reading the data.

Accessibility
● DFS Shell
● DFS Admin
● Browser Interface
● Mountable HDFS

MapReduce
Main Components
● JobClient
● JobTracker
● TaskTracker

MapReduce
JobTracker (Master)
● Single Job Tracker per cluster
● Schedule Map and Reduce Tasks for TaskTrackers
● Monitors Tasks and keeps track of TaskTrackers status
● Re-execute tasks on failure
TaskTracker (Slave)
● Single TaskTrackers per node (multiple in a cluster)
● Run Map and Reduce Tasks

Who uses Hadoop?
● Yahoo!
○ Support research for Ad Systems and Web Search
● Facebook
○ 2 major clusters (1100 + 300 machines w/ 8 cores)
○ Heavy users of both streaming and Java APIs.
○ Have developed a FUSE implementation on HDFS.
● EBay
○ 532 nodes cluster (8 * 532 cores, 5.3PB
● Hulu
○ 13 machine cluster (8 cores/machine, 4TB/machine)
○ Log storage and analysis
● Many more
○ http://wiki.apache.org/hadoop/PoweredBy

Where can I find resources?
● Hadoop Docs
○ http://hadoop.apache.org/docs/current/
● Mailing List:
○ http://hadoop.apache.org/mailing_lists.html
● White papers from Cloudera, Intel, Dell, etc.
● Hadoop in 20 Pages (http://blog.imaginea.
com/hadoop-a-short-guide/)
● Yahoo! CDN Hadoop Tutorial
● Google Search Engine (!)

Some Additional Info
● Hadoop Streaming
○ Run MapReduce with any language supporting
standard I/O e.g. ruby, python.
● Hadoop Distributed Cache
○ Puts contents of specified input path to memory in all
datanodes across cluster.
● Hadoop Security
○ Secure Hadoop with Kerberos
● Hadoop Federation
○ Solution for NameNode High Availability (HA) and no
Single Point of Failure of NameNode

Intro to Apache Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (18)

Similar to Intro to Apache Hadoop

Similar to Intro to Apache Hadoop (20)

Recently uploaded

Recently uploaded (20)

Intro to Apache Hadoop