Introduction to Apache Hadoop

Introduction to Apache
Hadoop
BACS 488 – February 6, 2012
Monfort College of Business
Christopher Pezza

Overview
 Data Storage and Analysis
 Comparison with other Systems
 HPC and Grid Computing
 Volunteer Computing
 History of Hadoop
 Analyzing Data with Hadoop
 Hadoop in the Enterprise
 The Collective Wisdom of the Valley

The Problem
 IDC estimates the size of the digital
universe has grown to 1.8 zettabytes
by the end of 2011
◦ 1 zettabyte = 1,000 exabytes = 1M
petabytes
 Individual data footprints are growing
 Storing and Analyzing datasets in the
petabyte range requires new and
innovative solutions

The Problem
 Storage capacities of hard drives have
increased but transfer rates have not
kept up
◦ Solution: read from multiple disks at once
 Hardware Failure
 Most analysis tasks need to be able to
combine the data in some way.

What Hadoop provides:
 The ability to read and write data in
parallel to or from multiple disks
 Enables applications to work with
thousands of nodes and petabytes of
data.
 A reliable shared storage and analysis
system (HDFS and MapReduce)
 A free license

MapReduce vs. RDBMS
 MapReduce Premise: the entire
dataset—or at least a good portion of
it—is processed for each query.
◦ Batch Query Processor
 Another Trend: Seek time is improving
more slowly than transfer time
 MapReduce is good for analyzing the
whole dataset, whereas RDBMS is
good for point queries or updates.

MapReduce vs. RDBMS
Traditional RDBMS MapReduce
Data Size Gigabytes Petabytes
Access Interactive and batch Batch
Updates Read and write many Write once, read many
times times
Structure Static schema Dynamic schema
Integrity High Low
Scaling Nonlinear Linear

• MapReduce suits applications where
the data is written once, and read
many times, whereas a RDBMS is
good for datasets that are continually
updated.

Data Structure
 Structured Data – data organized into
entities that have a defined format.
◦ Realm of RDBMS
 Semi-Structured Data – there may be
a schema, but often ignored; schema
is used as a guide to the structure of
the data.
 Unstructured Data – doesn’t have any
particular internal structure.
 MapReduce works well with semi-
structured and unstructured data.

More differences…
 Relational data is often normalized to
retain its integrity and remove
redundancy
 Normalization poses problems for
MapReduce
 MapReduce is a linearly scalable
programming model.
 Over time, the differences between
RDBMS and MapReduce are likely to
blur

HPC and Grid Computing
 The approach in HPC is to distribute the
work across a cluster of machines, which
access a shared filesystem, hosted by a
SAN.
◦ In very large datasets, bandwidth is the
bottleneck and network nodes become idle
 MapReduce tries to collocate the data
with the compute node, so data access
is fast since it is local.
◦ Works to conserve bandwidth by explicitly
modeling network topology.

Handling Partial Failure
 MapReduce – implementation detects
failed map or reduce tasks and
reschedules replacements on
machines that are healthy
 Shared-Nothing Architecture – tasks
have no dependence on one another
 To contrast, MPI programs have to
explicitly manage their own
checkpointing and recovery.

Why is MapReduce cool?
 Invented by engineers at Google as a
system for building production search
indexes because they found
themselves solving the same problem
over and over again.
 Wide range of algorithms expressed:
◦ Image Analysis
◦ Graph-based problems
◦ Machine Learning

Volunteer Computing
 Seti@Home
 MapReduce is designed to run jobs that
last minutes or hours on trusted,
dedicated hardware running in a single
data center with very high aggregate
bandwidth interconnects.
 Seti@home runs a perpetual
computation on untrusted machines on
the Internet with highly variable
connection speeds and no data locality

History of Hadoop
 Created by Doug Cutting
 2002 – Apache Nutch, open source web
search engine
 2003 – Google publishes a paper describing
the architecture of their distributed filesystem,
GFS.
 2004 – Nutch Distributed Filesystem (NDFS)
 2004 – Google publishes a paper on
MapReduce
 2005 – Nutch MapReduce implementation
 2006 – Hadoop is created; Cutting joins
Yahoo!
 2008 – Yahoo! demonstrates Hadoop

Hadoop Projects
 Common
 Avro
 MapReduce
 HDFS
 Pig
 Hive
 Hbase
 ZooKeeper
 Sqoop

Analyzing Data with Hadoop
 Case: NCDC Weather Data
◦ What’s the highest recorded global temp for each
year in the dataset?
 Express our query as a MapReduce job
 MapReduce breaks the processing into two
phases: Map and Reduce
 Input to our Map phase is raw NCDC data
 Map Function: Pull out the year and air
temperature AND filter out temps that are
missing, suspect or erroneous.
 Reducer Function: finding the max temp for
each year

MapReduce Example
 Map function extracts the year and
temp:
◦ (1950, 0), (1950, 22), (1950, -11), (1949,
111), (1949, 78)
 MapReduce sorts and groups the
data:
◦ (1949, [111,78])
◦ (1950, [0, 22, -11])
 Reduce function iterates through the
list:

Hadoop in the Enterprise
 Accelerate nightly batch business processes
 Storage of extremely high volumes of data
 Creation of automatic, redundant backups
 Improving the scalability of applications
 Use of Java for data processing instead of
SQL
 Producing JIT feeds for dashboards and BI
 Handling urgent, ad hoc request for data
 Turning unstructured data into relational data
 Taking on tasks that require massive
parallelism
 Moving existing algorithms, code,
frameworks, and components to a highly
distributed computing environment

Hadoop in the News
 the open-source LAMP stack
transformed web startup economics 10
years ago
 Argues that Hadoop is now displacing
expense proprietary solutions.
 Hadoop’s architechture of map-reducing
across of a cluster of commodity nodes
is more flexible and cost effective than
traditional data warehouses.
 3 Areas of application in Startup’s:
◦ Analysis of Customer Behavior
◦ Powering new user-facing features
◦ Enabling entire new lines of business

An interesting point to close on…
 From TechCrunch: ―What is most
remarkable is how the startup world is
collectively creating this ecosystem:
Yahoo, Facebook, Twitter, LinkedIn, and
other companies are actively adding to
the tool chain. This illustrates a new
thesis or collective wisdom rising from
the valley: If a technology is not your
core value-add, it should be open-
sourced because then others can
improve it, and potential future
employees can learn it. This rising tide
has lifted all boats, and is just getting
started‖

Training and Certifications
 Hortonworks – Believes that Apache
Hadoop will process half of the world’s
data within the next five years
◦ Hortonworks Data Platform – open source
distribution of Apache Hadoop
◦ Support, Training, Partner Enablement
programs designed to assist enterprises
and solution providers
 Hortonworks University

Extra Resources
 Running Hadoop on Ubuntu Linux
(Single-Node Cluster)
 Running Hadoop on Amazon EC2

Works Cited
 White, Tom (2011).
Hadoop: The Definitive
Guide. Sebastopol,
CA: O’Reilly.
 TechCrunch (July 2011) –
―Hadoop and Startups:
Where Open Source
Meets Business Data‖
 Wikipedia – Apache
Hadoop
 Apache Hadoop Website

Introduction to Apache Hadoop

More Related Content

What's hot

Similar to Introduction to Apache Hadoop

Recently uploaded

Introduction to Apache Hadoop

Editor's Notes