Hadoop past, present and future

Hadoop : Past, Present and
Future
Chris Harris
Email : charris@hortonworks.com
Twitter : cj_harris5

© Hortonworks Inc. 2013

Past


Page 2

A little history… it’s 2005


A Brief History of Apache Hadoop

Apache Project
Established

Yahoo! begins to
Operate at scale

Hortonworks
Data Platform

2013
2004

2006

2008

2010

2012

Enterprise
Hadoop

2005: Yahoo! creates
team under E14 to
work on Hadoop


Page 4

Key Hadoop Data Types
1.  Sentiment
Understand how your customers feel about your brand and
products – right now

2.  Clickstream
Capture and analyze website visitors’ data trails and
optimize your website

3.  Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines

4.  Geographic
Analyze location-based data to manage operations where
they occur

5.  Server Logs
Research logs to diagnose process failures and prevent
security breaches

6.  Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents


Value

Hadoop is NOT
!
!
!
!
!
!

ESB
NoSQL
HPC
Relational
Real-time
The Jack of all Trades


Hadoop 1
•  Limited up to 4,000 nodes per cluster
•  O(# of tasks in a cluster)
•  JobTracker bottleneck - resource management,
job scheduling and monitoring
•  Only has one namespace for managing HDFS
•  Map and Reduce slots are static
•  Only job to run is MapReduce


Hadoop 1 - Basics
MapReduce (Computation Framework)

A

B

C

C

B

B

C

A

A

A

HDFS (Storage Framework)

Hadoop 1 - Security

authN/authZ

LDAP/AD

Users

F
I
R
E
W
A
L
L

KDC

service request

Hadoop Cluster

block token
delegate token

Client Node/
Spoke Server

Encryption Plugin

* block token is for accessing data
* delegate token is for running jobs


Hadoop 1 - APIs
!
!
!
!

org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Job


Present


Page 14

Hadoop 2
!
!
!
!
!
!
!

Potentially up to 10,000 nodes per cluster
O(cluster size)
Supports multiple namespace for managing HDFS
Efficient cluster utilization (YARN)
MRv1 backward and forward compatible
Any apps can integrate with Hadoop
Beyond Java


Hadoop 2 - Basics


Hadoop 2 - Running Jobs
create app1

Hadoop Client 1

submit app1

ASM
NM

ResourceManager

.......negotiates....... Containers
.......reports to....... ASM

Scheduler .......partitions.......
Resources

create app2

Hadoop Client 2

submit app2

Scheduler

ASM

queues
status report

NodeManager
C2.1
NodeManager
C2.2
NodeManager
AM2

Rack1


NodeManager

NodeManager

C1.3
NodeManager
C2.3

C1.2

NodeManager
AM1

Rack2

NodeManager
C1.4
NodeManager
C1.1

RackN

Hadoop 2 - Security
DMZ
KDC
LDAP/AD
Knox Gateway Cluster

Enterprise/
Cloud SSO
Provider
JDBC Client

F
I
R
E
W
A
L
L

F
I
R
E
W
A
L
L

Hadoop Cluster

REST Client
Native Hive/HBase Encryption

Browser(HUE)


Hadoop 2 - APIs
! org.apache.hadoop.yarn.api.ApplicationClientProt
ocol
! org.apache.hadoop.yarn.api.ApplicationMasterPro
tocol
! org.apache.hadoop.yarn.api.ContainerManagemen
tProtocol


Future


Page 22

Apache Tez
A New Hadoop Data Processing Framework


Page 23

HDP: Enterprise Hadoop Distribution
OPERATIONAL

SERVICES

AMBARI

FLUME

PIG

FALCON*

OOZIE

Hortonworks
Data Platform (HDP)

DATA

SERVICES

SQOOP

HIVE
&

HCATALOG

HBASE

Enterprise Hadoop

OTHER

•  The ONLY 100% open source
and complete distribution

LOAD
&

EXTRACT

HADOOP

CORE

PLATFORM

SERVICES

NFS

WebHDFS

KNOX*

MAP

REDUCE*
TEZ*

YARN*

HDFS

Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots

HORTONWORKS

DATA
PLATFORM
(HDP)


•  Enterprise grade, proven and
tested at scale
•  Ecosystem endorsed to
ensure interoperability

Page 24

Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
– A new incubation project in the ASF

• Who else is involved?
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft

• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop

Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine
• Built on YARN
• Enables pipelining of jobs
• Removes task and job launch times
• Does not write intermediate output to HDFS
– Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc.
• Hive and Pig jobs no longer need to move to the end
of the queue between steps in the pipeline


Tez - Core Idea
Task with pluggable Input, Processor & Output

Input

Processor

Output

Task

Tez Task - <Input, Processor, Output>

YARN ApplicationMaster to run DAG of Tez Tasks

Building Blocks for Tasks
MapReduce ‘Map’

HDFS
Input

Map
Processor

Sorted
Output

MapReduce ‘Map’ Task

Special Pig/Hive ‘Map’

HDFS
Input

Map
Processor

Pipelin
e
Sorter
Output

Tez Task


MapReduce ‘Reduce’

Shuffle
Input

Reduce
Processor

HDFS
Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Reduce’

Shuffle
Skipmerge
Input

Reduce
Processor

Tez Task

Sorted
Output

Intermediate ‘Reduce’ for
Map-Reduce-Reduce

Shuffle
Input

Reduce
Processor

Sorted
Output

Intermediate ‘Reduce’ for
Map-Reduce-Reduce

In-memory Map

HDFSI
nput

Map
Processor

Tez Task

Inmemor
y
Sorted
Output

Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Job 1

Job 2
I/O Synchronization
Barrier

I/O Synchronization
Barrier

Single Job
Job 3

Pig/Hive - MR

Pig/Hive - Tez

Tez on YARN: Going Beyond Batch

Tez Task

Tez Optimizes Execution
New runtime engine for
more efficient data processing


Always-On Tez Service
Low latency processing for
all Hadoop data processing

Apache Knox
Secure Access to Hadoop


Knox Initiative

Make Hadoop security simple

Simplify Security

Aggregate Access

Client Agility

Simplify security for both users
and operators.

Deliver unified and centralized
access to the Hadoop cluster.

Provide seamless access for
users while securing cluster at
the perimeter, shielding the
intricacies of the security
implementation.

Make Hadoop feel like a single
application to users.

Ensure service users are
abstracted from where services
are located and how services
are configured & scaled.


Knox: Make Hadoop Security Simple
Authentication &
Verification
User Store
KDC, AD, LDAP

Client

{REST}!


Knox
Gateway

Hadoop Cluster

Knox: Next Generation of Hadoop Security
•  All users see one end-point
website
online apps
+
analytics tools

end users

•  All online systems see one endpoint RESTful service

Gateway

•  Consistency across all interfaces
and capabilities
•  Firewalled cluster that no end
users need to access
•  More IT-friendly. Enables:
–  Systems admins
–  DB admins
–  Security admins
–  Network admins

ﬁrewall

Hadoop cluster

ﬁrewall

Apache Falcon
Data Lifecycle Management for Hadoop


Data Lifecycle on Hadoop is Challenging

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

Map / Reduce

Multi Cluster Management

Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management.
Result:
Long development cycles and quality challenges.


Falcon: One-stop Shop for Data Lifecycle
Apache Falcon
Provides

Orchestrates

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

Map / Reduce

Multi Cluster Management

Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.


Falcon At A Glance
Data Processing Applications

Spec Files or
REST APIs

Falcon Data Lifecycle Management Service
Data Import
and
Replication

Scheduling
and
Coordination

Data Lifecycle
Policies

Multi-Cluster
Management

SLA
Management

>  Falcon provides the key services data processing applications need.
>  Complex data processing logic handled by Falcon instead of hard-coded in apps.
>  Faster development and higher quality for ETL, reporting and other data
processing apps on Hadoop.


Falcon Core Capabilities
• Core Functionality
– Pipeline processing
– Replication
– Retention
– Late data handling

• Automates
– Scheduling and retry
– Recording audit, lineage and metrics

• Operations and Management
– Monitoring, management, metering
– Alerts and notifications
– Multi Cluster Federation

• CLI and REST API

Falcon Example: Multi-Cluster Failover
Primary Hadoop Cluster
Cleansed
Data

Conformed
Data

Presented
Data

BI and Analytics

Replication

Staged
Data

Staged
Data

Presented
Data

Failover Hadoop Cluster

>  Falcon manages workflow, replication or both.
>  Enables business continuity without requiring full data reprocessing.
>  Failover clusters require less storage and CPU.


Falcon Example: Retention Policies

Staged Data

Cleansed Data

Conformed
Data

Presented
Data

Retain 5 Years

Retain 3 Years

Retain 3 Years

Retain Last
Copy Only

>  Sophisticated retention policies expressed in one place.
>  Simplify data retention for audit, compliance, or for data re-processing.


Falcon Example: Late Data Handling
Online
Transaction
Data (Pull via
Sqoop)
Wait up to 4
hours for FTP data
to arrive

Staging Area

Combined
Dataset

Web Log Data
(Push via FTP)

>  Processing waits until all data is available.
>  Developers don’t write complex data handling rules within applications.


Multi Cluster Management with Prism

>  Prism is the part of Falcon that handles multi-cluster.
>  Key use cases: Replication and data processing that spans clusters.


Page 43

Hortonworks Sandbox
Go from Zero to Big Data in 15 minutes


Page 44

Sandbox: A Guided Tour of HDP
Tutorials and videos give
a guided tour of HDP and
Hadoop
Perfect for beginners or
anyone learning more
about Hadoop
Installs easily on your
laptop or desktop

Easy-to-use editors
for Apache Pig and Hive

Easily import data
and create tables

Browse and manage
HDFS files

Latest tutorials pushed
directly to your Sandbox
Page 45

THANK YOU!
Chris Harris
charris@hortonworks.com

Download Sandbox
hortonworks.com/sandbox

Page 46

Hadoop past, present and future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Hadoop past, present and future

Similar to Hadoop past, present and future (20)

More from Codemotion

More from Codemotion (20)

Recently uploaded

Recently uploaded (20)

Hadoop past, present and future