Hadoop : Past, Present and
Future
Chris Harris
Email : charris@hortonworks.com
Twitter : cj_harris5

© Hortonworks Inc. 2013
Past

© Hortonworks Inc. 2013

Page 2
A little history… it’s 2005

© Hortonworks Inc. 2013
A Brief History of Apache Hadoop

Apache Project
Established

Yahoo! begins to
Operate at scale

Hortonworks
Data Platform

2013
2004

2006

2008

2010

2012

Enterprise
Hadoop

2005: Yahoo! creates
team under E14 to
work on Hadoop

© Hortonworks Inc. 2013

Page 4
Key Hadoop Data Types
1.  Sentiment
Understand how your customers feel about your brand and
products – right now

2.  Clickstream
Capture and analyze website visitors’ data trails and
optimize your website

3.  Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines

4.  Geographic
Analyze location-based data to manage operations where
they occur

5.  Server Logs
Research logs to diagnose process failures and prevent
security breaches

6.  Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents

© Hortonworks Inc. 2013

Value
Hadoop is NOT
!
!
!
!
!
!

ESB
NoSQL
HPC
Relational
Real-time
The Jack of all Trades

© Hortonworks Inc. 2013
Hadoop 1
•  Limited up to 4,000 nodes per cluster
•  O(# of tasks in a cluster)
•  JobTracker bottleneck - resource management,
job scheduling and monitoring
•  Only has one namespace for managing HDFS
•  Map and Reduce slots are static
•  Only job to run is MapReduce

© Hortonworks Inc. 2013
Hadoop 1 - Basics
MapReduce (Computation Framework)

A

B

C

C

B

B

C

A

A

A

HDFS (Storage Framework)
© Hortonworks Inc. 2013
Hadoop 1 - Reading Files
NameNode
read file

SNameNode
(fsimage/edit)

Hadoop Client
return DNs,
block ids, etc.

checkpoint

heartbeat/
block report

read blocks

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

Rack1

Rack2

Rack3

© Hortonworks Inc. 2013

RackN
Hadoop 1 - Writing Files
NameNode
request write

Hadoop Client

SNameNode
(fsimage/edit)
checkpoint

return DNs, etc.

block report

write blocks

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

Rack1

Rack2

Rack3

© Hortonworks Inc. 2013

RackN

replication pipelining
Hadoop 1 - Running Jobs
Hadoop Client

submit job

JobTracker

map
deploy job

shuffle
DN | TT

part 0

© Hortonworks Inc. 2013

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

DN | TT

Rack1

reduce

DN | TT

Rack2

Rack3

RackN
Hadoop 1 - Security

authN/authZ

LDAP/AD

Users

F
I
R
E
W
A
L
L

KDC

service request

Hadoop Cluster

block token
delegate token

Client Node/
Spoke Server

Encryption Plugin

* block token is for accessing data
* delegate token is for running jobs

© Hortonworks Inc. 2013
Hadoop 1 - APIs
!
!
!
!

org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Job

© Hortonworks Inc. 2013
Present

© Hortonworks Inc. 2013

Page 14
Hadoop 2
!
!
!
!
!
!
!

Potentially up to 10,000 nodes per cluster
O(cluster size)
Supports multiple namespace for managing HDFS
Efficient cluster utilization (YARN)
MRv1 backward and forward compatible
Any apps can integrate with Hadoop
Beyond Java

© Hortonworks Inc. 2013
Hadoop 2 - Basics

© Hortonworks Inc. 2013
Hadoop 2 - Reading Files
(w/ NN Federation)
SNameNode
per NN
Hadoop Client

NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4

fsimage/edit copy

read file

checkpoint

or

return DNs,
block ids, etc.
fs sync
read blocks

Backup NN
per NN

checkpoint

register/
heartbeat/
block report

Block Pools
DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

ns1

Rack1

Rack2

© Hortonworks Inc. 2013

Rack3

RackN

ns2

ns3

ns4

dn1, dn2
dn1, dn3
dn4, dn5 dn4, dn5
Hadoop 2 - Writing Files
SNameNode
per NN
Hadoop Client

NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4

request write

fsimage/edit copy
checkpoint

or

return DNs, etc.
fs sync
write blocks
block report

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

DN | NM

checkpoint

DN | NM

DN | NM

Backup NN
per NN

Rack1

Rack2

© Hortonworks Inc. 2013

Rack3

RackN

replication pipelining
Hadoop 2 - Running Jobs
create app1

Hadoop Client 1

submit app1

ASM
NM

ResourceManager

.......negotiates....... Containers
.......reports to....... ASM

Scheduler .......partitions.......
Resources

create app2

Hadoop Client 2

submit app2

Scheduler

ASM

queues
status report

NodeManager
C2.1
NodeManager
C2.2
NodeManager
AM2

Rack1

© Hortonworks Inc. 2013

NodeManager

NodeManager

C1.3
NodeManager
C2.3

C1.2

NodeManager
AM1

Rack2

NodeManager
C1.4
NodeManager
C1.1

RackN
Hadoop 2 - Security
DMZ
KDC
LDAP/AD
Knox Gateway Cluster

Enterprise/
Cloud SSO
Provider
JDBC Client

F
I
R
E
W
A
L
L

F
I
R
E
W
A
L
L

Hadoop Cluster

REST Client
Native Hive/HBase Encryption

Browser(HUE)

© Hortonworks Inc. 2013
Hadoop 2 - APIs
! org.apache.hadoop.yarn.api.ApplicationClientProt
ocol
! org.apache.hadoop.yarn.api.ApplicationMasterPro
tocol
! org.apache.hadoop.yarn.api.ContainerManagemen
tProtocol

© Hortonworks Inc. 2013
Future

© Hortonworks Inc. 2013

Page 22
Apache Tez
A New Hadoop Data Processing Framework

© Hortonworks Inc. 2013

Page 23
HDP: Enterprise Hadoop Distribution
OPERATIONAL	
  
SERVICES	
  
AMBARI	
  

FLUME	
  
PIG	
  

FALCON*	
  
OOZIE	
  

Hortonworks
Data Platform (HDP)

DATA	
  
SERVICES	
  

SQOOP	
  

HIVE	
  &	
  

HCATALOG	
  

HBASE	
  

Enterprise Hadoop

OTHER	
  

•  The ONLY 100% open source
and complete distribution

LOAD	
  &	
  	
  
EXTRACT	
  

HADOOP	
  	
  
CORE	
  
PLATFORM	
  	
  
SERVICES	
  

NFS	
  
WebHDFS	
  

KNOX*	
  

MAP	
  	
  
REDUCE*	
   TEZ*	
  
	
  

YARN*	
  	
  	
  
HDFS	
  
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots

HORTONWORKS	
  	
  
DATA	
  PLATFORM	
  (HDP)	
  

© Hortonworks Inc. 2013

•  Enterprise grade, proven and
tested at scale
•  Ecosystem endorsed to
ensure interoperability

Page 24
Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
– A new incubation project in the ASF

• Who else is involved?
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft

• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
© Hortonworks Inc. 2013
Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine
• Built on YARN
• Enables pipelining of jobs
• Removes task and job launch times
• Does not write intermediate output to HDFS
– Much lighter disk and network usage

• New base of MapReduce, Hive, Pig, Cascading etc.
• Hive and Pig jobs no longer need to move to the end
of the queue between steps in the pipeline

© Hortonworks Inc. 2013
Tez - Core Idea
Task with pluggable Input, Processor & Output

Input

Processor

Output

Task

Tez Task - <Input, Processor, Output>

YARN ApplicationMaster to run DAG of Tez Tasks
© Hortonworks Inc. 2013
Building Blocks for Tasks
MapReduce ‘Map’

HDFS
Input

Map
Processor

Sorted
Output

MapReduce ‘Map’ Task

Special Pig/Hive ‘Map’

HDFS
Input

Map
Processor

Pipelin
e
Sorter
Output

Tez Task

© Hortonworks Inc. 2013

MapReduce ‘Reduce’

Shuffle
Input

Reduce
Processor

HDFS
Output

MapReduce ‘Reduce’ Task

Special Pig/Hive ‘Reduce’

Shuffle
Skipmerge
Input

Reduce
Processor

Tez Task

Sorted
Output

Intermediate ‘Reduce’ for
Map-Reduce-Reduce

Shuffle
Input

Reduce
Processor

Sorted
Output

Intermediate ‘Reduce’ for
Map-Reduce-Reduce

In-memory Map

HDFSI
nput

Map
Processor

Tez Task

Inmemor
y
Sorted
Output
Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state

Job 1

Job 2
I/O Synchronization
Barrier

I/O Synchronization
Barrier

Single Job
Job 3

Pig/Hive - MR
© Hortonworks Inc. 2013

Pig/Hive - Tez
Tez on YARN: Going Beyond Batch

Tez Task

Tez Optimizes Execution
New runtime engine for
more efficient data processing

© Hortonworks Inc. 2013

Always-On Tez Service
Low latency processing for
all Hadoop data processing
Apache Knox
Secure Access to Hadoop

© Hortonworks Inc. 2013
Knox Initiative

Make Hadoop security simple

Simplify Security

Aggregate Access

Client Agility

Simplify security for both users
and operators.

Deliver unified and centralized
access to the Hadoop cluster.

Provide seamless access for
users while securing cluster at
the perimeter, shielding the
intricacies of the security
implementation.

Make Hadoop feel like a single
application to users.

Ensure service users are
abstracted from where services
are located and how services
are configured & scaled.

© Hortonworks Inc. 2013
Knox: Make Hadoop Security Simple
Authentication &
Verification
User Store
KDC, AD, LDAP

Client

{REST}!

© Hortonworks Inc. 2013

Knox
Gateway

Hadoop Cluster
Knox: Next Generation of Hadoop Security
•  All users see one end-point
website
online apps
+
analytics tools

end users

•  All online systems see one endpoint RESTful service

Gateway

•  Consistency across all interfaces
and capabilities
•  Firewalled cluster that no end
users need to access
•  More IT-friendly. Enables:
–  Systems admins
–  DB admins
–  Security admins
–  Network admins
© Hortonworks Inc. 2013

firewall

Hadoop cluster

firewall
Apache Falcon
Data Lifecycle Management for Hadoop

© Hortonworks Inc. 2013
Data Lifecycle on Hadoop is Challenging

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

Map / Reduce

Multi Cluster Management

Hive and Pig Jobs

Problem: Patchwork of tools complicate data lifecycle management.
Result:
Long development cycles and quality challenges.

© Hortonworks Inc. 2013
Falcon: One-stop Shop for Data Lifecycle
Apache Falcon
Provides

Orchestrates

Data Management Needs

Tools

Data Processing

Oozie

Replication

Sqoop

Retention

Distcp

Scheduling

Flume

Reprocessing

Map / Reduce

Multi Cluster Management

Hive and Pig Jobs

Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.

© Hortonworks Inc. 2013
Falcon At A Glance
Data Processing Applications

Spec Files or
REST APIs

Falcon Data Lifecycle Management Service
Data Import
and
Replication

Scheduling
and
Coordination

Data Lifecycle
Policies

Multi-Cluster
Management

SLA
Management

>  Falcon provides the key services data processing applications need.
>  Complex data processing logic handled by Falcon instead of hard-coded in apps.
>  Faster development and higher quality for ETL, reporting and other data
processing apps on Hadoop.

© Hortonworks Inc. 2013
Falcon Core Capabilities
• Core Functionality
– Pipeline processing
– Replication
– Retention
– Late data handling

• Automates
– Scheduling and retry
– Recording audit, lineage and metrics

• Operations and Management
– Monitoring, management, metering
– Alerts and notifications
– Multi Cluster Federation

• CLI and REST API
© Hortonworks Inc. 2013
Falcon Example: Multi-Cluster Failover
Primary Hadoop Cluster
Cleansed
Data

Conformed
Data

Presented
Data

BI and Analytics

Replication

Staged
Data

Staged
Data

Presented
Data

Failover Hadoop Cluster

>  Falcon manages workflow, replication or both.
>  Enables business continuity without requiring full data reprocessing.
>  Failover clusters require less storage and CPU.

© Hortonworks Inc. 2013
Falcon Example: Retention Policies

Staged Data

Cleansed Data

Conformed
Data

Presented
Data

Retain 5 Years

Retain 3 Years

Retain 3 Years

Retain Last
Copy Only

>  Sophisticated retention policies expressed in one place.
>  Simplify data retention for audit, compliance, or for data re-processing.

© Hortonworks Inc. 2013
Falcon Example: Late Data Handling
Online
Transaction
Data (Pull via
Sqoop)
Wait up to 4
hours for FTP data
to arrive

Staging Area

Combined
Dataset

Web Log Data
(Push via FTP)

>  Processing waits until all data is available.
>  Developers don’t write complex data handling rules within applications.

© Hortonworks Inc. 2013
Multi Cluster Management with Prism

>  Prism is the part of Falcon that handles multi-cluster.
>  Key use cases: Replication and data processing that spans clusters.

© Hortonworks Inc. 2013

Page 43
Hortonworks Sandbox
Go from Zero to Big Data in 15 minutes

© Hortonworks Inc. 2013

Page 44
Sandbox: A Guided Tour of HDP
Tutorials and videos give
a guided tour of HDP and
Hadoop
Perfect for beginners or
anyone learning more
about Hadoop
Installs easily on your
laptop or desktop

Easy-to-use editors
for Apache Pig and Hive
© Hortonworks Inc. 2013

Easily import data
and create tables

Browse and manage
HDFS files

Latest tutorials pushed
directly to your Sandbox
Page 45
THANK YOU!
Chris Harris
charris@hortonworks.com

Download Sandbox
hortonworks.com/sandbox
© Hortonworks Inc. 2013

Page 46

Hadoop past, present and future

  • 1.
    Hadoop : Past,Present and Future Chris Harris Email : charris@hortonworks.com Twitter : cj_harris5 © Hortonworks Inc. 2013
  • 2.
  • 3.
    A little history…it’s 2005 © Hortonworks Inc. 2013
  • 4.
    A Brief Historyof Apache Hadoop Apache Project Established Yahoo! begins to Operate at scale Hortonworks Data Platform 2013 2004 2006 2008 2010 2012 Enterprise Hadoop 2005: Yahoo! creates team under E14 to work on Hadoop © Hortonworks Inc. 2013 Page 4
  • 5.
    Key Hadoop DataTypes 1.  Sentiment Understand how your customers feel about your brand and products – right now 2.  Clickstream Capture and analyze website visitors’ data trails and optimize your website 3.  Sensor/Machine Discover patterns in data streaming automatically from remote sensors and machines 4.  Geographic Analyze location-based data to manage operations where they occur 5.  Server Logs Research logs to diagnose process failures and prevent security breaches 6.  Unstructured (txt, video, pictures, etc..) Understand patterns in files across millions of web pages, emails, and documents © Hortonworks Inc. 2013 Value
  • 6.
    Hadoop is NOT ! ! ! ! ! ! ESB NoSQL HPC Relational Real-time TheJack of all Trades © Hortonworks Inc. 2013
  • 7.
    Hadoop 1 •  Limitedup to 4,000 nodes per cluster •  O(# of tasks in a cluster) •  JobTracker bottleneck - resource management, job scheduling and monitoring •  Only has one namespace for managing HDFS •  Map and Reduce slots are static •  Only job to run is MapReduce © Hortonworks Inc. 2013
  • 8.
    Hadoop 1 -Basics MapReduce (Computation Framework) A B C C B B C A A A HDFS (Storage Framework) © Hortonworks Inc. 2013
  • 9.
    Hadoop 1 -Reading Files NameNode read file SNameNode (fsimage/edit) Hadoop Client return DNs, block ids, etc. checkpoint heartbeat/ block report read blocks DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT Rack1 Rack2 Rack3 © Hortonworks Inc. 2013 RackN
  • 10.
    Hadoop 1 -Writing Files NameNode request write Hadoop Client SNameNode (fsimage/edit) checkpoint return DNs, etc. block report write blocks DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT Rack1 Rack2 Rack3 © Hortonworks Inc. 2013 RackN replication pipelining
  • 11.
    Hadoop 1 -Running Jobs Hadoop Client submit job JobTracker map deploy job shuffle DN | TT part 0 © Hortonworks Inc. 2013 DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT DN | TT Rack1 reduce DN | TT Rack2 Rack3 RackN
  • 12.
    Hadoop 1 -Security authN/authZ LDAP/AD Users F I R E W A L L KDC service request Hadoop Cluster block token delegate token Client Node/ Spoke Server Encryption Plugin * block token is for accessing data * delegate token is for running jobs © Hortonworks Inc. 2013
  • 13.
    Hadoop 1 -APIs ! ! ! ! org.apache.hadoop.mapreduce.Partitioner org.apache.hadoop.mapreduce.Mapper org.apache.hadoop.mapreduce.Reducer org.apache.hadoop.mapreduce.Job © Hortonworks Inc. 2013
  • 14.
  • 15.
    Hadoop 2 ! ! ! ! ! ! ! Potentially upto 10,000 nodes per cluster O(cluster size) Supports multiple namespace for managing HDFS Efficient cluster utilization (YARN) MRv1 backward and forward compatible Any apps can integrate with Hadoop Beyond Java © Hortonworks Inc. 2013
  • 16.
    Hadoop 2 -Basics © Hortonworks Inc. 2013
  • 17.
    Hadoop 2 -Reading Files (w/ NN Federation) SNameNode per NN Hadoop Client NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4 fsimage/edit copy read file checkpoint or return DNs, block ids, etc. fs sync read blocks Backup NN per NN checkpoint register/ heartbeat/ block report Block Pools DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM ns1 Rack1 Rack2 © Hortonworks Inc. 2013 Rack3 RackN ns2 ns3 ns4 dn1, dn2 dn1, dn3 dn4, dn5 dn4, dn5
  • 18.
    Hadoop 2 -Writing Files SNameNode per NN Hadoop Client NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4 request write fsimage/edit copy checkpoint or return DNs, etc. fs sync write blocks block report DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM DN | NM checkpoint DN | NM DN | NM Backup NN per NN Rack1 Rack2 © Hortonworks Inc. 2013 Rack3 RackN replication pipelining
  • 19.
    Hadoop 2 -Running Jobs create app1 Hadoop Client 1 submit app1 ASM NM ResourceManager .......negotiates....... Containers .......reports to....... ASM Scheduler .......partitions....... Resources create app2 Hadoop Client 2 submit app2 Scheduler ASM queues status report NodeManager C2.1 NodeManager C2.2 NodeManager AM2 Rack1 © Hortonworks Inc. 2013 NodeManager NodeManager C1.3 NodeManager C2.3 C1.2 NodeManager AM1 Rack2 NodeManager C1.4 NodeManager C1.1 RackN
  • 20.
    Hadoop 2 -Security DMZ KDC LDAP/AD Knox Gateway Cluster Enterprise/ Cloud SSO Provider JDBC Client F I R E W A L L F I R E W A L L Hadoop Cluster REST Client Native Hive/HBase Encryption Browser(HUE) © Hortonworks Inc. 2013
  • 21.
    Hadoop 2 -APIs ! org.apache.hadoop.yarn.api.ApplicationClientProt ocol ! org.apache.hadoop.yarn.api.ApplicationMasterPro tocol ! org.apache.hadoop.yarn.api.ContainerManagemen tProtocol © Hortonworks Inc. 2013
  • 22.
  • 23.
    Apache Tez A NewHadoop Data Processing Framework © Hortonworks Inc. 2013 Page 23
  • 24.
    HDP: Enterprise HadoopDistribution OPERATIONAL   SERVICES   AMBARI   FLUME   PIG   FALCON*   OOZIE   Hortonworks Data Platform (HDP) DATA   SERVICES   SQOOP   HIVE  &   HCATALOG   HBASE   Enterprise Hadoop OTHER   •  The ONLY 100% open source and complete distribution LOAD  &     EXTRACT   HADOOP     CORE   PLATFORM     SERVICES   NFS   WebHDFS   KNOX*   MAP     REDUCE*   TEZ*     YARN*       HDFS   Enterprise Readiness High Availability, Disaster Recovery, Rolling Upgrades, Security and Snapshots HORTONWORKS     DATA  PLATFORM  (HDP)   © Hortonworks Inc. 2013 •  Enterprise grade, proven and tested at scale •  Ecosystem endorsed to ensure interoperability Page 24
  • 25.
    Tez (“Speed”) • What isit? – A data processing framework as an alternative to MapReduce – A new incubation project in the ASF • Who else is involved? – 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo, Microsoft • Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop © Hortonworks Inc. 2013
  • 26.
    Moving Hadoop BeyondMapReduce • Low level data-processing execution engine • Built on YARN • Enables pipelining of jobs • Removes task and job launch times • Does not write intermediate output to HDFS – Much lighter disk and network usage • New base of MapReduce, Hive, Pig, Cascading etc. • Hive and Pig jobs no longer need to move to the end of the queue between steps in the pipeline © Hortonworks Inc. 2013
  • 27.
    Tez - CoreIdea Task with pluggable Input, Processor & Output Input Processor Output Task Tez Task - <Input, Processor, Output> YARN ApplicationMaster to run DAG of Tez Tasks © Hortonworks Inc. 2013
  • 28.
    Building Blocks forTasks MapReduce ‘Map’ HDFS Input Map Processor Sorted Output MapReduce ‘Map’ Task Special Pig/Hive ‘Map’ HDFS Input Map Processor Pipelin e Sorter Output Tez Task © Hortonworks Inc. 2013 MapReduce ‘Reduce’ Shuffle Input Reduce Processor HDFS Output MapReduce ‘Reduce’ Task Special Pig/Hive ‘Reduce’ Shuffle Skipmerge Input Reduce Processor Tez Task Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce Shuffle Input Reduce Processor Sorted Output Intermediate ‘Reduce’ for Map-Reduce-Reduce In-memory Map HDFSI nput Map Processor Tez Task Inmemor y Sorted Output
  • 29.
    Pig/Hive-MR versus Pig/Hive-Tez SELECTa.state, COUNT(*), AVERAGE(c.price) FROM a JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId) GROUP BY a.state Job 1 Job 2 I/O Synchronization Barrier I/O Synchronization Barrier Single Job Job 3 Pig/Hive - MR © Hortonworks Inc. 2013 Pig/Hive - Tez
  • 30.
    Tez on YARN:Going Beyond Batch Tez Task Tez Optimizes Execution New runtime engine for more efficient data processing © Hortonworks Inc. 2013 Always-On Tez Service Low latency processing for all Hadoop data processing
  • 31.
    Apache Knox Secure Accessto Hadoop © Hortonworks Inc. 2013
  • 32.
    Knox Initiative Make Hadoopsecurity simple Simplify Security Aggregate Access Client Agility Simplify security for both users and operators. Deliver unified and centralized access to the Hadoop cluster. Provide seamless access for users while securing cluster at the perimeter, shielding the intricacies of the security implementation. Make Hadoop feel like a single application to users. Ensure service users are abstracted from where services are located and how services are configured & scaled. © Hortonworks Inc. 2013
  • 33.
    Knox: Make HadoopSecurity Simple Authentication & Verification User Store KDC, AD, LDAP Client {REST}! © Hortonworks Inc. 2013 Knox Gateway Hadoop Cluster
  • 34.
    Knox: Next Generationof Hadoop Security •  All users see one end-point website online apps + analytics tools end users •  All online systems see one endpoint RESTful service Gateway •  Consistency across all interfaces and capabilities •  Firewalled cluster that no end users need to access •  More IT-friendly. Enables: –  Systems admins –  DB admins –  Security admins –  Network admins © Hortonworks Inc. 2013 firewall Hadoop cluster firewall
  • 35.
    Apache Falcon Data LifecycleManagement for Hadoop © Hortonworks Inc. 2013
  • 36.
    Data Lifecycle onHadoop is Challenging Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Problem: Patchwork of tools complicate data lifecycle management. Result: Long development cycles and quality challenges. © Hortonworks Inc. 2013
  • 37.
    Falcon: One-stop Shopfor Data Lifecycle Apache Falcon Provides Orchestrates Data Management Needs Tools Data Processing Oozie Replication Sqoop Retention Distcp Scheduling Flume Reprocessing Map / Reduce Multi Cluster Management Hive and Pig Jobs Falcon provides a single interface to orchestrate data lifecycle. Sophisticated DLM easily added to Hadoop applications. © Hortonworks Inc. 2013
  • 38.
    Falcon At AGlance Data Processing Applications Spec Files or REST APIs Falcon Data Lifecycle Management Service Data Import and Replication Scheduling and Coordination Data Lifecycle Policies Multi-Cluster Management SLA Management >  Falcon provides the key services data processing applications need. >  Complex data processing logic handled by Falcon instead of hard-coded in apps. >  Faster development and higher quality for ETL, reporting and other data processing apps on Hadoop. © Hortonworks Inc. 2013
  • 39.
    Falcon Core Capabilities • CoreFunctionality – Pipeline processing – Replication – Retention – Late data handling • Automates – Scheduling and retry – Recording audit, lineage and metrics • Operations and Management – Monitoring, management, metering – Alerts and notifications – Multi Cluster Federation • CLI and REST API © Hortonworks Inc. 2013
  • 40.
    Falcon Example: Multi-ClusterFailover Primary Hadoop Cluster Cleansed Data Conformed Data Presented Data BI and Analytics Replication Staged Data Staged Data Presented Data Failover Hadoop Cluster >  Falcon manages workflow, replication or both. >  Enables business continuity without requiring full data reprocessing. >  Failover clusters require less storage and CPU. © Hortonworks Inc. 2013
  • 41.
    Falcon Example: RetentionPolicies Staged Data Cleansed Data Conformed Data Presented Data Retain 5 Years Retain 3 Years Retain 3 Years Retain Last Copy Only >  Sophisticated retention policies expressed in one place. >  Simplify data retention for audit, compliance, or for data re-processing. © Hortonworks Inc. 2013
  • 42.
    Falcon Example: LateData Handling Online Transaction Data (Pull via Sqoop) Wait up to 4 hours for FTP data to arrive Staging Area Combined Dataset Web Log Data (Push via FTP) >  Processing waits until all data is available. >  Developers don’t write complex data handling rules within applications. © Hortonworks Inc. 2013
  • 43.
    Multi Cluster Managementwith Prism >  Prism is the part of Falcon that handles multi-cluster. >  Key use cases: Replication and data processing that spans clusters. © Hortonworks Inc. 2013 Page 43
  • 44.
    Hortonworks Sandbox Go fromZero to Big Data in 15 minutes © Hortonworks Inc. 2013 Page 44
  • 45.
    Sandbox: A GuidedTour of HDP Tutorials and videos give a guided tour of HDP and Hadoop Perfect for beginners or anyone learning more about Hadoop Installs easily on your laptop or desktop Easy-to-use editors for Apache Pig and Hive © Hortonworks Inc. 2013 Easily import data and create tables Browse and manage HDFS files Latest tutorials pushed directly to your Sandbox Page 45
  • 46.
    THANK YOU! Chris Harris charris@hortonworks.com DownloadSandbox hortonworks.com/sandbox © Hortonworks Inc. 2013 Page 46