More Related Content Similar to Hadoop past, present and future (20) More from Codemotion (20) Hadoop past, present and future1. Hadoop : Past, Present and
Future
Chris Harris
Email : charris@hortonworks.com
Twitter : cj_harris5
© Hortonworks Inc. 2013
4. A Brief History of Apache Hadoop
Apache Project
Established
Yahoo! begins to
Operate at scale
Hortonworks
Data Platform
2013
2004
2006
2008
2010
2012
Enterprise
Hadoop
2005: Yahoo! creates
team under E14 to
work on Hadoop
© Hortonworks Inc. 2013
Page 4
5. Key Hadoop Data Types
1. Sentiment
Understand how your customers feel about your brand and
products – right now
2. Clickstream
Capture and analyze website visitors’ data trails and
optimize your website
3. Sensor/Machine
Discover patterns in data streaming automatically from
remote sensors and machines
4. Geographic
Analyze location-based data to manage operations where
they occur
5. Server Logs
Research logs to diagnose process failures and prevent
security breaches
6. Unstructured (txt, video, pictures, etc..)
Understand patterns in files across millions of web pages,
emails, and documents
© Hortonworks Inc. 2013
Value
7. Hadoop 1
• Limited up to 4,000 nodes per cluster
• O(# of tasks in a cluster)
• JobTracker bottleneck - resource management,
job scheduling and monitoring
• Only has one namespace for managing HDFS
• Map and Reduce slots are static
• Only job to run is MapReduce
© Hortonworks Inc. 2013
8. Hadoop 1 - Basics
MapReduce (Computation Framework)
A
B
C
C
B
B
C
A
A
A
HDFS (Storage Framework)
© Hortonworks Inc. 2013
9. Hadoop 1 - Reading Files
NameNode
read file
SNameNode
(fsimage/edit)
Hadoop Client
return DNs,
block ids, etc.
checkpoint
heartbeat/
block report
read blocks
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
Rack1
Rack2
Rack3
© Hortonworks Inc. 2013
RackN
10. Hadoop 1 - Writing Files
NameNode
request write
Hadoop Client
SNameNode
(fsimage/edit)
checkpoint
return DNs, etc.
block report
write blocks
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
Rack1
Rack2
Rack3
© Hortonworks Inc. 2013
RackN
replication pipelining
11. Hadoop 1 - Running Jobs
Hadoop Client
submit job
JobTracker
map
deploy job
shuffle
DN | TT
part 0
© Hortonworks Inc. 2013
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
DN | TT
Rack1
reduce
DN | TT
Rack2
Rack3
RackN
12. Hadoop 1 - Security
authN/authZ
LDAP/AD
Users
F
I
R
E
W
A
L
L
KDC
service request
Hadoop Cluster
block token
delegate token
Client Node/
Spoke Server
Encryption Plugin
* block token is for accessing data
* delegate token is for running jobs
© Hortonworks Inc. 2013
13. Hadoop 1 - APIs
!
!
!
!
org.apache.hadoop.mapreduce.Partitioner
org.apache.hadoop.mapreduce.Mapper
org.apache.hadoop.mapreduce.Reducer
org.apache.hadoop.mapreduce.Job
© Hortonworks Inc. 2013
15. Hadoop 2
!
!
!
!
!
!
!
Potentially up to 10,000 nodes per cluster
O(cluster size)
Supports multiple namespace for managing HDFS
Efficient cluster utilization (YARN)
MRv1 backward and forward compatible
Any apps can integrate with Hadoop
Beyond Java
© Hortonworks Inc. 2013
17. Hadoop 2 - Reading Files
(w/ NN Federation)
SNameNode
per NN
Hadoop Client
NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4
fsimage/edit copy
read file
checkpoint
or
return DNs,
block ids, etc.
fs sync
read blocks
Backup NN
per NN
checkpoint
register/
heartbeat/
block report
Block Pools
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
ns1
Rack1
Rack2
© Hortonworks Inc. 2013
Rack3
RackN
ns2
ns3
ns4
dn1, dn2
dn1, dn3
dn4, dn5 dn4, dn5
18. Hadoop 2 - Writing Files
SNameNode
per NN
Hadoop Client
NN1/ns1 NN2/ns2 NN3/ns3 NN4/ns4
request write
fsimage/edit copy
checkpoint
or
return DNs, etc.
fs sync
write blocks
block report
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
DN | NM
checkpoint
DN | NM
DN | NM
Backup NN
per NN
Rack1
Rack2
© Hortonworks Inc. 2013
Rack3
RackN
replication pipelining
19. Hadoop 2 - Running Jobs
create app1
Hadoop Client 1
submit app1
ASM
NM
ResourceManager
.......negotiates....... Containers
.......reports to....... ASM
Scheduler .......partitions.......
Resources
create app2
Hadoop Client 2
submit app2
Scheduler
ASM
queues
status report
NodeManager
C2.1
NodeManager
C2.2
NodeManager
AM2
Rack1
© Hortonworks Inc. 2013
NodeManager
NodeManager
C1.3
NodeManager
C2.3
C1.2
NodeManager
AM1
Rack2
NodeManager
C1.4
NodeManager
C1.1
RackN
20. Hadoop 2 - Security
DMZ
KDC
LDAP/AD
Knox Gateway Cluster
Enterprise/
Cloud SSO
Provider
JDBC Client
F
I
R
E
W
A
L
L
F
I
R
E
W
A
L
L
Hadoop Cluster
REST Client
Native Hive/HBase Encryption
Browser(HUE)
© Hortonworks Inc. 2013
21. Hadoop 2 - APIs
! org.apache.hadoop.yarn.api.ApplicationClientProt
ocol
! org.apache.hadoop.yarn.api.ApplicationMasterPro
tocol
! org.apache.hadoop.yarn.api.ContainerManagemen
tProtocol
© Hortonworks Inc. 2013
23. Apache Tez
A New Hadoop Data Processing Framework
© Hortonworks Inc. 2013
Page 23
24. HDP: Enterprise Hadoop Distribution
OPERATIONAL
SERVICES
AMBARI
FLUME
PIG
FALCON*
OOZIE
Hortonworks
Data Platform (HDP)
DATA
SERVICES
SQOOP
HIVE
&
HCATALOG
HBASE
Enterprise Hadoop
OTHER
• The ONLY 100% open source
and complete distribution
LOAD
&
EXTRACT
HADOOP
CORE
PLATFORM
SERVICES
NFS
WebHDFS
KNOX*
MAP
REDUCE*
TEZ*
YARN*
HDFS
Enterprise Readiness
High Availability, Disaster
Recovery, Rolling Upgrades,
Security and Snapshots
HORTONWORKS
DATA
PLATFORM
(HDP)
© Hortonworks Inc. 2013
• Enterprise grade, proven and
tested at scale
• Ecosystem endorsed to
ensure interoperability
Page 24
25. Tez (“Speed”)
• What is it?
– A data processing framework as an alternative to MapReduce
– A new incubation project in the ASF
• Who else is involved?
– 22 contributors: Hortonworks (13), Facebook, Twitter, Yahoo,
Microsoft
• Why does it matter?
– Widens the platform for Hadoop use cases
– Crucial to improving the performance of low-latency applications
– Core to the Stinger initiative
– Evidence of Hortonworks leading the community in the evolution
of Enterprise Hadoop
© Hortonworks Inc. 2013
26. Moving Hadoop Beyond MapReduce
• Low level data-processing execution engine
• Built on YARN
• Enables pipelining of jobs
• Removes task and job launch times
• Does not write intermediate output to HDFS
– Much lighter disk and network usage
• New base of MapReduce, Hive, Pig, Cascading etc.
• Hive and Pig jobs no longer need to move to the end
of the queue between steps in the pipeline
© Hortonworks Inc. 2013
27. Tez - Core Idea
Task with pluggable Input, Processor & Output
Input
Processor
Output
Task
Tez Task - <Input, Processor, Output>
YARN ApplicationMaster to run DAG of Tez Tasks
© Hortonworks Inc. 2013
28. Building Blocks for Tasks
MapReduce ‘Map’
HDFS
Input
Map
Processor
Sorted
Output
MapReduce ‘Map’ Task
Special Pig/Hive ‘Map’
HDFS
Input
Map
Processor
Pipelin
e
Sorter
Output
Tez Task
© Hortonworks Inc. 2013
MapReduce ‘Reduce’
Shuffle
Input
Reduce
Processor
HDFS
Output
MapReduce ‘Reduce’ Task
Special Pig/Hive ‘Reduce’
Shuffle
Skipmerge
Input
Reduce
Processor
Tez Task
Sorted
Output
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
Shuffle
Input
Reduce
Processor
Sorted
Output
Intermediate ‘Reduce’ for
Map-Reduce-Reduce
In-memory Map
HDFSI
nput
Map
Processor
Tez Task
Inmemor
y
Sorted
Output
29. Pig/Hive-MR versus Pig/Hive-Tez
SELECT a.state, COUNT(*), AVERAGE(c.price)
FROM a
JOIN b ON (a.id = b.id)
JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
Job 1
Job 2
I/O Synchronization
Barrier
I/O Synchronization
Barrier
Single Job
Job 3
Pig/Hive - MR
© Hortonworks Inc. 2013
Pig/Hive - Tez
30. Tez on YARN: Going Beyond Batch
Tez Task
Tez Optimizes Execution
New runtime engine for
more efficient data processing
© Hortonworks Inc. 2013
Always-On Tez Service
Low latency processing for
all Hadoop data processing
32. Knox Initiative
Make Hadoop security simple
Simplify Security
Aggregate Access
Client Agility
Simplify security for both users
and operators.
Deliver unified and centralized
access to the Hadoop cluster.
Provide seamless access for
users while securing cluster at
the perimeter, shielding the
intricacies of the security
implementation.
Make Hadoop feel like a single
application to users.
Ensure service users are
abstracted from where services
are located and how services
are configured & scaled.
© Hortonworks Inc. 2013
33. Knox: Make Hadoop Security Simple
Authentication &
Verification
User Store
KDC, AD, LDAP
Client
{REST}!
© Hortonworks Inc. 2013
Knox
Gateway
Hadoop Cluster
34. Knox: Next Generation of Hadoop Security
• All users see one end-point
website
online apps
+
analytics tools
end users
• All online systems see one endpoint RESTful service
Gateway
• Consistency across all interfaces
and capabilities
• Firewalled cluster that no end
users need to access
• More IT-friendly. Enables:
– Systems admins
– DB admins
– Security admins
– Network admins
© Hortonworks Inc. 2013
firewall
Hadoop cluster
firewall
36. Data Lifecycle on Hadoop is Challenging
Data Management Needs
Tools
Data Processing
Oozie
Replication
Sqoop
Retention
Distcp
Scheduling
Flume
Reprocessing
Map / Reduce
Multi Cluster Management
Hive and Pig Jobs
Problem: Patchwork of tools complicate data lifecycle management.
Result:
Long development cycles and quality challenges.
© Hortonworks Inc. 2013
37. Falcon: One-stop Shop for Data Lifecycle
Apache Falcon
Provides
Orchestrates
Data Management Needs
Tools
Data Processing
Oozie
Replication
Sqoop
Retention
Distcp
Scheduling
Flume
Reprocessing
Map / Reduce
Multi Cluster Management
Hive and Pig Jobs
Falcon provides a single interface to orchestrate data lifecycle.
Sophisticated DLM easily added to Hadoop applications.
© Hortonworks Inc. 2013
38. Falcon At A Glance
Data Processing Applications
Spec Files or
REST APIs
Falcon Data Lifecycle Management Service
Data Import
and
Replication
Scheduling
and
Coordination
Data Lifecycle
Policies
Multi-Cluster
Management
SLA
Management
> Falcon provides the key services data processing applications need.
> Complex data processing logic handled by Falcon instead of hard-coded in apps.
> Faster development and higher quality for ETL, reporting and other data
processing apps on Hadoop.
© Hortonworks Inc. 2013
39. Falcon Core Capabilities
• Core Functionality
– Pipeline processing
– Replication
– Retention
– Late data handling
• Automates
– Scheduling and retry
– Recording audit, lineage and metrics
• Operations and Management
– Monitoring, management, metering
– Alerts and notifications
– Multi Cluster Federation
• CLI and REST API
© Hortonworks Inc. 2013
40. Falcon Example: Multi-Cluster Failover
Primary Hadoop Cluster
Cleansed
Data
Conformed
Data
Presented
Data
BI and Analytics
Replication
Staged
Data
Staged
Data
Presented
Data
Failover Hadoop Cluster
> Falcon manages workflow, replication or both.
> Enables business continuity without requiring full data reprocessing.
> Failover clusters require less storage and CPU.
© Hortonworks Inc. 2013
41. Falcon Example: Retention Policies
Staged Data
Cleansed Data
Conformed
Data
Presented
Data
Retain 5 Years
Retain 3 Years
Retain 3 Years
Retain Last
Copy Only
> Sophisticated retention policies expressed in one place.
> Simplify data retention for audit, compliance, or for data re-processing.
© Hortonworks Inc. 2013
42. Falcon Example: Late Data Handling
Online
Transaction
Data (Pull via
Sqoop)
Wait up to 4
hours for FTP data
to arrive
Staging Area
Combined
Dataset
Web Log Data
(Push via FTP)
> Processing waits until all data is available.
> Developers don’t write complex data handling rules within applications.
© Hortonworks Inc. 2013
43. Multi Cluster Management with Prism
> Prism is the part of Falcon that handles multi-cluster.
> Key use cases: Replication and data processing that spans clusters.
© Hortonworks Inc. 2013
Page 43
45. Sandbox: A Guided Tour of HDP
Tutorials and videos give
a guided tour of HDP and
Hadoop
Perfect for beginners or
anyone learning more
about Hadoop
Installs easily on your
laptop or desktop
Easy-to-use editors
for Apache Pig and Hive
© Hortonworks Inc. 2013
Easily import data
and create tables
Browse and manage
HDFS files
Latest tutorials pushed
directly to your Sandbox
Page 45