Hadoop - Looking to the Future By Arun Murthy

Hadoop – Looking to the Future
Arun C. Murthy
Hortonworks Co-Founder
@acmurthy

1 ° ° ° ° °
° ° ° ° ° N
HDFS
(Hadoop Distributed File System)
MapReduce
Largely Batch Processing
2006
© Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hadoop w/ MapReduce
Traditional Hadoop allowed early adopters to
deal with data at scale however…
• Single purpose clusters, specific data sets
• Primarily a batch system using MapReduce
• Difficult to natively integrate existing applications
• Limited enterprise capabilities:
Operations, Security & Governance
In the beginning…

20092006
1 ° ° ° ° °
° ° ° ° ° N
HDFS
MapReduce
Largely Batch Processing
Hadoop w/ MapReduce
MAPREDUCE-279
Common data,
multiple applications
• Support multi-tenant cluster
• Batch, interactive & real-time
use cases can leverage the
most appropriate engine
Architectural Center
• Consistent security,
governance & operations
• Ecosystem applications
run natively in Hadoop
Apache Hadoop 2.0 & YARN
October 23, 2013
YARN : Data Operating System
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Batch Interactive Real-Time

Batch
MapReduce
Apache Tez: Flexible & More Efficient Execution Engine
YARN: Data Operating System
Batch & Interactive
Apache Tez
SQL
Apache
Hive
Data Flow
Apache
Pig
1
°
°
°
° ° ° ° ° ° °
° ° ° ° ° ° N
Java Apps
Cascading
Others Batch
MapReduce
1
°
°
°
° °
° °
HDFS
SQL
Apache
Hive
Data Flow
Apache Pig
° ° ° ° ° °
° ° ° ° ° N
Others
1
°
HDFS
Hadoop 1
Hadoop 2
Batch System w/
MapReduce as base
Apache Tez supports both
interactive & batch processing

1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Legacy
MapReduce
Interactive SQL
Apache Tez
Other Engines
& Workloads
Apache Hive
SQL
Business Analytics Custom Apps
Apache Hive and the Power of YARN
Stinger Initiative
Next generation SQL based
interactive query in Hadoop
Speed
Performance increased 100x for
interactive & batch use cases
Scale Queries from GBs,
to TBs to PBs
SQL Broadest range of SQL
semantics
Apache Hive Community
1,672Jira Tickets Closed
145Developers
44Companies
~390,000
Lines Of Code Added… (2x)
13Months
Hive
13
Hive
12
Hive
10
Dramatically
faster queries
speeds time
to insight
secondsthousands
of seconds

1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
Legacy
MapReduce
Interactive SQL
Apache Tez
Other Engines
& Workloads
Apache Hive
SQL
Apache Hive – Interactive SQL in Hadoop
Stinger
Next generation SQL based
interactive query in Hadoop
ORC
IO Improvements
Efficient processing via complex
pushdown
Tez Powerful primitives for
the SQL Planner
VQP Efficient CPU utilization in
Inner Loop

Sub-Second SQL with Hive LLAP
Stinger.Next
Sub-second SQL in Hadoop via
Hive/LLAP
CBO
The “right” plan executed
violently…
LLAP
Metastore Extensive stats &
scalability
1 ° ° ° ° ° ° ° ° °
° ° ° ° ° ° ° ° °
°
°N
HDFS
LLAP Apache Tez
Other Engines
& Workloads
Apache Hive
Sub-second SQL
Long-lived daemon for low-
latency startup, caching & CPU
efficiency via JIT

Apache Slider For “Always-on” Services
“Slide” apps on YARN
Democratize access to
storage (HDFS) and compute
(YARN)
Ease management (Ambari)
in addition to deployment
YARN: Data Operating System
Real-Time
Slider
NoSQL
Apache
HBase
NoSQL
Apache
Accumulo
1
°
°
°
° ° ° ° ° ° °
° ° ° ° ° ° N
HDFS
Stream
Apache
Storm
Others
ISV

© Hortonworks Inc. 2015. All Rights Reserved
Data Governance Initiative
Requirements
1. Hadoop must snap in to the
existing frameworks and
openly exchange metadata
2. Hadoop must address
governance within its own
stack of technologies
Engineers from a group of companies dedicated
to meeting these requirements in the open
New Apache
project proposal
Knowledge Store
Audit Store (Ranger)
ModelsType-System
Policy RulesTaxonomies
Tag Based
Policies
Data Lifecycle
Management
(Falcon)
Real-time Tag-based Access Control (Ranger)
REST API
Services
Search Lineage Exchange
Healthcare
HIPAA
HL7
Financial
SOX
Dodd-Frank
Energy
PPDM
Retail
PCI
PII
Other
CWM

Hadoop - Redefined
YARN
(Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
TezTez
Others
Engines
Tez
Java
Cascading
Tez
° °
° °
° °
HBase
NoSQL
Storm
Stream
Slider Slider
Accumulo
NoSQL
Others
Engines
Slider Slider
° ° ° ° °
° ° ° ° °
° ° ° ° °
°
°
°
Spark
In-Memory
°
°
°
°
°
°
PaaS
KubernetesLASR
HPA
°
°
N
°
°
°
°
°
°
HDFS
(Storage Management)
Batch
MR
DGI
(Data Governance & Metadata Management)

Hadoop - Looking to the Future By Arun Murthy

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Hadoop - Looking to the Future By Arun Murthy

Similar to Hadoop - Looking to the Future By Arun Murthy (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Hadoop - Looking to the Future By Arun Murthy

Editor's Notes