HDP2.5 Updates

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Data Pla.orm Updates
Yuta Imai, Hortonworks

Hortonworks Data Pla.orm

Hortonworks Data Pla.orm: Release Strategy
More frequent releases of Spark, Hive, Ambari and other Apache
Data Access projects
Extended Services
Longer release arcs for core Apache Hadoop components:
HDFS, YARN and MapReduce
Hadoop Core
2016 2017
2016 2017

HORTONWORKS DATA PLATFORM
Hadoop
& YARN
Flume
Oozie
HDP 2.3 is Apache Hadoop; not “based on” Hadoop
Pig
Hive
Tez
Sqoop
Cloudbreak
Ambari
Slider
KaRa
Knox
Solr
Zookeeper
Spark
Falcon
Ranger
HBase
Atlas
Accumulo
Storm
Phoenix
4.10.2
DATA MGMT DATA ACCESS GOVERNANCE & INTEGRATION OPERATIONS SECURITY
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
HDP 2.2
Dec 2014
HDP 2.1
April 2014
HDP 2.0
Oct 2013
0.12.0 0.12.0
0.12.1 0.13.0 0.4.0
1.4.4 1.4.4 3.3.2 3.4.5
0.4.0 0.5.0
0.14.0 0.14.0 3.4.6 0.5.0 0.4.0 0.9.3 0.5.2
4.0.0 4.7.2
1.2.1 0.60.0 0.98.4 4.2.0 1.6.1 0.6.0 1.5.2 1.4.5 4.1.0 2.0.0
1.4.0 1.5.1 4.0.0
1.3.1
1.5.1 1.4.4 3.4.5
2.2.0
2.4.0
2.6.0
2.7.1 1.4.6 1.0.0 0.6.0 0.5.0 2.1.0 0.8.2 3.4.6 1.5.2 5.2.1 0.80.0 0.5.0 1.7.0 4.4.0 0.10.0 0.6.1 0.7.0 1.2.1 0.15.0
HDP 2.3
Oct 2015
4.2.0
0.96.1
0.98.0 0.9.1
0.8.1
1.4.1 1.1.2
2.7.3 1.4.6 1.3.0 0.9.0 0.6.0 2.4.0 0.10.0 3.4.6 1.5.2 5.5.1 0.91.0 0.7.0 1.7.0 4.7.0 1.0.1 0.10.0 0.7.0
1.2.1+
2.1***
0.16.0
HDP 2.5*
2H2016
4.2.0
1.6.2+
2.0**
1.1.2
2.7.1 1.4.6 1.2.0 0.6.0 0.5.0 2.2.1 0.9.0 3.4.6 1.5.2 5.2.1 0.80.0 0.5.0 1.7.0 4.4.0 0.10.0 0.6.1 0.7.0 1.2.1 0.15.0
HDP 2.4
Mar 2016
4.2.0 1.6.0 1.1.2
Zeppelin
Ongoing Innovadon in Apache
0.6.0
* HDP 2.5 – Shows current Apache branches being used. Final component version subject to change based on Apache release process.
** Spark 1.6.2+ Spark 2.0 – HDP 2.5 support installaEon of both Spark 1.6.2 and Spark 2.0. Spark 2.0 is Technical Preview within HDP 2.5.
*** Hive 2.1 is Technical Preview within HDP 2.5.

Hortonworks Data Pla.orm 2.5 Key Highlights
•  InteracYve Query in Seconds: Hive with LLAP (Technical Preview )
•  Enterprise Spark at Scale: Apache Zeppelin Notebook for Spark
•  Real-Time ApplicaYons: Storm and HBase/Phoenix
•  Streamlined OperaYons: Apache Ambari
•  Dynamic Security: Apache Atlas + Ranger IntegraYon
•  Hortonworks Data Cloud (Technical Preview)
•  Hortonworks HDB (Apache HAWQ)

Interacdve Query in Seconds
Hive with LLAP Technical Preview

7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
LLAP

Hive 2 with LLAP Enable Interacdve Query In Seconds
Developer ProducYvity: InteracYve query in seconds
Ease of Use and AdopYon : 100% compaYble with Hive SQL
Enterprise Readiness: Linear scaling at Terabytes volume of data
Streamlined OperaYons: LLAP integraYon with Ambari with automated
dashboards

Why LLAP?
•  People like Hive
•  Disk->Mem is gehng further away
–  Cloud Storage isn’t co-located
–  Disks are connected to the CPU via network
•  Security landscape is changing
–  Cells & Columns are the new security boundary, not ﬁles
–  Safely masking columns needs a process boundary
•  Concurrency, Performance & Scale are at conﬂict
–  Concurrency at 100k queries/hour
–  Latencies at 2-5 seconds/query
–  Petabyte scale warehouses (with terabytes of “hot” data)
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment

What is LLAP?
•  Hybrid model combining daemons and containers
for fast, concurrent execution of analytical
workloads (e.g. Hive SQL queries)
•  Concurrent queries without specialized YARN queue setup
•  MulY-threaded execuYon of vectorized operator pipelines
•  Asynchronous IO and efficient in-memory caching
•  Relational view of the data available thru the API
•  High performance scans, execuYon code pushdown
•  Centralized data security
Node
LLAP Process
Cache
Query Fragment
HDFS
Query Fragment

Hive 2 with LLAP: Architecture Overview
Deep
Storage
YARN Cluster
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
LLAP Daemon
Query
Executors
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC SQL
Queries In-Memory Cache
(Shared Across All Users)
HDFS and
CompaYble
S3 WASB Isilon

MR vs Tez vs Tez+LLAP
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
HDFS
In-Memory
columnar cache
Map – Reduce
Intermediate results in HDFS
Tez
Optimized Pipeline
Tez with LLAP
Resident process on Nodes
Map tasks
read HDFS

So…
M M M
R R
R
M M
R
R
Tez

AM
So…
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Tez with LLAP (auto)
auto

AM AM
So…
T T T
R R
R
T T
T
R
M M M
R R
R
M M
R
R
Tez Tez with LLAP (auto)
T T T
R R
R
T T
T
R
Tez with LLAP (all)
all auto

Hive 2 with LLAP: Preliminary Numbers
0
10
20
30
40
50
60
70
80
q3 q7 q12 q13 q19 q21 q26 q27 q42 q43 q45 q52 q55 q60 q73 q84 q89 q91 q98
Hive2.0 and LLAP: TPC-DS at 10 TB Scale, 18 Nodes
Hive2.0-Tez
LLAP
Min query dme:
Query 55: 2.38s

ACID

Key Features: EDW Offload
Ã  ACID GA for Streaming and SQL:
–  50+ stabilizaYon fixes.
–  Tested at mulY-terabyte scale with simultaneous ingest, delete and query.
Ã  Berer BI Tool CompaYbility through Expanded OLAP CapabiliYes:
–  MulY parYYon-by, mulY order-by.
–  Order by UDF/UDAF.
–  Null order specificaYon (nulls first or nulls last).
Ã  Faster ETL with More Scalable ParYYon Loads:
–  2x faster dynamic parYYon loads.
Ã  Procedural Extensions (Tech Preview):
–  Procedural structures: loops, if/else.
–  Determine min/max parYYon values.
–  Copy data from external sources like FTP.
–  Simplifies ETL / data load processes.

HCatalog Stream Mutation API
ORC ORC
ORC ORC
ORC ORC
HDFS
Table
Bucket
Bucket
Bucket
ORC

SQL Compliance

Data Types SQL Features File Formats Futures
Numeric Core SQL Features Columnar Procedural Extensions (PL/SQL)
FLOAT/DOUBLE Date, Time and ArithmeYcal FuncYons ORCFile Primary Key / Foreign Key
DECIMAL INNER, OUTER, CROSS and SEMI Joins Parquet Non-Equijoin
INT/TINYINT/SMALLINT/BIGINT Derived Table Subqueries Text Scalable Cross Product
BOOLEAN Correlated + Uncorrelated Subqueries CSV Enhanced OLAP
String UNION ALL Logﬁle
CHAR / VARCHAR UDFs, UDAFs, UDTFs Nested / Complex ACID MERGE
STRING Common Table Expressions Avro MulY Subquery
BINARY UNION DISTINCT JSON Comparison to sub-select
Date, Time Advanced Analydcs XML INTERSECT and EXCEPT
DATE OLAP and Windowing FuncYons Custom Formats
TIMESTAMP CUBE and Grouping Sets Other Features
Interval Types Nested Data Analydcs XPath AnalyYcs
Complex Types Nested Data Traversal
ARRAY Lateral Views
MAP ACID Transacdons
STRUCT INSERT / UPDATE / DELETE
UNION
Apache Hive: Journey to SQL:2011 Analydcs
Legend
ExisYng
Projected: HDP 3.0
Projected: HDP 2.5
Track Hive SQL Complete: HIVE-13554

Enterprise Spark at Scale

Apache Zeppelin GA: The Data Science Notebook
Web-based data science notebook
InteracYve data ingesYon and data exploraYon
Easy sharing and collaboraYon
Secure with single sign-on and encrypYon

Apache Spark 2.0 (Technical Preview)
Structuring Spark: DataFrames, Datasets and Streaming
InteracYve data ingesYon and data exploraYon
Easy sharing and collaboraYon
Secure with single sign-on and encrypYon

Dynamic Security Policies
Apache Atlas and Ranger Integradon

Apache Atlas + Ranger - Powerful Together

Dynamic Masking and Row Level Filtering
Dept SSN CC No Name DOB MRN Policy ID
01 232323233 4539067047629850 John Doe 9/12/1969 8233054331 nj23j424
02 333287465 5391304868205600 Jane Doe 9/13/1969 3736885376 cadsd984
Ranger Policy Enforcement
Dept SSN CC No MRN Name
01 xxxxx3233 4539 xxxx xxxx xxxx null John Doe
02 xxxxx7465 5391 xxxx xxxx xxxx null Jane Doe
Dept SSN Name MRN
01 232323233 John Doe 8233054331
MarkeYng groups sees CC
and SSN as masked values
and MRN is nulliﬁed
Dept employee
only sees data
speciﬁc to that
department

Sqoop
Teradata
Connector
Apache
KaRa
Expanded Native Connector: Dataset Lineage
Custom
Acdvity
Reporter

Metadata
Repository
RDBMS

Apache Atlas Enables Business Catalog
for Ease of Use
Ã  Organize data assets along business terms
–  AuthoritaYve: Hierarchical business Taxonomy CreaYon
–  Agile modeling: Model Conceptual, Logical, Physical assets
–  DefiniYon and assignment of tags like PII (Personally
IdenYfiable InformaYon)
Ã  Comprehensive features for compliance
–  MulYple user profiles including Data Steward and Business
Analysts
–  Object audiYng to track “Who did it”
–  Metadata Versioning to track ”what did they do”
Key Benefits:

Easy way to create business
Taxonomy

Useful for mulYple user types
including Data Steward and
Business Analysts

Comprehensive features for
compliance

Business Catalog Model and explore metadata via the
new Business Catalog in Apache Atlas
Data Steward

Real Time Applicadons powered by
Storm and HBase/Phoenix

What’s New in Storm
Developer ProducYvity: Sliding and tumbling windowing support
Developer ProducYvity: New connectors for search and NoSQL Database
Enterprise Readiness: AutomaYc back pressure
Streamlined OperaYons: Resource aware scheduling and Storm view for Ambari

What’s New in HBase and Phoenix
Developer ProducYvity: Phoenix and Hive IntegraYon to run HBASE queries in HIVE
Enterprise Readiness: Incremental Back up/Restore
Enterprise Readiness: Performance boost for high-scale loads

Developer ProducYvity: Ad Hoc AnalyYcs with connector to any ODBC BI tool

Streamlined Operadons
Apache Ambari

Phase 1: Advanced Metrics VisualizaYon & Dashboarding
Ambari
Metrics
System
A M B A R I
Grafana
Goal: Quickly understand cluster health metrics and
key performance indicators

⬢  Capabilides
–  Centralized Dashboarding focusing on component Health &
Performance
–  Ad-Hoc Graph CreaYon
⬢  Pre-Built Dashboards
–  HDFS
–  YARN
–  HBase
⬢  Core Technologies
–  Ambari Metrics System
–  Grafana

Ambari now includes pre-built
dashboards for visualizing the
most important cluster health.

Phase 2: Consolidated Cluster AcYvity ReporYng
Goal: Quickly visualize and report on how business users
and tenants are using the cluster, top 10 queue’s, users,
most Bme consuming jobs

⬢  Capabilides
–  Top K AcYvity ReporYng
–  Chargeback
⬢  Services Covered
–  YARN
–  MapReduce
–  Hive/Tez
–  Spark
–  HDFS
⬢  Core Technologies
–  Hortonworks SmartSense
–  Apache Zeppelin
SmartSense A M B A R I
Ambari
Metrics
System
Zeppelin

Acdvity Explorer: Cluster Udlizadon Repordng

Preview: Streamlined Operadons Investments
Solr
A M B A R I
Log
Search
Phase 3: Centralized & Contextual Log Search
Goal: When issues arise, be able to quickly ﬁnd issues
across all HDP components

⬢  Capabilides
–  Rapid Search of all HDP component logs
–  Search across Yme ranges, log levels, and for keywords
⬢  Core Technologies:
–  Apache Ambari
–  Apache Solr
–  Apache Ambari Log Search

Hortonworks Data Cloud

Abstract: Governance and Security in Cloud
Today’s transportaYon marketplace is compeYYve and
quickly evolving. Ouen, unexpected regulaYons can pose a
serious risk to operaYons and the borom line. With
Hortonworks Data Cloud (HDC), we’ll show how to gain
agility in adapYng to new challenges that can turn problems
into opportuniYes.

•  Quickly provision a new analyYc cloud enviroment
•  Classify and Tag assets to ﬁnd and understand your data
•  Security and Audits service to meet compliance requirements

Learn More
http://hortonworks.github.io/hdp-aws/index.html

Hortonworks HDB
Powered by Apache HAWQ

What is HDB / Apache HAWQ ?
Hadoop-native SQL query engine and
advanced analytics MPP database
that offers high-performance
interactive ANSI SQL query execution
and machine learning for Data
Analysts & Data Scientists who want
to find insights from large/complex
datasets.
HORTONWORKS
HDBpowered by Apache HAWQ

Hortonworks HDB Powered By Apache HAWQ
1.  Interactive query performance
•  Query performance in seconds
•  Compatible with any ANSI SQL compliant BI Tool
•  Larger number of concurrent users
2.  MADlib big data Machine Learning in SQL for data
scientists and data analysts
•  Classification e.g. predict loan default
•  Regression e.g. predict value of a sale
•  Clustering e.g. marketing campaign segmentation, …
3.  Data federation using HAWQ Extension Framework
•  SQL queries against other data sources
BI Tool
X
BI Tool
Y
BI Tool
Z
HDP
HORTONWORKS
DATA PLATFORM
HORTONWORKS
HDB
SQL-89
SQL-92
SQL-2003

Advanced Analytics
Performance
Exceptional MPP performance, low
latency, high scalability, ACID reliability,
fault tolerance
Most Complete
Language Compliance
Higher degree of SQL compatibility,
SQL-92, 99, 2003, OLAP, leverage
existing SQL skills
Best-in-class Query
Optimizer
Maximize performance and
do advanced queries with confidence
Elastic Architecture for
Scalability
Scale-up/down or scale-in/out, expand/
shrink clusters on the fly
Tightly integrated w/
MADlib Machine
Learning
Advanced MPP analytics, data science at
scale, directly on Hadoop data
HDB / HAWQ Advantages
MAD

New in HDF 2.0

New Features of HDF 2.0

Ã  Enterprise producYvity via streamlined operaYons
– Ambari IntegraYon of Apache NiFi, Kava, Storm
– Apache Ranger authorizaYon
– Modernized, more intuiYve UI
– MulY-tenancy of dataﬂows
Ã  170+ processors, 30% more than in Apache NiFi 1.0
Ã  Edge intelligence with Apache MiNiFi
Ã  Increased security opYons with Apache Kava 0.10
Ã  10X streaming analyYcs performance, windowing and producYvity
tools with Apache Storm 1.0

Ambari Integradon

Comprehensive Storm-Ambari Views

Muld-tenant Authorizadon

Read Permission

Muld-tenant Authorizadon

NO Read Permission (talk about levels, where you can assign permissions)

HDF 2.0 has 170+ Processors, 30% Increase from HDF 1.2

Hash
Extract
Merge
Duplicate
Scan
GeoEnrich
Replace
Convert Split
Translate
Route Content
Route Context
Route Text
Control Rate
Distribute Load
Generate Table Fetch
Jolt Transform JSON
Prioridzed Delivery
Encrypt
Tail
Evaluate
Execute
HL7
FTP
UDP
XML
SFTP
HTTP
Syslog
Email
HTML
Image
AMQP
MQTT
All Apache project logos are trademarks of the ASF and the respecYve projects.
Fetch

Edge Intelligence with Apache MiNiFi

Ã  Guaranteed delivery
Ã  Data buffering
‒  Backpressure
‒  Pressure release
Ã  PrioriYzed queuing
Ã  Flow specific QoS
‒  Latency vs. throughput
‒  Loss tolerance
Ã  Data provenance
Ã  Recovery / recording a rolling log
of fine-grained history
Ã  Designed for extension r
Ã  Small Footprint (~40MB)r
Key Features

New Stream Processing Features HDF 2.0

Ã  New Storm Connectors
Ã  Storm-Kava Spout using new
client APIs
Ã  Storm Distributed Log Search
Ã  Storm Dynamic Worker
Proﬁling
Ã  Kava Grafana IntegraYon
Ã  Storm Grafana IntegraYon
Ã  Improved Nimbus HA
Ã  Storm AutomaYc Back
Pressure
Ã  Storm Distributed cache
Ã  Storm Windowing and State
Management
Ã  Storm Performance
improvements
Ã  Improved Kava SASL
Ã  Storm Topology Event inspector
Ã  Storm Resource Aware
Scheduling
Ã  Storm Dynamic Log Levels
Ã  Pacemaker Storm Daemon
Ã  Kava Rack Awareness
Developer Producdvity Enterprise Readiness Operadonal Simplicity

Thank You

HDP2.5 Updates

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to HDP2.5 Updates

Similar to HDP2.5 Updates (20)

More from Yuta Imai

More from Yuta Imai (20)

Recently uploaded

Recently uploaded (20)

HDP2.5 Updates