Big data spain keynote nov 2016

The Enterprise and Connected Data,
Trends in the Apache Hadoop
Ecosystem
Alan Gates
Co-Founder
Hortonworks
@alanfgates

© Hortonworks Inc. 2011 – 2016. All Rights Reserved
Our Hadoop Journey Begins…
1 ° ° °
° ° ° N
HDFS
MapReduce
Batch apps
2006

Today
Our Hadoop Journey: Ecosystem Innovation Accelerates
2006 2011

6 Years of Apache Hive and Beyond
• Apache Hive becomes a Top-Level Project
• HiveServer2 adds ODBC/JDBC
• SQL breadth expands with windowing
and more
• Apache Tez enters incubation
• Hive 0.13 marks delivery of the Stinger
Initiative with Tez, Vectorized Query
and ORCFile support
• Standard SQL authorization,
integration with Apache Ranger
• ACID transactions introduced
• Governance added with Apache
Atlas integration
• Hive 2 introduces LLAP and
intelligent in-memory caching
2010 2011 2012 2013 2014 2015 2016
A SQL data warehouse infrastructure that
delivers fast, scalable SQL processing on
Hadoop and in the Cloud
• Extensive SQL:2011 Support
• Compatible with every major BI Tool
• Proven at 300+ PB Scale

Hive 2 with LLAP: Architecture Overview
Deep
Storage
HDFS
S3 + Other HDFS
Compatible Filesystems
YARN Cluster
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
LLAP Daemon
Query
Executors
In-Memory
Cache
Query
Coordinators
Coord-
inator
Coord-
inator
Coord-
inator
HiveServer2
(Query
Endpoint)
ODBC /
JDBC SQL
Queries

Hive 2 with LLAP: 25+x Performance Boost
0
5
10
15
20
25
30
35
40
45
50
0
50
100
150
200
250
Speedup(xFactor)
QueryTime(s)(LowerisBetter)
Hive 2 with LLAP averages 26x faster than Hive 1
Hive 1 / Tez Time (s) Hive 2 / LLAP Time(s) Speedup (x Factor)

What’s new in Spark 2.0?
 API Improvements
– SparkSession – new entry point
– Unified DataFrame & DataSet API
– Structured Streaming/Continuous Application
 Performance Improvements
– Tungsten Phase 2 – Whole-stage code generation
 ML
– ML model persistence
– Distributed R algorithms (GLM, Naïve Bayes, K-Means, Survival Regression)
 SparkSQL
– SQL 2003 support (new ANSI SQL parser, subquery support)

How to Secure and Govern Access to Your Data?
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake
Policies
?

Secure and Govern Your Data with Tag-Based Access Policies
Classification
Prohibition
Time
Location
Policies
PDP
Resource
Cache
Ranger
Manage Access Policies
and Audit Logs
Track Metadata
and Lineage
Atlas Client
Subscribers
to Topic
Gets Metadata
Updates
Atlas
Metastore
Tags
Assets
Entitles
Streams
Pipelines
Feeds
Hive
Tables
HDFS
Files
HBase
Tables
Entities
in Data
Lake

Data In Motion
 Constrained
 High-latency
 Localized context
 Hybrid – cloud/on-premises
 Low-latency
 Global context
SOURCES
REGIONAL
INFRASTRUCTURE
CORE
INFRASTRUCTURE

Our Hadoop Journey: From the Data Center to the Cloud!
2006 Today

Why Hadoop in the Cloud?
Unlimited
Elastic Scale
Ephemeral &
Long-Running
IT &
Business Agility
No Upfront
HW Costs

Key Architectural Considerations for Hadoop in the Cloud
Shared Data
& Storage
On-Demand
Ephemeral Workloads
10101
10101010101
01010101010101
0101010101010101010
Elastic Resource
Management
Shared Metadata,
Security & Governance

Shared Data and Storage
Understand and Leverage Unique Cloud Properties
 Shared data lake is cloud storage accessible
by all apps
 Cloud storage segregated from compute
 Built-in geo-distribution and DR
Focus Areas
 Address cloud storage consistency
and performance
 Enhance performance via memory
and local storage
Shared Data
& Storage
10101
10101010101
01010101010101
0101010101010101010

Enhance Performance via Caching
Tabular Data: LLAP Read + Write-thru Cache
 Shared across jobs / apps and across engines
 Cache only the needed columns
 Spills to SSD when memory is full (anti-caching)
 Read & Write-through cache
 Security: Column-level and row-level
HDFS Caching for Non-tabular Data
 Cache data from cloud storage as needed
 Write-through cache
Workloads
Cloud Storage
LLAP R/W TablesHDFS Files
Cache

Prescriptive On-Demand Ephemeral Workloads
On-Demand
Ephemeral
Workloads
Data Science
R/W TablesCompute Fabric
ETL
Warehouse
Search

Shared Data Requires Shared Metadata, Security, and Governance
Shared Metadata Across All Workloads
 Metadata considerations
– Tabular data metastore
– Lineage and provenance metadata
– Pipeline and job management metadata
– Add upon ingest
– Update as processing modifies data
 Access / tag-based policies and audit logs
 Centrally stored to facilitate use across clusters
– Ex. backed by Cloud RDS (or shared DB)
Classification
Prohibition
Time
Location
Streams
Pipelines
Feeds
Tables
Files Objects
Shared
Metadata
Policies

Elastic Resource Management in Context of Workload
Workload Management vs. Cluster Management
 Understand resource needs of different
workload types
 Add / remove resources to meet workload SLAs
 Manage compute power and high-performance
data-access (ex., LLAP)
 Pricing-aware: instances (spot, reserved),
data, bandwidth
Elastic
Resource
Management

Data in
Motion
Data at
Rest
Deep Historical
Analysis
DATA CE NTE R
Stream Analytics
Edge
Data
Data in
Motion
Machine
Learning
CLOU D Edge
Data
Edge
Analytics
Data at
Rest
Transformational Applications Require Connected Data

Big data spain keynote nov 2016

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Big data spain keynote nov 2016

Similar to Big data spain keynote nov 2016 (20)

Recently uploaded

Recently uploaded (20)

Big data spain keynote nov 2016

Editor's Notes