Data Lifecycle in Hadoop

1 © Hortonworks Inc. 2011–2018. All rights reserved
© Hortonworks, Inc. 2011-2018. All rights reserved. | Hortonworks confidential and proprietary information.
The Implacable advance of Data - Data Lifecycle in Hadoop
Niru Anisetti, Principal Product Manager

Hortonworks Legal Disclaimer
This document may contain product/services features and technology directions that are under development, may be under development in the future
or may ultimately not be developed. Project capabilities are based on information that is publicly available within the Apache Software Foundation
project websites ("Apache"). Progress of the project capabilities can be tracked from inception to release through Apache, however, Technical feasibility,
market demand, user feedback and the overarching Apache Software Foundation community development process can all effect timing and final delivery.
This document’s description of these features and technology directions does not represent a contractual commitment, promise or obligation from
Hortonworks to deliver these features in any generally available product. Product & Service features and technology directions are subject to change, and
must not be included in contracts, purchase orders, or sales agreements of any kind. Since this document contains an outline of general product
development plans, customers should not rely upon it when making a purchase decision. The security of your IT system is very important to us, but no
single product or service can make your IT systems completely secure or prevent all improper disclosures or access. Any effective security program
requires a layered approach, which will require various systems, products, and services, as well as operational policies and procedures. HORTONWORKS
DOES NOT WARRANT THAT ANY SERVICES, SYSTEMS OR PRODUCTS PREVENT ACCIDENTAL, ILLEGAL OR MALICIOUS CONDUCT OF ANY PARTY.
❑ This presentation may contain products, services, features, or technology directions that are under development, may be under development in
the future, or may ultimately not be under development or developed. The description herein of such products, services, features or technology
directions does not represent a contractual commitment, promise or obligation by Hortonworks to deliver them in any generally available product.
❑ Apache project timelines are based in part on information that is publicly available within the Apache Software Foundation (“Apache”) project
websites. Progress of Apache projects can be tracked through Apache announcements on these websites; however, technical feasibility, market
demand, user feedback, the overarching Apache community development process, and other factors can all affect timing and delivery of Apache
projects.

3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Big Data Trend
90% of the data in the world has been created in the last two years alone
Digital content is doubling every 18
months
Structured Data
- Database
- Data Warehouse
- ERPs
- CRMs
Unstructured Data
- Web blogs
- Social media
- Audio, Video
- Software file-systems
Source: Frost & Sullivan - World’s Top Global Mega Trends To 2025 and Implications to
Business, Society and Cultures

4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
The Hortonworks Opportunity
At the core of ~$1.9T in
market opportunity
over the next 5 years
Cloud
~$410 B
Streaming
~$1.65 B
Data Science
~$180 B
Big Data
~$210 B
IoT
~$1.1 T
1.0 2.0 3.0
© Hortonworks Inc. 2011 – 2018. All Rights Reserved
Sources: IDC Worldwide Big Data and Analytics Software Forecast, 2017-2021, Forecasts Continuous/Streaming Analytics revenue to be $1.65B by 2021, July, 2017; Data Science Platform market size to reach $183.7B by 2023, Allied Market Research, Data Science Platform Market by Type and End User: Global Opportunity and Forecast, 2017-2023; IDC Worldwide Semiannual Big Data and Analytics Spending Guide Update, Forecasts Big Data & Business Analytics
revenues to be $210B by 2020, Press Release March 2017; Gartner Worldwide Public Cloud Services Revenue, Forecasts Public Cloud Services Revenue to be $411.4B by 2020, Press Release October 2017; IDC Worldwide Semiannual IoT Spending Guide Update, Forecasts Worldwide IoT Spending forecast to be ~$1.1T by 2021, Press Release December 2017.

DataPlane Service: Manage, Govern & Secure
Native Capabilities Clusters & Data Sources, Shared Services
Core Services Extensibility, Metering, Telemetry
Data Lifecycle
Manager
Oct, 2017
Data Steward
Studio
Q2, 2018
DPS EXTENSIBLE SERVICES
DPS PLATFORM
Data at Rest Data in Motion

“Data Lifecycle Manager” (DLM) Service
⬢ Is a DPS add-on service that safeguards
enterprise data
⬢ Manages the data lifecycle:
– Data Replication/ Failback for Disaster
Recovery
– Auto Tiering to Optimize Storage Cost &
Performance
– Backup & Recover Critical Business Data
– Offline replication for large datasets
⬢ Maintains common metadata and security
policies across data sources and hybrid
environments
Production Site Disaster Recovery Site
Offsite Replication
Failback
Sunday Monday Tuesday Wednesday Thursday Friday Saturday Sunday
Full Backup
Cumulative incremental backup
Accidental Deletion
Solid State Drive Hard Drive Archive
Access to Data
0days 30days 90days Forever
ProbabilityofReuse
Time
100%
0%
Disaster
Recovery
Backup &
Restore
Auto
Tiering
S3

DLM Features
• Incremental Hive replication & Hive metadata
• HDFS snapshot based replication between HDP clusters
• Ranger policy replication to Target cluster
• TDE & TLS support
• Support multiple keys/KMS
• Cloud storage replication (AWS)
• Active/standby behavior on DR site using Ranger
Available Now

DLM Service
Data Lifecycle Manager Architecture
HDP Distro REST
Data Plane UI
DLM ServicePlugin Manager
REST infrastructure
Job Manager Alerts ManagerConfiguration Manager
Security Infrastructure Copy Services
HDFS Hive Ranger
Log Manager
DLM DB
Logs

DLM Deployment Model
⬢ DLM Deployment packages
– DLM App (Installed as part of DPS app)
– DLM Engine in HDP (Management Pack)
DLM Releases
⬢ DPS 1.0/DLM 1.0
– October, 2017
– HDP 2.6.3 (Cluster)
⬢ DPS 1.1/DLM 1.1
– May 2018
– HDP 2.6.5 (Cluster)
Cluster 1
Cluster 2
DLM
Engine
DLM
Engine
On-Premise Data Center 1
Cluster 3
Cluster 4
DLM
Engine
On-Premise Data Center 2
Cluster 3
Operating
Cluster
DPS-DLM APP
Public Cloud
Push based replication to Cloud
Pull based replication

DLM 1.1 User Flow
Cloud Replication and Encryption

Cluster-1
Source
ListofJiraRMPs
VPC
DWS San Jose 2018 Summit DLM Demo scenarios
Cluster-3
IaaS/HDP
Onprem
HMS
S3 Buckets
Demo Setup
• Data: NY Traffic Collision Data (partitioned by date/Boroughs)
• Size: ~2GB
• Interactive Application: Zeppelin & Shell
• Pre-setup: Bootstrap, Cloud credentials, and Ranger policies on Target
• DLM Policy schedule interval: 1-minute interval
Demo Scenarios
1. Onprem-Onprem HDFS: Snapshot based incremental replication
2. Onprem-Cloud-HDFS: Replication of a HDFS folder to S3 bucket
Onprem
Hive
replication
setup
HDFS
replication
setup

DLM Customer Use cases/Solutions
Replicate 100+ TB data between
on-prem and cloud storage
locations
Metadata along with security
policy replication is critical
GDPR compliance is required
Tiering has be supported to
reduce overall TCO
Pharmaceutical Industry
Replicate PB+ TB data between
various data centers
Data has to be replicated along
with metadata and security
policies
GDPR compliance is required
Tiering has be supported to
reduce overall TCO
Finance & Banking Industry
Replicate corporate events data
between Hybrid locations
Build and fine-tune insights to
prove ROI for each of the event
related algorithms
Employee services Industry

DLM 2.x: Tiering User Flow

What is Tiered Storage?
• Data with different characteristics is moved to various types of storage media to
reduce total storage cost.
• Tiers are determined based on performance and cost of the media.
• DLM enables customers to define Tiers through DLM interface.
• DLM-Tiering is achieved by intra and/or inter cluster data movement

DLM 2018+ Roadmap

3
5
© Hortonworks Inc. Confidential 2011 – 2017. All Rights Reserved
DLM 2018+ Roadmap
1
2
3
2H 2017 1H 2018 2H 2018* 1H 2019*
DLM 1.0 1.1 1.x 2.0 3.0
HDP 2.6.3 2.6.5 2.6.x 2.6.x/3.x 3.x
Replication
• On-prem/HDP Cloud
replication
• Cloud Storage
replication (S3)
• Encryption
(TDE&TLS)
• Cloud Storage
replication (ADLS &
WASB)
• Atlas support (GDPR)
• GCS
• Hybrid/Multi-Cloud
support (One-to-
many)
• HBase
• Offline replication
Kafka support
DR
Failback & Failover N/A N/A N/A Failback
Failover
Auto-Tiering
N/A N/A N/A N/A Policy based Tiering -
hot/warm/cold data to
reduce TCO
* Subject to change. Features are not committed.
PRODUCTTHEMES
Released Released Planning Planning-in-progress

Thank you

Data Lifecycle in Hadoop

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Data Lifecycle in Hadoop

Similar to Data Lifecycle in Hadoop (20)

More from DataWorks Summit

More from DataWorks Summit (20)

Recently uploaded

Recently uploaded (20)

Data Lifecycle in Hadoop