Introduction to Hortonworks Data Cloud for AWS

1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
Hortonworks Data Cloud
Enterprise ready Hadoop on the cloud
蒋逸峰（しょういつほう／Yifeng Jiang）
Solutions Engineer, Hortonworks
@uprush
December 14, 2016

About Me
蒋逸峰 (しょういつほう / Yifeng Jiang)
• Solutions Engineer, Hortonworks
• Apache HBase book author
• I like hiking & running
• Twitter: @uprush

Hortonworks Data Platform (HDP)

What’s Missing?
Ã Ambari makes deploying HDP super easy, but..
– It is not easy to get there
– Cluster sizing
– HW purchase, setup in DC, network
– OS setup
Ã Average three weeks or even more

© Hortonworks Inc. 2011 – 2016. All Rights Reserved6
Introducing Hortonworks Data Cloud for AWS
Ã A new cloud product from Hortonworks
– Powered by Hortonworks Data Platform
Ã Offers Pay-As-You-Go (PAYG) pricing
Ã Delivered and sold via AWS Marketplace
Ã Handles most common big data use cases
with Apache Hadoop, Spark, and Hive
– Choose from a set of prescriptive cluster types
Ã Focuses on ease of use and business agility
– Avoids infinite configurability and customization
Ã Optional Free Community Support **
** Enterprise Support option coming soon

DEMO

Architecture
Amazon Web Services
Cloudbreak
Services
Cloud controller (aka Cloudbreak)
Cloudbreak
DB
Connector
AWS GCE Azure
HDP Cluster: ETL / EDW
Master GroupMaster Group:
Hive, Spark
Ambari
Slave
Group
Blueprint
HDP Cluster: Analytics
Master GroupMaster Group:
LLAP, Zeppelin
Ambari
Slave
Group
Blueprint
Cloudbreak
Deployer
Access tools
Shell REST API Web UI
OpenStack
S3aFileSystem
S3aFileSystem

Hortonworks Data Cloud - Summary
Ã Launch and manage clusters by workload type
– ETL / EDW, Data science, Business analytics
Ã Use highly scalable, durable storage for data (S3)
& metadata (RDS)
Ã Share data and metadata among multiple
ephemeral clusters
Ã Scale up and down at the click of a button
Ã Secure clusters with IAM roles, security groups,
etc.

Improving Enterprise
Readiness

Enterprise Readiness
Improving enterprise readiness in the cloud
Ã Cloud storage
Ã Security and governance
Ã Reliability and fault tolerance

Matching Hadoop with the Cloud
Datacenter
• Data Locality
• Consistent
Storage
• Single cluster
administration
Cloud
• Scalable storage
• Customizability
• Cost effective
compute
• Scalable storage with
performance and
consistency
• Customizability with
ease of administration
• Cost effective compute
with SLA policies

Cloud Storage access facts
HDFS
Applicati
on
Input Output tmp
Interaction models
Applicati
on
HDFSInput
Output
Copy
Ã Cloud storage optimizes for scale
– S3 data is replicated for high scale access,
durability
Ã Data access is remote
– Data locality
– Costlier metadata operations (E.g. hadoop fs
–mv is actually a copy and delete)
Ã Eventual Consistency
– Takes time for effect of modification operations
to permeate to all copies

Performance with Scalability
Ã General strategy: Optimize by workload types
Ã ETL workloads
– Typical pipeline: Bring in data => Transform => Repair partitions => Compute statistics
– Multiple metadata calls: Batched and issued in parallel for performance gains
Ã Distcp
– Optimized buffer management for transferring large files
– Randomize input to Distcp to avoid hot-spotting S3 nodes

Performance with Scalability
Ã Analytics workloads – ORC file related optimizations
– Support fast random access reads (both directions) by avoiding tearing down
S3 HTTP connections
– Pass index information to compute tasks as part of split data to avoid re-
computation
Ã Status: Available, but performance optimizations never stop J
https://hortonworks.github.io/hdp-aws/s3-performance/index.html

Correctness with strong consistency
Ã Write operations followed by read may not return correct
results
– Issues for data pipelines, multi-stage jobs, etc.
Ã S3Guard project: Intermediate, consistent metadata store
Ã Write calls from S3AFileSystem update both S3 and metadata
store
Ã S3AFileSystem automatically tries to reconcile metadata
between S3 and metadata store on subsequent reads
– Inconsistencies are handled based on policy
Ã Status: In progress
16
https://issues.apache.org/jira/browse/HADOOP-13345

Securing data access via IAM Roles
Ã Integration with cloud provider
Ã Provide an IAM role as instance profile for
a cluster
Ã Attach policies for accessing S3 to the role
– E.g. Read-only access for BI cluster to
specific buckets
Ã Status: Available

Data Security in Hadoop
Apache Ranger
Ã Fine grained, role-based access policies to
data
– Table/column level ACL
Ã Audit access information
Ã Row level filtering
Ã Dynamic data masking

Data Governance in Hadoop
Apache Atlas
Ã Auto discover & index metadata
Ã Tag data
Ã Track data lineage

Data governance technical architecture – On Premise
On Premise HDP Cluster
Ranger
Admin
Policy
Policy
Atlas
Admin Metadata
Governed HDP
Component
(E.g. Hive)
Ranger
Plugin
Atlas
Plugin
LDAP / AD
Data Steward

Data Governance in the Cloud:
Ease of administration with flexibility
Ã No longer a single compute cluster generating / accessing data
Ã Data & Metadata are still single and shared
Ã Evolve Atlas and Ranger to be data lake centric than cluster centric
– Shared long running Admin components
– Ephemeral plugins on compute clusters
Ã Status: Available as a Tech Preview
https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md

Shared Ranger / Atlas admin services
Available in Tech Preview in Hortonworks Data Cloud
ETL-EDW Cluster
Governed HDP
Component (E.g. Hive)
LDAP / AD
Ranger
Plugin
Atlas
Plugin
Data Analytics Cluster
Governed HDP
Component (E.g. Hive)
Ranger
Plugin
Atlas
Plugin
Ranger
Admin Policy
Policy
Atlas
Admin Metadata
Cloud
Controller
Shared Enterprise Services
Data Steward

HDP Cloud Compute nodes on AWS
Ã Regular EC2 instances
Ã Can attach EBS volumes or ephemeral storage disks
Ã Grouped according to functionality / access requirements
Ã Opportunistic provisioning – spot instances (work in progress)
HDP Cluster
Master Group
Group #1
Gateway node:
Ambari
Master Group
Group #2
Cloud Controller

HDP Cloud Compute nodes on AWS
24

Reliability with cost benefits
Ã HDP host instances could become unhealthy
– Unreliable underlying infrastructure
– Spot instances are transient, dependent on bid price
– SLA impact for workloads
Ã Automatically replace un-healthy nodes
– No costs incurred if node is not functional
– Replace unhealthy instances to maintain a desired capacity
Ã Status: Work in progress

Auto-recovery of slave nodes
Ã Use Ambari to detect unhealthy status & notify Cloudbreak
Ã Decommission and terminate unhealthy instances
Ã Provision new instances and add to cluster
HDP Cluster
Master Group
Group #1
Gateway node:
Ambari
Master Group
Group #2
Cloud Controller

Summary

Our Connected Data Platform Solutions
Hortonworks : Powering the Future of Data
(Every business is a data business, master value of data via open approach)
Modern Data Applications
(CyberSecurity, IoT, Partners, Custom, etc.)
Connected Data Platforms
(Manage All Data: data-at-rest, data-in-motion, data center & cloud)
Training | Consulting | Community Connection | Partnerworks
Data Center Solutions Cloud Solutions
Hortonworks
Data Cloud
for AWS
Azure
HDInsight
Rackspace
Accenture
Others
HDP HDF
Syncsort
AtScale
Pivotal HDB
Others
Enterprise Subscription
SmartSense operational svc’s
24x7 Support
Maintenance
Etc.

http://hortonworks.com/info/aws-marketplace-credits-signup/

THANK YOU

Introduction to Hortonworks Data Cloud for AWS

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to Hortonworks Data Cloud for AWS

Similar to Introduction to Hortonworks Data Cloud for AWS (20)

More from Yifeng Jiang

More from Yifeng Jiang (16)

Recently uploaded

Recently uploaded (20)

Introduction to Hortonworks Data Cloud for AWS