Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Introduction to Hortonworks Data Cloud for AWS

1,115 views

Published on

Introduction to Hortonworks Data Cloud for AWS

Published in: Software
  • Be the first to comment

Introduction to Hortonworks Data Cloud for AWS

  1. 1. 1 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Data Cloud Enterprise ready Hadoop on the cloud 蒋 逸峰(しょう いつほう/Yifeng Jiang) Solutions Engineer, Hortonworks @uprush December 14, 2016
  2. 2. 2 © Hortonworks Inc. 2011 – 2016. All Rights Reserved About Me 蒋 逸峰 (しょう いつほう / Yifeng Jiang) • Solutions Engineer, Hortonworks • Apache HBase book author • I like hiking & running • Twitter: @uprush
  3. 3. 3 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Data Platform (HDP)
  4. 4. 4 © Hortonworks Inc. 2011 – 2016. All Rights Reserved What’s Missing? à Ambari makes deploying HDP super easy, but.. – It is not easy to get there – Cluster sizing – HW purchase, setup in DC, network – OS setup à Average three weeks or even more
  5. 5. 5 © Hortonworks Inc. 2011 – 2016. All Rights Reserved
  6. 6. © Hortonworks Inc. 2011 – 2016. All Rights Reserved6 Introducing Hortonworks Data Cloud for AWS à A new cloud product from Hortonworks – Powered by Hortonworks Data Platform à Offers Pay-As-You-Go (PAYG) pricing à Delivered and sold via AWS Marketplace à Handles most common big data use cases with Apache Hadoop, Spark, and Hive – Choose from a set of prescriptive cluster types à Focuses on ease of use and business agility – Avoids infinite configurability and customization à Optional Free Community Support ** ** Enterprise Support option coming soon
  7. 7. 7 © Hortonworks Inc. 2011 – 2016. All Rights Reserved DEMO
  8. 8. 8 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Architecture Amazon Web Services Cloudbreak Services Cloud controller (aka Cloudbreak) Cloudbreak DB Connector AWS GCE Azure HDP Cluster: ETL / EDW Master GroupMaster Group: Hive, Spark Ambari Slave Group Blueprint HDP Cluster: Analytics Master GroupMaster Group: LLAP, Zeppelin Ambari Slave Group Blueprint Cloudbreak Deployer Access tools Shell REST API Web UI OpenStack S3aFileSystem S3aFileSystem
  9. 9. 9 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Hortonworks Data Cloud - Summary à Launch and manage clusters by workload type – ETL / EDW, Data science, Business analytics à Use highly scalable, durable storage for data (S3) & metadata (RDS) à Share data and metadata among multiple ephemeral clusters à Scale up and down at the click of a button à Secure clusters with IAM roles, security groups, etc.
  10. 10. 10 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Improving Enterprise Readiness
  11. 11. 11 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Enterprise Readiness Improving enterprise readiness in the cloud à Cloud storage à Security and governance à Reliability and fault tolerance
  12. 12. 12 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Matching Hadoop with the Cloud Datacenter • Data Locality • Consistent Storage • Single cluster administration Cloud • Scalable storage • Customizability • Cost effective compute • Scalable storage with performance and consistency • Customizability with ease of administration • Cost effective compute with SLA policies
  13. 13. 13 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Cloud Storage access facts HDFS Applicati on Input Output tmp Interaction models Applicati on HDFSInput Output Copy à Cloud storage optimizes for scale – S3 data is replicated for high scale access, durability à Data access is remote – Data locality – Costlier metadata operations (E.g. hadoop fs –mv is actually a copy and delete) à Eventual Consistency – Takes time for effect of modification operations to permeate to all copies
  14. 14. 14 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance with Scalability à General strategy: Optimize by workload types à ETL workloads – Typical pipeline: Bring in data => Transform => Repair partitions => Compute statistics – Multiple metadata calls: Batched and issued in parallel for performance gains à Distcp – Optimized buffer management for transferring large files – Randomize input to Distcp to avoid hot-spotting S3 nodes
  15. 15. 15 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Performance with Scalability à Analytics workloads – ORC file related optimizations – Support fast random access reads (both directions) by avoiding tearing down S3 HTTP connections – Pass index information to compute tasks as part of split data to avoid re- computation à Status: Available, but performance optimizations never stop J https://hortonworks.github.io/hdp-aws/s3-performance/index.html
  16. 16. 16 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Correctness with strong consistency à Write operations followed by read may not return correct results – Issues for data pipelines, multi-stage jobs, etc. à S3Guard project: Intermediate, consistent metadata store à Write calls from S3AFileSystem update both S3 and metadata store à S3AFileSystem automatically tries to reconcile metadata between S3 and metadata store on subsequent reads – Inconsistencies are handled based on policy à Status: In progress 16 https://issues.apache.org/jira/browse/HADOOP-13345
  17. 17. 17 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Securing data access via IAM Roles à Integration with cloud provider à Provide an IAM role as instance profile for a cluster à Attach policies for accessing S3 to the role – E.g. Read-only access for BI cluster to specific buckets à Status: Available
  18. 18. 18 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Security in Hadoop Apache Ranger à Fine grained, role-based access policies to data – Table/column level ACL à Audit access information à Row level filtering à Dynamic data masking
  19. 19. 19 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance in Hadoop Apache Atlas à Auto discover & index metadata à Tag data à Track data lineage
  20. 20. 20 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data governance technical architecture – On Premise On Premise HDP Cluster Ranger Admin Policy Policy Atlas Admin Metadata Governed HDP Component (E.g. Hive) Ranger Plugin Atlas Plugin LDAP / AD Data Steward
  21. 21. 21 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Data Governance in the Cloud: Ease of administration with flexibility à No longer a single compute cluster generating / accessing data à Data & Metadata are still single and shared à Evolve Atlas and Ranger to be data lake centric than cluster centric – Shared long running Admin components – Ephemeral plugins on compute clusters à Status: Available as a Tech Preview https://github.com/hortonworks/hdc-cli/blob/master/shared_cluster.md
  22. 22. 22 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Shared Ranger / Atlas admin services Available in Tech Preview in Hortonworks Data Cloud ETL-EDW Cluster Governed HDP Component (E.g. Hive) LDAP / AD Ranger Plugin Atlas Plugin Data Analytics Cluster Governed HDP Component (E.g. Hive) Ranger Plugin Atlas Plugin Ranger Admin Policy Policy Atlas Admin Metadata Cloud Controller Shared Enterprise Services Data Steward
  23. 23. 23 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Cloud Compute nodes on AWS à Regular EC2 instances à Can attach EBS volumes or ephemeral storage disks à Grouped according to functionality / access requirements à Opportunistic provisioning – spot instances (work in progress) HDP Cluster Master Group Group #1 Gateway node: Ambari Master Group Group #2 Cloud Controller
  24. 24. 24 © Hortonworks Inc. 2011 – 2016. All Rights Reserved HDP Cloud Compute nodes on AWS 24
  25. 25. 25 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Reliability with cost benefits à HDP host instances could become unhealthy – Unreliable underlying infrastructure – Spot instances are transient, dependent on bid price – SLA impact for workloads à Automatically replace un-healthy nodes – No costs incurred if node is not functional – Replace unhealthy instances to maintain a desired capacity à Status: Work in progress
  26. 26. 26 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Auto-recovery of slave nodes à Use Ambari to detect unhealthy status & notify Cloudbreak à Decommission and terminate unhealthy instances à Provision new instances and add to cluster HDP Cluster Master Group Group #1 Gateway node: Ambari Master Group Group #2 Cloud Controller
  27. 27. 27 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Summary
  28. 28. 28 © Hortonworks Inc. 2011 – 2016. All Rights Reserved Our Connected Data Platform Solutions Hortonworks : Powering the Future of Data (Every business is a data business, master value of data via open approach) Modern Data Applications (CyberSecurity, IoT, Partners, Custom, etc.) Connected Data Platforms (Manage All Data: data-at-rest, data-in-motion, data center & cloud) Training | Consulting | Community Connection | Partnerworks Data Center Solutions Cloud Solutions Hortonworks Data Cloud for AWS Azure HDInsight Rackspace Accenture Others HDP HDF Syncsort AtScale Pivotal HDB Others Enterprise Subscription SmartSense operational svc’s 24x7 Support Maintenance Etc.
  29. 29. 29 © Hortonworks Inc. 2011 – 2016. All Rights Reserved http://hortonworks.com/info/aws-marketplace-credits-signup/
  30. 30. 30 © Hortonworks Inc. 2011 – 2016. All Rights Reserved THANK YOU

×