Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

3,053 views

Published on

Hortonworks Data Platform 2.2 include HDFS for data storage . In this 30-minute webinar, we discussed data storage innovations, including Heterogeneous storage, encryption, and operational security enhancements.

Published in: Software

Discover hdp 2.2: Data storage innovations in Hadoop Distributed Filesystem (HDFS)

  1. 1. Discover HDP 2.2 Data Storage Innovations in Hadoop Distributed File System (HDFS) Page 1 © Hortonworks Inc. 2014 Hortonworks. We do Hadoop.
  2. 2. Speakers Page 2 © Hortonworks Inc. 2014 Rohit Bakhshi Hortonworks Senior Product Manager & PM for Apache Hadoop & Apache Solr in Hortonworks Data Platform Jitendra Pandey Hortonworks Senior Architect for HDFS
  3. 3. Agenda • Overview of HDFS • New HDFS Innovation in HDP 2.2 – Heterogeneous storage – Encryption – Operational security enhancements • Q & A We’ll move quickly: • Attendee phone lines are muted • Text any questions to Jitendra using Webex chat • Questions will be answered at the end of the call • Unanswered questions and answers in upcoming FAQ/blog post Page 3 © Hortonworks Inc. 2014
  4. 4. Big Data, Hadoop & Data Center Re-platforming Business Drivers • From reactive analytics to proactive interactions • Insights that drive competitive advantage & optimal returns Page 4 © Hortonworks Inc. 2014 $ Financial Drivers • Cost of data systems, as % of IT spend, continues to grow • Cost advantages of commodity hardware & open source software Technical Drivers • Data is growing exponentially & existing systems overwhelmed • Predominantly driven by NEW types of data that can inform analytics There is an inequitable balance between vendor and customer in the market
  5. 5. Clickstream Capture and analyze website visitors’ data trails and optimize your website Page 5 © Hortonworks Inc. 2014 Sensors Discover patterns in data streaming automatically from remote sensors and machines Server Logs Research logs to diagnose process failures and prevent security breaches Hadoop Value: New Types of Data Sentiment Understand how your customers feel about your brand and products – right now Geographic Analyze location-based data to manage operations where they occur Unstructured Understand patterns in files across millions of web pages, emails, and documents
  6. 6. A Shift from Reactive to Proactive Interactions A shift in Advertising From mass branding …to 1x1 Targeting A shift in Financial Services From Educated Investing …to Automated Algorithms A shift in Healthcare From mass treatment …to Designer Medicine A shift in Retail A shift in Telco Page 6 © Hortonworks Inc. 2014 HDP and Hadoop allow organizations to use data to shift interactions from… Reactive Post Transaction Proactive Pre Decision …to Real-t From static branding ime Personalization From break then fix …to repair before break
  7. 7. Enterprise Goals for the Modern Data Architecture Batch Interactive Real-Time Page 7 © Hortonworks Inc. 2014 • Consolidate siloed data sets structured and unstructured • Central data set on a single cluster • Multiple workloads across batch interactive and real time • Central services for security, governance and operation • Preserve existing investment in current tools and platforms • Single view of the customer, product, supply chain DATA SYSTEM APPLICATIONS Business Analytics Custom Applications Packaged Applications RDBMS EDW MPP YARN: Data Operating System 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° N CRM ERP Other 1 ° ° ° ° ° ° HDFS (Hadoop Distributed File System) SOURCES EXISTING Systems Clickstream Web &Social Geoloca9on Sensor & Machine Server Logs Unstructured
  8. 8. YARN Transformed Hadoop & Opened a New Era Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Page 8 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark
  9. 9. YARN Extends Hadoop to Other Data Center Leaders Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Java Scala Cascading Tez NoSQL HBase Accumulo Sli der 1 ° ° ° ° ° ° ° Stream Storm Slider HDFS In-Memory Spark (Hadoop Distributed File System) ° ° ° ° ° ° ° ° Page 9 © Hortonworks Inc. 2014 YARN The Architectural Center of Hadoop • Common data platform, many applications • Support multi-tenant access & processing • Batch, interactive & real-time use cases • Supports 3rd-party ISV tools (ex. SAS, Syncsort, Actian, etc.) YARN: Data Operating System (Cluster Resource Management) ° ° ° ° Others ISV Engines Search Solr ° ° ° ° ° ° ° ° ° ° YARN Ready Applications Facilitates ongoing innovation and enterprise adoption via ecosystem of new and existing “YARN Ready” solutions
  10. 10. Enterprise Hadoop: Central Set of Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Page 10 © Hortonworks Inc. 2014 Slider Slider YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Enables Apache Hadoop to be an Enterprise Data Platform with centralized services for: • Governance • Operations • Security Everything that plugs into Hadoop inherits these services Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines HDFS (Hadoop Distributed File System)
  11. 11. Hortonworks Development Investment for the Enterprise Vertical Integration with YARN and HDFS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 11 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure engines can run reliably and respectfully in a YARN based cluster • Implement features throughout the stack to accommodate
  12. 12. Hortonworks Development Investment for the Enterprise Horizontal Integration for Enterprise Services BATCH, INTERACTIVE & REAL-TIME DATA ACCESS GOVERNANCE SECURITY OPERATIONS Tez TezTez Slider 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Page 12 © Hortonworks Inc. 2014 Slider ° ° ° ° ° ° ° ° ° ° ° ° ° ° Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Load data and manage according to policy Deploy and effectively manage the platform Provide layered approach to security through Authentication, Authorization, Accounting, and Data Protection Script Pig SQL Hive Java Scala Cascading Stream Storm Search Solr NoSQL HBase Accumulo In-Memory Spark Others ISV Engines YARN: Data Operating System (Cluster Resource Management) HDFS (Hadoop Distributed File System) • Ensure consistent enterprise services are applied across the entire Hadoop stack • Integrate with and extend existing data center solutions for these key requirements
  13. 13. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS Script Pig SQL Hive TezTez Page 13 © Hortonworks Inc. 2014 Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox Linux Windows Deployment Choice Cloud YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines On-Premises
  14. 14. HDP Delivers Enterprise Hadoop Hortonworks Data Platform 2.2 GOVERNANCE BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SECURITY OPERATIONS 1 ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) ° ° ° ° ° ° ° ° Page 14 © Hortonworks Inc. 2014 YARN: Data Operating System (Cluster Resource Management) Script Pig SQL Hive TezTez Java Scala Cascading Tez Stream Storm Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark Provision, Manage & Monitor Ambari Zookeeper Scheduling Oozie Data Workflow, Lifecycle & Governance Falcon Sqoop Flume Kafka NFS WebHDFS Authentication Authorization Audit Data Protection Storage: HDFS Resources: YARN Access: Hive Pipeline: Falcon Cluster: Ranger Cluster: Knox YARN is the architectural center of HDP • Common data set across all applications • Batch, interactive & real-time workloads • Multi-tenant access & processing Provides comprehensive enterprise capabilities • Governance • Security • Operations Enables broad ecosystem adoption • ISVs can plug directly into Hadoop ° ° ° ° ° ° ° ° ° ° ° ° ° ° The widest range of deployment options • Linux & Windows • On premises & cloud Others ISV Engines Linux Windows Deployment Choice On-Premises Cloud
  15. 15. Overview of HDFS Page 15 © Hortonworks Inc. 2014
  16. 16. HDFS enables the Common Data Platform Script Pig BATCH, INTERACTIVE & REAL-TIME DATA ACCESS SQL Hive TezTez Page 16 © Hortonworks Inc. 2014 HDFS Storage Platform for Modern Data Architecture • Common data platform across multiple application workloads • Reliable • Scalable • Cost Efficient Java Scala Cascading Tez Stream Storm YARN: Data Operating System (Cluster Resource Management) 1 ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° Others ISV Engines ° ° ° ° ° ° ° ° ° ° HDFS (Hadoop Distributed File System) Search Solr NoSQL HBase Accumulo Sli der Slider In-Memory Spark
  17. 17. HDFS Innovations on HDP 2.2 Page 17 © Hortonworks Inc. 2014
  18. 18. HDFS in HDP 2.2: What’s New Page 18 © Hortonworks Inc. 2014 Heterogeneous Storage • Archive and SSD Tiers • Tech Preview: Enable intermediate data to stored in memory Heterogeneous Storage THEME Encryp9on • Tech Preview: Transparent Data Encryp?on Security THEME DataNode does not require Root to start • HDFS services in a Kerberized cluster no longer need Root to start Security THEME
  19. 19. New in HDP 2.2: Heterogeneous Storage Page 19 © Hortonworks Inc. 2014
  20. 20. Heterogeneous Storage Before • DataNode is a single storage • Storage is uniform - Only storage type Disk • Storage types hidden from the file system New Architecture • DataNode is a collection of storages • Support different types of storages – Disk, SSDs, Memory Page 20 © Hortonworks Inc. 2014 All disks as a single storage S3 Swift SAN Filers Collection of tiered storages
  21. 21. HDFS Storage Architecture - Now Page 21 © Hortonworks Inc. 2014
  22. 22. Storage Policies: Archival DISK DISK DISK DISK Page 22 © Hortonworks Inc. 2014 DISK DISK DISK DISK DISK ARCHIVE ARCHIVE ARCHIVE ARCHIVE ARCHIVE ARCHIVE ARCHIVE ARCHIVE ARCHIVE Warm 1 replica on DISK, others on ARCHIVE Hot All replicas on DISK Cold All replicas on ARCHIVE HDP Cluster
  23. 23. Storage Policy: SSD SSD DISK DISK SSD Page 23 © Hortonworks Inc. 2014 DISK DISK SSD DISK DISK SSD DISK DISK SSD DISK DISK HDP Cluster A SSD DISK DISK A A SSD DataSet A All replicas on SSD
  24. 24. Store Intermediate Data in Memory Page 24 © Hortonworks Inc. 2014 Application Process Write block to memory Memory Tier Lazy persist block to disk RAM_DISK Tech Preview feature For data writes that: - Need low latency writes - Where data is regenerate-able
  25. 25. New in HDP 2.2: Encryption Page 25 © Hortonworks Inc. 2014
  26. 26. HDFS Transparent Data Encryption • HDFS Encryption – Transparent Encryption in HDFS – HDFS-6134 – Designate a dir as encryption zone, all files in the zone are encrypted – Dependency on Key Management Server • Key Management Server - HADOOP-10433 – The custodian for all encryption keys in Hadoop – REST API for key CRUD operations • Key Provider API - HADOOP-10141 – API to allow Hadoop code (NN, DN, DFS Clients) CRUD operations on key material Page 26 © Hortonworks Inc. 2014
  27. 27. HDFS Transparent Data Encryption 1 ° ° ° ° 1 ° ° ° ° ° Encrypted File (aIributes -­‐ EDEK, IV) ° ° ° ° ° ° Encryp9on Zone ° ° ° ° ° ° (aIributes -­‐ EZKey ID, version) HDFS-­‐6134 Page 27 © Hortonworks Inc. 2014 ° ° KeyProvider ° ° ° ° Name Node ° ° ° ° N DATA ACCESS DATA MANAGEMENT SECURITY YARN HDFS Client ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° ° HDFS ° (Hadoop Distributed File System) API KeyProvider API KeyProvider API – Hadoop-­‐10141 Key Management System (KMS) Hadoop-­‐10433 EDEK DEK Crypto Stream (r/w with DEK) DEKs EZKs Acronym Descrip?on EZ Encryp?on Zone (an HDFS directory) EZK Encryp?on Zone Key; master key associated with all files in an EZ DEK Data Encryp?on Key, unique key associated with each file. EZ Key used to generate DEK EDEK Encrypted DEK, Name Node only has access to encrypted DEK. IV Ini?aliza?on Vector EDEK EDEK
  28. 28. New in HDP 2.2: Operational Security Enhancements Page 28 © Hortonworks Inc. 2014
  29. 29. DataNode does not require root Enables Organizations to run services without utilizing root privilege For Kerberized clusters DataNode no longer needs to run as the Linux root user when starting DataNode no longer needs to bind to privileged ports DataNode utilizes SASL to transfer blocks between HDFS clients and DataNodes. Page 29 © Hortonworks Inc. 2014
  30. 30. Q & A Page 30 © Hortonworks Inc. 2014
  31. 31. Thank you! Learn more at: hortonworks.com/hadoop/hdfs/ Page 31 © Hortonworks Inc. 2014 Register for the remaining 4 Discover HDP 2.2 Webinars Hortonworks.com/webinars

×