SlideShare a Scribd company logo
HDFS What’s New and Future

Suresh Srinivas
suresh@hortonworks.com
@suresh_m_s




© Hortonworks Inc. 2013      Page 1
About Me
• Architect & Founder at Hortonworks
• Apache Hadoop committer and PMC member
• > 4.5 years working on HDFS




    Architecting the Future of Big Data
                                          Page 2
    © Hortonworks Inc. 2013
Agenda

• HDFS – What’s new
 – Federation
 – HA
 – Snapshots
 – Other features
• Future
 – Major Architectural Directions
 – Short term and long term features


    Architecting the Future of Big Data
                                          Page 3
    © Hortonworks Inc. 2013
We have been hard at work…
• Progress is being made in many areas
  – Scalability
  – Performance
  – Enterprise features
  – Ongoing operability improvements
  – Enhancements for other projects in the ecosystem
  – Expand Hadoop ecosystem to more platforms and use cases
• 2192 commits in Hadoop in the last year
  – Almost a million lines of changes
  – ~150 contributors
  – Lot of new contributors - ~80 with < 3 patches
• 350K lines of changes in HDFS and common

      Architecting the Future of Big Data
                                                              Page 4
      © Hortonworks Inc. 2013
Building on Rock-solid Foundation
• Original design choices - simple and robust
   – Storage: Rely in OS’s file system rather than use raw disk
   – Storage Fault Tolerance: multiple replicas, active monitoring
   – Single Namenode Master
• Reliability
   – Over 7 9’s of data reliability
   – Less than 0.38 failures across 25 clusters
• Operability
   – Small teams can manage large clusters
      • An operator per 3K node cluster
   – Fast Time to repair on node or disk failure
      • Minutes to an hour Vs. RAID array repairs taking many long hours
• Scalable - proven by large scale deployments not bits
  – > 100 PB storage, > 400 million files, > 4500 nodes in a single cluster
   – > 70 K nodes of HDFS in deployment and use


         Architecting the Future of Big Data
                                                                              Page 5
         © Hortonworks Inc. 2013
Federation
                                 NN-1                         NN-k                  NN-n


                 Namespace
                                                                                           Foreign
                                          NS1                        NS k                   NS n
                                                         ..                    ..
                                                         .                     .

                                                Pool 1            Pool k             Pool n
                 Block Storage




                                                                Block Pools




                                         DN 1                     DN 2                  DN m
                                               ..                     ..                    ..
                                                              Common Storage

• Block Storage as generic storage service
  – DNs store blocks in Block Pools for all the Namespace Volumes
• Multiple independent Namenodes and Namespace Volumes in a cluster
  – Scalability by adding more namenodes/namespaces
  – Isolation – separating applications to their own namespaces
  – Client side mount tables/ViewFS for integrated views

         Architecting the Future of Big Data
                                                                                                     Page 6
         © Hortonworks Inc. 2013
High Availability
• Support standby namenode and failover
 – Planned downtime
 – Unplanned downtime
• Release 1.1
 – Cold standby
 – Uses NFS as shared storage
 – Standard HA frameworks as failover controller
   • Linux HA and VMWare VSphere
 – Suitable for small clusters up to 500 nodes



     Architecting the Future of Big Data
                                                   Page 7
     © Hortonworks Inc. 2013
Hadoop Full Stack HA


                                                Slave Nodes of Hadoop Cluster


                                      jo           jo             jo   jo    jo
                                       b            b              b    b     b


 Apps
Running
Outside
                                                           Failover

                                        JT into Safemode

                         NN                                  JT             NN

                            Server                            Server         Server

                                           HA Cluster for Master Daemons

          Architecting the Future of Big Data
                                                                                      Page 8
          © Hortonworks Inc. 2013
High Availability – Release 2.0
• Supports manual and automatic failover
• Automatic failover with Failover Controller
  – Active NN election and failure detection using ZooKeeper
  – Periodic NN health check
  – Failover on NN failure
• Removed shared storage dependency
  – Quorum Journal Manager
    • 3 to 5 Journal Nodes for storing editlog
    • Edit must be written to quorum number of Journal Nodes



                 Available in Release 2.0.3-alpha

      Architecting the Future of Big Data
                                                               Page 9
      © Hortonworks Inc. 2013
ZK          ZK           ZK
                                 Heartbeat                                                Heartbeat


      FailoverController                                                                  FailoverController
            Active                                                                             Standby

                                 Cmds
                                                   JN         JN          JN



                                                        Shared NN state
                                        NN                                        NN
Monitor Health                                          through Quorum
of NN. OS, HW
                                       Active           of JournalNodes         Standby               Monitor Health
                                                                                                      of NN. OS, HW




    Block Reports to Active & Standby
    DN fencing: only obey commands
              from active
                                             DN        DN          DN           DN


                           Namenode HA has no external dependency
      Architecting the Future of Big Data
                                                                                                                       Page 10
      © Hortonworks Inc. 2013
Snapshots (HDFS-2802)
• Support for read-only COW snapshots
  – Design allows read-write snapshots
• Namenode only operation – no data copy made
  – Metadata in namenode - no complicated distributed mechanism
  – Datanodes have no knowledge
• Snapshot entire namespace or sub directories
  – Nested snapshots allowed
  – Managed by Admin
    • Users can take snapshots of directories they own
• Efficient
  – Instantaneous creation
  – Memory used is highly optimized
  – Does not affect regular HDFS operations

      Architecting the Future of Big Data
                                                              Page 11
      © Hortonworks Inc. 2013
Snapshot Design
                                                   ∆n    ∆n-1          ∆0




                                         Current        Sn      Sn-1   S0




• Based on Persistent Data Structures
  – Maintains changes in the diff list at the Inodes
     • Tracks creation, deletion, and modification
  – Snapshot state Sn = current - ∆n
• A large number of snapshots supported
  – State proportional to the changes between the snapshots
  – Supports millions of snapshots
       Architecting the Future of Big Data
                                                                            Page 12
       © Hortonworks Inc. 2013
Snapshot – APIs and CLIs
• All regular commands & APIs can be used with snapshot path
  – /<path>/.snapshot/<snapshot_name>/file.txt
• CLIs
  – Allow snapshots
     • dfsadmin –allowSnapshots <dir>
     • dfsadmin –disAllowSnapshots <dir>
  – Create/delete/rename snapshots
     • fs –createSnapshot<dir> [snapshot_name]
     • fs –deleteSnapshot<dir> <snapshot_name>
     • fs –renameSnapshot<dir> <old_name> <new_name>
  – Tool to print diff between snapshots
  – Admin tool to print all snapshottable directories and snapshots
• Status
  – Work almost complete – ready to be integrated to trunk
  – Additional work needed in integration to Ambari

         Architecting the Future of Big Data
                                                                      Page 13
         © Hortonworks Inc. 2013
Performance Improvements
• Many Improvements
  – SSE4.2 CRC32C – ~3x less CPU on read path
  – Read path improvements for fewer memory copies
  – Short-circuit read for 2-3x faster random reads (HBase workloads)
  – Unix domain socket based local reads (almost done)
    • Simpler to configure and generic for many applications
  – I/O improvements using posix_fadvise()
  – libhdfs improvements for zero copy reads
• Significant improvements - IO 2.5x to 5x faster
  – Lot of improvements back ported to release 1.x
    • Available in Apache release 1.1 and HDP 1.1




      Architecting the Future of Big Data
                                                                  Page 14
      © Hortonworks Inc. 2013
Other Features
• New append pipeline
• Protobuf, wire compatibility
  – Post 2.0 GA stronger wire compatibility in Apache Hadoop and HDP Releases
• Rolling upgrades
  – With relaxed version checks
• Improvements for other projects
  – Stale node to improve HBase MTTR
• Block placement enhancements
  – Better support for other topologies such as VMs and Cloud
• On the wire encryption
  – Both data and RPC
• Support for NFS gateway
  – Work in progress – available soon
• Expanding ecosystem, platforms and applicability
  – Native support for Windows

       Architecting the Future of Big Data
                                                                                Page 15
       © Hortonworks Inc. 2013
Enterprise Readiness
• Storage fault-tolerance – built into HDFS 
  – Over 7’9s of data reliability
• High Availability 
• Standard Interfaces 
  – WebHdfs(REST) & HTTPFS, Fuse, NFS, libwebhdfs and libhdfs
• Wire protocol compatibility 
• Rolling upgrades 
• Snapshots 
• Disaster Recovery 
  – Distcp for parallel and incremental copies across cluster
  – Apache Ambari and HDP for automated management


       Architecting the Future of Big Data
                                                                Page 16
       © Hortonworks Inc. 2013
HDFS Futures




Architecting the Future of Big Data
                                      Page 17
© Hortonworks Inc. 2011
Storage Abstraction
• Fundamental storage abstraction improvements
• Short Term
  – Heterogeneous storage
     • Support SSDs and disks for different storage categories
     • Match storage to different access patterns
     • Disk/storage addressing/locality and status collection
  – Block level APIs for apps that don’t need file system interface
  – Granular block placement policies
• Long Term
  – Explore support for objects/Key value store and APIs
  – Serving from Datanodes optimized based on file structure



      Architecting the Future of Big Data
                                                                      Page 18
      © Hortonworks Inc. 2013
Higher Scalability
• Even higher scalability of namespace
 – Only working set in Namenode memory
 – Namenode as container of namespaces
   • Support large number of namespaces
 – Explore new types of namespaces


• Further scale the block storage
 – Block management to Datanodes
 – Block collection/Mega block group abstraction



     Architecting the Future of Big Data
                                                   Page 19
     © Hortonworks Inc. 2013
High Availability
• Further enhancements to HA
 – Expand Full stack HA to include other dependent services
 – Support multiple standby nodes
 – Use standby for reads
 – Simplify management – eliminate special daemons for journals
    • Move Namenode metadata to HDFS




     Architecting the Future of Big Data
                                                                  Page 20
     © Hortonworks Inc. 2013
Q&A
• Myths and misinformation
 – Not reliable (was never true)
 – Namenode dies all state is lost (was never true)
 – Hard to operate
 – Slow and not performant
 – Namenode is a single point of failure
 – Needs shared NFS storage
 – Does not have point in time recovery
 – Does not support disaster recovery


                                  Thank You!
    Architecting the Future of Big Data
                                                      Page 21
    © Hortonworks Inc. 2013

More Related Content

What's hot

High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Futurekarl.barnes
 
Dell high density GPU solution
Dell high density GPU solutionDell high density GPU solution
Dell high density GPU solutionClayton Li
 
Setting up Storage Features in Windows Server 2012
Setting up Storage Features in Windows Server 2012Setting up Storage Features in Windows Server 2012
Setting up Storage Features in Windows Server 2012
Lai Yoong Seng
 
Simple layouts for ECKD and zfcp disk configurations on Linux on System z
Simple layouts for ECKD and zfcp disk configurations on Linux on System zSimple layouts for ECKD and zfcp disk configurations on Linux on System z
Simple layouts for ECKD and zfcp disk configurations on Linux on System z
IBM India Smarter Computing
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System z
IBM India Smarter Computing
 
Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...
ORACLE USER GROUP ESTONIA
 
SANsymphony V
SANsymphony VSANsymphony V
SANsymphony V
TTEC
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Richard McDougall
 
Tandberg Data - Data Protection Solutions Guide
Tandberg Data  - Data Protection Solutions GuideTandberg Data  - Data Protection Solutions Guide
Tandberg Data - Data Protection Solutions Guide
Custemotion Unternehmensberatung UG (haftungsbeschränkt)
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
Xiao Qin
 
Consolidating database servers with Lenovo ThinkServer RD630
Consolidating database servers with Lenovo ThinkServer RD630Consolidating database servers with Lenovo ThinkServer RD630
Consolidating database servers with Lenovo ThinkServer RD630
Principled Technologies
 
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performanceHow an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
gojkoadzic
 
Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13DaWane Wanek
 
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDsVirtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
Principled Technologies
 
Extending the lifecycle of your storage area network
Extending the lifecycle of your storage area networkExtending the lifecycle of your storage area network
Extending the lifecycle of your storage area networkInterop
 
Red Hat Enterprise Linux on IBM System z Performance Evaluation
Red Hat Enterprise Linux on IBM System z Performance EvaluationRed Hat Enterprise Linux on IBM System z Performance Evaluation
Red Hat Enterprise Linux on IBM System z Performance Evaluation
IBM India Smarter Computing
 
SCM Dashboard
SCM DashboardSCM Dashboard
SCM DashboardPerforce
 
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
Principled Technologies
 

What's hot (20)

High Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and FutureHigh Performance Computing Infrastructure: Past, Present, and Future
High Performance Computing Infrastructure: Past, Present, and Future
 
Dell high density GPU solution
Dell high density GPU solutionDell high density GPU solution
Dell high density GPU solution
 
D02 Evolution of the HADR tool
D02 Evolution of the HADR toolD02 Evolution of the HADR tool
D02 Evolution of the HADR tool
 
Setting up Storage Features in Windows Server 2012
Setting up Storage Features in Windows Server 2012Setting up Storage Features in Windows Server 2012
Setting up Storage Features in Windows Server 2012
 
Simple layouts for ECKD and zfcp disk configurations on Linux on System z
Simple layouts for ECKD and zfcp disk configurations on Linux on System zSimple layouts for ECKD and zfcp disk configurations on Linux on System z
Simple layouts for ECKD and zfcp disk configurations on Linux on System z
 
SLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System zSLES 11 SP2 PerformanceEvaluation for Linux on System z
SLES 11 SP2 PerformanceEvaluation for Linux on System z
 
Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...Using multi tiered storage systems for storing both structured & unstructured...
Using multi tiered storage systems for storing both structured & unstructured...
 
SANsymphony V
SANsymphony VSANsymphony V
SANsymphony V
 
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on DemandApachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
Apachecon Euro 2012: Elastic, Multi-tenant Hadoop on Demand
 
Tandberg Data - Data Protection Solutions Guide
Tandberg Data  - Data Protection Solutions GuideTandberg Data  - Data Protection Solutions Guide
Tandberg Data - Data Protection Solutions Guide
 
An Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive ApplicationsAn Active and Hybrid Storage System for Data-intensive Applications
An Active and Hybrid Storage System for Data-intensive Applications
 
Consolidating database servers with Lenovo ThinkServer RD630
Consolidating database servers with Lenovo ThinkServer RD630Consolidating database servers with Lenovo ThinkServer RD630
Consolidating database servers with Lenovo ThinkServer RD630
 
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performanceHow an Enterprise Data Fabric (EDF) can improve resiliency and performance
How an Enterprise Data Fabric (EDF) can improve resiliency and performance
 
Edition based redefinition joords
Edition based redefinition joordsEdition based redefinition joords
Edition based redefinition joords
 
Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13Avnet & Rorke Data - Open Compute Summit '13
Avnet & Rorke Data - Open Compute Summit '13
 
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDsVirtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
Virtualized database performance with Dell PowerEdge PCIe Express Flash SSDs
 
Extending the lifecycle of your storage area network
Extending the lifecycle of your storage area networkExtending the lifecycle of your storage area network
Extending the lifecycle of your storage area network
 
Red Hat Enterprise Linux on IBM System z Performance Evaluation
Red Hat Enterprise Linux on IBM System z Performance EvaluationRed Hat Enterprise Linux on IBM System z Performance Evaluation
Red Hat Enterprise Linux on IBM System z Performance Evaluation
 
SCM Dashboard
SCM DashboardSCM Dashboard
SCM Dashboard
 
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
Dell Acceleration Appliance for Databases 2.0 and Microsoft SQL Server 2014: ...
 

Similar to HDFS - What's New and Future

Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
Cloudera, Inc.
 
Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High Availability
Cloudera, Inc.
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
hdhappy001
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
Hortonworks
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
Hortonworks
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
mcsrivas
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloudaidanshribman
 
Presentation st9900 virtualization - emea - primary disk
Presentation   st9900 virtualization - emea - primary diskPresentation   st9900 virtualization - emea - primary disk
Presentation st9900 virtualization - emea - primary diskxKinAnx
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
Cloudera, Inc.
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
DataWorks Summit
 
Zoned Storage
Zoned StorageZoned Storage
Zoned Storage
singh.gurjeet
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
Richard McDougall
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
Steve Loughran
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
Steve Loughran
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
Eran Gampel
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
DataWorks Summit
 
HDFS NameNode HA in CDH4
HDFS NameNode HA in CDH4HDFS NameNode HA in CDH4
HDFS NameNode HA in CDH4Lee neal
 
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Cloudera, Inc.
 
21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentation21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentationdataplex systems limited
 

Similar to HDFS - What's New and Future (20)

Strata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and FutureStrata + Hadoop World 2012: HDFS: Now and Future
Strata + Hadoop World 2012: HDFS: Now and Future
 
Hadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High AvailabilityHadoop Summit 2012 | HDFS High Availability
Hadoop Summit 2012 | HDFS High Availability
 
Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2Nicholas:hdfs what is new in hadoop 2
Nicholas:hdfs what is new in hadoop 2
 
Nn ha hadoop world.final
Nn ha hadoop world.finalNn ha hadoop world.final
Nn ha hadoop world.final
 
HDFS Namenode High Availability
HDFS Namenode High AvailabilityHDFS Namenode High Availability
HDFS Namenode High Availability
 
Design, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for HadoopDesign, Scale and Performance of MapR's Distribution for Hadoop
Design, Scale and Performance of MapR's Distribution for Hadoop
 
SAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego CloudSAP Virtualization Week 2012 - The Lego Cloud
SAP Virtualization Week 2012 - The Lego Cloud
 
Presentation st9900 virtualization - emea - primary disk
Presentation   st9900 virtualization - emea - primary diskPresentation   st9900 virtualization - emea - primary disk
Presentation st9900 virtualization - emea - primary disk
 
Hadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in CloudHadoop World 2011: Hadoop as a Service in Cloud
Hadoop World 2011: Hadoop as a Service in Cloud
 
HDFS- What is New and Future
HDFS- What is New and FutureHDFS- What is New and Future
HDFS- What is New and Future
 
Zoned Storage
Zoned StorageZoned Storage
Zoned Storage
 
Hadoop on VMware
Hadoop on VMwareHadoop on VMware
Hadoop on VMware
 
Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)Availability and Integrity in hadoop (Strata EU Edition)
Availability and Integrity in hadoop (Strata EU Edition)
 
Hadoop: today and tomorrow
Hadoop: today and tomorrowHadoop: today and tomorrow
Hadoop: today and tomorrow
 
HBase with MapR
HBase with MapRHBase with MapR
HBase with MapR
 
Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk Dragonflow Austin Summit Talk
Dragonflow Austin Summit Talk
 
Best Practices for Virtualizing Hadoop
Best Practices for Virtualizing HadoopBest Practices for Virtualizing Hadoop
Best Practices for Virtualizing Hadoop
 
HDFS NameNode HA in CDH4
HDFS NameNode HA in CDH4HDFS NameNode HA in CDH4
HDFS NameNode HA in CDH4
 
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
Strata + Hadoop World 2012: High Availability for the HDFS NameNode Phase 2
 
21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentation21.10.09 Microsoft Event, Microsoft Presentation
21.10.09 Microsoft Event, Microsoft Presentation
 

More from DataWorks Summit

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
DataWorks Summit
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
DataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
DataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
DataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
DataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
DataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
DataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
DataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
DataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
DataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
DataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
DataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
DataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
DataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
DataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
DataWorks Summit
 

More from DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Recently uploaded

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
Matthew Sinclair
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
Neo4j
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
Safe Software
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
Neo4j
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
Adtran
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
Quotidiano Piemontese
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
mikeeftimakis1
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems S.M.S.A.
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
20240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 202420240605 QFM017 Machine Intelligence Reading List May 2024
20240605 QFM017 Machine Intelligence Reading List May 2024
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024GraphSummit Singapore | The Art of the  Possible with Graph - Q2 2024
GraphSummit Singapore | The Art of the Possible with Graph - Q2 2024
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Essentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FMEEssentials of Automations: The Art of Triggers and Actions in FME
Essentials of Automations: The Art of Triggers and Actions in FME
 
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
GraphSummit Singapore | The Future of Agility: Supercharging Digital Transfor...
 
Pushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 daysPushing the limits of ePRTC: 100ns holdover for 100 days
Pushing the limits of ePRTC: 100ns holdover for 100 days
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
National Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practicesNational Security Agency - NSA mobile device best practices
National Security Agency - NSA mobile device best practices
 
Introduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - CybersecurityIntroduction to CHERI technology - Cybersecurity
Introduction to CHERI technology - Cybersecurity
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Uni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdfUni Systems Copilot event_05062024_C.Vlachos.pdf
Uni Systems Copilot event_05062024_C.Vlachos.pdf
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

HDFS - What's New and Future

  • 1. HDFS What’s New and Future Suresh Srinivas suresh@hortonworks.com @suresh_m_s © Hortonworks Inc. 2013 Page 1
  • 2. About Me • Architect & Founder at Hortonworks • Apache Hadoop committer and PMC member • > 4.5 years working on HDFS Architecting the Future of Big Data Page 2 © Hortonworks Inc. 2013
  • 3. Agenda • HDFS – What’s new – Federation – HA – Snapshots – Other features • Future – Major Architectural Directions – Short term and long term features Architecting the Future of Big Data Page 3 © Hortonworks Inc. 2013
  • 4. We have been hard at work… • Progress is being made in many areas – Scalability – Performance – Enterprise features – Ongoing operability improvements – Enhancements for other projects in the ecosystem – Expand Hadoop ecosystem to more platforms and use cases • 2192 commits in Hadoop in the last year – Almost a million lines of changes – ~150 contributors – Lot of new contributors - ~80 with < 3 patches • 350K lines of changes in HDFS and common Architecting the Future of Big Data Page 4 © Hortonworks Inc. 2013
  • 5. Building on Rock-solid Foundation • Original design choices - simple and robust – Storage: Rely in OS’s file system rather than use raw disk – Storage Fault Tolerance: multiple replicas, active monitoring – Single Namenode Master • Reliability – Over 7 9’s of data reliability – Less than 0.38 failures across 25 clusters • Operability – Small teams can manage large clusters • An operator per 3K node cluster – Fast Time to repair on node or disk failure • Minutes to an hour Vs. RAID array repairs taking many long hours • Scalable - proven by large scale deployments not bits – > 100 PB storage, > 400 million files, > 4500 nodes in a single cluster – > 70 K nodes of HDFS in deployment and use Architecting the Future of Big Data Page 5 © Hortonworks Inc. 2013
  • 6. Federation NN-1 NN-k NN-n Namespace Foreign NS1 NS k NS n .. .. . . Pool 1 Pool k Pool n Block Storage Block Pools DN 1 DN 2 DN m .. .. .. Common Storage • Block Storage as generic storage service – DNs store blocks in Block Pools for all the Namespace Volumes • Multiple independent Namenodes and Namespace Volumes in a cluster – Scalability by adding more namenodes/namespaces – Isolation – separating applications to their own namespaces – Client side mount tables/ViewFS for integrated views Architecting the Future of Big Data Page 6 © Hortonworks Inc. 2013
  • 7. High Availability • Support standby namenode and failover – Planned downtime – Unplanned downtime • Release 1.1 – Cold standby – Uses NFS as shared storage – Standard HA frameworks as failover controller • Linux HA and VMWare VSphere – Suitable for small clusters up to 500 nodes Architecting the Future of Big Data Page 7 © Hortonworks Inc. 2013
  • 8. Hadoop Full Stack HA Slave Nodes of Hadoop Cluster jo jo jo jo jo b b b b b Apps Running Outside Failover JT into Safemode NN JT NN Server Server Server HA Cluster for Master Daemons Architecting the Future of Big Data Page 8 © Hortonworks Inc. 2013
  • 9. High Availability – Release 2.0 • Supports manual and automatic failover • Automatic failover with Failover Controller – Active NN election and failure detection using ZooKeeper – Periodic NN health check – Failover on NN failure • Removed shared storage dependency – Quorum Journal Manager • 3 to 5 Journal Nodes for storing editlog • Edit must be written to quorum number of Journal Nodes Available in Release 2.0.3-alpha Architecting the Future of Big Data Page 9 © Hortonworks Inc. 2013
  • 10. ZK ZK ZK Heartbeat Heartbeat FailoverController FailoverController Active Standby Cmds JN JN JN Shared NN state NN NN Monitor Health through Quorum of NN. OS, HW Active of JournalNodes Standby Monitor Health of NN. OS, HW Block Reports to Active & Standby DN fencing: only obey commands from active DN DN DN DN Namenode HA has no external dependency Architecting the Future of Big Data Page 10 © Hortonworks Inc. 2013
  • 11. Snapshots (HDFS-2802) • Support for read-only COW snapshots – Design allows read-write snapshots • Namenode only operation – no data copy made – Metadata in namenode - no complicated distributed mechanism – Datanodes have no knowledge • Snapshot entire namespace or sub directories – Nested snapshots allowed – Managed by Admin • Users can take snapshots of directories they own • Efficient – Instantaneous creation – Memory used is highly optimized – Does not affect regular HDFS operations Architecting the Future of Big Data Page 11 © Hortonworks Inc. 2013
  • 12. Snapshot Design ∆n ∆n-1 ∆0 Current Sn Sn-1 S0 • Based on Persistent Data Structures – Maintains changes in the diff list at the Inodes • Tracks creation, deletion, and modification – Snapshot state Sn = current - ∆n • A large number of snapshots supported – State proportional to the changes between the snapshots – Supports millions of snapshots Architecting the Future of Big Data Page 12 © Hortonworks Inc. 2013
  • 13. Snapshot – APIs and CLIs • All regular commands & APIs can be used with snapshot path – /<path>/.snapshot/<snapshot_name>/file.txt • CLIs – Allow snapshots • dfsadmin –allowSnapshots <dir> • dfsadmin –disAllowSnapshots <dir> – Create/delete/rename snapshots • fs –createSnapshot<dir> [snapshot_name] • fs –deleteSnapshot<dir> <snapshot_name> • fs –renameSnapshot<dir> <old_name> <new_name> – Tool to print diff between snapshots – Admin tool to print all snapshottable directories and snapshots • Status – Work almost complete – ready to be integrated to trunk – Additional work needed in integration to Ambari Architecting the Future of Big Data Page 13 © Hortonworks Inc. 2013
  • 14. Performance Improvements • Many Improvements – SSE4.2 CRC32C – ~3x less CPU on read path – Read path improvements for fewer memory copies – Short-circuit read for 2-3x faster random reads (HBase workloads) – Unix domain socket based local reads (almost done) • Simpler to configure and generic for many applications – I/O improvements using posix_fadvise() – libhdfs improvements for zero copy reads • Significant improvements - IO 2.5x to 5x faster – Lot of improvements back ported to release 1.x • Available in Apache release 1.1 and HDP 1.1 Architecting the Future of Big Data Page 14 © Hortonworks Inc. 2013
  • 15. Other Features • New append pipeline • Protobuf, wire compatibility – Post 2.0 GA stronger wire compatibility in Apache Hadoop and HDP Releases • Rolling upgrades – With relaxed version checks • Improvements for other projects – Stale node to improve HBase MTTR • Block placement enhancements – Better support for other topologies such as VMs and Cloud • On the wire encryption – Both data and RPC • Support for NFS gateway – Work in progress – available soon • Expanding ecosystem, platforms and applicability – Native support for Windows Architecting the Future of Big Data Page 15 © Hortonworks Inc. 2013
  • 16. Enterprise Readiness • Storage fault-tolerance – built into HDFS  – Over 7’9s of data reliability • High Availability  • Standard Interfaces  – WebHdfs(REST) & HTTPFS, Fuse, NFS, libwebhdfs and libhdfs • Wire protocol compatibility  • Rolling upgrades  • Snapshots  • Disaster Recovery  – Distcp for parallel and incremental copies across cluster – Apache Ambari and HDP for automated management Architecting the Future of Big Data Page 16 © Hortonworks Inc. 2013
  • 17. HDFS Futures Architecting the Future of Big Data Page 17 © Hortonworks Inc. 2011
  • 18. Storage Abstraction • Fundamental storage abstraction improvements • Short Term – Heterogeneous storage • Support SSDs and disks for different storage categories • Match storage to different access patterns • Disk/storage addressing/locality and status collection – Block level APIs for apps that don’t need file system interface – Granular block placement policies • Long Term – Explore support for objects/Key value store and APIs – Serving from Datanodes optimized based on file structure Architecting the Future of Big Data Page 18 © Hortonworks Inc. 2013
  • 19. Higher Scalability • Even higher scalability of namespace – Only working set in Namenode memory – Namenode as container of namespaces • Support large number of namespaces – Explore new types of namespaces • Further scale the block storage – Block management to Datanodes – Block collection/Mega block group abstraction Architecting the Future of Big Data Page 19 © Hortonworks Inc. 2013
  • 20. High Availability • Further enhancements to HA – Expand Full stack HA to include other dependent services – Support multiple standby nodes – Use standby for reads – Simplify management – eliminate special daemons for journals • Move Namenode metadata to HDFS Architecting the Future of Big Data Page 20 © Hortonworks Inc. 2013
  • 21. Q&A • Myths and misinformation – Not reliable (was never true) – Namenode dies all state is lost (was never true) – Hard to operate – Slow and not performant – Namenode is a single point of failure – Needs shared NFS storage – Does not have point in time recovery – Does not support disaster recovery Thank You! Architecting the Future of Big Data Page 21 © Hortonworks Inc. 2013