SlideShare a Scribd company logo
1 of 36
Download to read offline
The Giraffa File System




   Konstantin V. Shvachko
   Alto Storage Technologies
        Storage




    September 19, 2012         Hadoop User Group

AltoStor
AltoStor
                                     Giraffa


      Giraffa is a distributed,
      highly available file system


      Utilizes features of
      HDFS and HBase


      New open source project
      in experimental stage


  2
AltoStor
                                        Apache Hadoop

      A reliable, scalable, high performance distributed
      storage and computing system
      The Hadoop Distributed File System (HDFS)
        Reliable storage layer

      MapReduce – distributed computation framework
        Simple computational model

      Ecosystem of Big Data tools
        HBase, Zookeeper




  3
AltoStor
                                     The Design Principles

      Linear scalability
        More nodes can do more work within the same time
        On Data size and Compute resources

      Reliability and Availability
        1 drive fails in 3 years. Probability of failing today 1/1000.
        Several drives fail every day on a cluster with thousands of drives

      Move computation to data
        Minimize expensive data transfers

      Sequential data processing
        Avoid random reads. [Use HBase for random data access]


  4
AltoStor
                                                       Hadoop Cluster

      HDFS – a distributed file system
        NameNode – namespace and block management
        DataNodes – block replica container

      MapReduce – a framework for distributed computations
        JobTracker – job scheduling, resource management, lifecycle
        coordination
        TaskTracker – task execution module

                        NameNode                 JobTracker




                    TaskTracker    TaskTracker        TaskTracker

                     DataNode      DataNode            DataNode



  5
AltoStor
                  Hadoop Distributed File System

      The namespace is a hierarchy of files and directories
        Files are divided into large blocks 128 MB

      Namespace (metadata) is decoupled from data
        Fast namespace operations, not slowed down by
        Direct data streaming from the source storage

      Single NameNode keeps entire namespace in RAM
      DataNodes store block replicas as files on local drives
        Blocks replicated on 3 DataNodes for redundancy & availability

      HDFS client – point of entry to HDFS
        Contacts NameNode for metadata
        Serves data to applications directly from DataNodes
  6
AltoStor
                                           Scalability Limits

      Single-master architecture: a constraining resource
      Single NameNode limits linear performance growth
        A handful of “bad” clients can saturate NameNode

      Single point of failure: takes whole cluster out of service
      NameNode space limit
        100 million files and 200 million blocks with 64GB RAM
        Restricts storage capacity to 20 PB
        Small file problem: block-to-file ratio is shrinking

      “HDFS Scalability: The limits to growth” USENIX ;login: 2010


  7
AltoStor
                                                            Node Count Visualization



                                                                          2008 Yahoo!
  Resources per node: Cores, Disks, RAM




                                                                          4000-node cluster
                                                                          2010 Facebook
                                                                          2000 nodes
                                                                          2011 eBay
                                                                          1000 nodes
                                                                          2013 Cluster of
                                                                          500 nodes
                                          Cluster Size: Number of Nodes


  8
AltoStor
                      Horizontal to Vertical Scaling

      Horizontal scaling is limited by single-master architecture
      Natural growth of compute power and storage density
        Clusters composed of more dense & powerful servers

      Vertical scaling leads to cluster size shrinking
        Storage capacity, Compute power, and Cost remain constant

      Exponential Information Growth
        2006 Chevron accumulates 2 TB a day
        2012 Facebook ingests 500 TB a day




  9
AltoStor
                                 Scalability for Hadoop 2.0

      HDFS Federation
        Independent NameNodes sharing a common pool of DataNodes
        Cluster is a family of volumes with shared block storage layer
        User sees volumes as isolated file systems
        ViewFS: the client-side mount table

      Yarn: New MapReduce framework
        Dynamic partitioning of cluster resources: no fixed slots
        Separation of JobTracker functions
      1. Job scheduling and resource allocation: centralized
      2. Job monitoring and job life-cycle coordination: decentralized
           o Delegate coordination of different jobs to other nodes



 10
AltoStor
                                  Namespace Partitioning

       Static: Federation
         Directory sub-trees are statically assigned to
         disjoint volumes
         Relocating sub-trees without copying is
         challenging
         Scale x10: billions of files
       Dynamic:
         Files, directory sub-trees can move automatically
         between nodes based on their utilization or load
         balancing requirements
         Files can be relocated without copying data blocks
         Scale x100: 100s of billion of files
       Orthogonal independent approaches.
         Federation of distributed namespaces is possible


  11
AltoStor
                                         Giraffa File System

      HDFS + HBase = Giraffa
        Goal: build from existing building blocks
        Minimize changes to existing components

  1. Store file & directory metadata in HBase table
        Dynamic table partitioning into regions
        Cashed in RegionServer RAM for fast access

  2. Store file data in HDFS DataNodes: data streaming
  3. Block management
        Handle communication with DataNodes:
        heartbeat, blockReport, addBlock
        Perform block allocation, replication, and deletion

 12
AltoStor
                                     Giraffa Requirements

      Availability – the primary goal
        Load balancing of metadata traffic
        Same data streaming speed to / from DataNodes
        Continuous Availability: No SPOF

      Cluster operability, management
        Cost of running larger clusters same as a smaller one

      More files & more data
                       HDFS          Federated HDFS    Giraffa
  Space                25 PB         120 PB            1 EB = 1000 PB
  Files + blocks       200 million   1 billion         100 billion
  Concurrent Clients   40,000        100,000           1 million


 13
AltoStor
                                             HBase Overview

      Table: big, sparse, loosely structured
        Collection of rows, sorted by row keys
        Rows can have arbitrary number of columns

      Dynamic Table partitioning!
        Table is split Horizontally into Regions
        Region Servers serve regions to applications
        Columns grouped into Column families: vertical partition of tables

      Distributed Cache:
        Regions are loaded in nodes’ RAM
        Real-time access to data




 14
AltoStor
           HBase Architecture




 15
AltoStor
                                                      HBase API

      HBaseAdmin: administrative functions
        Create, delete, list tables
        Create, update, delete columns, column families
        Split, compact, flush

      HTable: access table data
        Result HTable.get(Get g) // get cells of a row
        void HTable.put(Put p) // update a row
        void HTable.delete(Delete d) // delete cells/row
        ResultScanner getScanner(family) // scan col family
        Variety Filters

      Coprocessors:
        Custom actions triggered by update events
        Like database triggers or stored procedures

 16
AltoStor
                                         Building Blocks

      Giraffa clients
        Fetch file & block metadata from Namespace Service
        Exchange data with DataNodes
      Namespace Service
        HBase Table stores File metadata as rows
      Block Management
        Distributed collection of Giraffa block metadata
      Data Management
        DataNodes. Distributed collection of data blocks




 17
AltoStor
                                                      Giraffa Architecture

                                    Namespace Service                HBase
                                      Namespace Table                         1. Giraffa client
                                      path, attrs, block[], DN[][]               gets files
                                                                                 and blocks
                            1          Block Management Processor                from HBase

                                       2                                      2. Block
           NamespaceAgent




                                                                                 Manager
   App                                     Block Management Layer                handles
                                                                                 block
                                      BM              BM              BM         operations
                                3
                                                                              3. Stream data
                                     DN             DN               DN
                                       DN             DN               DN        to or from
                                         DN             DN               DN
                                                                                 DataNodes

 18
AltoStor
                                                                               Giraffa Client

      GiraffaFileSystem implements FileSystem
       fs.defaultFS = grfa:///
       fs.grfa.impl = o.a.giraffa.GiraffaFileSystem

      GiraffaClient extends DFSClient
       NamespaceAgent replaces NameNode RPC

                                                           Namespace
           GiraffaFileSystem




                                                             Agent
                               GiraffaClient

                                               DFSClient



                                                                       to Namespace




                                                                       to DataNodes

 19
AltoStor
                                               Namespace Table

      Single Table called “Namespace” stores
        Row Key = File ID
        File attributes:
           o Local name, owner, group, permissions, access-time,
             modification-time, block-size, replication, isDir, length
        List of blocks of a file
           o Persisted in the table
        List of block locations for each block
           o Not persisted, but discovered from the BlockManager
        Directory table
           o maps directory entry name to respective child row key




 20
AltoStor
                                                                  Namespace Service


           HBase Namespace Service
                  Region Server                  Region Server                     Region Server

                              Region                         Region                             Region
           NS Processor




                                                                            NS Processor
                                          NS Processor
                              Region                         Region                             Region
  1
                                                                        …
                               …                              …                                 …
                              Region                         Region                             Region

                          BM Processor                   BM Processor                      BM Processor


                               2


                                         Block Management Layer


 21
AltoStor
                                                  Block Manager

      Maintains flat namespace of Giraffa block metadata
  1. Block management
        Block allocation, deletion, replication
  2. DataNode management
        Process DataNode block reports, heartbeats. Identify lost nodes
  3. Storage for the HBase table
        Small file system to store Hfiles, HLog

      BM Server paired on the same node with RegionServer
        Distributed cluster of BMServes
        Mostly local communication between Region and BM Servers

      NameNode as an initial implementation of BMServer

 22
AltoStor
                                  Data Management

      DataNodes Store and Report data blocks;
      Blocks are files on local drives
      Data transfer to and from clients
      Internal data transfers
      Same as HDFS




 23
AltoStor
                                        Row Key Design

      Row keys
        Identify files and directories as rows in the table
        Define sorting of rows in Namespace table
        And therefore Namespace partitioning
      Different row key definitions based on locality
      requirement
        Key definition is chosen during file system formatting
      Full-path-key is the default implementation
        Problem: Rename can move object to another region
      Row keys based on INode numbers


 24
AltoStor
                                    Locality of Reference

      Files in the same directory – adjacent in the table
        Belong to the same region (most of the time)
        Efficient “ls”. Avoid jumping across regions
      Row keys define sorting of files and directories in the
      table
      Tree structured namespace is flattened into linear array
      Ordered list of files is self-partitioned into regions
      How to retain tree locality in linearized structure




 25
AltoStor
                                    Partitioning: Random

      Straightforward partitioning based on random hashing


                                              1


                          2              3               4


                     15       16




                     T1       T2             T3         T4




               id1                 id2            id3




 26
AltoStor
                                       Partitioning: Full Subtrees

      Partitioning based on lexicographic full-path ordering
        The default for Giraffa

                                                      1


                                   2             3             4


                              15        16




                              T1        T2           T3       T4




                 1   1   1                   1            1
                     2   2
                              T1        T2   3
                                                     T3   4
                                                              T4
                         15




 27
AltoStor
              Partitioning: Fixed Neighborhood

      Partitioning based on fixed depth neighborhoods


                                                 1


                                  2        3          4


                         15           16




                         T1           T2        T3   T4




                 1   1   1    1            2    2
                     2   3    4            15   16




 28
AltoStor
                                              Atomic Rename

      Giraffa will implement atomic in-place rename
        No support for atomic file move from one directory to another
        Requires inode numbers as unique file IDs

      A move can then be implemented on application level
        Non-atomically move the file from the source directory to a
        temporary file in the target directory
        Atomically rename the temporary file to its original name
        On failure use simple 3-step recovery procedure

      Eventually implement atomic moves
        PAXOS
        Simplified synchronization algorithms (ZAB)


 29
AltoStor
                           3-Step Recovery Procedure

      Move of a file from srcDir to trgDir failed
  1. If only the source file exists, then start the move over
  2. If only the target temporary file exists, then complete
     the move by renaming the temporary file to the original
     name
  3. If both the source and the temporary target file exist,
     then remove the source and rename the temporary file
           This step is non-atomic and may fail as well.
           In case of failure repeat the recovery procedure




 30
AltoStor
                                 New Giraffa Functionality

      Custom file attributes: user defined file metadata
        Hidden in complex file names or nested directories
           o /logs/2012/08/31/server-ip.log
        Stored in Zookeeper or even stand-alone DBs
           o Involves Synchronization
        Advanced Scanning, Grouping, Filtering
        Amazon S3 API turns Giraffa into reliable storage on the cloud

      Versioning
        Based on HBase row versioning
        Restore objects deleted inadvertently
        Alternative approach for snapshots


 31
AltoStor
                                                            Status

      We are on Apache Extra
      One node cluster running
      Row Key abstraction
      HBase implementation in separate package
        Other DBs or Key-Value stores can be plugged in

      Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki
      Server-side processing FS requests. HBase endpoints
      Testing Giraffa with TestHDFSCLI
      Web UI. Multi-node cluster. Release…
 32
AltoStor
           Thank You!




 33
AltoStor
                                                 Related Work

      Ceph
        Metadata stored on OSD
        MDS cache metadata: Dynamic Partitioning

      Lustre
        Plans to release (2.4) distributed namespace
        Code ready

      Colossus: from Google S.Quinlan and J.Dean
        100 million files per metadata server
        Hundreds of servers

      VoldFS, CassandraFS, KTHFS (MySQL)
        Prototypes

      MapR distributed file system
 34
AltoStor
                                                                 History

      (2008) Idea. Study of distributed systems
        AFS, Lustre, Ceph, PVFS, GPFS, Farsite, …
        Partitioning of the namespace: 4 types of partitioning

      (2009) Study on scalability limits
        NameNode optimization

      (2010) Design with Michael Stack
        Presentation at HDFS contributors meeting

      (2011) Plamen implements POC
      (2012) Rewrite open sourced as Apache Extras project
        http://code.google.com/a/apache-extras.org/p/giraffa/

 35
AltoStor
                                                Etymology

      Giraffe. Latin: Giraffa camelopardalis
      Family      Giraffidae
      Genus       Giraffa
      Species     Giraffa camelopardalis

      Other languages
      Arabic      Zarafa
      Spanish     Jirafa
      Bulgarian   жирафа
      Italian     Giraffa

      Favorites of my daughter
           o As the Hadoop traditions require

 36

More Related Content

What's hot

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Edureka!
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-researchsaintdevil163
 
Meethadoop
MeethadoopMeethadoop
MeethadoopIIIT-H
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationAdam Kawa
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfsTrendProgContest13
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013WANdisco Plc
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentYahoo Developer Network
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfsdatabloginfo
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with HadoopNalini Mehta
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop AdministrationEdureka!
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBaseCloudera, Inc.
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real WorldCloudera, Inc.
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file systemJohn Veigas
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 

What's hot (20)

Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability | Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
Hadoop 2.0 Architecture | HDFS Federation | NameNode High Availability |
 
Borthakur hadoop univ-research
Borthakur hadoop univ-researchBorthakur hadoop univ-research
Borthakur hadoop univ-research
 
Meethadoop
MeethadoopMeethadoop
Meethadoop
 
Apache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS FederationApache Hadoop YARN, NameNode HA, HDFS Federation
Apache Hadoop YARN, NameNode HA, HDFS Federation
 
HDFS Architecture
HDFS ArchitectureHDFS Architecture
HDFS Architecture
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
02.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 201302.28.13 WANdisco ApacheCon 2013
02.28.13 WANdisco ApacheCon 2013
 
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and DeploymentOct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
Oct 2012 HUG: Hadoop .Next (0.23) - Customer Impact and Deployment
 
2.introduction to hdfs
2.introduction to hdfs2.introduction to hdfs
2.introduction to hdfs
 
Managing Big data with Hadoop
Managing Big data with HadoopManaging Big data with Hadoop
Managing Big data with Hadoop
 
Huhadoop - v1.1
Huhadoop - v1.1Huhadoop - v1.1
Huhadoop - v1.1
 
Hadoop architecture by ajay
Hadoop architecture by ajayHadoop architecture by ajay
Hadoop architecture by ajay
 
hadoop
hadoophadoop
hadoop
 
Learn Hadoop Administration
Learn Hadoop AdministrationLearn Hadoop Administration
Learn Hadoop Administration
 
Bd class 2 complete
Bd class 2 completeBd class 2 complete
Bd class 2 complete
 
Apache Hadoop and HBase
Apache Hadoop and HBaseApache Hadoop and HBase
Apache Hadoop and HBase
 
Hadoop and HBase in the Real World
Hadoop and HBase in the Real WorldHadoop and HBase in the Real World
Hadoop and HBase in the Real World
 
Hadoop file system
Hadoop file systemHadoop file system
Hadoop file system
 
635 642
635 642635 642
635 642
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 

Similar to Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaYahoo Developer Network
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesappaji intelhunt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)Prashant Gupta
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceChris Nauroth
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing datapreetik9044
 
HDFS Federation++
HDFS Federation++HDFS Federation++
HDFS Federation++Hortonworks
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxSakthiVinoth78
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxssuser8c3ea7
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at FacebookS S
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebookelliando dias
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messagesyarapavan
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache HadoopAjit Koti
 

Similar to Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger (20)

Giraffa - November 2014
Giraffa - November 2014Giraffa - November 2014
Giraffa - November 2014
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Hadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologiesHadoop training in bangalore-kellytechnologies
Hadoop training in bangalore-kellytechnologies
 
HADOOP
HADOOPHADOOP
HADOOP
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Hadoop_arunam_ppt
Hadoop_arunam_pptHadoop_arunam_ppt
Hadoop_arunam_ppt
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
Storage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduceStorage and-compute-hdfs-map reduce
Storage and-compute-hdfs-map reduce
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop -HDFS.ppt
Hadoop -HDFS.pptHadoop -HDFS.ppt
Hadoop -HDFS.ppt
 
big data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing databig data hadoop technonolgy for storing and processing data
big data hadoop technonolgy for storing and processing data
 
Lecture 2 part 1
Lecture 2 part 1Lecture 2 part 1
Lecture 2 part 1
 
HDFS Federation++
HDFS Federation++HDFS Federation++
HDFS Federation++
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
Introduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptxIntroduction to Hadoop Distributed File System(HDFS).pptx
Introduction to Hadoop Distributed File System(HDFS).pptx
 
Big Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptxBig Data Reverse Knowledge Transfer.pptx
Big Data Reverse Knowledge Transfer.pptx
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at  FacebookHadoop and Hive Development at  Facebook
Hadoop and Hive Development at Facebook
 
Hadoop and Hive Development at Facebook
Hadoop and Hive Development at FacebookHadoop and Hive Development at Facebook
Hadoop and Hive Development at Facebook
 
Storage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook MessagesStorage Infrastructure Behind Facebook Messages
Storage Infrastructure Behind Facebook Messages
 
Apache Hadoop
Apache HadoopApache Hadoop
Apache Hadoop
 

More from Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

More from Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Recently uploaded

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Victor Rentea
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusZilliz
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...apidays
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024The Digital Insurer
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 

Recently uploaded (20)

AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 

Sep 2012 HUG: Giraffa File System to Grow Hadoop Bigger

  • 1. The Giraffa File System Konstantin V. Shvachko Alto Storage Technologies Storage September 19, 2012 Hadoop User Group AltoStor
  • 2. AltoStor Giraffa Giraffa is a distributed, highly available file system Utilizes features of HDFS and HBase New open source project in experimental stage 2
  • 3. AltoStor Apache Hadoop A reliable, scalable, high performance distributed storage and computing system The Hadoop Distributed File System (HDFS) Reliable storage layer MapReduce – distributed computation framework Simple computational model Ecosystem of Big Data tools HBase, Zookeeper 3
  • 4. AltoStor The Design Principles Linear scalability More nodes can do more work within the same time On Data size and Compute resources Reliability and Availability 1 drive fails in 3 years. Probability of failing today 1/1000. Several drives fail every day on a cluster with thousands of drives Move computation to data Minimize expensive data transfers Sequential data processing Avoid random reads. [Use HBase for random data access] 4
  • 5. AltoStor Hadoop Cluster HDFS – a distributed file system NameNode – namespace and block management DataNodes – block replica container MapReduce – a framework for distributed computations JobTracker – job scheduling, resource management, lifecycle coordination TaskTracker – task execution module NameNode JobTracker TaskTracker TaskTracker TaskTracker DataNode DataNode DataNode 5
  • 6. AltoStor Hadoop Distributed File System The namespace is a hierarchy of files and directories Files are divided into large blocks 128 MB Namespace (metadata) is decoupled from data Fast namespace operations, not slowed down by Direct data streaming from the source storage Single NameNode keeps entire namespace in RAM DataNodes store block replicas as files on local drives Blocks replicated on 3 DataNodes for redundancy & availability HDFS client – point of entry to HDFS Contacts NameNode for metadata Serves data to applications directly from DataNodes 6
  • 7. AltoStor Scalability Limits Single-master architecture: a constraining resource Single NameNode limits linear performance growth A handful of “bad” clients can saturate NameNode Single point of failure: takes whole cluster out of service NameNode space limit 100 million files and 200 million blocks with 64GB RAM Restricts storage capacity to 20 PB Small file problem: block-to-file ratio is shrinking “HDFS Scalability: The limits to growth” USENIX ;login: 2010 7
  • 8. AltoStor Node Count Visualization 2008 Yahoo! Resources per node: Cores, Disks, RAM 4000-node cluster 2010 Facebook 2000 nodes 2011 eBay 1000 nodes 2013 Cluster of 500 nodes Cluster Size: Number of Nodes 8
  • 9. AltoStor Horizontal to Vertical Scaling Horizontal scaling is limited by single-master architecture Natural growth of compute power and storage density Clusters composed of more dense & powerful servers Vertical scaling leads to cluster size shrinking Storage capacity, Compute power, and Cost remain constant Exponential Information Growth 2006 Chevron accumulates 2 TB a day 2012 Facebook ingests 500 TB a day 9
  • 10. AltoStor Scalability for Hadoop 2.0 HDFS Federation Independent NameNodes sharing a common pool of DataNodes Cluster is a family of volumes with shared block storage layer User sees volumes as isolated file systems ViewFS: the client-side mount table Yarn: New MapReduce framework Dynamic partitioning of cluster resources: no fixed slots Separation of JobTracker functions 1. Job scheduling and resource allocation: centralized 2. Job monitoring and job life-cycle coordination: decentralized o Delegate coordination of different jobs to other nodes 10
  • 11. AltoStor Namespace Partitioning Static: Federation Directory sub-trees are statically assigned to disjoint volumes Relocating sub-trees without copying is challenging Scale x10: billions of files Dynamic: Files, directory sub-trees can move automatically between nodes based on their utilization or load balancing requirements Files can be relocated without copying data blocks Scale x100: 100s of billion of files Orthogonal independent approaches. Federation of distributed namespaces is possible 11
  • 12. AltoStor Giraffa File System HDFS + HBase = Giraffa Goal: build from existing building blocks Minimize changes to existing components 1. Store file & directory metadata in HBase table Dynamic table partitioning into regions Cashed in RegionServer RAM for fast access 2. Store file data in HDFS DataNodes: data streaming 3. Block management Handle communication with DataNodes: heartbeat, blockReport, addBlock Perform block allocation, replication, and deletion 12
  • 13. AltoStor Giraffa Requirements Availability – the primary goal Load balancing of metadata traffic Same data streaming speed to / from DataNodes Continuous Availability: No SPOF Cluster operability, management Cost of running larger clusters same as a smaller one More files & more data HDFS Federated HDFS Giraffa Space 25 PB 120 PB 1 EB = 1000 PB Files + blocks 200 million 1 billion 100 billion Concurrent Clients 40,000 100,000 1 million 13
  • 14. AltoStor HBase Overview Table: big, sparse, loosely structured Collection of rows, sorted by row keys Rows can have arbitrary number of columns Dynamic Table partitioning! Table is split Horizontally into Regions Region Servers serve regions to applications Columns grouped into Column families: vertical partition of tables Distributed Cache: Regions are loaded in nodes’ RAM Real-time access to data 14
  • 15. AltoStor HBase Architecture 15
  • 16. AltoStor HBase API HBaseAdmin: administrative functions Create, delete, list tables Create, update, delete columns, column families Split, compact, flush HTable: access table data Result HTable.get(Get g) // get cells of a row void HTable.put(Put p) // update a row void HTable.delete(Delete d) // delete cells/row ResultScanner getScanner(family) // scan col family Variety Filters Coprocessors: Custom actions triggered by update events Like database triggers or stored procedures 16
  • 17. AltoStor Building Blocks Giraffa clients Fetch file & block metadata from Namespace Service Exchange data with DataNodes Namespace Service HBase Table stores File metadata as rows Block Management Distributed collection of Giraffa block metadata Data Management DataNodes. Distributed collection of data blocks 17
  • 18. AltoStor Giraffa Architecture Namespace Service HBase Namespace Table 1. Giraffa client path, attrs, block[], DN[][] gets files and blocks 1 Block Management Processor from HBase 2 2. Block NamespaceAgent Manager App Block Management Layer handles block BM BM BM operations 3 3. Stream data DN DN DN DN DN DN to or from DN DN DN DataNodes 18
  • 19. AltoStor Giraffa Client GiraffaFileSystem implements FileSystem fs.defaultFS = grfa:/// fs.grfa.impl = o.a.giraffa.GiraffaFileSystem GiraffaClient extends DFSClient NamespaceAgent replaces NameNode RPC Namespace GiraffaFileSystem Agent GiraffaClient DFSClient to Namespace to DataNodes 19
  • 20. AltoStor Namespace Table Single Table called “Namespace” stores Row Key = File ID File attributes: o Local name, owner, group, permissions, access-time, modification-time, block-size, replication, isDir, length List of blocks of a file o Persisted in the table List of block locations for each block o Not persisted, but discovered from the BlockManager Directory table o maps directory entry name to respective child row key 20
  • 21. AltoStor Namespace Service HBase Namespace Service Region Server Region Server Region Server Region Region Region NS Processor NS Processor NS Processor Region Region Region 1 … … … … Region Region Region BM Processor BM Processor BM Processor 2 Block Management Layer 21
  • 22. AltoStor Block Manager Maintains flat namespace of Giraffa block metadata 1. Block management Block allocation, deletion, replication 2. DataNode management Process DataNode block reports, heartbeats. Identify lost nodes 3. Storage for the HBase table Small file system to store Hfiles, HLog BM Server paired on the same node with RegionServer Distributed cluster of BMServes Mostly local communication between Region and BM Servers NameNode as an initial implementation of BMServer 22
  • 23. AltoStor Data Management DataNodes Store and Report data blocks; Blocks are files on local drives Data transfer to and from clients Internal data transfers Same as HDFS 23
  • 24. AltoStor Row Key Design Row keys Identify files and directories as rows in the table Define sorting of rows in Namespace table And therefore Namespace partitioning Different row key definitions based on locality requirement Key definition is chosen during file system formatting Full-path-key is the default implementation Problem: Rename can move object to another region Row keys based on INode numbers 24
  • 25. AltoStor Locality of Reference Files in the same directory – adjacent in the table Belong to the same region (most of the time) Efficient “ls”. Avoid jumping across regions Row keys define sorting of files and directories in the table Tree structured namespace is flattened into linear array Ordered list of files is self-partitioned into regions How to retain tree locality in linearized structure 25
  • 26. AltoStor Partitioning: Random Straightforward partitioning based on random hashing 1 2 3 4 15 16 T1 T2 T3 T4 id1 id2 id3 26
  • 27. AltoStor Partitioning: Full Subtrees Partitioning based on lexicographic full-path ordering The default for Giraffa 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 1 2 2 T1 T2 3 T3 4 T4 15 27
  • 28. AltoStor Partitioning: Fixed Neighborhood Partitioning based on fixed depth neighborhoods 1 2 3 4 15 16 T1 T2 T3 T4 1 1 1 1 2 2 2 3 4 15 16 28
  • 29. AltoStor Atomic Rename Giraffa will implement atomic in-place rename No support for atomic file move from one directory to another Requires inode numbers as unique file IDs A move can then be implemented on application level Non-atomically move the file from the source directory to a temporary file in the target directory Atomically rename the temporary file to its original name On failure use simple 3-step recovery procedure Eventually implement atomic moves PAXOS Simplified synchronization algorithms (ZAB) 29
  • 30. AltoStor 3-Step Recovery Procedure Move of a file from srcDir to trgDir failed 1. If only the source file exists, then start the move over 2. If only the target temporary file exists, then complete the move by renaming the temporary file to the original name 3. If both the source and the temporary target file exist, then remove the source and rename the temporary file This step is non-atomic and may fail as well. In case of failure repeat the recovery procedure 30
  • 31. AltoStor New Giraffa Functionality Custom file attributes: user defined file metadata Hidden in complex file names or nested directories o /logs/2012/08/31/server-ip.log Stored in Zookeeper or even stand-alone DBs o Involves Synchronization Advanced Scanning, Grouping, Filtering Amazon S3 API turns Giraffa into reliable storage on the cloud Versioning Based on HBase row versioning Restore objects deleted inadvertently Alternative approach for snapshots 31
  • 32. AltoStor Status We are on Apache Extra One node cluster running Row Key abstraction HBase implementation in separate package Other DBs or Key-Value stores can be plugged in Infrastructure: Eclipse, Findbugs, JavaDoc, Ivy, Jenkins, Wiki Server-side processing FS requests. HBase endpoints Testing Giraffa with TestHDFSCLI Web UI. Multi-node cluster. Release… 32
  • 33. AltoStor Thank You! 33
  • 34. AltoStor Related Work Ceph Metadata stored on OSD MDS cache metadata: Dynamic Partitioning Lustre Plans to release (2.4) distributed namespace Code ready Colossus: from Google S.Quinlan and J.Dean 100 million files per metadata server Hundreds of servers VoldFS, CassandraFS, KTHFS (MySQL) Prototypes MapR distributed file system 34
  • 35. AltoStor History (2008) Idea. Study of distributed systems AFS, Lustre, Ceph, PVFS, GPFS, Farsite, … Partitioning of the namespace: 4 types of partitioning (2009) Study on scalability limits NameNode optimization (2010) Design with Michael Stack Presentation at HDFS contributors meeting (2011) Plamen implements POC (2012) Rewrite open sourced as Apache Extras project http://code.google.com/a/apache-extras.org/p/giraffa/ 35
  • 36. AltoStor Etymology Giraffe. Latin: Giraffa camelopardalis Family Giraffidae Genus Giraffa Species Giraffa camelopardalis Other languages Arabic Zarafa Spanish Jirafa Bulgarian жирафа Italian Giraffa Favorites of my daughter o As the Hadoop traditions require 36