SlideShare a Scribd company logo
HBASE Backups



Pritam Damania
Software Engineer, Facebook
Aug 1, 2012
Agenda
 1   Introduction to HBASE and HDFS

 2   Introduction to HBASE Backups

 3   Facebook’s Backup solution

 4   Results

 5   Further Work
INTRODUCTION TO HDFS
What is HDFS ?


▪  Distributed   FileSystem
▪  Runs    on top of commodity hardware
▪  Scale   to Petabytes of data
▪  Tolerates   machine failures
HDFS Data Model

▪  Data    is logically organized into files and directories
▪  Files   are divided into uniform-sized blocks
▪  Blocks
        are distributed across the nodes of the cluster and are replicated
 to handle hardware failure
▪  HDFS     keeps checksums of data for corruption detection and recovery
▪  HDFS     exposes block placement so that computation can be migrated to
 data
HDFS Data Model (2)

            MetaInfo(Filename, replicationFactor, block-ids, …)
            /users/user1/data/part-0, repl:2, ids: {1,3}, …
            /users/user1/data/part-1, repl:3, ids: {2,4,5}, …


                                Block Storage

1                           2
                                          1       4         2     5
        2


                                           3                      4
            3         4
    5                                               5



                                                  7
HDFS Architecture
                                               Metadata (Name, #replicas, …):
                          Namenode                 /users/foo/data, 3, …
       Metadata ops
                                             Block ops
  Client               Metadata ops

Read              Datanodes                                 Datanodes
                                       Replication

                                                                     Blocks


                               Write                        Rack 2
            Rack 1
                            Client
                                                     8
INTRODUCTION TO HBASE
HBase in a nutshell

§  distributed, large-scale data store

§  can host very large tables, billions of rows x millions of columns

§  efficient at random reads/writes

§  open source project modeled after Google’s BigTable
HBase Data Model
•  An HBase table is:
 •    a sparse , three-dimensional array of cells, indexed by:
       RowKey, ColumnKey, Timestamp/Version

 •    sharded into regions along an ordered RowKey space

•  Within each region:
 •    Data is grouped into column families
 ▪    Sort order within each column family:

      •  Row Key (asc), Column Key (asc), Timestamp (desc)
HBase System Overview
                            Database Layer
   HBASE
       Master     Backup Master
   Region           Region             Region    ...
   Server           Server             Server

            Storage Layer                              Coordination Service

HDFS                                            Zookeeper Quorum
   Namenode         Secondary Namenode          ZK         ZK         ...
                                                Peer       Peer
Datanode Datanode       Datanode       ...
HBase Overview
HBASE Region Server
             ....
          Region #2
       Region #1
                    ....
                ColumnFamily #2
            ColumnFamily #1         Memstore
                             (in memory data structure)


               HFiles (in HDFS)                       flush



  Write Ahead Log ( in HDFS)
INTRODUCTION TO HBASE
       BACKUPS
Why Backups ?


▪  Data   Corruption
▪  Operational   error
▪  Hardware   failures
▪  Disaster
Hbase Backups – The Problem

▪  Need   a consistent, point in time backup
▪  Issues   :
 ▪    Live cluster, with traffic
 ▪    Data in MemStore
 ▪    Flushes and Compations in the background
 ▪    Regionserver death
 ▪    Regions moving
CURRENT OPTIONS – Export Table
▪  Pros   :
 ▪    Can export part or full table
 ▪    Map-Reduce job downloads data to output path provided
 ▪    Supports start time, end time and versions so could provide a
      consistent backup
 ▪    Can specify which Column Families to export
▪  Cons       :
 ▪    Only one table at a time
 ▪    Full scans and random reads
CURRENT OPTIONS - Copy Table
▪  Tool   to copy existing table to a intra/inter cluster
▪  Pros   :
 ▪    Another parallel replicated setup to switch
 ▪    Supports start time, end time, and versions
 ▪    Cluster being copied to could be in different setup
 ▪    Can specify which Column Families to export
▪  Cons       :
 ▪    Keep another HBASE cluster up and ready
 ▪    Full scans and random reads
Facebook’s Backup Solution
Backups V1
                Log(Put
                A)
  Application             Backup
                          Cluster
                Log(Put
                A)
        Put A             Dedup

    HBase                 Verify
Backups V1 – Pros and Cons
▪  Pros   :
 ▪    Simple solution
 ▪    Consistency in backup
 ▪    Point in time restore
 ▪    Verification of backups
▪  Cons       :
 ▪    Requires replay of large amount of transactions
 ▪    Requires double writes and deduplication
Backups V2

               Flush Region
 RegionServe   Get File List           Mapper
      r


     Flush                     Copy
                               Files


                      HDFS
                                                .regioninfo
Backups V2 – Tuning


▪  Locality   based mappers
▪  Use   in rack replication
▪  Increase   .Trash retention for HDFS
▪  Fault   tolerant
▪  Use   Backups V1 for point in time
Backups V2 – Restore



▪  Rewrite   backed up .regioninfo
▪  Move   backup copy in place
▪  Add   regions to .META using .regioninfo
Backups V2 – Pros and Cons
▪  Pros   :
 ▪    Faster restore
 ▪    Backup entire data in hours
 ▪    Consistency in backup
 ▪    Point in time restore
 ▪    Resilient to RS death, region moves
▪  Cons       :
 ▪    Affects production cluster
 ▪    Not scalable with data growth
Backups V2 – HDFS Improvements


▪  Overhead    of copying large files
▪  Use   locality of data
▪  HDFS    HFiles are immutable
▪  HDFS    blocks are immutable
▪  Hardlinks   at block level!
Fast Copy workflow
           Source                                                       Destination

B1    B2   ………………..                                         B1’   B2’    ……………………
                                                                         ….




FastCopy Client                           Add Block
                                       Create Destination
                                          Get Source              NameNode




                          Copy Block




     B1             B1’                    B1         B1’         B1                  B1’

     B2             B2’                    B2         B2’         B2                  B2’

      Date Node1                             Date Node2                 Date Node3
FastCopy – Pros and Cons

▪  Pros   :
 ▪    Extremely fast
 ▪    Lots of space saving
 ▪    Minimal impact to production cluster
▪  Cons       :
 ▪    NameNode not aware
 ▪    Hardlinks lost on datanode death
 ▪    Balancer not aware.
Operations


▪  Messages    Use Case :
 ▪    3 stage (same cluster, off cluster, off data center)
 ▪    Stage 1 : once/ day
 ▪    Stage 2 : once / 10 day
 ▪    Stage 3 : once / 10 day
 ▪    Retention based on capacity
Results
Backup Numbers

Example :
▪    40 TB table
▪    49 Mappers
▪    Normal Copy – 15 hours
▪    Fast Copy – 1.5 hours
Disk Savings - FastCopy



Disk
usage in
percent
Network Traffic - FastCopy
Further Work
Further Work

▪    Backup HLogs
▪    Point in time backups
▪    Namenode level Hard links


▪    Code and JIRAs :
     ▪    HBASE 4618
     ▪    HDFS code in github (https://github.com/facebook/hadoop-20)
Acknowledgements

▪    Madhuwanti Vaidya

▪    Ryan Thiessen

▪    Karthik Ranganathan

▪    Paul Tuckfield

▪    Kannan Muthukkaruppan

▪    Hairong Kuang

▪    Dhruba Borthakur

▪    Amitanand Aiyer

▪    Mikhail Bautin
Questions ?
(c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0

More Related Content

What's hot

HBase 2.0 cluster topology
HBase 2.0 cluster topologyHBase 2.0 cluster topology
HBase 2.0 cluster topology
Mikhail Antonov
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSA
Frederik Engelen
 
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
bupbechanhgmail
 
A First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesA First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM Changes
Willie Favero
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
Takrim Ul Islam Laskar
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Cloudera, Inc.
 
DB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple Standby
Dale McInnis
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
Delhi/NCR HUG
 
11 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 1111 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 11
aosborne
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
Nitin Khattar
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
Mahendran Ponnusamy
 
Dnssec
DnssecDnssec
Dnssec
guest3131f85
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
LINE Corporation (Tech Unit)
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
Benoit Perroud
 
INFLOW-2014-NVM-Compression
INFLOW-2014-NVM-CompressionINFLOW-2014-NVM-Compression
INFLOW-2014-NVM-Compression
Dhananjoy ( Joy ) Das
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
Hortonworks
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Zhe Zhang
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
Hanborq Inc.
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
shrey mehrotra
 
JONSMITH10042016
JONSMITH10042016JONSMITH10042016
JONSMITH10042016
Jon Smith
 

What's hot (20)

HBase 2.0 cluster topology
HBase 2.0 cluster topologyHBase 2.0 cluster topology
HBase 2.0 cluster topology
 
A DBA’s guide to using TSA
A DBA’s guide to using TSAA DBA’s guide to using TSA
A DBA’s guide to using TSA
 
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...Ibm db2 10.5 for linux, unix, and windows   getting started with db2 installa...
Ibm db2 10.5 for linux, unix, and windows getting started with db2 installa...
 
A First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM ChangesA First Look at the DB2 10 DSNZPARM Changes
A First Look at the DB2 10 DSNZPARM Changes
 
Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)Big data- HDFS(2nd presentation)
Big data- HDFS(2nd presentation)
 
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, HortonworksHadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
Hadoop World 2011: HDFS Federation - Suresh Srinivas, Hortonworks
 
DB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple StandbyDB2 V 10 HADR Multiple Standby
DB2 V 10 HADR Multiple Standby
 
Hadoop Architecture
Hadoop ArchitectureHadoop Architecture
Hadoop Architecture
 
11 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 1111 cool features in Defrag.nsf+ 11
11 cool features in Defrag.nsf+ 11
 
Hadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenesHadoop Distributed File System(HDFS) : Behind the scenes
Hadoop Distributed File System(HDFS) : Behind the scenes
 
Understanding Hadoop
Understanding HadoopUnderstanding Hadoop
Understanding Hadoop
 
Dnssec
DnssecDnssec
Dnssec
 
Storage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messagesStorage infrastructure using HBase behind LINE messages
Storage infrastructure using HBase behind LINE messages
 
Hadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment EvolutionHadoop Successes and Failures to Drive Deployment Evolution
Hadoop Successes and Failures to Drive Deployment Evolution
 
INFLOW-2014-NVM-Compression
INFLOW-2014-NVM-CompressionINFLOW-2014-NVM-Compression
INFLOW-2014-NVM-Compression
 
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and ScalabilityHDFS Futures: NameNode Federation for Improved Efficiency and Scalability
HDFS Futures: NameNode Federation for Improved Efficiency and Scalability
 
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure CodingLess is More: 2X Storage Efficiency with HDFS Erasure Coding
Less is More: 2X Storage Efficiency with HDFS Erasure Coding
 
Hadoop HDFS NameNode HA
Hadoop HDFS NameNode HAHadoop HDFS NameNode HA
Hadoop HDFS NameNode HA
 
Introduction to hadoop and hdfs
Introduction to hadoop and hdfsIntroduction to hadoop and hdfs
Introduction to hadoop and hdfs
 
JONSMITH10042016
JONSMITH10042016JONSMITH10042016
JONSMITH10042016
 

Similar to Facebook's HBase Backups - StampedeCon 2012

Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
强 王
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
yongboy
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
baggioss
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
DataWorks Summit
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
Cloudera, Inc.
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
Michael Stack
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
DataWorks Summit
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
Cloudera, Inc.
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
MariaDB plc
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
Yoshinori Matsunobu
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
mundlapudi
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Ontico
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
Ceph Community
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
bergwolf
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Yahoo Developer Network
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
VMware Tanzu
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
Chris Nauroth
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
Ovidiu Dimulescu
 

Similar to Facebook's HBase Backups - StampedeCon 2012 (20)

Facebook Messages & HBase
Facebook Messages & HBaseFacebook Messages & HBase
Facebook Messages & HBase
 
支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统支撑Facebook消息处理的h base存储系统
支撑Facebook消息处理的h base存储系统
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)[Hi c2011]building mission critical messaging system(guoqiang jerry)
[Hi c2011]building mission critical messaging system(guoqiang jerry)
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Putting Wings on the Elephant
Putting Wings on the ElephantPutting Wings on the Elephant
Putting Wings on the Elephant
 
Hadoop Backup and Disaster Recovery
Hadoop Backup and Disaster RecoveryHadoop Backup and Disaster Recovery
Hadoop Backup and Disaster Recovery
 
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
HBaseConAsia2018 Track1-5: Improving HBase reliability at PInterest with geo ...
 
Facebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage ChallengeFacebook's Approach to Big Data Storage Challenge
Facebook's Approach to Big Data Storage Challenge
 
Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010Facebook - Jonthan Gray - Hadoop World 2010
Facebook - Jonthan Gray - Hadoop World 2010
 
Migrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at FacebookMigrating from InnoDB and HBase to MyRocks at Facebook
Migrating from InnoDB and HBase to MyRocks at Facebook
 
MyRocks introduction and production deployment
MyRocks introduction and production deploymentMyRocks introduction and production deployment
MyRocks introduction and production deployment
 
Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)Hadoop - Disk Fail In Place (DFIP)
Hadoop - Disk Fail In Place (DFIP)
 
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
Cистема распределенного, масштабируемого и высоконадежного хранения данных дл...
 
CephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at LastCephFS in Jewel: Stable at Last
CephFS in Jewel: Stable at Last
 
CLFS 2010
CLFS 2010CLFS 2010
CLFS 2010
 
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay RadiaApache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
Apache Hadoop India Summit 2011 Keynote talk "HDFS Federation" by Sanjay Radia
 
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
Hadoop - Just the Basics for Big Data Rookies (SpringOne2GX 2013)
 
Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4Hdfs 2016-hadoop-summit-san-jose-v4
Hdfs 2016-hadoop-summit-san-jose-v4
 
Hadoop, Taming Elephants
Hadoop, Taming ElephantsHadoop, Taming Elephants
Hadoop, Taming Elephants
 

More from StampedeCon

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
StampedeCon
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
StampedeCon
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
StampedeCon
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
StampedeCon
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
StampedeCon
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
StampedeCon
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
StampedeCon
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
StampedeCon
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
StampedeCon
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
StampedeCon
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
StampedeCon
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
StampedeCon
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
StampedeCon
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
StampedeCon
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
StampedeCon
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
StampedeCon
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
StampedeCon
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
Brandon Minnick, MBA
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
saastr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
Hiike
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 

Recently uploaded (20)

Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Choosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptxChoosing The Best AWS Service For Your Website + API.pptx
Choosing The Best AWS Service For Your Website + API.pptx
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStrDeep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
Deep Dive: Getting Funded with Jason Jason Lemkin Founder & CEO @ SaaStr
 
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - HiikeSystem Design Case Study: Building a Scalable E-Commerce Platform - Hiike
System Design Case Study: Building a Scalable E-Commerce Platform - Hiike
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 

Facebook's HBase Backups - StampedeCon 2012

  • 1.
  • 2. HBASE Backups Pritam Damania Software Engineer, Facebook Aug 1, 2012
  • 3. Agenda 1 Introduction to HBASE and HDFS 2 Introduction to HBASE Backups 3 Facebook’s Backup solution 4 Results 5 Further Work
  • 5. What is HDFS ? ▪  Distributed FileSystem ▪  Runs on top of commodity hardware ▪  Scale to Petabytes of data ▪  Tolerates machine failures
  • 6. HDFS Data Model ▪  Data is logically organized into files and directories ▪  Files are divided into uniform-sized blocks ▪  Blocks are distributed across the nodes of the cluster and are replicated to handle hardware failure ▪  HDFS keeps checksums of data for corruption detection and recovery ▪  HDFS exposes block placement so that computation can be migrated to data
  • 7. HDFS Data Model (2) MetaInfo(Filename, replicationFactor, block-ids, …) /users/user1/data/part-0, repl:2, ids: {1,3}, … /users/user1/data/part-1, repl:3, ids: {2,4,5}, … Block Storage 1 2 1 4 2 5 2 3 4 3 4 5 5 7
  • 8. HDFS Architecture Metadata (Name, #replicas, …): Namenode /users/foo/data, 3, … Metadata ops Block ops Client Metadata ops Read Datanodes Datanodes Replication Blocks Write Rack 2 Rack 1 Client 8
  • 10. HBase in a nutshell §  distributed, large-scale data store §  can host very large tables, billions of rows x millions of columns §  efficient at random reads/writes §  open source project modeled after Google’s BigTable
  • 11. HBase Data Model •  An HBase table is: •  a sparse , three-dimensional array of cells, indexed by: RowKey, ColumnKey, Timestamp/Version •  sharded into regions along an ordered RowKey space •  Within each region: •  Data is grouped into column families ▪  Sort order within each column family: •  Row Key (asc), Column Key (asc), Timestamp (desc)
  • 12. HBase System Overview Database Layer HBASE Master Backup Master Region Region Region ... Server Server Server Storage Layer Coordination Service HDFS Zookeeper Quorum Namenode Secondary Namenode ZK ZK ... Peer Peer Datanode Datanode Datanode ...
  • 13. HBase Overview HBASE Region Server .... Region #2 Region #1 .... ColumnFamily #2 ColumnFamily #1 Memstore (in memory data structure) HFiles (in HDFS) flush Write Ahead Log ( in HDFS)
  • 15. Why Backups ? ▪  Data Corruption ▪  Operational error ▪  Hardware failures ▪  Disaster
  • 16. Hbase Backups – The Problem ▪  Need a consistent, point in time backup ▪  Issues : ▪  Live cluster, with traffic ▪  Data in MemStore ▪  Flushes and Compations in the background ▪  Regionserver death ▪  Regions moving
  • 17. CURRENT OPTIONS – Export Table ▪  Pros : ▪  Can export part or full table ▪  Map-Reduce job downloads data to output path provided ▪  Supports start time, end time and versions so could provide a consistent backup ▪  Can specify which Column Families to export ▪  Cons : ▪  Only one table at a time ▪  Full scans and random reads
  • 18. CURRENT OPTIONS - Copy Table ▪  Tool to copy existing table to a intra/inter cluster ▪  Pros : ▪  Another parallel replicated setup to switch ▪  Supports start time, end time, and versions ▪  Cluster being copied to could be in different setup ▪  Can specify which Column Families to export ▪  Cons : ▪  Keep another HBASE cluster up and ready ▪  Full scans and random reads
  • 20. Backups V1 Log(Put A) Application Backup Cluster Log(Put A) Put A Dedup HBase Verify
  • 21. Backups V1 – Pros and Cons ▪  Pros : ▪  Simple solution ▪  Consistency in backup ▪  Point in time restore ▪  Verification of backups ▪  Cons : ▪  Requires replay of large amount of transactions ▪  Requires double writes and deduplication
  • 22. Backups V2 Flush Region RegionServe Get File List Mapper r Flush Copy Files HDFS .regioninfo
  • 23. Backups V2 – Tuning ▪  Locality based mappers ▪  Use in rack replication ▪  Increase .Trash retention for HDFS ▪  Fault tolerant ▪  Use Backups V1 for point in time
  • 24. Backups V2 – Restore ▪  Rewrite backed up .regioninfo ▪  Move backup copy in place ▪  Add regions to .META using .regioninfo
  • 25. Backups V2 – Pros and Cons ▪  Pros : ▪  Faster restore ▪  Backup entire data in hours ▪  Consistency in backup ▪  Point in time restore ▪  Resilient to RS death, region moves ▪  Cons : ▪  Affects production cluster ▪  Not scalable with data growth
  • 26. Backups V2 – HDFS Improvements ▪  Overhead of copying large files ▪  Use locality of data ▪  HDFS HFiles are immutable ▪  HDFS blocks are immutable ▪  Hardlinks at block level!
  • 27. Fast Copy workflow Source Destination B1 B2 ……………….. B1’ B2’ …………………… …. FastCopy Client Add Block Create Destination Get Source NameNode Copy Block B1 B1’ B1 B1’ B1 B1’ B2 B2’ B2 B2’ B2 B2’ Date Node1 Date Node2 Date Node3
  • 28. FastCopy – Pros and Cons ▪  Pros : ▪  Extremely fast ▪  Lots of space saving ▪  Minimal impact to production cluster ▪  Cons : ▪  NameNode not aware ▪  Hardlinks lost on datanode death ▪  Balancer not aware.
  • 29. Operations ▪  Messages Use Case : ▪  3 stage (same cluster, off cluster, off data center) ▪  Stage 1 : once/ day ▪  Stage 2 : once / 10 day ▪  Stage 3 : once / 10 day ▪  Retention based on capacity
  • 31. Backup Numbers Example : ▪  40 TB table ▪  49 Mappers ▪  Normal Copy – 15 hours ▪  Fast Copy – 1.5 hours
  • 32. Disk Savings - FastCopy Disk usage in percent
  • 33. Network Traffic - FastCopy
  • 35. Further Work ▪  Backup HLogs ▪  Point in time backups ▪  Namenode level Hard links ▪  Code and JIRAs : ▪  HBASE 4618 ▪  HDFS code in github (https://github.com/facebook/hadoop-20)
  • 36. Acknowledgements ▪  Madhuwanti Vaidya ▪  Ryan Thiessen ▪  Karthik Ranganathan ▪  Paul Tuckfield ▪  Kannan Muthukkaruppan ▪  Hairong Kuang ▪  Dhruba Borthakur ▪  Amitanand Aiyer ▪  Mikhail Bautin
  • 38. (c) 2009 Facebook, Inc. or its licensors. "Facebook" is a registered trademark of Facebook, Inc.. All rights reserved. 1.0