SlideShare a Scribd company logo
How to get the MTTR below 1
minute and more
Devaraj Das
(ddas@hortonworks.com)
Nicolas Liochon
(nkeywal@gmail.com)
Outline
• What is this? Why are we talking about this
topic? Why it matters? ….
• HBase Recovery – an overview
• HDFS issues
• Beyond MTTR (Performance post recovery)
• Conclusion / Future / Q & A
What is MTTR? Why its important? …
• Mean Time To Recovery -> Average time
required to repair a failed component (Courtesy:
Wikipedia)
• Enterprises want an MTTR of ZERO
– Data should always be available with no
degradation of perceived SLAs
– Practically hard to obtain but yeah it’s a goal
• Close to Zero-MTTR is especially important for
HBase
– Given it is used in near realtime systems
HBase Basics
• Strongly consistent
– Write ordered with reads
– Once written, the data will stay
• Built on top of HDFS
• When a machine fails the cluster remains
available, and its data as well
• We’re just speaking about the piece of data that
was handled by this machine
Write path
WAL – Write
Ahead Log
A write is
finished once
written on all
HDFS nodes
The client
communicates
with the region
servers
We’re in a distributed system
• You can’t distinguish a
slow server from a
dead server
• Everything, or, nearly
everything, is based
on timeout
• Smaller timeouts means more false positives
• HBase works well with false positives, but
they always have a cost.
• The lesser the timeouts the better
HBase components for recovery
Recovery in action
Recovery process
• Failure detection: ZooKeeper
heartbeats the servers. Expires
the session when it does not
reply
• Regions assignment: the
master reallocates the regions
to the other servers
• Failure recovery: read the WAL
and rewrite the data again
• The clients stops the
connection to the dead server
and goes to the new one.
ZK
Heartbeat
Client
Region
Servers, DataNod
e
Data recovery
Master, RS, ZK
Region Assignment
So….
• Detect the failure as fast as possible
• Reassign as fast as possible
• Read / rewrite the WAL as fast as possible
• That’s obvious
The obvious – failure detection
• Failure detection
– Set a ZooKeeper timeout to 30s instead of the old 180s
default.
– Beware of the GC, but lower values are possible.
– ZooKeeper detects the errors sooner than the configured
timeout
• 0.96
– HBase scripts clean the ZK node when the server is kill -
9ed
• => Detection time becomes 0
– Can be used by any monitoring tool
The obvious – faster data recovery
• Not so obvious actually
• Already distributed since 0.92
– The larger the cluster the better.
• Completely rewritten in 0.96
– Recovery itself rewritten in 0.96
– Will be covered in the second part
The obvious – Faster assignment
• Faster assignment
– Just improving performances
• Parallelism
• Speed
– Globally ‘much’ faster
– Backported to 0.94
• Still possible to do better for huge number of
regions.
• A few seconds for most cases
With this
• Detection: from 180s to 30s
• Data recovery: around 10s
• Reassignment : from 10s of seconds to
seconds
Do you think we’re better with this
• Answer is NO
• Actually, yes but if and only if HDFS is fine
– But when you lose a regionserver, you’ve just lost
a datanode
DataNode crash is expensive!
• One replica of WAL edits is on the crashed DN
– 33% of the reads during the regionserver recovery
will go to it
• Many writes will go to it as well (the smaller
the cluster, the higher that probability)
• NameNode re-replicates the data (maybe TBs)
that was on this node to restore replica count
– NameNode does this work only after a good
timeout (10 minutes by default)
HDFS – Stale mode
Live
Stale
Dead
As today: used for reads &
writes, using locality
Not used for writes, used as
last resort for reads
As today: not used.
And actually, it’s better to do the HBase
recovery before HDFS replicates the TBs
of data of this node
30 seconds, can be less.
10 minutes, don’t change this
Results
• No more read/write HDFS errors during the
recovery
• Multiple failures are still possible
– Stale mode will still play its role
– And set dfs.timeout to 30s
– This limits the effect of two failures in a row. The
cost of the second failure is 30s if you were
unlucky
Are we done?
• We’re not bad
• But there is still something
The client
You left it waiting on the dead server
Here it is
The client
• You want the client to be patient
• Retrying when the system is already loaded is
not good.
• You want the client to learn about region
servers dying, and to be able to react
immediately.
• You want this to scale.
Solution
• The master notifies the client
– A cheap multicast message with the “dead servers”
list. Sent 5 times for safety.
– Off by default.
– On reception, the client stops immediately waiting on
the TCP connection. You can now enjoy large
hbase.rpc.timeout
Are we done
• In a way, yes
– There is a lot of things around asynchronous
writes, reads during recovery
– Will be for another time, but there will be some
nice things in 0.96
• And a couple of them is presented in the
second part of this talk!
Faster recovery
• Previous algo
– Read the WAL files
– Write new Hfiles
– Tell the region server it got new Hfiles
• Puts pressure on namenode
– Remember: don’t put pressure on the namenode
• New algo:
– Read the WAL
– Write to the regionserver
– We’re done (have seen great improvements in our tests)
– TBD: Assign the WAL to a RegionServer local to a replica
RegionServer0 RegionServer_x
RegionServer_y
WAL-file3
<region2:edit1><region1:edit2>
……
<region3:edit1>
……..
WAL-file2
<region2:edit1><region1:edit2>
……
<region3:edit1>
……..
WAL-file1
<region2:edit1><region1:edit2>
……
<region3:edit1>
……..
HDFS
Splitlog-file-for-region3
<region3:edit1><region1:edit2>
……
<region3:edit1>
……..
Splitlog-file-for-region2
<region2:edit1><region1:edit2>
……
<region2:edit1>
……..
Splitlog-file-for-region1
<region1:edit1><region1:edit2>
……
<region1:edit1>
……..
HDFS
RegionServer3
RegionServer2
RegionServer1
writes
writes
reads
reads
Distributed log Split
RegionServer0 RegionServer_x
RegionServer_y
WAL-file3
<region2:edit1><region1:edit2>
……
<region3:edit1>
……..
WAL-file2
<region2:edit1><region1:edit2>
……
<region3:edit1>
……..
WAL-file1
<region2:edit1><region1:edit2>
……
<region3:edit1>
……..
HDFS
Recovered-file-for-region3
<region3:edit1><region1:edit2>
……
<region3:edit1>
……..
Recovered-file-for-region2
<region2:edit1><region1:edit2>
……
<region2:edit1>
……..
Recovered-file-for-region1
<region1:edit1><region1:edit2>
……
<region1:edit1>
……..
HDFS
RegionServer3
RegionServer2
RegionServer1
writes reads
Distributed log
Replay
replays
Write during recovery
• Hey, you can write during the WAL replay
• Events stream: your new recovery time is the
failure detection time: max 30s, likely less!
MemStore flush
• Real life: some tables are updated at a given
moment then left alone
– With a non empty memstore
– More data to recover
• It’s now possible to guarantee that we don’t
have MemStore with old data
• Improves real life MTTR
• Helps snapshots
.META.
• .META.
– There is no –ROOT- in 0.95/0.96
– But .META. failures are critical
• A lot of small improvements
– Server now says to the client when a region has
moved (client can avoid going to meta)
• And a big one
– .META. WAL is managed separately to allow an
immediate recovery of META
– With the new MemStore flush, ensure a quick
recovery
Data locality post recovery
• HBase performance depends on data-locality
• After a recovery, you’ve lost it
– Bad for performance
• Here comes region groups
• Assign 3 favored RegionServers for every region
– Primary, Secondary, Tertiary
• On failures assign the region to one of the
Secondary or Tertiary depending on load
• The data-locality issue is minimized on failures
Block1 Block2 Block3
Block1 Block2
Rack1
Block3
Block3
Rack2 Rack3
Block1 Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Block1 Block2
Rack1
Block3
Block3
Rack2 Rack3
Block1 Block2
RegionServer4 Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
Reads Blk1 and
Blk2 remotely
Reads Blk3
remotely
RegionServer1 serves three regions, and their StoreFile blks are scattered
across the cluster with one replica local to RegionServer1.
Block1 Block2 Block3
Block1 Block2
Rack1
Block3
Block3
Rack2 Rack3
Block1 Block2
Datanode
RegionServer1
Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
RegionServer1 serves three regions, and their StoreFile blks are placed on
specific machines on the other racks
Block1 Block2
Rack1
Block3
Block3
Rack2 Rack3
Block1 Block2
RegionServer4 Datanode1
RegionServer1
Datanode
RegionServer2
Datanode1
RegionServer1
Datanode
RegionServer3
No remote reads
Datanode
Conclusion
• Our tests show that the recovery time has come
down from 10-15 minutes to less than 1 minute
– All the way from failure to recovery (and not just
recovery)
• Most of it is available in 0.96, some parts were
back-ported to 0.94.x
• Real life testing of the improvements in progress
– Pre-production deployments’ testing in progress
• Room for more improvements
– Example, asynchronous puts / gets
Q & A
Thanks!
• Devaraj Das
– ddas@hortonworks.com, @ddraj
• Nicolas Liochon
– nkeywal@gmail.com, @nkeywal

More Related Content

What's hot

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
Cloudera, Inc.
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
HBaseCon
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseenissoz
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
Cloudera, Inc.
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
Cloudera, Inc.
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
HBaseCon
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
HBaseCon
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
Cloudera, Inc.
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
Cloudera, Inc.
 
HBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBase
HBaseCon
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
enissoz
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
DataWorks Summit/Hadoop Summit
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Scott Miao
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
dave_revell
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon
 

What's hot (19)

HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, SalesforceHBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
HBaseCon 2012 | Learning HBase Internals - Lars Hofhansl, Salesforce
 
HBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond PanelHBaseCon 2015: HBase 2.0 and Beyond Panel
HBaseCon 2015: HBase 2.0 and Beyond Panel
 
Tales from the Cloudera Field
Tales from the Cloudera FieldTales from the Cloudera Field
Tales from the Cloudera Field
 
HBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBaseHBase and HDFS: Understanding FileSystem Usage in HBase
HBase and HDFS: Understanding FileSystem Usage in HBase
 
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, PhotobucketHBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
HBaseCon 2012 | Solbase - Kyungseog Oh, Photobucket
 
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - ClouderaHBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
HBaseCon 2012 | Base Metrics: What They Mean to You - Cloudera
 
HBase: Extreme Makeover
HBase: Extreme MakeoverHBase: Extreme Makeover
HBase: Extreme Makeover
 
Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction Apache HBase, Accelerated: In-Memory Flush and Compaction
Apache HBase, Accelerated: In-Memory Flush and Compaction
 
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, ClouderaHBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
HBaseCon 2012 | HBase and HDFS: Past, Present, Future - Todd Lipcon, Cloudera
 
HBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at XiaomiHBaseCon 2015: HBase Operations at Xiaomi
HBaseCon 2015: HBase Operations at Xiaomi
 
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
HBaseCon 2013: Apache HBase, Meet Ops. Ops, Meet Apache HBase.
 
HBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBaseHBaseCon 2015: Multitenancy in HBase
HBaseCon 2015: Multitenancy in HBase
 
HBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environmentHBaseCon2017 Improving HBase availability in a multi tenant environment
HBaseCon2017 Improving HBase availability in a multi tenant environment
 
Meet HBase 1.0
Meet HBase 1.0Meet HBase 1.0
Meet HBase 1.0
 
HBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and CompactionHBase Accelerated: In-Memory Flush and Compaction
HBase Accelerated: In-Memory Flush and Compaction
 
Zero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter MigrationZero-downtime Hadoop/HBase Cross-datacenter Migration
Zero-downtime Hadoop/HBase Cross-datacenter Migration
 
Near-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBaseNear-realtime analytics with Kafka and HBase
Near-realtime analytics with Kafka and HBase
 
HBase Low Latency
HBase Low LatencyHBase Low Latency
HBase Low Latency
 
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in PinterestHBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
HBaseCon2017 Removable singularity: a story of HBase upgrade in Pinterest
 

Viewers also liked

HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
Cloudera, Inc.
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
Cloudera, Inc.
 
HBaseCon 2015: Meet HBase 1.0
HBaseCon 2015: Meet HBase 1.0HBaseCon 2015: Meet HBase 1.0
HBaseCon 2015: Meet HBase 1.0
HBaseCon
 
C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...
C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...
C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...
DataStax Academy
 
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategy
Charlton Inao
 
10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability
Ricky Smith CMRP, CMRT
 
How to measure reliability
How to measure reliabilityHow to measure reliability
How to measure reliability 2
How to measure reliability 2How to measure reliability 2
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your Operators
Ricky Smith CMRP, CMRT
 
Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - Availability
Tom Jacyszyn
 
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by Resiliency
Reza Samei
 
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset Reliability
Ricky Smith CMRP, CMRT
 
Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologies
Accendo Reliability
 
Misuses of MTBF
Misuses of MTBFMisuses of MTBF
Misuses of MTBF
Accendo Reliability
 
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Array Technologies, Inc.
 
Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangASQ Reliability Division
 
Metastability,MTBF,synchronizer & synchronizer failure
Metastability,MTBF,synchronizer & synchronizer failureMetastability,MTBF,synchronizer & synchronizer failure
Metastability,MTBF,synchronizer & synchronizer failureprashant singh
 
Reliability Modeling Using Degradation Data - by Harry Guo
Reliability Modeling Using Degradation Data - by Harry GuoReliability Modeling Using Degradation Data - by Harry Guo
Reliability Modeling Using Degradation Data - by Harry GuoASQ Reliability Division
 
Technology Primer: Attain Faster MTTR through CA Application Performance Mana...
Technology Primer: Attain Faster MTTR through CA Application Performance Mana...Technology Primer: Attain Faster MTTR through CA Application Performance Mana...
Technology Primer: Attain Faster MTTR through CA Application Performance Mana...
CA Technologies
 

Viewers also liked (20)

HBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at PinterestHBaseCon 2013: Apache HBase Operations at Pinterest
HBaseCon 2013: Apache HBase Operations at Pinterest
 
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big DataHBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
HBaseCon 2012 | You’ve got HBase! How AOL Mail Handles Big Data
 
HBaseCon 2015: Meet HBase 1.0
HBaseCon 2015: Meet HBase 1.0HBaseCon 2015: Meet HBase 1.0
HBaseCon 2015: Meet HBase 1.0
 
C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...
C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...
C* Summit 2013: Eventual Consistency != Hopeful Consistency by Christos Kalan...
 
Unit 9 implementing the reliability strategy
Unit 9  implementing the reliability strategyUnit 9  implementing the reliability strategy
Unit 9 implementing the reliability strategy
 
10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability10 Things an Operations Supervisor can do Today to Improve Reliability
10 Things an Operations Supervisor can do Today to Improve Reliability
 
How to measure reliability
How to measure reliabilityHow to measure reliability
How to measure reliability
 
How to measure reliability 2
How to measure reliability 2How to measure reliability 2
How to measure reliability 2
 
Asset Reliability Begins With Your Operators
Asset Reliability Begins With Your OperatorsAsset Reliability Begins With Your Operators
Asset Reliability Begins With Your Operators
 
Reliability - Availability
Reliability -  AvailabilityReliability -  Availability
Reliability - Availability
 
Software Availability by Resiliency
Software Availability by ResiliencySoftware Availability by Resiliency
Software Availability by Resiliency
 
The Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset ReliabilityThe Seven Deadly Sins in Measuring Asset Reliability
The Seven Deadly Sins in Measuring Asset Reliability
 
Draft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologiesDraft comparison of electronic reliability prediction methodologies
Draft comparison of electronic reliability prediction methodologies
 
Misuses of MTBF
Misuses of MTBFMisuses of MTBF
Misuses of MTBF
 
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other EventsTracker Lifetime Cost: MTBF, Lifetime and Other Events
Tracker Lifetime Cost: MTBF, Lifetime and Other Events
 
Efficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin YangEfficient Reliability Demonstration Tests - by Guangbin Yang
Efficient Reliability Demonstration Tests - by Guangbin Yang
 
Metastability,MTBF,synchronizer & synchronizer failure
Metastability,MTBF,synchronizer & synchronizer failureMetastability,MTBF,synchronizer & synchronizer failure
Metastability,MTBF,synchronizer & synchronizer failure
 
Overview and Basic Maintenance
Overview and Basic MaintenanceOverview and Basic Maintenance
Overview and Basic Maintenance
 
Reliability Modeling Using Degradation Data - by Harry Guo
Reliability Modeling Using Degradation Data - by Harry GuoReliability Modeling Using Degradation Data - by Harry Guo
Reliability Modeling Using Degradation Data - by Harry Guo
 
Technology Primer: Attain Faster MTTR through CA Application Performance Mana...
Technology Primer: Attain Faster MTTR through CA Application Performance Mana...Technology Primer: Attain Faster MTTR through CA Application Performance Mana...
Technology Primer: Attain Faster MTTR through CA Application Performance Mana...
 

Similar to HBaseCon 2013: How to Get the MTTR Below 1 Minute and More

HBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minuteHBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minute
Hortonworks
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
ScyllaDB
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
Nick Dimiduk
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Ayon Sinha
 
HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
HBASE by  Nicolas Liochon - Meetup HUGFR du 22 Sept 2014HBASE by  Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
Modern Data Stack France
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentationEdward Capriolo
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
Lars Hofhansl
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
Codership Oy - Creators of Galera Cluster
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
MariaDB plc
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
mcsrivas
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
MariaDB plc
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
Venu Anuganti
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
vijayapraba1
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
HBaseCon
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
Felicia Haggarty
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
Uwe Printz
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
clairvoyantllc
 
Maximizing performance via tuning and optimization
Maximizing performance via tuning and optimizationMaximizing performance via tuning and optimization
Maximizing performance via tuning and optimization
MariaDB plc
 
Maximizing performance via tuning and optimization
Maximizing performance via tuning and optimizationMaximizing performance via tuning and optimization
Maximizing performance via tuning and optimization
MariaDB plc
 

Similar to HBaseCon 2013: How to Get the MTTR Below 1 Minute and More (20)

HBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minuteHBase: How to get MTTR below 1 minute
HBase: How to get MTTR below 1 minute
 
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
Scylla Summit 2016: Outbrain Case Study - Lowering Latency While Doing 20X IO...
 
HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014HBase Low Latency, StrataNYC 2014
HBase Low Latency, StrataNYC 2014
 
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and HadoopEventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
Eventual Consistency @WalmartLabs with Kafka, Avro, SolrCloud and Hadoop
 
HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
HBASE by  Nicolas Liochon - Meetup HUGFR du 22 Sept 2014HBASE by  Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
HBASE by Nicolas Liochon - Meetup HUGFR du 22 Sept 2014
 
M6d cassandrapresentation
M6d cassandrapresentationM6d cassandrapresentation
M6d cassandrapresentation
 
Apache HBase Performance Tuning
Apache HBase Performance TuningApache HBase Performance Tuning
Apache HBase Performance Tuning
 
Introduction to Galera Cluster
Introduction to Galera ClusterIntroduction to Galera Cluster
Introduction to Galera Cluster
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
Architectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop DistributionArchitectural Overview of MapR's Apache Hadoop Distribution
Architectural Overview of MapR's Apache Hadoop Distribution
 
Choosing the right high availability strategy
Choosing the right high availability strategyChoosing the right high availability strategy
Choosing the right high availability strategy
 
HBase Operations and Best Practices
HBase Operations and Best PracticesHBase Operations and Best Practices
HBase Operations and Best Practices
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
HDFS_architecture.ppt
HDFS_architecture.pptHDFS_architecture.ppt
HDFS_architecture.ppt
 
hbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecturehbaseconasia2017: Large scale data near-line loading method and architecture
hbaseconasia2017: Large scale data near-line loading method and architecture
 
Kudu austin oct 2015.pptx
Kudu austin oct 2015.pptxKudu austin oct 2015.pptx
Kudu austin oct 2015.pptx
 
Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?Hadoop 3.0 - Revolution or evolution?
Hadoop 3.0 - Revolution or evolution?
 
Bigdata workshop february 2015
Bigdata workshop  february 2015 Bigdata workshop  february 2015
Bigdata workshop february 2015
 
Maximizing performance via tuning and optimization
Maximizing performance via tuning and optimizationMaximizing performance via tuning and optimization
Maximizing performance via tuning and optimization
 
Maximizing performance via tuning and optimization
Maximizing performance via tuning and optimizationMaximizing performance via tuning and optimization
Maximizing performance via tuning and optimization
 

More from Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
Cloudera, Inc.
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
Cloudera, Inc.
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
Cloudera, Inc.
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
Cloudera, Inc.
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
Cloudera, Inc.
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Cloudera, Inc.
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
Cloudera, Inc.
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Cloudera, Inc.
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Cloudera, Inc.
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
Cloudera, Inc.
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Cloudera, Inc.
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
Cloudera, Inc.
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
Cloudera, Inc.
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
Cloudera, Inc.
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
Cloudera, Inc.
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
Cloudera, Inc.
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
Cloudera, Inc.
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
Cloudera, Inc.
 

More from Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Recently uploaded

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
Product School
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
Paul Groth
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 

Recently uploaded (20)

Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
From Daily Decisions to Bottom Line: Connecting Product Work to Revenue by VP...
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMsTo Graph or Not to Graph Knowledge Graph Architectures and LLMs
To Graph or Not to Graph Knowledge Graph Architectures and LLMs
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

HBaseCon 2013: How to Get the MTTR Below 1 Minute and More

  • 1. How to get the MTTR below 1 minute and more Devaraj Das (ddas@hortonworks.com) Nicolas Liochon (nkeywal@gmail.com)
  • 2. Outline • What is this? Why are we talking about this topic? Why it matters? …. • HBase Recovery – an overview • HDFS issues • Beyond MTTR (Performance post recovery) • Conclusion / Future / Q & A
  • 3. What is MTTR? Why its important? … • Mean Time To Recovery -> Average time required to repair a failed component (Courtesy: Wikipedia) • Enterprises want an MTTR of ZERO – Data should always be available with no degradation of perceived SLAs – Practically hard to obtain but yeah it’s a goal • Close to Zero-MTTR is especially important for HBase – Given it is used in near realtime systems
  • 4. HBase Basics • Strongly consistent – Write ordered with reads – Once written, the data will stay • Built on top of HDFS • When a machine fails the cluster remains available, and its data as well • We’re just speaking about the piece of data that was handled by this machine
  • 5. Write path WAL – Write Ahead Log A write is finished once written on all HDFS nodes The client communicates with the region servers
  • 6. We’re in a distributed system • You can’t distinguish a slow server from a dead server • Everything, or, nearly everything, is based on timeout • Smaller timeouts means more false positives • HBase works well with false positives, but they always have a cost. • The lesser the timeouts the better
  • 9. Recovery process • Failure detection: ZooKeeper heartbeats the servers. Expires the session when it does not reply • Regions assignment: the master reallocates the regions to the other servers • Failure recovery: read the WAL and rewrite the data again • The clients stops the connection to the dead server and goes to the new one. ZK Heartbeat Client Region Servers, DataNod e Data recovery Master, RS, ZK Region Assignment
  • 10. So…. • Detect the failure as fast as possible • Reassign as fast as possible • Read / rewrite the WAL as fast as possible • That’s obvious
  • 11. The obvious – failure detection • Failure detection – Set a ZooKeeper timeout to 30s instead of the old 180s default. – Beware of the GC, but lower values are possible. – ZooKeeper detects the errors sooner than the configured timeout • 0.96 – HBase scripts clean the ZK node when the server is kill - 9ed • => Detection time becomes 0 – Can be used by any monitoring tool
  • 12. The obvious – faster data recovery • Not so obvious actually • Already distributed since 0.92 – The larger the cluster the better. • Completely rewritten in 0.96 – Recovery itself rewritten in 0.96 – Will be covered in the second part
  • 13. The obvious – Faster assignment • Faster assignment – Just improving performances • Parallelism • Speed – Globally ‘much’ faster – Backported to 0.94 • Still possible to do better for huge number of regions. • A few seconds for most cases
  • 14. With this • Detection: from 180s to 30s • Data recovery: around 10s • Reassignment : from 10s of seconds to seconds
  • 15. Do you think we’re better with this • Answer is NO • Actually, yes but if and only if HDFS is fine – But when you lose a regionserver, you’ve just lost a datanode
  • 16. DataNode crash is expensive! • One replica of WAL edits is on the crashed DN – 33% of the reads during the regionserver recovery will go to it • Many writes will go to it as well (the smaller the cluster, the higher that probability) • NameNode re-replicates the data (maybe TBs) that was on this node to restore replica count – NameNode does this work only after a good timeout (10 minutes by default)
  • 17. HDFS – Stale mode Live Stale Dead As today: used for reads & writes, using locality Not used for writes, used as last resort for reads As today: not used. And actually, it’s better to do the HBase recovery before HDFS replicates the TBs of data of this node 30 seconds, can be less. 10 minutes, don’t change this
  • 18. Results • No more read/write HDFS errors during the recovery • Multiple failures are still possible – Stale mode will still play its role – And set dfs.timeout to 30s – This limits the effect of two failures in a row. The cost of the second failure is 30s if you were unlucky
  • 19. Are we done? • We’re not bad • But there is still something
  • 20. The client You left it waiting on the dead server
  • 22. The client • You want the client to be patient • Retrying when the system is already loaded is not good. • You want the client to learn about region servers dying, and to be able to react immediately. • You want this to scale.
  • 23. Solution • The master notifies the client – A cheap multicast message with the “dead servers” list. Sent 5 times for safety. – Off by default. – On reception, the client stops immediately waiting on the TCP connection. You can now enjoy large hbase.rpc.timeout
  • 24. Are we done • In a way, yes – There is a lot of things around asynchronous writes, reads during recovery – Will be for another time, but there will be some nice things in 0.96 • And a couple of them is presented in the second part of this talk!
  • 25. Faster recovery • Previous algo – Read the WAL files – Write new Hfiles – Tell the region server it got new Hfiles • Puts pressure on namenode – Remember: don’t put pressure on the namenode • New algo: – Read the WAL – Write to the regionserver – We’re done (have seen great improvements in our tests) – TBD: Assign the WAL to a RegionServer local to a replica
  • 28. Write during recovery • Hey, you can write during the WAL replay • Events stream: your new recovery time is the failure detection time: max 30s, likely less!
  • 29. MemStore flush • Real life: some tables are updated at a given moment then left alone – With a non empty memstore – More data to recover • It’s now possible to guarantee that we don’t have MemStore with old data • Improves real life MTTR • Helps snapshots
  • 30. .META. • .META. – There is no –ROOT- in 0.95/0.96 – But .META. failures are critical • A lot of small improvements – Server now says to the client when a region has moved (client can avoid going to meta) • And a big one – .META. WAL is managed separately to allow an immediate recovery of META – With the new MemStore flush, ensure a quick recovery
  • 31. Data locality post recovery • HBase performance depends on data-locality • After a recovery, you’ve lost it – Bad for performance • Here comes region groups • Assign 3 favored RegionServers for every region – Primary, Secondary, Tertiary • On failures assign the region to one of the Secondary or Tertiary depending on load • The data-locality issue is minimized on failures
  • 32. Block1 Block2 Block3 Block1 Block2 Rack1 Block3 Block3 Rack2 Rack3 Block1 Block2 Datanode RegionServer1 Datanode1 RegionServer1 Datanode RegionServer2 Datanode1 RegionServer1 Datanode RegionServer3 Block1 Block2 Rack1 Block3 Block3 Rack2 Rack3 Block1 Block2 RegionServer4 Datanode1 RegionServer1 Datanode RegionServer2 Datanode1 RegionServer1 Datanode RegionServer3 Reads Blk1 and Blk2 remotely Reads Blk3 remotely RegionServer1 serves three regions, and their StoreFile blks are scattered across the cluster with one replica local to RegionServer1.
  • 33. Block1 Block2 Block3 Block1 Block2 Rack1 Block3 Block3 Rack2 Rack3 Block1 Block2 Datanode RegionServer1 Datanode1 RegionServer1 Datanode RegionServer2 Datanode1 RegionServer1 Datanode RegionServer3 RegionServer1 serves three regions, and their StoreFile blks are placed on specific machines on the other racks Block1 Block2 Rack1 Block3 Block3 Rack2 Rack3 Block1 Block2 RegionServer4 Datanode1 RegionServer1 Datanode RegionServer2 Datanode1 RegionServer1 Datanode RegionServer3 No remote reads Datanode
  • 34. Conclusion • Our tests show that the recovery time has come down from 10-15 minutes to less than 1 minute – All the way from failure to recovery (and not just recovery) • Most of it is available in 0.96, some parts were back-ported to 0.94.x • Real life testing of the improvements in progress – Pre-production deployments’ testing in progress • Room for more improvements – Example, asynchronous puts / gets
  • 35. Q & A Thanks! • Devaraj Das – ddas@hortonworks.com, @ddraj • Nicolas Liochon – nkeywal@gmail.com, @nkeywal

Editor's Notes

  1. Talk about MTTR in general, why it is important.In Cassandra, for example, in theory, the MTTR is 0 since the system could sacrifice consistency for mttr (quorum reads)Some links - http://dbpedias.com/wiki/Oracle:Fast-Start_Time-Based_Recovery, http://sandeeptata.blogspot.com/2011/06/informal-availability-comparison.html
  2. Previously..
  3. Previously..