© Hortonworks Inc. 2014
Data-Center Replication
Josh Elser
Member of Technical Staff
PMC, Apache Accumulo
June 12th, 2014
Page 1
Apache, Accumulo, Apache Accumulo, and ZooKeeper are trademarks of the Apache Software Foundation.
with Apache Accumulo
© Hortonworks Inc. 2014
Justification
Shouldn’t Accumulo be able to handle failures?
Absolutely
• Accumulo is great at surviving through failure conditions
• Always chooses the durable option by default
BUT
• Sometimes wide-spread failure over some set of
resources makes them unavailable for some time
Page 2
© Hortonworks Inc. 2014
Justification
• Expect the worst hardware failure
–Both in magnitude and frequency
• Administration problems
–Human error is inevitable
• A single data-center location may be limiting
–Geography: 150ms to send a packet from California to the
Netherlands and back again.
–Throughput: Concurrent user access (1K, 10K, 100K?)
• Applications must also consider failure conditions
–Must maintain its service-level agreement (SLA)
Page 3
Jeff Dean – “Numbers everyone should know”
http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf
© Hortonworks Inc. 2014
Justification
Wait, failure doesn’t really happen at the data center level
Page 4
… … right?
Inside Google data centers– http://www.google.com/about/datacenters/gallery/images/_2000/IDI_014.jpg
© Hortonworks Inc. 2014
Justification
• HVAC Issues
–Microsoft: http://blogs.office.com/2013/03/13/details-of-the-
hotmail-outlook-com-outage-on-march-12th/
• “Acts of God”
–Hurricane Sandy (NYC):
http://www.datacenterknowledge.com/archives/2012/10/30/major-
flooding-nyc-data-centers/
–Thunderstorms (AWS, Netflix, Heroku, Pinterest, Instagram):
http://www.datacenterknowledge.com/archives/2012/06/30/amazo
n-data-center-loses-power-during-storm/
• Squirrels
–Level(3): http://blog.level3.com/level-3-network/the-10-most-
bizarre-and-annoying-causes-of-fiber-cuts/
Page 5
© Hortonworks Inc. 2014
Justification
• “The Joys of Real Hardware” – Jeff Dean
• Typical first year with a new cluster
~0.5 overheating (~1-2 days)
~1 PDU failure (~6 hours)
~20 Rack failures (1-6 hours)
~3 Router failures (~1 hour)
• What happens when you suddenly lose a large
percent of nodes in your cluster?
Page 6
Jeff Dean – “The Joys of Real Hardware”
http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf
© Hortonworks Inc. 2014
Justification
And what about your software…
Page 7
Don’t forget about Grace Hopper. Bugs will show up one way or another
Lots of great technical “post-mortems” which involve software problems
from companies like Github, Google, Dropbox, Cloudflare, and others.
Unmodified source: http://www.bugzilla.org/img/buggie.png
License: http://creativecommons.org/licenses/by-sa/2.0/
© Hortonworks Inc. 2014
Implementation
• A framework for tracking data written to a table
• Interfaces to transmit data to another storage system
• Asynchronous replication, eventual consistency
• Configurable, cyclic replication graph
• “Primary-push” system from primary to peers
• Survives prolonged peer outages
Page 8
NoSQL X RDBMS
© Hortonworks Inc. 2014
Implementation
• Write-Ahead Logs is source of data for replication
–Tricky because WALs are per-TabletServer, not per-table.
• Primary storage of book-keeping in a new table
• Requires some records in Accumulo metadata table
• Pluggable replication work assignment by the Master
to TabletServers on primary
–Default implementation uses ZooKeeper
Page 9
© Hortonworks Inc. 2014
Implementation
• TabletServers track use of WALs
–Only for tables configured for replication
• Master reads these records
–Creates units of replication work
Unit = File needing replication to a peer
–Makes replication work units available
–Cleans up records which are fully replicated
• TabletServers acquire the replication work
–Read part of file, replicate, record progress, repeat
–Tries to replicate as much data as possible
• Garbage Collector manages unused WALs
–Records an update when a WAL is no longer referenced
–Only removes WALs when replication is complete
Page 10
© Hortonworks Inc. 2014
Implementation
ReplicaSystem – interface for defining how data is
replicated to a peer.
• Runs inside a Primary tserver, does the “heavy lifting”
• AccumuloReplicaSystem – implementation that sends
data to another Accumulo instance
–Primary tserver asks the Peer’s Master for a tserver
–Primary tserver sends serialized Mutations from the WAL for the
given table to the Peer tserver
–Peer tserver applies the Mutations locally and reports the number
of Mutations applied back to the Primary
Page 11
Primary Peer
© Hortonworks Inc. 2014
Implementation
• And it works!
• Will be introduced as a part of Apache Accumulo 1.7.0
• Also included in the next release of HDP
• Currently (6/2014) waiting on review from other devs
Page 12
© Hortonworks Inc. 2014
Future
• Replication to other types of systems
–Other NoSQL systems, RDBMSes, others?
• Support for replication of bulk imports
• Support for Conditional Mutations
–Maybe? Somehow?
• Consistency of table configurations
–Iterator/Combiner definitions
–Could cause undesired implications
– Peer system is smaller than primary, cannot hold as much data
– Need to set shorter TTL/age-off on peer than primary
Page 13
© Hortonworks Inc. 2014
Email: jelser@hortonworks.com
Twitter: @je2451
Page 14
I owe a huge thank you to everyone who was a
part of this in one way or another.
JIRA: https://issues.apache.org/jira/browse/ACCUMULO-378

Data-Center Replication with Apache Accumulo

  • 1.
    © Hortonworks Inc.2014 Data-Center Replication Josh Elser Member of Technical Staff PMC, Apache Accumulo June 12th, 2014 Page 1 Apache, Accumulo, Apache Accumulo, and ZooKeeper are trademarks of the Apache Software Foundation. with Apache Accumulo
  • 2.
    © Hortonworks Inc.2014 Justification Shouldn’t Accumulo be able to handle failures? Absolutely • Accumulo is great at surviving through failure conditions • Always chooses the durable option by default BUT • Sometimes wide-spread failure over some set of resources makes them unavailable for some time Page 2
  • 3.
    © Hortonworks Inc.2014 Justification • Expect the worst hardware failure –Both in magnitude and frequency • Administration problems –Human error is inevitable • A single data-center location may be limiting –Geography: 150ms to send a packet from California to the Netherlands and back again. –Throughput: Concurrent user access (1K, 10K, 100K?) • Applications must also consider failure conditions –Must maintain its service-level agreement (SLA) Page 3 Jeff Dean – “Numbers everyone should know” http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf
  • 4.
    © Hortonworks Inc.2014 Justification Wait, failure doesn’t really happen at the data center level Page 4 … … right? Inside Google data centers– http://www.google.com/about/datacenters/gallery/images/_2000/IDI_014.jpg
  • 5.
    © Hortonworks Inc.2014 Justification • HVAC Issues –Microsoft: http://blogs.office.com/2013/03/13/details-of-the- hotmail-outlook-com-outage-on-march-12th/ • “Acts of God” –Hurricane Sandy (NYC): http://www.datacenterknowledge.com/archives/2012/10/30/major- flooding-nyc-data-centers/ –Thunderstorms (AWS, Netflix, Heroku, Pinterest, Instagram): http://www.datacenterknowledge.com/archives/2012/06/30/amazo n-data-center-loses-power-during-storm/ • Squirrels –Level(3): http://blog.level3.com/level-3-network/the-10-most- bizarre-and-annoying-causes-of-fiber-cuts/ Page 5
  • 6.
    © Hortonworks Inc.2014 Justification • “The Joys of Real Hardware” – Jeff Dean • Typical first year with a new cluster ~0.5 overheating (~1-2 days) ~1 PDU failure (~6 hours) ~20 Rack failures (1-6 hours) ~3 Router failures (~1 hour) • What happens when you suddenly lose a large percent of nodes in your cluster? Page 6 Jeff Dean – “The Joys of Real Hardware” http://static.googleusercontent.com/media/research.google.com/en/us/people/jeff/stanford-295-talk.pdf
  • 7.
    © Hortonworks Inc.2014 Justification And what about your software… Page 7 Don’t forget about Grace Hopper. Bugs will show up one way or another Lots of great technical “post-mortems” which involve software problems from companies like Github, Google, Dropbox, Cloudflare, and others. Unmodified source: http://www.bugzilla.org/img/buggie.png License: http://creativecommons.org/licenses/by-sa/2.0/
  • 8.
    © Hortonworks Inc.2014 Implementation • A framework for tracking data written to a table • Interfaces to transmit data to another storage system • Asynchronous replication, eventual consistency • Configurable, cyclic replication graph • “Primary-push” system from primary to peers • Survives prolonged peer outages Page 8 NoSQL X RDBMS
  • 9.
    © Hortonworks Inc.2014 Implementation • Write-Ahead Logs is source of data for replication –Tricky because WALs are per-TabletServer, not per-table. • Primary storage of book-keeping in a new table • Requires some records in Accumulo metadata table • Pluggable replication work assignment by the Master to TabletServers on primary –Default implementation uses ZooKeeper Page 9
  • 10.
    © Hortonworks Inc.2014 Implementation • TabletServers track use of WALs –Only for tables configured for replication • Master reads these records –Creates units of replication work Unit = File needing replication to a peer –Makes replication work units available –Cleans up records which are fully replicated • TabletServers acquire the replication work –Read part of file, replicate, record progress, repeat –Tries to replicate as much data as possible • Garbage Collector manages unused WALs –Records an update when a WAL is no longer referenced –Only removes WALs when replication is complete Page 10
  • 11.
    © Hortonworks Inc.2014 Implementation ReplicaSystem – interface for defining how data is replicated to a peer. • Runs inside a Primary tserver, does the “heavy lifting” • AccumuloReplicaSystem – implementation that sends data to another Accumulo instance –Primary tserver asks the Peer’s Master for a tserver –Primary tserver sends serialized Mutations from the WAL for the given table to the Peer tserver –Peer tserver applies the Mutations locally and reports the number of Mutations applied back to the Primary Page 11 Primary Peer
  • 12.
    © Hortonworks Inc.2014 Implementation • And it works! • Will be introduced as a part of Apache Accumulo 1.7.0 • Also included in the next release of HDP • Currently (6/2014) waiting on review from other devs Page 12
  • 13.
    © Hortonworks Inc.2014 Future • Replication to other types of systems –Other NoSQL systems, RDBMSes, others? • Support for replication of bulk imports • Support for Conditional Mutations –Maybe? Somehow? • Consistency of table configurations –Iterator/Combiner definitions –Could cause undesired implications – Peer system is smaller than primary, cannot hold as much data – Need to set shorter TTL/age-off on peer than primary Page 13
  • 14.
    © Hortonworks Inc.2014 Email: jelser@hortonworks.com Twitter: @je2451 Page 14 I owe a huge thank you to everyone who was a part of this in one way or another. JIRA: https://issues.apache.org/jira/browse/ACCUMULO-378

Editor's Notes

  • #2 Myself. What is replication in this context
  • #3 High level - why is replication important? Availability
  • #4 Hardware failures suck. Unexpected configuration/ops problems. Scale past a single data center without having to worry about a single instance. Must satisfy SLAs. Cannot just accept unplanned downtime.
  • #7 Jeff Dean’s talk about what to expect in the first year of a 1K node cluster
  • #8 The software running in the application, database and operating system also might have bugs which cause unexpected unavailability.
  • #9 Describe the characteristics of the replication implementation Book-keeping of data written to tables Interfaces/implementation for replicating data from Accumulo instance to another application Asynchronous Eventually consistent Push data from primary to peer Resilient to prolonged outages
  • #10 BigTable basics – what is a write-ahead log Write-ahead logs used to track data that was written to a table WAL is the primary element in bookkeeping system Some data written to metadata table, most written to replication table Leverage ZooKeeper for work assignment to tservers
  • #11 Tservers track WALs used for local tables Master makes work entries for the WALs that tservers use Assign replication work back to the tservers Master cleans up fully-replicated records GC closes WALs that are no longer referenced by tservers
  • #12 ReplicaSystem interface – does the “heavy lifting”, runs inside of the tserver AccumuloReplicaSystem implementation to replicate to an accumulo instance Local tserver -> peer master Peer master replies with peer tserver Local tserver -> peer tserver
  • #13 Not just some academic adventure. Will be available in 1.7.0 and the next version of HDP
  • #14 Replicate to RDBMS or nosql systems Support bulk-imports – not a huge priority because typically easily to replicate on your own. Conditional mutation support somehow – would have to change from async to sync or introduce conflict resolution Problems in table configuration properties that are universal and others which are specific to an instance