Valverde Computing The Fundamentals of Transaction Systems Part 1: Causality banishes Acausality (Clustered Database) C.S....
1- Library = Low level communication, operating system drivers and state on Open Systems platforms Subsystems = Open Sourc...
1- Library = Low level communication, operating system drivers and state on Open Systems platforms State  Optimally  inclu...
1- Subsystems = Open Source components + middleware standards + Customer Application Cores <ul><ul><li>The vast majority o...
1- EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tool...
1- High Speed, Minumum Latency Network or SAN “B” Cluster Redundancy Architecture High Speed, Minumum Latency Network or S...
4 Pillars (or Guardians or Demons) ‏ <ul><li>1.  Causality banishes Acausality (Clustered Database) </li></ul><ul><ul><li>...
<ul><li>2. Relativity shatters the Classical Delusion (Replicated Database) </li></ul><ul><ul><li>Real-time and distribute...
<ul><li>3.  Purity emerges from Impurity (Practical makes perfect) </li></ul><ul><ul><li>Algorithms need to work correctly...
<ul><li>4.  Certainty suppresses Uncertainty (Groups of Clusters) </li></ul><ul><ul><li>ΔXΔP ≥ ћ/2‏ </li></ul></ul><ul><ul...
Cluster Fundamentals <ul><li>1. Reliable Message Based System - serialized retries with duplicate removal </li></ul><ul><l...
Cluster Fundamentals <ul><li>10. ACID and BASE - workflow makes this reaction safe </li></ul><ul><li>11. True Multi-Thread...
Cluster Fundamentals <ul><li>17. Reliable Disjoint Async Replication </li></ul><ul><li>18. Logical Redo and Volume Autonom...
Cluster Fundamentals <ul><li>24. Publish and Subscribe </li></ul><ul><li>25. Ubiquitous Work Flow </li></ul><ul><li>26. Vi...
1. Reliable Message-Based   System  serialized retries   with duplicate removal <ul><li>Why are loosely-coupled clusters o...
<ul><li>Why loosely-coupled clusters (continued)? </li></ul><ul><ul><li>(2)  and the shared-nothing (Stonebraker) potentia...
<ul><li>However, what are the limits of using cluster messages? </li></ul><ul><ul><li>The  increased LAN cost  comes from ...
<ul><li>In an optimal RDBMS (a relational database management system), data spread over multiple clusters of computers imp...
<ul><li>Library based components require (for performance reasons) </li></ul><ul><ul><li>Drivers and packet communications...
<ul><li>Basic HP (was Tandem) Nonstop clustering services include   : </li></ul><ul><ul><li>Cluster coldload and single p...
<ul><ul><li>Processor Synchronization (continued): </li></ul></ul><ul><ul><ul><li>Global Update Protocol (Glupdate) : Clus...
<ul><li>The basic clustering of the Nonstop message system is described in the (expired and now available) Nonstop patents...
<ul><li>Nonstop patents from James Katzman, et al (continued): </li></ul><ul><ul><li>4,378,588 Buffer control for a data p...
<ul><li>A Nonstop system is a loosely-coupled (no shared memory) cluster (called a “network”) of clusters (called “nodes”)...
<ul><li>Takeover  is quite different from  failover and restart , IBM’s Parallel Sysplex does not do takeover: all nodes h...
2. Data Integrity   data must be checked wherever it goes <ul><li>Data corruption is an ever-present possibility through e...
2. Data Integrity   data must be checked wherever it goes <ul><li>Memory must be error checking and correcting (ECM), as i...
2. Data Integrity   data must be checked wherever it goes <ul><li>Log writing must use end-to-end checksums on blocks. Thi...
3. Reliability  = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment <ul><li>James...
<ul><li>Takeover  (more transparent) is far superior to  failover  (failure and restart), if only because it enables the u...
<ul><li>Single failures  and  double failures  in clusters require different kinds of testing and furthermore, different k...
<ul><li>Fault Avoidance : by small amounts of forethought and action here and there in the code, potentially large failure...
4. Basic Parallelism    if it isn’t locked, then it isn’t blocked ‏ ‏ <ul><li>An optimal RDBMS  will  use  S2PL  (strict t...
4. Basic Parallelism    if it isn’t locked, then it isn’t blocked ‏ <ul><li>An optimal RDBMS RM (resource manager) will su...
4. Basic Parallelism    if it isn’t locked, then it isn’t blocked ‏ <ul><li>Optimally, the RDBMS also supports RM-only tra...
4. Basic Parallelism    if it isn’t locked, then it isn’t blocked ‏ <ul><li>There is one version of the data in an optimal...
4. Basic Parallelism    if it isn’t locked, then it isn’t blocked ‏ <ul><li>This allows every application process to run f...
5. Basic  Transparency   when? where? how? <ul><li>Gosling: distributed computing is not transparent to either failure or ...
5. Basic  Transparency   when? where? how? <ul><li>(1)  Occasionally, the seamless operations and functioning of the appli...
5. Basic  Transparency   when? where? how? <ul><li>Then the database disk writes are scheduled to go out between the every...
5. Basic  Transparency   when? where? how? <ul><li>The neat trick that the backup RM does to restore update locks,  which ...
5. Basic  Transparency   when? where? how? <ul><li>(3)  Finally, if there is a rare TM or total cluster crash, due to a do...
5. Basic  Transparency   when? where? how? <ul><li>So  when/where/how is this transparent  to the application ? (continued...
6.  Basic  Scalability <ul><li>The original clustered database view of scalability came from David DeWitt and Jim Gray’s 1...
6.  Basic  Scalability <ul><li>Scalability of database logging performance inside the Nonstop cluster and for disaster rec...
6.  Basic  Scalability <ul><li>The combination of group commit and WAL yields just short of “in-memory” RDBMS performance,...
6.  Basic  Scalability <ul><li>At commit time, the Nonstop library transaction service induces explicit RM log flushing on...
6.  Basic  Scalability <ul><li>The group commit write by the RDBMS commit coordinator is the one and only time in the syst...
6.  Basic  Scalability <ul><li>Ultimately, however, you can easily generate more joint-serialized database log record bloc...
6.  Basic  Scalability <ul><li>Let’s talk about how a swarm of RMs can be flushed for transaction commit (or abort): </li>...
6.  Basic  Scalability <ul><li>Before an RM can do anything to a transaction protected file, on behalf of a  client reques...
6.  Basic  Scalability <ul><li>As an aside, there will be at least one transaction flush in the cluster for commit or abor...
6.  Basic  Scalability <ul><li>When an RM does an update, insert or delete of some row in an SQL table or an index entry, ...
6.  Basic  Scalability <ul><li>After incrementing the VSN, the  </li></ul><ul><ul><li>RMID </li></ul></ul><ul><ul><li>VSN ...
6.  Basic  Scalability <ul><li>When the RM is finished working on behalf of this transactional client file system request ...
6.  Basic  Scalability <ul><li>The RM should have (at least) two log write buffers, so that it can be filling one with log...
6.  Basic  Scalability <ul><li>The  RM writes out its log buffer  when it gets full enough (when things are busy), or when...
6.  Basic  Scalability <ul><li>This is not very scalable, so don’t under-configure the memory size of the RM buffer cache ...
6.  Basic  Scalability <ul><li>As yet another aside, when the Nonstop software stack was ported to Windows NT clusters in ...
6.  Basic  Scalability <ul><li>Getting back to our scenario, out of the commit broadcast  a transaction flush packet arriv...
6.  Basic  Scalability <ul><li>Checking the  crosslinks  VSN values (continued): </li></ul><ul><ul><li>Zero: data was only...
6.  Basic  Scalability <ul><li>For every RM you find on the crosslink list for this TID that has a positive VSN (continued...
6.  Basic  Scalability <ul><li>Once the TID’s crosslink list is run through to the point that all the RMs are flushed and ...
6.  Basic  Scalability <ul><li>The TM  commit coordinator wakes up on a timer , which is set  tiny  if things are not busy...
6.  Basic  Scalability <ul><li>When the TM commit coordinator has awakened, the committed/aborting flushed packets from th...
6.  Basic  Scalability <ul><li>When the  last log partition that it sent flush requests to has replied OK , the TM commit ...
6.  Basic  Scalability <ul><li>Final notes on group commit (continued) </li></ul><ul><ul><li>That one buffered and forced ...
6.  Basic  Scalability <ul><li>When transactions span the network to other clusters (or heterogeneously to other vendor sy...
6.  Basic  Scalability <ul><li>If transactions have locality of reference and only touch the local node with no lock confl...
7.  Basic  Availability   outage minutes -> zero <ul><li>What is availability? </li></ul><ul><li>On some systems it is def...
7.  Basic  Availability   outage minutes -> zero <ul><li>In an optimal RDBMS, availability is measured in terms of databas...
7.  Basic  Availability   outage minutes -> zero <ul><li>Tandem’s  Nonstop TMF  has had excellent fault tolerance  out-of-...
7.  Basic  Availability   outage minutes -> zero <ul><li>An optimal RDBMS should drastically exceed this level of fault to...
7.  Basic  Availability   outage minutes -> zero <ul><li>The 4 SQL online partition operations: reorganize, split, merge, ...
7.  Basic  Availability   outage minutes -> zero <ul><li>Much of this is possible, because Nonstop uses  logical keys inst...
7.  Basic  Availability   outage minutes -> zero <ul><li>However, RIDs disallow using SQL cursors against the base tables ...
7.  Basic  Availability   outage minutes -> zero <ul><li>The SQL ‘recovery in place’ approach to seamless operations allow...
7.  Basic  Availability   outage minutes -> zero <ul><li>IBM's mainframe DB2 uses a piece of special hardware called the C...
8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ <ul><li>So what is datab...
<ul><li>C is for Consistency  seems like a circular definition, but actually it should probably be spelled ASID, because c...
<ul><li>A transaction history starts in the log with the first update log record for a transaction </li></ul><ul><li>Then ...
<ul><li>The big thing about  serialized  transaction histories in the log is that even though they are mostly interspersed...
<ul><li>You must be able to sort (using Jim Gray's sorting method in Gray/Reuter 7.5.8.1) the entire log by transaction ti...
<ul><li>Multiple version concurrency control ( MVCC ) does not enforce the well-formed part, because shared read locks are...
<ul><li>MVCC  databases can create wormholes by  write skew : for instance, if two concurrent transactions read two differ...
<ul><li>Mostly what  MVCC  database users end up doing is single-threading, by two kinds of means: </li></ul><ul><ul><li>S...
<ul><li>Mostly what  MVCC  database users end up doing is single-threading, by various means ( continued ): </li></ul><ul>...
<ul><li>What all these single-threading methods accomplish is to effectively convert an  MVCC  RDBMS model to a fragile ap...
<ul><li>Of the  MVCC  databases, only Microsoft SQL Server can completely pass the TPC-E benchmark, which checks for incon...
<ul><li>So, what does all this  S2PL  get you? Isn't it slow? </li></ul><ul><li>Nonstop demonstrated the converse in the Z...
<ul><li>Imagine 10, 20, 100 clusters of database side by side in a massive group. Now allow them to share transactions wit...
<ul><li>In an  MVCC  log, you have what is called “log serialization”, which is only serialized as far as the log is conce...
<ul><li>So no matter how many clusters are involved in sharing distributed transactions in the massive group, if all of th...
<ul><li>This also allows you to do replication of  many  cluster primary databases to many  cluster backup databases and a...
<ul><li>I is for Isolation  and should probably be spelled ACLD, because isolation in database really means locking (as se...
<ul><li>Hence, because of transactional isolation: </li></ul><ul><ul><li>Applications communicate with each other using si...
<ul><li>Finally,  D is for Durability  and means that once an optimal RDBMS transaction service says it’s done, it’s reall...
<ul><li>So, given that we have a safe copy of the database transaction history (wormhole-free) in the log and a periodic c...
<ul><li>RM restart after TM restart: </li></ul><ul><ul><li>The  transaction manager rebuilds its state  from the last run ...
9.  Recovery   putting it all back together again <ul><ul><li>These are the  transaction states , which are stored as reco...
<ul><ul><li>Transaction state record transitions in the log root : </li></ul></ul><ul><ul><ul><li>Active or nil ->   forgo...
<ul><ul><li>Transaction state record transitions in the log root   (continued): </li></ul></ul><ul><ul><ul><li>Prepared ->...
<ul><ul><li>Transaction state record transitions in the log root   (continued): </li></ul></ul><ul><ul><ul><li>Committed -...
<ul><ul><li>Actions to take depending on transaction states in the TM after the restart scan of the last two TM checkpoint...
9.  Recovery   putting it all back together again <ul><ul><li>If there is no lock reinstatement  during crash recovery for...
9.  Recovery   putting it all back together again <ul><ul><li>If there is lock reinstatement , then the  database can be b...
9.  Recovery   putting it all back together again <ul><ul><li>If lock reinstatement is supported , the RM can be brought o...
9.  Recovery   putting it all back together again <ul><ul><li>If the current log pointers for the log root and the log par...
9.  Recovery   putting it all back together again <ul><li>RM archive recovery (if the file is trashed): </li></ul><ul><ul>...
Upcoming SlideShare
Loading in...5
×

Fundamentals Of Transaction Systems - Part 1: Causality banishes Acausality (Clustered Database)

2,937

Published on

see http://ValverdeComputing.Com for video

Published in: Technology, Business
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,937
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
70
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Fundamentals Of Transaction Systems - Part 1: Causality banishes Acausality (Clustered Database)

  1. 1. Valverde Computing The Fundamentals of Transaction Systems Part 1: Causality banishes Acausality (Clustered Database) C.S. Johnson <cjohnson@member.fsf.org> video: http://ValverdeComputing.Com social: http://ValverdeComputing.Ning.Com 1- The Open Source/ Systems Mainframe Architecture
  2. 2. 1- Library = Low level communication, operating system drivers and state on Open Systems platforms Subsystems = Open Source components + middleware standards + Customer Application Cores EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools Optimal Cluster Software Architecture
  3. 3. 1- Library = Low level communication, operating system drivers and state on Open Systems platforms State Optimally includes a proprietary layer of low level, C/C++ based drivers, yielding unparalleled transaction processing performance without the client having to deal with the underlying design architecture. These libraries provide a simple and unobstructive, yet elegant and abstract data management interface for new applications. Libraries ESS, WAN, LAN, SAN drivers and management library Global serialization library XML log records library Buffered log I/O library XML log reading library Cluster logging library Recovery library XML chains resource manager Global Transaction (IDs, handles and types) library Data management library Transaction management library XML remote scripting API library Computer, Cluster and Network management library
  4. 4. 1- Subsystems = Open Source components + middleware standards + Customer Application Cores <ul><ul><li>The vast majority of optimal middleware and applications are then implemented on open source using cross-platform Java to access this open system interface, allowing unprecedented flexibility for customization and future expansion. </li></ul></ul>Middleware – Open Source Disaster Recovery interface XML remote scripting XML management console Service control manager Application servers Application feeders Application extractors Application reports Application human interface Database and Recovery management interface Computer, Cluster and Network management interface Application Core
  5. 5. 1- EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools Enterprise Application Integration Actional Control Broker Acxiom AbiliTec™ Fair Isaac Blaze Advisor Mercator Commerce Broker MicroStrategy DoubleClick Ensemble SAS Enterprise Miner ETL Tools SeeBeyond® TIBCO Trillium
  6. 6. 1- High Speed, Minumum Latency Network or SAN “B” Cluster Redundancy Architecture High Speed, Minumum Latency Network or SAN “A” * Elements can be viewed as computers in a cluster, or as clusters in a group Fibre Channel or SAN Based Enterprise Storage Network “B” Fibre Channel or SAN Based Enterprise Storage Network “A”
  7. 7. 4 Pillars (or Guardians or Demons) ‏ <ul><li>1. Causality banishes Acausality (Clustered Database) </li></ul><ul><ul><li>Importance of Serialization, Order </li></ul></ul><ul><ul><li>S2PL (strict two phase locking vs MVCC)‏ </li></ul></ul><ul><ul><ul><li>Write skew and wormholes </li></ul></ul></ul><ul><ul><li>Wittgenstein and the Tractatus Logicus </li></ul></ul><ul><ul><li>Mohandas Gandhi: Be the change you want to see. </li></ul></ul><ul><ul><li>Daisaku Ikeda: Ningen Kakumei (Principle of Human Revolution) </li></ul></ul>1-
  8. 8. <ul><li>2. Relativity shatters the Classical Delusion (Replicated Database) </li></ul><ul><ul><li>Real-time and distributed systems performance issues </li></ul></ul><ul><ul><li>Timestamp and clock issues </li></ul></ul><ul><ul><ul><li>Einstein: no such thing as the global “current moment” </li></ul></ul></ul><ul><ul><ul><li>Davies: no such thing as the local “current moment” (modern physics)‏ </li></ul></ul></ul><ul><ul><ul><li>GPS satellites were made with 13 digits of precision, they turned out to need every digit due to elliptical orbits and gravity-acceleration time-dilation variances. </li></ul></ul></ul>4 Pillars (or Guardians or Demons) ‏ 1-
  9. 9. <ul><li>3. Purity emerges from Impurity (Practical makes perfect) </li></ul><ul><ul><li>Algorithms need to work correctly and be optimized for time and resources </li></ul></ul><ul><ul><li>Occam: &quot;All other things being equal, the simplest solution is the best.&quot; </li></ul></ul><ul><ul><li>Einstein: “Make everything as simple as possible, but not simpler” </li></ul></ul><ul><ul><li>Roger Penrose: Objectivity of Plato's mathematical world = no simultaneous correct proof and disproof </li></ul></ul><ul><ul><li>Karl Hewitt's Paraconsistency is really just Inconsistency </li></ul></ul>4 Pillars (or Guardians or Demons) ‏ 1-
  10. 10. <ul><li>4. Certainty suppresses Uncertainty (Groups of Clusters) </li></ul><ul><ul><li>ΔXΔP ≥ ћ/2‏ </li></ul></ul><ul><ul><li>Failure, Takeover and Recovery: reasserting the invariants </li></ul></ul><ul><ul><li>Propagation of Error and Chaos will ultimately loom in all predictions, interpolations and extrapolations </li></ul></ul><ul><ul><li>Idempotence: transparent retryability </li></ul></ul>4 Pillars (or Guardians or Demons) ‏ 1-
  11. 11. Cluster Fundamentals <ul><li>1. Reliable Message Based System - serialized retries with duplicate removal </li></ul><ul><li>2. Data Integrity - data must be checked wherever it goes </li></ul><ul><li>3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment </li></ul><ul><li>4. Basic Parallelism - if it isn’t locked, then it isn’t blocked </li></ul><ul><li>5. Basic Transparency - when? where? how? </li></ul><ul><li>6. Basic Scalability </li></ul><ul><li>7. Basic Availability - outage minutes -> zero </li></ul><ul><li>8. Application/Database Serialized Consistency - the database must be serialized wherever it goes </li></ul><ul><li>9. Recovery - putting it all back together again </li></ul>1-
  12. 12. Cluster Fundamentals <ul><li>10. ACID and BASE - workflow makes this reaction safe </li></ul><ul><li>11. True Multi-Threading - shrinking the size of thread-instance state </li></ul><ul><li>12. Single System Image and Network Autonomy </li></ul><ul><li>13. Minimal Use of Special Hardware - servers need to be off-the-shelf </li></ul><ul><li>14. Maintainability and Supportability - H/W & S/W needs to be capable of basic on-line repair </li></ul><ul><li>15. Expansively Transparent - Parallelism and Scalability </li></ul><ul><li>16. Continuous Database - needs virtual commit by name </li></ul>1-
  13. 13. Cluster Fundamentals <ul><li>17. Reliable Disjoint Async Replication </li></ul><ul><li>18. Logical Redo and Volume Autonomy </li></ul><ul><li>19. Scalable Joint Replication </li></ul><ul><li>20. Bi-Directional Replication - Reliable, Scalable, Atomically Consistent </li></ul><ul><li>21. Openness (Glasnost) - Open systems, open source, free software </li></ul><ul><li>22. Restructuring (Perestroika) - Online application and schema maintenance </li></ul><ul><li>23. Reliable Software Telemetry - push streaming needs a many-to-many architecture </li></ul>1-
  14. 14. Cluster Fundamentals <ul><li>24. Publish and Subscribe </li></ul><ul><li>25. Ubiquitous Work Flow </li></ul><ul><li>26. Virtual Operating System </li></ul><ul><li>27. Scaling Inwards - Extreme Single Row Performance for Exchanges </li></ul><ul><li>28. Ad Hoc Aggregation - Institutional Query Transparency for Regulation </li></ul><ul><li>29. Reliable Multi-Lateral Trading - Regulated Fairness & Performance, Guaranteed Result </li></ul><ul><li>30. Semantic Data - Verity of Data Processing </li></ul><ul><li>31. Integration and Test Platform - Real-Time Transaction Database </li></ul><ul><li>32. Integrated Logistics </li></ul>1-
  15. 15. 1. Reliable Message-Based System serialized retries with duplicate removal <ul><li>Why are loosely-coupled clusters of computers such a great thing  ? </li></ul><ul><ul><li>Of course, the computers themselves can be tightly-coupled SMPs, so one does not preclude the other: by carefully architecting the software, we get to have the best of both: </li></ul></ul><ul><ul><li>(1) the shared-memory multi-core semi-automatic load balancing (with the help of packages like the Intel Thread Building Blocks) within a single unit of failure that is an SMP, but memory-update-contention limits scalability </li></ul></ul>1-  TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>
  16. 16. <ul><li>Why loosely-coupled clusters (continued)? </li></ul><ul><ul><li>(2) and the shared-nothing (Stonebraker) potential for fault tolerance through the separate units of failure connected by messages, that makes up a cluster of computers with extremely difficult load balancing, but with the possibility of limitless scalability and parallelism, with no shared memory access </li></ul></ul><ul><li>However, what are the limits of using cluster messages? </li></ul><ul><ul><li>Messages aren’t free, they have a cost: LAN messages cost ten times what messages between cores in a shared memory system do (250 vs. 2500 instructions)  </li></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-  TR-88.4 The Cost of Messages <http://www.hpl.hp.com/techreports/tandem/TR-88.4.html>
  17. 17. <ul><li>However, what are the limits of using cluster messages? </li></ul><ul><ul><li>The increased LAN cost comes from framing, checksumming, packet assembly/disassembly, standard protocols, and OS layering </li></ul></ul><ul><ul><li>Co-processors are defraying a lot of this cost (SMP and LAN) since that article, but the disparity remains </li></ul></ul><ul><ul><li>Inlining code does minimize the processor overhead, but there is still a response time hit (down to 100 ns for Nonstop ServerNet II, this is the hardware limit) </li></ul></ul><ul><ul><li>WAN costs require the abandonment of full transparency outside LAN clusters: so, no SQL partitioning across the WAN - it’s all client server, replication or workflow  </li></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-  TR-89.1 Transparency in its Place The Case Against Transparent Access to Geographically Distributed Data <http://www.hpl.hp.com/techreports/tandem/TR-89.1.html>
  18. 18. <ul><li>In an optimal RDBMS (a relational database management system), data spread over multiple clusters of computers implies the use of messages between software subsystems (Gray-Reuter 3.7.3.1) ‏ </li></ul><ul><li>Message senders can die and be restarted and send duplicate messages, which must be detected and dropped ( idempotence : in math, a = a x a ; in computers, multiple attempts yield a single result) ‏ </li></ul><ul><li>Receivers can die and be restarted, so non-replied-to messages must be resent on failure (reliable retry to a new primary) ‏ </li></ul><ul><li>Gaps in series may need to be detectable (sessions and sequencing, a solid MsgSys can help do this for you) ‏ </li></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-
  19. 19. <ul><li>Library based components require (for performance reasons) </li></ul><ul><ul><li>Drivers and packet communications </li></ul></ul><ul><ul><li>kernel mode execution under driver dispatches </li></ul></ul><ul><ul><li>packet buffering and replies without dispatching target threads (called driver ricochet broadcasts) ‏ </li></ul></ul><ul><ul><li>global data with common access controls for kernel mode and thread mode, which allows </li></ul></ul><ul><ul><li>kernel mode flushing of RMs (RDBMS resource managers) with low-level lock-release </li></ul></ul><ul><ul><li>FIFO queuing of packet buffers into subsystem user mode threads (fibers) ‏ support stream programming </li></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-
  20. 20. <ul><li>Basic HP (was Tandem) Nonstop clustering services include  : </li></ul><ul><ul><li>Cluster coldload and single processor reload </li></ul></ul><ul><ul><li>Processor Synchronization: </li></ul></ul><ul><ul><ul><li>I’m Alive Protocol : Heartbeats every second, all processors check every two seconds for receipt from every other processor, if one cannot communicate, send a poison pill message and declare it down, and cancel its messages, etc., unless … </li></ul></ul></ul><ul><ul><ul><li>Regroup Protocol : Two-round cluster messaging protocol to make sure the unhealthy processor is really down, and not just late for some and not others (split-brain), which gives recalcitrants a second chance </li></ul></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-  TR-90.5 Fault Tolerance in Tandem Computer Systems <http://www.hpl.hp.com/techreports/tandem/TR-90.5.html>
  21. 21. <ul><ul><li>Processor Synchronization (continued): </li></ul></ul><ul><ul><ul><li>Global Update Protocol (Glupdate) : Cluster information (process pair name directory, other items in the messaging destination table) are replicated in a time-limited, atomic and serial manner </li></ul></ul></ul><ul><ul><ul><li>Cluster Time Synchronization : clock adjustments are constantly maintained and relative clock error is kept track of: the Nonstop transaction service does not depend upon clock synchronization for commit or any other algorithmic purpose (that would defy relativity), so Nonstop only inserts timestamps for reference purposes </li></ul></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-
  22. 22. <ul><li>The basic clustering of the Nonstop message system is described in the (expired and now available) Nonstop patents from James Katzman, et al: </li></ul><ul><ul><li>4,817,091 Fault-tolerant multiprocessor system </li></ul></ul><ul><ul><li>4,807,116 Interprocessor communication </li></ul></ul><ul><ul><li>4,672,537 Data error detection and device controller failure detection in an input/output system </li></ul></ul><ul><ul><li>4,672,535 Multiprocessor system </li></ul></ul><ul><ul><li>4,639,864 Power interlock system and method for use with multiprocessor systems </li></ul></ul><ul><ul><li>4,484,275 Multiprocessor system </li></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-
  23. 23. <ul><li>Nonstop patents from James Katzman, et al (continued): </li></ul><ul><ul><li>4,378,588 Buffer control for a data path system </li></ul></ul><ul><ul><li>4,365,295 Multiprocessor system </li></ul></ul><ul><ul><li>4,356,550 Multiprocessor system </li></ul></ul><ul><ul><li>4,228,496 Multiprocessor system </li></ul></ul><ul><li>And the (expired and now available) Glupdate patent from Richard Carr, et al, which has been reliable for over 30 years now, and has a much reduced message overhead versus Quorum Consensus+Thomas Write Rule+Lamport timestamps , while accomplishing more (for cluster size <= 25 or so): </li></ul><ul><ul><li>4,718,002 Method for multiprocessor communications </li></ul></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-
  24. 24. <ul><li>A Nonstop system is a loosely-coupled (no shared memory) cluster (called a “network”) of clusters (called “nodes”) of processors, up to 4096. Each 16 processor node in the Expand network has node autonomy, and its own transaction service TM (transaction manager) capable of bringing the cluster's RDBMS up and down, one RM (resource manager) at a time, or all at once. </li></ul><ul><li>Fault tolerance at the subsystem and application level is accomplished by process pairs, which look like a single process to a client sending messages and later retrying after the primary half of the pair has gone down, and takeover by the backup has made a new primary. </li></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1-
  25. 25. <ul><li>Takeover is quite different from failover and restart , IBM’s Parallel Sysplex does not do takeover: all nodes have transparent access to data, and applications that fail have to be restarted; there is a ‘Workload Manager’ to restart the apps, but even that does not completely recover the database (50-60% of IBM database applications are not transactional) </li></ul><ul><li>An IBM S390 Sysplex Cluster is a set of up to 32 16-way SMPs joined by ultra-fast interconnects and buses, with at least 2 coupling facility smart memory devices and 2 synchronized sysplex clocks (they are not used for processing commit, they do commit by log order, like Nonstop does), see their presentation: </li></ul>1. Reliable Message-Based System serialized retries with duplicate removal 1- <http://www.mvdirona.com/jrh/work/hpts2001/presentations/DB2%20390%20Availability.pdf>
  26. 26. 2. Data Integrity data must be checked wherever it goes <ul><li>Data corruption is an ever-present possibility through electronic noise (e.g. radon decay chain effects, cosmic radiation), physical defects (semiconductor doping flaws), and HW/SW design defects (stray pointers in code) ‏ </li></ul><ul><li>The statistics are that there are 3 undetected and uncorrected, but program-significant data corruptions per 1000 microprocessors per year (Horst, et al: Proc 23rd FT Computing Symposium 1993) ‏ </li></ul><ul><li>Disks, even when not in use , will corrupt data at a low rate and the mirrors need to be crawled and corrected in the background, and single data disk blocks recovered, non-mirrored disks with errors exceeding the 2-bit (or otherwise) encoding correction are a permanent problem </li></ul>1-
  27. 27. 2. Data Integrity data must be checked wherever it goes <ul><li>Memory must be error checking and correcting (ECM), as in most computer systems. Many components have the potential to corrupt data, and this will become more problematic as components shrink </li></ul><ul><li>Higher integration levels for processors will cause sporadic internal resets from soft errors, which occur more frequently at higher altitude (Itaniums in Colorado reset 1/day vs. 1/week at sea level in 2001), and which can take a processor offline for a half a minute ‏ </li></ul><ul><li>Optimal, reliable systems will support every one of the following: </li></ul><ul><ul><li>Lock-stepped microprocessors </li></ul></ul><ul><ul><li>Fail-fast protection of internal buses and drivers </li></ul></ul><ul><ul><li>End-to-end checksums on data sent to storage devices </li></ul></ul>1-
  28. 28. 2. Data Integrity data must be checked wherever it goes <ul><li>Log writing must use end-to-end checksums on blocks. This is because after a crash, we need to fixup to the last valid written block of log records from a log buffer, and we can’t tolerate garbage in the block middle due to power-loss partial writes (drive manufacturer dependencies) </li></ul><ul><li>During transaction restart after an RM (resource manager) or computer crash or a full cluster TM (transaction manager) crash, log fixup then searches for the last good block written (valid checksum) to the log mirrors, which becomes the new log tail </li></ul><ul><li>The fixup function reads from the mirrors until neither one has a good block at the end of the log, then rewrites all of the last log blocks on both mirrors (to scrub the errors on the short side) </li></ul>1-
  29. 29. 3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment <ul><li>James Gosling: distributed computing is not transparent to either failure or performance </li></ul><ul><li>Some errors are tolerable and some operations returning errors can be retried with idempotence </li></ul><ul><li>Oddly enough, keeping things reliably running requires a cut-throat approach to critical subsystems that are experiencing anomalies, encountering garbage data, or even running abnormally </li></ul><ul><li>Fail-fast – going down quickly prevents the spread of invalid data or even the effects of flawed algorithms or races we can’t handle (what if the corruption checks don’t catch something?) ‏ </li></ul>1-
  30. 30. <ul><li>Takeover (more transparent) is far superior to failover (failure and restart), if only because it enables the use of fail-fast techniques, because they don’t hurt users as much </li></ul><ul><li>Failure Detection : assertion logic is interwoven throughout all critical library code and all critical subsystem code in reliable systems </li></ul><ul><li>To maintain the state machine invariants end-to-end we must detect any violation of the invariants and then reinstate them by whatever means necessary </li></ul><ul><li>Bohr-bugs (synchronous: they hit repeatedly) and Heisen-bugs (asynchronous and racy) require different kinds of testing </li></ul>3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 1-
  31. 31. <ul><li>Single failures and double failures in clusters require different kinds of testing and furthermore, different kinds of fault tolerance design to ‘transparently’ handle those failures </li></ul><ul><li>Fault Tolerance : when something goes wrong and a failure occurs, whether hardware or software, takeover mechanisms ensure the re-establishment of state machine invariants (a new state which is equivalently identical to the state before failure) ‏ </li></ul><ul><li>In fault tolerant systems, when a piece of hardware fails, the fault tolerance of the software has to function correctly to mask the hardware failure </li></ul>3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 1-
  32. 32. <ul><li>Fault Avoidance : by small amounts of forethought and action here and there in the code, potentially large failures can be shrunk down in size to be handled invisibly: </li></ul><ul><ul><li>avoiding unnecessary transaction aborts by preparatory checkpointing of shared read locking state in the RM [resource manager] before a coordinated takeover </li></ul></ul><ul><ul><li>avoiding unnecessary RM crash recovery outages by detecting missing log writes and performing them in a timely way after a takeover </li></ul></ul><ul><li>Fault Containment : garbage pointers in the library kernel globals cause the outage of a computer in a cluster. Encountering the garbage pointers in a critical subsystem process environment may only require a process restart, if proper checkpoints have been made beforehand </li></ul>3. Reliability = fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment 1-
  33. 33. 4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ ‏ <ul><li>An optimal RDBMS will use S2PL (strict two phase locking) for the transaction duration locks (there are 5 kinds of locks in the Nonstop RM: DP2) ‏ </li></ul><ul><li>An optimal RDBMS RM (resource manager) holds both the data and the locks, with no external distributed lock table or external buffer cache to fight over with interlopers: and that means that clients can queue properly </li></ul><ul><li>So, one client message connects the transactional application code to: </li></ul><ul><ul><li>the RM data </li></ul></ul><ul><ul><li>+ the RM client queue </li></ul></ul><ul><ul><li>+ the acquired RM locks </li></ul></ul><ul><ul><li>+ the tx state within the node TM (transaction manager) library globals underneath the RM subsystem </li></ul></ul><ul><ul><li>+ the potential failure takeover process for the RDBMS RM </li></ul></ul>1-
  34. 34. 4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ <ul><li>An optimal RDBMS RM (resource manager) will support the use of a priority queue and the priority inversion on that queue so that mixed workloads can intermingle with little impact: </li></ul><ul><ul><li>A common problem with clusters is that a low-priority client, once dequeued and getting served at the RM, can block the access of a high-priority client that is newly queued </li></ul></ul><ul><ul><li>For short duration requests, this is ignorable, but low priority table scans for queries blocking high priority OLTP updates is not good for business </li></ul></ul><ul><ul><li>The solution is to execute client function in a thread at the priority of the client (inversion) and to make low priority scans (and the like) to execute for a quantum and be interruptible by high priority updates, see the paper  : </li></ul></ul>1-  TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>
  35. 35. 4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ <ul><li>Optimally, the RDBMS also supports RM-only transactions, which are only active within one RM, and which do a transaction flush confined to that RM, so that one application message can send all the compound SQL statements and rowsets for several transactions which will have microscopic response time, lock hold times, etc. (For instance, a hundred TPC-C transactions for one branch of the bank) … which will still allow you to do wide transactions and queries on that RM data at any time, see the 1999 HPTS position paper: </li></ul>1- <http://research.microsoft.com/~gray/HPTS99/Papers/JohnsonCharlie.doc>
  36. 36. 4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ <ul><li>There is one version of the data in an optimal RDBMS computing universe: applications must serialize, they must interact with each other on the same data using transactions: this is in contrast to MVCC (versioning) databases (Oracle, SQL Server, Sybase, MySQL, Postgres) where transactional reads use snapshot isolation and are blind to concurrent updates: so only updates on primary keys block updates and repeatable read transactions (in the style of banking and finance transactions) don't provide proper isolation </li></ul><ul><li>An optimal S2PL (strict two-phase locking) system's update lock will block both updates and reads, an S2PL shared read lock will block updates, and both kinds of locks will only be released when the transaction stops changing the database, and for the update lock, after changes to the database are made durable at commit or abort time </li></ul>1-
  37. 37. 4. Basic Parallelism if it isn’t locked, then it isn’t blocked ‏ <ul><li>This allows every application process to run freely in parallel across the entire database, until they encounter a blocking lock on some record – hence there is no application locking schedule or high level concurrency control or massive sharding of the database required to single thread the concurrency and allow the database to work correctly – so the system runs naturally in parallel at warp speed across the entire network of clusters of computers </li></ul><ul><li>In an optimal RDBMS, use of the read-only transaction commit optimization will further allow large sections of the database to be released at the beginning of the commit flush (which is the end of the database transformation from data that was read by the transaction, which is why you hold shared read locks) ‏ </li></ul>1-
  38. 38. 5. Basic Transparency when? where? how? <ul><li>Gosling: distributed computing is not transparent to either failure or performance (once again !) ‏ </li></ul><ul><li>Transparent / opaque to whom ? </li></ul><ul><li>For an optimal RDBMS, at the kernel mode library programming level, there is no clustering or failure transparency, and the task is to provide transparency (whenever possible) to the layers above </li></ul><ul><li>For the vast majority of hardware and software failures, even most double failures, an optimal RDBMS seamlessly rolls along without aborting transactions, so that the applications don’t need to worry about those failures … but here are the three major types of failures that applications will see: </li></ul>1-
  39. 39. 5. Basic Transparency when? where? how? <ul><li>(1) Occasionally, the seamless operations and functioning of the application are interrupted, as in the total loss (very rare double failure) of the cluster computers containing a resource manager (RM) … since the TM (transaction manager) library doesn’t know which tx touched which RM after all the copies of the RM-tx state are lost, because of CAB-WAL-WDV </li></ul><ul><li>The Nonstop RM uses a variation of WAL, which is the Write Ahead Log protocol (Gray/Reuter 10.3.7.6): WAL functions to guarantee that database blocks which get changed in the RM buffer cache must be written first to the log (serial writes are >= 10X faster, treating the log disk like a tape), before they ever get written to the database disk (random writes are slow) </li></ul>1-
  40. 40. 5. Basic Transparency when? where? how? <ul><li>Then the database disk writes are scheduled to go out between the every five minute state checkpoints (what has changed since the last RM checkpoint) that the RM makes to the log – these writes are not demand based, so they can go out in leisure fashion (until you get close to the next RM checkpoint time, then things get hectic) </li></ul><ul><li>The WAL variation that the Nonstop RM uses is the CAB-WAL-WDV protocol: </li></ul><ul><ul><li>CAB - Checkpoint Ahead Buffer: the RM first checkpoints the log buffer to the RM backup process, the backup uses a neat trick to fabricate all the update locks from the log records, and if the primary dies, the backup can takeover and will do the log write again (idempotence) </li></ul></ul><ul><ul><li>WAL – Write Ahead Log: then write the database changes to the log </li></ul></ul><ul><ul><li>WDV – Write Data Volume: leisurely write back the dirty database buffer cache blocks </li></ul></ul>1-
  41. 41. 5. Basic Transparency when? where? how? <ul><li>The neat trick that the backup RM does to restore update locks, which cannot restore shared read locks that protect against the write skew and wormhole problems of MVCC databases, is the very reason that the TM library will have to abort all the transactions in the cluster to restore the state of the database, and this will require applications to resubmit any uncommitted updates (like the current mini-batch in NASDAQ SuperMontage does on Nonstop) ‏ </li></ul><ul><li>Note that the RM could checkpoint shared read locks, but that would be a constant pain suffered to alleviate an only occasional irritation </li></ul><ul><li>(2) If the application process that begins a transaction dies the tx gets aborted and work must be resubmitted </li></ul>1-
  42. 42. 5. Basic Transparency when? where? how? <ul><li>(3) Finally, if there is a rare TM or total cluster crash, due to a double log disk failure or the failure to restart one of the cluster computers supporting the logging subsystem (like registry problems), then the optimal RDBMS transaction service must be restarted or a disaster recovery initiated (which is clearly not very transparent to applications) ‏ </li></ul><ul><li>So when/where/how is this transparent to the application ? </li></ul><ul><ul><li>In the fact that the optimal RDBMS has no wormholes in it due to the write skew problems of snapshot isolation databases that employ MVCC </li></ul></ul><ul><ul><li>In the consistent isolation view of the database outside of the transaction that either commits or aborts </li></ul></ul>1-
  43. 43. 5. Basic Transparency when? where? how? <ul><li>So when/where/how is this transparent to the application ? (continued) </li></ul><ul><ul><li>In the guarantee that if the transaction service says commit and the entire system takes a nosedive a nanosecond later, then the transaction data is there and it’s consistent </li></ul></ul><ul><ul><li>In the guarantee that if the transaction service says abort, then all transaction protected work is undone completely before any transaction locks are released </li></ul></ul><ul><ul><li>In that the application needs to do nothing, but use a transaction to guarantee all of that consistency </li></ul></ul>1-
  44. 44. 6. Basic Scalability <ul><li>The original clustered database view of scalability came from David DeWitt and Jim Gray’s 1990 paper on database parallelism  : </li></ul><ul><ul><li>Speedup – when you can double the hardware and get the same work done in half the time </li></ul></ul><ul><ul><li>Scaleup – when you can double the hardware and get twice the work done in the same time </li></ul></ul><ul><li>Nowadays ‘scaling up’ means roughly what speedup meant, and ‘scaling out’ means roughly what scaleup meant, although I’ve noticed that different people mean drastically different things when using the modern phrasing: the DeWitt and Gray terms had very precise meanings, and a scalable system does both </li></ul>1-  TR-90.9 Parallel Database Systems: The Future of Database Processing or a Passing Fad? <http://www.hpl.hp.com/techreports/tandem/TR-90.9.html>
  45. 45. 6. Basic Scalability <ul><li>Scalability of database logging performance inside the Nonstop cluster and for disaster recovery is accomplished by a three phase commit flushing algorithm and the forced group commit write </li></ul><ul><li>The Nonstop RMs (called ‘DP2’) would not force-write database updates to the log (except in highly unusual circumstances), instead those updates would be streamed to the log partition’s (called ‘auxiliary audit trails’) input buffer, using asynchronous and multi-buffered writes </li></ul><ul><li>Nonstop uses the WAL (write ahead log) protocol so that writes only have to be scheduled to the resource manager database disk every five minutes or so (their disk checkpoints are called ‘control points’), for nearly “in-memory” update database performance for the resource manager disk </li></ul>1-
  46. 46. 6. Basic Scalability <ul><li>The combination of group commit and WAL yields just short of “in-memory” RDBMS performance, because of the Five Minute Rule : Keep a data item in electronic memory if its access frequency is 5 minutes or higher; otherwise keep it in magnetic memory. (Gray/Reuter 2.2.1.3) This rule was originally calculated for a 1KB page size, it still comes out to 5 minutes for a 64KB page size – and this gives us guidance as to what is about the right page size to use </li></ul>1-
  47. 47. 6. Basic Scalability <ul><li>At commit time, the Nonstop library transaction service induces explicit RM log flushing only when necessary, from the interrupt service level of the TM library (100 times cheaper than process message wakeups). In busy systems the RMs are stream-writing ahead continuously to the log, so that the transaction updates are almost always already flushed to the log when commit time comes (unless the transactions are tiny and unbuffered) ‏ </li></ul><ul><li>When flushes due to commit (and abort) are reported to the commit coordinator (for Nonstop, called the ‘TMF Tmp’) on a busy system, they are lumped together into a single and periodic forced write into the log, called a group commit </li></ul>1-
  48. 48. 6. Basic Scalability <ul><li>The group commit write by the RDBMS commit coordinator is the one and only time in the system that the transactional database application absolutely must wait for the disk to spin and the drive head to move, and it’s a shared experience (and thereby scalable for the cluster’s transaction service) ‏ </li></ul><ul><li>So, why is writing to one log disk faster than writing in parallel to a bunch of RM data volume disks? If there is no other disk writehead-moving activity for that disk, and if we write it sequentially using big buffers with effective disk sector management: then by treating a disk like a tape we get 20-100 times the writing throughput (Gray/Reuter 2.2.1.2) ‏ </li></ul>1-
  49. 49. 6. Basic Scalability <ul><li>Ultimately, however, you can easily generate more joint-serialized database log record blocks than one log disk can receive, so the optimal RDBMS log is vertically partitioned N-1 ways (on Nonstop, called the ‘merged audit trail’) </li></ul><ul><li>But you still only force write one group commit buffer to the log root (on Nonstop, the ‘master audit trail’) while streaming log blocks to the N-1 leaf log partitions </li></ul><ul><li>So, part of the configuration of an optimal RDBMS clustered transaction service is to assign RMs to log partitions. Reassigning RMs to log to different log partitions should not require the transaction service to be brought down, and needs to be performable online (several issues, too complex to discuss here) </li></ul>1-
  50. 50. 6. Basic Scalability <ul><li>Let’s talk about how a swarm of RMs can be flushed for transaction commit (or abort): </li></ul><ul><ul><li>Where each RM is flushing its log record contribution to a particular log partition (leaf) </li></ul></ul><ul><ul><li>And which log partition (leaf) is itself flushed during the group commit for the merged log (root) </li></ul></ul><ul><li>To ensure scalability (that means both speedup and scaleup ), all this flushing needs to occur: </li></ul><ul><ul><li>Without causing unnecessary forced writes from RMs through their log partition input buffers </li></ul></ul><ul><ul><li>And without causing unnecessary forced flushes for already flushed or non-participating log partitions underneath the log root commit write </li></ul></ul>1-
  51. 51. 6. Basic Scalability <ul><li>Before an RM can do anything to a transaction protected file, on behalf of a client request , it needs to be doing so on behalf of a valid transaction: first, the TID (transaction identifier) from the message header, which was sent under the client’s invocation of the transactional file system, is used when the RM performs a bracketing Check-In call to the TM (transaction management) library to create a crosslink element between the RM and the TID in the TM globals, where these connections are tracked for cluster transaction flushing by the TM library at the correct time </li></ul><ul><li>The crosslink stores an VSN (Volume Sequence Number), which is initially set to infinity (binary ones, or hex FFFFs): and that means that transaction work is in progress (similar, but not quite the same as the term ‘LSN’ from Gray/Reuter 9.3.3, more on the VSN, below) </li></ul>1-
  52. 52. 6. Basic Scalability <ul><li>As an aside, there will be at least one transaction flush in the cluster for commit or abort, and then potentially many flushes for successive undo attempts (so, TM library transaction flush broadcasts also have sequence numbers), until the transaction is successfully backed out; the recovery of successive backout attempts can get extraordinarily worse, since each attempt can apply - as undo, all the undo of the original transaction and all the successive undo of the previous attempts … this problem is solved by chaining and avoiding redundant undo records in the log in the following patent by theoretician and expositor Jim Gray’s favorite practitioner, Franco Putzolu, et al  : </li></ul>1-  Method for providing recovery from a failure in a system utilizing distributed audit [log records] <http://www.google.com/patents?id=L_IWAAAAEBAJ&dq=5,832,203>
  53. 53. 6. Basic Scalability <ul><li>When an RM does an update, insert or delete of some row in an SQL table or an index entry, that item has to be contained in a cache block which was read from the disk and is now contained in the RM’s buffer cache: if it’s not in cache, it must be read into cache now, and that delay is the source of Jim Gray’s Five Minute Rule </li></ul><ul><li>Modifying that cache block atomically requires the increment of the 64-bit VSN counter for the RM: The VSN counts transactional database changes monotonically for this RM in the log partition that it streams changes to, such that {RMID (resource manager identity), VSN} pairs in that log partition’s history precisely measure the progress in flushing the log stream for this RM </li></ul>1-
  54. 54. 6. Basic Scalability <ul><li>After incrementing the VSN, the </li></ul><ul><ul><li>RMID </li></ul></ul><ul><ul><li>VSN </li></ul></ul><ul><ul><li>TID </li></ul></ul><ul><ul><li>And the previous state and the new state of the database item that was changed </li></ul></ul><ul><li>… these are logically described in an undo/redo log record and that record is inserted at the end of the RM’s log write buffer : some RDBMS products separate the redo and undo log, but since you need to read both to do RM crash recovery, and since you have already successfully scaled up ( speedup ) by partitioning the log, why complicate the log further? (We will discuss physical vs. logical redo, later on) </li></ul>1-
  55. 55. 6. Basic Scalability <ul><li>When the RM is finished working on behalf of this transactional client file system request message, the RM calls the end-bracketing Check-Out call to the TM library with the VSN, which can have three kinds of values: </li></ul><ul><ul><li>Infinity: Binary 111111s/Hex FFFFFFs means that transaction work is in progress, a Check-Out call is expected soon </li></ul></ul><ul><ul><li>Zero: A transactional read of the the data was done and will be replied to very soon (a shared read lock is held) </li></ul></ul><ul><ul><li>Positive and not Infinity: We changed some data and will reply very soon (a exclusive update lock is held) </li></ul></ul>1-
  56. 56. 6. Basic Scalability <ul><li>The RM should have (at least) two log write buffers, so that it can be filling one with log records from changes to the database, while asynchronously writing the other to the input buffer of the log (synchronous writes only happen under some very narrow circumstances related to something being down or wrongly configured) </li></ul><ul><li>The log needs to be able to handle multiple simultaneous input messages from RMs and also be able to place them in a ring buffer, because many RMs are multi-buffering writes to it, and it is possible that an individual RM’s buffer messages can be queued out of order, and you would rather put a message aside than cancel it to force a retry </li></ul>1-
  57. 57. 6. Basic Scalability <ul><li>The RM writes out its log buffer when it gets full enough (when things are busy), or when a transaction flush broadcast requests that the buffer be written (mostly happens when things are not very busy), or when someone configured the buffer cache to be too small (this is because of WAL - write ahead log): after the log write is complete, the RM records the highest VSN written in the log write, and the resulting LPTR (log pointer: 64-bit counter of total blocks written to this log partition) into TM globals with a TM library call </li></ul><ul><li>When an RM runs out of memory, it tries to write back dirty cache blocks to the database disk, but this requires that other things be done first because of the CAB-WAL-WDV policy (checkpoint ahead buffer-write ahead log-write data volume, in that order) </li></ul>1-
  58. 58. 6. Basic Scalability <ul><li>This is not very scalable, so don’t under-configure the memory size of the RM buffer cache </li></ul><ul><li>Why?, because an out-of-memory RM must checkpoint the log write buffer to the backup RM (CAB), to be able to synchronously write the log buffer to the input buffer of the log (WAL), and then to be able to write back cache blocks whose log records have not been flushed to the log yet (WDV) </li></ul><ul><li>Stepping away from the issues of a sick data volume, after all the work for the transaction has been done, the user asks the file system to commit the work , and under that call the TM library does a cluster group commit broadcast , which is not much different from an abort broadcast (caused either by direct abort invocation or spontaneously from some failure or anomaly) </li></ul>1-
  59. 59. 6. Basic Scalability <ul><li>As yet another aside, when the Nonstop software stack was ported to Windows NT clusters in the late 1990s, an ultra-fast and ultra-reliable version of the network transaction flushing and two-phase commit broadcast was written to utilize UDP or TCP group broadcast service (multicast) on ethernet, which did unicasts to complete unreplied multicast messages for incredible scalability, see the patent  : </li></ul><ul><li>That port was very successful, the boxed release was released twice to the Paris Stock Exchange, and then it was mysteriously pulled back by Compaq (the Nonstop mainframe people didn’t complain much) </li></ul>1-  Transaction state broadcast method using a two-stage multicast in a multiple processor cluster <http://www.google.com/patents?id=pOEIAAAAEBAJ&dq=6,247,059>
  60. 60. 6. Basic Scalability <ul><li>Getting back to our scenario, out of the commit broadcast a transaction flush packet arrives in the computer’s packet service, which calls the TM library in kernel mode (all traps off, no virtual memory swaps, locked down memory and code), and that call runs through the TID’s (transaction) list of crosslinks in TM Library globals checking the VSN values: </li></ul><ul><ul><li>Infinity: Binary 111111s/Hex FFFFFFs means that transaction work is in progress and this should not be happening for a commit flush (after ‘commit work’ has been called): call a fail-fast halt to save the database from corruption due to a ‘late RM checkin’ </li></ul></ul>1-
  61. 61. 6. Basic Scalability <ul><li>Checking the crosslinks VSN values (continued): </li></ul><ul><ul><li>Zero: data was only read (a shared read lock is held) wakeup the RM to release locks now, this RM is flushed </li></ul></ul><ul><ul><li>Positive and not Infinity: data was modified (an exclusive update lock is held) now you have some work to do </li></ul></ul><ul><li>For every RM you find on the crosslink list for this TID that has a positive VSN: </li></ul><ul><ul><li>If the crosslink VSN is less than or equal to the highest VSN written for this RM from log writing (he deposited that value in TM globals after completing the log write), then this RM is flushed for this TID , carry on to the next </li></ul></ul>1-
  62. 62. 6. Basic Scalability <ul><li>For every RM you find on the crosslink list for this TID that has a positive VSN (continued): </li></ul><ul><ul><li>If the crosslink VSN is greater than the highest VSN written for this RM from log writing, then wakeup this RM to flush his log writing buffers until the crosslink VSN is less than or equal to the highest VSN written for this RM, then make a TM library call to deposit this RM’s {LPID (log partition identity), LPTR (log pointer)} pair into the TM TID structure which holds the crosslink list </li></ul></ul>1-
  63. 63. 6. Basic Scalability <ul><li>Once the TID’s crosslink list is run through to the point that all the RMs are flushed and have been awakened to release shared read locks, the TM library replies back to the commit broadcaster that this computer is flushed for this TID , with the concise list of {LPID, MAX LPTR} pairs flushed. </li></ul><ul><li>Once that the TM library that initiated the commit flush broadcast has gotten the replies back from all the computers in the cluster (someone may have aborted), the TID results (i.e., committed/aborting) and the concise list of {LPID, MAX LPTR} pairs for the whole cluster are sent to the computer containing the TM commit coordinator (for Nonstop, called the ‘TMP’) </li></ul>1-
  64. 64. 6. Basic Scalability <ul><li>The TM commit coordinator wakes up on a timer , which is set tiny if things are not busy in the cluster (not enough traffic to get anything out of group commit), medium if there is enough business (response time decreases when you go as a group at this point, like metering lights on the freeway) and shorter as business picks up (at peak rates, the commit timer should be as short as will allow maximum throughput and minimum response time)  </li></ul><ul><li>If you want microscopic response times, you would use RM-only transactions, which are focused on one RM and only flush that RM, not the whole cluster </li></ul>1-  TR-88.1 Group Commit Timers and High-Volume Transaction Systems <http://www.hpl.hp.com/techreports/tandem/TR-88.1.html>
  65. 65. 6. Basic Scalability <ul><li>When the TM commit coordinator has awakened, the committed/aborting flushed packets from the cluster since the last wakeup are scooped up and lumped together into a group commit (and abort) </li></ul><ul><li>First, all the log partitions are sent a message to flush their log partition input buffers to their log disk, </li></ul><ul><ul><li>iff they were included in the joined and concise list of {LPID, MAX LPTR } pairs, otherwise they are not involved in any transaction in the current group committed/aborting list: when the log partition receives the message it will flush its log partition input buffer to disk: </li></ul></ul><ul><ul><ul><li>iff the MAX LPTR associated with this LPID is not already flushed to disk, otherwise it will reply OK (this LPID is flushed) </li></ul></ul></ul>1-
  66. 66. 6. Basic Scalability <ul><li>When the last log partition that it sent flush requests to has replied OK , the TM commit coordinator will write the group of committed and aborting transaction state log records in the log write buffer, by doing a waited write to the log root, and when that acknowledgment comes back, the group commit is complete for all the transactions … note the following: </li></ul><ul><ul><li>In a busy system (when it counts) every forced write, except the single commit write for every group commit, could have already been accomplished through streaming by the time it was requested (it’s only that we have noticed this by good bookkeeping) </li></ul></ul>1-
  67. 67. 6. Basic Scalability <ul><li>Final notes on group commit (continued) </li></ul><ul><ul><li>That one buffered and forced write of transaction state log records to the log root by the TM commit coordinator is comparatively tiny, and all the transactions in the system piggyback together on that one timer-driven delay: it is shared and periodic, like a rapid heartbeat </li></ul></ul><ul><ul><li>If any VSN or log pointer information is lost by an RM or log partition takeover, we will have to force flush the log partitions during commit, for a while </li></ul></ul><ul><ul><li>If you can’t stand that wait for the log to do all this flush coordination and a serial write, then use RM-only transactions </li></ul></ul>1-
  68. 68. 6. Basic Scalability <ul><li>When transactions span the network to other clusters (or heterogeneously to other vendor systems), then the commit coordinators on the two or more clusters do non-blocking three phase commit to guarantee the joint commit or abort of the distributed transaction </li></ul><ul><li>Optimal RDBMS distributed commit performance will do 60% of the local maximum transaction rate across as many nodes as the customer needs. That “scaling out” (Gray’s scaleup ) is accomplished by a method called “Mother-May-I”, and is described in two Nonstop patents  : </li></ul>1-  Hybrid method for flushing transaction state in a fault-tolerant clustered database <http://www.google.com/patents?id=rUt4AAAAEBAJ&dq=7,028,219> Method for handling node failures and reloads in a fault tolerant clustered database supporting transaction registration and fault-in logic <http://www.google.com/patents?id=S-d3AAAAEBAJ&dq=6,990,608>
  69. 69. 6. Basic Scalability <ul><li>If transactions have locality of reference and only touch the local node with no lock conflicts (which means that no joint serialization is necessary), then an optimal RDBMS will do “scaling up” (Gray’s speedup ) to nearly the full 100% level </li></ul><ul><li>Based on the scalability of the cluster and the partitioned log, an optimal RDBMS replication service will consistently maintain the database on a remote cluster with only 1% DB overhead + 4% network messaging overhead on the primary cluster </li></ul><ul><li>The optimal RDBMS replication service will consume more of the remote cluster applying the updates (between 15% and 25%). </li></ul><ul><li>More than these listed values for overhead is a bug </li></ul>1-
  70. 70. 7. Basic Availability outage minutes -> zero <ul><li>What is availability? </li></ul><ul><li>On some systems it is defined as the existence of a working Unix or Linux shell prompt </li></ul><ul><li>On some databases (Oracle) it has been quoted only on database software-produced outages, as though hardware and operating system-produced outages that are not tolerated by the database system are somehow not really happening to the customers </li></ul>1-
  71. 71. 7. Basic Availability outage minutes -> zero <ul><li>In an optimal RDBMS, availability is measured in terms of database queuing: if you can begin a transaction, and queue up for a lock on any part of the database under that transaction with the likelihood of actually getting that lock and then accessing that data, then that data is considered available </li></ul><ul><li>If you can’t do all that on some part of the database, then that part of the database is actually unavailable </li></ul><ul><li>Availability, in Highleyman's “Breaking the Availability Barrier”, p. 32: Availability = MTBF/(MTBF + MTR) where any 'mean time before failure' will return an availability of 1 (eternally up), if the 'mean time to repair' is zero. </li></ul>1-
  72. 72. 7. Basic Availability outage minutes -> zero <ul><li>Tandem’s Nonstop TMF has had excellent fault tolerance out-of-the-box for nearly 25 years with non-blocking three phase commit coordination between NonStop cluster nodes: when the Tmp process or cpu dies, the backup Tmp takes over with no perceptible outage or loss of state or any transactions being aborted, to the tune of 5½ nines of availability or 12 years overall MTBF (restarting after total system failure due to double failure makes a 30 minute repair time). Add RDF to make 7 nines in the British banking system (Mosher), or an astonishing 38 years overall MTBF (RDF repair time is consistently under 2 minutes)‏ </li></ul><ul><li>IBM Parallel Sysplex, using mainframe DB2 says (2003) that they can do 50 years overall MTBF (open your wallet wide for IBM services, because that’s definitely not out-of-the-box) ‏ </li></ul>1-
  73. 73. 7. Basic Availability outage minutes -> zero <ul><li>An optimal RDBMS should drastically exceed this level of fault tolerance, going to more than 1 million years overall MTBF and twelve nines (with a 30 second repair time), and this should not require expensive legacy services or onerously complex configuration: it should work that way out-of-the-box </li></ul><ul><li>Tandem Nonstop was the first to make all the common database operations seamlessly transparent online: SQL partition reorganize, split and merge, partition move to another disk, changing the log partition that an SQL partition logs to, SQL catalog changes (add, modify and delete table, add, modify and delete fields); all of these operations can be done mostly without changing applications and even without modifying query plans from the optimizer (many systems have hundreds of query plans stored and customers hate recompiling all that) ‏ </li></ul>1-
  74. 74. 7. Basic Availability outage minutes -> zero <ul><li>The 4 SQL online partition operations: reorganize, split, merge, and move are all done starting with a rollforward from an online (fuzzy) dump, applying changes (redo) from the log starting at dump time forward until nearly caught up, then slowing user updates at the end to catch up (avoiding infinite overtaking) ‏ . You could call this ‘recovery in place’ </li></ul><ul><li>An optimal RDBMS will operate in such a seamless way, and also the transaction abort, RM recovery (data volume recovery) and media failure recovery (archive recovery) will all do their job without writing through to the database disks, which runs up to 100 times faster (only rebuilding RM disk buffer cache) ‏ </li></ul><ul><li>This guarantees minimal outage times (MTR) on the functions of the database which are subject to these more visible operational outages </li></ul>1-
  75. 75. 7. Basic Availability outage minutes -> zero <ul><li>Much of this is possible, because Nonstop uses logical keys instead of record pointers (RIDS) to interconnect the btree pages/blocks … </li></ul><ul><li>IBM’s mainframe DB2 uses RIDs to connect the leaf levels in btrees, while an optimal RDBMS uses logical keys: this allows btrees to be moved without modification of the btree data, whereas IBM RIDs are only valid at that disk address and need to be remapped to move them. This causes holes in the implementation of utility functions for availability, known as the 'Halloween Problem' in the old days </li></ul><ul><li>They resolved Halloween by employing the Nonstop method (the two patents are nearly identical) for the SQL partition operations including the rollforward part and the infinite overtaking part </li></ul>1-
  76. 76. 7. Basic Availability outage minutes -> zero <ul><li>However, RIDs disallow using SQL cursors against the base tables since the RIDs can be remapped asynchronously underneath the app. In DB2, cursors can only be used against snapshot isolated copies of the base tables. That RID infantile appendage has an impact on function </li></ul><ul><li>This is why an optimal RDBMS, will split, merge, and move partitions, and reorganize the database on the fly without any availability outage, using archive recovery interfaces to pull the updates to the old partition from the log in real time and apply them to the new partition, and then switch over when the log tail is near, while SQL cursors are still active </li></ul>1-
  77. 77. 7. Basic Availability outage minutes -> zero <ul><li>The SQL ‘recovery in place’ approach to seamless operations allows completely retry-able and transparent fault tolerance for RDBMS utility functions experiencing failures. (So you don’t have to dump the entire database before and after utilities run, as has historically been the case with Oracle) ‏ </li></ul><ul><li>However … if enough failures of the intolerable kind occur simultaneously … your database can become unavailable, or worse, unusable … so what, then? </li></ul><ul><li>The first thing in availability is to reduce your MTR (mean time to repair). In a transaction system that means to bounce back up quickly after a crash, and that means knowing what transactions and locks are outstanding in the cluster. </li></ul>1-
  78. 78. 7. Basic Availability outage minutes -> zero <ul><li>IBM's mainframe DB2 uses a piece of special hardware called the CF (coupling facility), which acts as a smart memory to store shared buffers and locks. The CF is a mainframe processor running a special OS called CFCC. Then, each time cluster SMPs fail, the database elements needed for a quick restart are right there </li></ul><ul><li>Nonstop does not use special hardware in this way. Their innovation is to store their locks in the end of the log (in the last 5 minute RM checkpoint), and quickly reacquire those locks to allow RMs that have not failed to continue doing business, while those RMs requiring recovery get processed in some critical order, see this patent  : </li></ul>1-  Minimum latency reinstatement of database transaction locks <http://www.google.com/patents?id=9Lx6AAAAEBAJ&dq=7,100,076>
  79. 79. 8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ <ul><li>So what is database consistency? </li></ul><ul><li>It’s like pH: the higher the ACIDity, the stronger the database </li></ul><ul><li>The letters in ACID stand for Atomic, Consistent, Isolated, and Durable </li></ul><ul><li>A is for Atomicity and means all or nothing: the database everywhere must end up in a state whose visibility to the world outside the transaction is first the old state, then the new state (in the case of commit) or the state remains unchanged (in the case of abort) ‏ </li></ul>1-
  80. 80. <ul><li>C is for Consistency seems like a circular definition, but actually it should probably be spelled ASID, because consistency in database work is really accomplished by serialization (as seen from below) </li></ul><ul><li>The database (on reputable systems) is really in the log, the RM disks are a convenient cache whose disk image is only rarely in a consistent state (only after a correctly completed shutdown of the optimal RDBMS transaction service, at which point the RM disk is fairly unusable) ‏ </li></ul><ul><li>Serialization in the log is defined by the exclusive existence of serialized transaction histories without wormholes, and no other kind </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  81. 81. <ul><li>A transaction history starts in the log with the first update log record for a transaction </li></ul><ul><li>Then there’s a series of update records (btree block splits have multiple physical redo log records in a string) ‏ </li></ul><ul><li>Then there are one or more commit log records, xor one or more abort log records (either commit or abort, never are both present) ‏ </li></ul><ul><li>Then, in the case of an abort, there are one or more undo log records </li></ul><ul><li>Finally, there are one or more forgotten log records to terminate the transaction history </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  82. 82. <ul><li>The big thing about serialized transaction histories in the log is that even though they are mostly interspersed in order (concurrency), you must never have the case where a log record touching data for one transaction is historically interspersed with a log record touching that very same data for another transaction </li></ul><ul><li>This is called a wormhole (Gray/Reuter 7.5.8.1) and it occurs when a transaction is either not well-formed (using shared read locks and exclusive update locks) or is not two-phase (first acquiring and then releasing locks). “A transaction history is isolated if, and only if has no wormhole transactions.” - Jim Gray </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  83. 83. <ul><li>You must be able to sort (using Jim Gray's sorting method in Gray/Reuter 7.5.8.1) the entire log by transaction timestamp to the same effect as the original log when replayed back into the database at recovery time: yielding the GOLD STANDARD of transaction systems: Wormhole-Free Transaction Histories </li></ul><ul><li>Strict two phase locking (S2PL) transaction concurrency is both well-formed and two-phase on purpose . </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  84. 84. <ul><li>Multiple version concurrency control ( MVCC ) does not enforce the well-formed part, because shared read locks are not used (even in repeatable read mode in most implementations I've seen) ‏ </li></ul><ul><li>MVCC databases (Oracle, SQL Server, Sybase, Postgres, MySQL) basically employ forms of snapshot isolation, which allow transactions to work on a private snapshot version of the DB, so most locking is unnecessary. Blocking only rarely occurs, when a record is deleted or the primary key is updated, for concurrent transaction users. </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  85. 85. <ul><li>MVCC databases can create wormholes by write skew : for instance, if two concurrent transactions read two different row.column values and then update each other's previously read row.column values. </li></ul><ul><li>You can only do one of three things when this kind of conflict occurs: 1. Block one tx (which MVCC can't do at all well) 2. Abort one tx (which some MVCC s do) or 3. Corrupt the database integrity (mostly this what is done) ‏ </li></ul><ul><li>You know, there really isn’t any magic here </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  86. 86. <ul><li>Mostly what MVCC database users end up doing is single-threading, by two kinds of means: </li></ul><ul><ul><li>Sharding: making many isolated databases, also called database federation (invented by Microsoft for the TPC-C benchmark in the late 1990’s)‏ </li></ul></ul><ul><ul><ul><li>Then, single-threading the shards by: </li></ul></ul></ul><ul><ul><ul><ul><li>Using single threaded frameworks on dynamic languages (Ruby/Rails, Groovy/Grails, Python/Django, PHP/various) </li></ul></ul></ul></ul><ul><ul><ul><ul><li>Or by breaking up a hot-spot, like the bank branch balance in the TPC-C transactions and making many, many single-threaded bank apps in accomplishing the benchmark </li></ul></ul></ul></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  87. 87. <ul><li>Mostly what MVCC database users end up doing is single-threading, by various means ( continued ): </li></ul><ul><ul><li>Alternatively creating a towering application stack (EAI enterprise application integration) that: </li></ul></ul><ul><ul><ul><li>Filters every transaction and single threads the consistency space </li></ul></ul></ul><ul><ul><ul><li>Queuing and single-threading transactions that might conflict </li></ul></ul></ul><ul><ul><ul><li>Maintaining a complicated and partially sharded, partially duplicated database schema </li></ul></ul></ul><ul><ul><ul><li>To prevent the possibility of write skew by making applications impossible to develop and maintain by end users, who become the ‘end-losers’ </li></ul></ul></ul><ul><ul><ul><li>That is necessary, because the RDBMS cannot protect itself from its applications: it is totally vulnerable to corruption by concurrent access </li></ul></ul></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  88. 88. <ul><li>What all these single-threading methods accomplish is to effectively convert an MVCC RDBMS model to a fragile application SS2PL model: strong strict two phase locking, where even read locks are held until the transaction updates are flushed to disk </li></ul><ul><li>So, when a MySQL executive recently said that the era of Jim Gray database was passing, it was only true for web 2.0, not the critical enterprise or critical computing, where the answers really matter, and the big money is at risk </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  89. 89. <ul><li>Of the MVCC databases, only Microsoft SQL Server can completely pass the TPC-E benchmark, which checks for inconsistencies (so they are doing something cute): </li></ul><ul><ul><li>If you do aborts and raise the concurrency, you end up aborting more concurrent transactions 2 of 3, then 3 of 4, etc. </li></ul></ul><ul><ul><li>If you try and catch inconsistencies on-the-fly and then single-thread their schedule … well, you should have blocked on shared-reads to begin with </li></ul></ul><ul><ul><li>There is no guarantee these detecting methods work, using timestamps is racy, and only sharpens the guillotine blade. Once you start to get asymptotically close to correct, you and your users will start to trust your erroneous implementation </li></ul></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  90. 90. <ul><li>So, what does all this S2PL get you? Isn't it slow? </li></ul><ul><li>Nonstop demonstrated the converse in the ZLE benchmarks on 15 nodes with 200K TPS of updates, with hundreds of queries constantly running against the base tables, and with trickle batch. It was their finest hour. You need to be smart and a good database designer, but it can be done. And without isolating fractured databases. </li></ul><ul><li>The first wonderful thing you get from S2PL is the magic of distributed database for free . </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  91. 91. <ul><li>Imagine 10, 20, 100 clusters of database side by side in a massive group. Now allow them to share transactions with global tx IDs that have local tx IDs for each system. The RMs on the different nodes do not interact or scheme to serialize their updates in any way </li></ul><ul><li>Now go at the database concurrently with a 1000 tx and just try to make a wormhole, you can't do it </li></ul><ul><li>What's stopping you from corrupting the database? The RMs aren't coordinating, the commit coordination only orders the transaction state log records, not the RDBMS update records, so how is it protecting itself? </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  92. 92. <ul><li>In an MVCC log, you have what is called “log serialization”, which is only serialized as far as the log is concerned, which is insufficient. (Which is why you have to single-thread the applications). </li></ul><ul><li>In an S2PL log you get “application serialization”, where the applications serialize their own updates , by waiting for shared read and exclusive update locks. The applications themselves are doing that magical, unseen coordination </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  93. 93. <ul><li>So no matter how many clusters are involved in sharing distributed transactions in the massive group, if all of their logs are simply merged together (sorted by transaction-timestamp) they should yield a joined log with total joint serialized histories for all the transactions anywhere in the entire contiguous computational universe of optimal RDBMS clusters </li></ul><ul><li>This is why you need to do three phase commit coordination between nodes, simply to guarantee that all the update records everywhere are actually on a log disk somewhere when we terminate or “Forget” the global transaction) ‏ </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  94. 94. <ul><li>This also allows you to do replication of many cluster primary databases to many cluster backup databases and actually do a takeover and make that work, as well, to the last serialized transaction history commit. Only Nonstop RDF does this as of now, and that possibility only arises because of S2PL concurrency control </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  95. 95. <ul><li>I is for Isolation and should probably be spelled ACLD, because isolation in database really means locking (as seen from below) </li></ul><ul><li>For optimal RDBMS RM files, transaction duration locks are either exclusive update locks which block reads and updates, or shared read locks which block updates only (there are four other kinds of locks in a Nonstop system: held for session, message and operation duration) </li></ul><ul><li>Locks are only released after all of their associated transaction database work has ceased, and once the totality of database changes have hit the log disk: the locks are the fingers of the correctly ordered database in the log reaching out to the cache copy of the database and guaranteeing serialized behavior in the applications interacting through that cache copy of the database log </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  96. 96. <ul><li>Hence, because of transactional isolation: </li></ul><ul><ul><li>Applications communicate with each other using simple relational propositional logic ( Queries ) ... </li></ul></ul><ul><ul><li>creating, changing or discarding truthful propositions ( Rows ) ... </li></ul></ul><ul><ul><li>in shared repositories of mutually agreed upon truths ( Tables ) … </li></ul></ul><ul><ul><li>through the database in complete, uninterrupted compound sentences ( Transactions ) ‏ </li></ul></ul><ul><ul><li>pausing in real-time, only to hear the complete, uninterrupted compound sentences of other concurrent applications ( Locks ) ‏ </li></ul></ul><ul><ul><li>otherwise, running at warp speed unhindered of any other blockage to performance ( Minimum Latency ) ‏ </li></ul></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  97. 97. <ul><li>Finally, D is for Durability and means that once an optimal RDBMS transaction service says it’s done, it’s really done </li></ul><ul><li>If the ENDTRANSACTION procedure call returns OK, and one nanosecond later the entire installation crashes, the data is there and it’s correct </li></ul><ul><li>For optimal RDBMS disaster recovery, that means that when it’s done on the database on the primary site, and after all the log records reach the remote site, it’s done there, too </li></ul>8. Application/Database Serialized Consistency the database must be serialized wherever it goes ‏ 1-
  98. 98. <ul><li>So, given that we have a safe copy of the database transaction history (wormhole-free) in the log and a periodic collection of archive dumps of the files, we can do database recovery and there are two kinds: </li></ul><ul><ul><li>RM restart after TM restart (Gray/Reuter 11.4.2/TM, 11.4.3/RM or 11.4.6/Unified, 11.4.7-10/tricks): this is called ‘Volume Recovery’ on Nonstop, it is needed after a crash, because a clean shutdown pushing out all dirty cache blocks does not require recovery to start up the database volume </li></ul></ul><ul><ul><li>RM archive recovery (Gray/Reuter 11.5-6): this is called ‘File Recovery’ on Nonstop and is done after a media failure by restoring a fuzzy online dump and applying the log redo and then undoing incomplete transactions </li></ul></ul>9. Recovery putting it all back together again 1-
  99. 99. <ul><li>RM restart after TM restart: </li></ul><ul><ul><li>The transaction manager rebuilds its state from the last run by scanning the log root from the beginning of the penultimate TM checkpoint (last two): since each TM checkpoint contains all the transaction state records for transactions that did not generate log records since the last TM checkpoint, this guarantees building a complete snapshot list </li></ul></ul><ul><ul><li>As the state transitions are traversed for each transaction in the log, the state continually changes until the Forgotten state is reached, when the TM throws that transaction away </li></ul></ul>9. Recovery putting it all back together again 1-
  100. 100. 9. Recovery putting it all back together again <ul><ul><li>These are the transaction states , which are stored as records in the log root (the tense is crucial): </li></ul></ul><ul><ul><ul><li>Active state : only seen in the log if a working transaction gets caught by the periodic TM checkpoint </li></ul></ul></ul><ul><ul><ul><li>Prepared state : seen in a transaction which came in from a remote cluster (parent), and which is in the middle of a 2 or 3 phase commit </li></ul></ul></ul><ul><ul><ul><li>Committed state : seen in a transaction which has touched remote clusters (children), and which is at the end of a 2 or 3 phase commit </li></ul></ul></ul><ul><ul><ul><li>Aborting state : like I said </li></ul></ul></ul><ul><ul><ul><li>Forgotten state : this transaction is now durably going away </li></ul></ul></ul>1-
  101. 101. <ul><ul><li>Transaction state record transitions in the log root : </li></ul></ul><ul><ul><ul><li>Active or nil -> forgotten (hurried, lockrelease after): local commit, neither parents nor children </li></ul></ul></ul><ul><ul><ul><li>Active or nil -> prepared (hurried, locks held): distributed commit, definitely having parents, maybe having children </li></ul></ul></ul><ul><ul><ul><li>Active or nil -> committed (hurried, lockrelease after): distributed commit, no parents, definitely having children </li></ul></ul></ul><ul><ul><ul><li>Active or nil -> aborting (hurried, locks held): maybe local or distributed </li></ul></ul></ul>9. Recovery putting it all back together again 1-
  102. 102. <ul><ul><li>Transaction state record transitions in the log root (continued): </li></ul></ul><ul><ul><ul><li>Prepared -> forgotten (hurried, lockrelease after): distributed commit, definitely having parents, but no children </li></ul></ul></ul><ul><ul><ul><li>Prepared -> committed (hurried, lockrelease after): distributed commit, definitely having both parents and children </li></ul></ul></ul><ul><ul><ul><li>Prepared -> aborting (hurried, locks held): distributed commit, definitely having parents, maybe having children </li></ul></ul></ul>9. Recovery putting it all back together again 1-
  103. 103. <ul><ul><li>Transaction state record transitions in the log root (continued): </li></ul></ul><ul><ul><ul><li>Committed -> forgotten : distributed commit, maybe having parents, definitely having children </li></ul></ul></ul><ul><ul><ul><li>Aborting -> aborting (hurried, locks held): try, try again to abort </li></ul></ul></ul><ul><ul><ul><li>Aborting -> forgotten (hurried, lockrelease after): maybe local or distributed </li></ul></ul></ul>9. Recovery putting it all back together again 1-
  104. 104. <ul><ul><li>Actions to take depending on transaction states in the TM after the restart scan of the last two TM checkpoints in the log root: </li></ul></ul><ul><ul><ul><li>Active : Locks must be held to abort this transaction </li></ul></ul></ul><ul><ul><ul><li>Prepared : request the commit/abort decision from the parent cluster, locks must be held on the RM so that the RM can be made available, in case the answer is abort, which will apply undo </li></ul></ul></ul><ul><ul><ul><li>Committed : notify all children clusters about Commit, after they have all responded, go forgotten and discard the transaction: no locks are held for the Committed state on the TM’s home cluster </li></ul></ul></ul><ul><ul><ul><li>Aborting : Locks must be held to abort this transaction </li></ul></ul></ul><ul><ul><ul><li>Forgotten : discard the transaction </li></ul></ul></ul>9. Recovery putting it all back together again 1-
  105. 105. 9. Recovery putting it all back together again <ul><ul><li>If there is no lock reinstatement during crash recovery for the RMs, then the Prepared state transactions must be resolved by the TM by communicating with the parents of all Prepared transactions and committing or aborting them all, before bringing the database up </li></ul></ul><ul><ul><li>Lock reinstatement arises from RM periodic checkpointing, by appending to the RM log checkpoint all the locks that were not released or otherwise written to the log as redo records since the last RM log checkpoint, such that traversing the last two RM log checkpoints after a crash can rebuild the lock table (update locks only, these are sufficient for Prepared state management), from these and redo records </li></ul></ul>1-
  106. 106. 9. Recovery putting it all back together again <ul><ul><li>If there is lock reinstatement , then the database can be brought online with Prepared transactions awaiting resolution and also by aborting the remaining (Aborting and Active state transactions) while online </li></ul></ul><ul><ul><li>The resource manager RM rebuilds its state by starting a redo scan of the log partition from the redo low water mark (retrieved from the TM) to the end of the log, applying all redo (original transaction forward work) into RM cache blocks for any transactions that were alive when the crash occurred: reinstating locks along the way, if that is supported </li></ul></ul>1-
  107. 107. 9. Recovery putting it all back together again <ul><ul><li>If lock reinstatement is supported , the RM can be brought online at the end of the redo run, otherwise it can only be brought online after Prepared transactions are resolved and all the aborts are completed </li></ul></ul><ul><ul><li>The redo low water mark for an RM points to the earliest log partition write that has not been lazy-written to the data volume using the CAB(hurried)-WAL(hurried)-WDV(lazy) protocol </li></ul></ul><ul><ul><li>After the redo scan is complete, and using the list of Active and Aborting transactions remaining from the TM, the log partition is traversed from the end in reverse all the way to the undo low water mark , applying undo for the transactions until all of those transactions are completely undone: this can easily go beyond the last two RM checkpoints, and even beyond the redo low water mark - you hope all the undo is online and not on a T9840 StorageTek tape on a shelf somewhere </li></ul></ul>1-
  108. 108. 9. Recovery putting it all back together again <ul><ul><li>If the current log pointers for the log root and the log partitions are periodically broadcast to the TM library on every computer in the cluster, then when any transaction is begun, the transaction undo low water mark can be sent back to the TM and then that part of the log can be made to remain online and not go off to tape, to make aborting the transaction later (if it comes to that) easier, or possible </li></ul></ul><ul><ul><li>If lock reinstatement is supported , the database can be brought back online in 30 seconds or less, as opposed to 15 minutes or more to resolve Prepared and complete all the aborts </li></ul></ul>1-
  109. 109. 9. Recovery putting it all back together again <ul><li>RM archive recovery (if the file is trashed): </li></ul><ul><ul><li>To recover a file from the archive, first you restore the ‘fuzzy’ online dump from the most recent time before the target time you want the file consistency to brought up to </li></ul></ul><ul><ul><li>After the dump is in place in the RM, archive recovery applies all the redo for that file from the RM’s log partition, starting at a point in the log partition immediately before the online dump was initiated, and proceeding up to the log pointer or timestamp of the target file that you want to recover to </li></ul></ul><ul><ul><li>Traversing the log backwards from that final location (which is usually the end), archive recovery then applies all the undo for the transactions that were incomplete at the target time, back to the undo low water mark : and that’s it </li></ul></ul>1-
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×