Fundamentals Of Transaction Systems: Part One - Clusters

Loading...

Flash Player 9 (or above) is needed to view presentations.
We have detected that you do not have it on your computer. To install it, go here.

0 comments

Post a comment

    Post a comment
    Embed Video
    Edit your comment Cancel

    1 Favorite

    Fundamentals Of Transaction Systems: Part One - Clusters - Presentation Transcript

    1. Valverde Computing The Fundamentals of Transaction Systems: Part 1: Clusters Charles Johnson <cjohnson@member.fsf.org> The Open Source/ Systems Mainframe Architecture
    2. Library = Low level communication, operating system drivers and state on Open Systems platforms Subsystems = Open Source components + middleware standards + Customer Application Cores EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools Optimal Cluster Software Architecture
    3. Library = Low level communication, operating system drivers and state on Open Systems platforms State Optimally includes a proprietary layer of low level, C/C++ based drivers, yielding unparalleled transaction processing performance without the client having to deal with the underlying design architecture. These libraries provide a simple and unobstructive, yet elegant and abstract data management interface for new applications. Libraries ESS, WAN, LAN, SAN drivers and management library Global serialization library XML log records library Buffered log I/O library XML log reading library Cluster logging library Recovery library XML chains resource manager Global Transaction (IDs, handles and types) library Data management library Transaction management library XML remote scripting API library Server, Cluster and Network management library
    4. Subsystems = Open Source components + middleware standards + Customer Application Cores
        • The vast majority of optimal middleware and applications are then implemented on open source using cross-platform Java to access this open system interface, allowing unprecedented flexibility for customization and future expansion.
      Middleware – Open Source Disaster Recovery interface XML remote scripting XML management console Service control manager Application servers Application feeders Application extractors Application reports Application human interface Database and Recovery management interface Server, Cluster and Network management interface Application Core
    5. EAI = commerce brokers, data integration & rules engines, enterprise mining, web analytics, ETL and data cleansing tools Enterprise Application Integration Actional Control Broker Acxiom AbiliTec™ Fair Isaac Blaze Advisor Mercator Commerce Broker MicroStrategy DoubleClick Ensemble SAS Enterprise Miner ETL Tools SeeBeyond® TIBCO Trillium
    6. High Speed, Minumum Latency Network or SAN “B” Cluster Redundancy Architecture High Speed, Minumum Latency Network or SAN “A” * Elements can be viewed as servers in a cluster or clusters in a supercluster Fibre Channel or SAN Based Enterprise Storage Network “B” Fibre Channel or SAN Based Enterprise Storage Network “A”
    7. 4 Pillars (or Guardians or Demons)‏
      • 1. Causality or Acausality
        • Importance of Serialization, Order
        • S2PL (strict two phase locking vs MVCC)‏
          • Write skew and wormholes
        • Wittgenstein and the Tractatus Logicus
      • 2. Relativity or Classical Delusion
        • Real-time and distributed systems performance issues
        • Timestamp and clock issues
          • Einstein: no such thing as the global “current moment”
          • Davies: no such thing as the local “current moment” (modern physics)‏
          • GPS satellites were made with 13 digits of precision, they turned out to need every digit due to elliptical orbits and gravity-acceleration time-dilation.
    8. 4 Pillars (or Guardians or Demons)‏
      • 3. Purity or Impurity
        • Algorithms need to work correctly and be optimized for time and resources
        • Occam: &quot;All other things being equal, the simplest solution is the best.&quot;
        • Einstein: “Make everything as simple as possible, but not simpler”
        • Roger Penrose: Objectivity of Plato's mathematical world = no simultaneous correct proof and disproof
        • Karl Hewitt's Paraconsistency is really just Inconsistency
      • 4. Certainty (Philosophical) and Uncertainty ( Δ X Δ P ≥ ћ / 2 )‏
        • Failure, Takeover and Recovery: reasserting the invariants
        • Propagation of Error and Chaos will ultimately loom in all predictions, interpolations and extrapolations
        • Idempotence: transparent retryability
    9. Server/Cluster Fundamentals
      • Reliable Message Based System (serialized retries with duplicate removal)‏
      • Data integrity (data must be checked wherever it goes)‏
      • Reliability (fault detection + fault tolerance + fault avoidance + proper fault containment)‏
      • Server/Cluster Parallelism (if it isn’t locked, it isn’t blocked)‏
      • Transparency (when/where/how)‏
      • Server/Cluster Scalability
      • Server/Cluster Availability (outage minutes -> zero)‏
      • Application/Database Serialized Consistency (the database must be serialized wherever it goes)‏
      • Network Server Autonomy
    10. Server/Cluster Fundamentals 10. Minimal Use of Special Hardware (needs to be off-the-shelf)‏ 11. Maintainability and Supportability (H/W & S/W needs to be capable of on-line repair)‏ 12. Massive Parallelism and Scalability 13. Massive Availability -> Continuous Database 14. Grid Computing Operations 15. Reliable Disjoint Replication 16. Scalable Joint Replication 17. Bi-Directional Replication (Reliable, Scalable)‏ 18. Openness (Glasnost) – open systems and open source 19. Restructuring (Perestroika) – online application and schema maintenance
    11. 1. Reliable Message Based System (serialized retries with duplicate removal)‏
      • Data spread over multiple clusters implies messages between subsystems (Gray-Reuter 3.7.3.1)‏
      • Senders can die and be restarted and send duplicate messages, which must be detected and dropped (idempotence)‏
      • Receivers can die and be restarted, so messages must be resent on failure (retry)‏
      • Gaps in series may need to be detectable (sessions and sequencing, a solid MsgSys can do this for you and help above)‏
      • Library based components require (for performance reasons)
        • Drivers and packet communications
        • kernel mode execution under driver dispatches
        • packet buffering and replies without dispatching target threads (called driver ricochet broadcasts)‏
        • kernel mode flushing of RMs with lock-release
        • FIFO queuing of packet buffers into subsystem user mode threads (fibers)‏ support stream programming
    12. 1. Reliable Message Based System (serialized retries with duplicate removal)‏ ...
      • Maintaining a clustered Nonstop message system is described in the (expired and now available) Tandem patents from James Katzman, et al:
        • 4,817,091 Fault-tolerant multiprocessor system
        • 4,807,116 Interprocessor communication
        • 4,672,537 Data error detection and device controller failure detection in an input/output system
        • 4,672,535 Multiprocessor system
        • 4,639,864 Power interlock system and method for use with multiprocessor systems
        • 4,484,275 Multiprocessor system
        • 4,378,588 Buffer control for a data path system
        • 4,365,295 Multiprocessor system
        • 4,356,550 Multiprocessor system
        • 4,228,496 Multiprocessor system
      • And the (expired and now available) Glupdate patent from Richard Carr, et al, which is reliable for 30 years now, and has a much reduced message overhead versus Quorum Consensus, while accomplishing more:
        • 4,718,002 Method for multiprocessor communications
    13. 1. Reliable Message Based System (serialized retries with duplicate removal)‏ ...
      • A Tandem Nonstop system is a cluster of clusters of processors, up to 4096. Each 16 processor “node” or server in the Expand outer cluster has autonomy, and its own transaction service TM (transaction manager) bringing the server's RDBMS up and down, one RM (resource manager) at a time. Fault tolerance at the subsystem and application level is accomplished by process pairs, which look like a single process to a client sending messages and retrying after the primary half of the pair has gone down.
      • An IBM S390 Sysplex Cluster is a set of up to 32 16-way SMPs joined by fast interconnects and buses, with at least 2 coupling facility smart memory devices and 2 sysplex timers (not used for commit processing, they do commit by log order, like Tandem) <http://www.mvdirona.com/jrh/work/hpts2001/presentations/DB2%20390%20Availability.pdf>
    14. 2. Data Integrity (Data Must Be Checked Wherever It Goes)‏
      • Data corruption is an ever-present possibility through electronic noise (e.g. radon decay chain effects, cosmic radiation), physical defects (semiconductor doping flaws), and HW/SW design defects (stray pointers in code)‏
      • The statistics are that there are 3 undetected and uncorrected, but program-significant data corruptions per 1000 microprocessors per year (Horst, et al: Proc 23rd FT Computing Symposium 1993)‏
      • Disks, even when not in use, will corrupt data at a low rate and the mirrors must be crawled and corrected, and single data disk blocks recovered
      • Memory must be error checking and correcting (ECM), as in most computer systems. Many components have the potential to corrupt data, and this will get worse as components shrink
      • Higher integration levels for processors will cause sporadic internal resets, which occur more frequently at higher altitude (Colorado), which can take a processor offline for a half a minute‏
    15. 2. Data Integrity (Data Must Be Checked Wherever It Goes) ...
      • Optimal, reliable systems will support every one of the following:
        • Lock-stepped microprocessors
        • Fail-fast protection of internal buses and drivers
        • End-to-end checksums on data sent to storage devices
      • Log writing must use end-to-end checksums on blocks. This is because after a crash, we need to fix up to the last valid written block of log records from a log buffer, and we can’t tolerate garbage in the block middle due to power-loss partial writes
      • During transaction restart after an RM (resource manager) crash or full server/cluster crash, log fixup then searches for the last good block written (valid checksum) to the log mirrors, which becomes the new log tail
    16. 3. Reliability (fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment)‏
      • James Gosling: distributed computing is not transparent to either failure or performance
      • Some errors are tolerable and some operations returning errors can be retried with idempotence
      • Oddly enough, keeping things reliably running requires a cut-throat approach to critical subsystems that are experiencing anomalies, encountering garbage data, or even running abnormally
      • Fail-fast – going down quickly prevents the spread of invalid data or even the effects of flawed algorithms or races we can’t handle (what if the corruption checks don’t catch something?)‏
      • Failure Detection : assertion logic is interwoven throughout all critical library code and all subsystem code
      • To maintain the state machine invariants end-to-end we must detect any violation of the invariants and then reinstate them
      • Bohr-bugs and Heisen-bugs require different kinds of testing
      • Single Failures in clusters and Double Failures in clusters of clusters require different kinds of testing and furthermore, different kinds of fault tolerance design to ‘transparently’ handle those failures
    17. 3. Reliability (fail-fast + fault detection + fault tolerance + fault avoidance + proper fault containment) ...
      • Fault Tolerance : when something goes wrong and a failure occurs, whether hardware or software, takeover mechanisms ensure the re-establishment of state machine invariants (a new state which is equivalently identical to the state before failure)‏
      • In fault tolerant systems, when a piece of hardware is faulty, the fault tolerance of the software has to function correctly
      • Fault Avoidance : by small amounts of forethought and action here and there in the code, potentially large failures can be shrunk down in size to be handled invisibly, e.g.,
        • avoiding unnecessary transaction aborts by checkpointing transaction state in the RM [resource manager]
        • avoiding unnecessary RM crash recovery outages by detecting missing log writes and requesting them in a timely way
      • Fault Containment : garbage pointers in the library kernel mode underworld cause the outage of the server. Encountering the same problems in a process environment may only require a process restart, if proper checkpoints have been made beforehand
    18. 4. Server/Cluster Parallelism (If It Isn’t Locked, It Isn’t Blocked)‏
      • An optimal RDBMS will use S2PL (strict two phase locking) for the transaction duration locks (e.g., 5 kinds of locks in the Nonstop RM: DP2)‏
      • An optimal RDBMS RM holds both the data and the locks, with no external distributed lock table to fight over with interlopers, which supports the use of a queue and priority inversion of that queue
      • So, one message connects the transactional application code with the data + the acquired lock + the tx state within the node TM (transaction manager) underneath the DB RM process + any failure takeover process for the RDBMS RM
      • Optimally, the RDBMS also supports RM-only transactions, that are only active within one RM, and which do a transaction flush within that RM, so that one application message can send all the compound SQL statements for several transactions which will have microscopic response time, lock hold times, etc. (For instance, a hundred TPC-C transactions for one branch of the bank), which will still allow you to do wide transactions on that RM data at any time, as in a 1999 HPTS position paper: <http://research.microsoft.com/~gray/HPTS99/Papers/JohnsonCharlie.doc>
    19. 4. Server/Cluster Parallelism (If It Isn’t Locked, It Isn’t Blocked) ...
      • There is one version of the data in an optimal computing universe: applications must serialize, they must interact with each other on the same data using transactions: this is in contrast to versioning databases (Oracle, SQLServer, Sybase, MySQL, Postgres) where transactional reads use snapshot isolation and are blind to concurrent updates: so only updates on primary keys block updates and repeatable read transactions (in the style of banking and finance transactions) don't provide proper isolation
      • An optimal S2PL (strict two-phase locking) system's update lock will block both updates and reads, an S2PL shared read lock will block updates, and both kinds of locks will only be released when the transaction stops changing the database, and for the update lock, after changes to the database are made durable at commit or abort time
    20. 4. Server/Cluster Parallelism (If It Isn’t Locked, It Isn’t Blocked) ...
      • This allows every application process to run freely parallel until they encounter a blocking lock on some record – hence there is no application locking schedule or high level concurrency control or massive sharding of the database required to single thread the concurrency and allow the database to work correctly – so the system runs completely in parallel at warp speed across the entire cluster
      • In an optimal RDBMS, use of the read-only transaction commit optimization will further allow large sections of the database to be released at the beginning of the commit flush (which is the end of the database transformation from data that was read, which is why you hold shared read locks)‏
    21. 5. Transparency (When/where/how)‏
      • Gosling: distributed computing is not transparent to either failure or performance (once again !)‏
      • Transparent / opaque to whom ?
      • For an optimal RDBMS, at the kernel mode library programming level, there is no clustering or failure transparency, and the task is to provide transparency (whenever possible) to the layers above
      • For the vast majority of hardware and software failures, even most double failures, an optimal RDBMS seamlessly rolls along without aborting transactions, so that the applications don’t need to worry about those failures
    22. 5. Transparency (When/where/how) ...
      • Occasionally, the seamless operations and functioning of the application are interrupted, as in the total loss (very rare double failure) of a resource manager, since the TM (transaction manager) library doesn’t know which tx touched which RM after all the copies of the RM-tx state are lost [A less than optimal method involves RMs checkpointing only log changes and them only having update locks in the backup after takeover, violating S2PL consistency and risking wormholes (wormholes will be discussed later)]
      • Then the TM library will have to abort all transactions to restore the state of the database, and this will require applications to resubmit any uncommitted updates (like the current mini-batch in NASDAQ SuperMontage does on Nonstop)‏
      • If the process that begins a transaction dies the tx gets aborted and work must be resubmitted
      • Finally, if there is an RM, TM or cluster service crash, due to a double log disk or unrestartable server failure of the log, then an optimal transaction service must be restarted or disaster recovery initiated (which is clearly not very transparent to applications)‏
    23. 5. Transparency (When/where/how) ...
      • So when/where/how is this transparent to the application ?
        • In the fact that the optimal RDBMS has no wormholes in it due to the write skew problems of snapshot isolation databases that employ MVCC
        • In the consistent view of the database outside of a transaction that either commits or aborts
        • In the guarantee that if the transaction service says commit and the entire system takes a nosedive a nanosecond later, then the transaction data is there and it’s consistent
        • In the guarantee that if the transaction service says abort, then all transaction protected work is undone completely before any transaction locks are released
        • In that the application needs to do nothing, but use a transaction to guarantee all of that consistency
    24. 6. Server/Cluster Scalability
      • Scalability of database logging performance inside the Server/Cluster and for disaster recovery is accomplished by a three phase commit flushing algorithm and the forced group commit write by the commit coordinator in an optimal RDBMS TM library
      • In an optimal RDBMS, the RMs never force-write database updates to the log, instead those updates are streamed to the log partition’s input buffer
      • An optimal RDBMS uses the WAL (write ahead log) protocol so that writes only have to be scheduled to the resource manager database disk every five minutes or so (called disk checkpoints), for nearly “in-memory” update database performance for the resource manager disk
      • This yields just short of “in-memory” RDBMS performance, because of the five minute rule: Keep a data item in electronic memory if its access frequency is 5 minutes or higher; otherwise keep it in magnetic memory. (Gray/Reuter 2.2.1.3) This rule will apply as long as the price ratio between RAM/magnetic memory stays reasonably constant (when the ratio changes, only the ‘5’ will change)
    25. 6. Server/Cluster Scalability…
      • At commit time, the optimal RDBMS TM library transaction service induces explicit RM log flushing only when necessary, from the interrupt service level of the TM library (100 times cheaper than process message wakeups). In busy systems the RMs are stream-writing ahead continuously to the log, so that the transaction updates are almost always already flushed to the log when commit time comes (unless the transactions are tiny and unbuffered)‏
      • When flushes due to commit are reported to the RDBMS TM commit coordinator in a busy system, they are lumped together into a single and periodic forced write into the log (group commit)‏
      • The group commit write by the RDBMS commit coordinator is the one and only time in the system that the transactional database application absolutely must wait for the disk to spin and the drive head to move, and it’s a shared experience (and thereby scalable for the Server/Cluster’s transaction service)‏
    26. 6. Server/Cluster Scalability…
      • So, why is writing to a one log disk faster than writing in parallel to a bunch of RM data volume disks? If there is no other disk writehead-moving activity for that disk, and if we write it sequentially using big buffers with effective disk sector management: then by treating a disk like a tape we get 20-100 times the writing throughput (Gray/Reuter 2.2.1.2)‏
      • Ultimately, however, you can easily generate more joint-serialized database log record blocks than one log disk can receive, so we partition the optimal RDBMS log N-1 ways. But you still only force write one commit buffer to the log root while streaming log blocks to the N-1 leaf log partitions
      • So, part of the configuration of an optimal RDBMS clustered transaction service is to assign RMs to log partitions. Reassigning RMs to log to different log partitions must not require the transaction service to be brought down, and needs to be performed online (several issues here)
    27. 6. Server/Cluster Scalability…
      • When optimal RDBMS transactions span the network to other Server/Clusters (or heterogeneously to other vendor systems), then the commit coordinators on the two or more nodes do non-blocking three phase commit to guarantee the joint commit or abort of the transaction
      • Optimal RDBMS distributed commit performance will do 60% of the local maximum transaction rate across as many nodes as the customer needs. That’s called “scaling out” and is accomplished by a method called “Mother-May-I”, which is described in two patents: <http://www.google.com/patents?id=rUt4AAAAEBAJ&dq=7,028,219> <http://www.google.com/patents?id=S-d3AAAAEBAJ&dq=6,990,608>
    28. 6. Server/Cluster Scalability…
      • If transactions have locality of reference and only touch the local node with no lock conflicts (which means that no joint serialization is necessary), then an optimal RDBMS will do “scaling up” to nearly the full 100% level
      • Based on the scalability of the Server/Cluster and the log, the optimal RDBMS replication service will consistently maintain the database on a remote Server/Cluster with only 1% DB overhead + 4% network messaging overhead on the primary site
      • The optimal RDBMS replication service will consume more of the remote site system applying the updates (between 15% and 25%).
      • More than these listed values for overhead is a bug
    29. 7. Server/Cluster Availability (outage minutes -> zero)‏
      • What is availability?
      • On some systems it is defined as the existence of a working Linux shell prompt
      • On some databases (Oracle) it has been quoted only on database software-produced outages, as though hardware and operating system-produced outages that are not tolerated by the database system are somehow not really happening to the customers
      • In an optimal RDBMS, availability is measured in terms of database queuing: if you can begin a transaction, and queue up for a lock on any part of the database under that transaction with the likelihood of actually getting that lock and then accessing that data, then that data is considered available
      • If you can’t do all that on some part of the database, then that part of the database is considered unavailable
    30. 7. Server/Cluster Availability (outage minutes -> zero) ...
      • Availability, in Highleyman's “Breaking the Availability Barrier”, p. 32: Availability = MTBF/(MTBF + MTR) where any 'mean time before failure' will return an availability of 1 (eternally up), if the 'mean time to repair' is zero.
      • Tandem’s Nonstop TMF has had excellent fault tolerance for nearly 25 years with non-blocking three phase commit coordination between NonStop server-cluster nodes: when the Tmp process or cpu dies, the backup Tmp takes over with no perceptible outage or loss of state or any transactions being aborted, to the tune of 5½ nines of availability or 12 years overall MTBF (Highleyman et al). Add RDF to make 7 nines in the British banking system (Mosher), or an astonishing 600 years overall MTBF (both MTBFs assuming a 30 minute repair time for a disaster recovery takeover)‏
      • IBM Parallel Sysplex, using mainframe DB2 says (2003) that they can do 50 years overall MTBF (open your wallet wide for IBM services)‏
      • An optimal RDBMS should drastically exceed this level of fault tolerance, going to more than 1 million years overall MTBF and twelve nines (and this should not require expensive legacy services or onerously complex configuration)
    31. 7. Server/Cluster Availability (outage minutes -> zero) ...
      • Tandem Nonstop was the first to make all database operations completely seamlessly transparent online: SQL partition reorganize, split and merge, partition move to another disk, changing the log partition that an SQL partition logs to, SQL catalog changes (add, modify and delete table, add, modify and delete fields); all of these operations can be done mostly without changing applications and even without modifying query plans from the optimizer (for which some systems have hundreds stored and customers hate recompiling all that)‏
      • The 4 SQL partition operations: reorganize, split merge, and move are all done starting with a rollforward from an online (fuzzy) dump, applying changes from the log starting at dump time forward until nearly caught up, then slowing user updates at the end to catch up (avoiding infinite overtaking)‏. Let’s call this ‘live recovery mode’
      • An optimal RDBMS will operate in such a seamless way, and also the transaction abort, RM recovery and media failure recovery (file recovery) will all do their job without writing through to the database disks, which runs up to 100 times faster (only rebuilding RM disk cache)‏
      • This guarantees minimal outage times (MTR) on the parts of the database which are subject to these more visible operational outages
      • All of this is possible, because Nonstop uses logical keys instead of record pointers (RIDS) to connect btree blocks …
    32. 7. Server/Cluster Availability (outage minutes -> zero) ...
      • IBM’s mainframe DB2 uses RIDs to identify records in btrees, while an optimal RDBMS uses logical keys: this allows btrees to be moved without modification of the btree data, whereas IBM RIDs are only valid at that disk address and need to be remapped to move them. This causes holes in the implementation of utility functions for availability, known as the 'Halloween Problem' in the old days
      • They resolved Halloween by employing the Tandem method (the two patents are nearly identical) for the SQL partition operations including the rollforward part and the infinite overtaking part
      • However, RIDs disallow using SQL cursors against the base tables since the RIDs can be remapped asynchronously underneath the app. In DB2, cursors can only be used against snapshot isolated copies of the base tables. That RID infantile appendage has an impact on function
      • This is why an optimal RDBMS, will split, merge, and move partitions, and reorganize the database on the fly without any availability outage, using file recovery interfaces to pull the updates to the old partition from the log in real time and apply them to the new partition, and then switch over when the log tail is near, while SQL cursors are still active
    33. 7. Server/Cluster Availability (outage minutes -> zero) ...
      • The SQL ‘live recovery mode’ approach to seamless operations allows completely retry-able and transparent fault tolerance for RDBMS utility functions experiencing failures. (So you don’t have to dump the entire database before and after utilities run, as has historically been the case with Oracle)‏
      • However … if enough failures of the intolerable kind occur simultaneously … your database can become unavailable, or worse, unusable … so what, then?
      • The first thing in availability is to reduce your MTR (mean time to repair). In a transaction system that means to bounce back up quickly after a crash, and that means knowing what transactions and locks are outstanding in the server/cluster.
      • IBM's mainframe DB2 uses a piece of special hardware called the CF (coupling facility), which acts as a smart memory to store shared buffers and locks. The CF is a mainframe processor running a special OS called CFCC. Then, each time systems fail, the database elements needed for a quick restart are right there
    34. 7. Server/Cluster Availability (outage minutes -> zero) ...
      • Tandem does not use special hardware in this way. Their innovation is to store their locks in the end of the log, and quickly reacquire those to allow RMs that have not failed to continue doing business while those RMs requiring recovery get that in some critical order, see this patent: <http://www.google.com/patents?id=9Lx6AAAAEBAJ&dq=7,100,076>
      • If you are using enterprise storage, then another cluster could take over immediately to avoid a crash. It would however, require extending the three phase commit to a fourth phase, another cluster's commit coordinator taking over the log and all of its partitions on enterprise storage and continuing with RMs that are still functioning and recovering crashed RMs for seamless continuous database. That extension will theoretically take the transaction service to 12 nines of availability, see this patent: <http://www.google.com/patents?id=jB5_AAAAEBAJ&dq=7,168,001>
    35. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes)‏
      • So what is database consistency?
      • It’s like pH: the higher the ACIDity, the stronger the database
      • The letters in ACID stand for Atomic, Consistent, Isolated, and Durable
      • A is for Atomicity and means all or nothing: the database everywhere must end up in a state whose visibility to the world outside the transaction is first the old state, then the new state (in the case of commit) or the state remains unchanged (in the case of abort)‏
    36. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • C is for Consistency seems like a circular definition, but actually it should probably be spelled ASID, because consistency in database work is really accomplished by serialization (as seen from below)
      • The database (on reputable systems) is really in the log, the RM disks are a convenient cache whose disk image is only rarely in a consistent state (only after a correctly completed shutdown of the optimal RDBMS transaction service, at which point it’s fairly unusable)‏
      • Serialization in the log is defined by the exclusive existence of serialized transaction histories, and no other kind
    37. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • A transaction history starts in the log with the first update log record for a transaction
      • Then there’s a series of update records (btree block splits have multiple physical log records in a string)‏
      • Then there are one or more commit log records, xor one or more abort log records (either commit or abort, never are both present)‏
      • Then, in the case of an abort, there is a set of undo log records
      • Finally, there are one or more forgotten log records to terminate the transaction history
    38. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • The big thing about serialized transaction histories in the log is that even though they are mostly interspersed in order (concurrency), you must never have the case where a log record touching data for one transaction is historically interspersed with a log record touching that same data for another transaction
      • This is called a wormhole (Gray Reuter 7.5.8.1) and it occurs when a transaction is either not well-formed (using shared read locks and exclusive update locks) or is not two-phase (first acquiring and then releasing locks). “A transaction history is isolated if, and only if has no wormhole transactions.” - Jim Gray
    39. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • You must be able to sort (using Jim Gray's sorting method in 7.5.8.1) the entire log by transaction timestamp to the same effect as the original log when replayed back into the database at recovery time: yielding the GOLD STANDARD of transaction systems: Wormhole-Free Transaction Histories
      • Strict two phase locking (S2PL) transaction concurrency is well-formed and two-phase on purpose .
      • Multiple version concurrency control (MVCC) does not enforce the well-formed part, because shared read locks are not used (even in repeatable read mode in most implementations I've seen)‏
    40. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • MVCC databases (Oracle, SQLServer, Sybase, Postgres, MySQL) basically employ forms of snapshot isolation, which allow transactions to work on a private snapshot of the DB, so most locking is unnecessary. Blocking typically only occurs when a record is deleted or the primary key is updated, for concurrent transaction users.
      • MVCC databases can create wormholes by write skew: two concurrent transactions reading two field values and then updating each other's previously read field values.
      • You can only do one of three things when conflict occurs: 1. Block one tx (which MVCC can't do at all well) 2. Abort one tx (which some MVCCs do) or 3. Corrupt the database integrity (mostly this occurs)‏
    41. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • Mostly what MVCC database users end up doing is single-threading, by various means:
        • Sharding: making many isolated databases, also called database federation (invented by Microsoft for the TPC-C benchmark in the late 1990’s)‏
        • Single-threading the shards by:
          • Using single threaded frameworks on dynamic languages (Ruby/Rails, Groovy/Grails, Python/Django, PHP/various)
          • Employing a hot-spot, like the bank balance in the TPC-C transactions and making many, many single-threaded bank apps in accomplishing the benchmark
      • What all these single-threading methods accomplish is to effectively convert an MVCC model to a SS2PL model: strong strict two phase locking, where even read locks are held until the transaction updates are flushed to disk
    42. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • So, when a MySQL executive recently said that the era of Jim Gray database was passing, it was only true for web 2.0, not the critical enterprise or critical computing, where the answers really matter, and the big money is at risk.
      • Only Microsoft SQL server of the MVCC databases can pass the TPC-E benchmark, which checks for wormholes.
      • If you do aborts and raise the concurrency, you end up aborting more concurrent transactions 2 of 3, then 3 of 4, etc.
      • If you try and catch them on-the-fly and then single-thread their schedule … well, you should have blocked on shared-reads to begin with.
      • There is no guarantee these detecting methods work, using timestamps is racy, and only sharpens the guillotine blade. Once you start to get asymptotically close to correct, you and your users will start to trust your erroneous implementation.
    43. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • So, what does all this S2PL get you? Isn't it slow?
      • Tandem demonstrated the converse in the ZLE benchmarks on 15 nodes with 200K TPS of updates, with hundreds of queries constantly running against the base tables, and with trickle batch. It was their finest hour. You need to be smart and a good database designer, but it can be done. And without sharding and isolating fractured databases.
      • The first wonderful thing you get from S2PL is the magic of distributed database for free .
      • Imagine 10, 20, 100 Servers/Clusters of database side by side. Now allow them to share transactions with global tx IDs that have local tx IDs for each system. The RMs on the different nodes do not interact or scheme to serialize their updates in any way.
    44. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • Now go at the database concurrently with a 1000 tx and just try to make a wormhole, you can't do it.
      • What's stopping you? The RMs aren't coordinating, the commit coordination only orders the transaction state log records, not the RDBMS update records, so how is this done?
      • In an MVCC log, you have what is called “log serialization”, which is only serialized as far as the log is concerned, which is insufficient. (Which is why you have to single-thread the applications).
      • In an S2PL log you get “application serialization”, where the applications serialize their own updates , by waiting for shared read and exclusive update locks. The applications are doing that magical, missing coordination
    45. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • So no matter how many Server/Clusters are involved in sharing transactions, if all of their logs are merged together they should yield a joined log with total joint serialized histories for all the transactions anywhere in the entire contiguous computational universe of optimal RDBMS Servers/Clusters (which is why we do three phase commit coordination between nodes, only to guarantee that all the update records everywhere are actually on a log disk somewhere when we terminate or “Forget” the global transaction)‏
      • This also allows you to do replication of many Servers/Clusters to many Servers/Clusters and actually do a failover and make that work as well, to the last serialized transaction history commit. Only Tandem RDF does this as of now, and that possibility only arises because of S2PL concurrency control
    46. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • I is for Isolation and should probably be spelled ACLD, because isolation in database really means locking (as seen from below)
      • For optimal RDBMS RM files, transaction duration locks are either exclusive update locks which block reads and updates, or shared read locks which block updates only (there are four other kinds of locks in a Nonstop system: held for session, message and operation duration)
      • Locks are only released after all of their associated transaction database work has ceased, and once the totality of database changes have hit the log disk: the locks are the fingers of the correctly ordered database in the log reaching out to the cache copy of the database and guaranteeing serialized behavior in the applications interacting through that database
    47. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • Hence, because of transactional isolation:
        • Applications communicate with each other using simple relational propositional logic (SQL Queries) ...
        • creating, changing or discarding truthful propositions (Records) ...
        • in shared repositories of mutually agreed upon truths (SQL tables) …
        • through the database in complete, uninterrupted compound sentences (Transactions)‏
        • pausing in real-time, only to hear the complete, uninterrupted compound sentences of other concurrent applications (Locks)‏
        • otherwise, running at warp speed unhindered of any other blockage to performance (Minimum Latency)‏
    48. 8. Application/Database Serialized Consistency (the database must be serialized wherever it goes) ...
      • Finally, D is for Durability and means that once an optimal RDBMS transaction service says it’s done, it’s really done
      • If the ENDTRANSACTION procedure call returns OK, and one nanosecond later the entire installation crashes, the data is there and it’s correct
      • For optimal RDBMS disaster recovery, that means it’s done on the database on the primary site, and after all the log records reach the remote site, it’s done there, too

    + CharlesOneCharlesOne, 7 months ago

    custom

    648 views, 1 favs, 0 embeds more stats

    Silicon Valley Codecamp #3 (11/08) - What makes mai more

    More Info

    © All Rights Reserved

    Go to text version
    • Total Views 648
      • 648 on SlideShare
      • 0 from embeds
    • Comments 0
    • Favorites 1
    • Downloads 31
    Most viewed embeds

    more

    All embeds

    less

    Flagged as inappropriate Flag as inappropriate
    Flag as innappropriate

    Select your reason for flagging this presentation as inappropriate. If needed, use the feedback form to let us know more details.

    Cancel

    Categories