Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertainty (Groups of Clusters)

1,878 views

Published on

see http://ValverdeComputing.Com for video

Published in: Technology, Business
  • Be the first to comment

Fundamentals Of Transaction Systems - Part 2: Certainty suppresses Uncertainty (Groups of Clusters)

  1. 1. Valverde Computing The Fundamentals of Transaction Systems Part 2: Certainty suppresses Uncertainty (Groups of Clusters) C.S. Johnson <cjohnson@member.fsf.org> video: http://ValverdeComputing.Com social: http://ValverdeComputing.Ning.Com 2- The Open Source/ Systems Mainframe Architecture
  2. 2. 10. ACID and BASE workflow makes this reaction safe <ul><li>Eric Brewer (of UCB) has put forth a conjecture called CAP that states: </li></ul><ul><ul><li>You can only have 2 out of 3: of C onsistency, A vailability and P artition-tolerance for distributed database applications (note that these are not SQL partitions, but where the network is completely disconnected between network nodes) </li></ul></ul><ul><ul><li>Hence, if database applications are distributed, they must abandon the ACID model for the BASE model, where the database is B asically A vailable, S oft-state and E ventually consistent, but there is no concrete description of what a BASE transaction system is, except in the negative: It doesn’t qualify as an ACID transaction system </li></ul></ul>2-
  3. 3. 10. ACID and BASE workflow makes this reaction safe <ul><li>Brewer’s CAP conjecture: </li></ul><ul><ul><li>In Brewer’s view the computing world boils down to localized clusters where ACID applies, connected by an overarching inconsistent domain where BASE applies (and chaos reigns), since there will never be a global resolution of the consistency of everything mediated by any protocol </li></ul></ul><ul><li>Between the overarching inconsistency and the local zones where ACID applies, you can effectively only do ACID work (that you really care about the result of, like financials) in the small restricted ACID zones , according to Brewer </li></ul>2-
  4. 4. 10. ACID and BASE workflow makes this reaction safe <ul><li>Empirically this is a shaky assumption, because the British banking system, international stock, commodity, futures trading systems, funds transfer systems, etc., all move many trillions of dollars of ACID financial transactions on Nonstop clusters daily </li></ul><ul><li>The CAP conjecture, which disallows the simultaneous solution of C onsistency, A vailability and network P artition issues … does not seem to impede that continuing observation of availability </li></ul><ul><li>How can this be? Nancy Lynch mathematically proved that CAP is true!!! </li></ul>2-
  5. 5. 10. ACID and BASE workflow makes this reaction safe <ul><li>One of the assumptions of CAP is two-phase commit, but a lot of these systems run on Nonstop, which uses full three-phase commit over ServerNet, and the British banks double down on that by using Nonstop RDF (more on RDF, later) for takeover to a remote site, to get seven nines of measured availability (Malcolm Mosher) </li></ul><ul><li>The other assumption of CAP is the use of commonly available telecomm networks, but the British banks use private networks - duplexed (this is special hardware), or even with what is called ‘RDF triple contingency’ (NASDAQ does this, too), using two geographically separated backup sites which are in communication with each other to resolve the differences at takeover time </li></ul>2-
  6. 6. 10. ACID and BASE workflow makes this reaction safe <ul><li>Ultimately, what these very bright users are doing is widening and interconnecting the ACID zones, with the intent of driving back the chaos of BASE, and they could succeed on planet earth </li></ul><ul><li>However, relativity guarantees that for interplanetary computing, the chaos of BASE will ultimately prevail … so what should we do in space, or even on planet earth where the ACID zone doesn’t quite stretch? (Like the cloud, perhaps?) </li></ul><ul><li>My answer to that is work flow automation to deal with the overarching chaos of BASE : WFA relentlessly drives the ACID systems beneath the WF system into a state which is acceptable to all parties, because they all signed onto a given WF ‘contract’ (Andreas Reuter, Jim Gray’s co-author, along with Helmut Wächter, invented a Contracts model work flow system, which influenced many successor WF systems) </li></ul>2-
  7. 7. 10. ACID and BASE workflow makes this reaction safe <ul><li>Siemens implemented the Citibank Worldwide Funds Transfer System using a work flow system that captured all the paper transactions and human communications, and drove all of the enterprise and departmental systems, showing full visibility of the state of every single workflow and compensating for every failure (an amazing embrace of everything they did … ignoring the current CDO/CDS fiasco) </li></ul><ul><li>Following the lead of Siemens and Citibank, workflow could then be used to knit together the cloud systems (Amazon EC2, IBM Blue Cloud, Microsoft Azure, Google App Engine, etc.) with the enterprise systems, the internet, P2P, home systems and social networks (there will be more on WF and integrating everything, later) </li></ul>2-
  8. 8. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>Multi-threading is an item of interest in dynamic language frameworks surrounding Web 2.0, because of the scalability costs of single-threaded web servers </li></ul><ul><li>The Merb and Rails frameworks are being merged on Ruby (Rails 3 = Rails 2 + Merb 1) … Why? </li></ul><ul><li>The dynamic languages are single-threaded by default, as are most of the frameworks deployed on them: the only widely-used multi-threaded open frameworks are Spring on Java threads and .NET on Windows </li></ul>2-
  9. 9. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>The dynamic languages are single-threaded because the web databases (Oracle, SQL Server, Sybase, MySQL and Postgres) all use MVCC concurrency, which leaves those databases defenseless to prevent concurrent corruption without single-threading at the application level (write skew produces wormholes in snapshot isolation under concurrent transactional update -Gray) </li></ul><ul><li>S2PL avoids write skew by blocking on shared read locks (supported by IBM DB2, HP Nonstop) so that the application can just use transactions and access the database naturally: this is true multi-threading with concurrent, consistent access to data </li></ul>2-
  10. 10. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>MVCC applications avoid wormholes by single-threading the access to the database at the application level: </li></ul><ul><ul><li>either by sharding (federated database) +isolating the database shards +one single database instance per shard +one queue from one web server (PHP, Ruby, Python, Groovy, etc.) per shard </li></ul></ul><ul><ul><li>or by EAI towering application stacks -which compartmentalize the database apps -turning the compartments into queues -and boxcarring filtered business transactions from queues -against their targeted, vulnerable shards </li></ul></ul>2-
  11. 11. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>Either way with MVCC , you end up with </li></ul><ul><ul><li>Separate and autistic web databases requiring a separate technology (memcached, etc.) to bring everything back together </li></ul></ul><ul><ul><li>Or with towering application stacks that can only be maintained by static and bureaucratic legacy development because of their ultra-tight coupling to prevent concurrency and corruption of defenseless data </li></ul></ul><ul><li>So, for the web databases it is pointless to multi-thread the sharded server languages and frameworks, because the thread instance data will still have to include an entire single-threaded RDBMS </li></ul><ul><li>Thus, multi-threaded dynamic languages and frameworks will be completely hobbled in their scalability until an open source S2PL database that can be concurrently shared is finally used </li></ul>2-
  12. 12. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>S2PL databases can go at the database with the same sharding approach to divide and conquer the database hotspots with focused isolation using SQL rowsets, compound statements (compiled SQL only) and RM localized transactions </li></ul><ul><li>In that way, S2PL can get the same performance as the web databases, but without isolating tiny islands of the database from all of the rest of the database for application and query access, and without sacrificing the consistency and protection of data by the RDBMS </li></ul><ul><li>In an S2PL environment query programs can build queries that see the entire database and can transactionally update any part of it with complete consistency, and that visibility is a crucial thing for regulators and corporate officers, if for no other reason </li></ul>2-
  13. 13. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>Web 3.0 appears to require AJAX (or other) push streaming with no page refreshes and without polling: </li></ul><ul><ul><li>Javascript on various Reverse AJAX frameworks and servers </li></ul></ul><ul><ul><li>Icefaces (Java plugin) or JavaFX (needs another plugin to work) with Grizzly push streaming on the JVM using NewIO, with Spring as a framework </li></ul></ul><ul><ul><li>Adobe flash/flex(Actionscript+MXML) on the flash plugin pushed from RTMP (over TCP) or RTFMP (over UDP) servers, by Adobe AIR (Apollo) applications </li></ul></ul><ul><ul><li>MS Silverlight can push XML using the DOM interface, which requires using the .NET framework </li></ul></ul><ul><li>There will be more of these push streaming methods and frameworks, because they make internet applications look like desktop applications, and that makes everything else look sadly 2.0/1.0 </li></ul>2-
  14. 14. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>Push streaming technology creates a many-to-many scalability problem: millions of users streaming from thousands of sources, the solution to this was invented a decade ago for fault tolerant clustered software telemetry systems, see the patent  : </li></ul><ul><li>All of the dynamic languages and single-threaded server frameworks are wrong for this: asynchronous push-streaming from the server-side needs a multi-threaded network of sources pushing into a collection stream connected to the client with ports kept permanently open on both ends (which requires something like Java’s NewIO to scale that up) </li></ul>2-  Enhanced instrumentation software in fault tolerant systems <http://www.google.com/patents?id=cRIJAAAAEBAJ&dq=6,360,338>
  15. 15. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>This disruptive technology will bring on the demise of the simplistic web 2.0 model that PHP made so popular </li></ul><ul><li>The Java-based multi-threading model used by Terracotta and their competitors may be more appropriate to push streaming in Web 3.0: Terracotta knits together a cluster of JVM instances so that the Java Transaction concurrency model can work across a cluster seamlessly with S2PL concurrency </li></ul><ul><li>S2PL employs exclusive update locks and shared read locks which block updates until the reader transaction commits or aborts </li></ul><ul><li>This allows consistency across the cluster, but unfortunately requires (once again) that all accesses through any connection to an MVCC RDBMS must be queued together and single-threaded, otherwise corruption of the underlying database will occur </li></ul>2-
  16. 16. 11. True Multi-Threading shrinking the size of thread-instance state <ul><li>This creates a hotspot, which requires sharding, and there you go again fracturing the database mirror of reality: if you need to update data in two or more shards, you have to function ship using disjoint transactions which risks inconsistency, or risk corruption of the database, because it is MVCC and cannot defend itself against the applications </li></ul><ul><li>If Terracotta were deployed over an S2PL database, multiple (as many as you desired) concurrent connections could be opened simultaneously with no write skew corruption, and if you came in from different connections with the same transaction ID, data could be shared correctly </li></ul>2-
  17. 17. 12. Single System Image and Network Autonomy <ul><li>What is network autonomy and why do we need it for transactional clusters within a larger distributed database group of clusters (a massive database)? </li></ul><ul><li>The basic unit of network autonomy should be a single transaction system, running on a cluster of computers that goes up and down as a unit with the transaction service and the database RMs that are logging to the transaction system’s partitioned log </li></ul><ul><li>A distributed database is composed of many of these transactional clusters which are seamlessly integrated with a single view of the distributed data by a transactional file system that perceives the single system image of the user on top of the multiple clusters that it is composed of operationally </li></ul>2-
  18. 18. 12. Single System Image and Network Autonomy <ul><li>In fact, network autonomy for individual clusters of computers … is needed for the reliable operation of a massive database composed of a large group of those transactional clusters, according to the principle of divide and conquer: </li></ul><ul><ul><li>By limiting the side effects and the risks of cascading disruption that arise from transactional cluster operations and outages </li></ul></ul><ul><ul><li>By rendering operational functions possible to program and schedule for transactional clusters in a larger context </li></ul></ul><ul><ul><li>You can make those transactional cluster operations within a larger group of clusters supporting a massive database … simultaneously scalable, transparent and reliable </li></ul></ul>2-
  19. 19. 12. Single System Image and Network Autonomy <ul><li>So, an autonomous, single transactional cluster must be capable of reliable interoperation as a piece of the distributed database service, while retaining complete unit independence </li></ul><ul><li>An autonomous transactional cluster must be manageable for growth without forcing the change of distributed applications (transactional distributed file system, transparent distributed message system service, reliable global name service, speedy and reliable global notification service for death of processes and clusters, etc.) ‏ </li></ul><ul><li>An autonomous transactional cluster must be independent of other clusters so that it can come up and stay up in any and all circumstances, no matter what happens to clusters around it </li></ul>2-
  20. 20. 12. Single System Image and Network Autonomy <ul><li>An autonomous transactional cluster must never be brought down by outages or anomalies on other transactional clusters (this requires a reliable network infrastructure at the group level above the cluster level) </li></ul><ul><li>For example, an autonomous transactional cluster must not require precise clock synchronization to function correctly (especially in distributed or cluster commit processing), although human-readable clock displays should be synchronized in a trivial way, so that humans can evaluate anomalies in problem diagnosis, and in validation and verification (V&V) of the transactional system software </li></ul>2-
  21. 21. 12. Single System Image and Network Autonomy <ul><li>One implication of this is that Leslie Lamport timestamps can be used in distributed algorithms, but not Welch timestamps if network autonomy is to be preserved: </li></ul><ul><ul><li>Lamport timestamps are basically a counter, and guarantee that if they are propagated correctly to support the Thomas Write Rule, that the age of distributed updates can be correctly compared to find the most recent value for any update, to resolve simultaneous update conflicts (in a highly simplistic way) </li></ul></ul><ul><ul><li>Welch timestamps accomplish the same thing, but implement a clock timestamp, which is synthetically synchronized (defying relativity) … but that specific clock synchronization operation can cause the entire server to block until the local clock time is greater than or equal to the received timestamp’s clock value (IBM anecdote) </li></ul></ul>2-
  22. 22. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>Special Hardware is by definition closed-source, single-source and proprietary: using open source software based on special hardware that is proprietary lands you in the legacy trap, irregardless of your diligent efforts to avoid that </li></ul><ul><li>IBM has proprietary buses and high speed networks connecting a Sysplex cluster </li></ul><ul><li>IBM Sysplex clusters also have an ultra-smart memory called the coupling facility which is a mainframe computer in itself running a different OS (CFCC – CF control code), which can run backed-up in pairs for quasi-fault tolerance </li></ul><ul><li>However, the state in CFs cannot be mirrored , and a CF failure requires a reload before processing can be continued: this is a failover (blocked), and not a takeover (transparent) </li></ul>2-
  23. 23. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>The fault tolerant high performance and high availability of mainframe DB2 is dependent upon the CF, which holds the lock tables, shared buffers and some link lists </li></ul><ul><li>Sysplex also has a highly synchronized Sysplex clock which can also be paired for fault tolerance, but which is more for applications, since mainframe DB2 uses local order, not global time synchronization, for fault tolerant commit (IBM is too smart to defy relativity) </li></ul><ul><li>Tandem Nonstop uses off-the-shelf processors and hardware for most everything, although the boards are proprietary and that makes it legacy for hardware </li></ul><ul><li>In 1999 Nonstop (under Compaq) released the entire SQL-Mx software stack on NT clusters (twice, to the Paris Stock Exchange), which proves that they don’t need special hardware - with one exception: The clusters used ServerNet II as an interconnect </li></ul>2-
  24. 24. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>ServerNet II was a proprietary network using wormhole routing (different wormholes from MVCC), which had enhanced throughput and response time: it was a fabric switch like others, but the superior response time was due to the lightweight nature of the Nonstop message system </li></ul><ul><li>The hardware response time for ServerNet II was 100 ns (according to Bob Horst, the inventor), which is 100 feet of travel for light through vacuum, and this is the physical limit </li></ul><ul><li>Even though ServerNet II was a legacy play, it was built with 1 Gb Ethernet components, so it was sort of ‘off-the-shelf’ legacy </li></ul><ul><li>The 100+ ns response time for ServerNet II compares to 200 ns for Myrinet, Quadrics and Infiniband, which are commercially available fabric switches for PCI-bus based servers </li></ul>2-
  25. 25. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>All of these (IMHO) are going to lose out to the 40Gb and 100Gb ethernet switches: one from Force 10 Networks is quoted at 300-400 ns, which means that off-the-shelf switched networking routers are closing in on the fabric switches and the physical limit in terms of throughput, and now response time </li></ul><ul><li>Force 10 Networks is also working on making this ethernet wide, so that you could get decent performance like this on a larger scale, presumably over private networks </li></ul><ul><li>If that works out, it means that an optimal RDBMS transaction system will require no special networking hardware, by using ethernet and not using fabric switches: to get the superior combination of throughput and response time performance that made Nonstop so scalable for reasonable cost, but without the Nonstop legacy </li></ul>2-
  26. 26. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>Back to what IBM does to make clusters ACID with special hardware … to construct a DIY open source ‘coupling facility’, one could use: </li></ul><ul><ul><li>A network connected multi-core running NetBSD </li></ul></ul><ul><ul><li>A non-volatile memory: </li></ul></ul><ul><ul><ul><li>Flash – too slow, too much power, only 10 5 writes (because of the power surge on writes) </li></ul></ul></ul><ul><ul><ul><li>FeRAM – not as slow, lower power, 10 8 writes </li></ul></ul></ul><ul><ul><ul><li>MRAM – very fast, low power, 10 15 writes (much better low power) </li></ul></ul></ul>2-
  27. 27. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>Even after many years to catch up, flash still has the price-point over MRAM, so flash would be the non-volatile memory of choice for DIY-CF (for now) </li></ul><ul><li>SATA disk can do transfer rates of 300MB/sec (600 MB/s soon), flash does about 66MB/sec </li></ul><ul><li>Disk has an access time of 12 ms, flash has 0.1 ms: of course for serial writes (treating disk like a tape), disk access time overhead is heavily amortized, so log writes would still go to disk </li></ul><ul><li>However, checkpointing memory blocks to disk for fault tolerance are random writes and could now use a flash CF with much lower access time </li></ul>2-
  28. 28. 13. Minimal Use of Special Hardware servers need to be off-the-shelf <ul><li>We should make the CF look like a disk, to avoid any intellectual property issues: writing/logging to a file for any kind of fault tolerance is ‘prior art’ from way back, and not susceptible to patent infringement </li></ul><ul><li>Ad some scatter/gather features to our CF (also prior art in file systems disk access) </li></ul><ul><li>Presto !! A free and open do-it-yourself coupling facility </li></ul><ul><li>Of course, it would be easier and cheaper to just use enterprise storage cache, if you already use an ESS </li></ul><ul><li>Whatever you do, you will need two copies for fault tolerance, and you will have to maintain them as replicas (high-level) or mirrors (low-level) </li></ul>2-
  29. 29. 14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ <ul><li>Large software systems have four levels of discipline, which are involved in their development: </li></ul><ul><ul><li>Manufacturing and Logistics </li></ul></ul><ul><ul><li>Engineering </li></ul></ul><ul><ul><li>Research (Science and Mathematics) </li></ul></ul><ul><ul><li>Art </li></ul></ul>2-
  30. 30. <ul><li>This part of the discussion will focus on manufacturing , which in critical software systems means focusing on software releases: </li></ul><ul><ul><li>There is more extensive change and risk in major product releases (this is why some critical computing customers avoid them) </li></ul></ul><ul><ul><li>There is partial change and lower risk in minor product releases (the customers avoiding the major releases above will jump on the second minor release, figuring it might not be half-baked) </li></ul></ul><ul><ul><li>There are focused, isolated changes and narrowed risk in limited subsystem releases (to a few customers) </li></ul></ul><ul><ul><li>There are focused, isolated changes in quick fixes (often ‘under-the-table’) to a select clientele with undetermined and therefore heightened risk (which they are motivated to take by the problems fixed) </li></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  31. 31. <ul><li>Critical software systems sustain a mandatory requirement, from the customers who require changes to continue operating those critical systems: reliable and periodic releases are required </li></ul><ul><li>Keeping the process of periodic releases flowing creates a dynamic tension in software organizations: the retaining of some modifications and the kicking out of others in a periodic release vehicle will change the shape of the software system itself and force it into the legacy mold (even for open source software) </li></ul><ul><li>Small, isolated changes are safe and tend to embark upon the release vehicle with impunity, while the changes that have a web of dependencies across the subsystems or modules in the software system tend to be candidates for rejection, ultimately at the opening of content selection for the release </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  32. 32. <ul><li>This means that software systems like Linux that are modularized such that most changes are isolated, tend to be the most maintainable </li></ul><ul><li>But Linux would be very hard-pressed to add the driver ecosystem that was developed by David Cutler, et al, for Windows NT, which is the basis for the kernel mode services there </li></ul><ul><li>After the merger of those two companies, a similar concern foiled the attempt to merge HP’s stable, modularized enterprise Unix with Compaq’s riskier, but niftier Tru64 Unix: Tru64 supported a flexible kernel mode infrastructure and was a much cooler platform for doing amazing things … however, staid and boring enterprise unix won that fight </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  33. 33. <ul><li>The result of the manufacturing and release cycle for software is that the critical software systems that are the most inflexible to real change tend to have the most reliable releases, and that is not a happy result: it transforms FOSS into legacy </li></ul><ul><li>It is happy for the dynamic languages which allow rapid application change: Perl, PHP, Ruby, Python and (my favorite) Groovy on Java … but dynamic languages have their specific uses and serious limitations, which means that changing the shape and appearance of the web is far easier than changing the fundamentals of how the critical systems work behind the web, that supply the web with information flow </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  34. 34. <ul><li>Operating systems, DBMS, and frameworks are complex enough to force stratification of the shared libraries, formats, protocols and subsystem architecture … this stratification forces software projects into a legacy mindset </li></ul><ul><li>A successful development project becomes like a temple or a religious institution: the software becomes more departmentalized and less mathematically pure, minimizing bruises between developers at the expense of a ‘socialized’ architecture that buffers the software from drastic change </li></ul><ul><li>Successful development teams become priesthoods each jealously guarding their turf from stress and change: fighting off the difficult issues </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  35. 35. <ul><li>There is a yearning for smooth evolution with no bumps in the road, and a fear of revolution: real evolution is not like that, real evolution is catastrophically punctuated equilibrium (Stephen Jay Gould), not tiny steps, big leaps </li></ul><ul><li>The legacy trap is an inevitable product of success, which is why periodically, large projects should be forked and radically modified to breathe life back into them, however wrenching that is politically </li></ul><ul><li>The most difficult integration problems come from the software dependencies between the subsystems of critical systems that require multiple and complex changes to truly evolve </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  36. 36. <ul><li>These interdependencies are viewed as less desirable for bread and butter periodic releases: especially those interdependencies arising from changes to: </li></ul><ul><ul><li>Shared libraries </li></ul></ul><ul><ul><li>Formats </li></ul></ul><ul><ul><li>Protocols </li></ul></ul><ul><ul><li>Persistent formats and protocols (in the database and the log, these are the worst) </li></ul></ul><ul><li>A common method for managing change in formats, and even in protocols and libraries, is bundling them together as dialects : this puts versioning on a collection of interface items that can survive together cohesively for a time, while avoiding the necessity of developing a full-out repository scheme, which always has its own complexity, availability and chicken-and-egg problems </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  37. 37. <ul><li>Then a software release module or subsystem can decide to support multiple versions of a dialect, or multiple dialects: so at configuration time you can easily tell whether or not the release will have integration problems, not at some point much further downstream </li></ul><ul><li>IBM publishes their systems and subsystems formats and protocols in a FAP document, so that you can know the dependencies in a software system before design is considered and then later at release time </li></ul><ul><li>However you provide for versioning in shared libraries, formats, protocols and the persistence of those building blocks, not having versioning already means that you have to release (a major product release) the versioning itself first, before you can take advantage of versioning to make releases easier later </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  38. 38. <ul><li>For persistent formats and protocols, releasing them can be a deadly decision for customers, unless you provide a migration and fallback mechanism: without fallback, customers can get trapped in a syndrome that does not work for their application, and was not revealed in testing, and only shows up weeks later, let’s call this ‘a deadly embrace’ , from which there is no way back </li></ul><ul><li>Some time ago, when Compaq Nonstop upgraded their database and log file formats to support 64 bit internal pointers, they used fallback to release it in a safe way: this was only possible, because they first released a baseline file format versioning release for all the software involved, then a fallback and finally, a migration release … Here’s the theory behind that triad … </li></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  39. 39. <ul><li>The baseline format versioning release contains the extra ability to deal with versions of formats, but with no change in function: there should be no side effects whatever to this </li></ul><ul><ul><ul><li>Of course, if you already had format versioning or even better, dialect support in the software, a baseline release would be unnecessary </li></ul></ul></ul><ul><ul><ul><li>The baseline release should have close to zero risk , and if that’s not true, then there’s a bug in the design and you need to work harder: this release needs to be perfect, because you really can’t fall back from the fallback to the fallback </li></ul></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  40. 40. <ul><li>The fallback release can functionally deal with the semantics of new files, when they are later created by the fully functional release of the software, but contains no other new function or other modifications that could break the user’s applications, and it creates no new files on its own </li></ul><ul><ul><ul><li>If something goes immediately wrong with the new formats and format code included in the fallback release, you can pop back to the baseline release with complete safety, because nothing has been changed that is persistent: therefore the fallback release has small risk </li></ul></ul></ul><ul><ul><ul><li>Once you decide to migrate, you cannot go back to the baseline release, because new persistent state will be present and break that old code, so you have to make sure that the fallback release is satisfactory on all counts before proceeding to migration </li></ul></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  41. 41. <ul><li>The migration release is fully functional and deals with the old and new formats according to a script, or under user commands, to replace the old formats with new ones, using shadow files in a transparent manner </li></ul><ul><ul><ul><li>If something goes wrong with the new function, you can pop back to the fallback release with maximum safety, because it will put the old function back in place with the mixture of new and old files: the migration release has the typical risk of a serious update, which is mitigated by the capability to fall back to the previous “safe” function that your application is used to </li></ul></ul></ul><ul><ul><ul><li>You can transit back and forth between the fallback release and the migration release or updates to it as many times as you please, until you get a migration release version that your application can stand to live with </li></ul></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  42. 42. <ul><li>Distributed systems go down or become unavailable in a variety of ways, and all these syndromes need to be analyzable: </li></ul><ul><ul><li>Computer halts : a BSOD (blue screen of death), or a low level dead loop or a deadlock that prevents the cluster heartbeat (I’m Alive) from being transmitted, so that others declare you down after a regroup round and your second chance expires -> so, take a computer dump </li></ul></ul><ul><ul><li>Cluster freezes : asserting these needs to be a feature of the low-level function of a distributed system (enable/disable) -> the freeze allows you to inspect the state of all computers in the cluster, you can get all the evidence of what went wrong and take the appropriate dumps, the freeze is especially crucial for debugging racy distributed state machines </li></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  43. 43. <ul><li>Distributed systems evidence gathering (continued): </li></ul><ul><ul><li>Cluster low-level protocol hang : nothing halts, but no messages or work can be queued, and sometimes you can’t get the cluster’s attention (it may be in a dead loop, or it may be deadlocked on a resource) -> assert a freeze from the outside, and take appropriate dumps </li></ul></ul><ul><ul><li>Cluster subsystem failure : this doesn’t take down the computers possessing the subsystem (otherwise you’d trigger a cluster freeze) -> an odd case, the subsystem code needs to have assert functionality to halt, freeze, or just dump the subsystem and carry on; barring that, you must externally freeze the cluster and take appropriate dumps </li></ul></ul><ul><ul><li>Cluster subsystem protocol hang : new work gets queued to the subsystem, and nothing happens, but nothing dies -> freeze the cluster and take appropriate dumps </li></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  44. 44. <ul><li>When systems fail (especially distributed systems), solid pathology technique against the corpse(s) is crucial: interpreting a binary dump of a processor or system is a necessary talent to fall back on (I have crunched 10,000 or so binary dumps), but with a little planning, the system can trap useful information on the way down, to aid diagnosis of the problem(s): </li></ul><ul><ul><li>Halt code, traps, asserts and exceptions can be augmented to leave a standardized fingerprint including application stack and interrupt stack traces, critical globals and registers, and information like open files, file and message system operations in progress, etc.: stored in a couple of likely places in the system in a parsed and readable format, or in xml </li></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  45. 45. <ul><li>Problem diagnosis (continued): </li></ul><ul><ul><li>Trace logs are also handy, but are harder to engineer for failures, because pushing out that last I/O (the crucial one) can be tricky in trap handling: you need a facility and a shared library call to do this right, plus a way to turn focused tracing on and off so as not to swamp the system with a trace-storm </li></ul></ul><ul><ul><li>Handling problem reports from customers can become an endless and traumatic ordeal if the only person who can diagnose problems is the developer … this is where fingerprints of problems become really useful: rediscovery is the automated or semi-automated process of matching fingerprints and determining that a new problem is one that is already known … that analysis made by someone other than the overwhelmed and single-threaded subsystem developer </li></ul></ul>14. Maintainability and Supportability H/W & S/W needs to be capable of basic on-line repair ‏ 2-
  46. 46. 15. Expansively Transparent Parallelism and Scalability <ul><li>The transactional File System is the mechanism of transparency on any clustered system (like Nonstop)  : </li></ul><ul><li>The File System is a set of system library routines which have their own data segment but which run in the process environment of the application (client) program. These routines format and send to various Disk Processes messages requesting database services for files residing on their volumes. Through File System invocations, the application process becomes a requester (client) and the Disk Process a server in the requester-server model. In the case of ENSCRIBE, the application program invokes the File System explicitly -- calling such routines as OPEN, READ, WRITE, LOCKRECORD -- to perform key navigation and record-oriented I/O. </li></ul>2-  TR-88.10 High Performance SQL Through Low-Level System Integration, A. Borr, F. Putzolu <http://www.hpl.hp.com/techreports/tandem/TR-88.10.html>
  47. 47. <ul><li>In the case of SQL, the application program's SQL statements invoke the SQL Executor , a set of library routines which run in the application's process environment. The Executor invokes the File System on behalf of the application. Its field-oriented and possibly set-oriented File System calls implement the execution plan of the pre-compiled query. Certain aspects of the division of labor between the File System and the Disk Process are mandated by the distributed character of the Tandem architecture. Database files in a Tandem application are typically spread across multiple disk volumes . attached to different processors within a node, or to different nodes within a cluster or network. Base files may have multiple secondary indices (implemented as separate key-sequenced files), and these may be located on arbitrary volumes . Base files and secondary indices may each be horizontally partitioned . based on record key ranges into multiple fragments residing on a distributed set of disk volumes. </li></ul>15. Expansively Transparent Parallelism and Scalability 2-
  48. 48. <ul><li>Thus, the file fragment managed by the Disk Process as a single B-tree may in fact be merely a single partition of an ENSCRIBE or SQL file, or a secondary index (or partition thereof) for an ENSCRIBE or SQL base file. The file or table is viewed as the sum of all its partitions and secondary indices only from the perspective of the SQL Executor or ENSCRIBE File System invoker. Such an architecture makes the File System the natural locale for the logic which, transparently to the caller, manages access to the appropriate partition based on record key ; or manages access to the base file record via a secondary key; or performs maintenance of secondary indices consistent with the update or delete of a base file record. </li></ul>15. Expansively Transparent Parallelism and Scalability 2-
  49. 49. <ul><li>The file system enables an extreme form of transparent parallelism and scalability: Nonstop SQL tables and their indexes can have hundreds of horizontal partitions, and these can be spread across many autonomous transactional clusters (Nonstop nodes) as distributed partitions … hence, the transactional file system gives the application transparent, and in the case of SQL, automatically transparent access to all of the data on all of the computers in the transactional clusters across the entire group of clusters </li></ul><ul><li>If you have hundreds of partitions, then for true parallelism on those partitions, you will need to be able to coordinate hundreds of simultaneous fast sorts, hash sorts, etc. </li></ul>15. Expansively Transparent Parallelism and Scalability 2-
  50. 50. <ul><li>You will also need to marshal a lot of computing resources and temporary storage to accomplish all of that in parallel (more cores on more computers helps out on part of this) </li></ul><ul><li>Expansive transparency would then also require extremely high unit availability for any transactional cluster in the group, otherwise there would be little to be gained from the joined view of a continuously operating database across the massive enterprise, which sensible applications will require </li></ul>15. Expansively Transparent Parallelism and Scalability 2-
  51. 51. <ul><li>Expansive transparency requires that application functionality survive the failure and recovery of outages that become omnipresent as the numbers of computers grows larger: something is always going down, coming back up, having online management functions performed, being moved or renamed, having its software upgraded, … etc. </li></ul><ul><li>In a large group of clusters of computers with wall-to-wall expansive transparency, there can be no operational outage windows : All operational and management functions must work online , at the worst causing an occasional slowdown </li></ul>15. Expansively Transparent Parallelism and Scalability 2-
  52. 52. <ul><li>In a large group of clusters of computers with wall-to-wall expansive transparency, there must be performance that is broadly and consistently flat , because… </li></ul><ul><ul><li>Any disjoint parallelism makes for unpredictable scalability, because you won’t know where the application can add new load, or where the user can add new hardware that will retain balanced performance </li></ul></ul><ul><ul><li>Here’s how that works: (1) if one out of a thousand RMs is configured with too small a buffer cache by a typo in a script, that RM has to steal pages to change the database and insert a log record in the CAB-WAL buffer (2) once all the non-dirty pages are used, a dirty page must be taken (3) to write back to the data volume the CAB-WAL buffer must be checkpointed to the backup and then written to the log and waited for, then (4) the random write to the data volume can be performed and now you can steal the page and complete the operation </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  53. 53. <ul><li>There must be flat performance (continued) </li></ul><ul><ul><li>The moral of the story is not about getting the scripts right, it is this: there is a distribution of performance metrics in any group of clusters of computers and the system performance is a function of the slowest disk , and there is always a slowest disk that needs attention </li></ul></ul><ul><ul><li>Any disjoint scalability makes for unpredictable parallelism, because you won’t know how to tune such a large system to recapture the excellent execution that made you happy you bought the system in the first place </li></ul></ul><ul><ul><li>Without building another scenario to make this point, there is always a worst-tuned part of the system dragging down the group of clusters of computers’ performance </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  54. 54. <ul><li>There must be flat performance (continued) </li></ul><ul><ul><li>Parallelism and scalability then need to be every bit as flat across the group of clusters of computers as the transparency is wide: and they are not in any way independent of one another, because of expansive transparency </li></ul></ul><ul><ul><li>Expansive transparency across a group of clusters of computers means enhanced visibility of any irregularity: so you need first rate maintainability and performance tuning tools to make it easy to spot the anomalies </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  55. 55. <ul><li>There are a number of features under the covers that allow Nonstop to have flat performance characteristics across enormous systems: </li></ul><ul><ul><li>(1) Previously, the CAB-WAL-WDV protocol for the RM was discussed: the checkpoint-ahead-buffer and write-ahead-log are hurried, because locks are held, and the log writing is enormously scalable </li></ul></ul><ul><ul><li>The write-data-volume is lazy, because it happens after the transaction work is finished as a housekeeping task for the RM, but housekeeping can get behind quickly and must be done before the next 5-minute checkpoint from the RM to the log </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  56. 56. <ul><ul><li>The mechanism for that lazy writing, is discussed in the same article quoted before, by Andrea Borr and Franco Putzolu (  ibid., TR-88.10 ): </li></ul></ul><ul><ul><li>Bulk I/O is also used for asynchronous write-behind . This mechanism uses idle time between Disk Process requests to write out strings of sequential blocks updated under a subset. By using its Subset Control Block (created as a result of the initial set-oriented FS-DP interaction), the Disk Process can keep track of strings of sequential blocks which are dirty (i.e. have been updated in cache). Once a string of dirty data blocks has aged to the point that the audit [log records] related to the blocks of the string has already been written to disk, then the string of dirty data blocks can be written to disk without violating write-ahead-log protocol [Gray]. The Disk Process then writes the string to disk using the minimal number of bulk I/O'S . </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  57. 57. <ul><ul><li>(2) Message traffic can be reduced by shipping some SQL query function to the RMs (project, select, filtering, aggregate functions, etc.) </li></ul></ul><ul><ul><li>Message cost can be amortized by shipping data more efficiently: rowsets, compound statements make for fewer and bigger messages and tight loop execution in the RM  </li></ul></ul><ul><ul><li>(3) The RMs and the client software interaction is improved by the Mixed Workload Enhancement: </li></ul></ul><ul><ul><li>A common problem with clusters is that a low-priority client, once dequeued and getting served at the RM, can block the access of a high-priority client that is newly queued </li></ul></ul><ul><ul><li>For short duration requests, this is ignorable, but low priority table scans for queries blocking high priority OLTP updates is not good for business </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  58. 58. <ul><ul><li>The solution is (a) to execute client function in a thread at the priority of the client (inversion) and (b) to make low priority scans (and the like) to execute for a quantum and (c) be interruptible by high priority updates, see the paper  : </li></ul></ul><ul><ul><li>(4) A common method to avoid the deblocking/reblocking overhead for btree and even sequential file inserts is to do inserts at the end of the btree or sequential file: if you can continuously apply buffered multiple inserts/transaction at that btree/file position, then your performance goes way up </li></ul></ul><ul><ul><li>The problem is that it becomes a hotspot and then you run out of steam </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-  TR-90.8 Guardian 90: A Distributed Operating System Optimized Simultaneously for High-Performance OLTP, Parallelized Batch/Query and Mixed Workloads <http://www.hpl.hp.com/techreports/tandem/TR-90.8.html>
  59. 59. <ul><ul><li>The solution to this insert scalability problem is to partition the btree/sequential file, up to hundreds of partitions, which allows the application to pour the data into the system like a fire hose (don’t forget to partition the log sufficiently) (  ibid. last paper) </li></ul></ul><ul><ul><li>(5) If batch processing is broken up into SQL operations (select, project, scan, sort, join, aggregate, insert, update and delete), then many of those can be executed in parallel which gives a huge scale up (Gray’s speedup ) (  ibid.) </li></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  60. 60. <ul><ul><li>(6) Fastsort fires up as many subsorts as there is appropriate slack in the system for, and then partitions the file to be sorted amongst the subsorts and merges the results for excellent speedup parallelism (  ibid.) </li></ul></ul><ul><ul><li>(7) Nonstop has a facility called the ZLE , that brings all of the clustering features together to create a monster data store (Sprint, Target, Homeland Security): </li></ul></ul><ul><ul><ul><li>A ZLE data store can span a group (network) of more than 50 clusters (nodes) holding 4000 loosely coupled computers (cpus) in one single system image enterprise </li></ul></ul></ul><ul><ul><ul><li>EAI integration allows outside data sources to be poured into and batch extraction out of the live online system </li></ul></ul></ul>15. Expansively Transparent Parallelism and Scalability 2-
  61. 61. <ul><ul><li>ZLE (continued): </li></ul></ul><ul><ul><ul><li>Using the mixed workload enhancement the ZLE is running a constant and huge query load, while maintaining up to 200,000 business transactions (updates/inserts) per second, with trickle batch, see the articles: and see the patents: </li></ul></ul></ul>15. Expansively Transparent Parallelism and Scalability 2- <http://h71028.www7.hp.com/ERC/downloads/ZLEARCWP.pdf> <http://h71028.www7.hp.com/ERC/downloads/uploads_casestudy_ZLHOMSECWP.pdf> Enabling a zero latency enterprise <http://www.google.com/patents?id=SXcSAAAAEBAJ&dq=6,757,689> Framework, architecture, method and system for reducing latency of business operations of an enterprise <http://www.google.com/patents?id=1BwVAAAAEBAJ&dq=6,954,757>
  62. 62. <ul><li>An optimal RDBMS would have all these features combined into a facility with ‘minimum latency enhancement’ or MLE , with the addition of the features which are described in the following sets of slides on continuous database, for virtual commit by name, either in a datacenter or for geographic replication, supporting distributed database, scaling inwards, work flow based on publish and subscribe interfaces, etc. </li></ul>15. Expansively Transparent Parallelism and Scalability 2-
  63. 63. 16. Continuous Database needs virtual commit by name <ul><li>What is continuous database? </li></ul><ul><ul><li>High availability database is 5.5-7 nines of availability, without sacrificing data integrity, ACID, parallelism, transparency, scalability, and scalable replication for transparent disaster recovery </li></ul></ul><ul><ul><li>Continuous database has the joint hardware availability of two mirrored disk drives (IMHO), which is quoted at 12 nines of availability (Highleyman) , without sacrificing data integrity, ACID, parallelism, transparency, scalability, and scalable replication for transparent disaster recovery: one could also refer to continuous database as ‘massive availability’ </li></ul></ul>2-
  64. 64. 16. Continuous Database needs virtual commit by name <ul><li>To accomplish continuous database requires: </li></ul><ul><ul><li>Completely online software and hardware repair </li></ul></ul><ul><ul><li>Coordinated and uncoordinated takeover with complete application transparency, unbreakable and near-perfect integrity of database serialization </li></ul></ul><ul><ul><li>Vastly reduced unit MTBF (mean time before failure) and especially the MTBF for the double failure cases causing transactional cluster failures – and for all components – everything must become ultra-reliable </li></ul></ul>2-
  65. 65. 16. Continuous Database needs virtual commit by name <ul><li>To accomplish continuous database requires (continued): </li></ul><ul><ul><li>Vastly reduced unit MTR (mean time to repair) and especially reducing the MTR for double failure outages is really beneficial: </li></ul></ul><ul><ul><ul><li>As an example, the double failure cluster outage of a Nonstop system has an availability quoted at 5 ½ nines or 0.999995 (Highleyman), because of the immobility of criticality one (system critical) process pairs </li></ul></ul></ul><ul><ul><ul><li>Avail = (MTBF / (MTBF + MTR)), solving for MTR gives MTR = MTBF ((1 / Avail) – 1), solving for MTBF gives MTBF = MTR * Avail / (1 - Avail) </li></ul></ul></ul>2-
  66. 66. 16. Continuous Database needs virtual commit by name <ul><li>To accomplish continuous database requires (continued): </li></ul><ul><ul><li>Vastly reduced MTR (continued): </li></ul></ul><ul><ul><ul><li>Plugging in Nonstop availability and the typical 30 minutes it takes to do the volume recoveries after a Nonstop cluster failure (MTR), you get a double failure MTBF of about 11 ½ years </li></ul></ul></ul><ul><ul><ul><li>What if we improved the MTR? Taking only 30 seconds to reload the cluster and get the first critical data volumes up (reinstating the locks from the log RM checkpoint and running backout outside of crash recovery) and then starting the transaction system up early to do business on the critical applications would give us an availability of 0.9999999 or 7 nines, which is a huge improvement over 5 ½ nines </li></ul></ul></ul>2-
  67. 67. 16. Continuous Database needs virtual commit by name <ul><li>To accomplish continuous database requires (continued): </li></ul><ul><ul><li>Vastly reduced MTR (continued): </li></ul></ul><ul><ul><ul><li>Finally, an MTR of 30 seconds or less is considered below the threshold of noticeable in the online world: if you can get an MTR due to an occasional failure below 30 seconds, then that failure has been effectively masked </li></ul></ul></ul><ul><ul><li>Multi-path access to the enterprise storage network that can survive any potential multi-mode failure </li></ul></ul>2-
  68. 68. 16. Continuous Database needs virtual commit by name <ul><li>To accomplish continuous database requires (continued): </li></ul><ul><ul><li>Independent routing of the duplex communications network from the multi-path enterprise storage network – we mustn’t have a case where a single digging machine can cut through one cable trench and take it all down – overall physical path analysis is crucial </li></ul></ul><ul><ul><li>Minimal MTR on everything from components on up to major subsystems </li></ul></ul><ul><li>These assembled features could then (with a lot of software work) be made into a continuous database (with ‘massive availability’) </li></ul>2-
  69. 69. 16. Continuous Database needs virtual commit by name <ul><li>In a talk at Tandem Nonstop in the late 1980’s, Jim Gray explained the three ways of identification (addressing) in a network (from the least to the most pleasant): </li></ul><ul><ul><li>By path : (from an ancient 1983 Usenet post requesting information about BITNET): utzoo!linus!decvax!tektronix!ogcvax!omsvax!hplabs!sri-unix!JoeSventek@BRL-VGR.ARPA,j@lbl-csam </li></ul></ul><ul><ul><li>By location : ucbvax!lbl-csam!j </li></ul></ul><ul><ul><li>By name : Dr. Joe Sventek </li></ul></ul><ul><li>Jim Gray said that identifying servers and sending messages by name was the holy grail of distributed computing </li></ul>2-
  70. 70. 16. Continuous Database needs virtual commit by name <ul><li>As we will see later when we get to replication, there are no really great distributed database replication systems for disaster recovery, the very best (Nonstop RDF) must take over all the nodes in a network to all the other nodes: and that’s one big takeover !!! </li></ul><ul><li>This seems like a sledgehammer, when a ball peen hammer would do </li></ul><ul><li>The problem that rules this all or nothing strategy: where’s the parent ? </li></ul>2-
  71. 71. 16. Continuous Database needs virtual commit by name <ul><li>In multi-phase distributed commit coordination, after the prepare phase where everyone votes yes, there is a commit phase where the parent commit coordinator tells everyone about the commit/abort decision: when the parent commit coordinators go down, all the prepared children commit coordinators have to wait forever , holding locks so as to isolate the commit or abort of the database changes </li></ul><ul><li>This is why two-phase commit has such a bad reputation, and why three-phase commit, which synchronizes the commit decision with a backup commit coordinator, is so wonderful: the parent commit coordinator doesn’t disappear on you , and you can always find the backup parent after the primary parent goes down, but this solution does not work for disaster recovery, because the cluster holding both the primary and backup commit coordinators is simply gone </li></ul>2-
  72. 72. 16. Continuous Database needs virtual commit by name <ul><li>The currently available distributed and replicated disaster recovery product (Nonstop RDF), has a novel solution for all the prepared nodes that were sharing transactions with the old parent node(s) that died and must now find the parent(s) to determine commit or abort – they must commit suicide with the dead parent(s) </li></ul><ul><li>Then they can all be replaced as an entire family group, by a whole new set of backup systems that are uninfected by transactions that might look to coordinate commit with any parents on the dead side: only new children that can find all of their new parents remain , and as draconian as this solution appears to be, it does work </li></ul>2-
  73. 73. 16. Continuous Database needs virtual commit by name <ul><li>So you see network topologies for disjoint replication (where distributed transactions are not supported) that can be as complex as you want, but have a lot of application work to do after a single-node takeover … or you see network topologies for joint replication that are mirrored and mass-suicidal to guarantee no orphans </li></ul><ul><li>What is needed is virtual commit coordination by name , and it should be wired in to the DNA of the transaction system: physical cluster names and virtual cluster names should have equal status in commit, but virtual commit coordinators inhabit physical clusters (and possess physical commit coordinators ) and can move around, so that their transactional parents, siblings (replication) and children can always find them to coordinate commit and find out the past results of commit coordination </li></ul>2-
  74. 74. 16. Continuous Database needs virtual commit by name <ul><li>The Nonstop people architected this for clusters in the late 1980’s, but the legacy ball and chain kept them from making the requisite changes: it was called ‘Virtual Systems’ , and had a lot of transparency, maybe even too much transparency </li></ul><ul><li>The transparency of ‘Virtual Systems’ was accomplished by extending the scope of process pairs so that a virtual system contained nearly everything that a physical system had, and the transactional file system had to be able to discover the physical location of any component of a virtual system, at any time, any place where the virtual system had moved </li></ul>2-
  75. 75. 16. Continuous Database needs virtual commit by name <ul><li>In Nonstop, process pairs can have the same name, and carry on a server conversation with a client process and when one side of the server fails, the resend of the failed message to the old primary is routed to the new primary/old backup (this is called transparent retry) </li></ul><ul><li>For ‘Virtual Systems’ , this meant every element on the primary node had a potential match (and impedance mismatch) on the backup node, even printers (it was just simply overkill) </li></ul><ul><li>The right answer was to just do virtual commit, only for transactions, which desperately need a movable commit coordinator, and movable RMs (resource managers), and that’s all </li></ul>2-
  76. 76. 16. Continuous Database needs virtual commit by name <ul><li>That limited approach was architected, but never implemented, for a project by the transaction management (TMF) group at Nonstop, called RDF2 , such that multiple ‘RDF Groups’ (the ‘virtual’ moniker had lost its cachet) could simultaneously occupy a physical node and support replication to multiple target nodes: however, there was little transparency and the takeover notification method was thereby clunky, because they dialed back too far on systematizing the virtual transparency </li></ul><ul><li>The clunky way may or may not work ok, but I think we can do better, and handle this more systematically for an optimal RDBMS, by implementing virtual commit by name </li></ul>2-
  77. 77. 16. Continuous Database needs virtual commit by name <ul><li>An optimal RDBMS would accomplish virtual commit through reliable groups of clusters built on higher level glupdate/heartbeat/poison pill/regroup for massive clustering: </li></ul><ul><ul><li>Glupdate is the Nonstop method of atomically maintaining the state of a cluster of computers, with more solid reliability and messaging efficiency than the quorum consensus+Thomas write rule+Lamport timestamp type solutions or other networking state management protocols, but for smaller cluster size (say, <= 25), which is managed by a computer of the cluster called the cluster locker </li></ul></ul><ul><ul><li>When 1/sec heartbeat messages are missed between cluster computers, a poison pill gets sent to make sure the dead or autistic computer doesn’t belatedly pipe up with some distaff comment, unless a regroup two-round communication protocol revives his status </li></ul></ul>2-
  78. 78. 16. Continuous Database needs virtual commit by name <ul><li>So, for higher glupdate on a group of clusters, there would be a group locker cluster, which atomically maintains the globally-named list of physically-located clusters in the order of succession ( group locker first, first successor after he dies, etc.) </li></ul><ul><li>The group locker cluster also would maintain the state of where named virtual commit coordinators (and their co-located RMs) are physically located, which cluster is currently primary for each virtual commit coordinator, and other minimal global state that doesn’t change too rapidly (secured in some rational way from operator error with a proper command interface) </li></ul>2-
  79. 79. 16. Continuous Database needs virtual commit by name <ul><li>During the group glupdate , as each cluster receives the transmitted group state change, that cluster will itself propagate the state change by a subordinate cluster glupdate (under the computer in that cluster who is the cluster locker ), so that every computer in every cluster in the group will see the results of the group glupdate, in the same way that every cluster glupdate has visibility </li></ul><ul><li>There are clearly differences between group and cluster glupdates: </li></ul><ul><ul><li>A poison pill is the cure for a single computer in a cluster that is hung, or has failed, or is simply not talking - and after the regroup second chance has been declared truly dead: because the granularity of failure of ‘computer’ is tolerable if a mistake is made </li></ul></ul>2-
  80. 80. 16. Continuous Database needs virtual commit by name <ul><li>Differences between group and cluster glupdates (continued): </li></ul><ul><ul><li>For the failure granularity of ‘cluster’ in a group, a poison pill is too extreme : so after regroup, the group locker will tear down the session (that’s the poison pill at this level), increment the session sequence number and start a new session (accompanied by a group glupdate) </li></ul></ul><ul><ul><li>To communicate across clusters, the cluster to cluster messaging layer (called ‘Expand’ on Nonstop) would then reject any message with an obsolete sequence number , which would require the rejoiner to get a new session from the group locker </li></ul></ul><ul><ul><li>If the rejoiner came in from a reboot of the cluster, he would rejoin in the group with the standard initialization and reintegration after a crash </li></ul></ul>2-
  81. 81. 16. Continuous Database needs virtual commit by name <ul><li>Differences between group and cluster glupdates (continued): </li></ul><ul><ul><li>If the rejoiner is reintegrating (and if this is allowed), then reintegration will occur subsystem by subsystem, after the group locker reintegrates the rejoiner by sending him the group state and receiving and transmitting his local piece of the group state (an example, a little later) </li></ul></ul><ul><li>In this way, the transactional file system , which is the implementer of transparency in the transaction system software, will have everything it needs to determine what the overall group and the local cluster has as state, and remote cluster state is handled in the normal way, by message system error handling (as in Nonstop ‘Encompass’ over ‘Expand’) </li></ul>2-
  82. 82. 16. Continuous Database needs virtual commit by name <ul><li>Subsequently, when a named virtual transaction (naming the virtual commit coordinator) is distributed and then prepared to a physical child cluster and the virtual transaction parent’s physical cluster dies … as the surviving physical child cluster’s commit coordinator tries to find out the commit/abort result for the transaction, he will know where to send for the answer immediately after the virtual commit coordinator takeover group glupdate and subordinate cluster glupdate is completed on his cluster </li></ul><ul><li>Until then, (briefly) he will get an error or two </li></ul>2-
  83. 83. 16. Continuous Database needs virtual commit by name <ul><li>So, now that we can imagine doing virtual commit by name , how does that get us increased availability? </li></ul><ul><li>When clusters of computers are using enterprise storage (Fiber channel, iSCSI, ESCON) instead of attached storage, then when a cluster goes down due to a double failure, another cluster could takeover and execute using his enterprise storage, given that preparations are made in advance for this: if things are set up and correctly executed, that takeover can be transparent, one half of this we have imagined now, with virtual commit by name </li></ul>2-
  84. 84. 16. Continuous Database needs virtual commit by name <ul><li>If virtual commit coordination can be performed on disks in an enterprise storage (ESS) facility, then the six or more nines of availability for an optimal RDBMS transactional cluster in a group of clusters, could become 12 or more nines of availability from the very same group of clusters which now does virtual commit on a physical cluster and takeover of that cluster’s ESS database on a backup cluster executing with the same named virtual commit coordinator and its attached named resource managers (RMs) </li></ul><ul><li>HP Nonstop developed an architecture for this kind of ESS cluster takeover in the ISC (Indestructible Scalable Computer) </li></ul>2-
  85. 85. 16. Continuous Database needs virtual commit by name <ul><li>The ISC would, however, require extending the three phase commit of the Nonstop TM (called TMF) to a fourth phase, another node’s TMF taking over the log and all of its partitions on enterprise storage, and then continuing with the RMs (DP2) that have not crashed, and quickly recovering crashed RMs with locks acquired from the last two log checkpoints for seamless continuous database </li></ul><ul><li>That extension could theoretically take the Nonstop transaction service to 12 nines of availability, see this patent  : </li></ul>2-  Transaction processing apparatus and method <http://www.google.com/patents?id=jB5_AAAAEBAJ&dq=7,168,001>
  86. 86. 16. Continuous Database needs virtual commit by name <ul><li>That patent (which bears my name as primary author) was completed after I left </li></ul><ul><li>It has two large errors, the first is the glaring omission of either virtual commit by name , or the RDF2 less-virtual style of transferring commit, without either of which (or some other solution) the invention is fairly worthless (that’s not a criticism, it’s just missing) </li></ul><ul><li>The second error is more critical, but only relates to the use of virtual commit in scalable joint replication , and that missing logic will be obvious in that section </li></ul><ul><li>At the center of the availability method is the LSG or log storage group: back in the scalability slides, the RMs were attached to the log partitions receiving their update log records, which were attached to the log root receiving the transaction state log records from the TM commit coordinator during group commit </li></ul>2-
  87. 87. 16. Continuous Database needs virtual commit by name <ul><li>Basic scalability was accomplished by only force writing the log root, while discerning through careful accounting of VSNs on the RM side, and log pointers on the log partition side, that mostly streaming the log partitions (where the vast majority of log writing occurs) is sufficient , except where we did see from VSNs and log pointers that we would need to force a flush of the log partition’s writing buffer at commit time (and if any information was lost by an RM or log partition failure/takeover, we would force flushing for a little while) </li></ul><ul><li>A log storage group ( LSG ) would then be a log partition, which receives log records from a set of attached RMs, all of which are contained on an ESS (enterprise storage system) </li></ul>2-
  88. 88. 16. Continuous Database needs virtual commit by name <ul><li>So the entire merged log for one transactional cluster on an ESS can be viewed as consisting of a log root and one or more LSG s: after the cluster fails, the LSG s could be dispersed to other clusters (that were attached to the same ESS, probably in the same datacenter) to load balance the takeover of the cluster out to the larger group of clusters </li></ul><ul><li>After the cluster failure, the log root itself, containing the transaction state records, would be shared on the ESS in read-only mode, so that the LSG takeover clusters could determine the outcome of commit and abort, and hold locks for the transactions in prepared state until commit is resolved and abort the active and preparing transactions </li></ul>2-
  89. 89. 16. Continuous Database needs virtual commit by name <ul><li>The only thing that needs to be set up to allow takeover on the backup cluster after a total failure of the primary, is to wed the notion of virtual commit by name with the notion of the LSG </li></ul><ul><li>This is done by assigning the primary cluster log partition of the LSG and all of its RMs to a named virtual commit coordinator , and also assigning that to the backup cluster </li></ul><ul><li>Then it becomes the job of a Takeover process on the backup cluster to manage the series of tasks that must be completed, to do an ESS datacenter takeover , which is different from a potentially geographically dispersed replication takeover </li></ul><ul><li>All of that is missing from the patent and would need to be added to get twelve nines for HP Nonstop, so I will describe that in-between-RDF2-and-virtual-systems approach </li></ul>2-
  90. 90. 16. Continuous Database needs virtual commit by name <ul><li>During the steady state or normal functioning of the ISC system for Nonstop, the physical TM (TMP) process pair on the virtual primary and the physical TM on the backup should be kept in sync by messages, as to the named virtual commit coordinator they share as siblings </li></ul><ul><li>There should be a similarity of this TM sibling protocol to the parent-child distributed commit protocol: at its outset, a named virtual transaction is effectively distributed in nature between the siblings </li></ul><ul><li>The difference between the parent-child and sibling relationships is that parent-child relationships for transactions that have not gone fully prepared cause an abort if the network sessions between the TMs gets torn down: this should not occur for sibling TMs </li></ul>2-
  91. 91. 16. Continuous Database needs virtual commit by name <ul><li>This means that the backup TM sibling needs to catch up after a network partition, and might be a little behind at the point of disaster: this problem is solve by the primary TM sibling noticing the partitioning of virtual commit and setting a flag for this particular sibling relationship in successive group commit writes and TM checkpoints, later clearing that flag when the network is made whole and the immediacy of the TM sibling relationship is restored </li></ul><ul><li>After a failure of the primary TM sibling pair and takeover by the backup TM siblings, the last group commit write in the log will contain the flag for this relationship, if it indicates partitioning, then all transactions that can be aborted should be (not prepared or committed), by a TM crash recovery, otherwise this takeover could resemble a normal TM process pair takeover for the TM on a cluster </li></ul>2-
  92. 92. 16. Continuous Database needs virtual commit by name <ul><li>Similar flag handling is described for RM siblings in the patent, if they were in contact at the time of the takeover, a normal takeover is indicated in the LSG checkpoint buffer writes: so the RM protocol then becomes CAB-CANB-WAL-WDV: </li></ul><ul><ul><li>Checkpoint ahead buffer (to local backup RM) </li></ul></ul><ul><ul><li>Checkpoint ahead network backup (to remote sibling RM), which sibling may be partitioned off </li></ul></ul><ul><ul><li>Write ahead log, flagging the sibling partitioning </li></ul></ul><ul><ul><li>Write data volume, lazy as always (but not too lazy) </li></ul></ul><ul><li>On an RM sibling takeover , if the flag is set in the last WAL write, it’s a minimum latency crash recovery, otherwise it’s like a normal RM takeover on one cluster (few seconds) </li></ul>2-
  93. 93. 16. Continuous Database needs virtual commit by name <ul><li>Finally, it comes down to the Takeover process executing for the backup TM siblings trying to become primary and make the database available again, quickly </li></ul><ul><li>The group locker , after a second round of regroup declares a cluster offline/down, can automatically, or after a notification and receiving an operator command, start the takeover for all clusters holding new primary named virtual commit coordinators due to outage of the lost cluster </li></ul><ul><li>This sends a takeover message to the physical TMs on all clusters, who will now know where the new parents are for commit coordination : they knew theoretically before, but the takeover had not been initiated </li></ul>2-
  94. 94. 16. Continuous Database needs virtual commit by name <ul><li>Upon receipt of a takeover message, a physical TM, which now has become responsible as the primary named virtual commit coordinator, will write his log root to make the takeover durable </li></ul><ul><li>The lost cluster’s TM, if/when he comes back online will discover that he lost the regroup vote, and inform his subsystems to stop doing any transaction database work (until the takeover is complete) </li></ul><ul><li>The new virtual commit coordinator TM primary will start a datacenter master to do the rest of the takeover tasks (as opposed to a replication master : this is a binary role and fixed state of a named virtual commit coordinator) </li></ul>2-
  95. 95. 16. Continuous Database needs virtual commit by name <ul><li>The datacenter master will immediately exercise a deep ESS functionality (different for every ESS) to grab control away from the old primary physical system (it may still be writing) for the log root disks and any LSG s (log partition disks and attached RM data volume disks) that are now the responsibility of this cluster (the old log root may have already been pulled away by another takeover cluster, if so, it is now in shared read mode) </li></ul><ul><li>The datacenter master accesses the log root from the old primary physical system on the ESS to create a list of transactions (much like the TM does on a crash restart) and then restart the RMs which can simply takeover if up-to-date, or do minimum latency lock acquisition crash recovery on those RMs which lost connection </li></ul>2-
  96. 96. 16. Continuous Database needs virtual commit by name <ul><li>As each named virtual commit coordinator comes back online and reports in (this is almost immediate, because the log root only contains TM log records, and only the last two TM checkpoints have to be traversed), the group locker flags that takeover, and then those transactions and transaction services are available </li></ul><ul><li>As each named virtual RM comes back online and reports in (this is immediate for the up-to-date RMs, and a couple of minutes for the rest, although sizable backout runs will be doing aborts on those), the group locker will flag their takeover and they are up for business </li></ul><ul><li>An optimal RDBMS would do this very same thing, in a slightly different way to reduce the number of cluster failures requiring another cluster to takeover </li></ul>2-
  97. 97. 16. Continuous Database needs virtual commit by name <ul><li>Nonstop ISC clusters have to takeover if the bound process pair processors fail, because process pairs have to be located in specified processors and no where else , which forces their cluster fault tolerance to remain at five and a half nines, when it could be approaching seven nines (according to a talk given by Highleyman at Nonstop Software in 2001, I believe) </li></ul><ul><li>What’s interesting about this is that either (1) allowing floating backup cpus, or (2) allowing RM crash recovery to acquire locks from the last two RM checkpoints in the log partitions gets you seven nines of availability on a Nonstop system: if you had both, you would get nine nines of availability, and then the datacenter takeover would theoretically give you eighteen nines, were it not for the physical limit of two hard disks reducing that to twelve nines of availability </li></ul>2-
  98. 98. 16. Continuous Database needs virtual commit by name <ul><li>An optimal RDBMS would not have this limitation, and hence it could takeover in any computer in the cluster, and retry takeover in another if the first attempt didn’t succeed: so with lock reinstatement, basic clusters would have nine nines , and groups of clusters with ESS and LSG s would have twelve nines (the limit for the hardware availability of a perfectly managed pair of disks), but replicated systems (disk quads) could have a theoretical limit of eighteen nines , and that’s a lot of nines </li></ul>2-

×