Techno Arms Dealers & HighFrequency TradersToday represents the hottest time to be in financial markets – nanosecond responsetimes, the ability to affect global markets in real time, and lucrative spot deals in darkpools being all the rage. For companies who do business in these times, it is a technicalarms race, worthy of a Reagan era analogy.With High-Frequency Trading firms locked into an effective “Space Race”, thechallenges for these firms are now far reaching, extending beyond traditional regulatory,compliance, and government boundaries.With a need to ensure that regulatory requirements are met, serious fines for noncompliance and even enforceable undertakings by 3rd parties to halt trading activities onmarkets are still outweighed by the potential upside for combatant firms playing in therace.Increasingly, the most marginal of technical errors can spell doom for marketparticipants. In a market where risk is a prime occurrence and measured oftenin millions of dollars, glitches are a regular occurrence, resulting in lost revenue,disappointed customers, and the fast destruction of once high-profile market leaders.
Recently, this was brought to the public’s awareness, with the spectacular failure ofKnight Capital: in August of 2012, erroneous trades were sent to the New York StockExchange, leading to the obliteration of nearly 60% of the firms value in under 1 hour.The firm’s catastrophe has forced an attitude change among investors and corporatetechnology leadership, with a focus on compliance controls and board levelaccountability. Tiny lapses in controls are expensive mistakes, leading to the disruptionof markets, in conjunction with the immense losses and liability suits that often trail suchevents, the stakes are higher than ever to develop software in a controlled way and get itto market in the shortest time possible.With regulatory changes imminent, the need for clearer, actionable reporting at all levelsof technology organizations require a clearer approach than the traditional ones taken inthe past.The Landscape of Failure:In the last 2 years alone, there have been numerous incidents of technologymisconfiguration that led markets awry. Institutional investors aside, the mechanismsthat govern software development for brokerage firms and markets have far-reachingand damaging consequences. From ill-prepared recovery protocols to poorly governedfront, back, and middle offices; there are several noteworthy incidents in recent timesthat have led to greater scrutiny for trading companies.November 2012 – NYSE/EurodexA newly implemented market matching engine UTP (Universal Trading Platform, thecore trading platform employed by the NYSE) caused a day-long disruption and forcedthe Big Board operator to establish closing prices for more than 200 stocks using afallback to it’s old system Super Display Book (sDBK). Trading never resumed during theday for the 216 stocks affected, and the exchange determined the official closing pricefor each of the affected securities based on a consolidated reading of last-sale prices,instead of an auction system used to close stocks, manual intervention was required torevalidate positions for venues and participants.
Overview of the Root Cause: Poor Testing/Quality Assurance/Release ManagementFailure.2007 – 2010 London Stock Exchange (LSE) Multiple Outages & the Move to LinuxOver the course of a 4 year period, the London Stock Exchange began to earn areputation as the most unreliable exchange in the market. Multiple outages and multipletechnology problems all led to a raft of technology errors, which were manifested inregular outages. In fact, the LSE had to ultimately change it’s entire operating stack to anew platform and institute a raft of new mature processes to achieve the kind ofreliability they needed.LSE Migration to LinuxAugust 2012 Knight Capital:In the span of 45 minutes, a little over four hundred million dollars was lost when analgorithmic trading program designed for testing environments was released to theirproduction environment market. The blunder led to a seventy five percent dip in thestock price in a 30 minute period before attempts to salvage the situation could beinitiated. The error entailed HFT (High-Frequency Trading) of up to 140 stocks, and isjust the latest in a string of such errors.The Root Cause: Poor Configuration Management, Inconsistent Testing Approach, PoorRelease ManagementBut How?Most brokerages apply several layers of risk mitigation when developing and deployingsoftware. I’ll give a high level overview (below) of a traditional approach in another post(I won’t go into the details of settlement, vetting, market matching etc). Trading firms arecomplex beasts, with multiple market participants, multiple exchanges and a plethora ofinvestment instruments to use, and going into detail on the actual technologies detractsfrom the message. What is apparent is that the process life-cycles, which are used toachieve releases, are governed by mechanisms from a different time and place, with
varying inconsistent controls not designed for rapid release schedules, leaving gaps inorganizational capabilities that are open to failure.The “New Old Ways” to Manage These Problems:Typically Application Lifecycle Management (ALM), a recent play, is a means of ensuringthat software remains relevant. A vital aspect of the Software Development Life-Cycle(SDLC), ALM is an integral part of ensuring that the firms overcome challenges todeveloping top-notch software at a fast pace. The new wave premise of ALM, follows adesign, build, run mentality, and pushes the paradigm to encompass all activities in thedevelopment cycle under one roof, whereas previous approaches followed oftendifferent approaches with best-of-breed solutions.The benefits of this, with regard to trading systems, are clear. Greater visibility andconsistency between tools implies more fixes to bugs, and ultimately fewer glitches. Theunfortunate reality is that underlying configurations are not still maintained well in thisapproach, and unfortunately would not have been necessarily caught with traditionalALM technology vendors.ITIL is a widely accepted approach to IT service management in these organizations. AnITIL enabled process centrally focusses on what is called a Configuration ManagementDatabase (CMDB); which contains all information pertaining to an information system. Ithelps the organization identify and comprehend the relationship between system levelcomponents and applications, and it is designed to track relationships betweentechnology services and at a micro level, items called CI’s (Configuration Items). Thisprocess is known as configuration management, but as this typically lives in theoperational part of the equation (Application Support, Infrastructure Operations &Service Management), the process usually only gets invoked at a high level in the pre-production environments. There is another discipline called Software ConfigurationManagement which has applicable components in ITIL and ALM, however the tools andprocesses rarely meet, as the distinction between the disciplines are very much eithersoftware or infrastructure orientated.The conceptual CMDB enables controlling and specification of configuration items in asystematic and detailed manner, reducing configuration drift. As mentioned previously,
problems with this approach manifest in the ITIL world, as the CMDB typically does notconverge with the version control repositories in the development life-cycle, and moreoften than not are actually not version controlled themselves – leaving furtherinconsistencies.Okay Okay We Get That, So What Went Wrong at Knight?Basically, Knight accidentally released simulation software they used to verify theirmarket-making software functioned properly, into NYSE’s live system.Within Knight Capital’s development environments lived a software program called “amarket simulator”, designed to send spread patterns of buy and sell orders to itscounterpart market matching software, called RLP in this case. The trade executions arerecorded and were potentially used for performance validation prior to new releases ofthe market matching software. This is probably how they could stress test how well theirnew market-making software worked under load before deploying to the live systemconnected to the NYSE live system.Prior to August the 1st, a number of teams progressively would have migrated softwarebetween environments for release into the “live environment”. Potentially, a manualprocess was caught in the deployment, and pushed a copy of the simulation softwareinto the “live”. As you can see, most companies do not employ baseline configurationtests in the later environment stages, thus (probably at a later stage in the process),someone opted to add the program to the release package and deployed it.This is exacerbated in large teams, and is simply an overhang of the fact that typically noone team owns the configuration state, of both the Applications & the OperatingSystems/Platform that they run on, the closest team is usually the systemsadministration team, but as they have a production environment to manage, these“lesser” environments get sidelined with more important problems to deal with.Combined with the fact that there are very few tools that actually focus on theconfiguration testing aspects and people use collections of scripts or home-brewsolutions, it is easy to see where this went wrong.
The lack of a well-defined configuration baseline and set of configuration tests includingdifferences between the environments is the likely cause (well, from an outsider’sperspective) of the problem.On the morning of August 1st, the release was successfully deployed and the simulatorinadvertently bundled with the release was ready to do its job: execute market-makingsoftware.This time however, it was no longer in one of the test environments, it was actuallyexecuting live trades on the market, with real orders and real dollars.For stocks where Knight was the only one running market-making software as a RLP,and the simulator was the only algo trading that crossed the bid/ask spread, then wesaw consistent buy and sell patterns of trade executions, all marked regular, all from theNYSE, and all occurring at prices just above the bid or just below the ask.Examples include EXC and NOK, and you can see these patterns in charts here. Thesimulator was functioning just as it did in the test environments, and Knight’s marketmaking software was intercepting these orders and executing them. Knight’s net loss isminor on simple volumes, on this day however, the problem was compounded, as thesoftware was operating , but they were generating a lot of wash sales.For stocks where Knight was not the only market-maker, or when there was otheralgorithmic trading software actively trading (and crossing the bid/ask spread), thensome, or all of the orders sent by the simulator were executed by someone other thanKnight, and Knight now had a position in the stock. Meaning it could have been makingor losing money. The patterns generated for these stocks depended greatly on theactivity of the other players.Because the simulator was buying indiscriminately at the asking and selling of the bid,and because the bid/ask spreads were very wide during the open, we now understandwhy many stocks moved violently at that time. The simulator was simply hitting the bid oroffer, and the side it hit first determined whether the stock opened sharply up or down.
Since the simulator didn’t think it was dealing with real dollars, it didn’t have to keeptrack of its net position. Its job was to send buy and sell orders in waves across pre-defined positions.This explains why Knight didn’t know right away that it was losing a lot of money.They didn’t even know the simulator was running.When they realized they had a problem, the first likely suspect was likely the newmarket-making software. We think the two periods of time when there was a suddendrop in trading (9:48 and 9:52 AM), are when they restarted the system. Once it cameback, the simulator, being part of the package, fired up and continued trading positions.Finally, just moments before a news release at 10 AM, someone found and killed thesimulator.We can fully appreciate the nightmare their team must have experienced that morning, alack of visibility, inconsistent sources of what was actually running in production, andpoor visibility over the successful release.Regulated Controls Against Flash CrashesLike those that came before it, Knight Capital was once THE retail market-maker in theUS; its reputation has now been irreparably damaged. It’s prudent to note that the errorwas vastly avoidable, had the relevant controls been put in place.Several factors played into this scenario, namely:- Poor configuration management,- A set loose controls around the release management process within the firm,- A lack of visibility into the makeup of the changes that were being introduced into themarket.- An inability to isolate the configurations that we deployed- A lack of configuration testing
- A lack of operational acceptance testingAutomated Governance is the Way ForwardDevOps, a recent answer to the challenges of collaboration across release cycle,stresses the seamless integration of software development and collaboration between ITteams, with a view towards enabling a rapid rollout of products via automated releasemechanisms. It recognizes the existing gap between activities considered as part ofdevelopment life-cycle, and those characterized as operational activities. Historically, theseparation of development and operations has manifested itself as a form of conflict, ascan be clearly seen by the sheer amount of frameworks developed to address theproblem, which ultimately predisposes entire systems to errors.What’s currently lacking in each approach is a mechanism to gather systems knowledgein environments where skills and capabilities between teams varies significantly.For orchestration and deployment Puppet, Chef, Bladelogic and Electric Cloud go a longway towards improving upon the existing configuration components of ALM models, butoften neglect the interaction with ITIL. Puppet has been making strides in recent monthswith integrations into tools of this nature. Yet, the existing suites of tools require specificknowledge of declarative domain-specific languages to enable a user to describe systemresources and their state. In the case of Puppet, discovery of system information andcompilation into a usable format is possible, but is a daunting task to a novice user inthese fast paced corporate environments.Over time, heavily regulated environments, governed by strict auditing requirements,combined with a validation mechanism that can clearly be maintained and usable bythen varying capability levels of an organization must be put in place to ensure thatconfiguration drift between environments is caught early and reported back.Increasingly smart automations will be deployed, which will ensure state is forcefullymaintained by testing, recording, and auto-provisioned safely. This is a unique means ofpeer-based systems configuration and a measure of prevention before configurationerrors affect running systems,that very few companies are experimenting with (akaconfiguration-aware systems).
Our own tool, ScriptRock, complements the existing workflow tools and offers thesimplest way for Developers/Configuration Managers & Systems Administrators gainrealtime validation of configuration state to great effect. It enables the creation andrunning of configuration tests, collaborative configurations for teams, and a robustcommunity option in coming months, as well as the creation of detailed documents thatact as reports to satisfy audit standards. Applying ScriptRock to these environmentsensures fast process maturity for developing seamless system configurations andrequires no new syntax introductions or code; everything is available as a versioncontrolled test that can be executed under strict security contexts on the target system.Governing BodiesThe Knight crisis is not an isolated event. However, it has been looked at as a rallyingcall for greater visibility into the processes and compliance measures implementedwithin trading participants.With the increasing complexity of trading algorithms, which are the backbone of tradingprocedures, the necessity of controls to govern these technology organizations isbecoming more apparent each day.Mary Shapiro, the outgoing Chair of the SEC, called for a review of the SEC’sautomation review policies, which were put in place with exchanges after the 1987market crash, that require venues to notify the regulator of trading failures or securitylapses. Portions of those policies will serve as the basis for the new rules.The implementation of a powerful trading platform rests on many pillars. Theirremarkable effectiveness has led to the reliance of historically legacy solutions to dealwith the rapid release schedules that firms now face to stay recognized as leadingsystems. This comes at a cost, as this increased pressure to deliver innovation hasopened up these systems, and more importantly, the processes and tools that governthem to exposure and the risk of failure if glitches occur. As a result, the conceptsoutlined in DevOps are clearly necessitated in order to continue delivering key featuresand key components of financial markets, proper execution will help avert crises such asthe Knight fiasco in future.
The comprehension and adoption of the various frameworks, the integration of ITAutomation, and clear governance of development and operational environments will goa long way into ensuring that a fiasco such as the Knight crisis remains solely as aproblem of the past, never to be replicated. Unfortunately, we still have a long way to goin this journey.