Rv11

Runtime Analysis in the Cloud:Challenges and OpportunitiesWolfgang GrieskampSoftware Engineer, Google Corp

About Me < 2000: Formal Methods Research Europe (TU Berlin)2000-2006: Microsoft Research: Languages and Model-Based Testing Tools2007-2011: Microsoft Windows Interoperability Program: protocol testing and tools Since 4/2011: Google: Google+ platform and toolsDISCLAIMER: This talk does not necessarily represent Google’s opinion or direction.

Content of this talkGeneral blah blah about cloud computingMonitoring the CloudTesting the CloudUsing the Cloud for DevelopmentA formal method guy’s dream…Conclusion

What is Cloud Computing?From Wikipedia, the free encyclopediaCloud computing is the delivery of computing as a service rather than a product, whereby shared resources, software and information are provided to computers and other devices as a utility (like the electricity grid) over a network (typically the Internet).

What is Cloud Computing?From Wikipedia

Some Properties of Cloud ComputingIt is location independent

It has a unified access control model

It can scale-up or scale-down on demand

Its failsafehttp://www.economist.com/node/17797794?story_id=17797794

IAAS (Infrastructure As A Service)Basic building blocksStorageNetworkingComputationHomogenous, easy migration (based on VMs)Distributed Data Centers over Geographical ZonesPlayers: Amazon, GoGrid, Rackspace, Microsoft, GoogleEstimated revenue 2010: $1b (Source: Economist)

Platform As A Service (PAAS)Basic Building BlocksOperating System Frameworks and Development ToolsDeployment and Monitoring ToolsPlayers: Microsoft, Google, IBM, SAP, …Estimated Revenue 2010: $311m (Source: Economist)

Software As A Service (SAAS)Device and location independent applications, typically running in a browser (Email, Social, Retail, Enterprise Apps, etc.)Many different playersEstimated revenue 2010: $11.7b (Source: Economist)

Monitoring vs RV vs TestingWhat’s the difference?A (strict) take:Monitoring collects and presents information for human analysisRuntime verification collects and transforms information for automated analysis which ultimately leads to a verdictTesting does the above things in an isolated, staged or mocked, environment. In particular, stimuli from the environment are simulated. In practice, boundaries are not so clear. For this talk RV = Monitoring (adapting to Google conventions)

Anatomy of a Data CenterData Center AData Center B ……ControllerServerControllerServerServerServerServer…StorageStorageStorageStorageStorageNote: abstracted and simplified

Anatomy of a ServerData Center AData Center B ……ControllerServer (VM)ControllerServerControllerJobJobJobServerServerServerServer…MonitorMonitorMonitorStorageStorageStorageStorageLogsAlertNote: abstracted and simplified

Anatomy of a ServiceData Center AData Center B ……ControllerService (across Servers)JobJobServerControllerJobJobJobServerServerServerServer…JobStorageStorageStorageStorageStorageStorageNote: abstracted and simplified

Monitoring Types @ GoogleBlack Box MonitoringWhite Box MonitoringLog Analysis

Black Box MonitoringJobMonitorFrequently send requests and analyze the response Possible because server jobs are ‘stateless’ and always input enabledIf failure rate over a certain time interval exceeds a given ratio, raise an alert and page an engineerEngineers aim for minimizing paging and avoiding false positives

Black Box Monitoring: How its done @ GoogleThere are rule based languages for defining request/responses. Each rule:Synthesizes an HTTP requestAnalyzes the response using a regular expressionSpecifies frequency and allowed failure ratioRules are like tests: a simple trigger and a simple response analysis Monitors can be also custom codeJobMonitor

Black Box Monitoring: Issues?Is the ‘stateless’ hypothesis feasible?Nothing is really stateless -- state is passed as parameters in cookies, continuation tokens, etc.However, as these are health tests, state can be ignoredWhat is the relation to testing?In theory very similar, only that the environment is not mocked. In practice uses quite different frameworks/languages What about service/system level monitoring?Its only about one job. Doesn’t give failure root cause (it only measures a symptom)JobMonitor

ChallengeIntegrate Black-Box Monitoring and TestingJobMonitorBlack-box monitoring can be seen as particular way of executing tests end-to-end on the live product such that the impact on performance can be neglected.Frameworks which integrate design and execution of monitoring rules and test cases are promisingMainly an engineering challenge

JobJobChallengeSystem/Service Level Black-Box MonitoringMonitorMonitorNot commonly done Main purpose would be failure cause analysis and failure preventionSimple local monitoring already discovers failures Is there a strong point of doing it at runtime (vs. log analysis)?Only if real-time prevention and potentially repair is important Monitor

ChallengeProtocolcontract verificationJobMonitorAt Google, all communication between jobs happens via a single homogeneous RPC mechanism based on a message format definition language (called protocol buffers)Also all (= terra bytes) of data is stored in formats specified by protocol buffersOne could formulate data invariant and protocol sequencing contracts over protocol buffers and enforce them at runtime

White-Box MonitoringServer exports collection of probe points (variables)Memory, # RPCs, # Failures, etc.Monitor collects time series of those values and computes functions over themDashboards prepare information graphicallyMostly used for diagnosis by humansJobMonitor

White-Box Monitoring:How its done @ GoogleJobMonitorDeclarative language for time series computationsCollects samples from the server by memory scrapingMerging of similar data from multiple servers running the same jobRich support for diagram rendering in the browser

White-Box Monitoring:Issues?Design for monitorability/testability?Its already ubiquitous throughout, since software engineers are themselves on-call…Distributed collection/network load?Not really an issue because it’s sample basedRelation to testing?Same as with black-box – should be a common framework.Automatic root cause analysis and self-repair?Current systems mostly build for human analysis and repair. Self-repair would be a big thing.JobMonitor

ChallengeSelf-repairJobMonitorCloud system are homogenous and operate with redundancyMany VMs with exactly the same propertiesSelf-repair could identify and ‘drain’ faulty parts of the system, apply fallbacks, rollback software updates, etc.One major cause of cloud failures are outagesAnother major cause are software updates A semi-automated approach suggesting actions to a human would be already very usefulEver got paged at 2am in the morning?

ChallengeHybrid White-Box Monitoring / RVFoundationsJobMonitorThe data collected in white-box monitoring represents continuous and often stochastic functions over time. The triggers for discrete actions (like alerts) are thresholds over integrated values of those functions.Sounds like hybrid systems/automaton. Has anybody looked at it like this in RV community?

Log AnalysisJobLogsCollect data from each server’s run containing information like operation flow, exceptions, etc.Store data over a window of time (say for last 24h)Access data from various sources programmatically to analyze issues (post-mortem, performance, etc.)Allows for correlation of system/service wide information

Log Analysis:How its done @ GoogleJobLogsVery fined grained logging on job side; huge amounts of data collectedLogs are stored in Bigtable (Google’s large-scale storage solution)Logs are analyzed using parallel (cloud) computing, e.g. with Sawzall, a declarative language based on map-reduceLogs are most often used for failure cause analysis/debugging

Log Analysis:Issues?JobLogsAmount of data and accessibility?Not really an issue because of highly performing distributed file systemsFormat of the data?Logs are structured data (at Google, protocol buffers)Encryption?A big issue: if it can’t be decrypted, not much may be diagnosable. If it can be decrypted, the access to this now decrypted data needs to be restricted.

ChallengePrivacy and EncryptionJobMonitorData logged (or otherwise analyzed) during monitoring may contain encrypted proprietary informationA problem may not be diagnosable without decryptionDecrypted clear text data (in particular if logged) needs to be highly protectedAutomatic obfuscation and/or anonymization would be highly desirableA protocol may need to be designed for this in the first place

ChallengeIntegration TestingJobJobJobStorageTwo or more components are plugged together with a partially mocked environmentThese tests are usually very ‘flaky’ (unreliable) because:Difficulty to construct mocked component’s precise behavior (its more than a simple mock in a unit test)Difficulty to synthesize mocked component’s initial state (it may have a complex state)Potential solution: model-based testing

Model-Based Testing in a NutshellRequirementsFeedback AuthorModelFeedback Feedback GenerateTest SuiteIssueExpected Outputs (Test Oracle)Inputs(Test Sequences)VerdictControl Observe Feedback Implementation

Technical Document Testing Program of Windows: A Success Story for MBT222 protocols/technical documents tested22,847 pages studied and converted into requirements36,875 testable requirements identified and converted into test assertions69% tested using MBT31% tested using traditional test automation66,962 person days (250+ years)Hyderabad: 250 test engineersBeijing: 100 test engineers38

Comparison MBT vs TraditionalIn % of total effort per requirement, normalizing individual vendor performance

Vendor 2 modeled 85% of all test suites, performing relatively much better than Vendor 139Grieskamp et. al: Model-based quality assurance of protocol documentation: tools and methodology. Softw. Test., Verif. Reliab. 21(1): 55-71 (2011)

Exploiting the Cloud for Development

Idle Resources Peak demand problem: as with other utilities, the cloud must have capacity to deal with peak times: 7am, 7pm, etc.Huge amounts of idle computing resources available in the DCs outside of those peak timesLiterally hundreds of VMs may be available for a single engineer on a low-priority job baseGame changer for software development toolsUsing the Cloud for Dev @ GoogleDistributed/parallel buildEvery engineer can build all of Google’s code + third party open source code in a matter of minutes (sequential build would take days)Works by constructing the dependency graph than using map/reduce technologyDistributed/parallel testChanges on the code base are continuously tested against all dependent targets once submittedFailures can be tracked down very precisely to the given change which have introduced themCheck out http://google-engtools.blogspot.com/ for details

Consequences for Testing, Program Analysis, etc. Need to rethink the base assumptions of some of the existing approaches for testing and program analysis for massive coarse-grained parallelism:Early divide-and-conquer ideal, e.g. initial random seed than run-till-end and collect and compareTry different heuristics on the same problem; see which one winsTechniques like SMT and concolic execution can largely benefit from this

A formal method guy’s dream…

Rv11

More Related Content

What's hot

Similar to Rv11

Recently uploaded

Rv11

Editor's Notes