• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Do it in-production-seth_eliot_2013_03
 

Do it in-production-seth_eliot_2013_03

on

  • 534 views

 

Statistics

Views

Total Views
534
Views on SlideShare
534
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Chronologically Left to RightExperience is in software servicesTesting Planet LinksThe future of software testing Part Three – CloudJuly 10, 2012http://www.thetestingplanet.com/2012/07/july-2012-issue-8/ The future of software testing Part two– TestOpshttp://www.thetestingplanet.com/2012/03/march-2012-issue-7/  The future of software testing Part one – Testing in productionThe Testing Planet,  November 2011 I also did a Mind Map Testing in Production MindmapAugust 6, 2012http://www.ministryoftesting.com/2012/08/mindmap-testing-in-production/For Ministry of Testing (Software Testing Club)
  • Good book – I recommend itIs there something the assembled crowd here might be interested in measuring?CLICK: yes, Quality!So this is how I define Testing This INCLUDES classic pre-prod test case executionAnd his will necessarily include more than the classic test case execution
  • Data Driven Decision Making (D3M) is about the first definition: measurementData Driven Validation is about the second definition: testingThis talk about TiP, but TiP is but one form of Data-Driven ValidationCLICK TiP is leveraging real users, because we cannot know what all users will doCLICK and actual production, because production is a dangerous and chaotic place….in a risk mitigated way to reduce uncertainty about the quality of your software
  • Let’s dive in with an exampleBen was not someone I had followed CLICK: show re-tweetThe TweetBeing from MSFT this caught my attentionLikely IE6…. Even MSFT is running away from IE6Is it cost effective to keep that XP environment around? With IE 6?And how about every other OS and browser in the world?The matrix gets hugeWould it be great to answer the question, What are you users actually using?…and understand how you product works with them
  • Instead of a huge matrix, you can use production to get the data you needOf end to end performance under real operating conditionsIn this case PLT for Outlook.com (Hotmail at the time) – from millions of actual usersGet data on every OS, browser, Geographic location, or data center used – instead of testing a huge matrix in the labIdentified and remedied performance bottlenecks They use JSIThis is Big Data **** Who’s heard of big datatransitions to definition on next slide-------------------Not just a PLT, but a round trip for everything – data you can’t get in a labPublic internetLoad balancersLAN switchesPartner ServicesThis is (old) data from Hotmail (now Outlook.com). Based on this and similar measurements theyIdentified and remedied performance bottlenecks Such as upstream bandwidth constraintsBy using more caching and static images
  • The previous example makes use of Big DataSo while not all of our Data-Driven Validation needs to be Big Data, it is worthwhile understanding what Big Data Is3 V’sVolumeVelocityVariety4th V – Value – what’s the value? Efficient quality assessment------------------------------------------Ultimately it is about Big Insights- Again Hubbard: When you have high uncertainty, you need very little data to make an impactful reduction in it.3 V’s - http://radar.oreilly.com/2012/01/what-is-big-data.htmlVolumeCannot be handled by conventional RDBMSSQL Server maxes out at 16TBEntire web is 0.5 ZB (2009); probably about 1-2 ZB today`Richard Wray (2009-05-18). "Internet data heads for 500bn gigabytes". The Guardian. http://www.guardian.co.uk/business/2009/may/18/digital-content-expansion.Velocityeverything’s instrumentedSpeed of feedback is importantIBM: The Road: could you cross a busy road with just a snapshot (not live data)? http://vimeo.com/20718357 Batch vs. StreamPartial Analysis: http://research.microsoft.com/apps/video/default.aspx?id=163222 VarietyStructured: DBUnstructured: TweetsHow about XML? One good rule of thumb is if the data structure (or lack thereof) is not sufficient for the processing task at hand, then it is unstructured.
  • I mentioned Twitter in the previous slide, here is how Twitter data can be usedThis is an internal Microsoft tool. Public tools exist to do similar things******* Turn Tweet stream into actionable metricsSentiment is positive 2:1; there was a spike in certain topics around TechFest (MSFT R&D showcase)Can be used to find bugs too: version over version issues-----------------------------The Ambient Data of the web / social can be usedData Sources: Twitter, Blog, News, Forum, Facebook… but mostly TwitterSentiment, Timeliness (TechFest was in March), Quality signalsBugs? Timeline = new version? Certain phrases?NoteSentiment: almost everything has a large neutral frequency. Positive > 2:1 over negative is goodSDK and Kinect for Windows had a boost in early March – Microsoft TechFest (R&D showcase)“Kinect Fusion” creates a detailed 3-D rendering of your environmentThis technology may help you find bugs,Certain phrases may indicate itA rapid change in sentiment with a new releaseOther technologies that mine Microsoft’s customer support data can also be used to find issues with released product.
  • Data-Driven Validation is bigger than just TiPLots of good Data-Driven validation prior to production tooFor any system of sufficient scale, only Production looks like productionData center pics: Ideal (lab) versus Reality (production)
  • You cannot find this bug pre-prodWould you test walking directions between A and B for every combination in the world?It is trivial to find in productionWith the Right telemetryGoogle can know when this happens in Prod and report itCLICK Google knows that this route may be missing sidewalks!Remember: only in production do you find:The true diversity of Real users and usageThe true complexity of the production environment-----------------------------------------“Find this with a unit Test…” – James Whittaker - http://www.youtube.com/watch?v=cqwXUTjcabs&feature=BF&list=PL1242F05D3EA83AB1&index=16
  • Let’s look at an example from FacebookFacebook uses Open source monitoring software like GangliaUsing Hadoop, which we will talk about later, they developed….An internally produced ODS – persistent and accurateSystem metrics (CPU, Memory, lO, Network)Application metrics (Web, DB, Caches)Facebook metrics (Usage, Revenue)They claim to collect 5 million metrics they’re about dodgy on what this specifically means, but it is….Passive Validation at scale------------------------------------Nagios: ping testing, ssh testing- Is Active ValidationRefs:Ganglia, ODS: Cook, Tom. A Day in the Life of Facebook Operations. Velocity 2010. [Online] June 2010. http://www.youtube.com/watch?v=T-Xr_PJdNmQPicture: FB Prineville Datacenter: http://www.facebook.com/prinevilleDataCenter/
  • But 5 million metrics is a bit ambiguous- I understand it to mean number of different metrics collected x servers they collect them onCook, Tom. A Day in the Life of Facebook Operations. Velocity 2010. [Online] June 2010. http://www.youtube.com/watch?v=T-Xr_PJdNmQ
  • So how does Facebook use their 5 million metrics to assess quality?Let’s refer to a Quora answer and blog post from a FB engineer that discusses thisCLICK: How is FB like Gondor?Boromir: "Gondor has no king, Gondor needs no king.“  “Facebook has no testers, Facebook needs no testers”CLICK: What does FB actually do then? (refer to slide)So am I mocking this or promoting it as a valid practice?CLICK: well, both really… depends on your business requirementsThe FB engineer in question said… (refer to slide)****** FB uses TiP only, they just throw it in productionWe’re all pretty familiar with FB’s “quality” – if your quality needs to be higher, then this approach does not work-----------------------------------“A lot of cross talk betweenDev and QA… it’s pretty slow… let’s get rid of it”“Our engineers write, debug, and test their own code”“We expose real traffic to these services”Engineers need to be there every step of the wayOn IRC channel when deployAggressively log and audit“5 million metrics” can findProblems at scaleBroken features for significant percent of usersRefs:Cook, Tom. A Day in the Life of Facebook Operations. Velocity 2010. [Online] June 2010. http://www.youtube.com/watch?v=T-Xr_PJdNmQhttp://www.zdnet.com/blog/facebook/why-facebook-doesnt-have-or-need-testers/7191http://www.quora.com/Is-it-true-that-Facebook-has-no-testers - Evan Priestley, - Facebook engineer from 2007-2011
  • Blue = DeveloperPurple = TesterWe’ve seen FB just “throw it in Production,” and that is part of their business decision. But most teams will not choose to do thisThis is a simplified model of the test life cycleI call this the BUFT model (Big Up-Front Testing)I presume this look familiar to most of youCLICK: So then maybe we add TiPWe still have BUFT, and now the Testers have that much more to do!CLICK: So we need to adjust the modelThis is just one possible way to do itDevs take on more UFT testing - focus on functional & code quality at the COMPONENT level (Test can help with strategy)Test focus on integrated service quality (Dev can help with implementation – Testability in Production)****** Rule of thumb: should not find bugs that could have found in an earlier stage---------------------------Other notes:“Instrument Everything” is from FB - http://www.youtube.com/watch?v=T-Xr_PJdNmQMetrics and Optics give you access to the data streamTDD is a better way to build in qualityNo, do NOT just throw it in productionShould be part of a continuous test strategyBut may want to reduce UFT (Up-Front Testing).From BUFT to UFT + TiP
  • The examples I have shown thus far are types of Passive ValidationPassive Validation is very valuable – do not be fooled by the nameAnother types of Data-driven validation is Active ValidationSynthetic Transactions will be very familiar – Test Cases are Synthetic TxsLet’s look at some examples-----------------------------------------------------------Passive ValidationLooks a lot like what we would call monitoringOperational intelligence, like availability and performanceBusiness Intelligence tells us where the user is going. Crucial knowledge for a quality strategyWe always have to make hard decision on what to test, this answer thatBI also can indicate bugsIf usage drops off when no user-facing change has been madeActive ValidationThis looks a lot like the testing we do todaySynthetic Transactions = Test CasesIf we do this in production, Testing  Active MonitoringAvailability = Is it there? = successful TX, regardless of resultReliability = Does it work = Tx without errorPerformance = How long does it take?
  • Visuals are from “Office Service Pulse” – dashboard many metrics from active validationA specific example is Exchange Online, a hosted service – provides email, calendar, contacts managementWanted to re-use existing on-prem testsDeveloped execution framework running from Azure, Microsoft’s cloud platformAvailability = Is it there? = successful TX, regardless of resultRun repeatedly to turn pass/fail into availability/non-availabilityPerformance = How long does it take?Run repeatedly and time Tx to historically trend with timeEspecially useful release over release-----------------------------------------------we quite simply had to figure out how to simultaneously test a server and a service? How do we pull our existing rich cache of test automation into the services space? For server (on-prem)5000 machines in test labs70,000 automated test cases run multiple times a day on these machines.Reuse and extend our existing infrastructure.Exchange will remain one codebase.We are one team and will not have a separate service engineering team or service operations teamSolutionTiPRun tests from AzureYou getAvailabilityPerformanceRef:Experiences of Test Automation; Dorothy Graham; Jan 2012; ISBN 0321754069; Chapter: “Moving to the Cloud: The Evolution of TiP, Continuous Regression Testing in Production”; Ken Johnston, Felix Deschamps
  • Another example that does not quite look like test casesOperational Fault InjectionThis is another type of Active ValidationInjects Synthetic FaultsTo disrupt service operationTo Test System Fault Tolerance – assuming system was designed to be FT!!Chaos MonkeyApril 2011 exampleSimian armyJune 2012 exampleAmazon Game Day--------------------------------------Other Notes:Netflix is a streaming video service hosted on Amazon AWS CloudAvailable in both North and South America, the Caribbean, United Kingdom, Ireland, Sweden, Denmark, Norway, Finland Chaos Monkey  Simian ArmyIt started with their "Chaos Monkey", a script deployed to randomly kill instances and services within their production architecture. The name comes from the idea of unleashing a wild monkey with a weapon in your data center (or cloud region) to randomly shoot down instances and chew through cables. Then they took the concept further with other jobs with other similar goals. Latency Monkey induces artificial delays, Conformity Monkey finds instances that don’t adhere to best practices and shuts them down, Janitor Monkey searches for unused resources and disposes of them. April 2011 outage – stayed upJune 2012 Outage – Chaos Gorilla should have prepared them to survive an outage, but did notChaos Gorilla, the Simian Army member tasked with simulating the loss of an availability zone, was built for exactly this purpose.  This outage highlighted the need for additional tools and use cases for both Chaos Gorilla and other parts of the Simian Army. Amazon Game DayEntire DC taken downAnnounced in advance, but few services opt-outService owners are alert, but mostly not worried – Amazon services designed for thisRefs for Chaos Monkeyhttp://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.htmlhttp://techblog.netflix.com/2011/07/netflix-simian-army.html June 2012 outage: http://techblog.netflix.com/2012/07/lessons-netflix-learned-from-aws-storm.html Refs for Amazon Game Day- There really aren’t any, but this post mentions it: http://devops.com/2011/03/08/
  • Continuing our theme of monkeys….Fault Injection has some obvious risks, but even less intrusive synthetic transactions carry risksMonkey story (below) illustrates some risk of synthetics onBusiness Metrics and ReportingChanging “The shape” od production data-----------------------------------Other risks of syntheticsService OperationDirect User ExperiencePartner ServicesSecurityCosthttp://thedailywtf.com/Articles/Ive-Got-The-Monkey-Now.aspx 1999 was a big year for Harvard Business School Publishing. In the past few years, they had seen their business model – selling books, journals, articles, case studies, and so forth – transform from being entirely catalogue-based to largely web-based, and it had finally come time for a major re-launch of their website. HBRP’s new website was slick. On top of a fairly advanced search system, the re-designed site also featured community forums and a section called “Ideas @ Work”, which let users download audio broadcasts from influential business thinkers from around the world. And best of all, despite the rapid development schedule, scope creep, and all of the new bells and whistles, the new site actually worked. In the height of the dot-com era, not too many other sites could claim the same. One key contributor to the success of Harvard Business School Publishing’s new website was its extensive testing and QA. Analysts developed all sorts of test cases to cover virtually every aspect of the site. They worked closely with HBSP’s logistics department to make sure the tests – searching, fulfillment, account management, etc. – were run. And not just run, but run often. This aggressive testing strategy ensured that the site would function as intended for years to come. That is, until that one day in 2002. On that day, one of the test cases failed: the “Single Result Search.” The “Single Result Search” test case was part of a trio of cases designed to test the system’s search logic. Like the “Zero Result Search” case, which had the tester enter a term like “asdfasdf” to produce no results, and the “Many Results Search” case, which had the tester enter a term like “management” to produce pages of results, the “Single Result Search” case had the tester enter a term – specifically, “monkey” – to verify that the system would return exactly one result. And for three years, “monkey” returned exactly one result: Who's Got the Monkey? (full article text) by William Oncken Jr. Written in 1974 Oncken’s article is for managers who “find themselves running out of time while their subordinates are running out of work.” As for the monkeys, they’re just an analogy for work, not who managers should outsource work to. Apparently, Oncken wasn’t that ahead of his time. In any case, on that day in 2002, the “monkey” search returned two results. The first, as expected, was Who's Got the Monkey?. The second result was something to the effect of Who’s Got The Monkey Now?, which was an update to HBSP’s run-away best seller, Oncken’s 1974 Who's Got the Monkey?. It seemed obvious: the “Single Result Search” test case just needed to be updated. But then they looked into the matter a bit further. As part of the aggressive testing strategy mentioned earlier, the HBSP logistics team would fill their down time by executing test cases. First they’d run through the “Zero Result Search” test, then the “Many Result Search” test, then the “Single Result Search”. Then they’d add that single result – Who’s Got the Monkey? – to their shopping cart, create an new account, submit the order, and then fulfill it. Of course, they didn’t actually fulfill it – everyone knew that orders for “Mr. Test Test” and “123 Test St.” were not to be filled. That is, everyone except the marketing department. When HBSP’s marketing department analyzed the sales trends, they noticed a rather interesting trend. Oncken’s 1974 Who's Got the Monkey? was a run-away best seller! And like any marketing department would, they took the story and ran. HBSP created pamphlets and other distillations of the paper. They even repackaged those little plastic cocktail monkeys as official “Who’s Got the Monkey monkeys”. And finally, sometime in 2002, the updated version of Who’s Got the Monkey? was posted to HBSP, which was then picked up by the searching system, which, in turn, caused the “Single Result Search” test case to fail. Of course, by this point, there was little anyone could do. The fictional success of Who’s Got the Monkey had already been widely publicized as reality. And with all the subsequent write-ups (many of which are still around to this day), it may have very well become a best-seller. Needless to say, HBSP has since changed their aggressive testing policy. Some details of the story have been redacted to protect the guilty. Thanks to the two anonymous sources working at HBSP for the inside scoop, and news archives for the rest.
  • More risks…..Xbox storyAmazon StoryMitigationsData tagging, filtering, and clean-upXbox- Obviously a negative experience as the user is confused, and may think they have been charged (they were not)- This was only a handful of users. Xbox has implemented clever mitigations, such as only using UUIDs outside the range used by valid Xbox users.AmazonThis is a negative user experience because the user it trying to find actual items to purchase, an intent not served by exposing test data as shown in this example. The exposure of such data ironically creates a sense of "immaturity" or lack of quality. The poor experience becomes worse if a user purchases such an item. It may be reasonable to have such test data on the site transiently, but it should be removed after testing is completeMitigations includeData TaggingData CleaningData FilteringTransaction StubbingTransaction pre-validationTransaction Throttling
  • A quizA = ActiveP = PassiveAnswers can be subject to argument, there are gray areas
  • To understand the power of TiP, it is illustrative to understand the power of….Experimentation is a passive validation methodologyTry new things… in productionBuild on SuccessesCut your losses… before they get expensiveA/B testing is… users assigned to one of multiple experiences and comparedDF and Beta is….. A bit different, users opt-in to trying a not yet released versionBoth use Exposure Control which limits who sees the new codeMitigate risk by limited exposure of new codeControlled ExperimentationUn-controlled experimentationOne way FB experiments:Three concentric push phasesp1 = internal releasep2 = small external releasep3 = full external releaseRef: http://framethink.blogspot.com/2011/01/how-facebook-ships-code.html
  • 1% launches – Eric SchmidtSlice and dice is about the dataDesign decisions and also service qualityShadow LaunchesStatus packets: billions packets per dayLaunched service, but users could not see itAt Microsoft we used experimentation to assess how often decisions were goodDecision makers were expertsCLICK1/3 achieved some degree of the desires goal1/3 had not significant effect – this is an important result that many do not consider1/3 had the opposite to the desired effectExperimentation lets you quanitify the good ones and weed out the bad ones--------------------------------------1% launches… “…dice and slice in any way you can possibly fathom” – Eric SchmidtRef: How Google Fuels Its Idea Factory, BusinessWeek, April 29, 2008; http://www.businessweek.com/magazine/content/08_19/b4083054277984.htmSomewhat famously this is used for design decisions“design philosophy was governed by data and data exclusively“ – Douglas Bowman, Former Visual Design Lead - http://stopdesign.com/archive/2009/03/20/goodbye-google.html)Slice and dice what? The data… it’s a data-driven decisionShadow LaunchesRef: Seattle Conference on Scalability: Lessons In Building Scalable Systems, Reza Behforooz; http://video.google.com/videoplay?docid=6202268628085731280 @6:55Google Talk Presence packetsConnectedUsers X BuddylistSize X OnlineStatechanges = billions packets per dayEverything was happening, but nothing was displayed to usersAt Microsoft, an evaluation of decisions tested with experimentation found 1/3 2/3Ref: http://blog.clicksnconversions.com/intuition-sucks-%e2%80%93-that%e2%80%99s-why-we-test/
  • Let’s look at an example of experimentation more directly ties to traditional software quality assessmentNetflix is a streaming video service hosted on Amazon AWS CloudAvailable in both North and South America, the Caribbean, United Kingdom, Ireland, Sweden, Denmark, Norway, Finland 1B API requests = Big DataBlue is Vcurr – smiley face represents customer traffic is carried on that (virtual) serverRed is Vnext, [click] Netflix spins up Vnext in the cloud carrying no user traffic[click] They then put one red/Vnext server live carrying user traffic and let it run to test code quality[click] They then switch user traffic to red/Vnext servers but keep blue/Vcurr ones around while they run overnight and check for problems[click] Finally if all is well with Vnext, the release the Vcurr resources.Typical problem found: memory leakMove all users to Vnext and let bake – that is big dataAlthough not truly random and un-biased, there is still value here, especially to see large changeshttp://perfcap.blogspot.com/2012/03/ops-devops-and-noops-at-netflix.html Joe Sondow, Building Cloud Tools for NetflixSlides: http://www.slideshare.net/joesondow/building-cloudtoolsfornetflix-9419504Talk: http://blip.tv/silicon-valley-cloud-computing-group/building-cloud-tools-for-netflix-5754984
  • Data Science is becoming more important for testers to know. (Tester as Data Scientist)Not going to spend a lot of time on basics like median, mean, Standard Deviation or linear regressionI assume you know those… or you can look those up laterHere we will cover some of more interesting TECHNIQUES and GOTCHAS… ones you won’t find in a beginner stats courseWon’t explain them here, will illustrate them on the following slidesPlus the tools of Big Data
  • This is the one of the first computers I ever usedI have been working in software for 19 yearsSurvey the audienceGet 5 answers (samples)This is to illustrate the rule of 5 Median is point where equal population above and below itMedian years in softwarehas 93.75% chance of being between Min and Max samples surveyed (among the 5 taken)***** Power of small data sets******Explain sample bias
  • Median is point where equal population above and below itMedian years in softwarehas 93.75% chance of being between Min and Max samples surveyed (among the 5 taken)Explain why the rule of 5 worksExplain sample biasExplaining why the rule of 5 worksA value has a 50% chance of being above the median, same as chance of heads on a coin flipAll 5 values above median? 5 heads or 3.125%Neither all 5 values above median nor below it 100 – (2 x 3.125) = 93.75%Sample BiasMedian years in software among TestersMedian years in software among Testers attending Test Bash - would be same as 1 if we knew Test Bash attendees where a representative sample.Median years in software among Testers attending Test Bash who are willing to volunteer such info [self-selection bias]ModelingThis makes no assumption about the model. By definition a single observation has a 50% chance of being over or under the median
  • Averaging is a form of lossy data compression…it destroys information!Take-Away: You need to understand your populationAbove Probability Density function contains samples from two distinct populations. For example could be different versions of the softwareOr different user populations: testers vs. real users, different geographic regions
  • Same as previous… just more complex example – 5 distinct populations
  • Averaging is a form of lossy data compression…it destroys information!Other stats too can also be lossyTake-Away: You need to understand your data modelR^2 is the Coefficient of DeterminationCloser to 1 indicates that a regression line fits the data wellSD is the Standard DeviationA low standard deviation indicates that the data points tend to be very close to the mean; high standard deviation indicates that the data points are spread out over a large range of values.1 SD 68.27% of set; 2 SD 95.4%; 3 SD 99.73%
  • Hadoop is a tool for processing large data setsProcessing = what you might do with a SQL SELECT – combine, sort, countImagine this data set of 9 chars is actually 10s of trillions of charsFirst we need tostore massive amounts of dataDistributed storage: HDFS = Hadoop Distributed File SystemBreak file into pieces. Each piece stored multiple times (3) for redundancyThen we need to process massive amounts of dataDistributed computingMap-Reduce and similar algorithms (Cosmos uses Dryad)Bring the compute to the data in its split-up formMap-Reduce can operate on the piecesThe processing is MAP’ed to the smaller subsetsThe output of these many operations is then re-combined (REDUCED) into a single answer(remembering input is 10s of trillions) Output is a much smaller file than input------------------------------------------Hadoop is part of a rich eco-system of tools- Hive - Data warehouse for Hadoop - http://hive.apache.org/query the data using a SQL-like language called HiveQL- Pig - http://pig.apache.org/high-level language for expressing data analysis….compiler that produces sequences of Map-Reduce programsMahout - machine learning library - http://mahout.apache.org/Scribe: log aggregationHDInsight is Hadoop running on Microsoft AzureRef:http://www.windowsazure.com/en-us/manage/services/hdinsight/http://hadoop.apache.org/
  • Cosmos is similar to HadoopIt is Microsoft-internalThe numbers are impressive-----------------------------------------------Data drives search, advertising, and all of MicrosoftWeb pages: Links, text, titles, etcSearch logs: What people searched for, what they clicked, etcIE logs: What sites people visit, the browsing order, etcAdvertising logs: What ads do people click on, what was shown, etcSocial feeds from Twitter & Facebook Service telemetry Office 365, Hotmail (not emails), MSNPicture is a modularized “Container” of servers used in Microsoft Data Centers Refs: “It stores hundreds of petabytes of data on tens of thousands of computers. Large scale batch processing using Dryad with a high-level language called SCOPE on top of it.”The Bing Big Data Platform - Ken Johnston; Big Data Innovation Summit 2013, Las Vegas: Process 2PB per DayData drives search, advertising, and all of Microsofthttp://research.microsoft.com/en-us/events/fs2011/helland_cosmos_big_data_and_big_challenges.pdf
  • These are Spartans from HaloA headless Spartan cannot be killed. They should not exist, but they didHow did the Halo team find this bug and eliminate it?CLICK: HDInsight – Hadoop running on Azure - **** Hadoop as a serviceCLICK: Halo had the data, but it was overwhelmingCLICK: using the data and HDInsight they found the bug in production, and eliminated itHeadless Spartan: unofficial mod, which can only be applied using a modified Xbox 360.Almost impossible to find pre-release. But in production they can find and eliminate itCLICK: here is brief overview of what they didFrom hundreds of low-wage low-skilled testers  Millions of free, low-skilled testers, highly skilled customers CLICK: They also found less obvious bugs and cheats, it is all hidden in the dataTell target storyCLICK: reveal quote - Target statistician Andrew Pole ---------------------------------------------Target StoryTarget, store in US like Tesco in UKhttp://www.forbes.com/sites/kashmirhill/2012/02/16/how-target-figured-out-a-teen-girl-was-pregnant-before-her-father-did/Target sent teenage daughter baby supply couponsTarget apologized, then called weeks later to apologize again Father admitted unbeknownst to him at the time, daughter was pregnantThe following purchases may indicate woman is pregnant with a boyCocoa butter lotionA large purseZinc and magnesium supplementsA bright blue rug“But even if you’re following the law, you can do things where people get queasy.” - Target statistician Andrew Pole started mixing in all these ads for things we knew pregnant women would never buy, so the baby ads looked random. We’d put an ad for a lawn mower next to diapersRef:http://www.microsoft.com/en-us/news/features/2012/oct12/10-31halo4.aspxhttp://www.microsoft.com/casestudies/Case_Study_Detail.aspx?CaseStudyID=710000002102
  • Another Big Data example is Microsoft Exchange Online***** They can predict 75% of availability issues ahead of timeBig Data from Over 8000 Servers instrumented …to collect 1000 MetricsProcessed by COSMOSCLICK ***** Using ML they can *PREDICT* 75% of outages ahead of time---------------------------------------------------------PB’s of data collected, such asAvailabilityLatencyErrorsPerf counters: CPU, Memory, etcLots of serversInstrument them all you get lots of DataPBs, how can we process all that? – Cosmos, Machine Learning – It’s about fitting your data to a modelThink about simple linear regression y=mx + b, it is like that but can get much more advanced
  • Testing is….We can use Passive and/or Active techniques to get those observationsIn production is were we can find some of the best observationsEither using Passive or Active ValidationWe obtain Data, which we use to calculate metrics, which is used to drive actionsAbout the quality of the product
  • Contact InfoSeth Eliotseth.eliot@microsoft.comTwitter: @sethliotBlog: http://bit.ly/seth_qa

Do it in-production-seth_eliot_2013_03 Do it in-production-seth_eliot_2013_03 Presentation Transcript

  • Seth Eliot Senior Knowledge Engineer, Test Excellence
  • Services and Cloud A/B Testing of Services Petabytes Processed About Seth
  • Measuremen t A quantitatively expressed reduction of uncertainty …based on one or more observations Testing …about quality of a system under test…
  • Data Driven Decision Making Data Driven Validation TiP Real Users Production Environments
  • 500 Million measurements per month JSI JavaScript Instrumentation
  • “This process works for Facebook partly because Facebook does not, by and large, need to produce particularly high-quality software” Really?
  • …maybe less of this
  • Who's Got the Monkey? Who's Got the Monkey Now? Monkey
  • Provides insight into real usage Reproducible and well understood scenarios Covers a vast variety of environments Requires proper handling of Personally Identifiable Information (PII) May adversely alter production and production data
  • “To have a great idea, have a lot of them” -Thomas Edison
  • “…dice and slice in any way you can possibly fathom”
  • 1B API requests per day Canary Deployment
  • Science!
  • How many years have you worked in software?
  • 
  • On tens of thousands of computers Stores hundreds of petabyte
  • “We know we can't anticipate the 101 things that will go wrong, The only thing we can control is ensuring our team responds appropriately to those situations.“ – Jerry Hook, Executive Producer Halo …Hundreds of thousands of requests per second
  • Availability (y) over time (x) Predict 75% of dips 24 hours ahead of timeData Machine LearningCosmos
  • The best observations are often in production A set of observations to reduce uncertainty about quality of a system under testTesting
  • Do It In Production Testing Where it Counts ?