Skinner wasnt a software engineer


Published on

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Skinner wasnt a software engineer

  1. 1. Editor in Chief ■ from the editor Wa r r e n H a r r i s o n ■ P o r t l a n d S t a t e U n i v. ■ w a r r e n . h a r r i s o n @ c o m p u t e r. o r gSkinner Wasn’t a SoftwareEngineer Warren Harrison ost of us have heard of B.F. Skinner ment. Typically, this involves comparing the M (see, the father of operant conditioning. If you re- call, Skinner pioneered those psy- chology experiments that we often hear about where a rat is placed in a box and learns to press a lever a certain num- ber of times in order to receive a pellet of food. Even before Skinner, Ivan Pavlov (of Pavlov’s Dogs fame) studied behavioral performance of two groups of subjects, one that employs the new tool or technique and one that does not. For instance, if we’re inves- tigating a new development paradigm, we might study the members of each group as they develop or modify a software application. Members of one group use the new develop- ment paradigm, while members of the other don’t. Upon completing the task, we’d com- conditioning in dogs. Every time pare the two groups’ average performance. he fed his dog, he rang a bell. Af- Did members of one group complete the task ter a while, his dog would start more quickly or make fewer mistakes than the to salivate when he heard the other? bell whether food was present or These experiments are much more convinc- not. ing if the groups consist of dozens of program- mers rather than three or four. When we have We won’t be duped very small group sizes, random occurrences Of course, the software en- can significantly affect the outcome. For in- gineering community would stance, if we have only two programmers in a never fall for the work by Skinner, Pavlov, or group and one of them is suffering from the the many other psychologists who’ve studied flu, their group performance probably depends operant conditioning. After all, each of these more on fever than on development paradigm. studies only included a handful of animals. We On the other hand, if the group has 100 pro- all know that to prove something significant, grammers, the other 99 will mitigate the im- you need a large number of subjects. Other- pact of a single programmer performing below wise, how do you know you’re not working his normal abilities. with an extraordinary beagle or pigeon? Per- haps most rats just can’t learn, and you just But where do I get happened to pick the one genetic mutant that 100 programmers? has a more developed thalamus. As you can imagine, finding 100 developers Software engineering researchers and prac- willing to participate in such an experiment is titioners are all familiar with one of the main neither cheap nor easy. Even a modest experi- approaches to arguing the value of a new tool ment could cost tens of thousands of dollars. or technique: the comparative group experi- But even if a researcher has the money, where0740-7459/05/$20.00 © 2005 IEEE May/June 2005 IEEE SOFTWARE 5
  2. 2. FROM THE EDITOR do they find that many programmers? in which data collection can be expe- Resourceful academics often ad- dited. For instance, many industrial D E PA R T M E N T E D I T O R S dress this problem by pressing students studies measure defects found during Bookshelf: Warren Keuffel, into service. Students are cheap, plenti- integration testing because data is usu- ful, and easy to manage. Unfortunately, ally easy to extract from a defect track- Design: Martin Fowler, practitioners are understandably skep- ing system. On the other hand, because tical of results acquired from a study of projects seldom formally track errors Loyal Opposition: Robert Glass, 18-year-old college freshmen. found during unit testing, we almost Open Source: Christof Ebert, always omit these errors from field How long have I got? studies. Quality Time: Nancy Eickelmann, Comparative group experiments tend, and Jane Hayes, to work best when outcomes are observ- n = 1? Requirements: Suzanne Robertson, able within a short period of time. This Clearly, striving for large data sets is because of both the cost as well as the has drawbacks, regardless of the con- Tools of the Trade: Diomidis Spinellis, ability to control the subjects. If your text. As I observed in a paper at the experiment takes an hour to carry out, Workshop on Using Multi-disciplinary STAFF you have some control over your sub- Approaches to Empirical Software Engi- Senior Lead Editor jects. You can ensure that they don’t neering Research in June 2000 (“N = 1: Dale C. Strok discuss the experiment among them- An Alternative for Software Engineering selves, prevent them from using exter- Research?”), it might be more useful to Group Managing Editor nal reference material that might bias consider people and projects as individ- Crystal Shif the results, keep close measurement of uals rather than as single data points in Senior Editors Shani Murray, Dennis Taylor, Linda World how much time they spend thinking larger collections of data points. Other Staff Editor Editorial Assistant about the problem, and so on. How- disciplines have encountered similar is- Rita Scanlan Brooke Miner ever, if the experiment takes several sues. Clinical psychology and psychiatry Magazine Assistant days, you really can’t control for these have addressed these problems through Hilda Hosillos, factors short of sequestering the entire single-subject experimental design. To Art Director Toni Van Buskirk lot of them. quote Skinner (in Operant Behavior: Technical Illustrator Consequently, many tasks are con- Areas of Research and Application, Ap- Alex Torres structed to be exceedingly simple in or- pleton-Century-Crofts, 1966): Production Editor Production Artist der to squeeze into this artificial window Monette Velasco Carmen Flores-Garvey of time. While this might be appropriate Instead of studying a thousand Executive Director David Hennage for some studies, many other problems rats for one hour each, or a Publisher require long-term study before obtain- hundred rats for ten hours each, Angela Burgess ing meaningful results. For example, the investigator is likely to study Assistant Publisher it’s hard to see the results of a process one rat for a thousand hours. Dick Price improvement technique in an hour and Membership/Circulation Marketing Manager a half. Like comparative group experiments, Georgann Carter single-subject experiments involve a Business Development Manager Will field studies fix it? treatment—the new technology under Sandra Brown Senior Production Coordinator A common alternative to the con- study—as well as a measure of perfor- Marian Anderson trolled group experiment is the field mance. However, unlike the group ex- study—an analysis of the measure- periment, we study only one subject at a CONTRIBUTING EDITORS ments taken in the process of develop- time. For software development studies, Anne Lear, Robert Glass, ing a real software system. However, a this might be a single developer, a single Molly Mraz field study’s results can often be as mis- team, or a single project.Editorial: All submissions are subject to editing for clarity, leading as academic studies involving The single-subject experiment beginsstyle, and space. Unless otherwise stated, bylined articles students performing simple program- with a hypothesis. This is important be-and departments, as well as product and service descrip-tions, reflect the author’s or firm’s opinion. Inclusion in ming tasks. cause the hypothesis determines every-IEEE Software does not necessarily constitute endorsement There’s still pressure to include a large thing else that follows, from definingby the IEEE or the IEEE Computer Society. number of data points. So, to achieve a “the treatment” to deciding which per-To Submit: Access the IEEE Computer Society’s large data set, we usually end up with formance measures to use.Web-based system, Manuscript Central, at Be sure to select the a collection of projects developed using We start our data collection with anright manuscript type when submitting. Articles must be different processes and personnel and initial period of observation, called theoriginal and not exceed 5,400 words including figures andtables, which count for 200 words each. coming from different application ar- baseline. We record the performance eas. Often, we limit measures to phases measures before applying the treatment.6 IEEE SOFTWARE w w w . c o m p u t e r. o r g / s o f t w a r e
  3. 3. EDITOR IN CHIEF FROM THE EDITOR Warren HarrisonCalled the A Phase, this can last days, ate single-subject studies by determin- 10662 Los Vaqueros Circleweeks, or months. The point of the A ing whether the treatment’s effects are Los Alamitos, CA 90720-1314 warren.harrison@computer.orgPhase is to provide a standard against noticeable. We call this the therapeutic EDITOR IN CHIEF EMERITUS:which we can compare performance criterion. To achieve this, a treatment Steve McConnell, Construx Softwareonce we apply the treatment. must make an observable change in the Once we establish a performance subject’s effectiveness that doesn’t re- A S S O C I AT E E D I T O R S I N C H I E Fbaseline, we apply the treatment; then, quire elaborate statistical analysis.we measure the performance again. We Education and Training: Don Bagert, Rose-Hulmancall this the B Phase. For instance, if Aren’t single-subject Inst. of Technology; Design: Philippe Kruchten, University ofour treatment involves integrating a experiments just British Columbia; kruchten@ieee.orglint tool into our build procedure, we’ll case studies? Requirements: Roel Wieringa, University of Twente; roelw@cs.utwente.nlcompare the programmer’s error rate Case studies are popular for relating Management: Don Reifer, Reifer Consultants;after introducing the tool to the A your experiences with introducing new Quality: Stan Rifkin, Master Systems;Phase error rate. Any changes we see in software development technologies into sr@master-systems.comthe programmer’s performance are an organization. A well-done case study Experience Reports: Wolfgang Strigel, QA Labs; strigel@qalabs.comlikely (though not necessarily) due to shares some similarities with the A-Bthe treatment. (However, if some con- design. It obtains a performance base- EDITORIAL BOARDfounding factor exists, we’re much line before starting, compares it to the Christof Ebert, Alcatelmore likely to recognize it with a single performance observed at the study’s Nancy Eickelmann, Motorola Labsprogrammer than with a group of 30.) end, and evaluates the outcome using Martin Fowler, ThoughtWorks We call this an A-B design. How- the therapeutic criterion. Nevertheless, Jane Hayes, University of Kentucky Warren Keuffel, independent consultantever, it would be premature to assume the typical case study differs from a Neil Maiden, City University, Londonthe improvement in error rate is due to single-subject experiment in a number Diomidis Spinellis, Athens Univ. of Economicsthe lint tool’s introduction. Therefore, of ways. and Business Richard H. Thayer, Calif. State Univ. Sacramentothe single-subject experiment involves Usually, a case study is uncontrolleda third phase, in which we withdraw and establishes its hypothesis after the ADVISORY BOARDthe treatment. For instance, we could fact, whereas a single-subject experi- Stephen Mellor, Mentor Graphics (chair)remove access to the lint tool and ment is hypothesis driven and con- Maarten Boasson, Quaerendo Invenietisagain measure programmer error rate. trolled. Also, we can alternately intro- Robert Cochran, Catalyst Software Annie Kuntzmann-Combelles, Q-LabsIf the reduction in error rate really duce and remove the treatment in a David Dorenbos, Motorola Labswere due to the lint tool, we’d expect single-subject experiment’s withdrawal Juliana Herbert, ESICenter UNISINOSthe error rate to increase during this phase, giving us an idea of the treat- Dehua Ju, ASTI Shanghai Gargi Keeni, Tata Consultancy Servicesthird phase. ment’s significance. Tomoo Matsubara, Matsubara Consulting Adding the third phase gives us the Dorothy McKinney, Lockheed Martin Space SystemsA-B-A withdrawal design, and we can Does anyone use single- Bret Michael, Naval Postgraduate School Susan Mickel, Lockheed Martinimagine additional elaborations, such as subject experiments? Ann Miller, University of Missouri, RollaA-B-A-B, A-B-A-B-A, and so on. These To date, single-subject experiments Deependra Moitra, Infosys Technologies, India Melissa Murphy, Sandia National Laboratoriesare quite effective at ensuring that the to study software development ad- Suzanne Robertson, Atlantic Systems Guildperformance changes we observe are vances are almost nonexistent. Obvi- Grant Rule, Software Measurement Servicesdue to the treatment under study and ously, not every study lends itself to this Girish Seshagiri, Advanced Information Services Martyn Thomas, Praxisnot an unanticipated external event. approach—some treatments aren’t easy Rob Thomsett, The Thomsett Company to withdraw. However, single-subject Laurence Tratt, King’s College LondonWhat about statistics? experiments can be a powerful tool in Jeffrey Voas, SAIC John Vu, The Boeing Company It’s difficult to substantiate with clas- both the researcher’s and practitioner’s Simon Wright, SymTechsical statistical analysis that the treat- toolkit.ment is responsible for a performance C S P U B L I C AT I O N S B O A R Dchange when we use the A-B-A design. Feedback welcome Michael R. Williams (chair), Michael R. Blaha,Statistics used in comparative group ex- I’d like to find out what you think. Mark Christensen, Roger U. Fujii, Sorel Reisman, John Rokne, Bill Schilit, Linda Shafer,periments depend on representative val- What does it take to convince you that Steven L. Tanimoto, Anand Tripathiues such as means and medians and a new technique or method is valu- MAGAZINE OPERATIONS COMMITTEEmeasures of variability such as standard able? What does it take to get you todeviations. Obviously, if we’re measur- try out a new technique? Have you or Bill Schilit (chair), Jean Bacon, Pradip Bose,ing a single subject’s performance, then anyone you know had any success in Doris L. Carver, Norman Chonacky, George Cybenko, John C. Dill, Frank E. Ferrante,means, medians, and standard devia- using single-subject experimentation? Robert E. Filman, Forouzan Golshani, David Alan Grier, Rajesh Gupta, Warren Harrison,tions don’t make much sense. Please write me at warren.harrison@ James Hendler, M. Satyanarayanan To overcome these issues, we evalu- May/June 2005 IEEE SOFTWARE 7