Major Research and Education Activities


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Major Research and Education Activities

  1. 1. Empirical Studies of Distributed Student Software Testing Teams Dolores M. Zage and Wayne M. Zage Software Engineering Research Center Ball State University {dmzage, wmzage} Abstract tasks of a project and distribute them as separate jobs, which allows development decisions about This paper presents an empirical study each project task to be made with a degree of analyzing the use of distributed testing teams independence [3]. and a global testing infrastructure to identify One of the most pressing problems in GSD critical factors impacting the success of the is the absence of a globally distributed software testing process. The study collaborators were development process, and a key area within GSD researchers and students at the University of is software testing. This research analyzes test Limerick, Ireland and at Ball State University, cases and test case coverage from three Indiana, USA. Three global testing experiments experiments with varying experimental designs were designed, conducted and analyzed. The in an effort to determine critical factors test case distribution indicated that multiple impacting the software testing process. Faculty distributed teams increased the product-under- at the University of Limerick (UL) in Ireland and test functional testing coverage with surprisingly at Ball State University (BSU) in Indiana little duplication of test cases. The research integrated these experiments into their graduate highlighted how various global operational computer science courses to create virtual testing testing scenarios impact the software team scenarios. development process and product, and that Virtual teams depend upon technology to globally distributed teams conducting create the virtual workspace that the teams use to operational testing on a product can lower communicate and collaborate. A tool, the Global product cost and/or improve the quality of the Access Testing Environment (GATE), was resulting product. developed at BSU to create this environment not only for the virtual testing teams, but also for the U.S. and Irish researchers observing, monitoring, and integrating various research designs to 1. Introduction identify the key information required to support and enhance an effective infrastructure for There are as many economic benefits as globally distributed software testing. there are problems in developing software in globally distributed locations. Researchers have 2. Management of the Experimental noted benefits of global software development (GSD). Ebert and DeNeve [1] declare that a mix Testing Process of skill sets and life experiences within a team The need for test management software was can result in improved coordination among GSD obvious, but the decision as to which software to team members. Also noted, when people from use was not. The research required an different backgrounds or culture come together environment that would meet the following to work toward a common goal, the different criteria: levels of experience, technical knowledge and elementary understanding of a problem can lean Accessible globally, 24 hours per day in favor of innovation. Grinter and Herbsleb Provides mechanisms for user rights remarked that by developing across international and roles borders companies have access to a new set of Enforces consistency in the testing individuals that match its needs [2]. GSD in process and the data collected essence allows distributed teams to split up the 1
  2. 2. Accommodates changes to the process firsthand experience of working within globally and the data collected distributed software teams. Allows for manipulation of raw data It was difficult to identify a suitable product Provides means for team coordination to be chosen for the system under test (SUT) for and communication each experiment. For each of the SUTs chosen, After searching for and examining available it was required that the learning curve to become test management and collaboration software, it familiar with the product‟s domain be short. On was determined that to ensure unrestricted access the other hand, the product could not be so to data and to have the flexibility to alter the data simple that student team coordination was collected, a custom environment would need to unnecessary. The product‟s interface had to be be developed. Thus, work began at BSU on web-based for easy access for each of the student GATE, which consists of five components: User teams. This enforced a common product Management, Test Management, Defect platform assuring that test case results had a Management, Change Management and uniform interpretation. For the second and third Collaboration and Communication Tools [5]. To experiments, extra selection constraints were maintain consistency in the data that GATE added due to adding white-box strategies to the collects, each component required specific fields. testing process. A unit testing environment to The collection of the data in these required fields support white-box testing had to be available. also achieves compliance with the following Each experiment required extensive standards: ISO 9001, ISO/IEC 12207, SW- preparation. For each SUT, suitable CMMTM, and IEEE 829. GATE is an open- documentation and tutorials were required for source application that provides a virtual testing the student testers. Much of the time, these were team space. Within GATE are tools for created by BSU researchers after selecting the information sharing, reporting, metrics, system. A generic test plan was created for each permanent retention of testing information, and SUT and for the SUT used in the third team building tools. GATE has been used by experiment, a Software Requirements researchers and students at UL and BSU to Specification (SRS) was reengineered. GATE manage the testing process in these empirical required a tutorial and based on student feedback studies. and continuous updates in GATE‟s functionality, several iterations were created. Several versions 3. The Setting for the Testing of GATE tutorials were created through varying the mode of presentation. One tutorial was a Experiments step-by-step document with no direct interaction. A second one was a flash-generated tutorial with Three global testing experiments were interactive input. The tutorials were viewable undertaken with teams of students from from the web and downloadable to cater to the graduate-level computer science classes in the students‟ preferences and time frames. At the two universities to identify the essential start of each experiment, a new testing information and infrastructure required to environment had to be established before testing support effective testing in a globally distributed began. Preparations included, among several environment. Teams of graduate computer other activities, the initialization of GATE with science students in Ireland and in the United the lead testers from each team, the installation States conducted testing on closed and open of the SUT, tester and team email exchange, source software systems following several creation of a forum, inter-and intra-team chat scenarios. On several occasions, Dolores Zage capabilities, and the standardization of the taught a graduate seminar entitled Global reporting forms including the trouble reporting Software Testing and it was through these system and the video conferencing schedule. To seminars that the experiments on global testing educate our testing students on global software projects were conducted. development, the testing technologies and the Additionally, software development, and in environment, testing seminar materials were particular, software testing has become a created and delivered at both BSU and UL. globally sourced commodity and current For each of the three experiments the indications are that this trend continues to grow. functionality of the SUT was assessed and Our graduates will be engaged in GSD and both functional categories were cataloged. A desired the UL and the BSU students were eager to have distribution pattern of the test cases in each of the functional categories corresponded to 2
  3. 3. software size and functionality importance. The The functionality of rGrade was assessed students‟ test cases were distributed over the and broken into eight functional categories. system‟s functional categories and compared to These categories are listed on the horizontal axis the desired distribution. Each experiment also in Figure 2. The diagonal line illustrates a close was analyzed for the effect of the various global to ideal distribution of test cases over these testing scenarios on the software development categories, where the category functionality is process and product. These experiments also ordered by software size and importance. The assisted the researchers to establish „what works‟ intended outcome was that the distribution of test and „what does not work‟ when implementing cases would correspond to the ideal, indicating global testing and to provide knowledge and testers did not focus on just one or two areas of education on GSD to individual students in both the software. Note that the aggregation of test locations. cases for all groups, as given by the histogram in Figure 2, approximates the distribution represented by the diagonal line. Note also that 4. Experiment 1: Non-interacting there was some coverage (by test cases) in all Teams testing rGrade areas of functionality. As first reported in [4], of the 223 total test cases, 174 passed, 43 failed and Our first SUT was a soon-to-be released 6 were blocked. When duplicated test cases Ball State educational software product entitled were removed, (test cases that covered the exact rGrade. rGrade is a web-based, rubric-driven same functionality) we found a total of 171 grading and assessment environment. A rubric unique test cases (52 test cases were duplicates). can be thought of as a two-dimensional grid, The first column in each of eight categories where each row describes one element of the identifies all test cases and the second column program, problem or solution, and each column excludes the duplicated test cases. relates to a level of achievement. In computer science terms, a rubric is simply a 2-dimensional array. This experiment was conducted to determine the effect of non-interacting multiple teams on test cases. Displayed in Figure 1 is rGrade‟s Main Menu interface. In addition to its main focus, rGrade is a full grading environment where courses and assignments are stored and student grades are recorded. Figure 2: Distribution of all test cases by rGrade functionality Whereas Figure 2 presented results using the aggregated test cases for all of the teams, Figure 3 displays each of the individual team‟s test case distribution based on the same eight functional categories. In Figure 3, it can be observed that no one team covered all areas of functionality with their test cases. Moreover, not all teams‟ Figure 1: rGrade’s main menu test cases followed our desired pattern. Perhaps the more interesting observation is that the union The class was divided into four testing of the test cases from the four teams (Figure 2) teams. Testing lasted four weeks, with the first provides the desired coverage better than any week used for the development of each team‟s single team (Figure 3). test plan for this experiment. Each group worked independently and other than providing the group with some documentation and a short tutorial provided by BSU educational services, no other help was provided. Test cases were entered into GATE. 3
  4. 4. Finally, the failed test cases show virtually no duplicates (only 2), suggesting that the additional testing effort of multiple teams paid off. The duplicate test cases were in the rubrics functionality category. Figure 4: Test case summary by outcome The experiment provided insight into the effect of non-interacting multiple student teams on test cases and assessed GATE‟s effectiveness as a virtual team testing tool. Results indicate that multiple teams increased the testing coverage of the functionality of the SUT with surprisingly little duplication of test cases. The use of GATE provided each student test team with a framework in which to conduct testing and document results. Figure 3: Distribution of teams’ test cases by rGrade functionality Figure 5: Test case summary by duplicate/unique and outcome When considering outcome of all of the teams‟ test cases (Figure 4), 78% of the test 5. Experiment 2: Collaborating, cases passed, while 19% failed and 3% were Distributed Teams testing <jeXML> blocked. Consider now the percentages of unique and duplicate test cases that were passed, Our second SUT was a system entitled blocked and failed, as shown in Figure 5. Of the <jeXML>, the web edition. The software test cases that passed, 28% (50/174) were <jeXML>, or just enough XML, includes the duplicated. This suggests that the obvious following capabilities: checks for a well-formed functionality that testers focused on had XML document, validates an XML document overlapping test cases that performed correctly. against a DTD, and generates a DTD based on an The blocked test cases show no duplicates. XML file. A twelve page tutorial on the basics 4
  5. 5. of XML and <jeXML> features was available to cases over these categories, based on category student testers. <jeXML> was approximately size and functionality importance. The first two 2500 lines of C++. categories form the basis of <jeXML> by The students were divided into eleven checking the well-formedness of actual XML teams, five teams in Ireland and six teams in the and DTD documents. The latter two categories U.S. Teams selected names and team pairings use the functionality of the first two. As were assigned alphabetically by the team-name, previously noted in the first experiment, when all one Irish team with one American team (Teams test cases for <jeXML> are merged, they achieve 1-5, Table 1), with the remaining BSU team a close to ideal distribution. forming a control group. Similar to teams in experiment 1, the team size consisted of three to four students. For experiment 2, the testing effort was shared between a paired outsourced (UL) team and an in-house (BSU) team. The outsource team performed black-box testing and the in-house team supported the testing effort with additional black and white-box strategies. In theory, the actual team communication required for this experiment was minimal, and only the handover of the black-box test cases recorded within GATE was important to the Figure 6. Distribution of all test cases by success of the experiment. The five-person all <jeXML> functionality BSU students (Team 6) copied the distributed team process by assigning three members black- Figure 7 displays each of the individual box testing and the remaining two students team‟s test case distribution based on the same white-box testing. four functional categories. We see that no one The testing timeline of the experiment began team covered all areas of functionality with their with black-box testing for the first three weeks. test cases. Moreover, not all teams‟ test cases During the three-week time-frame it was followed our desired pattern. Only team 1 expected that the white-box testers were achieved close to the ideal distribution. Again, reviewing the actual code, reviewing the unit the interesting outcome is that the union of test testing environment, cppUnit, and preparing test cases from all of the teams gives us the desired scripts. Unit testing would be done for those coverage better than any single team. modules that implemented features with a use When considering the outcomes of all of the frequency of always or often. After the three teams‟ <jeXML> black-box test cases (337), weeks of black-box testing, the black-box cases 97% of the test cases passed, 1% were blocked would be reviewed by the BSU team to and 2% failed. Consider now the percentages of determine the level of statement coverage. The unique and duplicate test cases that were passed, unit test cases would be executed to determine a blocked and failed. Of the test cases that passed, total level of coverage. Testing would be 5.8% (19/327) were duplicated. This suggests augmented, if necessary, by unit and/or black- that the obvious functionality that testers focused box test cases to achieve statement coverage of on had overlapping test cases that performed at least 80%. At the semester‟s end, each paired correctly. The blocked test cases show no team was to submit the combined defect reports duplicates. Finally, the failed test cases had no (outsourced and in-house), the white-box testing duplicates, suggesting that the additional testing scripts, the white-box test cases accompanied by effort of multiple teams paid off. The results of the testing resources and a summary report. The this second experiment were consistent with the black-box and white-box test cases were outcomes of the first experiment, namely that recorded in GATE. For white-box testing, multiple distributed teams increased the testing additional information, such as test drivers and coverage of the functionality of the SUT with their execution, required additional submission. surprisingly little duplication of test cases, and As was previously done for rGrade, four the union of the test cases from all the teams functional categories were identified for provides the desired coverage better than any <jeXML>. These categories are listed on the single team. horizontal axis of Figure 6. The black line illustrates the close to ideal distribution of test 5
  6. 6. Figure 7. Distribution of teams’ test cases by <jeXML> functionality 5.1 Cooperation and Collaboration in Experiment 2: Environment Lessons Learned Table 1 presents the combined summary report for the 6 paired teams in Experiment 2. The outsourced team 3 did not provide any test cases and team 6 did not have an external partner accounting for the NA entry. All of the in-house teams submitted additional black-box test cases and all but one in-house team submitted very few if any unit-test cases. The original product executed through a console window, thus a quick adaptation of the interface enabling web-access was necessary. The adaptation took longer than anticipated, thereby reducing the time the black- box student teams could execute the actual tool and record the results of the test cases, a scenario probably approaching reality in practice. The white-box testing groups should not have been affected by the swap of the interface. Being novice testers, the students were not yet experienced enough to make adjustments. For example, experienced testers using the existing product documents would have created test cases and later executed them as the product became available. The unit testers should have proceeded as normal. Table 1: Results from the experiment 2 - outsourced testing Team 1 Team 2 Team 3 Team 4 Team 5 Team 6 Outsourced Black Box Test Cases 20 84 0 21 20 NA Defect Reports 0 0 0 0 0 NA In-house Black Box Test Cases 30 21 42 33 0 66 Statement Coverage 68% 60% NG NG 24% 67% Defect Report 4 0 0 0 0 3 Unit Test Cases 0 0 4 48 5 0 Defect Reports 0 0 2 13 1 0 6
  7. 7. Even though the teams were paired, the common FAQs. In the next version of GATE, amount of interaction between the outsourced an integrated group email was available for the and in-house teams feasibly could have been test team. To support varied schedules and time zero. Basically, GATE was the mediator where periods, the enhanced GATE included testers‟ black-box test cases are deposited and results schedules and a time-difference calculator. retrieved. GATE on the other hand did not work To promote efficient communication and to flawlessly. Experiment 2 required an XML ensure that resources are used effectively, there document as input, serving as a resource for a must an understanding of the product, good team test case. Many of the testers omitted submitting leadership and a supportive software tool. There the document and GATE did not complain about was not a good understanding of the product. As the absence of the testing resource. Without the a research project, <jeXML> did not have a test resource, the test coverage could not be Software Requirements Specification (SRS). assessed for many of the teams. The next The SRS is the unofficial contract for functional iteration of GATE attempted to address the issue testing. When a single co-located team is of automatically supporting the testing process assigned the task of testing a product that has no beyond merely the recording of test cases. SRS, the team can hunt down documentation, The impression of interaction or using the converse and finally come to terms as to the results of another team changed the work flow extent of testing. When two teams not in the and mindset of the teams. A common thread same location assigned to the same product are among the paired teams was that there was little faced with a similar situation, the old fashioned or no communication, which may have led to a way of ironing out the scope is not possible. In rationale for inactivity. However, the initial industry, some organizations impose a standard testing effort did not require involvement. SRS containing a minimum level of information White-box testing can be difficult and the that would be required to initiate a project. Also recording of white-box test cases within GATE imposed is a minimum documentation and was cumbersome. Many attachments were handover requirement which provides each party required for white-box testing and it was difficult an indication of what to expect from the other to manage attachments. The next version of party. Within the updated GATE, an SRS is GATE addressed this weakness. The level of expected to begin the formal process of testing. programming knowledge needed for testing Traceability between the SRS and the test cases <jeXML> also could have been a confounding is provided and mandatory. factor. Even though the product was only 2500 When interviewed, no team had a designated lines of code, parts of <jeXML> were test lead. This role was a “free floating” duty implemented with sophisticated scanning and among the student testers as they began testing parsing techniques and unique hashing and identifying testing duties. GATE does algorithms. Unless a student had training in include a user role of test lead which allows this compiler theory, some of the code would require user to select test cases from group members to an extended learning curve to grasp. form a test execution plan. The updated GATE This second experiment highlighted some supports leadership through the SRS and a weaknesses of GATE, namely, allowing detailed test plan. Individual GATE testers can incomplete and unusable test cases to be be assigned by test plan and progress reports are submitted and the absence of formal test plans plan and user-based. for almost all the test groups. Another Experiments 1 and 2 indicated that multiple observation was the rare communication and testing teams are effective in increasing the cooperation between distributed teams. The lack functionality coverage and are also efficient by of communication and cooperation was not the near absence of duplicate test cases. Multiple entirely a direct consequence of the current testing teams can relieve the external pressures version of GATE, since GATE‟s primary for increased efficiency and the internal pressure objective was to record, organize and evaluate for increased effectiveness. To further enhance test cases. However, GATE was envisioned to the value of multiple testing teams, the added be a testing tool and also a common environment overhead due to managing multiple teams needs where on-site and off-site testers can work to be decreased. In addition, a supportive testing together testing multiple products. The current environment, which reduces a tester‟s stress by GATE provided a WIKI as just one mechanism outlining a process enabling distributed teams to for inter-group communication. The GATE work effectively, is required. WIKI was used for the posting of tutorials and 7
  8. 8. In experiment 1, GATE enhanced the experiment 3 consisted of three testing teams: performance and analysis of the testing process. one white-box testing team with four members at In experiment 2, the outcome is reversed. Some BSU and two black-box testing teams each with of the test cases deposited by testers were three members at UL. Testing spanned only two incomplete and unusable. No sense of direction weeks, with the first week used for the or final plan was observable, and no formal plan development of each team‟s test plan for this was visible. Recording of white-box test cases experiment and the second week for actual seemed unnatural and tester inexperience had testing. Previous to the actual testing time block, more of an effect than anticipated. The original the BSU team was sharpening their PHP skills GATE provided an on-demand web service to and creating additional testing tools to be used global software developers and testers. It for this experiment. One of the members of the collected data that met or exceeded the published BSU team had previous PHP development and standards. Observing the difficulty even with a testing experience within another open source simple handover, an enhanced GATE‟s goals project. include the former goals plus support for the testing process by directing the process through the mechanism of a test plan. The envisioned test plan will insure traceability and accountability. Tester inexperience will be aided by interactive step-by-step tutorials linked throughout GATE. GATE will be the gate keeper to the completeness and consistency of testing input. For experiment 3, GATE was redesigned to incorporate a guided testing process and address some of the issues identified in experiment 2. New tutorials were also created. Figure 8. Distribution of all test cases by 6. Experiment 3: Independent Black- WebCalendar functionality Box and White-Box Testing of The functionality of WebCalendar was WebCalendar assessed and three functional categories were identified. These categories are listed on the The third SUT selected, WebCalendar, is an horizontal axis in Figure 8. The diagonal line open-source product. It is a PHP-based calendar illustrates the close to ideal distribution of test application that can be configured as a single- cases over these categories, based on category user calendar. The application can also be size and functionality importance. As before, we configured as a multi-user calendar for groups of hoped that the distribution of test cases would users and for the scheduling of events viewable correspond. Note that the aggregation of test by visitors. WebCalendar was selected because cases for all groups, as given by the histogram in most student testers would have had some Figure 8, was aligned with the diagonal line. For previous experience with a calendar, a stable experiment 3, all of the test cases passed and version existed with a web-interface and it was none were duplicated. developed in PHP. PHP source code was Figure 9 displays the individual team‟s test important for this experiment because we had case distribution based on the same three developed an extensive set of testing tools for functional categories. No one team covered all PHP while developing GATE itself, and these areas of functionality with their test cases. tools would become part of the framework for Moreover, not all teams‟ test cases followed our white-box testing. desired pattern. Again, as in the previous two This experiment was conducted to determine testing experiments (rGrade and <jeXML>), the the effect of dividing the testing process by type interesting outcome is that the union of test cases (black and white-box) using multiple teams on from all of the teams gives us the desired test cases and to evaluate the additional features coverage better than any single team. of the testing process in GATE. The pool of available students to participate as student testers was shrinking, since many of the current students had already taken the seminar class. Thus 8
  9. 9. for a distributed testing process, but it became apparent that a simpler interface was required for the student audience. The interface must be intuitive and easy to learn in a short period of time and each testing task must be directly visible and accomplishable within a few steps. GATE‟s interface requires a redesign and similar tester roles must be merged to reduce complexity. 7. Conclusions Developers know that testing is one of the most important phases in the development cycle. The product must overcome testing to become a stable trustworthy product. The tester must deliver a rigorous test plan and implement it. Performing testing in a GSD environment can add an extra layer of complexity and effort to the development process. In the experience with the student projects, the GSD has some very promising aspects but also some daunting traps. In our testing experimentation, no one team covered all areas of functionality with their test cases. Moreover, not all teams‟ test cases followed a desired testing pattern. The interesting point is that the union of test cases from all of the teams gave the desired coverage better than any single team. When considering all of the test cases for the three experiments, 89% of the test cases passed, while 9% failed and 2% were blocked. Of the test cases that Figure 9. Distribution of Teams’ Test Cases passed, 10% were duplicated. This suggests that by WebCalendar Functionality the obvious functionality that testers focused on had overlapping test cases that performed Although the total number of test cases correctly. The blocked test cases show no reported in this third experiment was small duplicates. Finally, the failed test cases show (partly due to the number of test teams), the virtually no duplicates (only 2), suggesting that outcome confirmed our previous findings that the additional testing effort of multiple teams multiple teams increase the testing coverage of paid off. the functionality of the SUT with little The varied skills and background of the duplication of test cases. Student testers kept testers led to creating test cases that covered the testing journals and in this experiment functionality of the SUT with little overlap and communication among test teams emerged. This with very little administration. In theory, the was also evident in the number of emails point of outsourcing is that it does not use in- exchanged and the testing roles that each house resources. Apart from the effort of member assumed (lead tester, test case designer, packaging the executable, the operational profile auditor, etc). Inserting communication and and manuals, and installation of the trouble collaboration tools within the GATE framework reporting software, the independent test teams had a definite advantage. did not require further resources. This addresses An unexpected side effect during both the compatibility issues and the frequent experiment 3 was the high number of times each complaint surfacing from “internal” groups that student launched the GATE tutorial. By resources (time) are not available. Independent incorporating additional testing guidance and test teams provide an independent third-party tester roles, the user-interface became more perspective and the entire program is judged complex. We have designed a powerful system against the same standard with no personal 9
  10. 10. investment to protect. As such, multiple Through Global Testing). The authors also independent test teams may uncover defects that thank their colleagues at UL for incorporating previous testing missed. the software testing experiences into their This project emphasized not only the classes. evaluation of the testing process final outcome but also reviewed the technical needs for a 9. References successful distributed environment. Through GATE, we supported the various cooperative [1] Ebert, C. and P. DeNeve (2001). “Surviving work forms with appropriate technology, with Global Software Development” IEEE Software 18(2): the objective of designing computer-based 62-69. technologies for cooperative work settings within software testing. GATE was not perfect, [2] Grinter, R. E., J.D. Herbsleb, et al. (1999) The and lessons were learned through the Geography Of Coordination: Dealing With Distance In R&D Work. International Conference On experimentation. Perhaps these lessons are the Supporting Group Work. basis of the feedback on GATE which mimics the Goldilocks paradigm: the first version of [3] Herbsleb, J. D. and R. E. Grinter. (1999). Splitting GATE did not have enough support, and the the Organisation and Integrating the Code: Conway‟s st second version contained too much support. Law Revisited. 21 International Conference on Perhaps the “just right” environment is in the Software Engineering, Los Angeles, California, near future. United States, IEEE Computer Press. 8. Acknowledgements [4] Richardson, I., V. Casey, D. Zage, and W. Zage, “Global Software Development – the Challenges”, Software Engineering Research Center TR-278, This research has been supported by the September 2005. Science Foundation Ireland Investigator Programme, B4-STEP (Building a Bi- [5] Zage, D., W. Zage, and C. Wilburn, “Test Directional Bridge Between Software ThEory Management and Process Support for Virtual Teams,” and Practice), the Irish Software Engineering Software Engineering Research Center TR-271, April Research GSD for SME cluster project, and the 2005. National Science Foundation Grant EEC- 0423930 (Reducing the Time to Product Stability 10