Performance Analysis of Leading Application Lifecycle    Management Systems for Large Customer Data Environments          ...
applications were able to complete the tests being performed (i.e., no tests failed outright). Basedon the results, Rally ...
cases. In terms of core artifact types, the              This generator produces text containing realproduct of these coun...
4.   JIRA+GreenHopper Data Population                  7.   Testing Methodology     Issues                                ...
difficult, if not impossible, to predict and             was not tested. The focus was on the collectionmeasure.          ...
end goal would be to convert those trial               run for every test was performed to allowcustomers to paying subscr...
3    Paging through            With our large data sets, navigation of large tables can become a     backlog lists.       ...
Test Result Summary Table (Relative Performance Analysis)     Legend Very Good: (5) Good: (4) Acceptable: (3) Poor: (2...
Test 1: Refresh Backlog Page for a Single Project  System         Mean    Standard     Point      1 SD                   1...
Test 2: Switch Backlog Views Between Two Projects  System            Mean          Standard     Point                  1 S...
Test 3: Paging Through Backlog List  System         Mean       Standard     Point      1 SD                    1 SD       ...
Test 4: Selecting and Viewing a User Story From the Backlog  System           Mean       Standard     Point               ...
Test 5: Selecting and Viewing a Task  System           Mean        Standard     Point                   1 SD        1 SD  ...
Test 6: Selecting and Viewing a Test Case  System          Mean       Standard     Point                  1 SD        1 SD...
Test 7: Selecting and Viewing a Defect/Bug  System          Mean       Standard     Point                  1 SD        1 S...
Test 8: Add an Iteration/Sprint  System            Mean         Standard     Point                  1 SD         1 SD     ...
Test 9: Move a Story to an Iteration/Sprint  System            Mean         Standard     Point                     1 SD   ...
Test 10: Convert a Story to a Defect/Bug  System           Mean         Standard          Point              1 SD        1...
finally, VersionOne has a menu option to do              scrumboard, which JIRA+GreenHopperthis task. The results, reporte...
baseline, suggesting Rally has a slightperformance advantage in general, followedclosely by VersionOne.ReferencesA variety...
Upcoming SlideShare
Loading in...5
×

Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments

3,815

Published on

The performance of three leading application lifecycle management (ALM) systems (Rally by Rally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) was assessed to draw comparative performance observations when customer data exceeds a 500,000-
artifact threshold. The focus of this performance testing was how each system handles a
simulated “large” customer (i.e., a customer with half a million artifacts). A near-identical representative data set of 512,000 objects was constructed and populated in each system in order
to simulate identical use cases as closely as possible. Timed browser testing was performed to gauge the performance of common usage scenarios, and comparisons were then made. Nine tests were performed based on measurable, single-operation events

0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,815
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
17
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments"

  1. 1. Performance Analysis of Leading Application Lifecycle Management Systems for Large Customer Data Environments Paul Nelson Director, Enterprise Systems Management, AppliedTrust, Inc. paul@appliedtrust.com Dr. Evi Nemeth Associate Professor Attendant Rank Emeritus, University of Colorado at Boulder Distinguished Engineer, AppliedTrust, Inc. evi@appliedtrust.com Tyler Bell Engineer, AppliedTrust, Inc. tyler@appliedtrust.com AppliedTrust, Inc. 1033 Walnut St, Boulder, CO 80302 (303) 245-4545 AbstractThe performance of three leading application lifecycle management (ALM) systems (Rally byRally Software, VersionOne by VersionOne, and JIRA+GreenHopper by Atlassian) wasassessed to draw comparative performance observations when customer data exceeds a 500,000-artifact threshold. The focus of this performance testing was how each system handles asimulated “large” customer (i.e., a customer with half a million artifacts). A near-identicalrepresentative data set of 512,000 objects was constructed and populated in each system in orderto simulate identical use cases as closely as possible. Timed browser testing was performed togauge the performance of common usage scenarios, and comparisons were then made. Nine testswere performed based on measurable, single-operation events. Rally emerged as the strongest performer based on the test results, leading outright in sixof the nine that were compared. In one of these six tests, Rally tied with VersionOne from ascoring perspective in terms of relative performance (using the scoring system developed forcomparisons), though it led from a raw measured-speed perspective. In one test not included inthe six, Rally tied with JIRA+GreenHopper from a numeric perspective and within the bounds ofthe scoring model that was established. VersionOne was the strongest performer in two of thenine tests, and exhibited very similar performance characteristics (generally within a 1 – 12second margin) in many of the tests that Rally led. JIRA+GreenHopper did not lead any tests, butas noted, tied with Rally for one. JIRA+GreenHopper was almost an order of magnitude slowerthan peers when performing any test that involved its agile software development plug-in. All 1
  2. 2. applications were able to complete the tests being performed (i.e., no tests failed outright). Basedon the results, Rally and VersionOne, but not JIRA+GreenHopper, appear to be viable solutionsfor clients with a large number of artifacts.1. Introduction JIRA+GreenHopper JIRA 5.1with GreenHopper 6 were the versions that wereAs the adoption of agile project management tested.has accelerated over the last decade, so too has The tests measure the performance ofthe use of tools supporting this methodology. single-user, single-operation events when anThis growth has resulted in the accumulation underlying customer data set made up ofof artifacts (user stories, defects, tasks, and test 500,000 objects is present. These tests are notcases) by customers in their ALM system of intended to be used to draw conclusionschoice. The trend is for data stored in these regarding other possible scenarios of interest,systems to be retained indefinitely, as there is such as load, concurrent users, or other testsno compelling reason to remove it, and often, not explicitly described.product generations are developed and The fundamental objective of the testing isimproved over significant periods of time. In to provide some level of quantitativeother cases, the size of specific customers and comparison for user-based interaction with theongoing projects may result in very rapid three products, as opposed to system- oraccumulation of artifacts in relatively short service-based interaction.periods of time. Anecdotal reports suggest thatan artifact threshold exists around the 500,000artifact point, and this paper seeks to test that 2. Data Set Constructionobservation. This artifact scaling presents a challenge The use of ALM software and the variety offor ALM solution providers, as customers artifacts, custom fields, etc., will varyexpect performance consistency in their ALM significantly between customers. As a result,platform regardless of the volume of the there is not necessarily a “right way” tounderlying data. While it is certainly possible structure data for test purposes. Moreto architect ALM systems to address such important is that fields contain content that ischallenges, there are anecdotal reports that similarly structured to real data (e.g., text insome major platforms do not currently handle freeform text fields, dates in date fields), andlarge projects in a sufficient manner from a that each platform is populated with the sameperformance perspective. data. In some cases, product variations This paper presents the results of testing prevented this. Rally, for example, does notperformed in August and September 2012, use the concept of an epic, but rather arecording the performance of Rally Software, hierarchical, user story relationship, whereasVersionOne, and JIRA+GreenHopper, and VersionOne supports epics.then drawing comparative conclusions Actually creating data with unique contentbetween the three products. Atlassian’s ALM for all artifacts would be infeasible for testingoffering utilizes its JIRA product and extends purposes. To model real data, a structure wasit to support agile project management using chosen for a customer instance based on 10the GreenHopper functionality extension unique projects. Within each project, 40 epics(referred to in this paper as or parent user stories were populated, and 80JIRA+GreenHopper). Rally Build 7396, user stories were created within each of those.VersionOne 12.2.2.3601, and Associated with each user story were 16 artifacts: 10 tasks, four defects, and two test 2
  3. 3. cases. In terms of core artifact types, the This generator produces text containing realproduct of these counts is 16*80*40*10, or sentence and paragraph structures, but random512,000. All platforms suffered from strings as words. A number of paragraph sizedifficulties related to data population. This and content blocks were created, and their usemanifested in a variety of ways, including was repeated in multiple objects. Theimports “freezing,” data being truncated, or description field of a story contained one ordata being mismapped to incorrect fields. two paragraphs of this generated text. Tasks,Every effort was made to ensure as much data defects, and tests used one or two sentences. Ifconsistency between data uploads as possible, one story got two paragraphs, then the nextbut there were slight deviations from the story would get one paragraph, and so on inexpected norm. This was estimated to be no rotation. This data model was used for eachmore than 5%, and where there was missing system.data, supplementary uploads were performed It is possible that one or more of theto move the total artifact count closer to the products may be able to optimize content512,000 target. In addition, tests were only retrieval with an effective indexing strategy,performed on objects that met consistency but this advantage is implementable in eachchecks (i.e., the same field data). product. Only JIRA+GreenHopper prompted These symmetrical project data structures the user to initiate indexing operations, andare not likely to be seen in real customer based on prompted instruction, indexing wasenvironments. The numbers of parent objects performed after data uploads were complete.and child objects will also vary considerably.That being said, a standard form is required toallow population in three products and to 3. Data Populationenable attempts at some level of dataconsistency. Given that the structure is Data was populated primarily by using themirrored as closely as possible across each CSV import functionality offered by eachproduct, the performance variance should be system. This process varied in the operationindicative of observed behaviors in other sequence and chunking mechanism forcustomer environments regardless of the exact uploads, but fundamentally was based onartifact distributions. tailoring input files to match the input Custom fields are offered by all products, specifications and uploading a sequence ofand so a number of fields were added and files. Out of necessity, files were uploaded inpopulated to simulate their use. Five custom various-sized pieces related to input limits forfields were added to each story, task, defect, each system. API calls and scripts were usedand test case; one was Boolean true/false, two to establish relationships between artifactswere numerical values, and two were short text when the CSV input method did not support orfields. retain these relationships. We encountered The data populated followed the schema issues with each vendor’s product in importingspecified by each vendor’s documentation. We such a large data set, which suggests thatpopulated fields for ID, name, description, customers considering switching from onepriority, and estimated cost and time to product to another should look carefully at thecomplete. The data consisted of dates and feasibility of loading their existing data. Sometimes, values from fixed lists (e.g., the priority of our difficulty in loading data involved thefield with each possible value used in turn), fact that we wanted to measure comparablereferences to other objects (parent ID), and operations, and the underlying data structurestext generated by a lorem ipsum generator. made this sometimes easy, sometimes nearly impossible. 3
  4. 4. 4. JIRA+GreenHopper Data Population 7. Testing Methodology Issues A single test system was used to collect testWe had to create a ‘Test Case’ issue type in data in order to limit bias introduced bythe JIRA+GreenHopper product and use what different computers and browser instances.is known in the JIRA+GreenHopper The test platform was a Dell Studio XPS 8100community as a bug to keep track of the running Microsoft Windows 7 Professionalparent-child hierarchy of data objects. Once SP1 64-bit, and the browser used to performthis was done, the data loaded quite smoothly testing was Mozilla Firefox v15.0.1. Theusing CSV files and its import facility until we Firebug add-on running v1.10.3 was used toreached the halfway point, when the import collect test metrics. Timing data was recordedprocess slowed down considerably. in a data collection spreadsheet constructed forUltimately, the data import took two to three this project. While results are expected to varyfull days to complete. if using other software and version combinations, using a standardized collection model ensured a consistent, unbiased approach5. Rally Data Population Issues to gathering test data for this paper, and will allow legitimate comparisons to be made. It isRally limits the size of CSV files to 1000 lines expected that while the actual timing averagesand 2.097 MB. It also destroys the may differ, the comparisons will not.UserStory/SubStory hierarchy on import At the time measurements were being(though presents it on export). These taken, the measurement machine was the onlylimitations led to a lengthy and tedious data user of our instance of the software products.population operation. Tasks could not be All tests were performed using the sameimported using the CSV technique. Instead, network and Internet connection, with noscripting was used to import tasks via Rally’s software updates or changes between tests. ToREST API interface. The script was made ensure there were no large disparities betweenusing Pyral, which is a library released by response times, an http-ping utility was used inRally for quick, easy access to its API using order to measure roundtrip response times tothe Python scripting language. The total data the service URLs provided by each system.import process took about a week to complete. Averaged response times over 10 http-ping samples were all under 350 milliseconds and within 150 milliseconds of each other,6. VersionOne Data Population Issues suggesting connectivity and response are comparable for all systems.VersionOne did not limit the CSV file size, but JIRA+GreenHopper had an average responsewarned that importing more than 500 objects time of 194 milliseconds, Rally 266, andat a time could cause performance issues. This VersionOne 343. All tests were performedwarning was absolutely true. During import, during US MDT business hours (8 a.m. – 5:30our VersionOne test system was totally p.m.).unresponsive to user operations. CSV files of It is noted that running tests in a linear5000 lines would lock it up for hours, making manner does introduce the possibility ofdata population take over a week of 24-hour performance variation due to connectivitydays. performance variations between endpoints, though these variations would be expected under any end-user usage scenario and are 4
  5. 5. difficult, if not impossible, to predict and was not tested. The focus was on the collectionmeasure. of core tests described in the test definition Tests and data constructs were table in the next section.implemented in a manner to allow apples-to- The time elapsed from the start of the firstapples comparison with as little bias and request until the end of the lastpotential benefit to any product as possible. request/response was used as the core timeHowever, it should be noted that these are metric associated with a requested page loadthree different platforms, each with unique when possible. This data is captured withfeatures. In the case where a feature exists on Firebug, and an example is illustrated belowonly one or two of the platforms, that element for a VersionOne test. Example of timing data collection for a VersionOne test. We encountered challenges timing pages inefficiencies. Bias may also be introduced inthat perform operations using asynchronous one or more products based on the testingtechniques to update or render data. Since we methodology employed. While every effortare interested in when the result of operations was made to make tests fair and representativeare visible to the user, timing only the of legitimate use cases, it is recognized thatasynchronous call that initiates the request results might vary if a different data set wasprovides little value from a testing perspective. used. Further, the testing has no control overIn cases where no single time event could be localized performance issues affecting theused, timing was performed manually. This hosted environments from which the servicesincreased the error associated with the are provided. If testing results in minormeasurement, and this error is estimated to be variance between products, then arguablyroughly one second or less. In cases where some of this variance could be due to factorsmanual measurements were made, it is outside of the actual application.indicated in the result analysis. A stopwatch The enterprise trial versions were used towith 0.1-second granularity was used for all test each system. We have no data regardingmanually timed tests, as were two people — how each service handles trial instances; it isone running the test with start/stop instruction possible that the trial instances differ fromand the other timing from those verbal cues. paid subscription instances, but based on our It is acknowledged that regardless of the review and the trial process, there was noconstraints imposed here to standardize data indication the trial version was in any wayand tests for comparison purposes, there may different. We assume that providers would notbe deviations from performance norms due to intentionally offer a performance-restrictedthe use of simulated data, either efficiencies or instance for trial customers, given that their 5
  6. 6. end goal would be to convert those trial run for every test was performed to allowcustomers to paying subscribers. object caching client-side — so in fact, each Based on a per-instance calibration routine, test was executed 11 times, but only results 2-the decision was made to repeat each test 10 11 were analyzed. Based on the belief that thetimes per platform. A comparison between a total artifact count is the root cause of10-test and 50-test sample was performed for scalability issues, allowing caching shouldone test case (user story edit) per platform to eliminate some of the variation due to factorsensure the standard deviation between that cannot be controlled by the test.respective tests was similar enough to warrant The use of attachments was not tested.the use of a 10-test sample. In no case was the This was identified as more of a bandwidthcalibration standard deviation greater than one and load test, as opposed to a performance ofsecond. If the performance differences the system in a scalability scenario.between applications are found to be of asimilar order of magnitude (i.e., seconds), thenthe use of a 10-test sample per application 8. Test Descriptionsshould clearly be questioned. However, if theoverriding observation is that each application Tests were constructed based on common usesperforms within the same small performance of ALM systems. Timing data was separatedrange of the others, the nuances of sample size into discrete operations when sequences ofcalculation are rendered insignificant. events were tested. These timings were A more in-depth sample sizing exercise compared individually, as opposed to incould also be performed, and could aggregate, in order to account for interface andrealistically be performed per test. However, it workflow differences between products.is already recognized that there are numerous There may be tests and scenarios thatfactors beyond the control of the tests, to the could be of interest but were not captured,extent that further increasing sample size either because they were not reproducible inwould offer little value given the relatively all products or were not identified as commonconsistent performance observed during operations. Also, it would be desirable incalibration. future tests to review the performance of To help reduce as many bandwidth and logical relationships (complex links betweengeographic distance factors as possible, the iterations/sprints and other artifacts, forclient browser cache was not cleared between example). The core objective when selectingtests. This also better reflects real user these tests was to enable comparison forinteraction with the systems. A single pretest similar operations between systems.# Test Name Description/Purpose1 Refresh the backlog The backlog page is important to both developers and managers; it for a single project. is the heart of the systems. Based on variance in accessing the backlog, the most reliable mechanism to test was identified as a refresh of the backlog page. Views were configured to display 50 entries per page.2 Switch backlog views A developer working on two or more projects might frequently between two projects. swap projects. Views were configured to display 50 entries per page. 6
  7. 7. 3 Paging through With our large data sets, navigation of large tables can become a backlog lists. performance issue. Views were configured to display 50 entries per page.4 Select and view a story Basic access to a story. from the backlog.5 Select and view a task. Basic access to a task.6 Select and view a Basic access to a defect or bug. (Note: JIRA+GreenHopper uses defect/bug. the term bug, while Rally and VersionOne use defect.)7 Select and view a test. Basic access to a test case.8 Create an Common management chore. (Note: This had to be manually timed iteration/sprint. for JIRA+GreenHopper, as measured time was about 0.3 seconds while elapsed time was 17 seconds.)9 Move a story to an Common developer or manager chore. (Note: JIRA+GreenHopper iteration/sprint. and VersionOne use the term sprint, while Rally uses iteration.)10 Convert a story to a Common developer chore (Note: This operation is not applicable defect/bug. to Rally because of the inherent hierarchy between a story and its defects).9. Test Results performed badly (subjectively). As such, the leader in a test is given the “Very Good”Each test was performed 1+10 times in rating, which corresponds to five points. Thesequence for each software system, and the leading time is then used as a base formean and standard deviation were computed. comparative scoring of competitors for thatThe point estimates were then compared to test, with each test score based on how manyfind the fastest performing application. A +n multiples it was of the fastest performer. The(seconds) indicator was used to indicate the point legend table is illustrated below.relative performance lag of the otherapplications from the fastest performing Time Multiple Pointsapplication for that test. 1.0x ≤ time < 1.5x 5 The test result summary table illustrates 1.5x ≤ time < 2.5x 4the relative performance for each test to allow 2.5x ≤ time < 3.5x 3observable comparisons per product and per 3.5x ≤ time < 4.5x 2test. In order to provide a measurement-based 4.5x ≤ time 1comparison, a scale was created to allownumerical comparison between products.There were no cases where the leader in a test 7
  8. 8. Test Result Summary Table (Relative Performance Analysis) Legend Very Good: (5) Good: (4) Acceptable: (3) Poor: (2) Very Poor: (1)System and Overall 1 2 3 4 5 6 7 8 9 Test Rating Backlog Switch Backlog View View View View Create Story Summary (Out of Refresh Backlog Paging Story Task Defect Test Sprint → 45) Sprint Rally 43         VersionOne 32          JIRA+GreenHopper 18          It must be noted that resulting means are (symmetrically distributed), and 95% should point-estimate averages. For several reasons, lie within two standard deviations. We we don’t suggest or use confidence intervals or graphically tested for normality using our test for significance. Based on the challenges calibration data and observed our data to be associated with structuring common tests with normally distributed. When there is no overlap different interfaces, different data structures, between timing at two standard deviations, this and no guarantee of connection quality, it is implies it will be fairly rare for one of the extraordinarily difficult to do so. In addition, typically slower performing applications to because each test may have a different weight exceed the performance of the faster or relevance to each customer depending on application (for that particular test). If there is their ALM process, the relevance of a test no overlap at one or two standard deviations leader should be weighted according to the between the lower and upper bounds, the result preference of the reader. That being said, these is marked as “Significant.” If there is overlap tests are intended to reflect the user in one or both cases, that result is flagged as experience. To address some of the concerns “Insignificant.” Significance is assessed associated with point estimates, analysis of between the fastest performing application for high and low bounds based on one and two the test and each of the other two applications. standard deviations was performed. If the high Therefore, the significance analysis is only bound for the fastest test overlaps with the low populated for the application with the fastest bound for either of the slower performing point estimate. The advantage is classed as application tests, the significance of the insignificant if the closest performing peer performance gain between those comparisons implies the result is insignificant. All data is questionable. The overlap suggests there values are in seconds. will be cases where the slower (overlapping) Results from each test are analyzed application may perform faster than the separately below. The results of each test are application with the fastest response time. shown both in table form with values and in Statistical theory and the three-sigma rule bar graph form, and are also interpreted in the suggest that when data is normally distributed, text below the corresponding table. Note that roughly 68% of observations should lie within long bars in the comparison graphs are long one standard deviation of the mean response times, and therefore bad. 8
  9. 9. Test 1: Refresh Backlog Page for a Single Project System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 15.27 1.38 +12.13 13.89 – - 12.52 – -GreenHopper 16.64 18.02 Rally 5.53 0.29 +2.39 5.24 – - 4.95 – - 5.81 6.10VersionOne 3.14 0.25 Fastest 2.88 – Significant 2.63 – Significant 3.39 3.64 Interpretation: The data indicates that for this almost 2.4 seconds. Both VersionOne and particular task, even when accounting for Rally perform significantly better than variance in performance, VersionOne JIRA+GreenHopper when executing this performs fastest. Note that the advantage is operation. relatively small when compared to Rally, though the Rally point estimate does lag by Best Performer: VersionOne 9
  10. 10. Test 2: Switch Backlog Views Between Two Projects System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 13.84 0.83 +11.39 13.01 – - 12.19 – -GreenHopper 14.66 15.49 Rally 2.45 0.16 Fastest 2.29 – Significant 2.13 – Significant 2.60 2.76VersionOne 2.94 0.07 +0.49 2.87 – - 2.79 – - 3.01 3.08 *To perform this operation on JIRA+GreenHopper, the user must navigate between two scrumboards and then load the data. Therefore, the timing numbers for JIRA+GreenHopper are the sum of two measurements. This introduces request overhead not present in the other two tests, yet the disparity suggests more than just simple transaction overhead is the cause of the delay. Furthermore, the resulting page was rendered frozen and was not usable for an additional 10 – 15 seconds. Users would probably pool that additional delay before the page could be accessed in their user experience impression, but it was not included here. Interpretation: The data indicates that Rally user interaction, the experience would be and VersionOne are significantly faster than similar for the two products. JIRA+GreenHopper, even when considering the sum of two operations. Rally is faster than Best Performer: Rally VersionOne, though marginally so. In terms of 10
  11. 11. Test 3: Paging Through Backlog List System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.53 0.66 Fastest 0.87 – Insignificant 0.21 – InsignificantGreenHopper 2.19 2.85 Rally 1.93 0.11 +0.4 1.81 – - 1.70 – - 2.04 2.15VersionOne 3.45 0.29 +1.92 3.16 – - 2.87 – - 3.74 4.04 Interpretation: JIRA+GreenHopper had the likely to be comparable. The data indicates fastest point-estimate mean, but the analysis that VersionOne is significantly slower than suggests there is minimal (not significant) the other two systems, and for very large data improvement over Rally, which was the sets like the tests used, this makes scrolling second-fastest. The standard deviations through the data quite tedious. suggest a wider performance variance for JIRA+GreenHopper, and so while the point Best Performer: JIRA+GreenHopper and estimate is better, the overall performance is Rally 11
  12. 12. Test 4: Selecting and Viewing a User Story From the Backlog System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 3.49 0.99 +2.95 2.49 – - 1.50 – -GreenHopper 4.48 5.47 Rally 0.53 .07 Fastest 0.46 – Significant 0.40 – Significant 0.60 0.67VersionOne 1.90 0.30 +1.36 1.59 – - 1.29 – 2.5 - 2.20 Interpretation: The data indicates that Rally is experience. Rally’s performance is also more significantly faster than either consistent than the other two products (i.e., it JIRA+GreenHopper or VersionOne. While the has a much lower response standard result is significant, the one-second difference deviation). between Rally and VersionOne is not likely to have a significant impact on the user Best Performer: Rally 12
  13. 13. Test 5: Selecting and Viewing a Task System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.36 0.17 +0.92 1.20 – - 1.03 – -GreenHopper 1.53 1.69 Rally 0.44 0.03 Fastest 0.42 – Significant 0.39 – Significant 0.47 0.50VersionOne 1.46 0.16 +1.01 1.29 – - 1.13 – - 1.62 1.78 Interpretation: The data indicates that Rally is VersionOne showed similar performance. significantly (in the probabilistic sense) faster Overall, the result for all applications was than either JIRA+GreenHopper or VersionOne qualitatively good. by about one second, and also has a more consistent response time (with the lowest Best Performer: Rally standard deviation). JIRA+GreenHopper and 13
  14. 14. Test 6: Selecting and Viewing a Test Case System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.91 0.86 +1.37 1.05 – - 0.19 – -GreenHopper 2.77 3.64 Rally 0.54 0.13 Fastest 0.41 – Significant 0.28 – Insignificant 0.67 0.80VersionOne 1.45 0.18 +0.91 1.27 – - 1.09 – - 1.62 1.80 Interpretation: The data indicates that, again, suggesting a consistently better experience. Rally is fastest in this task, though the speed VersionOne was second in terms of differences are significant at the one standard performance, followed by deviation level where there is no overlap in JIRA+GreenHopper. their respective timing ranges, but not at two standard deviations. Rally performed with the Best Performer: Rally lowest point estimate and the lowest variance, 14
  15. 15. Test 7: Selecting and Viewing a Defect/Bug System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 1.70 0.81 +1.02 0.88 – - 0.07 – -GreenHopper 2.51 3.32 Rally 0.68 0.05 Fastest 0.63 – Significant 0.58 – Insignificant 0.72 0.77VersionOne 1.74 0.17 +1.06 1.56 – - 1.39 – - 1.91 2.08 Interpretation: The data indicates that Rally is very low standard deviation. Though the point faster by roughly one second based on the estimates are very close, the performance of point-estimate mean when compared to the VersionOne is preferred based on the low other two products, with the difference being standard deviation. That being said, given that significant at the one standard deviation level the point estimates are all below two seconds, but not at two standard deviations. Variance in there would be little to no perceptible the results of the other products suggests they difference between VersionOne and will perform similarly to Rally on some JIRA+GreenHopper from a user perspective. occasions, but not all. Rally’s performance was relatively consistent, as indicated by the Best Performer: Rally 15
  16. 16. Test 8: Add an Iteration/Sprint System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 17.76 0.60 +17.72 17.16 – - 16.56 – -GreenHopper 18.36 18.96 Rally 0.04 0.00 Fastest 0.04 – Significant 0.03 – Significant 0.05 0.05VersionOne 1.36 0.10 +1.32 1.25 – - 1.15 – - 1.46 1.57 *Due to the disparity between Rally and JIRA+GreenHopper here, the graph appears to show no data for Rally. The graph resolution is simply insufficient to render the data clearly, given the large value generated by JIRA+GreenHopper tests. **The JIRA+GreenHopper data was manually measured due to inconsistencies in timing versus content rendering. Based on requests, it appeared asynchronous page timings were completing when requests were submitted, and the eventual content updates and rendering were disconnected from the original request being tracked. While this increases the measurement error, it certainly would not account for a roughly 17-second disparity. Interpretation: Rally is the fastest performer JIRA+GreenHopper is many times slower than in this test, with the results being significant at both Rally and VersionOne. both the one and two standard deviation levels. Best Performer: Rally 16
  17. 17. Test 9: Move a Story to an Iteration/Sprint System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 9.80 6.88 +8.42 2.91 – - 0.00* – -GreenHopper 16.68 23.56 Rally 3.37 0.22 +1.99 3.15 – - 2.94 – - 3.59 3.80VersionOne 1.38 0.36 Fastest 1.02 – Significant 0.66 – Insignificant 1.74 2.09 *The standard deviation range suggested a negative value, which is, of course, impossible. Therefore, 0.00 is provided. Interpretation: The data indicates that test is a result of the enormous standard VersionOne is fastest for this operation. The deviation of the JIRA+GreenHopper tests. insignificant two standard deviation overlap Best Performer: VersionOne 17
  18. 18. Test 10: Convert a Story to a Defect/Bug System Mean Standard Point 1 SD 1 SD 2 SD 2 SD Request Deviation Estimate Range Overlap Range Overlap Time (seconds) Comparison (seconds) Analysis (seconds) Analysis (seconds) (seconds) JIRA+ 26.56 2.94 +24.87 23.62 – - 20.68 – -GreenHopper 29.50 32.44 Rally 1.69 0.25 Fastest 1.44 – Significant 1.19 – Significant 1.94 2.19VersionOne 6.06 0.28 +4.36 5.77 – - 5.49 – - 6.34 6.62 *JIRA+GreenHopper required manual timing. See the interpretation below for explanation. Interpretation: This operation is an example update for about 10 seconds while it updated of one in which the procedure in each system the icon to the left of the new defect from a is completely different and perhaps not green story icon to a red defect icon. This extra comparable in any reasonable way. In 10 seconds was not included in the timing JIRA+GreenHopper, there are three operations results, although perhaps it should have been. involved (access the story, invoke the editor, In Rally, defects are hierarchically below and after changing the type of issue, saving the stories as one of a story’s attributes, and so a changes and updating the database) and these story cannot be converted to a defect, though had to be manually timed. In addition, the defects can be promoted to stories. That is JIRA+GreenHopper page froze after the what we measured for Rally’s case. And 18
  19. 19. finally, VersionOne has a menu option to do scrumboard, which JIRA+GreenHopperthis task. The results, reported here just for implements with the plug-in GreenHopper.interest and not defensible statistically, The GreenHopper overlay/add-on seemedindicate that Rally is fastest at this class of unable to handle the large data sets effectively.operation, followed by VersionOne at plus- When we tried to include the test of viewingfour seconds and JIRA+GreenHopper at +24 the backlog for all projects, we were able to doseconds. so for Rally and VersionOne, but the JIRA+GreenHopper instance queried for overBest Performer: N/A – Informational 12 hours without rendering the scrumboardobservations only. and merged project backlog. Some object view operations resulted in second-best performance for JIRA+GreenHopper, but with the10. Conclusions exception of viewing tasks, the variance associated with request was extraordinarilyOur testing was by no means exhaustive, but high compared to Rally and VersionOne. Thethorough enough to build a reasonably sized large variance will manifest to users as anresult set to enable comparison between inconsistent experience (in terms of responseapplications. It fundamentally aimed to assess time) when performing the same operation.the performance of testable elements that are Anecdotally, the performance ofconsistent between applications. We tried to VersionOne compared to Rally waschoose simple, small tests that mapped well significantly degraded when import activitybetween the three systems and could be was taking place, to the extent thatmeasured programmatically as opposed to VersionOne becomes effectively unusablemanually (and succeeded in most cases, during import operations. Further testing couldthough some manual timing was required). be performed to identify whether this is a Rally was the strongest performer based on CSV-limited import issue or if it extends tothe test results, leading outright in six of the programmatic API access, as well. Given hownine that were compared. In one of these six many platforms utilize API access regularly, ittests, Rally tied with VersionOne from a would be interesting to explore this resultscoring perspective in terms of relative further.performance (using the scoring system Both Rally and VersionOne appear todeveloped for comparisons), though it led provide a reasonable user experience thatfrom a raw measured-speed perspective. In should satisfy customers in most cases whenone test not included in the six, Rally tied with the applications are utilizing large data setsJIRA+GreenHopper from a numeric with over 500,000 artifacts.perspective and within the bounds of the JIRA+GreenHopper is significantlyscoring model that was established. disadvantaged from a performanceVersionOne was the strongest performer in perspective, and seems less suitable fortwo of the nine tests, and exhibited very customers with large artifact counts or withsimilar performance characteristics (generally aggressive growth expectations. Factors suchwithin a 1 – 12 second margin) in many of the as user concurrency, variations in sprinttests that Rally led. JIRA+GreenHopper did structure, and numerous others have thenot lead any tests, but as noted, tied with Rally potential to skew results in either direction,for one. and it is difficult to predict how specific use With the exception of backlog paging, cases may affect performance. These tests do,JIRA+GreenHopper trailed in tests that however, provide a reasonable comparativeleveraged agile development tools such as the 19
  20. 20. baseline, suggesting Rally has a slightperformance advantage in general, followedclosely by VersionOne.ReferencesA variety of references were used to help buildand execute a performance testingmethodology that would allow a reasonable,statistically supported comparison of theperformance of the three ALM systems. Inaddition to documentation available at thewebsites for each product, the followingresources were used:“Agile software development.” Wikipedia. Accessed Sept. 28, 2012 from http://en.wikipedia.org/wiki/Agile_soft ware_development.Beedle, Mike, et al. “Manifesto for Agile Software Development.” Accessed Sept. 28, 2012 from http://agilemanifesto.org.Hewitt, Joe, et al. Firebug: Add-ons for Firefox. Mozilla. Accessed Sept. 28, 2012 from http://addons.mozilla.org/en- us/firefox/addon/firebug.Honza. “Firebug Net Panel Timings.” Software is Hard. Accessed Sept. 28, 2012 from http://www.softwareishard.com/blog/fi rebug/firebug-net-panel-timings.Peter. “Top Agile and Scrum Tools – Which One Is Best?” Agile Scout. Accessed Sept. 28, 2012 from http://agilescout.com/best-agile-scrum- tools. 20

×