The dynamics of software evolution
                      EVOLUMONS 2011
             Research Seminar on Software Evolution

                           Université de Mons, Belgium
                               January 26th 2011

                                   Israel Herraiz
                           Universidad Alfonso X el Sabio
                                <isra@herraiz.org>
                                 <herraiz@uax.es>

                                                            1

http://www.uax.es http://herraiz.org
(c) 2011 Israel Herraiz
                                              This work is licensed under the
                                  Creative Commons Attribution-Share Alike 3.0

                                              To view a copy of this license, visit
                                 http://creativecommons.org/licenses/by-sa/3.0/

                                                                          or send a letter to

                                                                 Creative Commons,
                                                        171 Second Street, Suite 300,
                                                            San Francisco, California,
                                                                         94105, USA.

                                  Get the full bibliographic references listed in these slides at
                                  http://herraiz.org/stuff/evolumons_references_20110126.txt


http://www.uax.es http://herraiz.org
Outline
     ●   The laws of software evolution
     ●   The nature of software evolution (for libre
         software)
     ●   How to accurately forecast software evolution.
         And why it works.
     ●   What's next?
     ●   And what did I learn during all these years of
         work?

                                                          3

http://www.uax.es http://herraiz.org
The laws of software evolution




                                             4

http://www.uax.es http://herraiz.org
My background
     ●   Educated as a chemical and mechanical
         engineer
     ●   Wasted my time in the chemical industry. But I
         did (and do) love doing software!
               –   http://caflur.sf.net http://gpinch.sf.net
     ●   Involved in the open source community since
         around 2001, started a PhD in 2004 in the
         Libresoft research group
               –   http://libresoft.es

                                                               5

http://www.uax.es http://herraiz.org
How it all started
     ●   Godfrey and Tu                 ●   My supervisors and I
         [GT00] [GT01]                      wrote a paper on the
         studied the evolution              topic [RAGBH05]
         of the Linux kernel            ●   At the time, I thought
     ●   They said that the                 it was just one more
         laws of software                   paper
         evolution were not             ●   It turned out to be our
         valid for Linux                    most cited paper
               –   Laws of software
                   evolution. What is
                                            ●   Completely puzzled
                   that?                        me
                                                                      6

http://www.uax.es http://herraiz.org
The topic background:
                        Software evolution
     ●   How and why does
         software evolve?
     ●   Meir M. Lehman
         Laws of software
         evolution
     ●   “Program evolution.
         Processes of
         software change”
         published in 1985

                                               7

http://www.uax.es http://herraiz.org
The laws in the seventies
     ●   Laws of Program Evolution Dynamics (1974)




                                                           8
                                        [Leh74] [Leh85b]
http://www.uax.es http://herraiz.org
The evolution of the laws of
                    software evolution [Leh96] [LRW+97]
                                          [MFRP06]

                               [Leh78]    [Leh80]
                               [Leh85c]   [LB85]
    [Leh74]
    [Leh85b]




                                                    9

http://www.uax.es http://herraiz.org
The laws in the present day
                         (I – IV)




                                             10

http://www.uax.es http://herraiz.org
The laws in the present day
                        (V – VIII)




                                             11

http://www.uax.es http://herraiz.org
Empirical studies of software
                        evolution




                  See “Empirical Studies of Open Source Evolution” by
                      Juan Fernandez-Ramil, Angela Lozano, Michel Wermelinger, Andrea Capiluppi   12
                      in Tom Mens, Serge Demeyer (eds.) Software Evolution

http://www.uax.es http://herraiz.org
Why the controversy about the laws
           of software evolution?
     ●   Fernandez-Ramil et al. found in the literature
         empirical validation for the I, VI, VII (partially)
         and VIII (partially)
     ●   The most interesting part (for me)
               –   Statistical analysis of software projects and their
                   evolution, using time series analysis among other
                   techniques (suggested in ¡1974!) [Leh74] [Leh85b]
               –   “For maximum cost-effectiveness, management
                   consideration and judgement should include the entire
                   history of the project with the current state having the
                   strongest, but not exclusive, influence”
                   [Leh78] [Leh85c]
          ●
                                                                              13

http://www.uax.es http://herraiz.org
The nature of (libre) software
                       evolution




                                               14

http://www.uax.es http://herraiz.org
The nature of (libre) software
                       evolution
     ●   The goal is to develop a theoretical model for
         software evolution
     ●   Long pursued goal
          ●   Lehman and Belady in 1971 [BL71] [LB85]
          ●   Woodside progressive and anti-regressive work
              [Woo80] (included in [LB85])
          ●   Turski models [Tur96] [Tur02]
               –   Growth is inversely proportional to complexity
               –   Complexity is proportional to the square of size

                                                                      15

http://www.uax.es http://herraiz.org
More recent models
     ●   Self-Organized criticality [Wu06] [WHH07]
          ●   Power laws for the size of the system
          ●   Long range correlations in the time series of
              changes
     ●   Maintenance Guidance Model [CFR07]
          ●   Those functions that have suffered more changes in
              the past are more likely to be changed in the future
          ●   Assumptions:
               –   Distribution of accumulated changes is asymmetrical
               –   Developers prioritize changes using past number of
                   changes and complexity                                16

http://www.uax.es http://herraiz.org
Determinism and evolution
     ●   Self Organized Criticality
          ●   This means that current events are influenced by
              very old events
          ●   Against Lehman suggestions [Leh78] [Leh85c]
     ●   In my opinion, counter intuitive




                                                                 17

http://www.uax.es http://herraiz.org
Long range correlated processes




http://www.uax.es http://herraiz.org
Long range correlated processes




http://www.uax.es http://herraiz.org
Long range correlated processes




                                       Unreachable
http://www.uax.es http://herraiz.org
Short range correlated




http://www.uax.es http://herraiz.org
Short range correlated




http://www.uax.es http://herraiz.org
Short range correlated




http://www.uax.es http://herraiz.org
Short range correlated




http://www.uax.es http://herraiz.org
How is software evolution?




                                       or     ?




http://www.uax.es http://herraiz.org
Autocorrelation coefficients

                                                   ...
     1             2              3    4   5

                                                         r(1)
                                                   ...
                   1              2    3   4                    r(2)



                                                   ...
                                  1    2   3

                                               .
                                               .
                                               .




http://www.uax.es http://herraiz.org
r(k)           Autocorrelation coefficients
       1




       0

              1    2    3    4    5    6   7   8   9   10 11 12 13   14 15
                                                                             k

http://www.uax.es http://herraiz.org
r(k)           Autocorrelation coefficients
       1
                                                           Long range
                                                           correlated
                                                            r k ~k 2d−1
                                                            0d 0.5



              Short range
               correlated
            (ARIMA process)
              r k ~C 1−k 
       0

              1    2    3    4    5    6   7   8   9   10 11 12 13   14 15
                                                                             k

http://www.uax.es http://herraiz.org
r(k)         Autocorrelation coefficients
       1
                                       Long range
                                       correlated
                                        r k ~k 2d−1
                                         0d 0.5

                 Short range
                  correlated
               (ARIMA process)
                 r k ~ Ai 1−k          Logarithmic
                                               scale

       0


                                                        k

http://www.uax.es http://herraiz.org
Empirical study
     ●   3,821 software projects
               –   More than 3 developers
               –   More than 1 year of active history
               –   9,234,104 commits / 2,357,438 modification requests
               –   Projects registered between Nov. 1999 and Dec. 2004
               –   Datasets publicly available
     ●   See Determinism and evolution
               –   5th International Working Conference on
                   Mining Software Repositories (MSR 2008)
                                                                FLOSSMole
                                                                    +
                                                               CVSAnalY-SF


http://www.uax.es http://herraiz.org
Methodology
     ●   Liner correlation to calculate linearity
     ●   Distribution of the Pearson coefficients
     ●   Smoothing applied to the series before
         calculating ACF




http://www.uax.es http://herraiz.org
Results




http://www.uax.es http://herraiz.org
Results




http://www.uax.es http://herraiz.org
Results



                          Long
                          memory
                          processes              Short
                                                 memory
                                                 processes




http://www.uax.es http://herraiz.org
Looking at the numbers
                    Quantile Commits                    MRs
                           0 0.3235                     0.2886
                          20 0.7394                     0.7248
                          40 0.8178                     0.8036
                          60 0.8906                     0.8705
                          80 0.9783                     0.9464
                        100 0.9998                      0.9998

                                       Long memory process


                                       Short memory process
                                                                 35

http://www.uax.es http://herraiz.org
Implications for evolution
     ●   Short memory -> Yesterday's weather
         http://doi.ieeecomputersociety.org/10.1109/ICSM.2004.1357788
     ●   When deciding, current situation should have
         more influence
          ●   As Lehman said in 1978




http://www.uax.es http://herraiz.org
How to forecast software evolution




                                            37

http://www.uax.es http://herraiz.org
Background
     ●   Forecasting traditionally done using very simple
         statistical models
          ●   Regression
     ●   Lehman suggested in 1974 that Time Series
         Analysis was the best approach to study
         software evolution
     ●   Let's compare time series analysis against
         regression models


                                                        38

http://www.uax.es http://herraiz.org
Case studies

                                       Training set                  Test set



                                                                                PostgreSQL


                                                                                FreeBSD


                                                                                NetBSD

          1993      1995      1997     1999     2001   2003   2005     2007
                                              Time



                                                                                          39

http://www.uax.es http://herraiz.org
Case studies




                                              Training set   Test set




                                                                        40

http://www.uax.es http://herraiz.org
Time Series Analysis
                      Original                                  Yes
                                       ACF           Clear
                     time series
                                       PACF         pattern?
                        data


                                                          No


                                                     Kernel
                                                    smoothing




                                         ARIMA              p, d, q
                        Predictions       model            based on
                                          fitting         ACF / PACF



http://www.uax.es http://herraiz.org
Parameters of the model




http://www.uax.es http://herraiz.org
Autocorrelation coefficients.
                     No smoothing




http://www.uax.es http://herraiz.org
Autocorrelation coefficients.
                    After smoothing




http://www.uax.es http://herraiz.org
Parameters of all the models
     ●   Time series ARIMA model
          ●   d=1 q=0                  p = 6, 7 or 9
     ●   Regression model
          ●   r > 0.99




http://www.uax.es http://herraiz.org
How does the model look like?


                                                                  
                                 q                         p
             d                                j                  i
        ∇ x t 1−∑  j B =t 1−∑  i B
                               j=1                         i=1

                                          i       i
                                         B =B =x t−i
                                                  xt

                         ∇ x t =x t −x t−1=1−B x t
                                     d                 d
                               ∇ x t =1−B x t

http://www.uax.es http://herraiz.org
How does the model look like?

     Predicted / Actual values                             Estimation
                                       Coefficients                         Linear component
                                                             errors




                                                                                    
                                 q                                      p
             d                               j                                     i
        ∇ x t 1−∑  j B =t 1−∑  i B
                               j=1                                  i=1

      Parameters of
        the model                                Linear component




http://www.uax.es http://herraiz.org
Results
     Time series (ARIMA) vs. regression

                     ARIMA Regression
            FreeBSD 3.93        16.89
             NetBSD   1.80      15.94
           PostgreSQL 1.48       6.86

                        Mean Squared Relative Error




http://www.uax.es http://herraiz.org
Conclusions
     ●   Time Series more accurate than Regression
         Analysis for macroscopic predictions
     ●   Basic model. More components can be added.
     ●   Seasonality
     ●   Multi-variable, combining different factors




http://www.uax.es http://herraiz.org
More results
     ●   Ok, so you predicted last year...which is past...
     ●   What about predicting real future?
                          MSR Challenge 2007 winners

                          Goal:
                          predicting the number of changes
                          in Eclipse in the next three months
                          http://dx.doi.org/10.1109/MSR.2007.10




http://www.uax.es http://herraiz.org
Why this works?
     ●   Isn't it too accurate?
     ●   Why do you think this works?




http://www.uax.es http://herraiz.org
What's next?




                                                  52

http://www.uax.es http://herraiz.org
Further work
     ●   Write a paper about the controversy around the
         validation of the laws of software evolution
          ●   In progress
     ●   Write a paper about the short memory nature of
         evolution
          ●   Using Time Series Analysis to show it
          ●   And ARIMA as a forecasting tool
          ●   Extracting principles and guidelines for software
              projects management

                                                                  53

http://www.uax.es http://herraiz.org
And what I did learn during all these
                   years?




                                         54

http://www.uax.es http://herraiz.org
Things I appreciate my advisors did
     ●   Freedom of movements
     ●   Pressure to get my own funding
     ●   Unconditional support
     ●   Demanding and challenging environment
     ●   Opportunity to coordinate projects
     ●   And to participate in many meetings alone



                                                     55

http://www.uax.es http://herraiz.org
Things that I did not know and I do
                         now
     ●   Know-how about conferences and journals
     ●   English skills
     ●   Writing skills (papers and proposals)
     ●   Presentation skills
     ●   Self-motivation
               –   Brick walls are there for the rest of people
               –   Experience is what you get when you don't get what
                   you want
               –   Never give up
               –   http://www.youtube.com/watch?v=ji5_MqicxSo           56

http://www.uax.es http://herraiz.org
Take away
            Laws of                                Statistical
       Software Evolution                          approach

              Controversy                     Replicable study

            Short memory                      Brick walls are
              dynamics                         a good thing

             ARIMA                             Keep working.
        accurate forecast                      Don't give up
                                                                 57

http://www.uax.es http://herraiz.org

The dynamics of software evolution - EVOLUMONS 2011

  • 1.
    The dynamics ofsoftware evolution EVOLUMONS 2011 Research Seminar on Software Evolution Université de Mons, Belgium January 26th 2011 Israel Herraiz Universidad Alfonso X el Sabio <isra@herraiz.org> <herraiz@uax.es> 1 http://www.uax.es http://herraiz.org
  • 2.
    (c) 2011 IsraelHerraiz This work is licensed under the Creative Commons Attribution-Share Alike 3.0 To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/ or send a letter to Creative Commons, 171 Second Street, Suite 300, San Francisco, California, 94105, USA. Get the full bibliographic references listed in these slides at http://herraiz.org/stuff/evolumons_references_20110126.txt http://www.uax.es http://herraiz.org
  • 3.
    Outline ● The laws of software evolution ● The nature of software evolution (for libre software) ● How to accurately forecast software evolution. And why it works. ● What's next? ● And what did I learn during all these years of work? 3 http://www.uax.es http://herraiz.org
  • 4.
    The laws ofsoftware evolution 4 http://www.uax.es http://herraiz.org
  • 5.
    My background ● Educated as a chemical and mechanical engineer ● Wasted my time in the chemical industry. But I did (and do) love doing software! – http://caflur.sf.net http://gpinch.sf.net ● Involved in the open source community since around 2001, started a PhD in 2004 in the Libresoft research group – http://libresoft.es 5 http://www.uax.es http://herraiz.org
  • 6.
    How it allstarted ● Godfrey and Tu ● My supervisors and I [GT00] [GT01] wrote a paper on the studied the evolution topic [RAGBH05] of the Linux kernel ● At the time, I thought ● They said that the it was just one more laws of software paper evolution were not ● It turned out to be our valid for Linux most cited paper – Laws of software evolution. What is ● Completely puzzled that? me 6 http://www.uax.es http://herraiz.org
  • 7.
    The topic background: Software evolution ● How and why does software evolve? ● Meir M. Lehman Laws of software evolution ● “Program evolution. Processes of software change” published in 1985 7 http://www.uax.es http://herraiz.org
  • 8.
    The laws inthe seventies ● Laws of Program Evolution Dynamics (1974) 8 [Leh74] [Leh85b] http://www.uax.es http://herraiz.org
  • 9.
    The evolution ofthe laws of software evolution [Leh96] [LRW+97] [MFRP06] [Leh78] [Leh80] [Leh85c] [LB85] [Leh74] [Leh85b] 9 http://www.uax.es http://herraiz.org
  • 10.
    The laws inthe present day (I – IV) 10 http://www.uax.es http://herraiz.org
  • 11.
    The laws inthe present day (V – VIII) 11 http://www.uax.es http://herraiz.org
  • 12.
    Empirical studies ofsoftware evolution See “Empirical Studies of Open Source Evolution” by Juan Fernandez-Ramil, Angela Lozano, Michel Wermelinger, Andrea Capiluppi 12 in Tom Mens, Serge Demeyer (eds.) Software Evolution http://www.uax.es http://herraiz.org
  • 13.
    Why the controversyabout the laws of software evolution? ● Fernandez-Ramil et al. found in the literature empirical validation for the I, VI, VII (partially) and VIII (partially) ● The most interesting part (for me) – Statistical analysis of software projects and their evolution, using time series analysis among other techniques (suggested in ¡1974!) [Leh74] [Leh85b] – “For maximum cost-effectiveness, management consideration and judgement should include the entire history of the project with the current state having the strongest, but not exclusive, influence” [Leh78] [Leh85c] ● 13 http://www.uax.es http://herraiz.org
  • 14.
    The nature of(libre) software evolution 14 http://www.uax.es http://herraiz.org
  • 15.
    The nature of(libre) software evolution ● The goal is to develop a theoretical model for software evolution ● Long pursued goal ● Lehman and Belady in 1971 [BL71] [LB85] ● Woodside progressive and anti-regressive work [Woo80] (included in [LB85]) ● Turski models [Tur96] [Tur02] – Growth is inversely proportional to complexity – Complexity is proportional to the square of size 15 http://www.uax.es http://herraiz.org
  • 16.
    More recent models ● Self-Organized criticality [Wu06] [WHH07] ● Power laws for the size of the system ● Long range correlations in the time series of changes ● Maintenance Guidance Model [CFR07] ● Those functions that have suffered more changes in the past are more likely to be changed in the future ● Assumptions: – Distribution of accumulated changes is asymmetrical – Developers prioritize changes using past number of changes and complexity 16 http://www.uax.es http://herraiz.org
  • 17.
    Determinism and evolution ● Self Organized Criticality ● This means that current events are influenced by very old events ● Against Lehman suggestions [Leh78] [Leh85c] ● In my opinion, counter intuitive 17 http://www.uax.es http://herraiz.org
  • 18.
    Long range correlatedprocesses http://www.uax.es http://herraiz.org
  • 19.
    Long range correlatedprocesses http://www.uax.es http://herraiz.org
  • 20.
    Long range correlatedprocesses Unreachable http://www.uax.es http://herraiz.org
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
    How is softwareevolution? or ? http://www.uax.es http://herraiz.org
  • 26.
    Autocorrelation coefficients ... 1 2 3 4 5 r(1) ... 1 2 3 4 r(2) ... 1 2 3 . . . http://www.uax.es http://herraiz.org
  • 27.
    r(k) Autocorrelation coefficients 1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k http://www.uax.es http://herraiz.org
  • 28.
    r(k) Autocorrelation coefficients 1 Long range correlated r k ~k 2d−1 0d 0.5 Short range correlated (ARIMA process) r k ~C 1−k  0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 k http://www.uax.es http://herraiz.org
  • 29.
    r(k) Autocorrelation coefficients 1 Long range correlated r k ~k 2d−1 0d 0.5 Short range correlated (ARIMA process) r k ~ Ai 1−k  Logarithmic scale 0 k http://www.uax.es http://herraiz.org
  • 30.
    Empirical study ● 3,821 software projects – More than 3 developers – More than 1 year of active history – 9,234,104 commits / 2,357,438 modification requests – Projects registered between Nov. 1999 and Dec. 2004 – Datasets publicly available ● See Determinism and evolution – 5th International Working Conference on Mining Software Repositories (MSR 2008) FLOSSMole + CVSAnalY-SF http://www.uax.es http://herraiz.org
  • 31.
    Methodology ● Liner correlation to calculate linearity ● Distribution of the Pearson coefficients ● Smoothing applied to the series before calculating ACF http://www.uax.es http://herraiz.org
  • 32.
  • 33.
  • 34.
    Results Long memory processes Short memory processes http://www.uax.es http://herraiz.org
  • 35.
    Looking at thenumbers Quantile Commits MRs 0 0.3235 0.2886 20 0.7394 0.7248 40 0.8178 0.8036 60 0.8906 0.8705 80 0.9783 0.9464 100 0.9998 0.9998 Long memory process Short memory process 35 http://www.uax.es http://herraiz.org
  • 36.
    Implications for evolution ● Short memory -> Yesterday's weather http://doi.ieeecomputersociety.org/10.1109/ICSM.2004.1357788 ● When deciding, current situation should have more influence ● As Lehman said in 1978 http://www.uax.es http://herraiz.org
  • 37.
    How to forecastsoftware evolution 37 http://www.uax.es http://herraiz.org
  • 38.
    Background ● Forecasting traditionally done using very simple statistical models ● Regression ● Lehman suggested in 1974 that Time Series Analysis was the best approach to study software evolution ● Let's compare time series analysis against regression models 38 http://www.uax.es http://herraiz.org
  • 39.
    Case studies Training set Test set PostgreSQL FreeBSD NetBSD 1993 1995 1997 1999 2001 2003 2005 2007 Time 39 http://www.uax.es http://herraiz.org
  • 40.
    Case studies Training set Test set 40 http://www.uax.es http://herraiz.org
  • 41.
    Time Series Analysis Original Yes ACF Clear time series PACF pattern? data No Kernel smoothing ARIMA p, d, q Predictions model based on fitting ACF / PACF http://www.uax.es http://herraiz.org
  • 42.
    Parameters of themodel http://www.uax.es http://herraiz.org
  • 43.
    Autocorrelation coefficients. No smoothing http://www.uax.es http://herraiz.org
  • 44.
    Autocorrelation coefficients. After smoothing http://www.uax.es http://herraiz.org
  • 45.
    Parameters of allthe models ● Time series ARIMA model ● d=1 q=0 p = 6, 7 or 9 ● Regression model ● r > 0.99 http://www.uax.es http://herraiz.org
  • 46.
    How does themodel look like?     q p d j i ∇ x t 1−∑  j B =t 1−∑  i B j=1 i=1 i i B =B =x t−i xt ∇ x t =x t −x t−1=1−B x t d d ∇ x t =1−B x t http://www.uax.es http://herraiz.org
  • 47.
    How does themodel look like? Predicted / Actual values Estimation Coefficients Linear component errors     q p d j i ∇ x t 1−∑  j B =t 1−∑  i B j=1 i=1 Parameters of the model Linear component http://www.uax.es http://herraiz.org
  • 48.
    Results Time series (ARIMA) vs. regression ARIMA Regression FreeBSD 3.93 16.89 NetBSD 1.80 15.94 PostgreSQL 1.48 6.86 Mean Squared Relative Error http://www.uax.es http://herraiz.org
  • 49.
    Conclusions ● Time Series more accurate than Regression Analysis for macroscopic predictions ● Basic model. More components can be added. ● Seasonality ● Multi-variable, combining different factors http://www.uax.es http://herraiz.org
  • 50.
    More results ● Ok, so you predicted last year...which is past... ● What about predicting real future? MSR Challenge 2007 winners Goal: predicting the number of changes in Eclipse in the next three months http://dx.doi.org/10.1109/MSR.2007.10 http://www.uax.es http://herraiz.org
  • 51.
    Why this works? ● Isn't it too accurate? ● Why do you think this works? http://www.uax.es http://herraiz.org
  • 52.
    What's next? 52 http://www.uax.es http://herraiz.org
  • 53.
    Further work ● Write a paper about the controversy around the validation of the laws of software evolution ● In progress ● Write a paper about the short memory nature of evolution ● Using Time Series Analysis to show it ● And ARIMA as a forecasting tool ● Extracting principles and guidelines for software projects management 53 http://www.uax.es http://herraiz.org
  • 54.
    And what Idid learn during all these years? 54 http://www.uax.es http://herraiz.org
  • 55.
    Things I appreciatemy advisors did ● Freedom of movements ● Pressure to get my own funding ● Unconditional support ● Demanding and challenging environment ● Opportunity to coordinate projects ● And to participate in many meetings alone 55 http://www.uax.es http://herraiz.org
  • 56.
    Things that Idid not know and I do now ● Know-how about conferences and journals ● English skills ● Writing skills (papers and proposals) ● Presentation skills ● Self-motivation – Brick walls are there for the rest of people – Experience is what you get when you don't get what you want – Never give up – http://www.youtube.com/watch?v=ji5_MqicxSo 56 http://www.uax.es http://herraiz.org
  • 57.
    Take away Laws of Statistical Software Evolution approach Controversy Replicable study Short memory Brick walls are dynamics a good thing ARIMA Keep working. accurate forecast Don't give up 57 http://www.uax.es http://herraiz.org