SlideShare a Scribd company logo
1 of 40
How did this get published?
Pitfalls in experimental evaluation of
           computing systems

          José Nelson Amaral
          University of Alberta
         Edmonton, AB, Canada
Thing #1




Aggregation                Thing #2




                         Learning

              Thing #3




     Reproducibility
So, a computing scientist entered a Store….




  http://archive.constantcontact.com/fs042/1101916237075/archive/1102594461324.html   http://bitchmagazine.org/post/beyond-the-panel-an-
                                                                                      interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered a Store….
                                        They want
                                      $2,700 for the
                                        server and
                                       $100 for the
                       $ 200.00            iPod.



   $ 3,000.00

            I will get both and
              pay only $2,240
                altogether!
                                  http://bitchmagazine.org/post/beyond-the-panel-an-
                                  interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered an Store….

              Ma’am you                                                                        But the average of
               are $560                                                                       10% and 50% is 30%
                short.                                                                         and 70% of $3,200
                                                                                                   is $2,240.



                                                                                      $ 200.00



                                                                            $ 3,000.00
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
                                                                                                      http://bitchmagazine.org/post/beyond-the-panel-an-
                                                                                                      interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered an Store….
           Ma’am you                                                                            But… I just came
         cannot take the                                                                         from at top CS
           arithmetic                                                                          conference in San
           average of                                                                         Jose where they do
          percentages!                                                                                 it!


                                                                                      $ 200.00



                                                                            $ 3,000.00
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
                                                                                                     http://bitchmagazine.org/post/beyond-the-panel-an-
                                                                                                     interview-with-danielle-corsetto-of-girls-with-slingshots
The Problem with Averages
A Hypothetical Experiment
                                          Execution Time
                 25
                                                      20
                 20
Time (minutes)




                 15
                                    10     10                            Baseline
                 10
                                                                         Transformed
                           5                               4
                  5
                                2               2
                       1
                  0
                      benchA   benchB     benchC     benchD Arith Avg
                                         Benchmark


                                                                   *With thanks to Iain Ireland
Speedup
Performance Comparison
                                                                         20
                                      Speedup
          6
                                       5                5
          5
                                                                         1010
          4
Speedup




          3                                                   2.6             5
                                                                              4
          2                                                              2
                                                                         1
          1
                0.2        0.2                                          benchA
                                                                        benchD
                                                                         benchC
                                                                         benchB
          0
              benchA     benchB     benchC        benchD    Arith Avg   Geo Mean
                                           Benchmarks

              The transformed system is, on average, 2.6 times faster
                               than the baseline!
Normalized Time
Normalized Time
                                        Normalized Time
                  6
                        5        5
                  5
Normalized Time




                  4

                  3                                                2.6

                  2

                  1
                                            0.2           0.2
                  0
                      benchA   benchB     benchC     benchD     Arith Avg   Geo Mean
                                             Benchmark
            The transformed system is, on average, 2.6 times slower
                              than the baseline!
Latency × Throughput
• What matters is latency:



                             Start   Time


• What matters is throughput:



Start                                  Time
Aggregation for Latency:
   Geometric Mean
Speedup
                                    Speedup
          6
                                     5                5
          5
          4
Speedup




          3                                                  2.6
          2
                                                                         1
          1
               0.2       0.2
          0
              benchA    benchB    benchC        benchD    Arith Avg   Geo Mean
                                         Benchmarks

                The performance of the transformed system is,
                    on average, the same as the baseline!
Normalized Time
                                        Normalized Time
                  6
                        5        5
                  5
Normalized Time




                  4

                  3                                                2.6

                  2
                                                                              1.0
                  1
                                            0.2           0.2
                  0
                      benchA   benchB     benchC     benchD     Arith Avg   Geo Mean
                                             Benchmark

                      The performance of the transformed system is,
                          on average, the same as the baseline!
Aggregation for Throughput
                                          Execution Time
                 25
                                                      20
                 20
Time (minutes)




                 15
                                    10     10                             Baseline
                 10                                            8.25
                                                                          Transformed
                           5                                       5.25
                  5                                        4
                                2               2
                       1
                  0
                      benchA   benchB     benchC     benchD Arith Avg
                                         Benchmark
                  The throughput of the transformed system is,
                  on average, 1.6 times faster than the baseline.
The Evidence
• A careful reader will find the use of arithmetic
  average to aggregate normalized numbers in
  many top CS conferences.
• Papers that have done that have appeared in:
   – LCTES 2011
   – PLDI 2012 (at least two papers)
   – CGO 2012
      • A paper where the use of the wrong average changed a
        negative conclusion into a positive one.
   – 2007 SPEC Workshop
      • A methodology paper by myself and a student that won the
        best paper award.
This is not a new observation…




    Communications of the ACM, March 1986, pp. 218-221.
No need to dig dusty papers…
So, the computing scientist returns to the Store….
                                                                               Hello. I am just back from
                                                                               Beijing. Now I know that
             ???                                                                  we should take the
                                                                                 geometric average of
                                                                                      percentages.



                                                                                       $ 200.00



                                                                             $ 3,000.00
 http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
                                                                                                  http://bitchmagazine.org/post/beyond-the-panel-an-
                                                                                                  interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered a Store….

             Sorry
                                                                             Thus I should get      .
          Ma’am, we                                                           = 22.36% discount and
         don’t average                                                         pay 0.7764×$3,200 =
         percentages…                                                                $2,484.48



                                                                                      $ 200.00



                                                                            $ 3,000.00
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
                                                                                                 http://bitchmagazine.org/post/beyond-the-panel-an-
                                                                                                 interview-with-danielle-corsetto-of-girls-with-slingshots
So, a computing scientist entered a Store….

                             The original price is $ 3,200. You pay
                                   $ 2,700 + $ 100 = $ 2,800.
                              If you want an aggregate summary,
                             your discount is 400/3,200 = 12.5%




                                                                                      $ 200.00



                                                                            $ 3,000.00
http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1
                                                                                                 http://bitchmagazine.org/post/beyond-the-panel-an-
                                                                                                 interview-with-danielle-corsetto-of-girls-with-slingshots
Thing #2




                     Learning



Disregard to methodology when
   using automated learning
Example:
Evaluation of Feedback Directed
      Optimization (FDO)
We have:
Application
Inputs
                   Application
                     Code
                                          Compiler

                                   http://www.orchardoo.com




              We want to measure the
              effectiveness of an FDO-based
              code transformation.
Application                                 Training
Inputs                                      Set




                                -I
   Application
     Code
                        Compiler                 Instrumented
                                                      Code
                 http://www.orchardoo.com




                                                       Profile
Evaluation
Set




                                         -FDO
           The
   Application   FDO transformation produces code that
     Code          is XX faster for this application.
                                    Compiler              Optimized
                                                            Code
                             http://www.orchardoo.com



   Application
    Un-evaluated
   Inputs
    Set




                                          Profile       Performance
The Evidence
• Many papers that use a single input for
  training and a single input for testing
  appeared in conferences (notably CGO).
• For instance, a paper that uses a single input
  for training and a single input for testing
  appears in:
  – ASPLOS 2004
Evaluation
Set




                             -FDO
   Application
     Code
                        Compiler                    Optimized
                                                      Code
                 http://www.orchardoo.com




                                            Performance
                                             Performance
                              Profile          Performance
                                                 Performance
                                                  Performance
Training                 Evaluation
Set                      Set                          Combined Profiling (Berube, ISPASS12)

                                              Cross-Validated Evaluation (Berube, SPEC07)


                                                    -FDO
           Application
             Code
                                               Compiler                        Optimized
                                                                                 Code
                                        http://www.orchardoo.com




       Profile                Profile               Profile          Profile         Profile
Evaluation
Set




                                -FDO
   Application
     Code
                           Compiler              Optimized
                                                   Code
                    http://www.orchardoo.com




Wrong Evaluation!




                                 Profile       Performance
The Evidence
• For instance, a paper that incorrectly uses the
  same input for training and testing appeared
  in:
  – PLDI 2006
Thing #3




                               Reproducibility




Expectation:
When reproduced, an experimental evaluation
       should produce similar results.
Thing #3

Issues
                                              Reproducibility




         Have the measurements been repeated a sufficient
         number of times to capture measurement variations?


         Availability of code, data, and precise description
         of experimental setup.

         Lack of incentives for reproducibility studies.
Thing #3

Progress
                                          Reproducibility




     Program committees/reviewers starting to ask
     questions about reproducibility.

     Steps toward infrastructure to facilitate reproducibility.
SPEC Research Group
http://research.spec.org/
       14 industrial organizations
       20 universities or research institutes
SPEC Research Group
           http://research.spec.org/


           Performance Evaluation
               Benchmarks for New Areas
                      Performance Evaluation Tools
                                 Evaluation Methodology
                                       Repository for Reproducibility

http://icpe2013.ipd.kit.edu/
Evaluate Collaboratory:
http://evaluate.inf.usi.ch/



            Open Letter to PC Chairs

                              Anti Patterns

                                    Evaluation in CS education
Parting Thoughts….

Creating a culture that enables full reproducibility seems daunting…




     Initially we could aim for:

            Reasonable expectation by a reasonable reader that,
            if reproduced, the experimental
            evaluation would produce similar results.

More Related Content

Viewers also liked

제주 오라게
제주 오라게제주 오라게
제주 오라게Guleum Lee
 
Giáo trình động cơ điện
Giáo trình động cơ điệnGiáo trình động cơ điện
Giáo trình động cơ điệnHoa Dai
 
Handfun shop 개발 가이드
Handfun shop 개발 가이드Handfun shop 개발 가이드
Handfun shop 개발 가이드Guleum Lee
 
Gr board ui 개선 프로젝트
Gr board ui 개선 프로젝트Gr board ui 개선 프로젝트
Gr board ui 개선 프로젝트Guleum Lee
 
Handfuns 쇼핑몰 계획
Handfuns 쇼핑몰 계획Handfuns 쇼핑몰 계획
Handfuns 쇼핑몰 계획Guleum Lee
 
나를 그리다.
나를 그리다.나를 그리다.
나를 그리다.Guleum Lee
 
Funny food icon set
Funny food icon setFunny food icon set
Funny food icon setGuleum Lee
 

Viewers also liked (8)

제주 오라게
제주 오라게제주 오라게
제주 오라게
 
Giáo trình động cơ điện
Giáo trình động cơ điệnGiáo trình động cơ điện
Giáo trình động cơ điện
 
Handfun shop 개발 가이드
Handfun shop 개발 가이드Handfun shop 개발 가이드
Handfun shop 개발 가이드
 
Oficina 08
Oficina 08Oficina 08
Oficina 08
 
Gr board ui 개선 프로젝트
Gr board ui 개선 프로젝트Gr board ui 개선 프로젝트
Gr board ui 개선 프로젝트
 
Handfuns 쇼핑몰 계획
Handfuns 쇼핑몰 계획Handfuns 쇼핑몰 계획
Handfuns 쇼핑몰 계획
 
나를 그리다.
나를 그리다.나를 그리다.
나를 그리다.
 
Funny food icon set
Funny food icon setFunny food icon set
Funny food icon set
 

Similar to Amaral lctes2012

Explaining cumulative-flow-diagrams-cfd3688
Explaining cumulative-flow-diagrams-cfd3688Explaining cumulative-flow-diagrams-cfd3688
Explaining cumulative-flow-diagrams-cfd3688jfcm1989
 
Explaining Cumulative Flow Diagrams - CFD
Explaining Cumulative Flow Diagrams - CFDExplaining Cumulative Flow Diagrams - CFD
Explaining Cumulative Flow Diagrams - CFDYuval Yeret
 
Samsung business strategy: from fast-follower to first mover, can they make t...
Samsung business strategy: from fast-follower to first mover, can they make t...Samsung business strategy: from fast-follower to first mover, can they make t...
Samsung business strategy: from fast-follower to first mover, can they make t...Joey Lam
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudJesse Robbins
 
02 web performance
02 web performance 02 web performance
02 web performance MeasureWorks
 
MeasureWorks - Velocity Conference Europe - Performance Automation 101
MeasureWorks  - Velocity Conference Europe - Performance Automation 101MeasureWorks  - Velocity Conference Europe - Performance Automation 101
MeasureWorks - Velocity Conference Europe - Performance Automation 101MeasureWorks
 
Global Information Systems Presentation
Global Information Systems PresentationGlobal Information Systems Presentation
Global Information Systems PresentationJames Corne
 
What Can Data Tell Us?
What Can Data Tell Us?What Can Data Tell Us?
What Can Data Tell Us?OReillyTOC
 
Do Faster Releases Improve Software Quality?
Do Faster Releases Improve Software Quality? Do Faster Releases Improve Software Quality?
Do Faster Releases Improve Software Quality? Foutse Khomh
 
Speed is Essential for a Great Web Experience
Speed is Essential for a Great Web ExperienceSpeed is Essential for a Great Web Experience
Speed is Essential for a Great Web ExperienceAndy Davies
 

Similar to Amaral lctes2012 (12)

Explaining cumulative-flow-diagrams-cfd3688
Explaining cumulative-flow-diagrams-cfd3688Explaining cumulative-flow-diagrams-cfd3688
Explaining cumulative-flow-diagrams-cfd3688
 
Explaining Cumulative Flow Diagrams - CFD
Explaining Cumulative Flow Diagrams - CFDExplaining Cumulative Flow Diagrams - CFD
Explaining Cumulative Flow Diagrams - CFD
 
Samsung business strategy: from fast-follower to first mover, can they make t...
Samsung business strategy: from fast-follower to first mover, can they make t...Samsung business strategy: from fast-follower to first mover, can they make t...
Samsung business strategy: from fast-follower to first mover, can they make t...
 
Gov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the CloudGov 2.0: Scaling, Automation, & Management in the Cloud
Gov 2.0: Scaling, Automation, & Management in the Cloud
 
02 web performance
02 web performance 02 web performance
02 web performance
 
MeasureWorks - Velocity Conference Europe - Performance Automation 101
MeasureWorks  - Velocity Conference Europe - Performance Automation 101MeasureWorks  - Velocity Conference Europe - Performance Automation 101
MeasureWorks - Velocity Conference Europe - Performance Automation 101
 
SAS and 80/20
SAS and 80/20SAS and 80/20
SAS and 80/20
 
Global Information Systems Presentation
Global Information Systems PresentationGlobal Information Systems Presentation
Global Information Systems Presentation
 
What Can Data Tell Us?
What Can Data Tell Us?What Can Data Tell Us?
What Can Data Tell Us?
 
Do Faster Releases Improve Software Quality?
Do Faster Releases Improve Software Quality? Do Faster Releases Improve Software Quality?
Do Faster Releases Improve Software Quality?
 
Cs207 4
Cs207 4Cs207 4
Cs207 4
 
Speed is Essential for a Great Web Experience
Speed is Essential for a Great Web ExperienceSpeed is Essential for a Great Web Experience
Speed is Essential for a Great Web Experience
 

Recently uploaded

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGSujit Pal
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 

Recently uploaded (20)

How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Google AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAGGoogle AI Hackathon: LLM based Evaluator for RAG
Google AI Hackathon: LLM based Evaluator for RAG
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 

Amaral lctes2012

  • 1. How did this get published? Pitfalls in experimental evaluation of computing systems José Nelson Amaral University of Alberta Edmonton, AB, Canada
  • 2. Thing #1 Aggregation Thing #2 Learning Thing #3 Reproducibility
  • 3. So, a computing scientist entered a Store…. http://archive.constantcontact.com/fs042/1101916237075/archive/1102594461324.html http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 4. So, a computing scientist entered a Store…. They want $2,700 for the server and $100 for the $ 200.00 iPod. $ 3,000.00 I will get both and pay only $2,240 altogether! http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 5. So, a computing scientist entered an Store…. Ma’am you But the average of are $560 10% and 50% is 30% short. and 70% of $3,200 is $2,240. $ 200.00 $ 3,000.00 http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1 http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 6. So, a computing scientist entered an Store…. Ma’am you But… I just came cannot take the from at top CS arithmetic conference in San average of Jose where they do percentages! it! $ 200.00 $ 3,000.00 http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1 http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 7. The Problem with Averages
  • 8. A Hypothetical Experiment Execution Time 25 20 20 Time (minutes) 15 10 10 Baseline 10 Transformed 5 4 5 2 2 1 0 benchA benchB benchC benchD Arith Avg Benchmark *With thanks to Iain Ireland
  • 10. Performance Comparison 20 Speedup 6 5 5 5 1010 4 Speedup 3 2.6 5 4 2 2 1 1 0.2 0.2 benchA benchD benchC benchB 0 benchA benchB benchC benchD Arith Avg Geo Mean Benchmarks The transformed system is, on average, 2.6 times faster than the baseline!
  • 12. Normalized Time Normalized Time 6 5 5 5 Normalized Time 4 3 2.6 2 1 0.2 0.2 0 benchA benchB benchC benchD Arith Avg Geo Mean Benchmark The transformed system is, on average, 2.6 times slower than the baseline!
  • 13. Latency × Throughput • What matters is latency: Start Time • What matters is throughput: Start Time
  • 14. Aggregation for Latency: Geometric Mean
  • 15. Speedup Speedup 6 5 5 5 4 Speedup 3 2.6 2 1 1 0.2 0.2 0 benchA benchB benchC benchD Arith Avg Geo Mean Benchmarks The performance of the transformed system is, on average, the same as the baseline!
  • 16. Normalized Time Normalized Time 6 5 5 5 Normalized Time 4 3 2.6 2 1.0 1 0.2 0.2 0 benchA benchB benchC benchD Arith Avg Geo Mean Benchmark The performance of the transformed system is, on average, the same as the baseline!
  • 17. Aggregation for Throughput Execution Time 25 20 20 Time (minutes) 15 10 10 Baseline 10 8.25 Transformed 5 5.25 5 4 2 2 1 0 benchA benchB benchC benchD Arith Avg Benchmark The throughput of the transformed system is, on average, 1.6 times faster than the baseline.
  • 18. The Evidence • A careful reader will find the use of arithmetic average to aggregate normalized numbers in many top CS conferences. • Papers that have done that have appeared in: – LCTES 2011 – PLDI 2012 (at least two papers) – CGO 2012 • A paper where the use of the wrong average changed a negative conclusion into a positive one. – 2007 SPEC Workshop • A methodology paper by myself and a student that won the best paper award.
  • 19. This is not a new observation… Communications of the ACM, March 1986, pp. 218-221.
  • 20. No need to dig dusty papers…
  • 21. So, the computing scientist returns to the Store…. Hello. I am just back from Beijing. Now I know that ??? we should take the geometric average of percentages. $ 200.00 $ 3,000.00 http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1 http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 22. So, a computing scientist entered a Store…. Sorry Thus I should get . Ma’am, we = 22.36% discount and don’t average pay 0.7764×$3,200 = percentages… $2,484.48 $ 200.00 $ 3,000.00 http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1 http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 23. So, a computing scientist entered a Store…. The original price is $ 3,200. You pay $ 2,700 + $ 100 = $ 2,800. If you want an aggregate summary, your discount is 400/3,200 = 12.5% $ 200.00 $ 3,000.00 http://www.businessinsider.com/10-ways-to-fix-googles-busted-android-app-market-2010-1?op=1 http://bitchmagazine.org/post/beyond-the-panel-an- interview-with-danielle-corsetto-of-girls-with-slingshots
  • 24. Thing #2 Learning Disregard to methodology when using automated learning
  • 25. Example: Evaluation of Feedback Directed Optimization (FDO)
  • 26. We have: Application Inputs Application Code Compiler http://www.orchardoo.com We want to measure the effectiveness of an FDO-based code transformation.
  • 27. Application Training Inputs Set -I Application Code Compiler Instrumented Code http://www.orchardoo.com Profile
  • 28. Evaluation Set -FDO The Application FDO transformation produces code that Code is XX faster for this application. Compiler Optimized Code http://www.orchardoo.com Application Un-evaluated Inputs Set Profile Performance
  • 29. The Evidence • Many papers that use a single input for training and a single input for testing appeared in conferences (notably CGO). • For instance, a paper that uses a single input for training and a single input for testing appears in: – ASPLOS 2004
  • 30. Evaluation Set -FDO Application Code Compiler Optimized Code http://www.orchardoo.com Performance Performance Profile Performance Performance Performance
  • 31. Training Evaluation Set Set Combined Profiling (Berube, ISPASS12) Cross-Validated Evaluation (Berube, SPEC07) -FDO Application Code Compiler Optimized Code http://www.orchardoo.com Profile Profile Profile Profile Profile
  • 32. Evaluation Set -FDO Application Code Compiler Optimized Code http://www.orchardoo.com Wrong Evaluation! Profile Performance
  • 33. The Evidence • For instance, a paper that incorrectly uses the same input for training and testing appeared in: – PLDI 2006
  • 34. Thing #3 Reproducibility Expectation: When reproduced, an experimental evaluation should produce similar results.
  • 35. Thing #3 Issues Reproducibility Have the measurements been repeated a sufficient number of times to capture measurement variations? Availability of code, data, and precise description of experimental setup. Lack of incentives for reproducibility studies.
  • 36. Thing #3 Progress Reproducibility Program committees/reviewers starting to ask questions about reproducibility. Steps toward infrastructure to facilitate reproducibility.
  • 37. SPEC Research Group http://research.spec.org/ 14 industrial organizations 20 universities or research institutes
  • 38. SPEC Research Group http://research.spec.org/ Performance Evaluation Benchmarks for New Areas Performance Evaluation Tools Evaluation Methodology Repository for Reproducibility http://icpe2013.ipd.kit.edu/
  • 39. Evaluate Collaboratory: http://evaluate.inf.usi.ch/ Open Letter to PC Chairs Anti Patterns Evaluation in CS education
  • 40. Parting Thoughts…. Creating a culture that enables full reproducibility seems daunting… Initially we could aim for: Reasonable expectation by a reasonable reader that, if reproduced, the experimental evaluation would produce similar results.