Benchmarking PyCon AU 2011 v0


Published on

A Python module for benchmarking

Published in: Technology, Business
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • Introduce self as a speaker… Name: Tennessee Leeuwenburg: Qualifications: Bachelor of Computer Science, Diploma of Philosophy, Masters of Business Administration Manage a Team of 2-3 developers
  • History of benchmarking – “Fun Facts!” Image source: The Internet
  • This graph shows a situation where quality can be simply measured. The message of the graph is one of improvement compared to competitors, eventually reaching a rough parity.
  • I am trying here to introduce a general topic by reference to a specific example. What’s being demonstrated here is the principle of visualisation and the principle of measurement.
  • The yellow line is the normalised CPython performance. It doesn’t mean CPython took a constant amount of time in all tests. The blue graph shows the proportional time that PyPy took relative to CPython. The significance of that is not immediately apparent. It shows, however, that PyPy is basically faster across the board. It’s an impressive graphic, and a convincing piece of communication. Image Source:
  • This is a pretty high level principle. It might not have that much apparent value to a software engineer who is more interested in how to dis-assemble a problem into its parts. This isn’t so much about problem solving as monitoring symptoms. And that’s what you actually need to communicate to people to convince them they do or don’t have a problem to fix. Measurement and comparison can be used for problem diagnosis (and communication of such), but benchmarking is about high-level monitoring.
  • Okay, now I admit it’s a stretch to say that this graph actually caused PyPy to get faster, especially since I’m not a contributor to that project. The point is that PyPy are using this graph to (1) communicate the idea that they are on a path of progress and (2) understand themselves their level of performance. It provides a layer of truth to perceptions about progress and speed.
  • There are often many hidden factors which can influence a result. Getting the benefits from benchmarking usually come with time and analysis, not for free from some graph. If you are lucky enough to work in an area with standard tests, you may be able to sidestep some of the confusion.
  • Now that we have taken a look at one example, it’s time to consider the issue more generally. Assuming still that we are measuring speed, what is the context of that measurement? Historical performance is one piece of context – it tells you about improvement, which is always important. However benchmarking by configuration is important too. Maybe your test system is nice and fast, but your production environment has double the data size, and you have a bad exponential algorithm which only really bites you when peak load hits. Maybe your
  • Benchmarking PyCon AU 2011 v0

    1. 1. Benchmarking your applications With lots of shiny pictures Tennessee Leeuwenburg 20 August 2011
    2. 2. What is Benchmarking? <ul><li>Originally (circa 1842) a mark cut into a stone by land surveyors to secure a &quot;bench&quot; (from 19th century land surveying jargon, meaning a type of  bracket ), to mount measuring equipment </li></ul><ul><li>Another claim is that the term benchmarking originated with cobblers measuring people’s feet </li></ul><ul><li>Benchmarking is important to anyone who wishes to ensure a process is consistent and repeatable. </li></ul><ul><li>Images are attributed in the PPT notes </li></ul>
    3. 3. What is Benchmarking? <ul><li>Evaluate or check (something) by comparison with a standard: &quot;we are  benchmarking  our performance against external criteria&quot;. </li></ul><ul><li>Fundamentally, it’s evaluating performance through measurement </li></ul>Image by the author, fictional data
    4. 4. PyPy: A Concrete Software Example <ul><li>PyPy doesn’t use, they have a custom benchmark execution rig </li></ul><ul><li>They have concentrated on building a system for visualising performance, called “CodeSpeed”. </li></ul><ul><li>They have chosen to measure speed . They could have focused on memory, or network performance, or anything else that makes sense for their “business”. </li></ul><ul><li>However, speed is one of the most important aspects of a language, and one of the biggest reasons for someone not to choose PyPy instead of standard CPython </li></ul>
    5. 5. PyPy Benchmark against CPython Image source:
    6. 6. Benchmarking to Drive Development <ul><li>“ What gets measured gets done” – Peter Drucker </li></ul><ul><li>Benchmarking introduces performance into the feedback loop that drives our activity at work. Hide the information, and you hide the pressure. Publicise the information, and you increase the pressure. </li></ul><ul><li>It’s a tool for raising the profile of what you are measuring. Its says performance is important. </li></ul><ul><li>To improve performance, first measure it. (Of course, it’s not the only way, but it helps) </li></ul><ul><li>This makes selecting your measurement important… measure something meaningful </li></ul>
    7. 7. Performance over Time Image source:
    8. 8. A word of warning…
    9. 9. What to compare against? <ul><li>Benchmarking over time (by revision or date). </li></ul><ul><ul><li>The most normal kind of benchmarking is a historical comparison of past performance. This lets you understand what, if any, progress is being made in the application </li></ul></ul><ul><li>Benchmarking by configuration </li></ul><ul><ul><li>If an application has multiple configurations, especially if it has to run over a larger data set in production than in test, benchmarking those differences can be important </li></ul></ul><ul><li>Benchmarking by hardware </li></ul><ul><ul><li>Benchmarking by hardware has the obvious advantage that you can evaluate the impact of purchasing a hardware upgrade </li></ul></ul><ul><li>If there is a direct competitor, and you have their code, you can benchmark your procedures against theirs. But this is unlikely. </li></ul><ul><li>Some applications may have standard trials and tests </li></ul>
    10. 10. Benchmarking to Notice Problems <ul><li>Benchmarking can also solve specific problems by bringing them to your attention </li></ul><ul><li>Most usefully, it can highlight when something bad goes in with a commit </li></ul><ul><li>This graph shows a timeline of commits </li></ul><ul><li>No need to worry about performance ahead of time </li></ul><ul><li>Something slow went in with revision 10 </li></ul><ul><li>So go fix it! </li></ul><ul><li>Best of all, bounce the commit! </li></ul>Image by the author, fictional data
    11. 11. What is benchmarking of software? <ul><li>This question has a few parts, being: </li></ul><ul><ul><li>What is measurable about software? </li></ul></ul><ul><ul><li>What should be the basis for comparison? </li></ul></ul><ul><ul><li>What standards for comparison exist? </li></ul></ul><ul><li>Most software benchmarking is about speed. Why? </li></ul><ul><ul><li>It’s easiest to measure </li></ul></ul><ul><ul><li>It’s important </li></ul></ul><ul><ul><li>Most people understand speed of execution and what it it’s like for an application to be unresponsive for the user </li></ul></ul><ul><ul><li>It’s often easy to fix </li></ul></ul><ul><li>But… </li></ul><ul><ul><li>Memory, disk and networking? </li></ul></ul><ul><ul><li>User acceptance and use? </li></ul></ul>
    12. 12. You too can benchmark your Python code! <ul><li> … this thing I wrote and would like to share  </li></ul><ul><li> will collect all this data for you. It Just Works (YMMV). </li></ul><ul><li> is a tool which measures and reports on execution speed . It utilises the cProfile Python module to record statistics in a historical archive </li></ul><ul><li> has an integration module for CodeSpeed , a website for visualising performance. </li></ul><ul><li>Your manager will love it! (YMMV) </li></ul>
    13. 13. Introducing <ul><li>Easily available! </li></ul><ul><ul><li>Easy_install  Easy to install! </li></ul></ul><ul><ul><li>https:// /  Grab the source! </li></ul></ul><ul><li>Easy to follow tutorials! </li></ul><ul><ul><li> </li></ul></ul><ul><li>Easy to use! </li></ul><ul><ul><li>Simple syntax: simply decorate the function you would like profiled – no complex function execution required. </li></ul></ul><ul><ul><li>Or, integrate directly with py.test to use without any code modification at all </li></ul></ul><ul><li>Test-driven benchmarking </li></ul><ul><ul><li>Because benchmarking in operations will slow the app down </li></ul></ul>
    14. 14. I’m trying to avoid this… Image source:
    15. 15. How to Use <ul><li>ln [2]: import bench </li></ul><ul><li>In [3]: import bench.benchmarker </li></ul><ul><li>In [4]: from bench.benchmarker import benchmark </li></ul><ul><li>In [5]: @benchmark() </li></ul><ul><li>...: def foo(): </li></ul><ul><li>...: for i in range(100): </li></ul><ul><li>...: pass </li></ul><ul><li>...: </li></ul><ul><li>In [6]: foo() </li></ul><ul><li>In [7]: bench.benchmarker.print_stats() </li></ul><ul><li>100 function calls in 0.005 CPU seconds </li></ul><ul><li>Random listing order was used </li></ul><ul><li>ncalls tottime percall cumtime percall filename:lineno(function) </li></ul><ul><li>0 0.000 0.000 profile:0(profiler) </li></ul><ul><li>100 0.005 0.000 0.005 0.000 <ipython console>:1(foo) </li></ul>
    16. 16. Creating a good historical archive <ul><li>The key to this is really to maintain a good historical archive. This means a certain amount of integration with your tool chain, but if you are using py.test it’s easy. </li></ul><ul><ul><li>demo_project]$ find /tmp/bench_history </li></ul></ul><ul><ul><li>/tmp/bench_history </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration/Z400 </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration/Z400/full_tests </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration/Z400/full_tests/2011 </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration/Z400/full_tests/2011/07 </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration/Z400/full_tests/2011/07/25 </li></ul></ul><ul><ul><li>/tmp/bench_history/demonstration/Z400/full_tests/2011/07/25/2011_07_25_06_19.pstats </li></ul></ul>
    17. 17. Choosing what to pay attention to <ul><li>One of the fundamental choices when benchmarking is what to watch. Nothing can automate this, although choosing the ten most expensive functions is probably not a bad first try. Options include: </li></ul><ul><ul><li>Watching the most expensive functions </li></ul></ul><ul><ul><li>Watching the most common user operations </li></ul></ul><ul><ul><li>Hand-selecting a mix of “inner loop” type functions and “outer loop” type functions </li></ul></ul><ul><ul><li>“Critical path” functions that can’t execute in the background or be avoided </li></ul></ul><ul><ul><li>Crafting a watch list based on a specific objective or system component </li></ul></ul>
    18. 18. Figuring it out the first time <ul><li>Before setting up the list of watched functions for the graph server, try open the file in a spreadsheet. Benchmarker comes with a csv export mode. </li></ul>Data taken from BoM application tests 90 1 60 1173 <_AFPSSup.IFPClient_getTopoData> 89 1 79 1345 <_AFPSSup.IFPClient_getReferenceInventory> 87 2 90 2299 <_AFPSSup.IFPClient_getParmList> 86 2 112 9864 <_AFPSSup.IFPClient_getTextData> 84 4 224 4746 shuffle 80 5 289 2356 <_AFPSSup.new_HistoSampler> 75 5 313 25257 <_AFPSSup.ReferenceData_pyGrid> 69 7 386 34386 <_AFPSSup.IFPClient_getReferenceData> 63 12 662 11476 <_AFPSDB.Parm_saveParameter> 51 14 839 10584 <compile> 36 15 860 544 _getLandOrWaterRefData 21 21 1217 33929 <_AFPSSup.ReferenceData_grid> Cumulative % Total % Total Total time # of calls Function name
    19. 19. Almost all the time is in one place <ul><li>In the previous slide, based off our actual application at work, 90% of the time spent in the automated test was concentrated in just 12 functions. </li></ul><ul><li>The total number of functions measured was 6, 763 . </li></ul><ul><li>90% of the time is spent in around about 0.2% (2 hundredths) of the functions. Looking for where to improve speed is no mystery here! </li></ul><ul><li>The codebase is mostly Python… but the expensive operations are mostly in C. I guess this is a good thing! </li></ul>
    20. 20. Version Control Integration <ul><li>Version control integration is primitive, but available </li></ul><ul><li>py.test --bench_history –override-decorator –version_tag=0.4 </li></ul><ul><li>Goals are to: </li></ul><ul><ul><li>Clean up the syntax for this </li></ul></ul><ul><ul><li>Set up auto-sniffing of version tags </li></ul></ul>
    21. 21. Visualisation and Key Metrics <ul><li>Integration with codespeed is in a decoupled module which only relies on the filesystem structure created by </li></ul><ul><li>Which means you can make use of on-the-desk to produce reports without the web interface </li></ul><ul><li>Or it means you can adjust your own benchmarking rig to produce compatible file output and easily integrate with codespeed </li></ul>
    22. 22. Taking a look at the demo Image produced by the author. Data based on real execution of sort functions.
    23. 23. Benchmarking 102 <ul><li>Controlling the environment </li></ul><ul><ul><li>Run it on a box that isn’t doing anything else! </li></ul></ul><ul><ul><li>Distributed is solvable, but not done yet </li></ul></ul><ul><li>Writing specific tests </li></ul><ul><ul><li>Your tests may not be representative of program user experience, so you might want to write specific tests for benchmarking against </li></ul></ul><ul><ul><li>Execution time is data-dependent (e.g. large arrays). Make sure you have a consistent standard, and make sure you have a realistic standard </li></ul></ul><ul><li>Measure the test, not the function </li></ul><ul><ul><li>The function may get called by other top-level functions, so you need to pull that apart to understand the relationships </li></ul></ul>
    24. 24. Benchmarking 102 <ul><li>Total Time vs Cumulative Time </li></ul><ul><ul><li>Total time is where a three-deep loop iterates on a large array </li></ul></ul><ul><ul><li>Cumulative time is where you call that function with a large array… and wait </li></ul></ul><ul><ul><li>Total time is the CPU time in-function </li></ul></ul><ul><ul><li>Cumulative time accumulates the cost of called functions </li></ul></ul><ul><li>Large per-call total time is bad. </li></ul><ul><ul><li>It means a large operation. </li></ul></ul><ul><ul><li>Either increase its efficiency, or reduce the number of times it is called </li></ul></ul><ul><li>Small per-call total time can be okay. </li></ul><ul><ul><li>It means a small operation. </li></ul></ul><ul><ul><li>Efficiency is only important if it is called many times </li></ul></ul><ul><ul><li>But can you unroll the function to reduce call overhead? </li></ul></ul>
    25. 25. Future Directions (Bugs n Stuff) <ul><li>(1) Needs a userbase larger than one </li></ul><ul><li>(1) Improved version control information (version sniffing) </li></ul><ul><li>(2) Needs to properly namespace functions </li></ul><ul><li>(2) The codespeed timeline is a bit broken (uses submission time, not data validity time – looks like a bug in codespeed) </li></ul><ul><li>(3) Expansion into memory, disk and network profiling </li></ul><ul><li>(3) Expansion into interactive benchmarking through usage analysis and dialog-based user queries </li></ul><ul><li>(3) Maybe create a benchmarker class to allow multiple instances? (I believe this is actually not as necessary as feedback would suggest) </li></ul>
    26. 26. Acknowledgements <ul><li>Thanks to </li></ul><ul><ul><li>Ed Schofield, who got the Codespeed integration over the line </li></ul></ul><ul><ul><li>Miquel Torres, developer of Codespeed </li></ul></ul><ul><ul><li>Bureau of Meteorology, for allowing this work to progress as open source </li></ul></ul>
    27. 27. The Centre for Australian Weather and Climate Research A partnership between CSIRO and the Bureau of Meteorology Tennessee Leeuwenburg Phone: 03 9669 4310 Work Email: Email: Web: Thank you