Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Succeeding in academia despite doing good_software

3,486 views

Published on

Hacking academia for fun and profit

Thoughts on succeeding in academia despite doing good software

Keynote I gave at the Scipyconf Argentina 2014 conference

The advancement of science is a noble cause, and academia a fierce battlefield for tenure. Software is seen as a mere technicality, not worth a line on an academic CV. I claim that, on the opposite software, is the new medium of scientific method. I claim that succeeding in academia can be achieved not despite writing good software but via such an accomplishment. The key is to choose the right battles and to win them.

What is the emerging role of software in the scientific workflow? Which are the software challenges that can have impact? How to balance software quality assurance and the quick turn-around random-walk of research? What does "good design" mean for research software? What Python patterns can boost productivity and reuse in exploratory scientific computing?

I will try to answer these questions, based on my personal experience of growing up to become an academic Pythonista.

Published in: Software

Succeeding in academia despite doing good_software

  1. 1. Hacking academia for fun and profit Thoughts on succeeding in academia despite doing good software Varoquaux Ga¨el Strong is the power of the dark side
  2. 2. Hacking academia for fun and profit Thoughts on succeeding in academia despite doing good software Varoquaux Ga¨el Strong is the power of the dark side with Python
  3. 3. Hacking academia for fun and profit Thoughts on succeeding in academia despite doing good software Varoquaux Ga¨el Strong is the power of the dark side despite good software?
  4. 4. Publish or perish Want a career? G Varoquaux 2
  5. 5. Publish or perish Want a career? Broken value system Only Science matters Wrong incentives G Varoquaux 2
  6. 6. Publish or perish Want a career? Hack academia! G Varoquaux 2
  7. 7. Publish or perish Want a career? Hack academia! G Varoquaux 2
  8. 8. Publish or perish Want a career? Hack academia! Publishing scientific software matters [Pradal, Varoquaux, Langtangen, J Computational Science] G Varoquaux 2
  9. 9. [TL;DR]‹ Choose your battles keeping science in the target Win them software production ‹ Too Long, Didn’t Read G Varoquaux 3
  10. 10. Growing up as a geek scientist G Varoquaux 4
  11. 11. Growing up as a geek scientist I did a PhD in quantum physics G Varoquaux 5
  12. 12. Growing up as a geek scientist I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Best training ever for agile project managementG Varoquaux 6
  13. 13. Growing up as a geek scientist I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument control G Varoquaux 7
  14. 14. Growing up as a geek scientist I did a PhD in quantum physics Vacuum (leaks) Electronics (shorts) Lasers (mis-alignment) Computers were only one of the many moving parts Matlab Instrument controlShaped my vision of computing as a means to an end G Varoquaux 7
  15. 15. Success 2011 Tenured researcher in computer science G Varoquaux 8
  16. 16. Success 2011 Tenured researcher in computer science Today Growing team with data science rock stars G Varoquaux 8
  17. 17. Success 2011 Tenured researcher in computer science Today Growing team with data science rock stars How / why did I switch? Fernando Perez (IPython), Prabhu Ramachandran (Mayavi), Eric Jones (Enthought), Travis Oliphant (Numpy)... G Varoquaux 8
  18. 18. Success 2011 Tenured researcher in computer science Today Growing team with data science rock stars How / why did I switch? Fernando Perez (IPython), Prabhu Ramachandran (Mayavi), Eric Jones (Enthought), Travis Oliphant (Numpy)... Learning fast is more impor- tant than knowing something Bye bye physics G Varoquaux 8
  19. 19. And now... What I do nowadays: Machine learning to understand brain function Cognitive neuroscience: Link neural activity to thoughts and cognition G Varoquaux 9
  20. 20. And now... What I do nowadays: Machine learning to understand brain function Learn a bilateral link between brain activity and cognitive function G Varoquaux 9
  21. 21. Software is the new medium of scientific method Galileo’s notes G Varoquaux 10
  22. 22. The scientific method 1. Make conjectures 2. Derive prediction 3. Carry experiments 4. Confirm or infirm conjectures G Varoquaux 11
  23. 23. The scientific method 1. Make conjectures 2. Derive prediction 3. Carry experiments 4. Confirm or infirm conjectures Software is everywhere Data-mining Computational models Computer-controled experiments Data analysis G Varoquaux 11
  24. 24. The scientific method 1. Make conjectures 2. Derive prediction 3. Carry experiments 4. Confirm or infirm conjectures Code is often the very language in which predictions are expressed Models are now more complex than a simple formula or sentence G Varoquaux 11
  25. 25. Enabling falsification: reproducible science Replicating A 3rd party redoing the work Code and data made available Reproducing New analysis on different data / code coming to the same conclusion Reusing Applying the approach to a new problem Let us enable reusable research G Varoquaux 12
  26. 26. Enabling falsification: reproducible science Replicating A 3rd party redoing the work Code and data made available Reproducing New analysis on different data / code coming to the same conclusion Reusing Applying the approach to a new problem Let us enable reusable research Arguments for BSD license No strings attached Can tinker with it G Varoquaux 12
  27. 27. Reusable science ñ evidence accumulation Accumulation of scientific knowledge and learning formal representations G Varoquaux 13
  28. 28. Reusable science ñ evidence accumulation Accumulation of scientific knowledge and learning formal representations Akin to a review paper of the field But a mathematical model is more testable Machine learning: engineering knowledge from data G Varoquaux 13
  29. 29. Reusable science ñ evidence accumulation Accumulation of scientific knowledge and learning formal representations Akin to a review paper of the field But a mathematical model is more testable “A theory is a good theory if it satisfies two requirements: It must accurately describe a large class of observa- tions on the basis of a model that contains only a few arbitrary elements, and it must make definite predic- tions about the results of future observations.” Stephen Hawking, A Brief History of Time. G Varoquaux 13
  30. 30. The sweet spots across science and software G Varoquaux 14
  31. 31. The advancement of knowledge Imagine a circle that contains human knowledge Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  32. 32. The advancement of knowledge By the time you finish elementary school, you know a little Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  33. 33. The advancement of knowledge High school takes you a little bit further Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  34. 34. The advancement of knowledge With a bachelors degree, you gain a speciality Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  35. 35. The advancement of knowledge A master’s degree deepens this speciality Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  36. 36. The advancement of knowledge Research papers take you to the edge of human knowledge Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  37. 37. The advancement of knowledge Once you are at the boundary, you focus Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  38. 38. The advancement of knowledge You push at the boundary for a few years Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  39. 39. The advancement of knowledge And one day it yields Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  40. 40. The advancement of knowledge That dent you’ve made, is called a PhD Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  41. 41. The advancement of knowledge Of course, the world looks different to you now Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  42. 42. The advancement of knowledge But don’t forget the big picture PhD Courtesy of Matt Might, via Stefan van der Waalt G Varoquaux 15
  43. 43. The advancement of knowledge This is an optimistic view Biology Maths Computer science Physics Economy Literature History G Varoquaux 15
  44. 44. The advancement of knowledge This is an optimistic view Biology Maths Computer science Physics Economy Literature History I want to be there G Varoquaux 15
  45. 45. Translationnal computional science Computational science The use of computers and mathematical models to address scientific research G Varoquaux 16
  46. 46. Translationnal computional science Computational science The use of computers and mathematical models to address scientific research Translationnal science In medecine: bring bench science to medical practice G Varoquaux 16
  47. 47. Translationnal computional science Computational science The use of computers and mathematical models to address scientific research Translationnal science In medecine: bring bench science to medical practice Translational computational science? G Varoquaux 16
  48. 48. Pick a problem to work on Take the “easy” route There needs to be a market screeming for the software (in academia and in industry) Refine your vision Pull, not push Design driven be need G Varoquaux 17
  49. 49. Having an impact G Varoquaux 18
  50. 50. Having an impact G Varoquaux 18
  51. 51. Pick the right battles: viable projects Project idea A software implementing: i) machine learning and ii) neuroimaging and iii) a graphical user interface and iv) 3D plotting G Varoquaux 19
  52. 52. Pick the right battles: viable projects Project idea A software implementing: i) machine learning and ii) neuroimaging and iii) a graphical user interface and iv) 3D plotting G Varoquaux 19
  53. 53. Pick the right battles: viable projects Project idea A software implementing: i) machine learning and ii) neuroimaging and iii) a graphical user interface and iv) 3D plotting Define project scope and vision Break down projects by expertise Don’t solve hard problems Know the software landscape Don’t target markets that will not yield contributors Need a vision = elevator pitch G Varoquaux 19
  54. 54. Pick the right battles: viable projects Project idea A software implementing: i) machine learning and ii) neuroimaging and iii) a graphical user interface and iv) 3D plotting Define project scope and vision Break down projects by expertise Don’t solve hard problems Know the software landscape Don’t target markets that will not yield contributors Need a vision = elevator pitch Your research (PhD) probably does not qualify ñ need to cherry-pick contributions G Varoquaux 19
  55. 55. Open source and community development Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month “Hey Gael, I take it you’re too busy. That’s okay, I spent a day trying to install XXX and I think I’ll succeed myself. Next time though please don’t ignore my emails, I really don’t like it. You can say, ‘sorry, I have no time to help you.’ Just don’t ignore.” G Varoquaux 20
  56. 56. Open source and community development Code maintenance too expensive to be alone scikit-learn „ 300 email/month nipy „ 45 email/month joblib „ 45 email/month mayavi „ 30 email/month Your “benefits” come from a fraction of the code Data loading? Maybe? Standard algorithms? Nah Share the common code... ...to avoid dying under code Code becomes less precious with time And somebody might contribute features G Varoquaux 20
  57. 57. Community development in scikit-learn Huge feature set: benefits of a large team Project growth: More than 200 contributors „ 12 core contributors 1 full-time INRIA programmer from the start Estimated cost of development: $ 6 millions COCOMO model, http://www.ohloh.net/p/scikit-learn G Varoquaux 21
  58. 58. Communities: many eyes makes code fast L. Buitinck, O. Grisel, A. Joly, G. Louppe, J. Nothman, P. Prettenhofer G Varoquaux 22
  59. 59. Having an impact You need a community G Varoquaux 23
  60. 60. What’s in a scientific-computing environment G Varoquaux 24
  61. 61. The scientific workflow agile Interaction... Ñ script... Ñ module... ý interaction again... Consolidation, progressively Low tech and short turn-around times G Varoquaux 25
  62. 62. Choose your weapons Python, what else? Interactive language Easy to read / write General purpose G Varoquaux 26
  63. 63. Choose your weapons Python, what else? Interactive language Easy to read / write General purpose Old virtual machine / compiler Younger languages promissing (Julia) but will they get adoption beyond science? G Varoquaux 26
  64. 64. Choose your weapons Python, what else? +Numpy arrays Shoe-horn your data in a numpy array, and you’ve won personnally disappointed that pandas drifted away G Varoquaux 26
  65. 65. Software architecture for science “Scriptability” is paramount In an application: MVC (model, view, controller) Model Numerical or data-processing core View Ouput: graphs, or files Must enable headless use Controller Input: dialogs, or an API Avoid input as files: not expressive Dialogs should never be far from the code Dialog generation: traits, IPython widgets Reactive programming: dialogs modify object, and the model updates Don’t own the main G Varoquaux 27
  66. 66. Software architecture for science “Scriptability” is paramount In an application: MVC (model, view, controller) Model Numerical or data-processing core View Ouput: graphs, or files Must enable headless use Controller Input: dialogs, or an API Avoid input as files: not expressive Dialogs should never be far from the code Dialog generation: traits, IPython widgets Reactive programming: dialogs modify object, and the model updates Don’t own the main In Mayavi: script generation for free G Varoquaux 27
  67. 67. Quality is free‹ ‹ This is a book, by Philip Crosby G Varoquaux 28
  68. 68. You need quality Quality will give you users Bugs give you bad rap Quality will give you developers Contribute to learn and improve Quality will make your developers happy People need to be proud of their work Do less, do better Goes against the grant-system incentive G Varoquaux 29
  69. 69. Quality: what & how Great documentation Simplify, but don’t dumb down Focus on what the user is trying to solve Great APIs Example-based development If something is hard to explain, rethink the concepts Limit the number of different concepts and objects Consistency, consistency, consistency Good numerics Write tests based on mathematical properties When a user finds an instability, write a new test G Varoquaux 30
  70. 70. Quality: what & how Great documentation Simplify, but don’t dumb down Focus on what the user is trying to solve Great APIs Example-based development If something is hard to explain, rethink the concepts Limit the number of different concepts and objects Consistency, consistency, consistency Good numerics Write tests based on mathematical properties When a user finds an instability, write a new test Quality enables reuse Beyond mere reproducibility G Varoquaux 30
  71. 71. Be productive G Varoquaux 31
  72. 72. Be productive “If you spend too much time thinking about a thing, you’ll never get it done.” — Bruce Lee G Varoquaux 31
  73. 73. Limited resources Limited resources are good Need success in the short term, not the long term The startup culture: fail fast Quickly identify non-viable projects The simpest solution that works is the best G Varoquaux 32
  74. 74. Short cycles, limited ambitions Keep coming back to your users Release early, release often G Varoquaux 33
  75. 75. Simplicity Complexity increase superlinearly [An Experiment on Unit Increase in Problem Complexity, Woodfield 1979] 25% increase in problem complexity ñ 100% increase in code complexity G Varoquaux 34
  76. 76. Simplicity Complexity increase superlinearly [An Experiment on Unit Increase in Problem Complexity, Woodfield 1979] 25% increase in problem complexity ñ 100% increase in code complexity The 80/20 rule 80% of the usecases can be solved with 20% of the lines of code Avoid feature creep G Varoquaux 34
  77. 77. Simplicity Complexity increase superlinearly [An Experiment on Unit Increase in Problem Complexity, Woodfield 1979] 25% increase in problem complexity ñ 100% increase in code complexity The 80/20 rule 80% of the usecases can be solved with 20% of the lines of code Avoid feature creep Use objects sparingly Don’t use classes for the sake of it G Varoquaux 34
  78. 78. Software engineering G Varoquaux 35
  79. 79. Software engineering good practices Version control Use git + github Unit testing If it’s not tested, it’s broken or soon will be. Make a package, with controlled dependencies and compilation ... G Varoquaux 36
  80. 80. Research ‰ production Need to adapt software-engineering principles Ó Good naming is free Use functions, not scripts Version control is very cheap Tests are more expensive... Considering if goals stabilizes Build chains are hard Go down the chain as your research progress You can think of shipping a software only if it was viable to go completely down the chain G Varoquaux 37
  81. 81. Things we did right (maybe) G Varoquaux 38
  82. 82. Mayavi: 3D visualization in Python Success factors Building upon VTK Great power Component model (UI) Internals open to the world ñ from interaction to scripting Limiting factors Building upon VTK A lot of complexity Codebase too complex and object-oriented (bound to VTK) Users of GUIs do not turn into developers Composition is an API killer G Varoquaux 39
  83. 83. Mayavi: 3D visualization in Python Success factors Building upon VTK Great power Component model (UI) Internals open to the world ñ from interaction to scripting Limiting factors Building upon VTK A lot of complexity Codebase too complex and object-oriented (bound to VTK) Users of GUIs do not turn into developers Composition is an API killer G Varoquaux 39
  84. 84. joblib: computational workflow patterns Parallel for loop >>> from joblib import Parallel, delayed >>> Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] On-demand dispatch to ease memory consumption Threading and processes backends G Varoquaux 40
  85. 85. joblib: computational workflow patterns Parallel for loop >>> from joblib import Parallel, delayed >>> Parallel(n jobs=2)(delayed(sqrt)(i**2) ... for i in range(8)) [0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0] Memoize pattern mem = joblib.Memory(cachedir=’.’) g = mem.cache(f) b = g(a) # computes a using f c = g(a) # retrieves results from store G Varoquaux 40
  86. 86. joblib: computational workflow patterns Success factors Simplicity of use Patterns we really, really need (pull not push) G Varoquaux 41
  87. 87. joblib: computational workflow patterns Success factors Simplicity of use Patterns we really, really need (pull not push) Limiting factor Vision of the project unclear Positioning with regards to landscape unclear (IPython, where are you headed?) Tricky code inside G Varoquaux 41
  88. 88. scikit-learn: machine learning in Python Success factors Right project vision Machine learning without learning the machinery Black box that can be opened Right trade-off between ”just works” and versatility (think Apple vs Linux) We’re not going to solve all the problems for you I don’t solve hard problems Feature-engineering, domain-specific cases... Python is a programming language. Use it. Cover all the 80% usecases in one package G Varoquaux 42
  89. 89. scikit-learn: machine learning in Python Success factors Right project vision High-level programming - Optimize algorithmes, not for loops - Know perfectly Numpy and scipy All significant data should be in arrays Avoid memory copies, rely on blas/lapack - Use Cython, quad not C/C++ G Varoquaux 42
  90. 90. scikit-learn: machine learning in Python Success factors Right project vision High-level programming Good API design - separate data from operations 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0 0 3 8 7 8 7 9 4 7 9 7 9 2 7 0 1 7 9 0 7 5 2 7 0 1 5 7 8 9 4 0 7 1 7 4 6 1 2 4 7 9 7 5 4 9 7 0 7 1 8 7 1 7 8 8 7 1 3 6 5 3 4 9 0 4 9 5 1 9 0 7 4 7 5 4 2 6 5 3 5 8 0 9 8 4 8 7 2 1 5 4 6 3 4 9 0 8 4 9 0 3 4 5 6 7 3 2 4 5 6 1 4 7 8 9 5 7 1 8 7 7 4 5 6 2 0G Varoquaux 42
  91. 91. scikit-learn: machine learning in Python Success factors Right project vision High-level programming Good API design - separate data from operations - Object API exposes a data-processing language fit, predict, transform, score, partial fit - Instantiated without data but with all parameters G Varoquaux 42
  92. 92. scikit-learn: machine learning in Python Success factors Right project vision High-level programming Good API design Great community - Github + code review G Varoquaux 42
  93. 93. scikit-learn: machine learning in Python Success factors Right project vision High-level programming Good API design Great community Great documentation G Varoquaux 42
  94. 94. scikit-learn: machine learning in Python Success factors Right project vision High-level programming Good API design Great community Great documentation Limiting factors Tricky numerical code Our own success ñ huge volume G Varoquaux 42
  95. 95. @GaelVaroquaux Succeeding in academia despite doing good software 1 Game the system It’s about convincing a tenure committee Code must contribute to a scientific problem
  96. 96. @GaelVaroquaux Succeeding in academia despite doing good software 1 Game the system 2 Not all battles can be fought Make sure that there is a market Don’t solve hard problems Problems that matter for science and industry
  97. 97. @GaelVaroquaux Succeeding in academia despite doing good software 1 Game the system 2 Not all battles can be fought 3 Make good software That actually answers scientists needs With quality, software engineering Relying on a communauty Usability matters
  98. 98. @GaelVaroquaux Succeeding in academia despite doing good software 1 Game the system 2 Not all battles can be fought 3 Make good software Now go out, and code!

×