Successfully reported this slideshow.

Clean code in Jupyter notebooks

17

Share

Loading in …3
×
1 of 43
1 of 43

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Related Audiobooks

Free with a 14 day trial from Scribd

See all

Clean code in Jupyter notebooks

  1. 1. @KNerush @Volodymyrk Clean Code In Jupyter notebooks, using Python 1 5th of July, 2016
  2. 2. @KNerush @Volodymyrk Volodymyr (Vlad) Kazantsev Head of Data @ product madness Product Manager MBA @LBS Graphics programming Writes code for money since 2002 Math degree 2 Kateryna (Katya) Nerush Mobile Dev @ Octopus Labs Dev Lead in Finance Data Engineer Web Developer Writes code for money since 2003 CS degree
  3. 3. @KNerush @Volodymyrk Why we end-up with messy ipy notebooks? 3 Coding Stats Business
  4. 4. @KNerush @Volodymyrk Who are Data Scientists, really? 4 Coding Stats Business “In a nutshell, coding is telling a computer to do something using a language it understands.” Data Science with Python
  5. 5. @KNerush @Volodymyrk It is not going to production anyway! 5
  6. 6. @KNerush @Volodymyrk “Any fool can write code that a computer can understand. Good programmers write code that humans can understand” - Kent Beck, 1999 6 WTF! How am I suppose to validate this?? Sorry, but how do can I calculate 7 day retention ?
  7. 7. @KNerush @Volodymyrk From Prototype to ... The Data Science Spiral 7 Ideas & Questions Data Analysis Insights Impact
  8. 8. @KNerush @Volodymyrk You do it for your own good.. 8 Re-run all AB tests analysis for the last months, by tomorrow eas & Questions Data Analysis Insights Impact
  9. 9. @KNerush @Volodymyrk Part 2 What can Data Scientists learn from Software Engineers? 9
  10. 10. @KNerush @Volodymyrk Robert C. Martin, a.k.a. “Uncle Bob” 10 https://cleancoders.com/
  11. 11. @KNerush @Volodymyrk “Clean Code” ? 11 Pleasingly graceful and stylish in appearance or manner Bjarne Stroustrup Inventor of C++ Clean code reads like well written prose Grady Booch creator of UML .. each routine turns out to be pretty much what you expected Ward Cunningham inventor of Wiki and XP
  12. 12. @KNerush @Volodymyrk One does not simply start writing clean code.. 12 First make it work, Then make it Right, Then make it fast and small Kent Beck co-inventor of XP and TDD Leave the campground cleaner than you found it - Run all the tests - Contains no duplicate code - Expresses all ideas... - Minimize classes and methods Ron Jeffries author of Extreme Programming Installed The Boy Scouts of America Applied to programming by Uncle Bob
  13. 13. @KNerush @Volodymyrk I'm not a great programmer; I'm just a good programmer with great habits. 13 Kent Beck
  14. 14. @KNerush @Volodymyrk “There are only two hard problems in Computer Science: cache invalidation and naming things" - Phil Karlton long_descriptive_names Avoid: x, i, stuff, do_blah() Pronounceable and Searchable revenue_per_payer vs. arpdpu Avoid encodings, abbreviations, prefixes, suffixes.. if possible bonus_points_on_iphone vs. cns_crm_dip Add meaningful context daily_revenue_per_payer Don’t be lazy. Spend time naming and renaming things. 14
  15. 15. @KNerush @Volodymyrk “each routine turns out to be pretty much what you expected” - Ward Cunningham Small Do one thing One Level of Abstraction Have only few arguments (one is the best) Less important in Python, with named arguments. 15
  16. 16. @KNerush @Volodymyrk Use good names Avoid obvious comments. Dead Commented-out Code ToDo, licenses, history, markup for documentation and other nonsense But there are exceptions.. “When you feel the need to write a comment, first try to refactor the code so that any comment becomes superfluous” Kent Beck 16
  17. 17. @KNerush @Volodymyrk // When I wrote this, only God and I understood what I was doing // Now, God only knows 17
  18. 18. @KNerush @Volodymyrk // sometimes I believe compiler ignores all my comments 18
  19. 19. @KNerush @Volodymyrk /** * Always returns true. */ public boolean isAvailable() { return false; } 19
  20. 20. @KNerush @Volodymyrk “Long functions is where classes are trying to hide” - Robert C. Martin 20 Small Do one thing SOLID, Design Patterns, etc.
  21. 21. @KNerush @Volodymyrk Code conventions Team should produce same style code as if that was one person Team conventions over language one, over personal ones Automate style formatting 21
  22. 22. @KNerush @Volodymyrk Part 3 How to write Clean Code in Python? (e.g. this is not Java) 22
  23. 23. @KNerush @Volodymyrk ● Indentation ● Tabs or Spaces? ● Maximum Line Length ● Should a line break before or after a binary operator? ● Blank Lines ● Imports ● Comments ● Naming Conventions Example: PEP 8 -- Style Guide for Python Code 23 foo = long_function_name(var_one, var_two, var_three, var_four) foo = long_function_name(var_one, var_two, var_three, var_four) Good Bad https://www.python.org/dev/peps/pep-0008/
  24. 24. @KNerush @Volodymyrk Google Python Style Guide 24 https://google.github.io/styleguide/pyguide.html
  25. 25. @KNerush @Volodymyrk25 My favourite ! This is not Java or C++ Functions are first-class objects Duck-typing as an interface No setters/getters Itertools, zip, enumerate etc.
  26. 26. @KNerush @Volodymyrk Part 4 How to write Clean Python Code in Jupyter Notebook? 26
  27. 27. @KNerush @Volodymyrk 1. Imports 27 2. Get Data 5.Visualisation 6. Making sense of the data 4. Modelling 3. Transform Data Typical structure of the ipynb
  28. 28. @KNerush @Volodymyrk How big should a notebook file be? 28
  29. 29. @KNerush @Volodymyrk How big should a notebook file be? Hypothesis - Data - Interpretation 29
  30. 30. @KNerush @Volodymyrk Keep your notebooks small! (4-10 cells each) 30
  31. 31. @KNerush @Volodymyrk Example: Tip 1: break fat notebook into many small ones 31 1_data_preparation.ipynb df.to_pickle(‘clean_data_1.pkl) 2_linear_model.py df = pd.read_pickle(‘clean_data_1.pkl) 3_ensamble.py df = pd.read_pickle(‘clean_data_1.pkl)
  32. 32. @KNerush @Volodymyrk Tip 2: shared library Data access Common plotting functionality Report generation Misc. utils 32 acme_data_utils Data_access.py plotting.py setup.py tests/
  33. 33. @KNerush @Volodymyrk Tip 3: Don’t just be pythonic. Be IPythonic Don’t hide “secret sauce” inside imported module BAD: Good: 33
  34. 34. @KNerush @Volodymyrk Clean code reads like well written prose 34 Grady Booch
  35. 35. @KNerush @Volodymyrk Good jupyter notebook reads like well written prose 35
  36. 36. @KNerush @Volodymyrk How big should one Cell be? 36
  37. 37. @KNerush @Volodymyrk One “idea - execution - output” triplet per cell Import Cell: expected output is no import errors CMD+SHIFT+P 37 Tip 4: each cell should have one logical output
  38. 38. @KNerush @Volodymyrk Tip 5: write tests .. in jupyter notebooks 38 https://pypi.python.org/pypi/pytest-ipynb
  39. 39. @KNerush @Volodymyrk Tip 6: ..to the cloud 39
  40. 40. @KNerush @Volodymyrk Code Smells .. in ipynb - Cells can’t be executed in order (with runAll and Restart&RunAll) - Prototype (check ideas) code is mixed with “analysis” code - Debugging cells - Copy-paste cells - Duplicate code (in general) - Multiple notebooks that re-implement the same function 40
  41. 41. @KNerush @Volodymyrk Tip 7: Run notebook from another notebook! 41 analysis.ipynb
  42. 42. @KNerush @Volodymyrk Make Data Product from notebooks! 42
  43. 43. @KNerush @Volodymyrk Summary: How to organise a Jupyter project 1. Notebook should have one Hypothesis-Data-Interpretation loop 2. Make a multi-project utils library 3. Good jupyter notebook reads like a well written prose 4. Each cell should have one and only one output 5. Write tests in notebooks 6. Deploy a shared Jupyter server 7. Try to keep code inside notebooks. Avoid refactoring to modules, if possible. 43

Editor's Notes

  • Data Scientists are coming from various backgrounds.
    In my company, many came from the business “dark” side.
  • http://www.slideshare.net/ISchwarz23/clean-code-49797249
  • Part-2 of the talk is, of course, heavily inspired by work of Robert C Martin, his books, website and his absolutely wonderful video podcast. Shameless plagiarism Alert!
  • So, what is Clean code anyway?

    “Pleasingly graceful and stylish in appearance or manner” - and this guy invented C++?! Graceful and stylish..

    “Clean code reads like well written prose” - Grady Booch, creator of UML. Interestingly, UML can’t be read like a prose at all.. But may be there is something in there afterall..

    “.. each routine turns out to be pretty much what you expected” Ward Cunningham, inventor of Wiki.
    So, clean code should not “surprise”. I can definitely relate to that. How often you open someone else’s code and go
    “Wow, what is that.. This is very curious and inventive way.. But.. WHY??”

    So clean code should aim not to surprise. It should be even “boring” and “predictable”. And consistent.. But more on that later

  • First make it work,
    Then make it Right,
    Then make it fast and small

    These are the Design rules of Kent Beck, creator and proponent of Test Driven Development

    Another recipe for making clean code from Ron Jeffries, leading book author on Agile, XP and good practices:

    Clean code is the one that..
    Run all the tests
    Contains no duplicate code
    Expresses all ideas...
    Minimize classes and methods
    Robert C. Martin, found a very successful metaphor for writing clean code sustainably.. It is called a Boy Scout rule of development:
    Leave the campground cleaner than you found it
  • So let’s take a look at how to develop those great habits
  • Naming things..
    "There are only two hard problems in Computer Science: cache invalidation and naming things." - Phil Karlton, Principal Architect at Netscape
    Rule-1: long_descriptive_names. Don’t be afraid to to type a long name. All modern IDEs have auto-complete. And if yours don’t - get a better one that do! Even Python Notebooks have autocomplete for variable and function names. X, i, stuff and do_blah() are real variable or function names that I have seen!
    Rule2: name should be easy to pronounce. Arpdpu stands for .. Vova, what is this stands for again??
    Vova: “Average Revenue per Daily Paying User
    Rule3: avoid encodings and abbreviations. Exception may be only where everyone in the organisation already knows that DAU stands for Daily Active Users.. Avoid hungarian notetion and other nonesence. But use Domain Names.
    Rule 4: add relevant context. Of course, long names make it hard to have lines that are 79 characters. But I prefer to have longer lines (we all have UltraHD Retina monitors these days anyway) rather than shorter and obscure names.
    Rule 5: naming is hard. Very hard. So think hard about naming things. And don’t be afraid to refactor and rename, if you found a better way to express the purpose of variable, function or a Class.
    Good EDI should have refactor-rename functions. And even Python Notebook has “search and replace” these days..

  • Functions..
    Rules of functions
    should be small
    should be smaller than that

    do one thing (also applies to classes)
    they should do it well
    do it only

    One Level of Abstraction:
    Don’t mix high-level policy and low-level details
  • But there are exceptions..
    Complex algorithms
    Technical notes and warnings
    Conventions and rules
  • And this is where I will take over...
  • So let’s take a look at how to develop those great habits
  • So let’s take a look at how to develop those great habits
  • Same environment
    Same color - reproducability
  • Advice: reload() modules that are changing, so that RunAll produce the same result as Restart&RunAll
    There should be one and and only one narrative (story line) per notebook
    Not the same as test-cells. We are going to talk about Tests and TDD later
    Copy-paste cells - my favourite
  • ×