@KNerush @Volodymyrk
Volodymyr (Vlad) Kazantsev
Head of Data @ product madness
Product Manager
MBA @LBS
Graphics programming
Writes code for money since 2002
Math degree
2
Kateryna (Katya) Nerush
Mobile Dev @ Octopus Labs
Dev Lead in Finance
Data Engineer
Web Developer
Writes code for money since
2003
CS degree
@KNerush @Volodymyrk
Who are Data Scientists, really?
4
Coding
Stats Business “In a nutshell, coding is telling a computer to do
something using a language it understands.”
Data Science with Python
@KNerush @Volodymyrk
“Any fool can write code that a computer can understand. Good programmers write
code that humans can understand” - Kent Beck, 1999
6
WTF! How am I suppose to
validate this??
Sorry, but how do
can I calculate
7 day retention ?
@KNerush @Volodymyrk
You do it for your own good..
8
Re-run all AB tests
analysis for the last
months, by
tomorrow
eas & Questions Data Analysis
Insights
Impact
@KNerush @Volodymyrk
“Clean Code” ?
11
Pleasingly graceful and stylish in appearance
or manner
Bjarne Stroustrup
Inventor of C++
Clean code reads like well written prose
Grady Booch
creator of UML
.. each routine turns out to be pretty much what
you expected
Ward Cunningham
inventor of Wiki and XP
@KNerush @Volodymyrk
One does not simply start writing clean code..
12
First make it work,
Then make it Right,
Then make it fast and small
Kent Beck
co-inventor of XP and TDD
Leave the campground cleaner than you found it
- Run all the tests
- Contains no duplicate code
- Expresses all ideas...
- Minimize classes and methods
Ron Jeffries
author of Extreme
Programming Installed
The Boy Scouts of America
Applied to programming by
Uncle Bob
@KNerush @Volodymyrk
“There are only two hard problems in Computer Science:
cache invalidation and naming things" - Phil Karlton
long_descriptive_names
Avoid: x, i, stuff, do_blah()
Pronounceable and Searchable
revenue_per_payer vs. arpdpu
Avoid encodings, abbreviations, prefixes, suffixes.. if possible
bonus_points_on_iphone vs. cns_crm_dip
Add meaningful context
daily_revenue_per_payer
Don’t be lazy.
Spend time naming and renaming things.
14
@KNerush @Volodymyrk
“each routine turns out to be pretty much what you
expected” - Ward Cunningham
Small
Do one thing
One Level of Abstraction
Have only few arguments (one is the best)
Less important in Python, with named arguments.
15
@KNerush @Volodymyrk
Use good names
Avoid obvious comments.
Dead Commented-out Code
ToDo, licenses, history, markup for documentation and other nonsense
But there are exceptions..
“When you feel the need to write a comment, first try to refactor
the code so that any comment becomes superfluous” Kent Beck
16
@KNerush @Volodymyrk
Code conventions
Team should produce same style code as if that was one person
Team conventions over language one, over personal ones
Automate style formatting
21
@KNerush @Volodymyrk
● Indentation
● Tabs or Spaces?
● Maximum Line Length
● Should a line break before or after a binary operator?
● Blank Lines
● Imports
● Comments
● Naming Conventions
Example:
PEP 8 -- Style Guide for Python Code
23
foo = long_function_name(var_one, var_two,
var_three, var_four)
foo = long_function_name(var_one, var_two,
var_three, var_four)
Good Bad
https://www.python.org/dev/peps/pep-0008/
@KNerush @Volodymyrk25
My favourite !
This is not Java or C++
Functions are first-class objects
Duck-typing as an interface
No setters/getters
Itertools, zip, enumerate
etc.
@KNerush @Volodymyrk
1. Imports
27
2. Get Data
5.Visualisation
6. Making sense of the data
4. Modelling
3. Transform Data
Typical structure of the ipynb
@KNerush @Volodymyrk
One “idea - execution - output” triplet per cell
Import Cell: expected output is no import errors
CMD+SHIFT+P
37
Tip 4: each cell should have one logical output
@KNerush @Volodymyrk
Code Smells .. in ipynb
- Cells can’t be executed in order (with runAll and Restart&RunAll)
- Prototype (check ideas) code is mixed with “analysis” code
- Debugging cells
- Copy-paste cells
- Duplicate code (in general)
- Multiple notebooks that re-implement the same function
40
@KNerush @Volodymyrk
Summary: How to organise a Jupyter project
1. Notebook should have one Hypothesis-Data-Interpretation loop
2. Make a multi-project utils library
3. Good jupyter notebook reads like a well written prose
4. Each cell should have one and only one output
5. Write tests in notebooks
6. Deploy a shared Jupyter server
7. Try to keep code inside notebooks. Avoid refactoring to modules, if possible.
43
Editor's Notes
Data Scientists are coming from various backgrounds.
In my company, many came from the business “dark” side.
Part-2 of the talk is, of course, heavily inspired by work of Robert C Martin, his books, website and his absolutely wonderful video podcast. Shameless plagiarism Alert!
So, what is Clean code anyway?
“Pleasingly graceful and stylish in appearance or manner” - and this guy invented C++?! Graceful and stylish..
“Clean code reads like well written prose” - Grady Booch, creator of UML. Interestingly, UML can’t be read like a prose at all.. But may be there is something in there afterall..
“.. each routine turns out to be pretty much what you expected” Ward Cunningham, inventor of Wiki.
So, clean code should not “surprise”. I can definitely relate to that. How often you open someone else’s code and go
“Wow, what is that.. This is very curious and inventive way.. But.. WHY??”
So clean code should aim not to surprise. It should be even “boring” and “predictable”. And consistent.. But more on that later
First make it work,
Then make it Right,
Then make it fast and small
These are the Design rules of Kent Beck, creator and proponent of Test Driven Development
Another recipe for making clean code from Ron Jeffries, leading book author on Agile, XP and good practices:
Clean code is the one that..
Run all the tests
Contains no duplicate code
Expresses all ideas...
Minimize classes and methods
Robert C. Martin, found a very successful metaphor for writing clean code sustainably.. It is called a Boy Scout rule of development:
Leave the campground cleaner than you found it
So let’s take a look at how to develop those great habits
Naming things..
"There are only two hard problems in Computer Science: cache invalidation and naming things." - Phil Karlton, Principal Architect at Netscape
Rule-1: long_descriptive_names. Don’t be afraid to to type a long name. All modern IDEs have auto-complete. And if yours don’t - get a better one that do! Even Python Notebooks have autocomplete for variable and function names. X, i, stuff and do_blah() are real variable or function names that I have seen!
Rule2: name should be easy to pronounce. Arpdpu stands for .. Vova, what is this stands for again??
Vova: “Average Revenue per Daily Paying User
Rule3: avoid encodings and abbreviations. Exception may be only where everyone in the organisation already knows that DAU stands for Daily Active Users.. Avoid hungarian notetion and other nonesence. But use Domain Names.
Rule 4: add relevant context. Of course, long names make it hard to have lines that are 79 characters. But I prefer to have longer lines (we all have UltraHD Retina monitors these days anyway) rather than shorter and obscure names.
Rule 5: naming is hard. Very hard. So think hard about naming things. And don’t be afraid to refactor and rename, if you found a better way to express the purpose of variable, function or a Class.
Good EDI should have refactor-rename functions. And even Python Notebook has “search and replace” these days..
Functions..
Rules of functions
should be small
should be smaller than that
do one thing (also applies to classes)
they should do it well
do it only
One Level of Abstraction:
Don’t mix high-level policy and low-level details
But there are exceptions..
Complex algorithms
Technical notes and warnings
Conventions and rules
And this is where I will take over...
So let’s take a look at how to develop those great habits
So let’s take a look at how to develop those great habits
Same environment
Same color - reproducability
Advice: reload() modules that are changing, so that RunAll produce the same result as Restart&RunAll
There should be one and and only one narrative (story line) per notebook
Not the same as test-cells. We are going to talk about Tests and TDD later
Copy-paste cells - my favourite