Is Agile Data Science just two buzzwords put together? I argue that agile is a very practical and applicable methodology, that does work well in the real world for all sorts of Analytics and Data Science workflows.
http://theinnovationenterprise.com/summits/digital-web-analytics-summit-london-2015/schedule
5. 5
volodymyrk
Heart of Vegas in (public) Numbers
* source: App Annie, 2nd of March
Top Grossing
Games US
Top Grossing
Games AU
iphone 29 (+1) 1 (+1)
ipad 8 (+2) 1 (-)
Android 16 (+2) 1 (-)
Facebook 5(+1)
6. 6
volodymyrk
Data Team
● Ad-hoc analytics and
daily fires; dashboards
● Deep dive analysis;
Predictive analytics
● ETL, Data Viz tools,
R&D, DBA
Analytics
Data
Science
Data
Engineering
8 people; 4 in London
13. 13
volodymyrk
Agile Manifesto
Individuals and interactions over processes and tools
Working software over comprehensive documentation
Customer collaboration over contract negotiation
Responding to change over following a plan
* agilemanifesto.org
14. 14
volodymyrk
Agile Data Science Manifesto
Individuals and interactions over processes and tools
Actionable insights over comprehensive reports
Customer collaboration over project negotiation
Responding to change over following a plan
15. 15
“If a building doesn’t encourage [collaboration], you’ll lose a lot of
innovation and the magic that’s sparked by serendipity” - Steve Jobs
Individuals and interactions over processes and tools
17. 17
Agile Principles
Iterative, incremental and evolutionary
Efficient and face-to-face communication
Very short feedback loop and adaptation cycle
Quality focus
- iterations, timeboxed estimates
- no to tasks by email (with no face-to-face)
- daily standups, pair analysis
- verifiable, reproducible findings
18. 18
Scrum-Ban in Data Science @ProductMadness
● Weekly cycle
● Daily standup meeting @10am
● ToDo/WIP/Waiting buckets are kept small
● Disruptions to weekly plan are expected
● On-demand planning
20. 20
Lesson 1: Agile methods in Data Science
1. co-location matter; whiteboard next to your desk
2. Work with decision maker; share preliminary findings
3. Make a research plan; pivot early
4. Book “Findings” meeting before project start
5. MVP for Data Products
6. Do Daily Stand-ups !
22. 22
What is Agile Acceleration
Waterfall Scrum
Units of Work
Time Interval
Velocity = ΔVelocity = Acceleration* ΔTime
VS.
23. 23
a = F
m
I run SQL, copy-
paste data to Excel
and send it by email
I created a deep
neural network to
predict high
spenders
24. 24
Case Study: to Git or not to Git
Scripts (ruby, bash, python)
Python Apps
Python Modules
IPython Notebooks
Research Documents (word)
Presentations (powerpoint)
Spreadsheets (excel)
25. 25
Case Study: Git or not to Git
Scripts (ruby, bash, python)
Python Apps
Python Modules
IPython Notebooks ?
Research Documents (word)
Slides (powerpoint)
Spreadsheets (excel)
26. 26
Case Study: Git or not to Git
Scripts (ruby, bash, python)
Python Apps
Python Modules
IPython Notebooks
Research Documents (word)
Slides (powerpoint)
Spreadsheets (excel)
29. 29
Friction: Mini Case Studies
re.dash for self-service analytics cloud-hosted Jupyter notebooks
30. 30
Lesson 2: find the lightest suitable tool
1. IPython notebooks: Dropbox over Git
2. Google Slides over Powerpoint
Google Slides over Email with images
3. Google Spreadsheets over Excel
4. Podio over Jira
5. Data Transformations in DWH in SQL over Hadoop
6. re.dash over SQL Workbench+csv export+excel
7. Hosted Jupyter over local python
33. 33
Scrum for Data Science?
Assumptions:
● Motivated ninjas
● Isolated and co-located team
● Clear direction
● You can estimate work
Reality:
● Unicorns are rare
● Constant interruption; 3 locations
● Lots of unknown-unknowns
● You can estimate very little
36. 36
Limit the number of Open Loops
90% 90%
75%80%
80%60%
100% 100%
100%100%
0% 0%
Always prefer to have:
90% of tasks are 100%
complete
over 100% of tasks are
90% complete
VS.
37. 37
Lesson 3: Focus on Closing the Loop
1. Don’t build predictive models that you can’t act upon. Don’t
analyse stuff that cannot help to make a decision
2. The best way to deal with Analytics Spiral is to avoid the spiral.
Practise Crack a Case and “what if” method.
3. Limit the number of “open loops”
43. 43
Import all commonly used tools
in one line.
All access and security is
abstracted away.
Focus on SQL, not data access
formatting and publishing a .png
in one line of code
PyCharm has great SQL editor
44. 44
Lesson 4: Reproducibility
● Get rid of Windows and you get rid of Excel
● ipynb are always shared and versioned;
Prefer simple cloud sharing to VCS
● Streamline data access functions
● Cache long-running code and queries
● Develop a common library
46. 46
Summary
● Agile approach works well for Data Science
● Find the lightest suitable tool for a task
● Reproducibility is not negotiable
● Focus on closing the loop(s)