Presentation for the MancML on data readiness.
If you are considering starting to invest in Data science, this is a helpful guide to understand:
- what you need *before* you start looking for a Data scientist
- the skillset and experience that you should be looking for when you do.
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Are you ready for Data science? A 12 point test
1. “I've come up with a set of rules that describe our
reactions to technologies:
1. Anything that is in the world when you’re born is
normal and ordinary and is just a natural part of the
way the world works.
2. Anything that's invented between when you’re fifteen
and thirty-five is new and exciting and revolutionary
and you can probably get a career in it.
3. Anything invented after you're thirty-five is against
the natural order of things.”
― Douglas Adams, The Salmon of Doubt
2. Should you hire a data
science team?
Bertil Hatt
Head of Data science
RentalCars.com
4. Raise your hand if
your service has:
• Meaningful automated decision or recommendation
• Running in production without individual human control
• Self-learning and updating on a schedule
5. Keep your hand up
if your own team is:
• Dedicated to building models, not reports
• Has three full time employees or more
• Including at least two full-time modellers
8. Contents
• Hiring is cool but not always appropriate
• How to tell if it is the right time to grow
• If we have time: Examples of data not ready
9. The hype is real
And it is a good thing
Image: (c) Olga Tarkovskiy.
10. The hype is blurring the lines
And it is less a great
11. What do we mean
by ‘data scientist’
Analyst: uses data to answer
ad hoc questions
uses data streams to build alerts,
automated reports & dashboards
Statistician: build statistical models
build self-updating models
used to automate decisions
ML researcher: imagines new types of models
14. The Joel test
Joel Splosky’s test for
code-based projects:
• 12 Yes/No
• Easy to tell
• Easy to start
https://www.joelonsoftware.com/2000/08/09/the-joel-test-12-steps-to-better-code/
15. 12-point test for readiness to
implement data science
• Are your strategy, effort and product team
structure making sense? Are they aligned?
• Is your data easy to get at and up to date?
Are your analysts building the right datasets?
• Would a model find a place to sit in?
16. Are your product &
goals clearly defined?
Is the company goal broken down into team metrics in a
mutually exclusive, completely exhaustive (MECE) way?
Can a new team member describe the current effort
& its impact on metrics at the end of their first week?
Are the metrics known by everyone, i.e. reported, audited
& challenged? Do developers other teams key metrics?
…
17. Is your data easy to get?
Do you have an up-to-date metrics dictionary with a
detailed explanation of each number, incl. edge case?
Are the reference analytical tables audited daily?
Totals add up with production, trends make sense, etc.
Are simple data requests answered within an hour? i.e.
Can a question that fit in two lines fit on a single query
…
18. Extract Transform Load
Production
databaseRead
replica
Analytics
database
More
services
Aggregated tables:
- Pageviews to sessions
- Customer first & last
- Daily metrics (all)
Denormalised individual tables:
- Per item with all information
- UTC, local time, DoW, ∂
- One per concept: order,
campaign, action per day.
Normalised tables:
references, events
Inferences
Machine learning
training datasets
Reports
Automated audits & alerts
19. Is your data up-to-date?
Do you use version control on the ETL?
Do you monitor your analysts queries for bad patterns?
Talk about improvements at weekly improvement review?
Is there more than three days between a product release
& subsequent ETL update (tested, reviewed, pushed)?
Is the team handling the ETL aware of the product
schedule, including re-factorisation & service split?
…
21. ML in production
Production
server (JS)
Client
ML server
(Python)Features
Prediction
Production db
recommendation_placeholder.py
import os
import pymssql
def get_smart_recommendation(input):
analytics_server = os.environ['ANALYTICS_URL']
conn = pymssql.connect(analytics_server).cursor()
cursor = conn.cursor()
cursor.execute('SELECT TOP 5 product_id FROM sales;')
five_suggested_product_list = []
for row in cursor:
five_suggested_product_list.ammend(row)
return five_suggested_product_list
Placeholder
naive model
Connection
naive model
22. Are you logging your
(placeholder) model?
Do you have estimates (from past A/B tests) of
how model quality impact your metrics?
Do you log the user input, suggestion, user action & model
version? Are those logs processed & audited?
Do you have a placeholder machine learning environment
with a training server fed from the analytics pipeline?
23. Questions?
1. Overall goal split MECE in team KPI?
2. Can newbies describe current impact?
3. Are the metrics known by everyone?
4. Up-to-date metrics dictionary?
5. Are analytical tables audited daily?
6. Are queries answered in an hour?
7. Version control on ETL?
8. Monitor analysts queries & feedback?
9. Less 3 d. from launch to ETL update?
10. Estimate impact of model quality?
11. Do you log input, & user action?
12. Placeholder ML environment?
• Hiring is cool but not
always appropriate How to
tell if it is the right time to
grow
• Product & goals clear?
• Data clear & updated?
• Model structure ready?
• Examples of product & data
not ready
• Growth & fake users
• Customer service actions
• Local dates & D1 retention
24. • Hiring is cool but not always appropriate
• How to tell if it is the right time to grow
• Examples of product & data not ready
25. Four cases where
lack of clarity
lead to bad models
Real cases of real models
gone wrong in production
26. Growth accounting
• Breakdown Active Users into influx and outcome
• Late identification of fake account
• Predicting user retention:
• fake accounts are more active
• and more likely to be gone
27. Customer service
• All agent-based operations need to be listed and logged
• Without an exhaustive list, user state might change
• To recommend a common response, you’d miss a step
28. Clear hierarchy of types
for recommendation
• Player/user actions, partners need a meaningful structure
• Classify food for recommendation:
• Restaurant chain: name or financial data?
• Food type: is Fusion a type of Asian?
• How to cold-start new restaurant, new customer?
29. Dates (local time zones)
• User daily rhythm is heavily dependent on local time
• Store timestamps UTC but store action time of day
• D1 Retention difference Canada vs. New-Zealand
Editor's Notes
Sending emails, granting rebate but not to everyone
New user can receive something without human intervention
Discover new rules
Building reports is close to data science, but not same profile
Three: significant independent effort
Two modellers: do full time
IllustrativeDon’t focus on where we are
Job roles less clearNot really a problem:
Novice can learn
Experts like the technicality
Before we go to explain the problem,I need to clarify
No clear objectives
No access to dataNo implementation path
No resources to fix any of that
Joel Splosky is a though leader in how to make great software
There were complicated ways of estimating quality of environment
He wanted to simplify it to a short check list
Each can be set up by individual decisions;some expensive but none head-scratcher
Few are dependent on each other
How much do you score on Joel’s test too?
Were you surprised by the score?
Do you give space for challenges to metrics?
None of those are about data science:
Because those will be defined later, when you have one DS
Happy to do those too
Ask those to the most junior analyst
Managers often too far to know the real struggle
Ddi steal some quite obviously from Joel
Monitor for joins, new tables, querying production table
Pretext to teach good, legible effective SQL
What can most data scientist can handle:
- can you host a server?
With an example?
Notice that no matter what the input, five most common
Boiler plate that a coder can easily produce