What is data science?
How is it used in the industry?
DS methodology and life cycle
Who are the Data-team members?
Limitations and caveats
(**Google slides upload didn't go well)
3. Talk agenda
3
◎What is data science?
◎How is it used in the industry?
◎DS methodology and life cycle
◎Who are the Data-team members?
◎Limitations and caveats
7. “
Data Science (Wikipedia):
An interdisciplinary field that uses
scientific methods, processes,
algorithms and systems to extract
knowledge and insights from
data in various forms, both
structured and unstructured
7
8. Data Science vs. Statistics
◎ The term data scientist was originally coined by a
statistician, trying to rebrand statisticians (Chien-Fu
1998)
◎ Statistics vs. DS - Data models vs. Algorithmic
modeling (Leo Breiman 2001)
◎ Data Science = Aggr(`stats`,`advanced
computing`,`hacking`,`business
logic`,`math`,`domain knowledge`,`data
analysis`)
8
9. Demystifying data science
◎ DS Purpose - achieving ‘Data Driven Decision
making’ (basing decisions on data with certain
confidence)
9
10. Buzzwords terminology
10
Data Science (DS)
The science of recognizing and utilizing patterns in data in order to
develop actionable insight and confidence for decisions.
Artificial Intelligence (AI)
Any technique which enables computers
to mimic human behaviour
Machine Learning (ML)
Subset of AI techniques which use
statistical methods to enable machine
-tasks to improve with experience
Deep learning (DL)
Subset of ML which allows in
certain conditions to model the
data with less human intervention
11. 2.
How is it used in the
industry?
Typical use cases and market
overview
11
12. Why should I use DS in my business?
◎ Derive insights on business challenges
○ Sales
○ Pricing
○ Marketing
○ Churn
◎ Improve user experience
○ Faster
○ Personalized
○ Accurate
◎ Automate cumbersome routines involving human
labor
12
22. Methodology - preliminaries
22
Basic analytics
(no need for AI/ML)
Database instrumentation
and data structuring
◎ ** If needed, create a rule-based system of expert-defined thresholds as
the ‘AI’ backend and continue to gather data
23. Methodology - CRISP DM
23
* Cross Industry Standard Process for
Data Mining (CRISP-DM)
24. Methodology - Business understanding
24
Problem definition & Business
understanding
◎ Define business targets and qualitative
success metrics
◎ Asses risks, costs, benefits, data-
resources
◎ Project to data science subtasks and
identify the class of the problems
◎ Plan the project - estimate
requirements, timeline and budget
25. Methodology - Data understanding
25
Data understanding
◎ Refine initial data and enrich with if
needed
◎ Match data to business problem
◎ Describe and explore the data
○ Spot anomalies
○ Basic amounts and value types
◎ Verify data quality
○ Missing data
○ Collection errors/biases
anomalies?
outliers?
26. Methodology - Data Preparation
26
Data Preparation
◎ Clean data
○ Correct errors
○ Fill missing data
◎ Select right data
○ Representative
○ Data partitioning - train/test/hold-
out
◎ Format data
◎ Beware of “leaks”
Source: KDNuggets Poll 2003
27. Methodology - Modeling
27
Modeling
◎ Build cost/risk target to optimize
◎ Understand models assumptions and
check data compatibility
◎ Build model and optimize parameters
◎ Generate test design
◎ Assess model on provided data
28. Methodology - Evaluation
28
Evaluation
◎ Analyze model performance and
summarize results
○ New insights
○ A/B testing
○ Validation cases
◎ Error analysis
◎ Prediction interpretability
◎ Robustness and maintainability of model
◎ Business related performance -
cumulative response and lift curves
29. Methodology - Deployment
29
Deployment
◎ Integrate prototype into productions
system
◎ Implement software features inspired by
the data-mining process
◎ Plan model maintenance and support
32. Unicorn
fairytale
Data science is actually comprised of
multiple disciplines. Typically, a
single creature cannot manage the
engineering process, lead modeling
efforts, coordinate the product
roadmap, and articulate results to
stakeholders.
32
33. The magnificent data warriors
★ Descriptive and conditional statistics
★ Error analysis
★ Finding sense in results and
monitoring production model
performance
★ Feature engineering and
formalization of prior knowledge
★ Domain expertise
★ Validation
★ Excel, SQL, DB, R
(Scripting), Statistics
33
AI Analyst
34. The magnificent data warriors
★ Machine learning and statistical
analysis
★ Experiment design and research
★ Familiar with Big Data technologies
★ Dev foundations - Pipelines,
testing, performance optimization
★ Storytelling and visualization
★ Feature engineering
★ Bias and leakages discovery
★ Generalization and overfitting
★ Python, R, Matlab, SQL, OOP,
Spark, Pig, Hive 34
Data Scientist
35. The magnificent data warriors
★ Data orchestration and system
architecture
★ Scaling with Big Data technologies
★ Database maintenance and data
storage
★ Production processes - code
deployment, optimization and
testing
35
Data Engineer
★ OOP and functional programming,
Python/Java/Scala/Ruby/Closure,
Spark, Hadoop, Pig, Hive, DB &
SQL, Jenkins, Luigi/Airflow
36. The magnificent data warriors
★ Setting goals
★ Tracking progress
★ Coordinates between team
members
★ Strong understanding of data
mining , evaluation metrics and
statistics
★ Deliver results to stakeholders
★ Leader
36
Manager
38. Other options
Let your developers
carefully integrate
Data science licensed
APIs for predictive
modeling in product.
Outsource DS task to
a consulting company
38
40. Limitations
◎ No magic - when there is no predictive
information in data
◎ No 100%
◎ No hidden golden feature
◎ For tomorrow, it is impossible
◎ Tasks with subjective nature are hard
◎ Outdated data and outdated models
◎ Train and test data discrepancies
40
42. References
42
◎ Foster Provost and Tom Fawcett. 2013. Data Science for Business: What You Need to Know about
Data Mining and Data-Analytic Thinking (1st ed.). O'Reilly Media, Inc.
◎ https://www.salesforce.com/quotable/articles/why-AI-will-be-your-new-best-friend-in-sales/
◎ https://hbr.org/2018/07/how-ai-is-changing-sales
◎ https://neilpatel.com/blog/how-uber-uses-data/
◎ https://www.marketingaiinstitute.com/blog/7-top-marketing-and-sales-companies-using-
artificial-intelligence-and-machine-learning
◎ https://www.vccafe.com/2017/09/11/israels-machine-intelligence-startup-landscape-2017/
◎ https://www.slideshare.net/kuonen/a-statisticians-view-on-big-data-and-data-science
◎ https://www.datasciencecentral.com/profiles/blogs/10-most-popular-data-science-
presentations-on-slideshare
◎ https://medium.com/high-alpha/how-to-build-a-great-data-science-team-d921fb41b5b1
◎ https://towardsdatascience.com/what-is-the-most-effective-way-to-structure-a-data-science-
team-498041b88dae
◎ https://towardsdatascience.com/the-limits-of-data-science-b4e5faad20f4
Editor's Notes
In data models you assume to know in some sense the prediction function and the types of interactions between predictor variables. Then you only need to seek for the optimal settings (params) for the model to fit the data.
In algorithmic modeling you assume that the function is an unknown box and you let an algorithm and the data to find out the prediction function and the variables.
Optimove’s Customer Marketing Cloud automatically schedules, executes and evaluates highly individualized marketing campaigns. helps marketers retarget ads only to website visitors most likely to make a purchase on the site.
Datorama’s - process of mapping new sources of marketing information to generate enhanced insights for decision-makers.
Predictive advertisement targeting
What’s predicted: which female customer will have a baby in coming months, which ad each customer is most likely to click
What’s done about it: suggests relevant offers for soon-to-be Parents, display best add
Targeting direct marketing
What’s predicted: which customers will respond to marketing contact
What’s done about it: contact customers more likely to respond
Churn
What’s predicted: which customers will leave
What’s done about it: retention efforts targeting at risk customers
Causal modeling
predictive modeling to target advertisements to consumers.
Was this because the advertisements influenced the consumers to purchase? Or did the predictive models simply do a good job
Viral marketing
Recognize influencers and seed them with free products
they will cause an increase in the likelihood that the people they know will purchase the product.
Gong “ shining the light on their sales conversations.” Automatically record, transcribe and analyze all “sales calls, demos, and meetings so sales teams can scale the effectiveness of their sales conversations.”
Conversica uses AI to automate “routine business conversations in a human way.” They sell an automated sales assistant that “engages, qualifies and follows-up with sales leads via human-like, two-way email conversations.” The idea is that salespeople can talk to the right people at the right time, while AI does the heavy lifting the rest of the time.
Demand forecasting (strawberry pop-tarts and beer in hurricane (NY - TIMES 2004)
What’s predicted: products to be consumed before an event (such as hurricane)
What’s done about it: pricing, supply
Upselling and cross-selling
What’s predicted: identify which of your existing clients are more likely to buy a better version of what they currently own (up-sell).The net effect is an increase in revenue and a drop in marketing costs.
Leads
Predicting which leads are most likely to be converted into a deal, while considering the geography, size of a company, and titles, to engagement such as signing up for a trial or downloading a white paper.
Then - Uber was originally started as a black car-hailing service: UberCab, in San Francisco.
Now - closely monitor which features of the Service are used most, to analyze usage pattern.Predict everything from the customer’s wait time, to recommending where drivers should place themselves via heatmap in order to take advantage of the best fares and most passengers.
Dynamic pricing is similar to the pricing strategy used by hotels and flights for their weekend or holiday fares and rates – except Uber leverages predictive modeling in real-time based on traffic patterns, supply and demand.
AI - Brain inspired programming
ML - data driven optimization
Simple business questions -
User profile (age, gender, background etc.)
How pays more and for what product
Iterative and very difficult step
Be able to tell what is unrealistic or ill defined
If data is good, be patience for vaguely defined problems
Do not economize on this phase
The earlier you discover issues with your data the better (yes, your data will
have issues!)
Data understanding leads to domain understanding, it will pay off in
the modelling phase
Do not trust data quality estimates provided by your customer
Verify as far as you can, if your data is correct, complete, coherent,
deduplicated, representative, independent, up-to-date, stationary
Investigate what sort of processing was applied to the raw data
Understand anomalies and outliers
Data understanding and preparation will usually consume half or more of your project time!
Examples
converting data to tabular format
Removing or inferring missing values,
converting data to different types.
Scaling and normalizing
Some data mining techniques are designed for symbolic and categorical data, while others handle only numeric values.
Whenever possible, peek inside your model and consult it with
domain expert
• Assess feature importance
• Run your model on simulated data
Cumulative response curves - plot the hit rate (tp rate; y axis). You return a list ranked by your model, and you check your accuracy vs. the change in the size of the list. the percentage of positives correctly classified, as a function of the percentage of the population that is targeted (x axis). So, conceptually as we move down the list of instances ranked by the model, we target increasingly larger proportions of all the instances.
Intuitively, the lift of a classifier represents the advantage it provides over random
guessing. The lift is the degree to which it “pushes up” the positive instances in a list above the negative instances
\
Analysts monitor processes, evaluate data quality, and monitor production model performance. These steps seem relatively routine but when you realize the fact that a model is never “complete” and will always require some oversight then appointing an analyst to manage the process makes sense. This allows your more senior assets to focus on innovation instead of maintenance.
Data Scientist then owns the modeling process. Generally, they take input parameters from product or other team leads in order to understand the model’s business objective. They then work to articulate requirements to the engineers and other stakeholders. Once these criteria have been defined, the process of building tests, models, and evaluating performance begins.
Data Engineers are responsible for building and maintaining the technical infrastructure required in order do modeling, predictions, and analysis. The engineers create and maintain databases, machine learning pipelines, and production processes. Without having properly stored data, modeling processes, and the ability to serve predictions in production a Data Scientist is essentially useless.
As the data team and number of models grows, the need for a Data Science Manager appears. This person coordinates the quants, devs, and analysts as well as manages external demand of the data science team. The Data Science Manager essentially guides the process, allocates resources, and occasionally shields the team from ad hoc requests so they are able to achieve their primary objectives.
Ignoring methodology and overlooking phases lead to fragile insights and unreliable products