What does a typical Data Science Project look like? Explore the current Business Analytics Landscape : Get past the Jargon into actual business cases. The co-founder of Bowery Analytics, Ania Wieczorek, talked about how Data Science is the newest hot trend in the world of business and what it really means. She took the audience through a real case and explained what the project lifecycle looks like from a business perspective. We also discussed specific steps a typical data science project goes through, the outputs you will see and the jargon being used.
9. AGENDA
Intro to data science
Case Study
Data Science Project Lifecycle
Data Science Projects v/s
Traditional Projects
Resources
9BOWERY ANALYTICS LLC
10. NOT
COVERED
1. Data mining algorithms and techniques.
2. Big data technologies.
3. Data visualization technologies.
10BOWERY ANALYTICS LLC
11. WHAT IS DATA
SCIENCE ?
Area of work
concerned with the
collection,
preparation,
analysis,
visualization,
management, and
preservation of
large collection of
information.
11BOWERY ANALYTICS LLC
Jeffrey Stanton
Syracuse
University School
of Information
12. WHY NOW ?
90%
all data created in last two years
2.5
million TB
12BOWERY ANALYTICS LLC
13. HOW DID
WE GET
HERE?
EVOLUTION OF BIG
DATA
1.0
1950-
2000
Enterprise
Data Ware
House
Static and
Low Volume
Focus on
Operational
Efficiency
2.0
2000-2012
Google
Amazon
(Internet)
High Volume,
High Velocity
Data tools as a
product
(people you
may know,
products you
like)
3.0
2012 – Now
Data
generated at
every event
(iOT)
Cognitive
Analytics
(Echo, Home).
Every company
offering data
products.
13BOWERY ANALYTICS LLC
The different eras of Big Data
16. THE
PARK
MORTON
ARBORETUM
Reduce membership churn for the park using
predictive analytics.
Objective
Explore park’s member profile and the transaction
data and make predictions.
Data
Leverage data analytics tools and predictive
algorithms.
Tools
16BOWERY ANALYTICS LLC
18. DATA
TRANSFOR
MATION
Multiple steps
Clean data (remove blank,
empty spaces, data formatting)
Add additional attributes to
data
Binary flag Churn/No Churn
Distance from Morton by zip
Age Group
Average income leveraging
census data
18BOWERY ANALYTICS LLC
23. SUPERVISED V/S
UNSUPERVISED
1
Do our customers fit into different
groups ? (Unsupervised)
2
Do we know customers who will
quit after their contract expires ?
(Supervised)
23BOWERY ANALYTICS LLC
24. MODELIN
G –
QUICK
INTRO
•Estimate a numerical value
– the probability of this
customer leaving ?
(Supervised/Regression)
•Will the customer leave
(yes/no) ?
(Unsupervised/Classificatio
n)
24BOWERY ANALYTICS LLC
25. MODELING
oDealing with Missing Values
oDummy Transformations (churn/no churn variables)
oIdentify train and test data (60 and 40%)
25BOWERY ANALYTICS LLC
28. DATA SCIENCE
APPLICATIONS
1. Customer Attrition (eCommerce and Subscriptions)
2. Predict who is quitting (HR)
3. Heavy equipment reducing downtime (Oil and Natural gas)
4. Image recognition and profiling. (Govt.)
5. Precision medicine and reducing downtime. (Healthcare)
6. Topic Modeling and Document Management (Legal and
Contracts)
BOWERY ANALYTICS LLC 28
29. SOME DATA
SCIENCE
ALGORITHMS
Classification
How likely will the customer respond
to our campaign ?
Regression (Estimation)
How much will she use the service ?
Similarity
Can we find customers similar to my
best customers?
Clustering
Do my customers for natural groups?
30. CONT..
Co-Occurrence
You might also like …
Description – Profiling
What does normal behavior look
like?
Causal Modeling
Why are my customers leaving?
31. TOOLS AND SKILLS
BOWERY ANALYTICS LLC 31
Business
Understandin
g
Data
Understandin
g
Data
Preparation
Modeling Evaluation Deployment
Micro Soft
Visio, Excel
Power User
and Macros,
Micro Soft
Word, SQL
QlikView,
Tableau,
Power BI,
ggplot
using R,
Embedded
web plot
using
Plotly and
D3.js
Statistics, R,
Python, Scala
R, Python,
Scala,
Azure ML,
R Studio,
IBM Blue
Mix, Google
Tensor
Flow, AWS
ML,
Hadoop,
Apache
Spark Eco
System
R, Python,
Scala,
Azure ML,
R Studio,
IBM Blue
Mix,
Google
Tensor
Flow, AWS
ML,
Hadoop,
Apache
Spark Eco
System
R, Python,
Scala, Azure
ML, R Studio,
IBM Blue
Mix, Google
Tensor Flow,
AWS ML,
Hadoop,
Apache
Spark Eco
System
32. TYPICAL JOB ROLES
Data Analysts and Visualization (Eg.Tableau
Developer)
Analysts – Part of a Data Science Team
Tableau Dashboard Developer
Data Mining and Infrastructure setup (AWS and
Hadoop/NoSQL Developer/Administrator)
Setup Hadoop Infrastructure
Write Java code to run Hadoop Jobs
AWS Instance Administrator
Data Scientist (Data Modeling using
Predictive Analytics)
Data Science Manager
Team Lead
Architect
33. PREPARING
FOR THE
PROJECT
Do we have the data ?
What kind of data do we
have ?
Do we have the team ?
Do we have the buy in from
the stakeholders ?
33BOWERY ANALYTICS LLC
34. GOOD DATA = GOOD
MODELS
BOWERY ANALYTICS LLC 34
35. COMMON
MISTAKES
•Data Science Project is not a Software
Project
•Overestimate the significance of data
•Vendor, Team and Skills
35BOWERY ANALYTICS LLC
36. RESOURCES
AND TOOLS
1. An Intro to statistical learning
Gareth James , Daniela Witten,
Trevor Hastie and Robert Tibshirani
2. Business Analytics for Managers
Wolfgang Jank
3. An Introduction to Machine Learning
Miroslav Kubat
4. Understanding Statistics using R
Randall Schumaker, Sara Tomek
5. www.rstudio.com
Primary development environment for R
6. Visual Studio Community Edition
2015 and above comes with an
Integrated R environment
36BOWERY ANALYTICS LLC
We are going to look at each one of these concerns and identify how this maps to a project plan and the skills needed to execute this.
90% of all data created in last two years
2.5 Quil
1.0: US Census was one of the first data warehouses built.
2.0: Google and Amazon Pioneered the use of Data Products. Google with the AdWords and Amazon with the use of “You may also like..” product recommendations.
3.0: Data Products: Caterpillar selling usage data. GE launching operational products for OIL and Natural gas industry to help reduce downtime. 50 million data variables from 10 million sensors installed on its machines. GE predicts the applications for industrial products will be close to 220 Billion by 2020
https://sloanreview.mit.edu/case-study/ge-big-bet-on-data-and-analytics/
What kind of data do we have ?
What type of data do we have ?
This is a continuation of Data Exploration.
Top Left: Churn/No Churn in relation to the visits
Top right: Churn/No Churn in relation to the events attended.
Bottom Left: Churn/No Churn in relation to the Distance from the park.
Bottom right: Churn/No Churn in relation to the Net Amt spend in the park.
This is the first hypothesis we developed in a series.
Explain why we need supervised and un-supervised.
False Positives – How many records was negative which was marked as positives. You are loosing money here targeting the wrong people.
False Negatives – How many records was positive which was marked as Negative. You lost opportunity here by not targeting the right people.
You want very low number in these two.
Basic Understanding: What kind of problem are we trying to solve ? Is it classification ? Is it regression ?
Data Understanding: Strengths and limitations of data ? Do we have enough historical data to accomplish what we need to do ? Is the data reliable ? We have lot of data – Is all the data we have actually reliable ? We collected all the transaction data but are we getting any reliable insights from it ? For example Credit Card data has Fraud and real – these labels might serve as targets. Medicare fraud example. Credit Card and Medicare Fraud are two distinct examples. There is no specific labels in Medicare on which can be a fraud charge. Its common to have several data mining techniques and eventually combining them.
Data Preparation: Converting to tabular format. Inferring missing values, type conversion – making sure data is looking good.
Modeling
Evaluation; Too many false alarms. What would be the cost of all false alarms ? Data Scientists need to think about the Comprehensibility of the model to stake holders. Test and Production environment. Do final controlled experiments in live systems ? Also, be wary of what kind of data is passed to the model. Did the data change after the model was build ?
Deployment can be simple like a set of rules or complex like live fraud detection.
#6 – Talk about the project we are working on right now at the financial client.