Demystifying Data Science
What does it mean in practice?
Jonathan Sedar

Principal Data Scientist

Applied AI Ltd

www.applied.ai

@applied_ai

@jonsedar
Applied AI is a Data Science Consultancy
We create a competitive advantage for financial services
companies through applied artificial intelligence
www.applied.ai
@applied_ai
@jonsedar
Know Your Customers Develop Your Market Manage Risk & Regulation
Innovate & Experiment Streamline Operations Embed Data Science
Demystifying Data Science
Motivations
A Maturity Model
An Ecosystem Model
Practical Examples & Advice
Data Science
$> DATA.SCIENCE()
Intelligently Learning
From Data
Extracting information from all
that Big Data you're collecting
.. and the small stuff too
Discovering correlations, inferring
patterns of behaviour ... and training
models to predict outcomes
Running the business more
effectively ... and systematising
insights and products
How wonderful for you
Learning from data is
nothing new
Most of our business
is doing it already
Trading & Quant Finance
Increase Revenue
Process Optimisation
Reduce Costs
Portfolio Risk Modelling
Manage Risk
Reserves & Stress Testing
Meet Compliance
Learning from data benefits
the whole business
Increase Revenue
tune risk profile

understand the competition
optimise business processes

improve customer retention
inform & adapt to regulatory change

demonstrate leadership
innovate product-market fit

increase customer base
Reduce Cost
Manage Risk Meet Compliance
Data Science
Maturity Model
Sophisticated Analyses
• Hypothesis testing & data
discovery

• Advanced statistics & predictive
modelling

• Deliver immediate value, guide
strategy
• Advanced data science is
supported thought the organisation
and embedded in:

• Products & Services

• Senior Decision Making

• Business Administration
Full Capability Data Science
• Identify new opportunities and
useful data sources

• Basic modelling

• Senior leaders help to define &
develop the business case
Getting Started
• Create ‘data products’, reports,
new systems to embed change

• Replace legacy systems

• Build internal knowledge and skills
Business Operations
• Auto Insurer: “Help me price correctly”
• Extracted, cleaned, parsed data from messy
internal & external sources
• Lightweight multidimensional analysis of customer
base inc interactive dashboards
• Reports and strategic recommendations to board
level, proving the need for further analysis
Getting Started
Sophisticated Analyses
• Life & Pensions: “Help me model my customer churn
(a credit risk situation)”
• Sourced, cleaned, prepared internal & external data
• Created advanced time-to-event models using
Bayesian statistics
• Churn modelling output identified key risk groups &
potentially large new revenues and cost savings
Business Operations
• Asset Management Co: “Help me price real estate
at the optimal market price”
• Sourced, cleaned, prepared data, undertook initial
investigations and statistical modelling
• Created a price prediction “engine” within a
microservice API, now used within daily operations
• Accurate estimates and reduced manual effort
Full Capability Data Science
• The holy grail!
• A centre of excellence guiding:
• Products
• Decision Making
• Business Administration
Data Science
Ecosystem
Data Curation
• Making the right data available for
modelling and maintaining it well.
• Garbage-in-garbage-out
• Getting to ‘good data’ is subtle
• 80% of the process
Machine Learning
• Learning from data
• The empirical practice at the heart of
statistics.
• A machine (aka computer or model) is
trained on a dataset to predict values
• Predict or infer real-word behaviours.
Business Integration
• Conventional business analysis lives and
dies within spreadsheets & presentations
• Expensive dashboards require unstable
data pipelines.
• Huge data warehouses and "lakes" are so
complicated they're barely utilised.
• Business integration is hard, but critical
Three Stories of Data
Science in Practice
Data Curation
Curating external datasets to
better understand customers
Clustering
Introspection
Visualisation
We work mainly with insurance companies
They don’t have a reputation for being exciting
But from a data science point of view…
It’s quite interesting!
“Our term insurance policies are
lapsing before they become profitable”
We modelled lapse using survival analysis
(more of which later)
Along the way noticed something…
The churn rate was sky-high
in new estates
Geographic Effects
And Socioeconomic Effects
We could use these effects to:
Identify lapse-prone customers
More accurately price credit risk
Identify new markets
… we’re not the first people to
think of this
We can do it better and
cheaper ourselves
First: geocode the customer base
Get lat/long based on address
Used Nominatim (FOSS, based on PostGIS)
rather than Google, because …
Irish addresses are
pathological!
Second: go shopping for
socioeconomic data
Irish census produced every 5 years
15 themes, 500+ features
Captures almost everything about daily life
Aggregated to ‘small areas’ approx 200 households
Census themes
Theme Subject Theme Subject
1 Sex, Age & Migration 9 Social Class
2 Ethnicity & Language 10 Education
3 Irish Langage 11 Commuting
4 Families 12 Health
5 Private Housholds 13 Occupation
6 Housing 14 Industries
7 Hospitals & Prisons 15 PC & Internet
8 Principal Status
We could do what Experian does,
and also:
We would own the code
We could integrate with any internal project
We could tune it to fit our needs
Lets take a look at the data
Not a trivial task…
What we have is a really big matrix
18,488 rows x 767 columns
Data Compression
Visualisation
Clustering (unsupervised learning)
Data Compression
Singular Value Decomposition
Rotate and scale data into new frame of reference
Compress into fewer features while maintaining
information
Compressed 500+ columns into 100
Data Visualisation
t-Distributed Stochastic Neighbor
Embedding (t-SNE)
Visualise 100D in 2D space
View natural clustering in the data
Clustering
Hierarchical Agglomerative Clustering
(Ward Clustering)
Progressively group nearby datapoints into larger clusters
Cut nested hierarchy of clusters to fit
Interpreting the Clusters
…carefully
Now we can place each
small area on a map
Using shapefiles and PostGIS
Dublin, Ireland
2011
Interactive dashboard showing
each Small Area (200 people),
plotted by location and cluster id
Data Curation
• A centralised, up-to-date, traceable,
documented repository for structured
text, tabular & image datasets
• Augment with public data to keep up
with competitors and gain an edge
• Update, maintain and optimise your
primary data sources to allow for high
risk/reward POC projects
Machine Learning
Learning from data to predict
outcomes and infer behaviours
Supervised (classification, regression)
Unsupervised (clustering, pattern matching)
Reinforcement (behavioural rewards)
Hot new area, thus word soup
artificial intelligence
machine intelligence
statistical modelling
robotic process automation
cognitive computing
deep learning
…
Statistics <3 Machine Learning
Example 1: time to event modelling
“What’s our projected customer churn
(and thus projected credit risk)
Supervised Regression
Basic idea: estimate this curve
Counts: Kaplan Meier
Parametric (or semi-parametric) models
Exponential, Weibull, Cox PH Regression etc
Time-varying coefficients
Piecewise, Aalen-Additive Regression etc
Sidenote: Bayesian Inference is perfect
for time-based regression
Treat observed values as a realisation of a
probability distribution
Big wins: capture prior knowledge, preserve
uncertainty, model introspection and inference
Create predictions with qualified
uncertainty: “credible regions”
Straightforward to extend models
e.g. time-varying effects
Straightforward to make models robust
e.g. outlier detection, mixture models
Example 2: topic modelling
“Can we learn the topics of conversation
in broker communications?
Unsupervised Clustering
NLP upon business data sources
After careful cleaning, anonymisation, preprocessing
Find the ‘topics’ of conversation
Words that seem to co-occur
Use topics as a shortcut to categorise and
correlate documents to activity
Create the communications graph
Learn social & organisational structure
Design for interactive investigation
Example 3: anomaly detection
“Can we spot fraudulent activity
in claims?”
Un / Supervised Learning
Supervised Learning: function estimation
Classification: Log. Reg, Neural / Deep Nets, Trees, Random Forests
Regression: Linear, Non-Linear, Time-Series
Unsupervised Learning: pattern finding
Clustering, distance measures, topologies
Feature engineering is critical
Understand the data shape, size, behaviours and the processes
that generated it
Machine Learning
• Sophisticated statistical techniques,
good software dev practices and
research-grade, open-source software
• Document and share knowledge to
become technical centre of excellence
• Validate, test, review & maintain your
data pipelines, software and models to
mitigate risk and allow for audit
Business Integration
Learning from data benefits
the whole business
Increase Revenue
tune risk profile

understand the competition
optimise business processes

improve customer retention
inform & adapt to regulatory change

demonstrate leadership
innovate product-market fit

increase customer base
Reduce Cost
Manage Risk Meet Compliance
How to integrate data science
into business activities?
Tooling
Open Source
Reproducibility and Documentation
Wider Communication
APIs and Integration
The Team
Data scientist skill set
Drew Conway’s (in)famous Venn Diagram
Not so different from a software
development team
Communicate
Iterate
and another thing…
The practice of data
science can offer powerful
insight and prediction…
… it’s only a model
Business Integration
• Clear path from model inference and
predictions to the extrapolation of
business actions and impacts
• Communicate results with non-technical
stakeholders via engaging dashboards
and visualisations
• Integrate an automated, live, on-demand
prediction service with business systems
Using a “Data Science” approach:
- Motivations
- A Maturity Model
- An Ecosystem Model
Practical Examples & Advice
Learning from data benefits
the whole business
Increase Revenue
tune risk profile

understand the competition
optimise business processes

improve customer retention
inform & adapt to regulatory change

demonstrate leadership
innovate product-market fit

increase customer base
Reduce Cost
Manage Risk Meet Compliance
Further reading
•Blogs with good technical articles, insights etc
•http://blog.applied.ai
•http://www.magesblog.com
•https://planet.scipy.org
•http://andrewgelman.com
•http://blog.kaggle.com
• Books / technical articles
•https://www.oreilly.com/ideas/what-is-hardcore-data-science-in-practice
•http://www.oreilly.com/data/free/ten-signs-of-data-science-maturity.csp
•Machine Learning for Hackers http://shop.oreilly.com/product/
0636920018483.do
Thank you
www.applied.ai
@applied_ai
@jonsedar

Demystifying Data Science