Data Science:
Bridging the Gap Between Data Generation
and Data Comprehension
Dr Carsten Riggselsen
Principal Data Scientist
Pivotal
2© Copyright 2015 Pivotal. All rights reserved.
Analyzing data is nothing new
3© Copyright 2015 Pivotal. All rights reserved.
“Their Data”“Our Data”“My Data”
“Data”
“The Data”
“Data (Big)”
4© Copyright 2015 Pivotal. All rights reserved.
“Data” vs. “Data-Driven”
Deploy analytic apps
and automation at scale
Store any type
and size of data
Discover insights
Create analytics algorithms
5© Copyright 2015 Pivotal. All rights reserved.
6© Copyright 2015 Pivotal. All rights reserved.
Data Science
Product Management
Product Design
Engineering
Continuous Improvement
Data Science
7© Copyright 2015 Pivotal. All rights reserved.
Isolated Data Science
I don’t think (Big) Data
is valuable, it’s a hype
– prove me wrong.
We do BI and stuff
already. Data Science is a
hype – prove me wrong.
8© Copyright 2015 Pivotal. All rights reserved.
Data Science
Product Management
Product Design
Engineering
Continuous Improvement
Data Science
9© Copyright 2015 Pivotal. All rights reserved.
Data Science
Product Management
Product Design
Engineering
Continuous Improvement
10© Copyright 2015 Pivotal. All rights reserved.
“Mere” convenience through Apps
Automate mundane or tedious tasks
Present information at a glance in an app
User Interaction with the app
Consistency and unbiasedness
24-7 availability
Scalability
Platform independence
Easy Provisioning
11© Copyright 2015 Pivotal. All rights reserved.
Smart Apps – Data Science Powered
Combining/link data sources/streams across areas and domains
There is an element of prediction involved based on accumulated data/info
Inferring (ab)normal patterns, e.g., profiling users, usage patterns
There is an element of root-cause identification involved
12© Copyright 2015 Pivotal. All rights reserved.
DS-Cheat-Sheet - Is it a SMART App?
q  Can past knowledge potentially improve on how to inform or act in the
future?
q  Is past knowledge based on data/info from different domains?
q  Do you need to affect outcomes in real-time?
q  Are (ab)normal patterns to be inferred?
q  Is the reason or cause for an action or a pattern unclear yet an important
thing to know?
q  Is the solution highly personalised?
q  Is “crowdsourcing” knowledge (data/information) beneficial?
13© Copyright 2015 Pivotal. All rights reserved.
The Car Unlock
Button – Press it!
14© Copyright 2015 Pivotal. All rights reserved.
“Siri or OK Google – unlock
my car…
UnnnLoooock my Caaaar…”
“OK – I will unlock your house”
15© Copyright 2015 Pivotal. All rights reserved.
SMART Unlock
Access to your Calendar/Agenda
Infer where/when you usually go by car
Awareness of Bank Holidays etc.
Knows where you parked your car
Knows where you are (GPS)
16© Copyright 2015 Pivotal. All rights reserved.
Works Efficient Convenient Smart
The Car-Unlock Experience
I unlocked
your car!
17© Copyright 2015 Pivotal. All rights reserved.
Examples
18© Copyright 2015 Pivotal. All rights reserved.
Obstruction Duration Prediction
•  Predict duration of road incidents in London
•  Android app developed on top of the model
•  http://ds-demo-transport.cfapps.io
19© Copyright 2015 Pivotal. All rights reserved.
R E A LT I M E 
DASHBOARD
Driving
Prediction
https://youtu.be/5gySgGWJMHA
20© Copyright 2015 Pivotal. All rights reserved.
Time to Delivery
Ÿ  Three sub problems
–  Time to delivery estimate
–  Time slot availability
–  Courier scheduling
Ÿ  Courier scheduling and
time to delivery estimate
may have mutual
feedback
Logistics Comp. Logistics Comp.
21© Copyright 2015 Pivotal. All rights reserved.
Telco: Protecting Minors - Age Prediction
Estimate age of the customer based
on their calling habits
Can distinguish minors in with an
accuracy of >80%
•  Call records from March-Aug 2014
•  Corresponds to ~3TB data
•  Attributes are
•  Calling party ID
•  Called party ID
•  Date
•  Time
•  Duration at start/end
•  Location
•  Type of call and bearer
•  TAC
•  Data
•  Call records from March-Aug 2014
•  Corresponds to ~3TB data
•  Attributes are
•  Calling party ID
•  Called party ID
•  Date
•  Time
•  Duration at start/end
•  Location
•  Type of call and bearer
•  TAC
•  Data
CDR CRM Data
Feature Importance Observation
Calls (holidays-schooltime) 0.08-0.06 Minors call less in school holiday
Average call length 0.07 Minors make shorter calls
Call timing (night-day) 0.07-0.03 Minors call more at nighttime
Number of phone uses 0.05 Minors use the phones less
Percentage of text use 0.05 Minors text less
Number of contacts 0.05 Minors less likely to have 1 contact
Percentage of calls to minors 0.04 Minors call other minors more
Percentage of voice use 0.04 Typical
Caller-Callee ratio 0.04 Minors receive more calls than make
Fri/Sat/Thurs ratio 0.04-0.03 Minors call more at weekends
Number of locations 0.04 Minors more likely to have 2 locs
22© Copyright 2015 Pivotal. All rights reserved.
Internal Transaction Fraud Detection
Beyond signatures
Beyond simple metrics for thresholding
Beyond manual engineering of rules
Monitor each and every entity in its environmental context
23© Copyright 2015 Pivotal. All rights reserved.
Internal Transaction Fraud Detection
Beyond signatures
Beyond simple metrics for thresholding
Beyond manual engineering of rules
Monitor each and every entity in its environmental context
24© Copyright 2015 Pivotal. All rights reserved.
2
5
3
3
3,25
UserID and Data Experts analyze Overall vote is determined
S(id) = w1 · M1(id) + ... + wj · Mj(id)
X
i
wi = 1
s.t.
Weights are a measure of “importance” for
model expert j. Initially uniform across all
experts.
Mixture of Experts Metaphor
25© Copyright 2015 Pivotal. All rights reserved.
Anomalous User Behavior Comparison
Mean Anomaly Scores Users
Transaction
Anomaly
SoD
Risk
Terminated
Employees
CDHDR
Access
Anomaly
VPN
Access
Anomaly
Cluster
Outlier
Total
Score
# %
Reg B
Red 0.6 0.6 0.1 0.2 0.1 0.6 2.3 26 0.3%
Amber 0.4 0.5 0.1 0.1 0.1 0.6 1.7 73 0.8%
Green 0.0 0.0 0.0 0.0 0.1 0.0 0.1 8,765 98.9%
Reg A
Red 0.1 - - 1.0 0.4 0.9 2.4 1 0.01%
Amber 0.4 0.2 0.0 0.1 0.2 0.7 1.7 25 0.4%
Green 0.0 0.0 0.0 0.0 0.1 0.0 0.2 6,853 99.6%
26© Copyright 2015 Pivotal. All rights reserved.
Add SMARTness to your app by leveraging data
Don’t think of Data Science in an isolated fashion
Move beyond POCs on Big Data
Start with a minimal viable product/solution
Get the right platform and resources in place
Collaborate and interact
Conclusions
Digital Transformation Forum
Disrupt or Be Disrupted
19 OCTOBER · BMW WELT EVENT CENTRE · MUNICH

Pivotal Digital Transformation Forum: Data Science

  • 1.
    Data Science: Bridging theGap Between Data Generation and Data Comprehension Dr Carsten Riggselsen Principal Data Scientist Pivotal
  • 2.
    2© Copyright 2015Pivotal. All rights reserved. Analyzing data is nothing new
  • 3.
    3© Copyright 2015Pivotal. All rights reserved. “Their Data”“Our Data”“My Data” “Data” “The Data” “Data (Big)”
  • 4.
    4© Copyright 2015Pivotal. All rights reserved. “Data” vs. “Data-Driven” Deploy analytic apps and automation at scale Store any type and size of data Discover insights Create analytics algorithms
  • 5.
    5© Copyright 2015Pivotal. All rights reserved.
  • 6.
    6© Copyright 2015Pivotal. All rights reserved. Data Science Product Management Product Design Engineering Continuous Improvement Data Science
  • 7.
    7© Copyright 2015Pivotal. All rights reserved. Isolated Data Science I don’t think (Big) Data is valuable, it’s a hype – prove me wrong. We do BI and stuff already. Data Science is a hype – prove me wrong.
  • 8.
    8© Copyright 2015Pivotal. All rights reserved. Data Science Product Management Product Design Engineering Continuous Improvement Data Science
  • 9.
    9© Copyright 2015Pivotal. All rights reserved. Data Science Product Management Product Design Engineering Continuous Improvement
  • 10.
    10© Copyright 2015Pivotal. All rights reserved. “Mere” convenience through Apps Automate mundane or tedious tasks Present information at a glance in an app User Interaction with the app Consistency and unbiasedness 24-7 availability Scalability Platform independence Easy Provisioning
  • 11.
    11© Copyright 2015Pivotal. All rights reserved. Smart Apps – Data Science Powered Combining/link data sources/streams across areas and domains There is an element of prediction involved based on accumulated data/info Inferring (ab)normal patterns, e.g., profiling users, usage patterns There is an element of root-cause identification involved
  • 12.
    12© Copyright 2015Pivotal. All rights reserved. DS-Cheat-Sheet - Is it a SMART App? q  Can past knowledge potentially improve on how to inform or act in the future? q  Is past knowledge based on data/info from different domains? q  Do you need to affect outcomes in real-time? q  Are (ab)normal patterns to be inferred? q  Is the reason or cause for an action or a pattern unclear yet an important thing to know? q  Is the solution highly personalised? q  Is “crowdsourcing” knowledge (data/information) beneficial?
  • 13.
    13© Copyright 2015Pivotal. All rights reserved. The Car Unlock Button – Press it!
  • 14.
    14© Copyright 2015Pivotal. All rights reserved. “Siri or OK Google – unlock my car… UnnnLoooock my Caaaar…” “OK – I will unlock your house”
  • 15.
    15© Copyright 2015Pivotal. All rights reserved. SMART Unlock Access to your Calendar/Agenda Infer where/when you usually go by car Awareness of Bank Holidays etc. Knows where you parked your car Knows where you are (GPS)
  • 16.
    16© Copyright 2015Pivotal. All rights reserved. Works Efficient Convenient Smart The Car-Unlock Experience I unlocked your car!
  • 17.
    17© Copyright 2015Pivotal. All rights reserved. Examples
  • 18.
    18© Copyright 2015Pivotal. All rights reserved. Obstruction Duration Prediction •  Predict duration of road incidents in London •  Android app developed on top of the model •  http://ds-demo-transport.cfapps.io
  • 19.
    19© Copyright 2015Pivotal. All rights reserved. R E A LT I M E DASHBOARD Driving Prediction https://youtu.be/5gySgGWJMHA
  • 20.
    20© Copyright 2015Pivotal. All rights reserved. Time to Delivery Ÿ  Three sub problems –  Time to delivery estimate –  Time slot availability –  Courier scheduling Ÿ  Courier scheduling and time to delivery estimate may have mutual feedback Logistics Comp. Logistics Comp.
  • 21.
    21© Copyright 2015Pivotal. All rights reserved. Telco: Protecting Minors - Age Prediction Estimate age of the customer based on their calling habits Can distinguish minors in with an accuracy of >80% •  Call records from March-Aug 2014 •  Corresponds to ~3TB data •  Attributes are •  Calling party ID •  Called party ID •  Date •  Time •  Duration at start/end •  Location •  Type of call and bearer •  TAC •  Data •  Call records from March-Aug 2014 •  Corresponds to ~3TB data •  Attributes are •  Calling party ID •  Called party ID •  Date •  Time •  Duration at start/end •  Location •  Type of call and bearer •  TAC •  Data CDR CRM Data Feature Importance Observation Calls (holidays-schooltime) 0.08-0.06 Minors call less in school holiday Average call length 0.07 Minors make shorter calls Call timing (night-day) 0.07-0.03 Minors call more at nighttime Number of phone uses 0.05 Minors use the phones less Percentage of text use 0.05 Minors text less Number of contacts 0.05 Minors less likely to have 1 contact Percentage of calls to minors 0.04 Minors call other minors more Percentage of voice use 0.04 Typical Caller-Callee ratio 0.04 Minors receive more calls than make Fri/Sat/Thurs ratio 0.04-0.03 Minors call more at weekends Number of locations 0.04 Minors more likely to have 2 locs
  • 22.
    22© Copyright 2015Pivotal. All rights reserved. Internal Transaction Fraud Detection Beyond signatures Beyond simple metrics for thresholding Beyond manual engineering of rules Monitor each and every entity in its environmental context
  • 23.
    23© Copyright 2015Pivotal. All rights reserved. Internal Transaction Fraud Detection Beyond signatures Beyond simple metrics for thresholding Beyond manual engineering of rules Monitor each and every entity in its environmental context
  • 24.
    24© Copyright 2015Pivotal. All rights reserved. 2 5 3 3 3,25 UserID and Data Experts analyze Overall vote is determined S(id) = w1 · M1(id) + ... + wj · Mj(id) X i wi = 1 s.t. Weights are a measure of “importance” for model expert j. Initially uniform across all experts. Mixture of Experts Metaphor
  • 25.
    25© Copyright 2015Pivotal. All rights reserved. Anomalous User Behavior Comparison Mean Anomaly Scores Users Transaction Anomaly SoD Risk Terminated Employees CDHDR Access Anomaly VPN Access Anomaly Cluster Outlier Total Score # % Reg B Red 0.6 0.6 0.1 0.2 0.1 0.6 2.3 26 0.3% Amber 0.4 0.5 0.1 0.1 0.1 0.6 1.7 73 0.8% Green 0.0 0.0 0.0 0.0 0.1 0.0 0.1 8,765 98.9% Reg A Red 0.1 - - 1.0 0.4 0.9 2.4 1 0.01% Amber 0.4 0.2 0.0 0.1 0.2 0.7 1.7 25 0.4% Green 0.0 0.0 0.0 0.0 0.1 0.0 0.2 6,853 99.6%
  • 26.
    26© Copyright 2015Pivotal. All rights reserved. Add SMARTness to your app by leveraging data Don’t think of Data Science in an isolated fashion Move beyond POCs on Big Data Start with a minimal viable product/solution Get the right platform and resources in place Collaborate and interact Conclusions
  • 27.
    Digital Transformation Forum Disruptor Be Disrupted 19 OCTOBER · BMW WELT EVENT CENTRE · MUNICH