Data Science is the study of the extraction of knowledge from data. What if we extract partial or inaccurate knowledge? This illusion of knowledge would lead us to make wrong decisions, with sometimes disastrous consequences such as in the case of medical diagnosis, security or other life and death situations. In this talk we’ll present ideas on how to validate the extracted knowledge by its predictive power and not by its ability to explain the past. We’ll also discuss special techniques for predicting very rare high value events.
1. Richard BijjaniRichard Bijjani
JUMPING TO CONCLUSIONS
(Generating Improbable Insights)
Richard Robehr Bijjani, Ph.D.
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
2.
3. • IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
1/3 of all deaths globally are
from cardiovascular disease
SOURCE: WORLD HEALTH ORGANIZATION
4. • IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
SOURCE: Mayo Clinic
#1 risk factor is high blood pressure
9. Quanttus is always on!
We capture > 50 million data
points and > 400,000 vital sign
measurements / person / day.
10. Richard BijjaniRichard Bijjani
Data Science @ Quanttus
Data Science ≡ Extraction of Actionable Knowledge
from Data
Actionable Knowledge
Better Decisions
Meaningful Insights
Knowledge is actionable iff it has predictive power
(not just an ability to explain the past)
11. Richard BijjaniRichard Bijjani
IUMRING TQ CQNGIUSIQNS
Illusion of Knowledge
fatal
The greatest enemy of knowledge is not ignorance,
it is the illusion of knowledge.
-Stephen Hawking
13. Richard BijjaniRichard Bijjani
First a Joke!
A police officer approaches a man intently
searching the ground under a lamppost
• Policeman: What are you doing?
• Man: Looking for my car keys
The officer helps for a few minutes without
success
• Policeman: Are you certain you
dropped your keys near here?
• Man : No! I remember dropping them
across the street.
• Policeman (very irritated): Why are
looking for them here then?
• Man : The light is much better here!
14. Richard BijjaniRichard Bijjani
Why Scientific Studies are so often Wrong
• Researchers tend to look for answers where the
looking is good, rather than where the answers are
likely to be hiding. David Freedman
•15/45 most prominent studies published in the top medical
journals were ultimately refuted.
•2/3 of all medical studies are wrong.
•9/10 of leading-edge studies (like those linking a disease to a
specific gene) are wrong.
John Ioannidis, University of Ioannina
15. Richard BijjaniRichard Bijjani
10% to 20% of cases:
delayed, missed, and incorrect diagnosis
garber, et al., jama, 2005
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
Why Scientific Studies are so often Wrong
16. Richard BijjaniRichard Bijjani
40,000+ patients in US ICU’s
may die with a misdiagnosis annually
winters, et al., bmj quality & safety, 2012
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
Why Scientific Studies are so often Wrong
17. Richard BijjaniRichard Bijjani
50% of MDs are below-average
vinod khosla
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
Why Scientific Studies are so often Wrong
18. Richard BijjaniRichard Bijjani
Are you Immune to the Streetlight Effect?
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
• Think of the data you are working with, is it the
ideal data, or just the conveniently available data?
• When was the last time you worked with ideal
data?
• Have you ever?
Why Scientific Studies are so often Wrong
19. Richard BijjaniRichard Bijjani
Expert Consensus
seventeen experts’ estimates of
the effect of screening on colon cancer deaths
0% 25% 50% 75% 100%
proportion of colon cancer deaths prevented
20. Richard BijjaniRichard Bijjani
Should you trust your Dr.?
• Depends.
• If your ailment is common, your Dr. will do a decent job.
• If you’re suffering from a relatively uncommon disease,
not so well.
“If you don’t find it often, you often don’t find it”,
Jeremy Wolfe
21. Richard BijjaniRichard Bijjani
Weak Link, Humans!
1. Signals with low predictive Values
are not very useful
1 in 1000 does not hold ones attention for
long
2. Attention directed at one thing, is
attention drawn away from
something else
Lost research/testing/treatment opportunities
22. Richard BijjaniRichard Bijjani
Data Scientists’ Tools of Choice
• Some scientists use only techniques they feel comfortable
with
• Others latch on to new ones without fully understanding
them.
• Some just rely on available methods built into their software.
23. Richard BijjaniRichard Bijjani
The Nonsense Asymmetry Principle
The amount of energy needed to
refute ‘Nonsense’ is an order of
magnitude bigger than to produce it.
-Alberto Brandolini
24. Richard BijjaniRichard Bijjani
Data Scientific Method
• Validation can ONLY occur
by measuring the
predictive power of the
insights, in addition to it’s
ability to explain the past
• Data Science is Science
and hence follows the
Scientific Method
Ask Relevant
Questions
Report Results
Research.
Gather Data.
Analyze
Results
Validate Data.
Construct Hypothesis
Design Experiment.
Test Hypothesis.
Hypothesi
s is True
Hypothesi
s not Valid
25. Richard BijjaniRichard Bijjani
Ask the Important Question
• Deadly Virus
• Infects 1 in 1 million
• Diagnostic Test developed with 99.9% Sensitivity and
Specificity
• Treatment developed
• 99% Curative
• 1% Deadly side effect
• Question: Would you recommend Diagnosis?
• Would you recommend Treatment?
26. Richard BijjaniRichard Bijjani
Efficacy of Treatment
• Do Nothing:
• 300 People Infected
• 300 People will Die
300M test
subjects
Predicted
Negative
Predicted
Positive
Normal 299. 7M 300,000
Infected <1 >299
• Diagnose and Treat:
• Infected Population:
• 296 Cured
• 3 + 1 Die
• Non-Infected Population
• 297,000 Unaffected (except for the scare)
• 3,000 Die
27. Richard BijjaniRichard Bijjani
Good Practices
• Understand were the data comes from.
• Pre-process / Clean your data, but keep validated outliers.
• Own the tools and adapt them to your own requirements.
• Follow the scientific Method.
• Analyze data to answer Question posed.
• Save a list of other interesting questions for later.
• Share your hypotheses with the team.
• Simple is better, at least make sure it’s deployable.
• Test, Validate, re-test.
• Communicate results correctly and set the right expectations.
"If you torture the data long enough, it will
confess to anything." - Hal Varian
29. Richard BijjaniRichard Bijjani
Pitfalls of data mining
• The hope: data miners pore over large, diffuse sets
of raw data trying to discern patterns that would
otherwise go undetected.
• The dark side of data mining is to pick and choose
from a large set of data to try to explain a small one
• “Given enough time, enough attempts and enough
imagination, almost any set of data can be teased
out of any conclusion”
30. Richard BijjaniRichard Bijjani
Limitations of Common Data
Mining Techniques
• Automated feature selection methods cannot
apply to rare (or unforeseen) events
• Normal events are similar, rare events are by
definition unique
• Accuracy measurements are not appropriate
• Real time detection of rare events is necessary,
but machine learning techniques construct
models based on the past
• If you haven’t yet seen, you cannot detect it!
31. Richard BijjaniRichard Bijjani
What are rare/high-value events?
• Rare or Outliers
• Occurs less then 1%
• For large datasets,
many samples exist.
Balance could be
achieved and
traditional Data
Mining Techniques
could be applied
• Preferential
sampling of rare class
• Under-Sampling of
majority class
• Extremely Rare or
Anomalies
• Statistical chance of
detection is zero
• Most databases don’t
‘naturally’ contain any
samples
• Properties of target
samples are not known
33. Richard BijjaniRichard Bijjani
Anomalies vs. High Value Rare
Events
By definition, anomalies are the
exception, but not necessarily
rare and/or of high value.
• Anomaly? Yes
• Rare Event? No
37. Richard BijjaniRichard Bijjani
Why incompatible?
• No Quality
Control
Suicide Bomb Trainer in Iraq
Accidentally Blows Up His
Class
Terrorist ‘lab’
(redacted)
40. Richard BijjaniRichard Bijjani
Takeaway
• We are drowning in data, yet
starving for knowledge
• In case of rare events, data may not
be enough, source of data need to
be well understood
• To detect rare events: Sometimes
it’s just more effective to generate
heuristics
• Heuristics cannot predict, while
machine learning assumes the
future will resemble the past, and
extremely rare events are not part
of the past
• What to do?
41. Richard BijjaniRichard Bijjani
Outliers revisited
1. Retain outliers in data set for analysis.
2. Exclude only those that are known to be
due to defective measurements or
transcription errors
1. Need to understand data origin to
accurately separate rare events from
measurement errors
3. Do not assume normal distribution
42. Richard BijjaniRichard Bijjani
Data Mining Techniques
• Supervised
• pro: Human readable
models
• con: Requires labeled
data
• Unsupervised
• Pro: Deviation
detection, no labeling
needed
• Con: Requires similarity
measures, high false
alarm (due to benign
yet previously unseen
data)
43. Richard BijjaniRichard Bijjani
Unsupervised Techniques
• Outlier datum defined as different from the rest of the
data
• Rare event: Same definition
• Detection Approaches
• Statistics based
• Distance Based
• Model Based
44. Richard BijjaniRichard Bijjani
Unsupervised: Statistics
• Data modeled using stochastic distribution
• Advantages: no a priori knowledge required
• Disadvantages:
• Fails with high dimensions (curse of dimensionality)
• Does not identify patterns of rare events
• Sample implementations:
• Finite Mixtures Schemes, e.g. SmartSifter. Use histogram density to
represent probability distribution
• Blocked Adaptive Computationally Efficient Outlier Nominator, BACON
• Probability Distributions
• Entropy Measures
45. Richard BijjaniRichard Bijjani
Unsupervised: Distance
• Distance computed between neighbors, and data points sorted
• Advantages: no a priori knowledge required
• Disadvantages:
• Not suitable for rare classes
• Sample implementations:
• k-Nearest Neighbor
• Mahalanobis Distance for skewed distributions
• Local Outlier Factor (LOF) for variable density cluster (average distance
between points are different in different clusters)
• Specialized Clustering, Canopy, FindOut
46. Richard BijjaniRichard Bijjani
Unsupervised: Model
• Predict normal behavior via model
• Capture deviations
• Detection Approaches
• Neural Networks, 4 layers, input = output
• Unsupervised support vector machines SVM
47. Richard BijjaniRichard Bijjani
Supervised Techniques
• Classification methods typically not suitable:
• Problem: Lack of labels
• Possible Solution: Balance the class size
• Duplicate rare events or down-size normal
events
• Generate anomalies inversely proportional
with data density
• Synthetically generate minority over-
sampled events (e.g. SMOTE)
• Classify regions as ‘positive’ without having
enough data in them
• Shrink: look for presence of positive labels,
not majority
• PN-rule: Find regions of high recall (Pd),
then prune false positives, then classify
(avoid over-fitting)
• Decision Tree methods: Ripple Down Rules,
CREDOS, Boosting Classifiers, Random
Forest
48. Richard BijjaniRichard Bijjani
Cost Functions
• In any classification problem, one needs to
minimize the cost function
• Selecting an appropriate cost function is key
• Weighting is also important, not all data points are
created equal.
• Bayesian Thinking is necessary
• Temporal (time-series) Analysis requires different
approaches.
• Is current data ‘surprising’ based on historical data
created with the same underlying process?
• Opportunity for Insight, error, or rare event capture.
49. Richard BijjaniRichard Bijjani
The weak Link
No matter how good your
automated system is, final
decision to act or not is
often a human!
Present only relevant data
to make the right decision
Actionable information
Present data in human
readable format.
Visualization! Be creative,
different.
50. Richard BijjaniRichard Bijjani
Detecting Extremely Rare Events
Data
Collection
•Capture high
SNR
representative
data
Pre-
process
•Clean the data
from known
noise and
artifacts
Feature
Extraction
•Reduce data to
meaningful
feature with no
loss of desired
signal
Classifier
• Divide data into
training/testing and
use appropriate
classifiers
• Always use feature
confidences
Identify
Outliers
•Data that is not ‘normal’
•Determine if physically
appropriate or measurement
error. Delete errors.
Explain Data
•Any Insights?
What does it
mean, sub-
category
classification
Present Data
•Visualizati
on, UI, UX,
51. Richard BijjaniRichard Bijjani
Conclusion
• Experiment, test. Iterate.
• Do your homework, learn the physical origin of your data.
• Pre-process data.
• Develop your own method, all methods have weaknesses and strength, learn to
combine.
• Know your customer. Stay focused on the ‘Question’.
• Simplify. Needs to run in the real world
• Avoid Bias, Biased Samples Biased Outcome
• Never use the test and validation data in the training phase, not even for scaling
purposes.