Jumping to Conclusions

Richard BijjaniRichard Bijjani
JUMPING TO CONCLUSIONS
(Generating Improbable Insights)
Richard Robehr Bijjani, Ph.D.
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci

• IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
1/3 of all deaths globally are
from cardiovascular disease
SOURCE: WORLD HEALTH ORGANIZATION

• IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
SOURCE: Mayo Clinic
#1 risk factor is high blood pressure

SOURCE: WORLD HEALTH ORGANIZATION
1,000,000,000

CONTINUOUS PASSIVE
CLINICALLY MEANINGFUL
BEHAVIORAL INSIGHTS

CONTEXTUALIZED CARDIOVASCULAR HEALTH

Quanttus is always on!
We capture > 50 million data
points and > 400,000 vital sign
measurements / person / day.

Data Science @ Quanttus
Data Science ≡ Extraction of Actionable Knowledge
from Data
Actionable Knowledge
 Better Decisions
 Meaningful Insights
Knowledge is actionable iff it has predictive power
(not just an ability to explain the past)

IUMRING TQ CQNGIUSIQNS
Illusion of Knowledge
fatal
The greatest enemy of knowledge is not ignorance,
it is the illusion of knowledge.
-Stephen Hawking

Illusion of Knowledge
Courtesy of National Geographic

First a Joke!
A police officer approaches a man intently
searching the ground under a lamppost
• Policeman: What are you doing?
• Man: Looking for my car keys
The officer helps for a few minutes without
success
• Policeman: Are you certain you
dropped your keys near here?
• Man : No! I remember dropping them
across the street.
• Policeman (very irritated): Why are
looking for them here then?
• Man : The light is much better here!

Why Scientific Studies are so often Wrong
• Researchers tend to look for answers where the
looking is good, rather than where the answers are
likely to be hiding. David Freedman
•15/45 most prominent studies published in the top medical
journals were ultimately refuted.
•2/3 of all medical studies are wrong.
•9/10 of leading-edge studies (like those linking a disease to a
specific gene) are wrong.
John Ioannidis, University of Ioannina

10% to 20% of cases:
delayed, missed, and incorrect diagnosis
garber, et al., jama, 2005
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman

40,000+ patients in US ICU’s
may die with a misdiagnosis annually
winters, et al., bmj quality & safety, 2012
David Freedman

50% of MDs are below-average
vinod khosla
David Freedman

Are you Immune to the Streetlight Effect?
David Freedman
• Think of the data you are working with, is it the
ideal data, or just the conveniently available data?
• When was the last time you worked with ideal
data?
• Have you ever?

Expert Consensus
seventeen experts’ estimates of
the effect of screening on colon cancer deaths
0% 25% 50% 75% 100%
proportion of colon cancer deaths prevented

Should you trust your Dr.?
• Depends.
• If your ailment is common, your Dr. will do a decent job.
• If you’re suffering from a relatively uncommon disease,
not so well.
“If you don’t find it often, you often don’t find it”,
Jeremy Wolfe

Weak Link, Humans!
1. Signals with low predictive Values
are not very useful
1 in 1000 does not hold ones attention for
long
2. Attention directed at one thing, is
attention drawn away from
something else
Lost research/testing/treatment opportunities

Data Scientists’ Tools of Choice
• Some scientists use only techniques they feel comfortable
with
• Others latch on to new ones without fully understanding
them.
• Some just rely on available methods built into their software.

The Nonsense Asymmetry Principle
The amount of energy needed to
refute ‘Nonsense’ is an order of
magnitude bigger than to produce it.
-Alberto Brandolini

Data Scientific Method
• Validation can ONLY occur
by measuring the
predictive power of the
insights, in addition to it’s
ability to explain the past
• Data Science is Science
and hence follows the
Scientific Method
Ask Relevant
Questions
Report Results
Research.
Gather Data.
Analyze
Results
Validate Data.
Construct Hypothesis
Design Experiment.
Test Hypothesis.
Hypothesi
s is True
Hypothesi
s not Valid

Ask the Important Question
• Deadly Virus
• Infects 1 in 1 million
• Diagnostic Test developed with 99.9% Sensitivity and
Specificity
• Treatment developed
• 99% Curative
• 1% Deadly side effect
• Question: Would you recommend Diagnosis?
• Would you recommend Treatment?

Efficacy of Treatment
• Do Nothing:
• 300 People Infected
• 300 People will Die
300M test
subjects
Predicted
Negative
Predicted
Positive
Normal 299. 7M 300,000
Infected <1 >299
• Diagnose and Treat:
• Infected Population:
• 296 Cured
• 3 + 1 Die
• Non-Infected Population
• 297,000 Unaffected (except for the scare)
• 3,000 Die

Good Practices
• Understand were the data comes from.
• Pre-process / Clean your data, but keep validated outliers.
• Own the tools and adapt them to your own requirements.
• Follow the scientific Method.
• Analyze data to answer Question posed.
• Save a list of other interesting questions for later.
• Share your hypotheses with the team.
• Simple is better, at least make sure it’s deployable.
• Test, Validate, re-test.
• Communicate results correctly and set the right expectations.
"If you torture the data long enough, it will
confess to anything." - Hal Varian

Pitfalls of data mining
• The hope: data miners pore over large, diffuse sets
of raw data trying to discern patterns that would
otherwise go undetected.
• The dark side of data mining is to pick and choose
from a large set of data to try to explain a small one
• “Given enough time, enough attempts and enough
imagination, almost any set of data can be teased
out of any conclusion”

Limitations of Common Data
Mining Techniques
• Automated feature selection methods cannot
apply to rare (or unforeseen) events
• Normal events are similar, rare events are by
definition unique
• Accuracy measurements are not appropriate
• Real time detection of rare events is necessary,
but machine learning techniques construct
models based on the past
• If you haven’t yet seen, you cannot detect it!

What are rare/high-value events?
• Rare or Outliers
• Occurs less then 1%
• For large datasets,
many samples exist.
Balance could be
achieved and
traditional Data
Mining Techniques
could be applied
• Preferential
sampling of rare class
• Under-Sampling of
majority class
• Extremely Rare or
Anomalies
• Statistical chance of
detection is zero
• Most databases don’t
‘naturally’ contain any
samples
• Properties of target
samples are not known

What are the costs of such
events?
Cost Functions not easily Defined

Anomalies vs. High Value Rare
Events
By definition, anomalies are the
exception, but not necessarily
rare and/or of high value.
• Anomaly? Yes
• Rare Event? No

Extremely Rare, High
Value Events
Case Study: Terrorism, specifically explosive detection

Finding Commercial
Explosives
Data Could be Collected
and/or simulated.
Allowing for rare class
augmentation

Finding Explosives
Data cannot be
Collected and/or
simulated.

Why incompatible?
• No Quality
Control
Suicide Bomb Trainer in Iraq
Accidentally Blows Up His
Class
Terrorist ‘lab’
(redacted)

The Quanttus Vision

Takeaway
• We are drowning in data, yet
starving for knowledge
• In case of rare events, data may not
be enough, source of data need to
be well understood
• To detect rare events: Sometimes
it’s just more effective to generate
heuristics
• Heuristics cannot predict, while
machine learning assumes the
future will resemble the past, and
extremely rare events are not part
of the past
• What to do?

Outliers revisited
1. Retain outliers in data set for analysis.
2. Exclude only those that are known to be
due to defective measurements or
transcription errors
1. Need to understand data origin to
accurately separate rare events from
measurement errors
3. Do not assume normal distribution

Data Mining Techniques
• Supervised
• pro: Human readable
models
• con: Requires labeled
data
• Unsupervised
• Pro: Deviation
detection, no labeling
needed
• Con: Requires similarity
measures, high false
alarm (due to benign
yet previously unseen
data)

Unsupervised Techniques
• Outlier datum defined as different from the rest of the
data
• Rare event: Same definition
• Detection Approaches
• Statistics based
• Distance Based
• Model Based

Unsupervised: Statistics
• Data modeled using stochastic distribution
• Advantages: no a priori knowledge required
• Disadvantages:
• Fails with high dimensions (curse of dimensionality)
• Does not identify patterns of rare events
• Sample implementations:
• Finite Mixtures Schemes, e.g. SmartSifter. Use histogram density to
represent probability distribution
• Blocked Adaptive Computationally Efficient Outlier Nominator, BACON
• Probability Distributions
• Entropy Measures

Unsupervised: Distance
• Distance computed between neighbors, and data points sorted
• Advantages: no a priori knowledge required
• Disadvantages:
• Not suitable for rare classes
• Sample implementations:
• k-Nearest Neighbor
• Mahalanobis Distance for skewed distributions
• Local Outlier Factor (LOF) for variable density cluster (average distance
between points are different in different clusters)
• Specialized Clustering, Canopy, FindOut

Unsupervised: Model
• Predict normal behavior via model
• Capture deviations
• Detection Approaches
• Neural Networks, 4 layers, input = output
• Unsupervised support vector machines SVM

Supervised Techniques
• Classification methods typically not suitable:
• Problem: Lack of labels
• Possible Solution: Balance the class size
• Duplicate rare events or down-size normal
events
• Generate anomalies inversely proportional
with data density
• Synthetically generate minority over-
sampled events (e.g. SMOTE)
• Classify regions as ‘positive’ without having
enough data in them
• Shrink: look for presence of positive labels,
not majority
• PN-rule: Find regions of high recall (Pd),
then prune false positives, then classify
(avoid over-fitting)
• Decision Tree methods: Ripple Down Rules,
CREDOS, Boosting Classifiers, Random
Forest

Cost Functions
• In any classification problem, one needs to
minimize the cost function
• Selecting an appropriate cost function is key
• Weighting is also important, not all data points are
created equal.
• Bayesian Thinking is necessary
• Temporal (time-series) Analysis requires different
approaches.
• Is current data ‘surprising’ based on historical data
created with the same underlying process?
• Opportunity for Insight, error, or rare event capture.

The weak Link
No matter how good your
automated system is, final
decision to act or not is
often a human!
Present only relevant data
to make the right decision
Actionable information
Present data in human
readable format.
Visualization! Be creative,
different.

Detecting Extremely Rare Events
Data
Collection
•Capture high
SNR
representative
data
Pre-
process
•Clean the data
from known
noise and
artifacts
Feature
Extraction
•Reduce data to
meaningful
feature with no
loss of desired
signal
Classifier
• Divide data into
training/testing and
use appropriate
classifiers
• Always use feature
confidences
Identify
Outliers
•Data that is not ‘normal’
•Determine if physically
appropriate or measurement
error. Delete errors.
Explain Data
•Any Insights?
What does it
mean, sub-
category
classification
Present Data
•Visualizati
on, UI, UX,

Conclusion
• Experiment, test. Iterate.
• Do your homework, learn the physical origin of your data.
• Pre-process data.
• Develop your own method, all methods have weaknesses and strength, learn to
combine.
• Know your customer. Stay focused on the ‘Question’.
• Simplify. Needs to run in the real world
• Avoid Bias, Biased Samples  Biased Outcome
• Never use the test and validation data in the training phase, not even for scaling
purposes.

Thank you.
www.Quanttus.com
@Quanttus
www.facebook.com/Quanttus

Jumping to Conclusions

Recommended

Recommended

More Related Content

Similar to Jumping to Conclusions

Similar to Jumping to Conclusions (20)

More from odsc

More from odsc (20)

Recently uploaded

Recently uploaded (20)

Jumping to Conclusions