SlideShare a Scribd company logo
1 of 52
Download to read offline
Richard BijjaniRichard Bijjani
JUMPING TO CONCLUSIONS
(Generating Improbable Insights)
Richard Robehr Bijjani, Ph.D.
O P E N
D A T A
S C I E N C E
C O N F E R E N C E_
BOSTON 2015
@opendatasci
• IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
1/3 of all deaths globally are
from cardiovascular disease
SOURCE: WORLD HEALTH ORGANIZATION
• IMAGE: Chart of controlled vs uncontrolled?
• How best to change behaviors
SOURCE: Mayo Clinic
#1 risk factor is high blood pressure
SOURCE: WORLD HEALTH ORGANIZATION
1,000,000,000
CONTINUOUS PASSIVE
CLINICALLY MEANINGFUL
BEHAVIORAL INSIGHTS
CONTEXTUALIZED CARDIOVASCULAR HEALTH
CONTEXTUALIZED HUMAN HEALTH
Quanttus is always on!
We capture > 50 million data
points and > 400,000 vital sign
measurements / person / day.
Richard BijjaniRichard Bijjani
Data Science @ Quanttus
Data Science ≡ Extraction of Actionable Knowledge
from Data
Actionable Knowledge
 Better Decisions
 Meaningful Insights
Knowledge is actionable iff it has predictive power
(not just an ability to explain the past)
Richard BijjaniRichard Bijjani
IUMRING TQ CQNGIUSIQNS
Illusion of Knowledge
fatal
The greatest enemy of knowledge is not ignorance,
it is the illusion of knowledge.
-Stephen Hawking
Richard BijjaniRichard Bijjani
Illusion of Knowledge
Courtesy of National Geographic
Richard BijjaniRichard Bijjani
First a Joke!
A police officer approaches a man intently
searching the ground under a lamppost
• Policeman: What are you doing?
• Man: Looking for my car keys
The officer helps for a few minutes without
success
• Policeman: Are you certain you
dropped your keys near here?
• Man : No! I remember dropping them
across the street.
• Policeman (very irritated): Why are
looking for them here then?
• Man : The light is much better here!
Richard BijjaniRichard Bijjani
Why Scientific Studies are so often Wrong
• Researchers tend to look for answers where the
looking is good, rather than where the answers are
likely to be hiding. David Freedman
•15/45 most prominent studies published in the top medical
journals were ultimately refuted.
•2/3 of all medical studies are wrong.
•9/10 of leading-edge studies (like those linking a disease to a
specific gene) are wrong.
John Ioannidis, University of Ioannina
Richard BijjaniRichard Bijjani
10% to 20% of cases:
delayed, missed, and incorrect diagnosis
garber, et al., jama, 2005
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
40,000+ patients in US ICU’s
may die with a misdiagnosis annually
winters, et al., bmj quality & safety, 2012
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
50% of MDs are below-average
vinod khosla
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
Are you Immune to the Streetlight Effect?
Researchers tend to look for answers where the looking is
good, rather than where the answers are likely to be hiding.
David Freedman
• Think of the data you are working with, is it the
ideal data, or just the conveniently available data?
• When was the last time you worked with ideal
data?
• Have you ever?
Why Scientific Studies are so often Wrong
Richard BijjaniRichard Bijjani
Expert Consensus
seventeen experts’ estimates of
the effect of screening on colon cancer deaths
0% 25% 50% 75% 100%
proportion of colon cancer deaths prevented
Richard BijjaniRichard Bijjani
Should you trust your Dr.?
• Depends.
• If your ailment is common, your Dr. will do a decent job.
• If you’re suffering from a relatively uncommon disease,
not so well.
“If you don’t find it often, you often don’t find it”,
Jeremy Wolfe
Richard BijjaniRichard Bijjani
Weak Link, Humans!
1. Signals with low predictive Values
are not very useful
1 in 1000 does not hold ones attention for
long
2. Attention directed at one thing, is
attention drawn away from
something else
Lost research/testing/treatment opportunities
Richard BijjaniRichard Bijjani
Data Scientists’ Tools of Choice
• Some scientists use only techniques they feel comfortable
with
• Others latch on to new ones without fully understanding
them.
• Some just rely on available methods built into their software.
Richard BijjaniRichard Bijjani
The Nonsense Asymmetry Principle
The amount of energy needed to
refute ‘Nonsense’ is an order of
magnitude bigger than to produce it.
-Alberto Brandolini
Richard BijjaniRichard Bijjani
Data Scientific Method
• Validation can ONLY occur
by measuring the
predictive power of the
insights, in addition to it’s
ability to explain the past
• Data Science is Science
and hence follows the
Scientific Method
Ask Relevant
Questions
Report Results
Research.
Gather Data.
Analyze
Results
Validate Data.
Construct Hypothesis
Design Experiment.
Test Hypothesis.
Hypothesi
s is True
Hypothesi
s not Valid
Richard BijjaniRichard Bijjani
Ask the Important Question
• Deadly Virus
• Infects 1 in 1 million
• Diagnostic Test developed with 99.9% Sensitivity and
Specificity
• Treatment developed
• 99% Curative
• 1% Deadly side effect
• Question: Would you recommend Diagnosis?
• Would you recommend Treatment?
Richard BijjaniRichard Bijjani
Efficacy of Treatment
• Do Nothing:
• 300 People Infected
• 300 People will Die
300M test
subjects
Predicted
Negative
Predicted
Positive
Normal 299. 7M 300,000
Infected <1 >299
• Diagnose and Treat:
• Infected Population:
• 296 Cured
• 3 + 1 Die
• Non-Infected Population
• 297,000 Unaffected (except for the scare)
• 3,000 Die
Richard BijjaniRichard Bijjani
Good Practices
• Understand were the data comes from.
• Pre-process / Clean your data, but keep validated outliers.
• Own the tools and adapt them to your own requirements.
• Follow the scientific Method.
• Analyze data to answer Question posed.
• Save a list of other interesting questions for later.
• Share your hypotheses with the team.
• Simple is better, at least make sure it’s deployable.
• Test, Validate, re-test.
• Communicate results correctly and set the right expectations.
"If you torture the data long enough, it will
confess to anything." - Hal Varian
Richard BijjaniRichard Bijjani
Richard BijjaniRichard Bijjani
Pitfalls of data mining
• The hope: data miners pore over large, diffuse sets
of raw data trying to discern patterns that would
otherwise go undetected.
• The dark side of data mining is to pick and choose
from a large set of data to try to explain a small one
• “Given enough time, enough attempts and enough
imagination, almost any set of data can be teased
out of any conclusion”
Richard BijjaniRichard Bijjani
Limitations of Common Data
Mining Techniques
• Automated feature selection methods cannot
apply to rare (or unforeseen) events
• Normal events are similar, rare events are by
definition unique
• Accuracy measurements are not appropriate
• Real time detection of rare events is necessary,
but machine learning techniques construct
models based on the past
• If you haven’t yet seen, you cannot detect it!
Richard BijjaniRichard Bijjani
What are rare/high-value events?
• Rare or Outliers
• Occurs less then 1%
• For large datasets,
many samples exist.
Balance could be
achieved and
traditional Data
Mining Techniques
could be applied
• Preferential
sampling of rare class
• Under-Sampling of
majority class
• Extremely Rare or
Anomalies
• Statistical chance of
detection is zero
• Most databases don’t
‘naturally’ contain any
samples
• Properties of target
samples are not known
Richard BijjaniRichard Bijjani
What are the costs of such
events?
Cost Functions not easily Defined
Richard BijjaniRichard Bijjani
Anomalies vs. High Value Rare
Events
By definition, anomalies are the
exception, but not necessarily
rare and/or of high value.
• Anomaly? Yes
• Rare Event? No
Richard BijjaniRichard Bijjani
Extremely Rare, High
Value Events
Case Study: Terrorism, specifically explosive detection
Richard BijjaniRichard Bijjani
Finding Commercial
Explosives
Data Could be Collected
and/or simulated.
Allowing for rare class
augmentation
Richard BijjaniRichard Bijjani
Finding Explosives
Data cannot be
Collected and/or
simulated.
Richard BijjaniRichard Bijjani
Why incompatible?
• No Quality
Control
Suicide Bomb Trainer in Iraq
Accidentally Blows Up His
Class
Terrorist ‘lab’
(redacted)
Richard BijjaniRichard Bijjani
The Quanttus Vision
Richard BijjaniRichard Bijjani
The Quanttus Vision
Richard BijjaniRichard Bijjani
Takeaway
• We are drowning in data, yet
starving for knowledge
• In case of rare events, data may not
be enough, source of data need to
be well understood
• To detect rare events: Sometimes
it’s just more effective to generate
heuristics
• Heuristics cannot predict, while
machine learning assumes the
future will resemble the past, and
extremely rare events are not part
of the past
• What to do?
Richard BijjaniRichard Bijjani
Outliers revisited
1. Retain outliers in data set for analysis.
2. Exclude only those that are known to be
due to defective measurements or
transcription errors
1. Need to understand data origin to
accurately separate rare events from
measurement errors
3. Do not assume normal distribution
Richard BijjaniRichard Bijjani
Data Mining Techniques
• Supervised
• pro: Human readable
models
• con: Requires labeled
data
• Unsupervised
• Pro: Deviation
detection, no labeling
needed
• Con: Requires similarity
measures, high false
alarm (due to benign
yet previously unseen
data)
Richard BijjaniRichard Bijjani
Unsupervised Techniques
• Outlier datum defined as different from the rest of the
data
• Rare event: Same definition
• Detection Approaches
• Statistics based
• Distance Based
• Model Based
Richard BijjaniRichard Bijjani
Unsupervised: Statistics
• Data modeled using stochastic distribution
• Advantages: no a priori knowledge required
• Disadvantages:
• Fails with high dimensions (curse of dimensionality)
• Does not identify patterns of rare events
• Sample implementations:
• Finite Mixtures Schemes, e.g. SmartSifter. Use histogram density to
represent probability distribution
• Blocked Adaptive Computationally Efficient Outlier Nominator, BACON
• Probability Distributions
• Entropy Measures
Richard BijjaniRichard Bijjani
Unsupervised: Distance
• Distance computed between neighbors, and data points sorted
• Advantages: no a priori knowledge required
• Disadvantages:
• Not suitable for rare classes
• Sample implementations:
• k-Nearest Neighbor
• Mahalanobis Distance for skewed distributions
• Local Outlier Factor (LOF) for variable density cluster (average distance
between points are different in different clusters)
• Specialized Clustering, Canopy, FindOut
Richard BijjaniRichard Bijjani
Unsupervised: Model
• Predict normal behavior via model
• Capture deviations
• Detection Approaches
• Neural Networks, 4 layers, input = output
• Unsupervised support vector machines SVM
Richard BijjaniRichard Bijjani
Supervised Techniques
• Classification methods typically not suitable:
• Problem: Lack of labels
• Possible Solution: Balance the class size
• Duplicate rare events or down-size normal
events
• Generate anomalies inversely proportional
with data density
• Synthetically generate minority over-
sampled events (e.g. SMOTE)
• Classify regions as ‘positive’ without having
enough data in them
• Shrink: look for presence of positive labels,
not majority
• PN-rule: Find regions of high recall (Pd),
then prune false positives, then classify
(avoid over-fitting)
• Decision Tree methods: Ripple Down Rules,
CREDOS, Boosting Classifiers, Random
Forest
Richard BijjaniRichard Bijjani
Cost Functions
• In any classification problem, one needs to
minimize the cost function
• Selecting an appropriate cost function is key
• Weighting is also important, not all data points are
created equal.
• Bayesian Thinking is necessary
• Temporal (time-series) Analysis requires different
approaches.
• Is current data ‘surprising’ based on historical data
created with the same underlying process?
• Opportunity for Insight, error, or rare event capture.
Richard BijjaniRichard Bijjani
The weak Link
No matter how good your
automated system is, final
decision to act or not is
often a human!
Present only relevant data
to make the right decision
Actionable information
Present data in human
readable format.
Visualization! Be creative,
different.
Richard BijjaniRichard Bijjani
Detecting Extremely Rare Events
Data
Collection
•Capture high
SNR
representative
data
Pre-
process
•Clean the data
from known
noise and
artifacts
Feature
Extraction
•Reduce data to
meaningful
feature with no
loss of desired
signal
Classifier
• Divide data into
training/testing and
use appropriate
classifiers
• Always use feature
confidences
Identify
Outliers
•Data that is not ‘normal’
•Determine if physically
appropriate or measurement
error. Delete errors.
Explain Data
•Any Insights?
What does it
mean, sub-
category
classification
Present Data
•Visualizati
on, UI, UX,
Richard BijjaniRichard Bijjani
Conclusion
• Experiment, test. Iterate.
• Do your homework, learn the physical origin of your data.
• Pre-process data.
• Develop your own method, all methods have weaknesses and strength, learn to
combine.
• Know your customer. Stay focused on the ‘Question’.
• Simplify. Needs to run in the real world
• Avoid Bias, Biased Samples  Biased Outcome
• Never use the test and validation data in the training phase, not even for scaling
purposes.
Richard BijjaniRichard Bijjani
Thank you.
www.Quanttus.com
@Quanttus
www.facebook.com/Quanttus

More Related Content

Similar to Jumping to Conclusions

Sophia Zilber - Mito research and data webinar - June 3, 2021
Sophia Zilber - Mito research and data webinar - June 3, 2021Sophia Zilber - Mito research and data webinar - June 3, 2021
Sophia Zilber - Mito research and data webinar - June 3, 2021SophiaZilber
 
Meyer-Practical tips for responsible and effective data sharing
Meyer-Practical tips for responsible and effective data sharingMeyer-Practical tips for responsible and effective data sharing
Meyer-Practical tips for responsible and effective data sharingMichelle N. Meyer
 
Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...
Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...
Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...Dominick Maino
 
Introduction to critical appraisal
Introduction to critical appraisalIntroduction to critical appraisal
Introduction to critical appraisalOmar Taibah
 
Meyer funders forum-data sharing-dec 6 2016-clean
Meyer funders forum-data sharing-dec 6 2016-cleanMeyer funders forum-data sharing-dec 6 2016-clean
Meyer funders forum-data sharing-dec 6 2016-cleanMichelle N. Meyer
 
SCC 2012 Give me your brain: communicating tricky topics
SCC 2012 Give me your brain: communicating tricky topicsSCC 2012 Give me your brain: communicating tricky topics
SCC 2012 Give me your brain: communicating tricky topicsBritish Science Association
 
How to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat CroskerryHow to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat CroskerrySMACC Conference
 
AI in life extension
AI in life extensionAI in life extension
AI in life extensionavturchin
 
Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...
Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...
Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...OPUNITE
 
Strategies for Dealing with the CRF
Strategies for Dealing with the CRFStrategies for Dealing with the CRF
Strategies for Dealing with the CRFMary K.D. D'Rozario
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicinePaul Agapow
 
Beyond Internal Communications - Influence and Persuasion
Beyond Internal Communications  -  Influence and PersuasionBeyond Internal Communications  -  Influence and Persuasion
Beyond Internal Communications - Influence and PersuasionAniisu K Verghese
 
Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...
Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...
Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...Kara Gavin
 
American Association for Suicidology (2020), Jaspr Health (DIMEFF)
American Association for Suicidology (2020), Jaspr Health (DIMEFF)American Association for Suicidology (2020), Jaspr Health (DIMEFF)
American Association for Suicidology (2020), Jaspr Health (DIMEFF)Linda Dimeff
 
UQUMRC KAMC Research Bioethics 2012 Updated
UQUMRC KAMC Research Bioethics 2012 UpdatedUQUMRC KAMC Research Bioethics 2012 Updated
UQUMRC KAMC Research Bioethics 2012 UpdatedSohail Bajammal
 
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...Bedirhan Ustun
 
National Board of Medical Examiners 100th Annual Meeting (plenary), 2014
National Board of Medical Examiners 100th Annual Meeting (plenary), 2014National Board of Medical Examiners 100th Annual Meeting (plenary), 2014
National Board of Medical Examiners 100th Annual Meeting (plenary), 2014e-Patient Dave deBronkart
 

Similar to Jumping to Conclusions (20)

Presentation Sioo Evidence-Based Practice (Dutch)
Presentation Sioo Evidence-Based Practice (Dutch)Presentation Sioo Evidence-Based Practice (Dutch)
Presentation Sioo Evidence-Based Practice (Dutch)
 
Sophia Zilber - Mito research and data webinar - June 3, 2021
Sophia Zilber - Mito research and data webinar - June 3, 2021Sophia Zilber - Mito research and data webinar - June 3, 2021
Sophia Zilber - Mito research and data webinar - June 3, 2021
 
Meyer-Practical tips for responsible and effective data sharing
Meyer-Practical tips for responsible and effective data sharingMeyer-Practical tips for responsible and effective data sharing
Meyer-Practical tips for responsible and effective data sharing
 
Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...
Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...
Evidence Based Practice: Pediatrics, Binocular Vision and Patients with Speci...
 
Introduction to critical appraisal
Introduction to critical appraisalIntroduction to critical appraisal
Introduction to critical appraisal
 
Presentation NVAO Evidence Based Practice
Presentation NVAO Evidence Based PracticePresentation NVAO Evidence Based Practice
Presentation NVAO Evidence Based Practice
 
Meyer funders forum-data sharing-dec 6 2016-clean
Meyer funders forum-data sharing-dec 6 2016-cleanMeyer funders forum-data sharing-dec 6 2016-clean
Meyer funders forum-data sharing-dec 6 2016-clean
 
SCC 2012 Give me your brain: communicating tricky topics
SCC 2012 Give me your brain: communicating tricky topicsSCC 2012 Give me your brain: communicating tricky topics
SCC 2012 Give me your brain: communicating tricky topics
 
How to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat CroskerryHow to Think Straight- Cognitive Debiasing Pat Croskerry
How to Think Straight- Cognitive Debiasing Pat Croskerry
 
AI in life extension
AI in life extensionAI in life extension
AI in life extension
 
Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...
Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...
Web only rx16 presummit pillmills-mon_200_investigating and prosecuting pill ...
 
Strategies for Dealing with the CRF
Strategies for Dealing with the CRFStrategies for Dealing with the CRF
Strategies for Dealing with the CRF
 
Where AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicineWhere AI will (and won't) revolutionize biomedicine
Where AI will (and won't) revolutionize biomedicine
 
Beyond Internal Communications - Influence and Persuasion
Beyond Internal Communications  -  Influence and PersuasionBeyond Internal Communications  -  Influence and Persuasion
Beyond Internal Communications - Influence and Persuasion
 
Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...
Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...
Reputable Sources in a Pandemic: How to Find and Evaluate Information You Can...
 
American Association for Suicidology (2020), Jaspr Health (DIMEFF)
American Association for Suicidology (2020), Jaspr Health (DIMEFF)American Association for Suicidology (2020), Jaspr Health (DIMEFF)
American Association for Suicidology (2020), Jaspr Health (DIMEFF)
 
UQUMRC KAMC Research Bioethics 2012 Updated
UQUMRC KAMC Research Bioethics 2012 UpdatedUQUMRC KAMC Research Bioethics 2012 Updated
UQUMRC KAMC Research Bioethics 2012 Updated
 
Ethics
EthicsEthics
Ethics
 
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...USTUN_ Digital Health Assembly Open Innovation Conference:  Sharing Global Da...
USTUN_ Digital Health Assembly Open Innovation Conference: Sharing Global Da...
 
National Board of Medical Examiners 100th Annual Meeting (plenary), 2014
National Board of Medical Examiners 100th Annual Meeting (plenary), 2014National Board of Medical Examiners 100th Annual Meeting (plenary), 2014
National Board of Medical Examiners 100th Annual Meeting (plenary), 2014
 

More from odsc

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer odsc
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discoveryodsc
 
API Driven Development
API Driven Development API Driven Development
API Driven Development odsc
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysisodsc
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Upodsc
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hiveodsc
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depthodsc
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Informationodsc
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet odsc
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLodsc
 
Beyond Names
Beyond NamesBeyond Names
Beyond Namesodsc
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500odsc
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Dataodsc
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Scienceodsc
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions odsc
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learnodsc
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Toolsodsc
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypseodsc
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science odsc
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Researchodsc
 

More from odsc (20)

Understanding the Chief Data Officer
Understanding the Chief Data Officer Understanding the Chief Data Officer
Understanding the Chief Data Officer
 
Machine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge DiscoveryMachine-In-The-Loop for Knowledge Discovery
Machine-In-The-Loop for Knowledge Discovery
 
API Driven Development
API Driven Development API Driven Development
API Driven Development
 
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata AnalysisMobile technology Usage by Humanitarian Programs: A Metadata Analysis
Mobile technology Usage by Humanitarian Programs: A Metadata Analysis
 
Productionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground UpProductionizing Deep Learning From the Ground Up
Productionizing Deep Learning From the Ground Up
 
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and HiveBig Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
Big Data Infrastructure: Introduction to Hadoop with MapReduce, Pig, and Hive
 
Think Breadth, Not Depth
Think Breadth, Not DepthThink Breadth, Not Depth
Think Breadth, Not Depth
 
Data Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and InformationData Science at Dow Jones: Monetizing Data, News and Information
Data Science at Dow Jones: Monetizing Data, News and Information
 
Spark, Python and Parquet
Spark, Python and Parquet Spark, Python and Parquet
Spark, Python and Parquet
 
Building a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure MLBuilding a Predictive Analytics Solution with Azure ML
Building a Predictive Analytics Solution with Azure ML
 
Beyond Names
Beyond NamesBeyond Names
Beyond Names
 
How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500How Woman are Conquering the S&P 500
How Woman are Conquering the S&P 500
 
Domain Expertise and Unstructured Data
Domain Expertise and Unstructured DataDomain Expertise and Unstructured Data
Domain Expertise and Unstructured Data
 
Kaggle The Home of Data Science
Kaggle The Home of Data ScienceKaggle The Home of Data Science
Kaggle The Home of Data Science
 
Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions Open Source Tools & Data Science Competitions
Open Source Tools & Data Science Competitions
 
Machine Learning with scikit-learn
Machine Learning with scikit-learnMachine Learning with scikit-learn
Machine Learning with scikit-learn
 
Bridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source ToolsBridging the Gap Between Data and Insight using Open-Source Tools
Bridging the Gap Between Data and Insight using Open-Source Tools
 
Top 10 Signs of the Textpocalypse
Top 10 Signs of the TextpocalypseTop 10 Signs of the Textpocalypse
Top 10 Signs of the Textpocalypse
 
The Art of Data Science
The Art of Data Science The Art of Data Science
The Art of Data Science
 
Frontiers of Open Data Science Research
Frontiers of Open Data Science ResearchFrontiers of Open Data Science Research
Frontiers of Open Data Science Research
 

Recently uploaded

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 

Recently uploaded (20)

Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 

Jumping to Conclusions

  • 1. Richard BijjaniRichard Bijjani JUMPING TO CONCLUSIONS (Generating Improbable Insights) Richard Robehr Bijjani, Ph.D. O P E N D A T A S C I E N C E C O N F E R E N C E_ BOSTON 2015 @opendatasci
  • 2.
  • 3. • IMAGE: Chart of controlled vs uncontrolled? • How best to change behaviors 1/3 of all deaths globally are from cardiovascular disease SOURCE: WORLD HEALTH ORGANIZATION
  • 4. • IMAGE: Chart of controlled vs uncontrolled? • How best to change behaviors SOURCE: Mayo Clinic #1 risk factor is high blood pressure
  • 5. SOURCE: WORLD HEALTH ORGANIZATION 1,000,000,000
  • 9. Quanttus is always on! We capture > 50 million data points and > 400,000 vital sign measurements / person / day.
  • 10. Richard BijjaniRichard Bijjani Data Science @ Quanttus Data Science ≡ Extraction of Actionable Knowledge from Data Actionable Knowledge  Better Decisions  Meaningful Insights Knowledge is actionable iff it has predictive power (not just an ability to explain the past)
  • 11. Richard BijjaniRichard Bijjani IUMRING TQ CQNGIUSIQNS Illusion of Knowledge fatal The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. -Stephen Hawking
  • 12. Richard BijjaniRichard Bijjani Illusion of Knowledge Courtesy of National Geographic
  • 13. Richard BijjaniRichard Bijjani First a Joke! A police officer approaches a man intently searching the ground under a lamppost • Policeman: What are you doing? • Man: Looking for my car keys The officer helps for a few minutes without success • Policeman: Are you certain you dropped your keys near here? • Man : No! I remember dropping them across the street. • Policeman (very irritated): Why are looking for them here then? • Man : The light is much better here!
  • 14. Richard BijjaniRichard Bijjani Why Scientific Studies are so often Wrong • Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman •15/45 most prominent studies published in the top medical journals were ultimately refuted. •2/3 of all medical studies are wrong. •9/10 of leading-edge studies (like those linking a disease to a specific gene) are wrong. John Ioannidis, University of Ioannina
  • 15. Richard BijjaniRichard Bijjani 10% to 20% of cases: delayed, missed, and incorrect diagnosis garber, et al., jama, 2005 Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman Why Scientific Studies are so often Wrong
  • 16. Richard BijjaniRichard Bijjani 40,000+ patients in US ICU’s may die with a misdiagnosis annually winters, et al., bmj quality & safety, 2012 Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman Why Scientific Studies are so often Wrong
  • 17. Richard BijjaniRichard Bijjani 50% of MDs are below-average vinod khosla Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman Why Scientific Studies are so often Wrong
  • 18. Richard BijjaniRichard Bijjani Are you Immune to the Streetlight Effect? Researchers tend to look for answers where the looking is good, rather than where the answers are likely to be hiding. David Freedman • Think of the data you are working with, is it the ideal data, or just the conveniently available data? • When was the last time you worked with ideal data? • Have you ever? Why Scientific Studies are so often Wrong
  • 19. Richard BijjaniRichard Bijjani Expert Consensus seventeen experts’ estimates of the effect of screening on colon cancer deaths 0% 25% 50% 75% 100% proportion of colon cancer deaths prevented
  • 20. Richard BijjaniRichard Bijjani Should you trust your Dr.? • Depends. • If your ailment is common, your Dr. will do a decent job. • If you’re suffering from a relatively uncommon disease, not so well. “If you don’t find it often, you often don’t find it”, Jeremy Wolfe
  • 21. Richard BijjaniRichard Bijjani Weak Link, Humans! 1. Signals with low predictive Values are not very useful 1 in 1000 does not hold ones attention for long 2. Attention directed at one thing, is attention drawn away from something else Lost research/testing/treatment opportunities
  • 22. Richard BijjaniRichard Bijjani Data Scientists’ Tools of Choice • Some scientists use only techniques they feel comfortable with • Others latch on to new ones without fully understanding them. • Some just rely on available methods built into their software.
  • 23. Richard BijjaniRichard Bijjani The Nonsense Asymmetry Principle The amount of energy needed to refute ‘Nonsense’ is an order of magnitude bigger than to produce it. -Alberto Brandolini
  • 24. Richard BijjaniRichard Bijjani Data Scientific Method • Validation can ONLY occur by measuring the predictive power of the insights, in addition to it’s ability to explain the past • Data Science is Science and hence follows the Scientific Method Ask Relevant Questions Report Results Research. Gather Data. Analyze Results Validate Data. Construct Hypothesis Design Experiment. Test Hypothesis. Hypothesi s is True Hypothesi s not Valid
  • 25. Richard BijjaniRichard Bijjani Ask the Important Question • Deadly Virus • Infects 1 in 1 million • Diagnostic Test developed with 99.9% Sensitivity and Specificity • Treatment developed • 99% Curative • 1% Deadly side effect • Question: Would you recommend Diagnosis? • Would you recommend Treatment?
  • 26. Richard BijjaniRichard Bijjani Efficacy of Treatment • Do Nothing: • 300 People Infected • 300 People will Die 300M test subjects Predicted Negative Predicted Positive Normal 299. 7M 300,000 Infected <1 >299 • Diagnose and Treat: • Infected Population: • 296 Cured • 3 + 1 Die • Non-Infected Population • 297,000 Unaffected (except for the scare) • 3,000 Die
  • 27. Richard BijjaniRichard Bijjani Good Practices • Understand were the data comes from. • Pre-process / Clean your data, but keep validated outliers. • Own the tools and adapt them to your own requirements. • Follow the scientific Method. • Analyze data to answer Question posed. • Save a list of other interesting questions for later. • Share your hypotheses with the team. • Simple is better, at least make sure it’s deployable. • Test, Validate, re-test. • Communicate results correctly and set the right expectations. "If you torture the data long enough, it will confess to anything." - Hal Varian
  • 29. Richard BijjaniRichard Bijjani Pitfalls of data mining • The hope: data miners pore over large, diffuse sets of raw data trying to discern patterns that would otherwise go undetected. • The dark side of data mining is to pick and choose from a large set of data to try to explain a small one • “Given enough time, enough attempts and enough imagination, almost any set of data can be teased out of any conclusion”
  • 30. Richard BijjaniRichard Bijjani Limitations of Common Data Mining Techniques • Automated feature selection methods cannot apply to rare (or unforeseen) events • Normal events are similar, rare events are by definition unique • Accuracy measurements are not appropriate • Real time detection of rare events is necessary, but machine learning techniques construct models based on the past • If you haven’t yet seen, you cannot detect it!
  • 31. Richard BijjaniRichard Bijjani What are rare/high-value events? • Rare or Outliers • Occurs less then 1% • For large datasets, many samples exist. Balance could be achieved and traditional Data Mining Techniques could be applied • Preferential sampling of rare class • Under-Sampling of majority class • Extremely Rare or Anomalies • Statistical chance of detection is zero • Most databases don’t ‘naturally’ contain any samples • Properties of target samples are not known
  • 32. Richard BijjaniRichard Bijjani What are the costs of such events? Cost Functions not easily Defined
  • 33. Richard BijjaniRichard Bijjani Anomalies vs. High Value Rare Events By definition, anomalies are the exception, but not necessarily rare and/or of high value. • Anomaly? Yes • Rare Event? No
  • 34. Richard BijjaniRichard Bijjani Extremely Rare, High Value Events Case Study: Terrorism, specifically explosive detection
  • 35. Richard BijjaniRichard Bijjani Finding Commercial Explosives Data Could be Collected and/or simulated. Allowing for rare class augmentation
  • 36. Richard BijjaniRichard Bijjani Finding Explosives Data cannot be Collected and/or simulated.
  • 37. Richard BijjaniRichard Bijjani Why incompatible? • No Quality Control Suicide Bomb Trainer in Iraq Accidentally Blows Up His Class Terrorist ‘lab’ (redacted)
  • 40. Richard BijjaniRichard Bijjani Takeaway • We are drowning in data, yet starving for knowledge • In case of rare events, data may not be enough, source of data need to be well understood • To detect rare events: Sometimes it’s just more effective to generate heuristics • Heuristics cannot predict, while machine learning assumes the future will resemble the past, and extremely rare events are not part of the past • What to do?
  • 41. Richard BijjaniRichard Bijjani Outliers revisited 1. Retain outliers in data set for analysis. 2. Exclude only those that are known to be due to defective measurements or transcription errors 1. Need to understand data origin to accurately separate rare events from measurement errors 3. Do not assume normal distribution
  • 42. Richard BijjaniRichard Bijjani Data Mining Techniques • Supervised • pro: Human readable models • con: Requires labeled data • Unsupervised • Pro: Deviation detection, no labeling needed • Con: Requires similarity measures, high false alarm (due to benign yet previously unseen data)
  • 43. Richard BijjaniRichard Bijjani Unsupervised Techniques • Outlier datum defined as different from the rest of the data • Rare event: Same definition • Detection Approaches • Statistics based • Distance Based • Model Based
  • 44. Richard BijjaniRichard Bijjani Unsupervised: Statistics • Data modeled using stochastic distribution • Advantages: no a priori knowledge required • Disadvantages: • Fails with high dimensions (curse of dimensionality) • Does not identify patterns of rare events • Sample implementations: • Finite Mixtures Schemes, e.g. SmartSifter. Use histogram density to represent probability distribution • Blocked Adaptive Computationally Efficient Outlier Nominator, BACON • Probability Distributions • Entropy Measures
  • 45. Richard BijjaniRichard Bijjani Unsupervised: Distance • Distance computed between neighbors, and data points sorted • Advantages: no a priori knowledge required • Disadvantages: • Not suitable for rare classes • Sample implementations: • k-Nearest Neighbor • Mahalanobis Distance for skewed distributions • Local Outlier Factor (LOF) for variable density cluster (average distance between points are different in different clusters) • Specialized Clustering, Canopy, FindOut
  • 46. Richard BijjaniRichard Bijjani Unsupervised: Model • Predict normal behavior via model • Capture deviations • Detection Approaches • Neural Networks, 4 layers, input = output • Unsupervised support vector machines SVM
  • 47. Richard BijjaniRichard Bijjani Supervised Techniques • Classification methods typically not suitable: • Problem: Lack of labels • Possible Solution: Balance the class size • Duplicate rare events or down-size normal events • Generate anomalies inversely proportional with data density • Synthetically generate minority over- sampled events (e.g. SMOTE) • Classify regions as ‘positive’ without having enough data in them • Shrink: look for presence of positive labels, not majority • PN-rule: Find regions of high recall (Pd), then prune false positives, then classify (avoid over-fitting) • Decision Tree methods: Ripple Down Rules, CREDOS, Boosting Classifiers, Random Forest
  • 48. Richard BijjaniRichard Bijjani Cost Functions • In any classification problem, one needs to minimize the cost function • Selecting an appropriate cost function is key • Weighting is also important, not all data points are created equal. • Bayesian Thinking is necessary • Temporal (time-series) Analysis requires different approaches. • Is current data ‘surprising’ based on historical data created with the same underlying process? • Opportunity for Insight, error, or rare event capture.
  • 49. Richard BijjaniRichard Bijjani The weak Link No matter how good your automated system is, final decision to act or not is often a human! Present only relevant data to make the right decision Actionable information Present data in human readable format. Visualization! Be creative, different.
  • 50. Richard BijjaniRichard Bijjani Detecting Extremely Rare Events Data Collection •Capture high SNR representative data Pre- process •Clean the data from known noise and artifacts Feature Extraction •Reduce data to meaningful feature with no loss of desired signal Classifier • Divide data into training/testing and use appropriate classifiers • Always use feature confidences Identify Outliers •Data that is not ‘normal’ •Determine if physically appropriate or measurement error. Delete errors. Explain Data •Any Insights? What does it mean, sub- category classification Present Data •Visualizati on, UI, UX,
  • 51. Richard BijjaniRichard Bijjani Conclusion • Experiment, test. Iterate. • Do your homework, learn the physical origin of your data. • Pre-process data. • Develop your own method, all methods have weaknesses and strength, learn to combine. • Know your customer. Stay focused on the ‘Question’. • Simplify. Needs to run in the real world • Avoid Bias, Biased Samples  Biased Outcome • Never use the test and validation data in the training phase, not even for scaling purposes.
  • 52. Richard BijjaniRichard Bijjani Thank you. www.Quanttus.com @Quanttus www.facebook.com/Quanttus