SlideShare a Scribd company logo
A HYBRID APPROACH
TO MODEL
RANDOMNESS AND
FUZZINESS USING
DEMPSTER-SHAFFER
METHOD
Taposh Roy & Austin Powell
July 2016
WHO ARE WE ?
Lead Big Data & Data Science
Kaiser Permanente
taposh.d.roy@kp.org
Graduate Intern
Kaiser Permanente
austin.powell@kp.org
OUTLINE
• Motivation, philosophy and some
definitions
• Why Dempster Shafer method
• Math behind Dempster Shafer method.
• Our analysis combining predictability
with Dempster Shafer.
• Review of our results and interpretation
MOTIVATION
THE WHY?
Majority of predictive analytics today is
based on historic data. As data
scientists, we hunt and gather historic
data, train to generate models using
variety of algorithms, test and validate
it and finally deploy to score.
This however limits us to situations that
have been considered in our training
data. Our predictions don’t incorporate
any external or environmental factors
not already trained.
PHILOSOPHY
So far as the laws of
mathematics refer to
reality, they are not
certain. And so far as
they are certain, they
do not refer to reality.
As complexity rises,
precise statements
lose meaning and
meaningful
statements lose
BASIC DEFINITIONS
DEFINITION :
RANDOMNESS
Box A contains 8 small and 8 big apples. You randomly
pick an apple from the box. There is uncertainty about
the size of the apple that will come out. The size of the
apple can be described by a random variable that takes
2 different values.
There are 2 distinct
possible outcomes and
there is no uncertainty
after the apple is taken
out and observed.
The probability of selected apple being
big or small is :
50 %
DEFINITION : FUZZINESS
Now, you have a box (Box B) that contains 16 apples
with different sizes ranging from very small to very
large. You have one of these apples in hand. There is
uncertainty in describing the size of the apple but this
time the uncertainty is not about the lack of knowledge
of the outcome. It is about the lack of the boundary
between small and large. You use a fuzzy
membership function to define the membership value
of the apple in the set ``large`` or ``small``. Everything
is known except how to label the objects (apples), how
to describe them and draw the boundaries of the sets.
Box B
WHY DEMPSTER SHAFER?
DEMPSTER SHAFFER –
WHY?
• Dempster Shafer theory was used in AI and
reliability modeling1.
• Dempster Shafer led the path to Fuzzy logic. It
is a base theory for uncertainty modeling.
• Dempster–Shafer theory (a.k.a. theory of belief
functions) is a generalization of the Bayesian
theory of subjective probability.
• Whereas Bayesian theory requires probabilities
for each question of interest, belief functions
allow us to base degrees of belief (strength of
evidence in favor of some proposition) for one
question on probabilities for a related question.
1http://www.gnedenko-forum.org/Journal/2007/03-042007/article19_32007.pdf
BRIEF HISTORY:
• 17th Century Roots
• 1968 > Dempster developed means for combining degrees
of belief derived from independent items of evidence.
• 1976 > Shafer developed the Dempster-Shafer theory.
• 1988-90 > The theory was criticized by researchers as being
confusing probabilities of truth with probabilities of
provability2,3.
• 1998 > Kłopotek and Wierzchoń proposed to interpret the
Dempster–Shafer theory in terms of statistics of decision
tables
• 2011 > Dr. Hemant Baruah, The Theory of Fuzzy Sets:
Beliefs and Realities
• 2012 > Dr. Jøsang proved that Dempster's rule of
combination actually is a method for fusing belief
constraints.
• 2013 > Dr. John Yen - "Can Evidence Be Combined in the
Dempster-Shafer Theory"2 http://sartemov.ws.gc.cuny.edu/files/2012/10/Artemov-Beklemishev.-Provability-logic.pdf
3 https://en.wikipedia.org/wiki/Provability_logic
BASIC EXAMPLES
EXAMPLE :
Dempster–Shafer theory has 2 components :
• Obtaining degrees of belief.
• Combining degrees of belief.
She tells me a limb fell on my car. This statement
must be true if she is reliable.
So her testimony alone justifies a 0.9 degree of
belief that a limb fell on my car and a Zero (not 0.1)
degree of belief that no limb fell on my car.
Note : Zero degree of belief here does not mean that I am sure
no limb fell on my car. It merely means based on Betty’s
testimony there is no reason to believe that a limb fell on my
Betty
Belief function is 0.9 and 0
together
COMBINING BELIEFS
Sally
I have another friend Sally.
She independent of Betty, tells me a limb fell on my
car.
The event that Betty is reliable is independent of the
event that Sally is reliable. So we multiply the
probabilities of these events.
The probability that both are reliable is 0.9x0.9 =0.81,
The probability that neither is reliable is 0.1x0.1 =0.01,
&
The probability that at least one is reliable is 1-0.01
=0.99
Since both said a limb fell on my car, at least one of them being
reliable implies that a limb did fell on my car, and hence I may
assign this event a degree of belief of 0.99
Reference : http://www.glennshafer.com/assets/downloads/articles/article48.pdf
COMBINING BELIEFS :
CONTRADICT
Sally
Now suppose, Sally contradicts Betty. She tells me no
limb fell on my car.
In this case both cannot be right and hence both
cannot be reliable – one is reliable or neither is
reliable.
The prior probabilities :
Only Betty is reliable and Sally is not :
(0.9x0.1) = 0.09
Only Sally is reliable and Betty is not :
(0.9x0.1) =0.09
Neither is reliable is 0.1x0.1 =0.01
Only Betty is reliable while Sally is not, provided at least one is unreliable
:
P(Betty is reliable, Sally is not)/P(at least one is unreliable)
0.09/[1-(0.9x0.9)] =9/19
Only Sally is reliable while Betty is not, provided at least one is unreliable
:
P(Sally is reliable, Betty is not)/P(at least one is unreliable)
0.09/[1-(0.9x0.9)] =9/19
Neither of them are reliable, provided at least one is unreliable :
P(Sally is unreliable and Betty is unreliable)/P(at least one is
unreliable)
[0.1 x 0.1]/[1-(0.9x0.9)] =1/19
Hence we have 9/19 degree of belief that a limb fell on my car (Betty
reliable) & 9/19 degree of belief that no limb fell on my car (Sally
reliable)
SallyBetty
The posterior
probabilities :
COMBINING BELIEFS :
CONTRADICT
BASIC DEMPSTER SHAFER
We obtain degrees of belief for
one question (Did a limb fell on
my car?) from probabilities for
another question (Is the witness
reliable?)4.
4 http://www.glennshafer.com/assets/downloads/articles/article48.pdf
ALGORITHM
NEXT STEPS
• Quantify fuzziness in binary
prediction
• Obtain improved probabilities for
binary classification based off a
prior classification algorithm
using Modified Dempster-Shafer
DATASETS TESTED
 Higgs Boson
 Titanic
 Cdystonia
 Parkinsons Telemonitoring
 Abalone
 Yeast
 Skin_Nonskin
 Contraceptive Method Choice
 Diabetic Renopathy
 Mammographic Mass
 Breast Cancer Wisconson
(Diagnostic)
 Breast Cancer Wisconson
(Prognastic)
 Haberman’s Survival
 Statlog (Heart)
 SPECT Heart
 SPECTF Heart
 Hepatitis
 Sponge
 Audiology
 Soybean
 Ecoli
 Post-Operative Patient
 Zoo
 EEG Eye State
 Wilt
 Bach
 Nursery
 Phishing Websites
 Pets
 Contraceptive Method
 First Order
 Spambase
 Hill Valley
 Satellite Image
 Agaricus Lepiota
 King Rook vs King Pawn
 Firm Teacher
 Fertility
 Dermatology
 Mice Protein
 LSVT Voice
 Thyroid
MODIFIED DS SUMMARY
Road Forward
• Scale performance with data dimension
• Correct for highly imbalanced classes which
may lead to high-class probability when there is
none (high/low belief conflict)
• Test different prior algorithms
Positives
• Can significantly increase performance metrics
and to a greater precision
• Algorithm is not black-box. It can be understood
relatively easily
• Prior probabilities can come from algorithms
other than logistic regression
JAR OF MARBLES
Type equation here.Type equation here.
Person
1
Person 2
Majority is Red 70% 50%
Majority is Blue 20% 30%
Half Blue and
Red
10% 20%
What is the
majority color in
the jar?
A Dempster-Shafer student could
then say he/she believes:
Red Blue Half-half
81% 17% 2%
DEMPSTER’S RULE OF
COMBINATION
𝑚1⨁𝑚2 𝐴 =
1
1 − 𝐾
𝐵∩𝐶=∅
𝑚1(𝐵)𝑚2(𝐶) = 𝑚1,2 𝐴
Combines multiple pieces of evidence (mass functions).
Emphasizes the
agreement, and ignores
the conflict of the mass
functions by using the
normalization factor 1-K
𝐾 =
𝐵⋂𝐶=𝐴 ≠𝜙
𝑚1 𝐵 𝑚2 𝐶 , 𝐾 ≠ 1
K is the measure of
conflict between the two
mass sets meaning 1-K
is the measure of
agreement
DS NOTATION
Θ =
The plausibility function of a class
sums the mass values of all sets that
intersect that class. Measures what is
left over after subtracting evidence
against a belief.
pl(𝐴) = 𝐵⋂𝐴≠𝜙 𝑚(𝐵)
𝑏𝑒𝑙(𝐴) =
𝐵⊆𝐴
𝑚(𝐵)
The belief function of a class sums the
mass values of all the non-empty
subsets of that class. The strength of
evidence in favor of some proposition
𝑚(𝐴) ≤ 𝑏𝑒𝑙(𝐴) ≤ 𝑝𝑙(𝐴)
MODIFIED DS: OBTAINING
POSTERIOR
𝐶 𝐴 𝐸′ =
𝑚(𝐴|𝐸′
)
𝑃(𝐴)
𝐴⊂Θ
𝑚(𝐴|𝐸′)
𝑃(𝐴)
Given some evidence and prior probabilities for a
class, we can obtain a posterior or updated mass
value
R. P. S. Mahler D. Fixsen. The modified dempster-shafer approach to classification
𝐸′
The Basic Certainty Value: a normalized
ratio of a hypothesis subset mass
to its prior probability*
Evidence Space: a set of mutually exclusive outcom
or possible values of an evidence source.
JAR OF MARBLES
Type equation here.Type equation here.
Person
1
Person
2
Majority is Red 70% 50%
Majority is Blue 20% 30%
Half Blue and
Red
10% 20%
What is the
majority color in
the jar?
A Dempster-Shafer
student could then say
he/she believes:
Red Blue Half-
half
DS 81% 17% 2%
Modifie
d
84.9% 13.5% 1.6%
Prior Information
Red Blue
60% 40%
THE ALGORITHM – BUILD A
MODEL
OBTAINING MASS VALUES
For Each Feature
1. Partition the training data by class
2. Calculate the max/min values of each partition
3. Determine which max/min value situation it is by
looking at the ranges of each partition
EXAMPLES
So far have tested DS algorithm on 43
datasets:
• UCI Machine Learning Repository
• R package “Datasets”
EMAIL SPAM
• Email spam is from a diverse
group of individuals. Predicting on
individuals definition of spam or
not.
• 58 features: primarily of different
word and character frequencies
• 4601 observations
https://archive.ics.uci.edu/ml/datasets/Spambase
Median Metrics Based of 10-fold CV
Metrics LR LRDS
Accuracy 0.63587 0.63913
AUC 0.76317 0.76265
F1 Score 0.45753 0.45840
Precision 0.52277 0.52470
True
Positive
0.39503 0.40151
False
Positive
0.01245 0.01384
EMAIL SPAM
SPECT HEART
• Cardiac Single Proton Emission
Computed Tomography images
• Predicting on normal or abnormal
heart
• 23 features on partial human
diagnoses of abnormality
• 267 observations
Source: https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
Median Metrics Based of 10-fold CV
Metrics LR LRDS
Accuracy 0.55660 0.72641
AUC 0.89846 0.89846
F1 Score 0.52168 0.57441
Precision 0.52618 0.66678
True
Positive
0.14074 0.50020
False
Positive
0.01562 0.036458
SPECT HEART
SPECT HEART
REFERENCES
• Application of Dempster Shafer in reliability modeling http://www.gnedenko-forum.org/Journal/2007/03-
042007/article19_32007.pdf
• Provability Logic http://sartemov.ws.gc.cuny.edu/files/2012/10/Artemov-Beklemishev.-Provability-logic.pdf
• Wiki Provability Logic https://en.wikipedia.org/wiki/Provability_logic
• Dr. Shafer's Article http://www.glennshafer.com/assets/downloads/articles/article48.pdf
• Home Page of Dr. Lotfi Zadeh https://www2.eecs.berkeley.edu/Faculty/Homepages/zadeh.html
• L. A. Zadeh, "The concept of a linguistic variable and its application to approximate reasoning - I," Information Sciences, vol.
8, no. 3, pp. 199-249, July 1975.
• R. E. Bellman and L. A. Zadeh, "Decision making in a fuzzy environment," Management Science, vol. 17, no. 4, pp. B141-
B164, Dec. 1970.
• L. A. Zadeh, "Fuzzy sets," Information and Control, vol. 8, no. 3, pp. 338-353, June 1965.
• Alexander Karlsson. R: Evidence Combination in R. R Foundation for Statistical Computing, MIT, 2012.
• R. P. S. Mahler D. Fixsen. The modified dempster-shafer approach to classification. IEEE Std. 1516-2000, pages 96 – 104,
Jan 1997.
• Jiwen Guan, Jasmina Pavlin, and Victor R. Lesser. Combining Evidence in the Extended Dempster-Shafer Theory, pages
163–178. Springer London, London, 1990
• M. Lichman. UCI machine learning repository, 2013
• Simon Parsons and London Wca Px. Some qualitative approaches to applying the dempster-shafer theory, 1994.
QUESTION
S?
APPEND
IX
DS NOTATION
Θ =Θ = 𝑎, 𝑏
2Θ
= {𝜙, 𝑎, 𝑏, Θ}
A hypothesis for the possibilities which is a subset of
all possibilities
The power set is the set of all possible subsets of
Θ, including Θ itself and the empty set Φ
𝑚: 2Θ
→ [0,1]
Mass function: Each element of the
power set has a mass value assigned
to it
𝑚 𝜙 = 0 For the purposes of our algorithm, the
mass of the empty set is zero
𝐴 ⊆ Θ
𝑚 𝐴 = 1 Sum of masses (for each row) is 1
(currently)
CLASSIFICATION PROCESS
Min/Max Ranges Situation
1
CLASSIFICATION PROCESS
Min/Max Ranges Situation
2
CLASSIFICATION PROCESS
Min/Max Ranges Situation
3

More Related Content

Viewers also liked

Solar system 2
Solar system 2Solar system 2
Solar system 2
amiles2222
 
Chapter 13 - The Remaining Amendments and a Return to the Constitution
Chapter 13 - The Remaining Amendments and a Return to the ConstitutionChapter 13 - The Remaining Amendments and a Return to the Constitution
Chapter 13 - The Remaining Amendments and a Return to the Constitution
lisajurs
 
Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...
Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...
Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...
Net at Work
 
Experience Leadership 2012
Experience Leadership 2012Experience Leadership 2012
Experience Leadership 2012
Scott Jones
 
Evolucion de la Tecnlogia Educativa
Evolucion de la Tecnlogia Educativa Evolucion de la Tecnlogia Educativa
Evolucion de la Tecnlogia Educativa AdryCando
 
FAQs - 360 Degree Feedback Survey
FAQs - 360 Degree Feedback SurveyFAQs - 360 Degree Feedback Survey
FAQs - 360 Degree Feedback Survey
ttnext
 

Viewers also liked (7)

Solar system 2
Solar system 2Solar system 2
Solar system 2
 
Chapter 13 - The Remaining Amendments and a Return to the Constitution
Chapter 13 - The Remaining Amendments and a Return to the ConstitutionChapter 13 - The Remaining Amendments and a Return to the Constitution
Chapter 13 - The Remaining Amendments and a Return to the Constitution
 
Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...
Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...
Technology Trends for Private Member Clubs: A Look at Modern Club Systems for...
 
hipo-selection-2016
hipo-selection-2016hipo-selection-2016
hipo-selection-2016
 
Experience Leadership 2012
Experience Leadership 2012Experience Leadership 2012
Experience Leadership 2012
 
Evolucion de la Tecnlogia Educativa
Evolucion de la Tecnlogia Educativa Evolucion de la Tecnlogia Educativa
Evolucion de la Tecnlogia Educativa
 
FAQs - 360 Degree Feedback Survey
FAQs - 360 Degree Feedback SurveyFAQs - 360 Degree Feedback Survey
FAQs - 360 Degree Feedback Survey
 

Similar to pydata_confernence_july_2016_modified_sunday_final

Inexact reasoning
Inexact reasoningInexact reasoning
Inexact reasoning
Nirdesh Singh
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
Raman Kannan
 
Analysis on Data Fusion Techniques for Combining Conflicting Beliefs
Analysis on Data Fusion Techniques for Combining Conflicting BeliefsAnalysis on Data Fusion Techniques for Combining Conflicting Beliefs
Analysis on Data Fusion Techniques for Combining Conflicting Beliefs
ijsrd.com
 
Barra Presentation
Barra PresentationBarra Presentation
Barra Presentationspgreiner
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
jemille6
 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
ShayanChowdary
 
Frequentist inference only seems easy By John Mount
Frequentist inference only seems easy By John MountFrequentist inference only seems easy By John Mount
Frequentist inference only seems easy By John Mount
Chester Chen
 
0hypothesis testing.pdf
0hypothesis testing.pdf0hypothesis testing.pdf
0hypothesis testing.pdf
AyushPandey175
 
4. correlations
4. correlations4. correlations
4. correlations
Steve Saffhill
 
Bio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptionsBio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptions
Qussai Abbas
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
Shivasharana Marnur
 
Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?  Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?
Andino Maseleno
 
Hypothesis
HypothesisHypothesis
Hypothesis
jyotsna dwivedi
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.pptNursing Path
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
StephenSenn2
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
jemille6
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
jemille6
 

Similar to pydata_confernence_july_2016_modified_sunday_final (20)

Inexact reasoning
Inexact reasoningInexact reasoning
Inexact reasoning
 
M08 BiasVarianceTradeoff
M08 BiasVarianceTradeoffM08 BiasVarianceTradeoff
M08 BiasVarianceTradeoff
 
Analysis on Data Fusion Techniques for Combining Conflicting Beliefs
Analysis on Data Fusion Techniques for Combining Conflicting BeliefsAnalysis on Data Fusion Techniques for Combining Conflicting Beliefs
Analysis on Data Fusion Techniques for Combining Conflicting Beliefs
 
Barra Presentation
Barra PresentationBarra Presentation
Barra Presentation
 
On Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the TwoOn Severity, the Weight of Evidence, and the Relationship Between the Two
On Severity, the Weight of Evidence, and the Relationship Between the Two
 
chap4_Parametric_Methods.ppt
chap4_Parametric_Methods.pptchap4_Parametric_Methods.ppt
chap4_Parametric_Methods.ppt
 
Frequentist inference only seems easy By John Mount
Frequentist inference only seems easy By John MountFrequentist inference only seems easy By John Mount
Frequentist inference only seems easy By John Mount
 
0hypothesis testing.pdf
0hypothesis testing.pdf0hypothesis testing.pdf
0hypothesis testing.pdf
 
4. correlations
4. correlations4. correlations
4. correlations
 
Bio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptionsBio-statistics definitions and misconceptions
Bio-statistics definitions and misconceptions
 
Hypothesis testing
Hypothesis testingHypothesis testing
Hypothesis testing
 
Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?  Is there any a novel best theory for uncertainty?
Is there any a novel best theory for uncertainty?
 
Hypothesis
HypothesisHypothesis
Hypothesis
 
Inferential statistics.ppt
Inferential statistics.pptInferential statistics.ppt
Inferential statistics.ppt
 
Russo Ihpst Seminar
Russo Ihpst SeminarRusso Ihpst Seminar
Russo Ihpst Seminar
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...The replication crisis: are P-values the problem and are Bayes factors the so...
The replication crisis: are P-values the problem and are Bayes factors the so...
 
Russo Vub Seminar
Russo Vub SeminarRusso Vub Seminar
Russo Vub Seminar
 
Russo Vub Seminar
Russo Vub SeminarRusso Vub Seminar
Russo Vub Seminar
 
D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy D. Mayo: Replication Research Under an Error Statistical Philosophy
D. Mayo: Replication Research Under an Error Statistical Philosophy
 

pydata_confernence_july_2016_modified_sunday_final

  • 1. A HYBRID APPROACH TO MODEL RANDOMNESS AND FUZZINESS USING DEMPSTER-SHAFFER METHOD Taposh Roy & Austin Powell July 2016
  • 2. WHO ARE WE ? Lead Big Data & Data Science Kaiser Permanente taposh.d.roy@kp.org Graduate Intern Kaiser Permanente austin.powell@kp.org
  • 3. OUTLINE • Motivation, philosophy and some definitions • Why Dempster Shafer method • Math behind Dempster Shafer method. • Our analysis combining predictability with Dempster Shafer. • Review of our results and interpretation
  • 5. THE WHY? Majority of predictive analytics today is based on historic data. As data scientists, we hunt and gather historic data, train to generate models using variety of algorithms, test and validate it and finally deploy to score. This however limits us to situations that have been considered in our training data. Our predictions don’t incorporate any external or environmental factors not already trained.
  • 6. PHILOSOPHY So far as the laws of mathematics refer to reality, they are not certain. And so far as they are certain, they do not refer to reality. As complexity rises, precise statements lose meaning and meaningful statements lose
  • 8. DEFINITION : RANDOMNESS Box A contains 8 small and 8 big apples. You randomly pick an apple from the box. There is uncertainty about the size of the apple that will come out. The size of the apple can be described by a random variable that takes 2 different values. There are 2 distinct possible outcomes and there is no uncertainty after the apple is taken out and observed. The probability of selected apple being big or small is : 50 %
  • 9. DEFINITION : FUZZINESS Now, you have a box (Box B) that contains 16 apples with different sizes ranging from very small to very large. You have one of these apples in hand. There is uncertainty in describing the size of the apple but this time the uncertainty is not about the lack of knowledge of the outcome. It is about the lack of the boundary between small and large. You use a fuzzy membership function to define the membership value of the apple in the set ``large`` or ``small``. Everything is known except how to label the objects (apples), how to describe them and draw the boundaries of the sets. Box B
  • 11. DEMPSTER SHAFFER – WHY? • Dempster Shafer theory was used in AI and reliability modeling1. • Dempster Shafer led the path to Fuzzy logic. It is a base theory for uncertainty modeling. • Dempster–Shafer theory (a.k.a. theory of belief functions) is a generalization of the Bayesian theory of subjective probability. • Whereas Bayesian theory requires probabilities for each question of interest, belief functions allow us to base degrees of belief (strength of evidence in favor of some proposition) for one question on probabilities for a related question. 1http://www.gnedenko-forum.org/Journal/2007/03-042007/article19_32007.pdf
  • 12. BRIEF HISTORY: • 17th Century Roots • 1968 > Dempster developed means for combining degrees of belief derived from independent items of evidence. • 1976 > Shafer developed the Dempster-Shafer theory. • 1988-90 > The theory was criticized by researchers as being confusing probabilities of truth with probabilities of provability2,3. • 1998 > Kłopotek and Wierzchoń proposed to interpret the Dempster–Shafer theory in terms of statistics of decision tables • 2011 > Dr. Hemant Baruah, The Theory of Fuzzy Sets: Beliefs and Realities • 2012 > Dr. Jøsang proved that Dempster's rule of combination actually is a method for fusing belief constraints. • 2013 > Dr. John Yen - "Can Evidence Be Combined in the Dempster-Shafer Theory"2 http://sartemov.ws.gc.cuny.edu/files/2012/10/Artemov-Beklemishev.-Provability-logic.pdf 3 https://en.wikipedia.org/wiki/Provability_logic
  • 14. EXAMPLE : Dempster–Shafer theory has 2 components : • Obtaining degrees of belief. • Combining degrees of belief. She tells me a limb fell on my car. This statement must be true if she is reliable. So her testimony alone justifies a 0.9 degree of belief that a limb fell on my car and a Zero (not 0.1) degree of belief that no limb fell on my car. Note : Zero degree of belief here does not mean that I am sure no limb fell on my car. It merely means based on Betty’s testimony there is no reason to believe that a limb fell on my Betty Belief function is 0.9 and 0 together
  • 15. COMBINING BELIEFS Sally I have another friend Sally. She independent of Betty, tells me a limb fell on my car. The event that Betty is reliable is independent of the event that Sally is reliable. So we multiply the probabilities of these events. The probability that both are reliable is 0.9x0.9 =0.81, The probability that neither is reliable is 0.1x0.1 =0.01, & The probability that at least one is reliable is 1-0.01 =0.99 Since both said a limb fell on my car, at least one of them being reliable implies that a limb did fell on my car, and hence I may assign this event a degree of belief of 0.99 Reference : http://www.glennshafer.com/assets/downloads/articles/article48.pdf
  • 16. COMBINING BELIEFS : CONTRADICT Sally Now suppose, Sally contradicts Betty. She tells me no limb fell on my car. In this case both cannot be right and hence both cannot be reliable – one is reliable or neither is reliable. The prior probabilities : Only Betty is reliable and Sally is not : (0.9x0.1) = 0.09 Only Sally is reliable and Betty is not : (0.9x0.1) =0.09 Neither is reliable is 0.1x0.1 =0.01
  • 17. Only Betty is reliable while Sally is not, provided at least one is unreliable : P(Betty is reliable, Sally is not)/P(at least one is unreliable) 0.09/[1-(0.9x0.9)] =9/19 Only Sally is reliable while Betty is not, provided at least one is unreliable : P(Sally is reliable, Betty is not)/P(at least one is unreliable) 0.09/[1-(0.9x0.9)] =9/19 Neither of them are reliable, provided at least one is unreliable : P(Sally is unreliable and Betty is unreliable)/P(at least one is unreliable) [0.1 x 0.1]/[1-(0.9x0.9)] =1/19 Hence we have 9/19 degree of belief that a limb fell on my car (Betty reliable) & 9/19 degree of belief that no limb fell on my car (Sally reliable) SallyBetty The posterior probabilities : COMBINING BELIEFS : CONTRADICT
  • 18. BASIC DEMPSTER SHAFER We obtain degrees of belief for one question (Did a limb fell on my car?) from probabilities for another question (Is the witness reliable?)4. 4 http://www.glennshafer.com/assets/downloads/articles/article48.pdf
  • 20. NEXT STEPS • Quantify fuzziness in binary prediction • Obtain improved probabilities for binary classification based off a prior classification algorithm using Modified Dempster-Shafer
  • 21. DATASETS TESTED  Higgs Boson  Titanic  Cdystonia  Parkinsons Telemonitoring  Abalone  Yeast  Skin_Nonskin  Contraceptive Method Choice  Diabetic Renopathy  Mammographic Mass  Breast Cancer Wisconson (Diagnostic)  Breast Cancer Wisconson (Prognastic)  Haberman’s Survival  Statlog (Heart)  SPECT Heart  SPECTF Heart  Hepatitis  Sponge  Audiology  Soybean  Ecoli  Post-Operative Patient  Zoo  EEG Eye State  Wilt  Bach  Nursery  Phishing Websites  Pets  Contraceptive Method  First Order  Spambase  Hill Valley  Satellite Image  Agaricus Lepiota  King Rook vs King Pawn  Firm Teacher  Fertility  Dermatology  Mice Protein  LSVT Voice  Thyroid
  • 22. MODIFIED DS SUMMARY Road Forward • Scale performance with data dimension • Correct for highly imbalanced classes which may lead to high-class probability when there is none (high/low belief conflict) • Test different prior algorithms Positives • Can significantly increase performance metrics and to a greater precision • Algorithm is not black-box. It can be understood relatively easily • Prior probabilities can come from algorithms other than logistic regression
  • 23. JAR OF MARBLES Type equation here.Type equation here. Person 1 Person 2 Majority is Red 70% 50% Majority is Blue 20% 30% Half Blue and Red 10% 20% What is the majority color in the jar? A Dempster-Shafer student could then say he/she believes: Red Blue Half-half 81% 17% 2%
  • 24. DEMPSTER’S RULE OF COMBINATION 𝑚1⨁𝑚2 𝐴 = 1 1 − 𝐾 𝐵∩𝐶=∅ 𝑚1(𝐵)𝑚2(𝐶) = 𝑚1,2 𝐴 Combines multiple pieces of evidence (mass functions). Emphasizes the agreement, and ignores the conflict of the mass functions by using the normalization factor 1-K 𝐾 = 𝐵⋂𝐶=𝐴 ≠𝜙 𝑚1 𝐵 𝑚2 𝐶 , 𝐾 ≠ 1 K is the measure of conflict between the two mass sets meaning 1-K is the measure of agreement
  • 25. DS NOTATION Θ = The plausibility function of a class sums the mass values of all sets that intersect that class. Measures what is left over after subtracting evidence against a belief. pl(𝐴) = 𝐵⋂𝐴≠𝜙 𝑚(𝐵) 𝑏𝑒𝑙(𝐴) = 𝐵⊆𝐴 𝑚(𝐵) The belief function of a class sums the mass values of all the non-empty subsets of that class. The strength of evidence in favor of some proposition 𝑚(𝐴) ≤ 𝑏𝑒𝑙(𝐴) ≤ 𝑝𝑙(𝐴)
  • 26. MODIFIED DS: OBTAINING POSTERIOR 𝐶 𝐴 𝐸′ = 𝑚(𝐴|𝐸′ ) 𝑃(𝐴) 𝐴⊂Θ 𝑚(𝐴|𝐸′) 𝑃(𝐴) Given some evidence and prior probabilities for a class, we can obtain a posterior or updated mass value R. P. S. Mahler D. Fixsen. The modified dempster-shafer approach to classification 𝐸′ The Basic Certainty Value: a normalized ratio of a hypothesis subset mass to its prior probability* Evidence Space: a set of mutually exclusive outcom or possible values of an evidence source.
  • 27. JAR OF MARBLES Type equation here.Type equation here. Person 1 Person 2 Majority is Red 70% 50% Majority is Blue 20% 30% Half Blue and Red 10% 20% What is the majority color in the jar? A Dempster-Shafer student could then say he/she believes: Red Blue Half- half DS 81% 17% 2% Modifie d 84.9% 13.5% 1.6% Prior Information Red Blue 60% 40%
  • 28. THE ALGORITHM – BUILD A MODEL
  • 29. OBTAINING MASS VALUES For Each Feature 1. Partition the training data by class 2. Calculate the max/min values of each partition 3. Determine which max/min value situation it is by looking at the ranges of each partition
  • 30. EXAMPLES So far have tested DS algorithm on 43 datasets: • UCI Machine Learning Repository • R package “Datasets”
  • 31. EMAIL SPAM • Email spam is from a diverse group of individuals. Predicting on individuals definition of spam or not. • 58 features: primarily of different word and character frequencies • 4601 observations https://archive.ics.uci.edu/ml/datasets/Spambase Median Metrics Based of 10-fold CV Metrics LR LRDS Accuracy 0.63587 0.63913 AUC 0.76317 0.76265 F1 Score 0.45753 0.45840 Precision 0.52277 0.52470 True Positive 0.39503 0.40151 False Positive 0.01245 0.01384
  • 33. SPECT HEART • Cardiac Single Proton Emission Computed Tomography images • Predicting on normal or abnormal heart • 23 features on partial human diagnoses of abnormality • 267 observations Source: https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation Median Metrics Based of 10-fold CV Metrics LR LRDS Accuracy 0.55660 0.72641 AUC 0.89846 0.89846 F1 Score 0.52168 0.57441 Precision 0.52618 0.66678 True Positive 0.14074 0.50020 False Positive 0.01562 0.036458
  • 36. REFERENCES • Application of Dempster Shafer in reliability modeling http://www.gnedenko-forum.org/Journal/2007/03- 042007/article19_32007.pdf • Provability Logic http://sartemov.ws.gc.cuny.edu/files/2012/10/Artemov-Beklemishev.-Provability-logic.pdf • Wiki Provability Logic https://en.wikipedia.org/wiki/Provability_logic • Dr. Shafer's Article http://www.glennshafer.com/assets/downloads/articles/article48.pdf • Home Page of Dr. Lotfi Zadeh https://www2.eecs.berkeley.edu/Faculty/Homepages/zadeh.html • L. A. Zadeh, "The concept of a linguistic variable and its application to approximate reasoning - I," Information Sciences, vol. 8, no. 3, pp. 199-249, July 1975. • R. E. Bellman and L. A. Zadeh, "Decision making in a fuzzy environment," Management Science, vol. 17, no. 4, pp. B141- B164, Dec. 1970. • L. A. Zadeh, "Fuzzy sets," Information and Control, vol. 8, no. 3, pp. 338-353, June 1965. • Alexander Karlsson. R: Evidence Combination in R. R Foundation for Statistical Computing, MIT, 2012. • R. P. S. Mahler D. Fixsen. The modified dempster-shafer approach to classification. IEEE Std. 1516-2000, pages 96 – 104, Jan 1997. • Jiwen Guan, Jasmina Pavlin, and Victor R. Lesser. Combining Evidence in the Extended Dempster-Shafer Theory, pages 163–178. Springer London, London, 1990 • M. Lichman. UCI machine learning repository, 2013 • Simon Parsons and London Wca Px. Some qualitative approaches to applying the dempster-shafer theory, 1994.
  • 39. DS NOTATION Θ =Θ = 𝑎, 𝑏 2Θ = {𝜙, 𝑎, 𝑏, Θ} A hypothesis for the possibilities which is a subset of all possibilities The power set is the set of all possible subsets of Θ, including Θ itself and the empty set Φ 𝑚: 2Θ → [0,1] Mass function: Each element of the power set has a mass value assigned to it 𝑚 𝜙 = 0 For the purposes of our algorithm, the mass of the empty set is zero 𝐴 ⊆ Θ 𝑚 𝐴 = 1 Sum of masses (for each row) is 1 (currently)

Editor's Notes

  1. Hi this is Taposh Roy and Austin Powell from Kaiser Permanente. This is part of the on-going research work done by Austin.
  2. In this talk we will discuss the following:   Motivation of combining fuzziness and predictability. Math behind Dempster Shafer method. Our Analysis combining Logistic Regression with Dempster Shaffer. Review of our results and interpretation
  3. Majority of predictive analytics today is based on historic data. As data scientists, we hunt and gather historic data, train to generate models using variety of algorithms, test and validate it and finally deploy to score. This however limits us to situations that have historic information. Our predictions don’t incorporate any external or environmental factors not already trained.
  4. Can you guess who said this ? This is Dr. Zadeh, his research has been pioneering in this area of belief functions. Before we go further, I am going to talk about a few concepts.
  5. Box A contains 8 small and 8 big apples. You randomly pick an apple from the box. There is uncertainty about the size of the apple that will come out. The size of the apple can be described by a random variable that takes 2 different values with 50% probability each. There are 2 distinct possible outcomes and there is no uncertainty after the apple is taken out and observed.
  6. You have a box (Box B) that contains 10 apples with different sizes ranging from very small to very large. You have one of these apples in hand. There is uncertainty in describing the size of the apple but this time the uncertainty is not about the lack of knowledge of the outcome. It is about the lack of the boundary between small and large. You use a fuzzy membership function to define the membership value of the apple in the set ``large`` or ``small``. Everything is known except how to label the objects (apples), how to describe them and draw the boundaries of the sets.
  7. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  8. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  9. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  10. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  11. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  12. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  13. AI researchers in the early 1980s, when they were trying to adapt probability theory to expert systems.
  14. We could get one probability confidence interval using something like assuming normality for the true parameter. (1) Disadvantage is the usual issue associated with normality assumptions and sample size (2) This is for the final performance metric, not an a per-observation upper and lower bound
  15. ”To a greater precision” in first point is referring to the fact that in the following plots, boxplots have smaller bounds for LRDS
  16. Person A and Person B probabilities should be introduced as mass values Anyone who has ever done a probability problem has done a jar/urn problem with a certainty Should note that for the example we are not considering a binary classification. In our case, we would only be considering Red and Green for example
  17. Plausibility It also ranges from 0 to 1 and measures the extent to which evidence in favor of ~p leaves room for belief in p. Will address what a “measure” is
  18. In addition to combining information from different features. It is possible to use the same Bayesian process to integrate information from other data sources. Possibly as a way of weighting your masses. By way of example, say that we did logistic regression on the breast cancer dataset to obtain probabilities. We then could obtain probabilities from another binary prediction algorithm such as Naïve Bayes and combine the two to obtain mass values. This equation is from the paper “A reasoning model based on an extended Dempster-Shafer Theory” – John Yen The Basic Certainty Value of a hypothesis subset is the normalized ratio of the subset’s mass to its prior probability
  19. Person A and Person B probabilities should be introduced as mass values Anyone who has ever done a probability problem has done a jar/urn problem with a certainty Should note that for the example we are not considering a binary classification. In our case, we would only be considering Red and Green for example
  20. This figure assuming that the dataset of interest only has 4 features
  21. In the real-world example, mass values would be someone’s opinion of the probability of a certain event.
  22. Storyline for the results: Firstly point out that no data cleaning, engineering, outlier removal etc. was done so the performance is all relative. Introduce results with Titanic since it is a well-known dataset The results tend to show that lrds does slightly better in general with Acc than AUC, however this is the metric were more interested in. So from an operational standpoint we have had good results.
  23. Precision: “how many selected items are relevant” Recall: “how many relevant items are selected”
  24. Precision: “how many selected items are relevant” Recall: “how many relevant items are selected” PR curves are: Better when looking at highly skewed datasets An ROC curve dominates (as compared to other metrics) iff it dominates in PR curves There are some disadvantages when it comes to optimizing, but PR was a performance idicator
  25. Precision: “how many selected items are relevant” Recall: “how many relevant items are selected”
  26. Precision: “how many selected items are relevant” Recall: “how many relevant items are selected”
  27. 3 main points of this slide: If there is a big difference in all the different performance metrics, then its in LRDS’ favor. There has not been a decrease in performance across our metrics for LRDS The exception to the above is for the False positive rate There is no difference in the PR curves, but a large difference in performance metrics Note that F1 that LRDS is actually greater, difference is backwards Precision: “how many selected items are relevant” Recall: “how many relevant items are selected”
  28. Precision: “how many selected items are relevant” Recall: “how many relevant items are selected”
  29. Precision: “how many selected items are relevant” Recall: “how many relevant items are selected”