Knowledge Assembly at Scale
with Semantic and Probabilistic Techniques
Szymon Klarman
Department of Computer Science
Brunel University London
Connected Data London 2016
Scientific publishing deluge
 50 mln papers published since 1665
 2.5 mln papers published last year
 publication output doubling every 9 years
Effects:
 narrowing of science
- we cite a small pool of mostly recent papers
 fragmentation of expertise
- nobody understands the „big picture” anymore
 quality of results affected
- many experimental results non-reproducible
Big Mechanism
Reading Assembly Explanation
The goal: develop AI technology for extracting large mechanistic (causal) models and
enabling them to computational agents
Challenges
• ambiguity, vagueness, modality of natural language
• general quality and reliability of the sources
• the inaccuracy of the information extraction tools
• the typical „Vs” of the big data, i.e.: volume, variety, volatility, velocity
• inconsistent, inconclusive or non-reproducible results
• gaps, omissions, contextual assumptions
In vitro curcumin downregulated the expression of
Bcl-2, and Bcl-XL and upregulated the expression of
p53, Bax, Bak, PUMA, Noxa, and Bim at mRNA and protein
levels in prostate cancer cells [14].
extraction
reconciliation
filtering
aggregation
evidence knowledge model formation
Knowledge assembly is a process of reconstructing complex knowledge from contextually
asserted atomic statements and data fragments (evidence).
Knowledge assembly
knowledge assembly„[…] A can associate with B […]” <A binding B>
extraction assemblyevidence (probabilistic)
knowledge
probabilistic inference
learning
model updates
Probabilistic knowledge assembly
expert input
In Probabilistic Knowledge Assembly framework, evidence with all contextual information
is part of the knowledge base to enable continuous update-assembly loop.
extraction assemblyevidence (probabilistic)
knowledge
probabilistic inference
learning
model updates
„A can associate with B”
extraction acurracy = 0.7
published in: „Molecular Cancer”
<A binding B> is supported to degree 0.7 Evidence contradicts the model to degree 0.7
<A binding B> is experimentally confirmed
Probabilistic knowledge assembly
expert input
In Probabilistic Knowledge Assembly framework, evidence with all contextual information
is part of the knowledge base to enable continuous update-assembly loop.
 ontologies:
• biomedical (GO, BioPax, MI)
• uncertainty (UNO)
• information/document/provenance description
(IAO, Prov-O, VoID, Dublin Core)
 (linked) open data via SPARQL endpoints and APIs:
• PubMed
• journal rankings (SciMago)
• bioinformatics databases (UniProt, Chebi, HGNC)
 unique identifiers
• biochemical enitities
• journals / articles
Linked data resources
Event
Biochemical entity / Event
Statement
ArticleJournal
represents
is extracted from
Molecular interaction
has participant
type
published in
Uncertainty level
Textual evidence
Truth value evidence
has evidence
has truth value
has uncertainty
(of type X)
Knowledge graph: data model
knowledge
[...]
In addition, GRB2
can associate with
GAB1
[...]
Knowledge graph: example
statement_1
textual
evidence
0.8
extraction prob
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Statement
Article
type
type
0.7
provenance prob
[...]
In addition, GRB2
can associate with
GAB1
[...]
Knowledge graph: example
GRB2 binding GAB1
statement_1
textual
evidence
0.8
extraction prob
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Event
Binding
Protein
Statement
Article
type
type
subclass of
typetype
type
represents0.7
provenance prob
[...]
In addition, GRB2
can associate with
GAB1
[...]
GRB2 binding GAB1
statement_1
textual
evidence
0.8
extraction prob
statement_..99
represents
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Event
Binding
Protein
Statement
Article
PMC654321 False
„GRB2 does not interact
directly with GAB1”
typetype
type
subclass of
typetype
type type
represents
extractedFrom
0.7
provenance prob
0.6
0.7
provenance prob
extraction prob
textual
evidence
truth value
GRB2 binding GAB1
statement_1
textual
evidence
0.8
extraction prob
statement_..99
represents
GRB2_MOUSE GAB1_MOUSE
has participant A has participant B
True
truth value
PMC123456
extracted from
„In addition, GRB2 can
associate with GAB1”
Event
Binding
Protein
Statement
Article
PMC654321 False
„GRB2 does not interact
directly with GAB1”
typetype
type
subclass of
typetype
type type
represents
extractedFrom
0.7
provenance prob
0.6
0.7
provenance prob
extraction prob
textual
evidence
truth value
So what can we really say about
the truth of events?
event = <A binding B>
0
0,5
1
{s1} {s1, s2} {s1, s2, s3}
positive support
negative support
inconsistency
Statement Extraction accurracy Provenance uncertainty
S1 = event is true 0.8 0.7
S2 = event is false 0.8 0.7
S3 = event is false 0.9 0.6
Support aggregation
Positive
support
Negative
support
Event
likelihood
Doc_1
Doc_2
Stat_1
Stat_2
Provenance
uncertainty
Extraction
accurracy
Textual
uncertainty
Stat...
Doc...
Document
part weight
Total uncertainty aggregation
Probabilistic model (~Bayes net) over linked data expressed via probabilistic logic
programming (ProbLog).
Extraction
Accuracy
Provenance
Uncertainty
Total
Uncertainty
Experimental
Confirmation
T F -
0.9 0.1 0.5
Molecule Interaction Gene
Total Uncertainty
Before Experiment
Experimental
Confirmation
Total Uncertainty
After Experiment
curcumin
negative
regulation
BCL2_MOUSE 0.3941 TRUE 0.7489
curcumin
positive
regulation
P53_HUMAN 0.3924 FALSE 0.1569
curcumin
negative
regulation
Q9H014_HUMAN 0.3929 - 0.3929
... ... ... ... ... ...
Expert input
Big Mechanism technology
We need technology for extracting and operationalizing Big Mechanisms:
ecosystems, brains, economic and social systems, etc.
Probabilistic Knowledge Assembly framework offers:
• a generic solution for scalable and flexible knowledge assembly
• a uniform knowledge representation model and data access interface based on
standardized tools and technologies (particularly W3C standards)
• the use of declarative formalisms facilitates provenance tracking
• continuous update-assembly loop for dynamic environments
(see: http://52.26.26.74/)
szymon.klarman@gmail.com
http://52.26.26.74/
Thank you!

Knowledge Assembly at Scale with Semantic and Probabilistic Techniques

  • 1.
    Knowledge Assembly atScale with Semantic and Probabilistic Techniques Szymon Klarman Department of Computer Science Brunel University London Connected Data London 2016
  • 2.
    Scientific publishing deluge 50 mln papers published since 1665  2.5 mln papers published last year  publication output doubling every 9 years Effects:  narrowing of science - we cite a small pool of mostly recent papers  fragmentation of expertise - nobody understands the „big picture” anymore  quality of results affected - many experimental results non-reproducible
  • 3.
    Big Mechanism Reading AssemblyExplanation The goal: develop AI technology for extracting large mechanistic (causal) models and enabling them to computational agents
  • 4.
    Challenges • ambiguity, vagueness,modality of natural language • general quality and reliability of the sources • the inaccuracy of the information extraction tools • the typical „Vs” of the big data, i.e.: volume, variety, volatility, velocity • inconsistent, inconclusive or non-reproducible results • gaps, omissions, contextual assumptions In vitro curcumin downregulated the expression of Bcl-2, and Bcl-XL and upregulated the expression of p53, Bax, Bak, PUMA, Noxa, and Bim at mRNA and protein levels in prostate cancer cells [14].
  • 5.
    extraction reconciliation filtering aggregation evidence knowledge modelformation Knowledge assembly is a process of reconstructing complex knowledge from contextually asserted atomic statements and data fragments (evidence). Knowledge assembly knowledge assembly„[…] A can associate with B […]” <A binding B>
  • 6.
    extraction assemblyevidence (probabilistic) knowledge probabilisticinference learning model updates Probabilistic knowledge assembly expert input In Probabilistic Knowledge Assembly framework, evidence with all contextual information is part of the knowledge base to enable continuous update-assembly loop.
  • 7.
    extraction assemblyevidence (probabilistic) knowledge probabilisticinference learning model updates „A can associate with B” extraction acurracy = 0.7 published in: „Molecular Cancer” <A binding B> is supported to degree 0.7 Evidence contradicts the model to degree 0.7 <A binding B> is experimentally confirmed Probabilistic knowledge assembly expert input In Probabilistic Knowledge Assembly framework, evidence with all contextual information is part of the knowledge base to enable continuous update-assembly loop.
  • 8.
     ontologies: • biomedical(GO, BioPax, MI) • uncertainty (UNO) • information/document/provenance description (IAO, Prov-O, VoID, Dublin Core)  (linked) open data via SPARQL endpoints and APIs: • PubMed • journal rankings (SciMago) • bioinformatics databases (UniProt, Chebi, HGNC)  unique identifiers • biochemical enitities • journals / articles Linked data resources
  • 9.
    Event Biochemical entity /Event Statement ArticleJournal represents is extracted from Molecular interaction has participant type published in Uncertainty level Textual evidence Truth value evidence has evidence has truth value has uncertainty (of type X) Knowledge graph: data model knowledge
  • 10.
    [...] In addition, GRB2 canassociate with GAB1 [...] Knowledge graph: example
  • 11.
    statement_1 textual evidence 0.8 extraction prob True truth value PMC123456 extractedfrom „In addition, GRB2 can associate with GAB1” Statement Article type type 0.7 provenance prob [...] In addition, GRB2 can associate with GAB1 [...] Knowledge graph: example
  • 12.
    GRB2 binding GAB1 statement_1 textual evidence 0.8 extractionprob GRB2_MOUSE GAB1_MOUSE has participant A has participant B True truth value PMC123456 extracted from „In addition, GRB2 can associate with GAB1” Event Binding Protein Statement Article type type subclass of typetype type represents0.7 provenance prob [...] In addition, GRB2 can associate with GAB1 [...]
  • 13.
    GRB2 binding GAB1 statement_1 textual evidence 0.8 extractionprob statement_..99 represents GRB2_MOUSE GAB1_MOUSE has participant A has participant B True truth value PMC123456 extracted from „In addition, GRB2 can associate with GAB1” Event Binding Protein Statement Article PMC654321 False „GRB2 does not interact directly with GAB1” typetype type subclass of typetype type type represents extractedFrom 0.7 provenance prob 0.6 0.7 provenance prob extraction prob textual evidence truth value
  • 14.
    GRB2 binding GAB1 statement_1 textual evidence 0.8 extractionprob statement_..99 represents GRB2_MOUSE GAB1_MOUSE has participant A has participant B True truth value PMC123456 extracted from „In addition, GRB2 can associate with GAB1” Event Binding Protein Statement Article PMC654321 False „GRB2 does not interact directly with GAB1” typetype type subclass of typetype type type represents extractedFrom 0.7 provenance prob 0.6 0.7 provenance prob extraction prob textual evidence truth value So what can we really say about the truth of events?
  • 15.
    event = <Abinding B> 0 0,5 1 {s1} {s1, s2} {s1, s2, s3} positive support negative support inconsistency Statement Extraction accurracy Provenance uncertainty S1 = event is true 0.8 0.7 S2 = event is false 0.8 0.7 S3 = event is false 0.9 0.6 Support aggregation
  • 16.
  • 17.
    Extraction Accuracy Provenance Uncertainty Total Uncertainty Experimental Confirmation T F - 0.90.1 0.5 Molecule Interaction Gene Total Uncertainty Before Experiment Experimental Confirmation Total Uncertainty After Experiment curcumin negative regulation BCL2_MOUSE 0.3941 TRUE 0.7489 curcumin positive regulation P53_HUMAN 0.3924 FALSE 0.1569 curcumin negative regulation Q9H014_HUMAN 0.3929 - 0.3929 ... ... ... ... ... ... Expert input
  • 18.
    Big Mechanism technology Weneed technology for extracting and operationalizing Big Mechanisms: ecosystems, brains, economic and social systems, etc. Probabilistic Knowledge Assembly framework offers: • a generic solution for scalable and flexible knowledge assembly • a uniform knowledge representation model and data access interface based on standardized tools and technologies (particularly W3C standards) • the use of declarative formalisms facilitates provenance tracking • continuous update-assembly loop for dynamic environments (see: http://52.26.26.74/)
  • 19.