Workshop overview
• Y/our backgrounds and interests and what we want
• How does mining work and what can it do for YOU/Cochrane?
• Demonstration with emphasis on dictionaries.
• What would YOU like a system to do?
• Your dictionary/ies in action
• Advanced (chemistry, diagram mining)
• ANY early adopter can obtain our (Open) software and run it at
home for any resource (medical, agricultural, government, climate,
etc.). We will help you during next 24 hours.
• All material CC BY.
Cochrane UK & Ireland
Symposium 2016,
Birmingham, UK, 2016-03-15
Let the Machine Help
with your
Systematic Reviews
Peter Murray-Rust1,2
Christopher Kittel2
[1]University of Cambridge
[2]TheContentMine
Simple, Universal,
Knowledge creation and re-use
The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
Resources
• Europe PubMedCentral http://europepmc.org/
• ContentMine toolkit https://github.com/ContentMine/
• Wikidata:
https://www.wikidata.org/wiki/Wikidata:Main_Page
• Hypothes.is https://hypothes.is/ [1]
• Etherpad: http://pads.cottagelabs.com/p/cochrane2016
• Note: early adopters can obtain our (Open) software and
run it at home…
• [1] Not used in CochraneBham workshop
Europe PubMedCentral
catalogue
getpapers
query
Daily
Crawl
EPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
dictionaries
Dictionaries!
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
Dict A
Dict B
Image
Caption
Table
Caption
MINING
with sections
and dictionaries
[W3C Annotation / https://hypothes.is/ ]
Disease Dictionary (ICD-10)
<dictionary title="disease">
<entry term="1p36 deletion syndrome"/>
<entry term="1q21.1 deletion syndrome"/>
<entry term="1q21.1 duplication syndrome"/>
<entry term="3-methylglutaconic aciduria"/>
<entry term="3mc syndrome”
<entry term="corpus luteum cyst”/>
<entry term="cortical blindness" />
SELECT DISTINCT ?thingLabel WHERE {
?thing wdt:P494 ?wd .
?thing wdt:P279 wd:Q12136 .
SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" }
}
wdt:P494 = ICD-10 (P494) identifier
wd:Q12136 = disease (Q12136) abnormal condition that
affects the body of an organism
Wikidata ontology for disease
• ChEBI (chemicals at EBI)
ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names_3star.tsv.gz)
• combined with WIKIDATA: World Health Organisation International Nonproprietary Name
(P2275)
* => 4947 items in the dictionary (inn.xml)
DRUGS
<dictionary title="inn">
<entry term="(r)-fenfluramine"/>
<entry term="abacavir"/>
<entry term="abafungin"/>
<entry term="abafungina"/>
<entry term="abafungine"/>
<entry term="abafunginum"/>
<entry term="abamectin"/>
<entry term="abarelix"/>
<entry term="abatacept"/>
<dictionary title="funders">
<!— from http://help.crossref.org/funder-registry with
thanks -->
<entry id="http://dx.doi.org/10.13039/100001436"
term="1675 Foundation"/>
<entry id="http://dx.doi.org/10.13039/100004343"
term="3M"/>
<entry id=“http://dx.doi.org/10.13039/501100005957”
term="8020 Promotion Foundation"/>
<entry id="http://dx.doi.org/10.13039/501100007139"
term="A Richer Life Foundation"/>
<entry id="http://dx.doi.org/10.13039/100006543"
term="A World Celiac Community Foundation"/>
<entry id="http://dx.doi.org/10.13039/100001962"
term="A-T Children's Project"/>
<entry id="http://dx.doi.org/10.13039/100008456"
term="A. Alfred Taubman Medical Research Institute"/>
11566 entries
Funders Dictionary
Dengue Mosquito
<dictionary name="genus">
<entry term="Aa"/>
<entry term="Aaaba"/>
<entry term="Aacanthocnema"/>
<entry term="Aaosphaeria"/>
<entry term="Aaptos"/>
<entry term="Aaptosyax"/>
<entry term="Aaroniella"/>
<entry term="Aaronsohnia"/>
<entry term="Abablemma"/>
Genera from NCBI TaxDump
<dictionary title="hgnc">
<entry term="A1BG" name="alpha-1-B glycoprotein"/>
<entry term="A1BG-AS1" name="A1BG antisense RNA 1"/>
<entry term="A1CF"
name="APOBEC1 complementation factor"/>
<entry term="A2M" name="alpha-2-macroglobulin"/>
<entry term="A2M-AS1"
name="A2M antisense RNA 1 (head to head)"/>
<entry term="A2ML1" name="alpha-2-macroglobulin-like 1"/>
<entry term="A2ML1-AS1" name="A2ML1 antisense RNA 1"/>
Human Genes (HGNC)
<entry term="Aaas"
name="achalasia, adrenocortical insufficiency, alacrimia"/>
<entry term="Aacs" name="acetoacetyl-CoA synthetase"/>
<entry term="Aadac"
name="arylacetamide deacetylase (esterase)"/>
<entry term="Aadacl2"
name="arylacetamide deacetylase-like 2"/>
<entry term="Aadacl3"
name="arylacetamide deacetylase-like 3"/>
<entry term="Aadat" name="aminoadipate aminotransferase"/>
<entry term="Aaed1"
name="AhpC/TSA antioxidant enzyme domain containing 1"/>
<entry term="Aagab"
name="alpha- and gamma-adaptin binding protein"/>
<entry term="Aak1" name="AP2 associated kinase 1"/>
<entry term="Aamdc"
name="adipogenesis associated Mth938 domain containing"/>
<entry term="Aamp"
name="angio-associated migratory protein"/>
Mouse genes (JAXson)
Ebola!
<dictionary title="tropicalVirus">
<entry term="ZIKV" name="Zika virus"/>
<entry term="Zika" name="Zika virus"/>
<entry term="DENV" name="Dengue virus"/>
<entry term="Dengue" name="Dengue virus"/>
<entry term="CHIKV" name="Chikungunya virus"/>
<entry term="Chikungunya" name="Chikungunya virus"/>
<entry term="WNV" name="West Nile virus"/>
<entry term="West Nile" name="West Nile virus"/>
<entry term="YFV" name="Yellow fever virus"/>
<entry term="Yellow fever" name="Yellow fever virus"/>
<entry term="HPV" name="Human papilloma virus"/>
<entry term="Human papilloma virus"
name="Human papilloma virus"/>
</dictionary>
Terms co-ocurring with “Zika”
<dictionary title="cochrane">
<entry term="Cochrane Library"/>
<entry term="Cochrane Reviews"/>
<entry
term="Cochrane Central Register of Controlled Trials"/>
<entry term="Cochrane"/>
<entry term="randomize"/>
<entry term="meta-analysis"/>
<entry term="Embase"/>
<entry term="MEDLINE"/>
<entry term="eligibility"/>
<entry term="exclusion"/>
<entry term="outcome"/>
<entry term="Review Manager"/>
<entry term="STATA"/>
<entry term="RCT"/>
</dictionary>
Terms lexically related to “meta-analysis”
Mining strategy
• Discover. negotiate permissions . => bibliography
• Crawl / Scrape (download), documents AND
supplemental
• Normalize. PDF => XML
• Index: facets => Facts and snippets (“entities”)
• Interpret/analyze entities => relationships,
aggregations (“Transformative”)
• Publish
catalogue
getpapers
query
Daily
Crawl
EuPMC, arXiv
CORE , HAL,
(UNIV repos)
ToC
services
PDF HTML
DOC ePUB
TeX XML
PNG
EPS CSV
XLSURLs
DOIs
crawl
quickscrape
norma
Normalizer
Structurer
Semantic
Tagger
Text
Data
Figures
ami
UNIV
Repos
search
Lookup
CONTENT
MINING
Chem
Phylo
Trials
Crystal
Plants
COMMUNITY
plugins
Visualization
and Analysis
PloSONE, BMC,
peerJ… Nature, IEEE,
Elsevier…
Publisher Sites
scrapers
queries
taggers
abstract
methods
references
Captioned
Figures
Fig. 1
HTML tables
30, 000 pages/day
Semantic ScholarlyHTML
Facts
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
Demo
PMR runs getpapers and ami
Chris runs Python visualization of drug co-occurrence
Systematic Reviews
Can we:
• eliminate true negatives automatically?
• extract data from formulaic language?
• mine diagrams?
• Annotate existing sources?
• forward-reference clinical trials?
Polly has 20 seconds to read this paper…
…and 10,000 more
ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
Diagram Mining
Ln Bacterial load per fly
11.5
11.0
10.5
10.0
9.5
9.0
6.5
6.0
Days post—infection
0 1 2 3 4 5
Bitmap Image and Tesseract OCR
Cochrane workshop 2016
Cochrane workshop 2016
Cochrane workshop 2016

Cochrane workshop 2016

  • 1.
    Workshop overview • Y/ourbackgrounds and interests and what we want • How does mining work and what can it do for YOU/Cochrane? • Demonstration with emphasis on dictionaries. • What would YOU like a system to do? • Your dictionary/ies in action • Advanced (chemistry, diagram mining) • ANY early adopter can obtain our (Open) software and run it at home for any resource (medical, agricultural, government, climate, etc.). We will help you during next 24 hours. • All material CC BY.
  • 2.
    Cochrane UK &Ireland Symposium 2016, Birmingham, UK, 2016-03-15 Let the Machine Help with your Systematic Reviews Peter Murray-Rust1,2 Christopher Kittel2 [1]University of Cambridge [2]TheContentMine Simple, Universal, Knowledge creation and re-use
  • 3.
    The Right toRead is the Right to Mine**PeterMurray-Rust, 2011 http://contentmine.org
  • 4.
    Resources • Europe PubMedCentralhttp://europepmc.org/ • ContentMine toolkit https://github.com/ContentMine/ • Wikidata: https://www.wikidata.org/wiki/Wikidata:Main_Page • Hypothes.is https://hypothes.is/ [1] • Etherpad: http://pads.cottagelabs.com/p/cochrane2016 • Note: early adopters can obtain our (Open) software and run it at home… • [1] Not used in CochraneBham workshop
  • 5.
  • 7.
    catalogue getpapers query Daily Crawl EPMC, arXiv CORE ,HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature dictionaries
  • 8.
  • 9.
    abstract methods references Captioned Figures Fig. 1 HTML tables abstract methods references Captioned Figures Fig.1 HTML tables Dict A Dict B Image Caption Table Caption MINING with sections and dictionaries [W3C Annotation / https://hypothes.is/ ]
  • 10.
    Disease Dictionary (ICD-10) <dictionarytitle="disease"> <entry term="1p36 deletion syndrome"/> <entry term="1q21.1 deletion syndrome"/> <entry term="1q21.1 duplication syndrome"/> <entry term="3-methylglutaconic aciduria"/> <entry term="3mc syndrome” <entry term="corpus luteum cyst”/> <entry term="cortical blindness" /> SELECT DISTINCT ?thingLabel WHERE { ?thing wdt:P494 ?wd . ?thing wdt:P279 wd:Q12136 . SERVICE wikibase:label { bd:serviceParam wikibase:language "en" } } wdt:P494 = ICD-10 (P494) identifier wd:Q12136 = disease (Q12136) abnormal condition that affects the body of an organism Wikidata ontology for disease
  • 11.
    • ChEBI (chemicalsat EBI) ftp://ftp.ebi.ac.uk/pub/databases/chebi/Flat_file_tab_delimited/names_3star.tsv.gz) • combined with WIKIDATA: World Health Organisation International Nonproprietary Name (P2275) * => 4947 items in the dictionary (inn.xml) DRUGS <dictionary title="inn"> <entry term="(r)-fenfluramine"/> <entry term="abacavir"/> <entry term="abafungin"/> <entry term="abafungina"/> <entry term="abafungine"/> <entry term="abafunginum"/> <entry term="abamectin"/> <entry term="abarelix"/> <entry term="abatacept"/>
  • 12.
    <dictionary title="funders"> <!— fromhttp://help.crossref.org/funder-registry with thanks --> <entry id="http://dx.doi.org/10.13039/100001436" term="1675 Foundation"/> <entry id="http://dx.doi.org/10.13039/100004343" term="3M"/> <entry id=“http://dx.doi.org/10.13039/501100005957” term="8020 Promotion Foundation"/> <entry id="http://dx.doi.org/10.13039/501100007139" term="A Richer Life Foundation"/> <entry id="http://dx.doi.org/10.13039/100006543" term="A World Celiac Community Foundation"/> <entry id="http://dx.doi.org/10.13039/100001962" term="A-T Children's Project"/> <entry id="http://dx.doi.org/10.13039/100008456" term="A. Alfred Taubman Medical Research Institute"/> 11566 entries Funders Dictionary
  • 13.
  • 14.
    <dictionary name="genus"> <entry term="Aa"/> <entryterm="Aaaba"/> <entry term="Aacanthocnema"/> <entry term="Aaosphaeria"/> <entry term="Aaptos"/> <entry term="Aaptosyax"/> <entry term="Aaroniella"/> <entry term="Aaronsohnia"/> <entry term="Abablemma"/> Genera from NCBI TaxDump
  • 15.
    <dictionary title="hgnc"> <entry term="A1BG"name="alpha-1-B glycoprotein"/> <entry term="A1BG-AS1" name="A1BG antisense RNA 1"/> <entry term="A1CF" name="APOBEC1 complementation factor"/> <entry term="A2M" name="alpha-2-macroglobulin"/> <entry term="A2M-AS1" name="A2M antisense RNA 1 (head to head)"/> <entry term="A2ML1" name="alpha-2-macroglobulin-like 1"/> <entry term="A2ML1-AS1" name="A2ML1 antisense RNA 1"/> Human Genes (HGNC)
  • 16.
    <entry term="Aaas" name="achalasia, adrenocorticalinsufficiency, alacrimia"/> <entry term="Aacs" name="acetoacetyl-CoA synthetase"/> <entry term="Aadac" name="arylacetamide deacetylase (esterase)"/> <entry term="Aadacl2" name="arylacetamide deacetylase-like 2"/> <entry term="Aadacl3" name="arylacetamide deacetylase-like 3"/> <entry term="Aadat" name="aminoadipate aminotransferase"/> <entry term="Aaed1" name="AhpC/TSA antioxidant enzyme domain containing 1"/> <entry term="Aagab" name="alpha- and gamma-adaptin binding protein"/> <entry term="Aak1" name="AP2 associated kinase 1"/> <entry term="Aamdc" name="adipogenesis associated Mth938 domain containing"/> <entry term="Aamp" name="angio-associated migratory protein"/> Mouse genes (JAXson)
  • 17.
  • 18.
    <dictionary title="tropicalVirus"> <entry term="ZIKV"name="Zika virus"/> <entry term="Zika" name="Zika virus"/> <entry term="DENV" name="Dengue virus"/> <entry term="Dengue" name="Dengue virus"/> <entry term="CHIKV" name="Chikungunya virus"/> <entry term="Chikungunya" name="Chikungunya virus"/> <entry term="WNV" name="West Nile virus"/> <entry term="West Nile" name="West Nile virus"/> <entry term="YFV" name="Yellow fever virus"/> <entry term="Yellow fever" name="Yellow fever virus"/> <entry term="HPV" name="Human papilloma virus"/> <entry term="Human papilloma virus" name="Human papilloma virus"/> </dictionary> Terms co-ocurring with “Zika”
  • 19.
    <dictionary title="cochrane"> <entry term="CochraneLibrary"/> <entry term="Cochrane Reviews"/> <entry term="Cochrane Central Register of Controlled Trials"/> <entry term="Cochrane"/> <entry term="randomize"/> <entry term="meta-analysis"/> <entry term="Embase"/> <entry term="MEDLINE"/> <entry term="eligibility"/> <entry term="exclusion"/> <entry term="outcome"/> <entry term="Review Manager"/> <entry term="STATA"/> <entry term="RCT"/> </dictionary> Terms lexically related to “meta-analysis”
  • 20.
    Mining strategy • Discover.negotiate permissions . => bibliography • Crawl / Scrape (download), documents AND supplemental • Normalize. PDF => XML • Index: facets => Facts and snippets (“entities”) • Interpret/analyze entities => relationships, aggregations (“Transformative”) • Publish
  • 21.
    catalogue getpapers query Daily Crawl EuPMC, arXiv CORE ,HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
  • 22.
    Demo PMR runs getpapersand ami Chris runs Python visualization of drug co-occurrence
  • 23.
    Systematic Reviews Can we: •eliminate true negatives automatically? • extract data from formulaic language? • mine diagrams? • Annotate existing sources? • forward-reference clinical trials?
  • 24.
    Polly has 20seconds to read this paper… …and 10,000 more
  • 25.
    ContentMine software cando this in a few minutes Polly: “there were 10,000 abstracts and due to time pressures, we split this between 6 researchers. It took about 2-3 days of work (working only on this) to get through ~1,600 papers each. So, at a minimum this equates to 12 days of full-time work (and would normally be done over several weeks under normal time pressures).”
  • 26.
    400,000 Clinical Trials In10 government registries Mapping trials => papers http://www.trialsjournal.com/content/16/1/80 2009 => 2015. What’s happened in last 6 years?? Search the whole scientific literature For “2009-0100068-41”
  • 28.
  • 29.
    Ln Bacterial loadper fly 11.5 11.0 10.5 10.0 9.5 9.0 6.5 6.0 Days post—infection 0 1 2 3 4 5 Bitmap Image and Tesseract OCR