Trinity College Dublin , IE,
”virtual” , 2020-07-01
openBattery
Materials Knowledge for citizens in the time of climate change
Peter Murray-Rust
Dept of Chemistry University of Cambridge and contentmine.org
Images from ContentMine CC BY and Wikimedia CC BY-SA
pm286@cam.ac.uk
peter@contentmine.org
Automatic extraction of materials and properties reported in the scientific literature.
Are you interested in having machines automatically search and retrieve data? Free?
ContentMine is OpenLocked Non-Profit http://contentmine.org
The Right to Read is the Right to Mine
openVirus collaborators
Remko Popma,
Lezan Hawizy, Tim Voronov,
Andy Jackson,
Clyde Davies,
Thomas Shafee,
Priya JK , Kareena Singh,
Simon Worthington,
Open-battery
Matthew Dunstan
The world’s existential problems need
knowledge
2019* “Open Climate Knowledge” (OCK)
to mine scientific articles about climate change.
50-90% of all published science is PAYWALLED. The rest is very hard to
find…
*Simon Worthington and PMR
BUT COVID-19 hit …
Ebola in West Africa
was forecast
35 years ago …
… was COVID-19
Also forecast?
… or its mitigation
suggested?
Delhi, IN
Priya and Kareena are 3rd year interns on openVirus
PMR
Gitanjali Yadav
Mining!
• build scrapers for Openly readable sources.
• Users queries for scraping
• download raw content
• clean and semantify
• annotate with dictionaries.
• analyze, display.
Scrape -> Clean-> Annotate -> Display
Open sources publish
|
v
|
v
|
v
Sources
https://ethos.bl.uk/Home.dohttps://www.redalyc.org/
100,000 Theses
4,700,000 abstracts
50,000 preprints
https://doaj.org https://biorxiv.org
https://medrxiv.org
Mexico, Latin America
https://europepmc.org
And your archive?
“(virus OR viral) AND
epidemic”
45 hits
DOAJ
Directory of Open Access Journals100,000
abstracts
Only 4.6 million
more to go 0.05%
20 GB total
Clyde Davies
Complete repo would yield
> 2000 articles
The power is with the READER
framework: ami + CProject data
scrapers: getpapers, Ferret, curl, scrapy
cleaners: PDFBox, Tidy/Jsoup, etc. Grobid
transformers: xml2html, ami ocr, KNIME
dictionaries: ami dictionary
indexing and annotation: Solr, ami
Analysis and display: R, KNIME
openVirus Tools
scrape clean annotate display
Dictionaries
materials.xml
country.xml
Generous support from
<entry term="anode" wikipedia="anode" wikidata="Q181232" name="anode" description="el
<entry term="castep" wikidata="Q5008880" name="CASTEP" description="density functional
<entry term="cathode" wikipedia="cathode" wikidata="Q175233" name="cathode" descriptio
getpapers -q "lithium-ion battery" -n
info: Searching using eupmc API
info: Found 3305 open access results
Wikipedia
Wikidata
UniqueID
We need Wikimedia!
Dashboard of 200 articles searched with 5 dictionaries
Co-occurrence of words
Extraction of data from diagrams
Extraction of data from diagrams
ami-image
ami-pixel
Extraction of data from diagrams
Publishers are a major problem
• http://github.com/petermr/openVirus
• pm286@cam.ac.uk
• https://github.com/the-grey-group/open-battery/
• Thanks to Matthew Dunstan
• These slides are on slideshare and the event is recorded on YouTube – address to
follow.

Automatic mining of data from materials science literature

  • 1.
    Trinity College Dublin, IE, ”virtual” , 2020-07-01 openBattery Materials Knowledge for citizens in the time of climate change Peter Murray-Rust Dept of Chemistry University of Cambridge and contentmine.org Images from ContentMine CC BY and Wikimedia CC BY-SA pm286@cam.ac.uk peter@contentmine.org Automatic extraction of materials and properties reported in the scientific literature. Are you interested in having machines automatically search and retrieve data? Free?
  • 2.
    ContentMine is OpenLockedNon-Profit http://contentmine.org The Right to Read is the Right to Mine openVirus collaborators Remko Popma, Lezan Hawizy, Tim Voronov, Andy Jackson, Clyde Davies, Thomas Shafee, Priya JK , Kareena Singh, Simon Worthington, Open-battery Matthew Dunstan
  • 3.
    The world’s existentialproblems need knowledge 2019* “Open Climate Knowledge” (OCK) to mine scientific articles about climate change. 50-90% of all published science is PAYWALLED. The rest is very hard to find… *Simon Worthington and PMR BUT COVID-19 hit …
  • 4.
    Ebola in WestAfrica was forecast 35 years ago … … was COVID-19 Also forecast? … or its mitigation suggested?
  • 6.
    Delhi, IN Priya andKareena are 3rd year interns on openVirus PMR Gitanjali Yadav
  • 7.
    Mining! • build scrapersfor Openly readable sources. • Users queries for scraping • download raw content • clean and semantify • annotate with dictionaries. • analyze, display. Scrape -> Clean-> Annotate -> Display Open sources publish | v | v | v
  • 8.
    Sources https://ethos.bl.uk/Home.dohttps://www.redalyc.org/ 100,000 Theses 4,700,000 abstracts 50,000preprints https://doaj.org https://biorxiv.org https://medrxiv.org Mexico, Latin America https://europepmc.org And your archive?
  • 9.
    “(virus OR viral)AND epidemic” 45 hits DOAJ Directory of Open Access Journals100,000 abstracts Only 4.6 million more to go 0.05% 20 GB total Clyde Davies Complete repo would yield > 2000 articles The power is with the READER
  • 10.
    framework: ami +CProject data scrapers: getpapers, Ferret, curl, scrapy cleaners: PDFBox, Tidy/Jsoup, etc. Grobid transformers: xml2html, ami ocr, KNIME dictionaries: ami dictionary indexing and annotation: Solr, ami Analysis and display: R, KNIME openVirus Tools scrape clean annotate display
  • 11.
    Dictionaries materials.xml country.xml Generous support from <entryterm="anode" wikipedia="anode" wikidata="Q181232" name="anode" description="el <entry term="castep" wikidata="Q5008880" name="CASTEP" description="density functional <entry term="cathode" wikipedia="cathode" wikidata="Q175233" name="cathode" descriptio
  • 12.
    getpapers -q "lithium-ionbattery" -n info: Searching using eupmc API info: Found 3305 open access results
  • 14.
  • 15.
    Dashboard of 200articles searched with 5 dictionaries
  • 16.
  • 17.
    Extraction of datafrom diagrams
  • 18.
    Extraction of datafrom diagrams
  • 19.
  • 20.
    Publishers are amajor problem
  • 21.
    • http://github.com/petermr/openVirus • pm286@cam.ac.uk •https://github.com/the-grey-group/open-battery/ • Thanks to Matthew Dunstan • These slides are on slideshare and the event is recorded on YouTube – address to follow.