INYAS
India, 2021-08-13
Open Science Principles and Practice
Peter Murray-Rust1,2
[1]University of Cambridge
[2]TheContentMine
pm286 AT cam DOT ac DOT uk
Ayush Garg and Shweata N Hegde
Most images CEVOpen
• Collaboration
• Open Notebook Data Science
• Antagonism to Open
• Mining Open Access for
OpenScience
• Ayush Garg and Shweata N Hegde
• CEVOpen: plant literature mining
• The future
Themes
TIGR2ESS 2019 Delhi
Food Security FEB 2019
Priya and Kareena interns
JUNE 2020
NIPGR: Developed Virtually over 6 weeks!
INYAS – KARYA-DBT
Delhi 2014
Gita Yadav + colleagues
CEVOpen 2021
https://www.youtube.com/watch?v=XiTngk-
POm8&ab_channel=INYASYouTube
Aug 2020
2013
OpenVirus
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently … an article by European researchers in Annals of
Virology [1982]: “Liberia has to be included in the Ebola virus endemic
zone. … medical personnel in Liberian health centers […] may come across
active cases,”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology [1982]: “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone. … medical
personnel in Liberian health centers […] may come across active cases and
thus be prepared to avoid [hospital-acquired] epidemics,”
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
https://www.youtube.com/watch?v=XiTngk-POm8&ab_channel=INYASYouTube
Aug 2020
8 miniprojects
Corpus (1000+ docs)
Dictionaries (30000+ terms)
3 complex topics:
- zoonosis (animal hosts)
- non-pharmaceutical (masks, social distancing, etc.)
- test and trace
- country
- disease
- drug
- funder
- virus
5 core facets country
drug
funder
disease
virus
https://github.com/petermr/openVirus/blob/master/Wikicite%20Presentations%20of%20presenters/Wikimedia_Hamadani_2.ipynb
Infectious Diseases (pre-COVID) in scientific literature
Influenza
Measles
Chikungunya
Countries
Ambreen Hamadani 2020
Cooccurrence
Cooccurrence of terms in an article (epidemic corpus)
Our Team
Dr Gitanjali Yadav
Credits: Ambreen Hamadani
Our Team
Credits: Ambreen Hamadani
Aishwarya Dharan
MSc
(Bioinformatics)
Central Univ. Punjab
Shweata N. Hegde
BSc Student
Regional Inst. Of Edu.
Mysore
Ayush Garg
GIIS Singapore
Anugrah S. R.
BS-MS(Biological
Sciences) IISER Bhopal
Mukul Bhambri
Undergrad, SRM
University
Radhu Kantilal
Ladani
Msc Bioinformatics
Rajiv Gandhi
Institute, Pune.
Vasant Kumar
M.Sc. Biotech,
Himachal Pradesh
University, Summerhill
Talha Hasan
MSc Toxicology
Jamia Hamdard
University
Credits: Ambreen Hamadani
Kanishka Parashar
MSc Biotech
Jamia Millia
Islamia,Delhi
Chaintanya Sharma
B.Tech(Mechanical)
Delhi Technological University
Sagar Jadhav
Postdoctoral Research
Associate,NIPGR
Many other global
volunteers!
Credits: Ambreen Hamadani
Open Science
(Human Genome Project)
… all DNA sequence data [should] be
released in publicly accessible databases
within twenty-four hours after generation
Classification
Annotation++
Decision Tree
Filtering+
Transcription
Pattern Finding
including astronomy, ecology, cell biology, humanities, and climate science.[5]
Citizen Science
Fukushima
Volunteers map the world’s radiation
Some of PM-R’s data
Safecast.org
And Wikimedia commons
WikiProject
COVID-19
• Laser cutters
• 3D printers
Co-working
Cambridge
Delhi
5 million Open Scientific articles ( 0.5
TB), indexed by ContentMine . Disk
30 GBP Raspberry Pi3. 50 GBP
CC BY, PeterMR
Disk
Raspberry PI
Power
CONTAINERISATION!
Opposition to Open
All: 426,613
Open: 21,919 5% is Open to citizens
Is this article relevant to policy makers?
Most science is not Open
Knowledge Neocolonialism/Capitalism
• The result of [knowledge colonialism] is that [corporate]
capital is used for the exploitation rather than for the
development of the less developed parts of the
[knowledge] world.
• Investment, under [knowledge imperialism], increases,
rather than decreases, the gap between the rich and the
poor [scholars] of the world. The struggle against
[knowledge neocolonialism] […] is aimed at preventing the
financial power of the [megacorporations] being used in
such a way as to impoverish the less developed.
• (adapted by PMR from Kwame Nkrumah)
[1] The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
Publishers Academia
Glory+?
$$, MS
review
Taxpayer
Student
Researcher
$$ $$
in-kind
The Publisher-Academic complex[1]
Infrastructure
“The scholarly poor”
@Senficon (Julia Reda) :Text & Data mining in times of
#copyright maximalism:
"Elsevier stopped me doing my research"
http://onsnetwork.org/chartgerink/2015/11/16/elsevi
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
Chris Hartgerink
The READER has been forgotten!
• PDFs (for blind humans and machines)
• Impossible to cut and paste
• Discovery is hugely time-consuming
• Total legal minefield
Open Access is not working
in the interests of the Global South
The READER has been forgotten!
• PDFs (for blind humans and machines)
• Impossible to cut and paste
• Discovery is hugely time-consuming
• Total legal minefield
Open Access is not working
in the interests of the Global South
Adapting Ranganathan:
* Science is for Use
* Save the time of the Reader
(She wants to read a paper every
SECOND)
Why Open?
• Better
• Quicker
• Flexible
• Inclusive and collaborative
• Preservable
• Higher Quality
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
Open Notebook Science. [6]
Jean-Claude Bradley, [2006]
[7]
... there is a URL to a laboratory
notebook that is freely available
and indexed on common search
engines. It does not necessarily
have to look like a paper notebook
but it is essential that all of the
information available to the
researchers to make their
conclusions is equally available to
the rest of the world
Wikipedia CC BY-SA
OpenSourceMalaria (last week!)
Molecules tweeted as soon
As they are tested
Immediate public discussion
One of our Open Notebooks
Open Notebook Philosophy*
real-time updates!
Try it!
https://github.com/petermr/CEVOpen
https://github.com/petermr/pygetpapers
https://github.com/petermr/pyami
37
This Photo by Unknown Author is licensed under CC BY
GITHUB
*https://en.wikipedia.org/wiki/Open-notebook_science
Updated 19-hours ago
Everything on the web!
Shweata Hegde
Our tools: Content Mining!
Scrape -> Clean-> Annotate -> Display
Open sources publish
|
v
|
v
|
v
pygetpapers pyami docanalysis
General! Works in ANY scientific discipline:
• Materials
• Climate
• Medicine
|
v
|
v
Collaborative projects
This Photo by Unknown Author is licensed under CC BY-ND
39
Work as a community
Individual tasks
No blame or formal competition
Completed
In-
progress
Not-started
ISSUES
Two young scientists give their philosophies
https://www.youtube.com/watch?v=JlXEGyP8bks&ab_channel=TechTitans
Shweata N. Hegde
BSc Student
Regional Inst. Of Edu.
Mysore
Ayush Garg
GIIS Singapore
Lead developer CEVOpen Project Manager and Data Scientist
Mining!
• build scrapers for Openly readable sources.
• Users queries for scraping
• download raw content
• clean and semantify -> Semantic Web
• annotate with dictionaries.
• analyze, display.
Scrape -> Clean-> Annotate -> Display
Open sources publish
|
v
|
v
|
v
CEVOpen Notebook
<entry description="fibrous material from trees or other plants"
name="wood" term="wood”
wikidataURL="http://www.wikidata.org/entity/Q287"
wikidataID="Q287"
wikipediaPage="https://en.wikipedia.org/wiki/Wood">
<synonym xml:lang="zh">木材</synonym>
<synonym xml:lang="de">Holz</synonym>
<synonym xml:lang="ur">‫<لکڑی‬/synonym>
<synonym xml:lang="hi">लकड़ी</synonym>
<synonym xml:lang="ta">மரம்</synonym>
<synonym xml:lang="es">madera</synonym>
<synonym xml:lang="fr">bois</synonym>
<description xml:lang="de">faseriges Material von Bäumen oder
anderen Pflanzen</description>
<description xml:lang="fr">matériau naturel</description>
<description xml:lang="es">material duro y fibroso obtenido de los
árboles</description>
</entry>
Wikidata-based Multilingual Dictionary
Wikidata entry
Multilingual
Synonyms
In Unicode
Multilingual
Descriptions
In Unicode
Term (EN)
Generous support from
100 million entries
“plantParts”
Mining!
• build scrapers for Openly readable sources.
• Users queries for scraping
• download raw content
• clean and semantify
• annotate with dictionaries.
• analyze, display.
Scrape -> Clean-> Annotate -> Display
Open sources publish
|
v
|
v
|
v
Open Access Sources
https://ethos.bl.uk/Home.do
https://www.redalyc.org/
100,000 Theses
4,700,000 abstracts
50,000 preprints
https://doaj.org https://biorxiv.org
https://medrxiv.org
Mexico, Latin America
https://europepmc.org
And your archive?
Europe PubMedCentral
Plants, genes, enzymes, chemicals
This Photo by Unknown Author is licensed under CC BY
location plants
100 papers
Using scientific literature to map
Invasive Species
Tulsi
47
This Photo by Unknown Author is licensed under CC BY-NC-ND
Kanishka Parashar 2021
What plants produce Carvone?
https://en.wikipedia.org/wiki/Carvone
https://en.wikipedia.org/wiki/Carvone
Carvone in Wikidata
Also SPARQL endpoint
Citrus sinensis
+
Farnesyl pyrophosphate
https://www.uniprot.org/
https://pubchem.ncbi.nlm.nih.gov/
Terpene Synthase (TPS) Phytochemistry
Valencene
Valencene_synthase
Sager Jhadav 2021
Q71MJ3 model
LINKS PMCID PLANT KEYPOINTS TPS COMPOUND
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC8224800/
PMC8224800 RICE Sesquiterpene
Plays Roles in
Antixenosis
OsSTPS2 https://europepmc.org/a
rticles/PMC8224800/bin
/plants-10-01049-
s001.zip
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC8148558/
PMC8148558 Mentha
canadensis
Transcriptome
Analysis and
Monoterpenes
FIGURE
TEXT
TABLE
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC8147058/
PMC8147058 Apple Transcriptome
and metabolite
profiling
MdAFS1 https://bmcplantbiol.bio
medcentral.com/articles
/
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC7982956/
PMC7982956 C. sinensis Engineered
Orange for
defense
AtTPS21 FIGURE
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC7968207/
PMC7968207 Zea mays Transcriptomic
and volatile
TPS 2, 3,
5, 23
TABLE
Scoping through TPS Volatiles corpus
Sagar Jhadav 2021
LINKS PMCID PLANT KEYPOINTS TPS COMPOUND
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC8224800/
PMC8224800 RICE Sesquiterpene
Plays Roles in
Antixenosis
OsSTPS2 https://europepmc.org/a
rticles/PMC8224800/bin
/plants-10-01049-
s001.zip
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC8148558/
PMC8148558 Mentha
canadensis
Transcriptome
Analysis and
Monoterpenes
FIGURE
TEXT
TABLE
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC8147058/
PMC8147058 Apple Transcriptome
and metabolite
profiling
MdAFS1 https://bmcplantbiol.bio
medcentral.com/articles
/
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC7982956/
PMC7982956 C. sinensis Engineered
Orange for
defense
AtTPS21 FIGURE
https://www.ncbi.nl
m.nih.gov/pmc/articl
es/PMC7968207/
PMC7968207 Zea mays Transcriptomic
and volatile
TPS 2, 3,
5, 23
TABLE
Scoping through TPS Volatiles corpus
Sager Jhadav 2021
Ambreen Hamadani 2020
Chaitanya Sharma 2021
Classification of Acknowledgement statements
41
2
1
36
TP
TN
FN
FP
Future developments: Machine learning
55
https://github.com/petermr/dictionary/blob/main/tps_data_a
vailability/data_formats_tps.ipynb
Jupyter Notebook Analysis
Future Directions – Supplemental Data
Shweata N Hegde 2021
57
https://github.com/petermr/dictionary/blob/main/tps_data_a
vailability/data_formats_tps.ipynb
Jupyter Notebook Analysis
Future Directions – Supplemental Data
Shweata N Hegde 2021
Future Directions – Federated Scraper?
https://ethos.bl.uk/Home.do
https://www.redalyc.org/
100,000 Theses
4,700,000 abstracts
50,000 preprints
https://doaj.org https://biorxiv.org
https://medrxiv.org
Mexico, Latin America
https://europepmc.org
And your archive?
Europe PubMedCentral
Future Directions – Universal Scraper?
https://ethos.bl.uk/Home.do
https://www.redalyc.org/
100,000 Theses
4,700,000 abstracts
50,000 preprints
https://doaj.org https://biorxiv.org
https://medrxiv.org
Mexico, Latin America
https://europepmc.org
Europe PubMedCentral
Future Directions – new Interns
• 5 INYAS/KARYA interns joining in next few days!
• 2 3-6-month interns at NIPGR
• We’re looking for collaboration
Thanks!
NIPGR, INYAS/KARYA/DBT, Wikimedia

Open Science Principles and Practice

  • 1.
    INYAS India, 2021-08-13 Open SciencePrinciples and Practice Peter Murray-Rust1,2 [1]University of Cambridge [2]TheContentMine pm286 AT cam DOT ac DOT uk Ayush Garg and Shweata N Hegde Most images CEVOpen
  • 2.
    • Collaboration • OpenNotebook Data Science • Antagonism to Open • Mining Open Access for OpenScience • Ayush Garg and Shweata N Hegde • CEVOpen: plant literature mining • The future Themes
  • 3.
    TIGR2ESS 2019 Delhi FoodSecurity FEB 2019 Priya and Kareena interns JUNE 2020 NIPGR: Developed Virtually over 6 weeks! INYAS – KARYA-DBT Delhi 2014 Gita Yadav + colleagues CEVOpen 2021 https://www.youtube.com/watch?v=XiTngk- POm8&ab_channel=INYASYouTube Aug 2020 2013 OpenVirus
  • 5.
    http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunnedrecently … an article by European researchers in Annals of Virology [1982]: “Liberia has to be included in the Ebola virus endemic zone. … medical personnel in Liberian health centers […] may come across active cases,” Bernice Dahn (chief medical officer of Liberia’s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  • 6.
    http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about- ebola.html We were stunnedrecently when we stumbled across an article by European researchers in Annals of Virology [1982]: “The results seem to indicate that Liberia has to be included in the Ebola virus endemic zone. … medical personnel in Liberian health centers […] may come across active cases and thus be prepared to avoid [hospital-acquired] epidemics,” Bernice Dahn (chief medical officer of Liberia’s Ministry of Health) Vera Mussah (director of county health services) Cameron Nutt (Ebola response adviser to Partners in Health) A System Failure of Scholarly Publishing
  • 7.
  • 8.
    8 miniprojects Corpus (1000+docs) Dictionaries (30000+ terms) 3 complex topics: - zoonosis (animal hosts) - non-pharmaceutical (masks, social distancing, etc.) - test and trace - country - disease - drug - funder - virus
  • 9.
    5 core facetscountry drug funder disease virus
  • 10.
  • 11.
    Cooccurrence Cooccurrence of termsin an article (epidemic corpus)
  • 12.
    Our Team Dr GitanjaliYadav Credits: Ambreen Hamadani
  • 13.
  • 14.
    Aishwarya Dharan MSc (Bioinformatics) Central Univ.Punjab Shweata N. Hegde BSc Student Regional Inst. Of Edu. Mysore Ayush Garg GIIS Singapore Anugrah S. R. BS-MS(Biological Sciences) IISER Bhopal Mukul Bhambri Undergrad, SRM University Radhu Kantilal Ladani Msc Bioinformatics Rajiv Gandhi Institute, Pune. Vasant Kumar M.Sc. Biotech, Himachal Pradesh University, Summerhill Talha Hasan MSc Toxicology Jamia Hamdard University Credits: Ambreen Hamadani
  • 15.
    Kanishka Parashar MSc Biotech JamiaMillia Islamia,Delhi Chaintanya Sharma B.Tech(Mechanical) Delhi Technological University Sagar Jadhav Postdoctoral Research Associate,NIPGR Many other global volunteers! Credits: Ambreen Hamadani
  • 16.
  • 17.
    (Human Genome Project) …all DNA sequence data [should] be released in publicly accessible databases within twenty-four hours after generation
  • 18.
    Classification Annotation++ Decision Tree Filtering+ Transcription Pattern Finding includingastronomy, ecology, cell biology, humanities, and climate science.[5] Citizen Science
  • 19.
    Fukushima Volunteers map theworld’s radiation Some of PM-R’s data Safecast.org And Wikimedia commons
  • 20.
  • 21.
    • Laser cutters •3D printers Co-working Cambridge Delhi
  • 22.
    5 million OpenScientific articles ( 0.5 TB), indexed by ContentMine . Disk 30 GBP Raspberry Pi3. 50 GBP CC BY, PeterMR Disk Raspberry PI Power CONTAINERISATION!
  • 23.
  • 24.
    All: 426,613 Open: 21,9195% is Open to citizens Is this article relevant to policy makers? Most science is not Open
  • 25.
    Knowledge Neocolonialism/Capitalism • Theresult of [knowledge colonialism] is that [corporate] capital is used for the exploitation rather than for the development of the less developed parts of the [knowledge] world. • Investment, under [knowledge imperialism], increases, rather than decreases, the gap between the rich and the poor [scholars] of the world. The struggle against [knowledge neocolonialism] […] is aimed at preventing the financial power of the [megacorporations] being used in such a way as to impoverish the less developed. • (adapted by PMR from Kwame Nkrumah)
  • 26.
    [1] The Military-Industrial-Academiccomplex (1961) (Dwight D Eisenhower, US President) Publishers Academia Glory+? $$, MS review Taxpayer Student Researcher $$ $$ in-kind The Publisher-Academic complex[1] Infrastructure “The scholarly poor”
  • 27.
    @Senficon (Julia Reda):Text & Data mining in times of #copyright maximalism: "Elsevier stopped me doing my research" http://onsnetwork.org/chartgerink/2015/11/16/elsevi er-stopped-me-doing-my-research/ … #opencon #TDM Elsevier stopped me doing my research Chris Hartgerink
  • 28.
    The READER hasbeen forgotten! • PDFs (for blind humans and machines) • Impossible to cut and paste • Discovery is hugely time-consuming • Total legal minefield Open Access is not working in the interests of the Global South
  • 29.
    The READER hasbeen forgotten! • PDFs (for blind humans and machines) • Impossible to cut and paste • Discovery is hugely time-consuming • Total legal minefield Open Access is not working in the interests of the Global South Adapting Ranganathan: * Science is for Use * Save the time of the Reader (She wants to read a paper every SECOND)
  • 31.
    Why Open? • Better •Quicker • Flexible • Inclusive and collaborative • Preservable • Higher Quality
  • 32.
    http://www.budapestopenaccessinitiative.org/read … an unprecedentedpublic good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 33.
    http://www.budapestopenaccessinitiative.org/read … an unprecedentedpublic good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 34.
    Open Notebook Science.[6] Jean-Claude Bradley, [2006] [7] ... there is a URL to a laboratory notebook that is freely available and indexed on common search engines. It does not necessarily have to look like a paper notebook but it is essential that all of the information available to the researchers to make their conclusions is equally available to the rest of the world Wikipedia CC BY-SA
  • 35.
    OpenSourceMalaria (last week!) Moleculestweeted as soon As they are tested Immediate public discussion
  • 37.
    One of ourOpen Notebooks Open Notebook Philosophy* real-time updates! Try it! https://github.com/petermr/CEVOpen https://github.com/petermr/pygetpapers https://github.com/petermr/pyami 37 This Photo by Unknown Author is licensed under CC BY GITHUB *https://en.wikipedia.org/wiki/Open-notebook_science Updated 19-hours ago Everything on the web! Shweata Hegde
  • 38.
    Our tools: ContentMining! Scrape -> Clean-> Annotate -> Display Open sources publish | v | v | v pygetpapers pyami docanalysis General! Works in ANY scientific discipline: • Materials • Climate • Medicine | v | v
  • 39.
    Collaborative projects This Photoby Unknown Author is licensed under CC BY-ND 39 Work as a community Individual tasks No blame or formal competition Completed In- progress Not-started ISSUES
  • 40.
    Two young scientistsgive their philosophies https://www.youtube.com/watch?v=JlXEGyP8bks&ab_channel=TechTitans Shweata N. Hegde BSc Student Regional Inst. Of Edu. Mysore Ayush Garg GIIS Singapore Lead developer CEVOpen Project Manager and Data Scientist
  • 41.
    Mining! • build scrapersfor Openly readable sources. • Users queries for scraping • download raw content • clean and semantify -> Semantic Web • annotate with dictionaries. • analyze, display. Scrape -> Clean-> Annotate -> Display Open sources publish | v | v | v
  • 42.
  • 43.
    <entry description="fibrous materialfrom trees or other plants" name="wood" term="wood” wikidataURL="http://www.wikidata.org/entity/Q287" wikidataID="Q287" wikipediaPage="https://en.wikipedia.org/wiki/Wood"> <synonym xml:lang="zh">木材</synonym> <synonym xml:lang="de">Holz</synonym> <synonym xml:lang="ur">‫<لکڑی‬/synonym> <synonym xml:lang="hi">लकड़ी</synonym> <synonym xml:lang="ta">மரம்</synonym> <synonym xml:lang="es">madera</synonym> <synonym xml:lang="fr">bois</synonym> <description xml:lang="de">faseriges Material von Bäumen oder anderen Pflanzen</description> <description xml:lang="fr">matériau naturel</description> <description xml:lang="es">material duro y fibroso obtenido de los árboles</description> </entry> Wikidata-based Multilingual Dictionary Wikidata entry Multilingual Synonyms In Unicode Multilingual Descriptions In Unicode Term (EN) Generous support from 100 million entries “plantParts”
  • 44.
    Mining! • build scrapersfor Openly readable sources. • Users queries for scraping • download raw content • clean and semantify • annotate with dictionaries. • analyze, display. Scrape -> Clean-> Annotate -> Display Open sources publish | v | v | v
  • 45.
    Open Access Sources https://ethos.bl.uk/Home.do https://www.redalyc.org/ 100,000Theses 4,700,000 abstracts 50,000 preprints https://doaj.org https://biorxiv.org https://medrxiv.org Mexico, Latin America https://europepmc.org And your archive? Europe PubMedCentral
  • 46.
  • 47.
    This Photo byUnknown Author is licensed under CC BY location plants 100 papers Using scientific literature to map Invasive Species Tulsi 47 This Photo by Unknown Author is licensed under CC BY-NC-ND Kanishka Parashar 2021
  • 48.
    What plants produceCarvone? https://en.wikipedia.org/wiki/Carvone https://en.wikipedia.org/wiki/Carvone
  • 49.
    Carvone in Wikidata AlsoSPARQL endpoint
  • 50.
    Citrus sinensis + Farnesyl pyrophosphate https://www.uniprot.org/ https://pubchem.ncbi.nlm.nih.gov/ TerpeneSynthase (TPS) Phytochemistry Valencene Valencene_synthase Sager Jhadav 2021 Q71MJ3 model
  • 51.
    LINKS PMCID PLANTKEYPOINTS TPS COMPOUND https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC8224800/ PMC8224800 RICE Sesquiterpene Plays Roles in Antixenosis OsSTPS2 https://europepmc.org/a rticles/PMC8224800/bin /plants-10-01049- s001.zip https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC8148558/ PMC8148558 Mentha canadensis Transcriptome Analysis and Monoterpenes FIGURE TEXT TABLE https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC8147058/ PMC8147058 Apple Transcriptome and metabolite profiling MdAFS1 https://bmcplantbiol.bio medcentral.com/articles / https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC7982956/ PMC7982956 C. sinensis Engineered Orange for defense AtTPS21 FIGURE https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC7968207/ PMC7968207 Zea mays Transcriptomic and volatile TPS 2, 3, 5, 23 TABLE Scoping through TPS Volatiles corpus Sagar Jhadav 2021
  • 52.
    LINKS PMCID PLANTKEYPOINTS TPS COMPOUND https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC8224800/ PMC8224800 RICE Sesquiterpene Plays Roles in Antixenosis OsSTPS2 https://europepmc.org/a rticles/PMC8224800/bin /plants-10-01049- s001.zip https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC8148558/ PMC8148558 Mentha canadensis Transcriptome Analysis and Monoterpenes FIGURE TEXT TABLE https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC8147058/ PMC8147058 Apple Transcriptome and metabolite profiling MdAFS1 https://bmcplantbiol.bio medcentral.com/articles / https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC7982956/ PMC7982956 C. sinensis Engineered Orange for defense AtTPS21 FIGURE https://www.ncbi.nl m.nih.gov/pmc/articl es/PMC7968207/ PMC7968207 Zea mays Transcriptomic and volatile TPS 2, 3, 5, 23 TABLE Scoping through TPS Volatiles corpus Sager Jhadav 2021
  • 54.
    Ambreen Hamadani 2020 ChaitanyaSharma 2021 Classification of Acknowledgement statements 41 2 1 36 TP TN FN FP Future developments: Machine learning
  • 55.
    55 https://github.com/petermr/dictionary/blob/main/tps_data_a vailability/data_formats_tps.ipynb Jupyter Notebook Analysis FutureDirections – Supplemental Data Shweata N Hegde 2021 57 https://github.com/petermr/dictionary/blob/main/tps_data_a vailability/data_formats_tps.ipynb Jupyter Notebook Analysis Future Directions – Supplemental Data Shweata N Hegde 2021
  • 56.
    Future Directions –Federated Scraper? https://ethos.bl.uk/Home.do https://www.redalyc.org/ 100,000 Theses 4,700,000 abstracts 50,000 preprints https://doaj.org https://biorxiv.org https://medrxiv.org Mexico, Latin America https://europepmc.org And your archive? Europe PubMedCentral
  • 57.
    Future Directions –Universal Scraper? https://ethos.bl.uk/Home.do https://www.redalyc.org/ 100,000 Theses 4,700,000 abstracts 50,000 preprints https://doaj.org https://biorxiv.org https://medrxiv.org Mexico, Latin America https://europepmc.org Europe PubMedCentral
  • 58.
    Future Directions –new Interns • 5 INYAS/KARYA interns joining in next few days! • 2 3-6-month interns at NIPGR • We’re looking for collaboration Thanks! NIPGR, INYAS/KARYA/DBT, Wikimedia