Biopesticide (2).pptx .This slides helps to know the different types of biop...
Ā
ContentMining in Neuroscience
1. Open Mining of the Bioscience Literature
Peter Murray-Rust,
ContentMine.org and the University of Cambridge
UNAM, MX 2015-10-09
Millions of data points are hidden in the bioscience literature.
ContentMine has Open technology to liberate them automatically.
Using OpenNotebook approaches
The major problem is politico-legal
This is an exploratory talk, looking for ideas and projects
The future depends on young people
4. Some particularly relevant Fellows/Alumni and projects:
ā¢ Rufus Pollock: Open Knowledge Foundation
ā¢ Mark Surman: Mozilla
ā¢ Dan Whaley: Hypothes.is
ā¢ Daniel Lombrana-Gonzales: PyBossa/Crowdcrafting
Erin McKiernan, 2015 Flash Award
ContentMine and Peter Murray-Rust are funded by:
5. The Right to Read is the Right to Mine
http://contentmine.org
6. ContentMine Workshops and
Hackdays
Open Science Brazil, 2014-08
Easily distributed software
Get started in 30 mins
Build application
in a morning
Start simple: bagOfWords, Stemming, Regex, templates
8. Why do we publish science?
ā¢ Communicate our results
ā¢ Archival
ā¢ Get feedback from peers.
ā¢ Provide material that others can re-use.
ā¢ Priority and esteem.
9.
10. http://www.nytimes.com/2015/04/08/opinion/yes-we-were-warned-about-
ebola.html
[Liberian Ministry of Health] were stunned recently when we stumbled across
an article by European researchers in Annals of Virology [1982]: āThe results
seem to indicate that Liberia has to be included in the Ebola virus endemic
zone.ā In the future, the authors asserted, āmedical personnel in Liberian health
centers should be aware of the possibility that they may come across active
cases and thus be prepared to avoid nosocomial epidemics,ā referring to
hospital-acquired infection.
Adage in public health: āThe road to inaction is paved with research
papers.ā
Bernice Dahn (chief medical officer of Liberiaās Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
16. Output of scholarly publishing
[2] https://en.wikipedia.org/wiki/Mont_Blanc#/media/File:Mont_Blanc_depuis_Valmorel.jpg
586,364 Crossref DOIs 201,507 [1] per month
1.5 million (papers + supplemental data) /year [citation needed]*
each 3 mm thick
ļ 4500 m high per year [2]
* Most is not Publicly readable
[1] http://www.crossref.org/01company/crossref_indicators.html
17. Scientific and Medical publication (STM)[+]
ā¢ World Citizens pay $450,000,000,000ā¦
ā¢ ā¦ for research in 1,500,000 articles ā¦
ā¢ ā¦ cost $300,000 each to create ā¦
ā¢ ā¦ $7000 each to āpublishā [*]ā¦
ā¢ ā¦ $10,000,000,000 from academic libraries ā¦
ā¢ ā¦ to āpublishersā who forbid access to 99.9% of citizens of
the world ā¦
ā¢ 85% of medical research is wasted (not published, badly
conceived, duplicated, ā¦) [Lancet 2009]
[+] Figures probably +- 50 %
[*] arXiV preprint server costs $7 USD per paper
19. ContentMine approaches
0. Open software, Open content, Open notebooks
1. Daily liberation of facts which are easy and widely
useful.
ā Species (Bacillus subtilis, Okapia johnstoni)
ā Genes (BRCA1*, APOE)
ā Chemicals (acetone, CH3OH)
ā Identifiers (RRIDs, museum specimens, )
1. CMunities of practice with bespoke tools:
ā Clinical Trials
ā Phylogenetic trees
ā Systematic reviews
21. Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
22. C) Whatās the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084ā4087
Original thanks to ChemBark
28. āadult nonpregnant patients, aged ā„18 yearsā,
ārandomization sequence using a permuted block design with random
block sizes stratified by study centerā.
āblinding of the patients and caregivers is not possibleā.
āInvestigators performing analysis are blinded for the interventionā.
āContinuous normally distributed variables ā¦ mean and standard deviation,
counts (n) and percentages (%). ā¦ Studentās t-test ā¦ or the MannāWhitney U test
ā¦ Categorical ā¦ Chi-square test or Fisher's exact tests. Statistical significance is
considered to be at a P value <0.05 ā¦ā
Formulaic language in reporting clinical trials
29. Text-based plugins
ā¢ Bag of words
(https://en.wikipedia.org/wiki/Bag-of-
words_model)
ā¢ https://en.wikipedia.org/wiki/Tf%E2%80%93idf
(Term-frequency, inverse document frequency)
ā¢ Templates and regexes (regular expressions).
31. Regular Expressions for Systematic Reviews of Animal Tests
Preceding Text
Following Text
Extracted term
In 30 minutes 6 scientists (most were unfamiliar with regex)
wrote 200 regexes for ARRIVE (NC3R guidelines)
43. Traditional Research and Publication
āLabā work paper/th
esis
Write
rewrite
Re-experiment
publish
???
Validation??
DATA
output ābelongsā
to publisher
45. Open Notebook Content Mining
ā¢ āNo insider knowledgeā
ā¢ Anyone can become involved
ā¢ All raw non-copyright material on Github
ā¢ Planning and discussion on Open Discourse
ā¢ All output (however imperfect) on Github CC0
ā¢ Immediate upload
ā¢ Inspired by Free/Libre/Open Source, Wikipedia,
Open StreetMap.
50. Automatic Open Notebook of computations
Everything is posted to Github before being analyzed
51. Bacillus subtilis [131238]*
Bacteroides fragilis [221817]
Brevibacillus brevis
Cyclobacterium marinum
Escherichia coli [25419]
Filobacillus milosensis
Flectobacillus major [15809775]
Flexibacter flexilis [15809789]
Formosa algae
Gelidibacter algens [16982233]
Halobacillus halophilus
Lentibacillus salicampi [18345921]
Octadecabacter arcticus
Psychroflexus torquis [16988834]
Pseudomonas aeruginosa [31856]
Sagittula stellata [16992371]
Salegentibacter salegens
Sphingobacterium spiritivorum
Terrabacter tumescens
ā¢ [Identifier in Wikidata]
ā¢ Missing = not found with Wikidata API
20 commonest organisms (in > 30 papers) in trees from IJSEM*
Half do not appear to be in Wikidata
Can the Wikipedia Scientists comment?
*Int. J. Syst. Evol. Microbiol.
70. Prof. Ian Hargreaves (2011): "David Cameron's
exam questionā: "Could it be true that laws
designed more than three centuries ago with the
express purpose of creating economic incentives
for innovation by protecting creators' rights are
today obstructing innovation and economic
growth?ā
āyes. We have found that the UK's intellectual
property framework, especially with regard to
copyright, is falling behind what is needed.ā "Digital
Opportunity" by Prof Ian Hargreaves - http://www.ipo.gov.uk/ipreview.htm. Licensed under CC BY 3.0 via Wikipedia -
https://en.wikipedia.org/wiki/File:Digital_Opportunity.jpg#/media/File:Digital_Opportunity.jpg
Hi, Iām here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.
In this talk, Iām going to impress the importance of data in a specific format and its utility to automated machine processing. Then Iām going to demonstrate AMIās architecture and the transformation of data as it flows through the process. Iām going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, Iām going to introduce Andyās ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.