SlideShare a Scribd company logo
ContentMine: Open data
And Social Machines
Peter Murray-Rust
,
Computation Lab, Univ of Chicago, 2014-11-12
ContentMine: We use machines to
liberate[1] 100 million facts /yr from
the scientific scholarly literature and
make them free for everyone
(WikiData)
WikiData and ContentMines are social
machines
There are no longer any technical
obstacles, only people.
[1] Friday workshop: build your own social machine: scraping XML,
Liberation Software
http://en.wikipedia.org/wiki/Tim_Berners-Lee
Everything in this presentation is ODOSOS
(Open Data, Open Standards, Open Source)
CC0, CC-BY, W3C etc., Apache2, etc.
Open = “Free to use, re-use and redistribute
http://contentmine.org
http://bitbucket.org/petermr
http://wwmm.ch.cam.ac.uk
A promise: I (Petermr) will never sell out to non-transparent organizations.
http://www.budapestopenaccessinitiative.org/read
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
Scientific and Medical publication (STM)[+]
• World Citizens pay $400,000,000,000…
• … for research in 1,500,000 articles …
• … cost $300,000 each to create …
• … $7000 each to “publish” [*]…
• … $10,000,000,000 from academic libraries …
• … to “publishers” who forbid access to 99.9% of
citizens of the world …
[+] Figures probably +- 50 %
[*] arXiV preprint server costs $7 USD per paper
petermr: I believe in Wikipedia
• 2006 http://en.wikipedia.org/wiki/User:Petermr
• 2006 started Open Data (term unknown then!)
• 2009: “the bit of Wikipedia that I wrote is correct” [challenging the
idea of “WP is junk”]
• 2009: “Wikipedia is the digital library of this century”
• 2012: I alert WP that Springer has copyrighted > 1000 of our
images [Springergate]
• 2014: “For facts in maths, physical and biological sciences I trust
Wikipedia.” (Wikimania2014)
A meritocratic
critical
volunteer
community
Volunteer community in chemistry: Open Data/Source/Standards
4 Billion USD on human genome
yielded 800 Billion USD and 4 M job-years
Gloom Warning
…three problems—flawed design, non-
publication, and poor reporting—together
meant >85% of research funds were wasted, a
global total loss >100 billion USD per year.
[Lancet 2009]
[Even more] waste clearly occurs after
publication: from poor access, poor
dissemination, and poor uptake of the findings
of research. [PLOS Medicine 2014-05-27]
Bad publication wastes science
Publishers’ PDFs destroy science
PDFs do not contain words
or subscripts!
PDFs do not contain tables
and do not have columns
SVG is turned into JPEG because it’s easier to process
Elsevier wants to control Open Data
[asked by Michelle Brook]
STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
Licences destroy Content Mining
CLOSED ACCESS MEANS PEOPLE DIE
CLOSED DATA MEANS PEOPLE DIE
Happiness Restored
The scientist’s amanuensis
• "The bane of my life is doing things I know computers could do
for me" (Dan Connolly, W3C)
Example: A semantic amanuensis could
• Give me a daily digest of mineralogy papers
• Extract all the crystal structures from them
• Compute physical properties with GULP and NWChem
• Compare the results statistically
• Preserve and distribute the complete operation
• Prepare the results for publication
The semantic web is having a personal amanuensis
Artificial Intelligence in science
In 1970 chess and chemistry were the sandboxes for AI. Some
approaches:
• Lookup (Knowledge)
• Natural Language Processing (NLP)
• Brute force calculation (inc. physical methods)
• Tree-pruning and heuristics
• Logic (cf. OWL-DL)
• Human-machine integration (crowdsourcing)
• Computer Vision
Domain-specific Turing test: Can a machine pass a first-year
chemistry exam?
The Semantic Web
"The Semantic Web is an extension of the
current web in which information is given well-
defined meaning, better enabling computers
and people to work in cooperation."
Tim Berners-Lee, James Hendler, Ora Lassila, The
Semantic Web, Scientific American, May 2001
CC-BY-SA Images from Wikipedia
“Which Rivers flow into the Rhine and are longer
than 50 kilometers?” or “Which Skyscrapers
in China have more than 50 floors and have
been constructed before the year 2000?”
Open Crystallography?
“Which countries where tropical diseases are
endemic have published structures of chiral
natural products?”
Linked Open data from Wikipedia
CC-BY-SA from Wikipedia
The Right to Read is the Right to Mine
http://contentmine.org
• Science can be read and understood by
human-machine Amanuensis-symbionts.
• Amanuenses are based on Wikipedia,
databases and software (e.g. ContentMine’s
AMI)
• The results are fed back into WP and WikiData
http://en.wikipedia.org/wiki/Symbiosishttp://en.wikipedia.org/wiki/Eric_Fenby
• Crawl scientific literature
(Open Bibliography)
• Scrape each scientific article
(ContentMine-quickscrape)
• Extract the facts (ContentMine-AMI)
• Index (Wikipedia)
• Republish (WikiData)
Machine Extraction of scientific facts
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Linked Open Data – the world’s knowledge
very little physical science 
http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png
DBPedia
BIO
Comp
Lib
PDB
Ontologies
GOV
GOV.uk
Music,
Art
Literature
Social
Knowledge
bases
RDF
triples
Part of a COD RDF entry
The Semantic Web understands this
Mathematics Markup Language
Energy of c.c.p lattice of argon
4 pages clippedHuman-friendly
Machine-
friendly
Many editors and tools exist
We used MathWeaver
Automatic!
MathML
CML (Chemical Markup Language)
Human-friendly Machine-friendly
Automatic!
Innovation with Componentisation
Individual, manual,
unreusable, flaky
Commodity, standard,
reliable, re-usable
Non-semantic
data
Data extraction
difficult and
incomplete
Human
readers
Current scientific information flow
… is broken for data-rich science
PDF
Lineprinter output
Text files
Human input
Semantic network closes the loop
Data mined from
document
Data available for
e-science and re-
use
Computation
Measurement
Semantic
Authoring
Community
Analysis
The network grows autonomously
Machine-machine
Machine-human
Human-machine
Human-human
Humans and machines use different
languages
How a machine reads a chemical thesis
nodes are compounds; arrows are reactions
Human-machine symbionts can read science!
WP_Lion
WP_Aspergillus_oryzae
WP_Soybean
Facts Marked by “non-scientists” in ContentMine workshops
With Wikipedia everyone can be a scientist
“nuggets” in a scientific paper
quantity
units
Value ranges
Humans aren’t designed to mine this … 
chemical
project places
Parsing chemical sentences
A FACT, uncopyrightable, and representable by triples
http://wwmm.ch.cam.ac.uk/chemicaltagger
• Typical
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
But we can now
turn PDFs into
Science
We can’t turn a hamburger into a cow
UNITS
TICKS
QUANTITY
SCALE
TITLES
DATA!!
2000+ points
Dumb PDF
CSV
Semantic
Spectrum
2nd Derivative
Gaussian
Filter
Automatic
extraction
Takes < 1 second
Chemical Computer Vision
Raw Mobile photo; problems:
Shadows, contrast, noise, skew, clipping
Binarization (pixels = 0,1)
Irregular edges
Thinning: thick lines to 1-pixel
Chemical Optical Character Recognition
Small alphabet, clean typefaces, clear boundaries make
this relatively tractable. Problems are “I” “O” etc.
AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home
Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY:
AMI reads the complete diagram,
recognizes the paths and
generates the molecules. Then
she creates a stop-fram animation
showing how the 12 reactions
lead into each other
CLICK HERE FOR ANIMATION
(may be browser dependent)
AMI Demo
http://www.mdpi.com/2218-1989/2/1/39/pdf
https://bitbucket.org/AndyHowlett/ami2-poc
ami2-poc -i example
-v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor
May take time to start if not connected to web
Output:./target/output/reactionsexample/
SVG: ./page1annotated.svg
CML: image.g.1.4.svg.reaction0.cml
Avogadro
Viewer:
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
(http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 –
“Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” .
((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),(
(((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n
215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187),
n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n
102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),(((
n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1
60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139
,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222)))
)))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,(
n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),((
n53,n131),n159)))))));
http://en.wikipedia.org/wiki/Digital_image_processing
http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics
Open notebook science is the practice of
making the entire primary record of a research
project publicly available online as it is
recorded. (WP)
Jean-Claude Bradley was a chemist who
actively promoted Open Science in
chemistry,… He coined the term Open
Notebook Science. … A memorial
symposium was held July 14, 2014 at
Cambridge University, UK.[9]
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Thanks
• Shuttleworth Foundation and Fellowship
• Contentmine.org: Michelle Brook, Jenny Molloy,
Ross Mounce, Richard Smith-Unna,
CottageLabs, Charles Oppenheim
• Open Knowledge Foundation Community
• Wikimedia Community
• Blue Obelisk Community
My/our Dream
• An Open Bibliography of science, updated
daily
• An interface for ContentMine to feed new
facts into WikiData
• Domain-specific enthusiasts to create and run
fact extraction and validation
• Wikipedia to become a C21 publisher of
reference science

More Related Content

What's hot

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
petermurrayrust
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
petermurrayrust
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
petermurrayrust
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
petermurrayrust
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Science
petermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
petermurrayrust
 
Ontologies in Physical Science
Ontologies in Physical ScienceOntologies in Physical Science
Ontologies in Physical Science
petermurrayrust
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
TheContentMine
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
petermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
TheContentMine
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly Collaboration
David De Roure
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
petermurrayrust
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
petermurrayrust
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
TheContentMine
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
petermurrayrust
 
Dagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slidesDagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slidesTim Clark
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
TheContentMine
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
petermurrayrust
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
petermurrayrust
 

What's hot (20)

ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)The Content Mine (presented at UKSG)
The Content Mine (presented at UKSG)
 
Copyright Reform and Open Data
Copyright Reform and Open DataCopyright Reform and Open Data
Copyright Reform and Open Data
 
Principles and practice of Open Science
Principles and practice of Open SciencePrinciples and practice of Open Science
Principles and practice of Open Science
 
Semantic Web in Physical Science
Semantic Web in Physical ScienceSemantic Web in Physical Science
Semantic Web in Physical Science
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
Ontologies in Physical Science
Ontologies in Physical ScienceOntologies in Physical Science
Ontologies in Physical Science
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Social Machines of Scholarly Collaboration
Social Machines of Scholarly CollaborationSocial Machines of Scholarly Collaboration
Social Machines of Scholarly Collaboration
 
Csvconf
CsvconfCsvconf
Csvconf
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
ContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and thesesContentMine: Liberating scholarship from Open publications and theses
ContentMine: Liberating scholarship from Open publications and theses
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Dagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slidesDagstuhl "Future" sesssion intro slides
Dagstuhl "Future" sesssion intro slides
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
Scientific search for everyone
Scientific search for everyoneScientific search for everyone
Scientific search for everyone
 
The culture of researchData
The culture of researchDataThe culture of researchData
The culture of researchData
 

Viewers also liked

Cubrid - Open Source DBMS highly optimized for Web Applications
Cubrid - Open Source DBMS highly optimized for Web ApplicationsCubrid - Open Source DBMS highly optimized for Web Applications
Cubrid - Open Source DBMS highly optimized for Web ApplicationsAgora Group
 
6a sofia proyect 1 nuber 16
6a sofia proyect 1 nuber 166a sofia proyect 1 nuber 16
6a sofia proyect 1 nuber 16
juanmonteroingles
 
โอ้ มาริโอ้-เมาเร่อ
โอ้ มาริโอ้-เมาเร่อโอ้ มาริโอ้-เมาเร่อ
โอ้ มาริโอ้-เมาเร่อ
pop_NATTIDA
 
Les préférences partisanes des Français / Les éclairages #3
Les préférences partisanes des Français / Les éclairages #3Les préférences partisanes des Français / Les éclairages #3
Les préférences partisanes des Français / Les éclairages #3
contact Elabe
 
29.08.2015「里山資本主義」から「里海」へ 井上恭介
29.08.2015「里山資本主義」から「里海」へ 井上恭介29.08.2015「里山資本主義」から「里海」へ 井上恭介
29.08.2015「里山資本主義」から「里海」へ 井上恭介
unuiasouik
 
いしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナー
いしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナーいしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナー
いしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナー
unuiasouik
 
Dagger 2 Injeção de dependências no mundo Android
Dagger 2 Injeção de dependências no mundo AndroidDagger 2 Injeção de dependências no mundo Android
Dagger 2 Injeção de dependências no mundo Android
Clerton Leal
 
COMMON PROBLEMS FACING WITH TABLETS
COMMON PROBLEMS FACING WITH TABLETSCOMMON PROBLEMS FACING WITH TABLETS
COMMON PROBLEMS FACING WITH TABLETS
ASIT Education
 
Notice du sondage ELABE BFMTV premier débat des Primaires citoyennes
Notice du sondage ELABE BFMTV premier débat des Primaires citoyennesNotice du sondage ELABE BFMTV premier débat des Primaires citoyennes
Notice du sondage ELABE BFMTV premier débat des Primaires citoyennes
contact Elabe
 

Viewers also liked (10)

Cubrid - Open Source DBMS highly optimized for Web Applications
Cubrid - Open Source DBMS highly optimized for Web ApplicationsCubrid - Open Source DBMS highly optimized for Web Applications
Cubrid - Open Source DBMS highly optimized for Web Applications
 
6a sofia proyect 1 nuber 16
6a sofia proyect 1 nuber 166a sofia proyect 1 nuber 16
6a sofia proyect 1 nuber 16
 
Campaign Plan
Campaign PlanCampaign Plan
Campaign Plan
 
โอ้ มาริโอ้-เมาเร่อ
โอ้ มาริโอ้-เมาเร่อโอ้ มาริโอ้-เมาเร่อ
โอ้ มาริโอ้-เมาเร่อ
 
Les préférences partisanes des Français / Les éclairages #3
Les préférences partisanes des Français / Les éclairages #3Les préférences partisanes des Français / Les éclairages #3
Les préférences partisanes des Français / Les éclairages #3
 
29.08.2015「里山資本主義」から「里海」へ 井上恭介
29.08.2015「里山資本主義」から「里海」へ 井上恭介29.08.2015「里山資本主義」から「里海」へ 井上恭介
29.08.2015「里山資本主義」から「里海」へ 井上恭介
 
いしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナー
いしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナーいしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナー
いしかわの里山・里海 と生物・文化多様性モデル@いしかわ大学連携インキュベータ地元資源活用セミナー
 
Dagger 2 Injeção de dependências no mundo Android
Dagger 2 Injeção de dependências no mundo AndroidDagger 2 Injeção de dependências no mundo Android
Dagger 2 Injeção de dependências no mundo Android
 
COMMON PROBLEMS FACING WITH TABLETS
COMMON PROBLEMS FACING WITH TABLETSCOMMON PROBLEMS FACING WITH TABLETS
COMMON PROBLEMS FACING WITH TABLETS
 
Notice du sondage ELABE BFMTV premier débat des Primaires citoyennes
Notice du sondage ELABE BFMTV premier débat des Primaires citoyennesNotice du sondage ELABE BFMTV premier débat des Primaires citoyennes
Notice du sondage ELABE BFMTV premier débat des Primaires citoyennes
 

Similar to ContentMine: Open Data and Social Machines

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
TheContentMine
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
petermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
petermurrayrust
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
TheContentMine
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
petermurrayrust
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
petermurrayrust
 
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
Jisc
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
TheContentMine
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and Analytics
David De Roure
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
TheContentMine
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
petermurrayrust
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
TheContentMine
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
TheContentMine
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
LEARN Project
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
petermurrayrust
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
TheContentMine
 
Ethics of Automation
Ethics of AutomationEthics of Automation
Ethics of Automation
David De Roure
 
Science 2.0
Science 2.0Science 2.0
Science 2.0
Robert Cormia
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
petermurrayrust
 
Social Machines Paradigm
Social Machines ParadigmSocial Machines Paradigm
Social Machines Paradigm
David De Roure
 

Similar to ContentMine: Open Data and Social Machines (20)

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Climate Change and Human Migration
Climate Change and Human MigrationClimate Change and Human Migration
Climate Change and Human Migration
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Paradise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to MineParadise Lost and The Right to Read is the Right to Mine
Paradise Lost and The Right to Read is the Right to Mine
 
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...The wider environment of open scholarship – Jisc and CNI conference 10 July ...
The wider environment of open scholarship – Jisc and CNI conference 10 July ...
 
ContentMining in Neuroscience
ContentMining in NeuroscienceContentMining in Neuroscience
ContentMining in Neuroscience
 
Emerging Forms of Data and Analytics
Emerging Forms of Data and AnalyticsEmerging Forms of Data and Analytics
Emerging Forms of Data and Analytics
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
ContentMining and Clinical Trials
ContentMining and Clinical TrialsContentMining and Clinical Trials
ContentMining and Clinical Trials
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
The Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-RustThe Culture of Research Data, by Peter Murray-Rust
The Culture of Research Data, by Peter Murray-Rust
 
Rapid biomedical search
Rapid biomedical search Rapid biomedical search
Rapid biomedical search
 
Content Mining for Machines and Humans
Content Mining for Machines and HumansContent Mining for Machines and Humans
Content Mining for Machines and Humans
 
Ethics of Automation
Ethics of AutomationEthics of Automation
Ethics of Automation
 
Science 2.0
Science 2.0Science 2.0
Science 2.0
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
Social Machines Paradigm
Social Machines ParadigmSocial Machines Paradigm
Social Machines Paradigm
 

More from TheContentMine

High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
TheContentMine
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
TheContentMine
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
TheContentMine
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
TheContentMine
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
TheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
TheContentMine
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
TheContentMine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
TheContentMine
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
TheContentMine
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
TheContentMine
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
TheContentMine
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
TheContentMine
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
TheContentMine
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
TheContentMine
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
TheContentMine
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
TheContentMine
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
TheContentMine
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for Everyone
TheContentMine
 
Overview of Practical Content Mining
Overview of Practical Content Mining Overview of Practical Content Mining
Overview of Practical Content Mining
TheContentMine
 

More from TheContentMine (20)

High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 
TheContentMine: Mining for Everyone
TheContentMine: Mining for EveryoneTheContentMine: Mining for Everyone
TheContentMine: Mining for Everyone
 
Overview of Practical Content Mining
Overview of Practical Content Mining Overview of Practical Content Mining
Overview of Practical Content Mining
 

Recently uploaded

What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
moosaasad1975
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Sérgio Sacani
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
fafyfskhan251kmf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Studia Poinsotiana
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
Lokesh Patil
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
zeex60
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
frank0071
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
tonzsalvador2222
 

Recently uploaded (20)

What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.What is greenhouse gasses and how many gasses are there to affect the Earth.
What is greenhouse gasses and how many gasses are there to affect the Earth.
 
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdfDMARDs Pharmacolgy Pharm D 5th Semester.pdf
DMARDs Pharmacolgy Pharm D 5th Semester.pdf
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
Salas, V. (2024) "John of St. Thomas (Poinsot) on the Science of Sacred Theol...
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
Nutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technologyNutraceutical market, scope and growth: Herbal drug technology
Nutraceutical market, scope and growth: Herbal drug technology
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Introduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptxIntroduction to Mean Field Theory(MFT).pptx
Introduction to Mean Field Theory(MFT).pptx
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdfMudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
Mudde & Rovira Kaltwasser. - Populism - a very short introduction [2017].pdf
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Chapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisisChapter 12 - climate change and the energy crisis
Chapter 12 - climate change and the energy crisis
 

ContentMine: Open Data and Social Machines

  • 1. ContentMine: Open data And Social Machines Peter Murray-Rust , Computation Lab, Univ of Chicago, 2014-11-12
  • 2. ContentMine: We use machines to liberate[1] 100 million facts /yr from the scientific scholarly literature and make them free for everyone (WikiData) WikiData and ContentMines are social machines There are no longer any technical obstacles, only people. [1] Friday workshop: build your own social machine: scraping XML,
  • 4. http://en.wikipedia.org/wiki/Tim_Berners-Lee Everything in this presentation is ODOSOS (Open Data, Open Standards, Open Source) CC0, CC-BY, W3C etc., Apache2, etc. Open = “Free to use, re-use and redistribute http://contentmine.org http://bitbucket.org/petermr http://wwmm.ch.cam.ac.uk A promise: I (Petermr) will never sell out to non-transparent organizations.
  • 5. http://www.budapestopenaccessinitiative.org/read … an unprecedented public good. … … completely free and unrestricted access to [peer- reviewed literature] by all scientists, scholars, teachers, students, and other curious minds. … …Removing access barriers to this literature will accelerate research, enrich education, share the learning of the rich with the poor and the poor with the rich, make this literature as useful as it can be, and lay the foundation for uniting humanity in a common intellectual conversation and quest for knowledge. (Budapest Open Access Initiative, 2003)
  • 6. Scientific and Medical publication (STM)[+] • World Citizens pay $400,000,000,000… • … for research in 1,500,000 articles … • … cost $300,000 each to create … • … $7000 each to “publish” [*]… • … $10,000,000,000 from academic libraries … • … to “publishers” who forbid access to 99.9% of citizens of the world … [+] Figures probably +- 50 % [*] arXiV preprint server costs $7 USD per paper
  • 7. petermr: I believe in Wikipedia • 2006 http://en.wikipedia.org/wiki/User:Petermr • 2006 started Open Data (term unknown then!) • 2009: “the bit of Wikipedia that I wrote is correct” [challenging the idea of “WP is junk”] • 2009: “Wikipedia is the digital library of this century” • 2012: I alert WP that Springer has copyrighted > 1000 of our images [Springergate] • 2014: “For facts in maths, physical and biological sciences I trust Wikipedia.” (Wikimania2014)
  • 9. Volunteer community in chemistry: Open Data/Source/Standards
  • 10. 4 Billion USD on human genome yielded 800 Billion USD and 4 M job-years
  • 12. …three problems—flawed design, non- publication, and poor reporting—together meant >85% of research funds were wasted, a global total loss >100 billion USD per year. [Lancet 2009] [Even more] waste clearly occurs after publication: from poor access, poor dissemination, and poor uptake of the findings of research. [PLOS Medicine 2014-05-27] Bad publication wastes science
  • 13. Publishers’ PDFs destroy science PDFs do not contain words or subscripts! PDFs do not contain tables and do not have columns SVG is turned into JPEG because it’s easier to process
  • 14. Elsevier wants to control Open Data [asked by Michelle Brook]
  • 15. STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]” WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R Licences destroy Content Mining
  • 16. CLOSED ACCESS MEANS PEOPLE DIE CLOSED DATA MEANS PEOPLE DIE
  • 18. The scientist’s amanuensis • "The bane of my life is doing things I know computers could do for me" (Dan Connolly, W3C) Example: A semantic amanuensis could • Give me a daily digest of mineralogy papers • Extract all the crystal structures from them • Compute physical properties with GULP and NWChem • Compare the results statistically • Preserve and distribute the complete operation • Prepare the results for publication The semantic web is having a personal amanuensis
  • 19. Artificial Intelligence in science In 1970 chess and chemistry were the sandboxes for AI. Some approaches: • Lookup (Knowledge) • Natural Language Processing (NLP) • Brute force calculation (inc. physical methods) • Tree-pruning and heuristics • Logic (cf. OWL-DL) • Human-machine integration (crowdsourcing) • Computer Vision Domain-specific Turing test: Can a machine pass a first-year chemistry exam?
  • 20. The Semantic Web "The Semantic Web is an extension of the current web in which information is given well- defined meaning, better enabling computers and people to work in cooperation." Tim Berners-Lee, James Hendler, Ora Lassila, The Semantic Web, Scientific American, May 2001 CC-BY-SA Images from Wikipedia
  • 21. “Which Rivers flow into the Rhine and are longer than 50 kilometers?” or “Which Skyscrapers in China have more than 50 floors and have been constructed before the year 2000?” Open Crystallography? “Which countries where tropical diseases are endemic have published structures of chiral natural products?” Linked Open data from Wikipedia CC-BY-SA from Wikipedia
  • 22. The Right to Read is the Right to Mine http://contentmine.org
  • 23. • Science can be read and understood by human-machine Amanuensis-symbionts. • Amanuenses are based on Wikipedia, databases and software (e.g. ContentMine’s AMI) • The results are fed back into WP and WikiData http://en.wikipedia.org/wiki/Symbiosishttp://en.wikipedia.org/wiki/Eric_Fenby
  • 24. • Crawl scientific literature (Open Bibliography) • Scrape each scientific article (ContentMine-quickscrape) • Extract the facts (ContentMine-AMI) • Index (Wikipedia) • Republish (WikiData) Machine Extraction of scientific facts
  • 25. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers
  • 26. Linked Open Data – the world’s knowledge very little physical science  http://upload.wikimedia.org/wikipedia/commons/3/34/LOD_Cloud_Diagram_as_of_September_2011.png DBPedia BIO Comp Lib PDB Ontologies GOV GOV.uk Music, Art Literature Social Knowledge bases RDF triples
  • 27. Part of a COD RDF entry The Semantic Web understands this
  • 28. Mathematics Markup Language Energy of c.c.p lattice of argon 4 pages clippedHuman-friendly Machine- friendly Many editors and tools exist We used MathWeaver Automatic! MathML
  • 29. CML (Chemical Markup Language) Human-friendly Machine-friendly Automatic!
  • 30. Innovation with Componentisation Individual, manual, unreusable, flaky Commodity, standard, reliable, re-usable
  • 31. Non-semantic data Data extraction difficult and incomplete Human readers Current scientific information flow … is broken for data-rich science PDF Lineprinter output Text files Human input
  • 32. Semantic network closes the loop Data mined from document Data available for e-science and re- use Computation Measurement Semantic Authoring Community Analysis
  • 33. The network grows autonomously Machine-machine Machine-human Human-machine Human-human
  • 34. Humans and machines use different languages
  • 35. How a machine reads a chemical thesis nodes are compounds; arrows are reactions
  • 36. Human-machine symbionts can read science! WP_Lion WP_Aspergillus_oryzae WP_Soybean
  • 37. Facts Marked by “non-scientists” in ContentMine workshops With Wikipedia everyone can be a scientist
  • 38. “nuggets” in a scientific paper quantity units Value ranges Humans aren’t designed to mine this …  chemical project places
  • 39. Parsing chemical sentences A FACT, uncopyrightable, and representable by triples
  • 41. Open Content Mining of FACTs Machines can interpret chemical reactions We have done 500,000 patents. There are > 3,000,000 reactions/year. Added value > 1B Eur.
  • 42. But we can now turn PDFs into Science We can’t turn a hamburger into a cow
  • 45. Chemical Computer Vision Raw Mobile photo; problems: Shadows, contrast, noise, skew, clipping
  • 46. Binarization (pixels = 0,1) Irregular edges
  • 47. Thinning: thick lines to 1-pixel
  • 48. Chemical Optical Character Recognition Small alphabet, clean typefaces, clear boundaries make this relatively tractable. Problems are “I” “O” etc.
  • 49. AMI https://bitbucket.org/petermr/xhtml2stm/wiki/Home Example reaction scheme, taken from MDPI Metabolites 2012, 2, 100-133; page 8, CC-BY: AMI reads the complete diagram, recognizes the paths and generates the molecules. Then she creates a stop-fram animation showing how the 12 reactions lead into each other CLICK HERE FOR ANIMATION (may be browser dependent)
  • 50. AMI Demo http://www.mdpi.com/2218-1989/2/1/39/pdf https://bitbucket.org/AndyHowlett/ami2-poc ami2-poc -i example -v org.xmlcml.xhtml2stm.visitor.chem.ChemVisitor May take time to start if not connected to web Output:./target/output/reactionsexample/ SVG: ./page1annotated.svg CML: image.g.1.4.svg.reaction0.cml Avogadro Viewer:
  • 51. Bacterial WP_phylogenetic tree Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves) WP: Clostridium_butyricum Genbank ID American Type Culture Collection
  • 52. (http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0036933 – “Adaptive Evolution of HIV at HLA Epitopes Is Associated with Ethnicity in Canada” . ((n122,((n121,n205),((n39,(n84,((((n35,n98),n191),n22),n17))),((n10,n182),( (((n232,n76),n68),(n109,n30)),(n73,(n106,n58))))))),((((((n103,n86),(n218,(n 215,n157))),((n164,n143),((n190,((n108,n177),(n192,n220))),((n233,n187), n41)))),((((n59,n184),((n134,n200),(n137,(n212,((n92,n209),n29))))),(n88,(n 102,n161))),((((n70,n140),(n18,n188)),(n49,((n123,n132),(n219,n198)))),((( n37,(n65,n46)),(n135,(n11,(n113,n142)))),(n210,((n69,(n216,n36)),(n231,n1 60))))))),(((n107,n43),((n149,n199),n74)),(((n101,(n19,n54)),n96),(n7,((n139 ,n5),((n170,(n25,n75)),(n146,(n154,(n194,(((n14,n116),n112),(n126,n222))) )))))))),(((((n165,(n168,n128)),n129),((n114,n181),(n48,n118))),((n158,(n91,( n33,n213))),(n87,n235))),((n197,(n175,n117)),(n196,((n171,(n163,n227)),(( n53,n131),n159))))))); http://en.wikipedia.org/wiki/Digital_image_processing http://en.wikipedia.org/wiki/Newick_format http://en.wikipedia.org/wiki/Phylogenetics
  • 53. Open notebook science is the practice of making the entire primary record of a research project publicly available online as it is recorded. (WP) Jean-Claude Bradley was a chemist who actively promoted Open Science in chemistry,… He coined the term Open Notebook Science. … A memorial symposium was held July 14, 2014 at Cambridge University, UK.[9]
  • 54. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers
  • 55. Thanks • Shuttleworth Foundation and Fellowship • Contentmine.org: Michelle Brook, Jenny Molloy, Ross Mounce, Richard Smith-Unna, CottageLabs, Charles Oppenheim • Open Knowledge Foundation Community • Wikimedia Community • Blue Obelisk Community
  • 56. My/our Dream • An Open Bibliography of science, updated daily • An interface for ContentMine to feed new facts into WikiData • Domain-specific enthusiasts to create and run fact extraction and validation • Wikipedia to become a C21 publisher of reference science