SlideShare a Scribd company logo
TheContentMine: Mining for Everyone
Peter Murray-Rust
BL_Labs, London, 2014-11-27
The Right to Read is the Right to Mine
http://contentmine.org
ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN
ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO
Ebola Collaborators (Atlanta)
Roxanne Further Moore, Jessie
Gunter, April Clyburne-Sherin
Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
or bats
[Bb]ats?
Sudden onset [Ss]uddens+onset (space/s)
Panthera leoor
Gorilla gorilla
[A-Z][a-z]+s+[a-z]+
(ranges of letters)
Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs
Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>
Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button

More Related Content

Viewers also liked

Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
petermurrayrust
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literature
petermurrayrust
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
petermurrayrust
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
petermurrayrust
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIH
petermurrayrust
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
petermurrayrust
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
petermurrayrust
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
petermurrayrust
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
TheContentMine
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
TheContentMine
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
TheContentMine
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for factsMining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
petermurrayrust
 

Viewers also liked (12)

Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literatureAutomatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literatureHigh throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSSOpen software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
High throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIHHigh throughput mining of the scholarly literature; talk at NIH
High throughput mining of the scholarly literature; talk at NIH
 
ContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UKContentMining for France and Europe; Lessons from 2 years in UK
ContentMining for France and Europe; Lessons from 2 years in UK
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
Asking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolismAsking the scientific literature to tell us about metabolism
Asking the scientific literature to tell us about metabolism
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for factsMining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 

More from TheContentMine

Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
TheContentMine
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
TheContentMine
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
TheContentMine
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
TheContentMine
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
TheContentMine
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
TheContentMine
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
TheContentMine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
TheContentMine
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
TheContentMine
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
TheContentMine
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
TheContentMine
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
TheContentMine
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
TheContentMine
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
TheContentMine
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
TheContentMine
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
TheContentMine
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
TheContentMine
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
TheContentMine
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
TheContentMine
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
TheContentMine
 

More from TheContentMine (20)

Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 
Embrace the Open Revolution
Embrace the Open RevolutionEmbrace the Open Revolution
Embrace the Open Revolution
 

Recently uploaded

Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
ossaicprecious19
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
pablovgd
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
DiyaBiswas10
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
Columbia Weather Systems
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
muralinath2
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
Richard Gill
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
Areesha Ahmad
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
Health Advances
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
Richard Gill
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
muralinath2
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
ssuserbfdca9
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
muralinath2
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
SAMIR PANDA
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
muralinath2
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
RenuJangid3
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
ChetanK57
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
kumarmathi863
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
IqrimaNabilatulhusni
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 

Recently uploaded (20)

Lab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerinLab report on liquid viscosity of glycerin
Lab report on liquid viscosity of glycerin
 
NuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final versionNuGOweek 2024 Ghent - programme - final version
NuGOweek 2024 Ghent - programme - final version
 
extra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdfextra-chromosomal-inheritance[1].pptx.pdfpdf
extra-chromosomal-inheritance[1].pptx.pdfpdf
 
Orion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWSOrion Air Quality Monitoring Systems - CWS
Orion Air Quality Monitoring Systems - CWS
 
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptxBody fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
Body fluids_tonicity_dehydration_hypovolemia_hypervolemia.pptx
 
Richard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlandsRichard's aventures in two entangled wonderlands
Richard's aventures in two entangled wonderlands
 
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of LipidsGBSN - Biochemistry (Unit 5) Chemistry of Lipids
GBSN - Biochemistry (Unit 5) Chemistry of Lipids
 
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...The ASGCT Annual Meeting was packed with exciting progress in the field advan...
The ASGCT Annual Meeting was packed with exciting progress in the field advan...
 
Richard's entangled aventures in wonderland
Richard's entangled aventures in wonderlandRichard's entangled aventures in wonderland
Richard's entangled aventures in wonderland
 
platelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptxplatelets_clotting_biogenesis.clot retractionpptx
platelets_clotting_biogenesis.clot retractionpptx
 
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
4. An Overview of Sugarcane White Leaf Disease in Vietnam.pdf
 
Hemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptxHemostasis_importance& clinical significance.pptx
Hemostasis_importance& clinical significance.pptx
 
Seminar of U.V. Spectroscopy by SAMIR PANDA
 Seminar of U.V. Spectroscopy by SAMIR PANDA Seminar of U.V. Spectroscopy by SAMIR PANDA
Seminar of U.V. Spectroscopy by SAMIR PANDA
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
Circulatory system_ Laplace law. Ohms law.reynaults law,baro-chemo-receptors-...
 
Leaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdfLeaf Initiation, Growth and Differentiation.pdf
Leaf Initiation, Growth and Differentiation.pdf
 
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATIONPRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
PRESENTATION ABOUT PRINCIPLE OF COSMATIC EVALUATION
 
Structures and textures of metamorphic rocks
Structures and textures of metamorphic rocksStructures and textures of metamorphic rocks
Structures and textures of metamorphic rocks
 
general properties of oerganologametal.ppt
general properties of oerganologametal.pptgeneral properties of oerganologametal.ppt
general properties of oerganologametal.ppt
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 

TheContentMine: Mining for Everyone

  • 1. TheContentMine: Mining for Everyone Peter Murray-Rust BL_Labs, London, 2014-11-27
  • 2. The Right to Read is the Right to Mine http://contentmine.org
  • 3. ContentMine • 1-2 year Shuttleworth Funding from 2014-03 • Free to everyone, Open Source, updated daily • Structured Text, and Image/Diagram Mining • Workshops for training and training trainers • Bottom-up community development – Bioscience (EuropePMC, BBSRC) – Disease Ebola – Astrophysics (Stray Toaster) – Chemistry (TSB, EBI, PennState - Citeseer) • We fight for Justice and Freedom
  • 4. ContentMine People • Jenny Molloy • Ross Mounce • Peter Murray-Rust + volunteers (Bioscience, disease) • Richard Smith-Unna + 20 quickscrape volunteers • Steph Unna • Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones) • Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth) • Advisory Board RSN
  • 5. ContentMine Workshops (1-hour -> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US Upcoming • JISC • LIBER • BL • Wellcome Trust • WHO
  • 6. Ebola Collaborators (Atlanta) Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
  • 7. Regular Expressions (Easier than Crosswords or Sudoku) Ebola Ebola Mali (not Malicious) MaliW (end of word) Bat or bat [Bb]at (alternatives) bat or bats bats? (optional letter) Bat or Bats or bat or bats [Bb]ats? Sudden onset [Ss]uddens+onset (space/s) Panthera leoor Gorilla gorilla [A-Z][a-z]+s+[a-z]+ (ranges of letters)
  • 8. Ebola regex • <compoundRegex title="ebola"> • <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> • <regex weight="1.0" fields="marburg">(Marburg)</regex> • <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> • <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> • <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> • <regex weight="0.5" fields="guinea">(Guinea)</regex> • <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> • <regex weight="0.5" fields="liberia">(Liberia)</regex> • <regex weight="0.5" fields="mali">(Mali)W</regex> • <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> • <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> • <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> • <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> • <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> • <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> • </compoundRegex> I 15 mins to create, 15 mins to install and test Or run online at CottageLabs
  • 9. Results of Regex on Ebola • <resultsList xmlns="http://www.xml-cml.org/ami"> • <results xmlns=""> • <source xmlns="http://www.xml-cml.org/ami" • name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" • lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> • <regex xmlns="" weight="1.0" fields="[ebola]"> • <pattern>(Ebola)</pattern> • </regex> • <hits xmlns=""> • <hit ebola="Ebola" /> • </hits> • </regex> • </result> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" • lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> • <regex xmlns="" weight="0.5" fields="[sierra_leone]"> • <pattern>(Sierras+Leone)</pattern> • </regex> • <hits xmlns=""> • <hit sierra_leone="Sierra Leone" /> • </hits> • </regex> • </result>
  • 10. Demo of Content Mining ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
  • 11. Bacterial WP_phylogenetic tree Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves) WP: Clostridium_butyricum Genbank ID American Type Culture Collection
  • 12. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button