SlideShare a Scribd company logo
1 of 21
Overview of Practical Content Mining
Peter Murray-Rust
JISC, London, 2014-12-01
What is Content Mining
• Mining Text, Tables and Lists, Diagrams, Images
• Born-digital documents
• High-throughput (millions of items/year)
• Formal and Informal Collaboration
• Role of UK
• Hands-on
• Everything is OPEN (OSI , CC-BY, CC0)
The Right to Read is the Right to Mine
http://contentmine.org
ContentMine
• 1-2 year Shuttleworth Funding from 2014-03
• Free to everyone, Open Source, updated daily
• Structured Text, and Image/Diagram Mining
• Workshops for training and training trainers
• Bottom-up community development
– Bioscience (EuropePMC, BBSRC)
– Disease Ebola
– Astrophysics (Stray Toaster)
– Chemistry (TSB, EBI, PennState - Citeseer)
• We fight for Justice and Freedom
ContentMine People
• Jenny Molloy
• Ross Mounce
• Peter Murray-Rust + volunteers (Bioscience, disease)
• Richard Smith-Unna + 20 quickscrape volunteers
• Steph Unna
• Cottage Labs (Mark MacGillivray, Emanuil Tolev,
Richard Jones)
• Prof Charles Oppenheim
• Karien Bezuidenhout (Shuttleworth)
• Advisory Board RSN
ContentMine Workshops
(1-hour -> full day or more)
2014-May->Nov
• Budapest/Shuttleworth
• Leicester Univ
• Electronic Theses and Dissertations
• Austrian Science Fund AT
• OKFest DE
• Eur. Bioinformatics Institute
• Open Science Rio de Janeiro BR
• Sci DataCon , Delhi IN
• Univ of Chicago US
• OpenCon 2014, Wash DC. US
Upcoming
• JISC
• LIBER
• BL
• Wellcome Trust
• WHO
Ebola Collaborators (Atlanta)
Roxanne Further Moore, Jessie
Gunter, April Clyburne-Sherin
Regular Expressions
(Easier than Crosswords or Sudoku)
Ebola Ebola
Mali (not
Malicious)
MaliW (end of word)
Bat or bat [Bb]at (alternatives)
bat or bats bats? (optional letter)
Bat or Bats or bat
or bats
[Bb]ats?
Sudden onset [Ss]uddens+onset (space/s)
Panthera leoor
Gorilla gorilla
[A-Z][a-z]+s+[a-z]+
(ranges of letters)
Ebola regex
• <compoundRegex title="ebola">
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex>
• <regex weight="1.0" fields="marburg">(Marburg)</regex>
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex>
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex>
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex>
• <regex weight="0.5" fields="guinea">(Guinea)</regex>
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex>
• <regex weight="0.5" fields="liberia">(Liberia)</regex>
• <regex weight="0.5" fields="mali">(Mali)W</regex>
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex>
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex>
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex>
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex>
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex>
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex>
• </compoundRegex>
I
15 mins to create, 15 mins to install and test
Or run online at CottageLabs
Results of Regex on Ebola
• <resultsList xmlns="http://www.xml-cml.org/ami">
• <results xmlns="">
• <source xmlns="http://www.xml-cml.org/ami"
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" />
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7"
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak ">
• <regex xmlns="" weight="1.0" fields="[ebola]">
• <pattern>(Ebola)</pattern>
• </regex>
• <hits xmlns="">
• <hit ebola="Ebola" />
• </hits>
• </regex>
• </result>
• <result>
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9"
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains ">
• <regex xmlns="" weight="0.5" fields="[sierra_leone]">
• <pattern>(Sierras+Leone)</pattern>
• </regex>
• <hits xmlns="">
• <hit sierra_leone="Sierra Leone" />
• </hits>
• </regex>
• </result>
Demo of Content Mining
ChemicalTagger (Lezan Hawizy) a shallow,
domain-specific, semantic parser for un/natural
language.
Bacterial WP_phylogenetic tree
Our machines have read and interpreted 4300 in an hour with > 95% accuracy
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
WP: Clostridium_butyricum
Genbank ID
American Type
Culture Collection
RSU: Richard Smith-Unna
PMR: Peter Murray-Rust
CL: CottageLabs
Queues
Repos
Scientific
literature
Science
Plugins
Science
Volunteers
Collaboration with
Open Access Button
AMI (extraction) architecture
PDF2SVG
Image
analysis
SVG2XML
Regex Species Phylo Chem
AMI
tablessections
captioned
diagrams
Immediate Stakeholders
– Researchers (bio, EBI, chem, materials, astro)
– Funders WT, FWF (Austria), RCUK,
– Libraries (repositories, theses)
– Service providers (EuropePMC)
– knowledge-based SMEs
– Library organisations (JISC, RLUK, LIBER, SPARC)
– Non-profits (Wikimedia, WHO, Mozilla)
Content production
• Scholarly articles
• Theses
• Repositories
• Grey scientific literature
• Grey politico-socio-legal literature
• Company output (reports, accounts, contracts)
(e.g. OpenOil)
STM Publishers Licence
2012_03_15_Sample_Licence_Text_Data_Mining.pdf
(Summary: PMR has NO rights)
• [cannot publish to: ] “libraries, repositories, or archives”
• [cannot] “Make the results of any TDM Output available on an externally facing server or
website”
• “Subscriber shall pay a […] fee”
Heather Piwowar: “negotiating with publishers [made me physically ill]”
WE WALKED OUT
• Brit Library
• JISC
• RLUK
• OKFN
• …
• Ross Mounce
• PM-R
Licences destroy Content Mining
Challenges
• Active opposition from content “owners”
including serious lobbying and FUD
• Ignorance and apathy from universities;
inappropriate reward system
• Sub-optimal technology of publishers
• Lack of common infrastructure, technology,
APIs
• And it’s objectively messy anyway
Technical problems
• PDF: lacks words, tables, diagrams
• Non-Unicode character sets (or worse)
• Graphics objects largely destroyed (converted
to PNG or worse)
• No communal ontology for document
structure.
• HTML carries PublisherJunk and Javascript
Goals of Mining
• Classification of resources
• Entity extraction and indexing
• Aggregation within discipline
• Inter-disciplinary, e.g. biodiversity,
phytochemistry
• Repurposing (twitter, ePub, annotation)
• Semantification/intelligent documents
• Detection of error and fraud
What we need
• Inter/national commitment to infrastructure
• Common ontologies and APIs
• Development of community
• Go beyond academia; non-academic reward
system

More Related Content

Similar to Overview of Practical Content Mining

Similar to Overview of Practical Content Mining (20)

ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
ContentMine and WikiData
ContentMine and WikiDataContentMine and WikiData
ContentMine and WikiData
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Why ContentMining is useful
Why ContentMining is usefulWhy ContentMining is useful
Why ContentMining is useful
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and  Medicine from the scholarly literatureAutomatic Extraction of Science and  Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Automatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literatureAutomatic Extraction of Science and Medicine from the scholarly literature
Automatic Extraction of Science and Medicine from the scholarly literature
 
Oa and academic integrity for ph d students 2016
Oa and academic integrity for ph d students   2016Oa and academic integrity for ph d students   2016
Oa and academic integrity for ph d students 2016
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016 Liberating facts from the scientific literature - Jisc Digifest 2016
Liberating facts from the scientific literature - Jisc Digifest 2016
 
ContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC DigifestContentMine (TDM) at JISC Digifest
ContentMine (TDM) at JISC Digifest
 
High throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and thesesHigh throughput mining of the scholarly literature: journals and theses
High throughput mining of the scholarly literature: journals and theses
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
ContentMining for Synthetic Biology
ContentMining for Synthetic BiologyContentMining for Synthetic Biology
ContentMining for Synthetic Biology
 
Fifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly informationFifty shades of green and gold: open access to scholarly information
Fifty shades of green and gold: open access to scholarly information
 
Content Mining at Wellcome Trust
Content Mining at Wellcome TrustContent Mining at Wellcome Trust
Content Mining at Wellcome Trust
 
What's Driving Open Access?
What's Driving Open Access?What's Driving Open Access?
What's Driving Open Access?
 
The time is right to focus on a model organism database
The time is right to focus on a model organism databaseThe time is right to focus on a model organism database
The time is right to focus on a model organism database
 
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
Dutch Book Trade 1660-1750: using the STCN to gain insight in publishers’ str...
 
Open data and Open Science
Open data and Open ScienceOpen data and Open Science
Open data and Open Science
 
Biodiversity Heritage Library
Biodiversity Heritage LibraryBiodiversity Heritage Library
Biodiversity Heritage Library
 

More from TheContentMine

More from TheContentMine (20)

High throughput mining of the scholarly literature
High throughput mining of the scholarly literature High throughput mining of the scholarly literature
High throughput mining of the scholarly literature
 
Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature Amanuens.is HUmans and machines annotating scholarly literature
Amanuens.is HUmans and machines annotating scholarly literature
 
Open software and knowledge for MIOSS
Open software and knowledge for MIOSS Open software and knowledge for MIOSS
Open software and knowledge for MIOSS
 
Automatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the LiteratureAutomatic Extraction of Knowledge from the Literature
Automatic Extraction of Knowledge from the Literature
 
Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature Automatic Extraction of Knowledge from Biomedical literature
Automatic Extraction of Knowledge from Biomedical literature
 
Cochrane workshop 2016
Cochrane workshop 2016Cochrane workshop 2016
Cochrane workshop 2016
 
Content Mining of Science and Medicine
Content Mining of Science and MedicineContent Mining of Science and Medicine
Content Mining of Science and Medicine
 
ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika! ContentMine + EPMC: Finding Zika!
ContentMine + EPMC: Finding Zika!
 
The culture of researchData
The culture of researchData The culture of researchData
The culture of researchData
 
Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts Mining Scientific Diagrams for facts
Mining Scientific Diagrams for facts
 
Content Mining of Science in Cambridge
Content Mining of Science in CambridgeContent Mining of Science in Cambridge
Content Mining of Science in Cambridge
 
Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape? Digital Scholarship: Enlightenment or Devastated Landscape?
Digital Scholarship: Enlightenment or Devastated Landscape?
 
Open Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics InstituteOpen Knowledge and University of Cambridge European Bioinformatics Institute
Open Knowledge and University of Cambridge European Bioinformatics Institute
 
Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)Can Computers understand the scientific literature (includes compscie material)
Can Computers understand the scientific literature (includes compscie material)
 
OpenNotebookScience NOW!
OpenNotebookScience NOW!OpenNotebookScience NOW!
OpenNotebookScience NOW!
 
Making Theses USEFUL
Making Theses USEFULMaking Theses USEFUL
Making Theses USEFUL
 
Open Data and Open Science
Open Data and Open ScienceOpen Data and Open Science
Open Data and Open Science
 
Mining Scientific Images
Mining Scientific ImagesMining Scientific Images
Mining Scientific Images
 
ContentMine: Open Data and Social Machines
ContentMine: Open Data and Social MachinesContentMine: Open Data and Social Machines
ContentMine: Open Data and Social Machines
 
Disruptive Communities and Technology
Disruptive Communities and TechnologyDisruptive Communities and Technology
Disruptive Communities and Technology
 

Recently uploaded

Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
ciinovamais
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
heathfieldcps1
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
negromaestrong
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
QucHHunhnh
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
QucHHunhnh
 

Recently uploaded (20)

Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Activity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdfActivity 01 - Artificial Culture (1).pdf
Activity 01 - Artificial Culture (1).pdf
 
ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.ICT role in 21st century education and it's challenges.
ICT role in 21st century education and it's challenges.
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-IIFood Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
Food Chain and Food Web (Ecosystem) EVS, B. Pharmacy 1st Year, Sem-II
 
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
Ecological Succession. ( ECOSYSTEM, B. Pharmacy, 1st Year, Sem-II, Environmen...
 
Measures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SDMeasures of Dispersion and Variability: Range, QD, AD and SD
Measures of Dispersion and Variability: Range, QD, AD and SD
 
Seal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptxSeal of Good Local Governance (SGLG) 2024Final.pptx
Seal of Good Local Governance (SGLG) 2024Final.pptx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17  How to Extend Models Using Mixin ClassesMixin Classes in Odoo 17  How to Extend Models Using Mixin Classes
Mixin Classes in Odoo 17 How to Extend Models Using Mixin Classes
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701ComPTIA Overview | Comptia Security+ Book SY0-701
ComPTIA Overview | Comptia Security+ Book SY0-701
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
1029-Danh muc Sach Giao Khoa khoi 6.pdf
1029-Danh muc Sach Giao Khoa khoi  6.pdf1029-Danh muc Sach Giao Khoa khoi  6.pdf
1029-Danh muc Sach Giao Khoa khoi 6.pdf
 

Overview of Practical Content Mining

  • 1. Overview of Practical Content Mining Peter Murray-Rust JISC, London, 2014-12-01
  • 2. What is Content Mining • Mining Text, Tables and Lists, Diagrams, Images • Born-digital documents • High-throughput (millions of items/year) • Formal and Informal Collaboration • Role of UK • Hands-on • Everything is OPEN (OSI , CC-BY, CC0)
  • 3. The Right to Read is the Right to Mine http://contentmine.org
  • 4. ContentMine • 1-2 year Shuttleworth Funding from 2014-03 • Free to everyone, Open Source, updated daily • Structured Text, and Image/Diagram Mining • Workshops for training and training trainers • Bottom-up community development – Bioscience (EuropePMC, BBSRC) – Disease Ebola – Astrophysics (Stray Toaster) – Chemistry (TSB, EBI, PennState - Citeseer) • We fight for Justice and Freedom
  • 5. ContentMine People • Jenny Molloy • Ross Mounce • Peter Murray-Rust + volunteers (Bioscience, disease) • Richard Smith-Unna + 20 quickscrape volunteers • Steph Unna • Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones) • Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth) • Advisory Board RSN
  • 6. ContentMine Workshops (1-hour -> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US Upcoming • JISC • LIBER • BL • Wellcome Trust • WHO
  • 7. Ebola Collaborators (Atlanta) Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
  • 8. Regular Expressions (Easier than Crosswords or Sudoku) Ebola Ebola Mali (not Malicious) MaliW (end of word) Bat or bat [Bb]at (alternatives) bat or bats bats? (optional letter) Bat or Bats or bat or bats [Bb]ats? Sudden onset [Ss]uddens+onset (space/s) Panthera leoor Gorilla gorilla [A-Z][a-z]+s+[a-z]+ (ranges of letters)
  • 9. Ebola regex • <compoundRegex title="ebola"> • <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> • <regex weight="1.0" fields="marburg">(Marburg)</regex> • <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> • <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> • <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> • <regex weight="0.5" fields="guinea">(Guinea)</regex> • <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> • <regex weight="0.5" fields="liberia">(Liberia)</regex> • <regex weight="0.5" fields="mali">(Mali)W</regex> • <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> • <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> • <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> • <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> • <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> • <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> • </compoundRegex> I 15 mins to create, 15 mins to install and test Or run online at CottageLabs
  • 10. Results of Regex on Ebola • <resultsList xmlns="http://www.xml-cml.org/ami"> • <results xmlns=""> • <source xmlns="http://www.xml-cml.org/ami" • name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" • lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> • <regex xmlns="" weight="1.0" fields="[ebola]"> • <pattern>(Ebola)</pattern> • </regex> • <hits xmlns=""> • <hit ebola="Ebola" /> • </hits> • </regex> • </result> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" • lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> • <regex xmlns="" weight="0.5" fields="[sierra_leone]"> • <pattern>(Sierras+Leone)</pattern> • </regex> • <hits xmlns=""> • <hit sierra_leone="Sierra Leone" /> • </hits> • </regex> • </result>
  • 11. Demo of Content Mining ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
  • 12. Bacterial WP_phylogenetic tree Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves) WP: Clostridium_butyricum Genbank ID American Type Culture Collection
  • 13. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button
  • 14. AMI (extraction) architecture PDF2SVG Image analysis SVG2XML Regex Species Phylo Chem AMI tablessections captioned diagrams
  • 15. Immediate Stakeholders – Researchers (bio, EBI, chem, materials, astro) – Funders WT, FWF (Austria), RCUK, – Libraries (repositories, theses) – Service providers (EuropePMC) – knowledge-based SMEs – Library organisations (JISC, RLUK, LIBER, SPARC) – Non-profits (Wikimedia, WHO, Mozilla)
  • 16. Content production • Scholarly articles • Theses • Repositories • Grey scientific literature • Grey politico-socio-legal literature • Company output (reports, accounts, contracts) (e.g. OpenOil)
  • 17. STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]” WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R Licences destroy Content Mining
  • 18. Challenges • Active opposition from content “owners” including serious lobbying and FUD • Ignorance and apathy from universities; inappropriate reward system • Sub-optimal technology of publishers • Lack of common infrastructure, technology, APIs • And it’s objectively messy anyway
  • 19. Technical problems • PDF: lacks words, tables, diagrams • Non-Unicode character sets (or worse) • Graphics objects largely destroyed (converted to PNG or worse) • No communal ontology for document structure. • HTML carries PublisherJunk and Javascript
  • 20. Goals of Mining • Classification of resources • Entity extraction and indexing • Aggregation within discipline • Inter-disciplinary, e.g. biodiversity, phytochemistry • Repurposing (twitter, ePub, annotation) • Semantification/intelligent documents • Detection of error and fraud
  • 21. What we need • Inter/national commitment to infrastructure • Common ontologies and APIs • Development of community • Go beyond academia; non-academic reward system