Overview of Practical Content Mining 
Peter Murray-Rust 
JISC, London, 2014-12-01
What is Content Mining 
• Mining Text, Tables and Lists, Diagrams, Images 
• Born-digital documents 
• High-throughput (millions of items/year) 
• Formal and Informal Collaboration 
• Role of UK 
• Hands-on 
• Everything is OPEN (OSI , CC-BY, CC0)
The Right to Read is the Right to Mine 
http://contentmine.org
ContentMine 
• 1-2 year Shuttleworth Funding from 2014-03 
• Free to everyone, Open Source, updated daily 
• Structured Text, and Image/Diagram Mining 
• Workshops for training and training trainers 
• Bottom-up community development 
– Bioscience (EuropePMC, BBSRC) 
– Disease Ebola 
– Astrophysics (Stray Toaster) 
– Chemistry (TSB, EBI, PennState - Citeseer) 
• We fight for Justice and Freedom
ContentMine People 
• Jenny Molloy 
• Ross Mounce 
• Peter Murray-Rust + volunteers (Bioscience, disease) 
• Richard Smith-Unna + 20 quickscrape volunteers 
• Steph Unna 
• Cottage Labs (Mark MacGillivray, Emanuil Tolev, 
Richard Jones) 
• Prof Charles Oppenheim 
• Karien Bezuidenhout (Shuttleworth) 
• Advisory Board RSN
ContentMine Workshops 
(1-hour -> full day or more) 
2014-May->Nov 
• Budapest/Shuttleworth 
• Leicester Univ 
• Electronic Theses and Dissertations 
• Austrian Science Fund AT 
• OKFest DE 
• Eur. Bioinformatics Institute 
• Open Science Rio de Janeiro BR 
• Sci DataCon , Delhi IN 
• Univ of Chicago US 
• OpenCon 2014, Wash DC. US 
Upcoming 
• JISC 
• LIBER 
• BL 
• Wellcome Trust 
• WHO
Ebola Collaborators (Atlanta) 
Roxanne Further Moore, Jessie 
Gunter, April Clyburne-Sherin
Regular Expressions 
(Easier than Crosswords or Sudoku) 
Ebola Ebola 
Mali (not 
Malicious) 
MaliW (end of word) 
Bat or bat [Bb]at (alternatives) 
bat or bats bats? (optional letter) 
Bat or Bats or bat 
[Bb]ats? 
or bats 
Sudden onset [Ss]uddens+onset (space/s) 
Panthera leo or 
[A-Z][a-z]+s+[a-z]+ 
Gorilla gorilla 
(ranges of letters)
Ebola regex 
• <compoundRegex title="ebola"> 
• <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> 
• <regex weight="1.0" fields="marburg">(Marburg)</regex> 
• <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> 
• <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> 
• <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> 
• <regex weight="0.5" fields="guinea">(Guinea)</regex> 
• <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> 
• <regex weight="0.5" fields="liberia">(Liberia)</regex> 
• <regex weight="0.5" fields="mali">(Mali)W</regex> 
• <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> 
• <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> 
• <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> 
• <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> 
• <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> 
• <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> 
• </compoundRegex> 
I 
15 mins to create, 15 mins to install and test 
Or run online at CottageLabs
Results of Regex on Ebola 
• <resultsList xmlns="http://www.xml-cml.org/ami"> 
• <results xmlns=""> 
• <source xmlns="http://www.xml-cml.org/ami" 
• name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> 
• <result> 
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" 
• lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> 
• <regex xmlns="" weight="1.0" fields="[ebola]"> 
• <pattern>(Ebola)</pattern> 
• </regex> 
• <hits xmlns=""> 
• <hit ebola="Ebola" /> 
• </hits> 
• </regex> 
• </result> 
• <result> 
• <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" 
• lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> 
• <regex xmlns="" weight="0.5" fields="[sierra_leone]"> 
• <pattern>(Sierras+Leone)</pattern> 
• </regex> 
• <hits xmlns=""> 
• <hit sierra_leone="Sierra Leone" /> 
• </hits> 
• </regex> 
• </result>
Demo of Content Mining 
ChemicalTagger (Lezan Hawizy) a shallow, 
domain-specific, semantic parser for un/natural 
language.
Bacterial WP_phylogenetic tree 
Genbank ID 
American Type 
Culture Collection 
WP: Clostridium_butyricum 
Our machines have read and interpreted 4300 in an hour with > 95% accuracy 
Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
RSU: Richard Smith-Unna 
PMR: Peter Murray-Rust 
CL: CottageLabs 
Queues 
Repos 
Scientific 
literature 
Science 
Plugins 
Science 
Volunteers 
Collaboration with 
Open Access Button
AMI (extraction) architecture 
PDF2SVG 
SVG2XML 
Image 
analysis 
sections tables 
AMI 
captioned 
diagrams 
Regex Species Phylo Chem
Immediate Stakeholders 
– Researchers (bio, EBI, chem, materials, astro) 
– Funders WT, FWF (Austria), RCUK, 
– Libraries (repositories, theses) 
– Service providers (EuropePMC) 
– knowledge-based SMEs 
– Library organisations (JISC, RLUK, LIBER, SPARC) 
– Non-profits (Wikimedia, WHO, Mozilla)
Content production 
• Scholarly articles 
• Theses 
• Repositories 
• Grey scientific literature 
• Grey politico-socio-legal literature 
• Company output (reports, accounts, contracts) 
(e.g. OpenOil)
Licences destroy Content Mining 
WE WALKED OUT 
• Brit Library 
• JISC 
• RLUK 
• OKFN 
• … 
• Ross Mounce 
• PM-R 
STM Publishers Licence 
2012_03_15_Sample_Licence_Text_Data_Mining.pdf 
(Summary: PMR has NO rights) 
• [cannot publish to: ] “libraries, repositories, or archives” 
• [cannot] “Make the results of any TDM Output available on an externally facing server or 
website” 
• “Subscriber shall pay a […] fee” 
Heather Piwowar: “negotiating with publishers [made me physically ill]”
Challenges 
• Active opposition from content “owners” 
including serious lobbying and FUD 
• Ignorance and apathy from universities; 
inappropriate reward system 
• Sub-optimal technology of publishers 
• Lack of common infrastructure, technology, 
APIs 
• And it’s objectively messy anyway
Technical problems 
• PDF: lacks words, tables, diagrams 
• Non-Unicode character sets (or worse) 
• Graphics objects largely destroyed (converted 
to PNG or worse) 
• No communal ontology for document 
structure. 
• HTML carries PublisherJunk and Javascript
Goals of Mining 
• Classification of resources 
• Entity extraction and indexing 
• Aggregation within discipline 
• Inter-disciplinary, e.g. biodiversity, 
phytochemistry 
• Repurposing (twitter, ePub, annotation) 
• Semantification/intelligent documents 
• Detection of error and fraud
What we need 
• Inter/national commitment to infrastructure 
• Common ontologies and APIs 
• Development of community 
• Go beyond academia; non-academic reward 
system

Petermrjisc20141201

  • 1.
    Overview of PracticalContent Mining Peter Murray-Rust JISC, London, 2014-12-01
  • 2.
    What is ContentMining • Mining Text, Tables and Lists, Diagrams, Images • Born-digital documents • High-throughput (millions of items/year) • Formal and Informal Collaboration • Role of UK • Hands-on • Everything is OPEN (OSI , CC-BY, CC0)
  • 3.
    The Right toRead is the Right to Mine http://contentmine.org
  • 4.
    ContentMine • 1-2year Shuttleworth Funding from 2014-03 • Free to everyone, Open Source, updated daily • Structured Text, and Image/Diagram Mining • Workshops for training and training trainers • Bottom-up community development – Bioscience (EuropePMC, BBSRC) – Disease Ebola – Astrophysics (Stray Toaster) – Chemistry (TSB, EBI, PennState - Citeseer) • We fight for Justice and Freedom
  • 5.
    ContentMine People •Jenny Molloy • Ross Mounce • Peter Murray-Rust + volunteers (Bioscience, disease) • Richard Smith-Unna + 20 quickscrape volunteers • Steph Unna • Cottage Labs (Mark MacGillivray, Emanuil Tolev, Richard Jones) • Prof Charles Oppenheim • Karien Bezuidenhout (Shuttleworth) • Advisory Board RSN
  • 6.
    ContentMine Workshops (1-hour-> full day or more) 2014-May->Nov • Budapest/Shuttleworth • Leicester Univ • Electronic Theses and Dissertations • Austrian Science Fund AT • OKFest DE • Eur. Bioinformatics Institute • Open Science Rio de Janeiro BR • Sci DataCon , Delhi IN • Univ of Chicago US • OpenCon 2014, Wash DC. US Upcoming • JISC • LIBER • BL • Wellcome Trust • WHO
  • 7.
    Ebola Collaborators (Atlanta) Roxanne Further Moore, Jessie Gunter, April Clyburne-Sherin
  • 8.
    Regular Expressions (Easierthan Crosswords or Sudoku) Ebola Ebola Mali (not Malicious) MaliW (end of word) Bat or bat [Bb]at (alternatives) bat or bats bats? (optional letter) Bat or Bats or bat [Bb]ats? or bats Sudden onset [Ss]uddens+onset (space/s) Panthera leo or [A-Z][a-z]+s+[a-z]+ Gorilla gorilla (ranges of letters)
  • 9.
    Ebola regex •<compoundRegex title="ebola"> • <regex weight="1.0" fields="ebola" case="">(Ebola)</regex> • <regex weight="1.0" fields="marburg">(Marburg)</regex> • <regex weight="1.0" fields="hemorrhagic_fever">([Hh]a?emorrhagics+fever)</regex> • <regex weight="0.8" fields="sudden_onset">([Ss]uddens+onset)</regex> • <regex weight="0.6" fields="vomiting_diarrhoea">([Vv]omitings+diarrho?ea)</regex> • <regex weight="0.5" fields="guinea">(Guinea)</regex> • <regex weight="0.5" fields="sierra_leone">(Sierras+Leone)</regex> • <regex weight="0.5" fields="liberia">(Liberia)</regex> • <regex weight="0.5" fields="mali">(Mali)W</regex> • <regex weight="0.6" fields="contact_tracing">([Cc]ontacts+tracing)</regex> • <regex weight="0.5" fields="bat">W([Bb]ats?W)</regex> • <regex weight="0.5" fields="bushmeat">([Bb]ushmeat)</regex> • <regex weight="0.5" fields="drc">(Democratic Republics*(s*of)?(s*the)?s*Congo)(DRC)</regex> • <regex weight="0.6" fields="safe_burial">([Ss]afes+burials+practice?s)</regex> • <regex weight="1.0" fields="etu">([Ee]bolas+treatments+units?)(ETU)</regex> • </compoundRegex> I 15 mins to create, 15 mins to install and test Or run online at CottageLabs
  • 10.
    Results of Regexon Ebola • <resultsList xmlns="http://www.xml-cml.org/ami"> • <results xmlns=""> • <source xmlns="http://www.xml-cml.org/ami" • name="/Users/pm286/workspace/ami-core/./docs/ebola/text/14Nov.txt" /> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="7" • lineValue=" There have been 14 413 reported Ebola cases in eight countries since the outbreak "> • <regex xmlns="" weight="1.0" fields="[ebola]"> • <pattern>(Ebola)</pattern> • </regex> • <hits xmlns=""> • <hit ebola="Ebola" /> • </hits> • </regex> • </result> • <result> • <regex xmlns="http://www.xml-cml.org/ami" lineNumber="9" • lineValue="HIGHLIGHTS Case incidence continues to increase in Sierra Leone, and transmission also remains "> • <regex xmlns="" weight="0.5" fields="[sierra_leone]"> • <pattern>(Sierras+Leone)</pattern> • </regex> • <hits xmlns=""> • <hit sierra_leone="Sierra Leone" /> • </hits> • </regex> • </result>
  • 11.
    Demo of ContentMining ChemicalTagger (Lezan Hawizy) a shallow, domain-specific, semantic parser for un/natural language.
  • 12.
    Bacterial WP_phylogenetic tree Genbank ID American Type Culture Collection WP: Clostridium_butyricum Our machines have read and interpreted 4300 in an hour with > 95% accuracy Trees From http://ijs.sgmjournals.org/ used under new UK legislation (Hargreaves)
  • 13.
    RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button
  • 14.
    AMI (extraction) architecture PDF2SVG SVG2XML Image analysis sections tables AMI captioned diagrams Regex Species Phylo Chem
  • 15.
    Immediate Stakeholders –Researchers (bio, EBI, chem, materials, astro) – Funders WT, FWF (Austria), RCUK, – Libraries (repositories, theses) – Service providers (EuropePMC) – knowledge-based SMEs – Library organisations (JISC, RLUK, LIBER, SPARC) – Non-profits (Wikimedia, WHO, Mozilla)
  • 16.
    Content production •Scholarly articles • Theses • Repositories • Grey scientific literature • Grey politico-socio-legal literature • Company output (reports, accounts, contracts) (e.g. OpenOil)
  • 17.
    Licences destroy ContentMining WE WALKED OUT • Brit Library • JISC • RLUK • OKFN • … • Ross Mounce • PM-R STM Publishers Licence 2012_03_15_Sample_Licence_Text_Data_Mining.pdf (Summary: PMR has NO rights) • [cannot publish to: ] “libraries, repositories, or archives” • [cannot] “Make the results of any TDM Output available on an externally facing server or website” • “Subscriber shall pay a […] fee” Heather Piwowar: “negotiating with publishers [made me physically ill]”
  • 18.
    Challenges • Activeopposition from content “owners” including serious lobbying and FUD • Ignorance and apathy from universities; inappropriate reward system • Sub-optimal technology of publishers • Lack of common infrastructure, technology, APIs • And it’s objectively messy anyway
  • 19.
    Technical problems •PDF: lacks words, tables, diagrams • Non-Unicode character sets (or worse) • Graphics objects largely destroyed (converted to PNG or worse) • No communal ontology for document structure. • HTML carries PublisherJunk and Javascript
  • 20.
    Goals of Mining • Classification of resources • Entity extraction and indexing • Aggregation within discipline • Inter-disciplinary, e.g. biodiversity, phytochemistry • Repurposing (twitter, ePub, annotation) • Semantification/intelligent documents • Detection of error and fraud
  • 21.
    What we need • Inter/national commitment to infrastructure • Common ontologies and APIs • Development of community • Go beyond academia; non-academic reward system