Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Architecture of ContentMine Components


Published on

This is the evolving architecture of ContentMine ( architecture. It includes an overview ( slide #2, ) showing getpapers, quickscrape, norma and ami.

The key container is the CTree and the architecture shows where components are added or transformed to this.

These slides are dated and may be out-of-date wrt code. Some diagrams are autogenerated from *.dot files.

Please use as the main source of up-to-date info. Feel free to ask questions, offer help, critique, etc.

All s/w is Open (BSD, Apache2)

Published in: Science
  • Be the first to comment

Architecture of ContentMine Components

  1. 1. Architecture of TheContentMine These slides are for enlightenment and presentations. Use for up- to-date info. Questions, comments and critiques welcome! All s/w is Open (BSD/Apache2) Some diagrams are autogenerated from *.dot files which are located in the projects (mainly Norma and AMI)
  2. 2. catalogue getpapers query Daily Crawl EuPMC, arXiv CORE , HAL, (UNIV repos) ToC services PDF HTML DOC ePUB TeX XML PNG EPS CSV XLSURLs DOIs crawl quickscrape norma Normalizer Structurer Semantic Tagger Text Data Figures ami UNIV Repos search Lookup CONTENT MINING Chem Phylo Trials Crystal Plants COMMUNITY plugins Visualization and Analysis PloSONE, BMC, peerJ… Nature, IEEE, Elsevier… Publisher Sites scrapers queries taggers abstract methods references Captioned Figures Fig. 1 HTML tables 30, 000 pages/day Semantic ScholarlyHTML Facts Latest 20150908
  3. 3. quickscrape Norma Index & Transform PDF XML URL DOI DOC CSV sHTML Plugins SequencesSpecies BespokeScrapers XPath Taggers Per- Journal Chemistry Phylogenetics Plants AMI BadHTML OCR Diagrams CAT-alogue index getpapersquery Titles+ links Daily Crawl/ feed EuPMC JToCs Latest 20150908; limited in scope
  4. 4. Starting points for ingestion (getpapers/quickscrape/Norma) • Search/Crawl/Feed-> PMCID,DOI,URL -> quickscrape -> CTree(PDF,HTML,XML,images/,meta) -> Norma -> CMDir(sHTML|TXT|SVG|image) good • PDF,XML,TXT,HTML -> Norma -> CTree(PDF,rawHTML,TXT,images/,meta?) -> NormaOCR|TXT2HTML -> CTree(sHTML,TXT,SVG) variable 20150908
  5. 5. Norma Conversions • Paper-> Scanned -> TIFF (avoid) • PDF,TIFF,PNG -> Tesseract-N -> HTML, SVG fast, variable • PDF -> PDF2SVG-N -> sHTML, SVG, images/. slow, accurate-ish • PDF -> PDF2TXT-N -> TXT fast, variable • PDF -> PDF2Image-N -> PNG fast, accurate 20150908
  6. 6. Norma End points • Norma -> CTree(OpenSHTML-SVG) -> everything? • Norma -> CTree(sHTML. sections) -> AMI -> all text + species, chemText, sequences) • Norma -> CTree(TXT (unsectioned)) -> AMI -> bagOfWords, regex, IDs, species? • Norma -> CTree(PNG) -> AMI -> phylo, bar/xy- plots, • Norma -> CTree(SVG) -> AMI -> phylo, bar/xy- plots, chemistry
  7. 7. Pre/early Norma toolchain Transforming PDF and PNG into higher value components 20150908Diagram autogenerated from *.dot graph
  8. 8. getpapers/quickscrape/Norma workflow 20150908Diagram autogenerated from *.dot graph
  9. 9. 20150908Diagram autogenerated from *.dot graph Getpapers/quickscrape/Norma: commonest uses
  10. 10. 20150908Diagram autogenerated from *.dot graph AMI: inputs and outputs for common plugins
  11. 11. Earlier diagrams Probably significantly out of date, but may contain useful info.
  12. 12. NORMALIZE Norma Convert PDF,XML To sHTML Tag sections Normalized Scientific Literature AMI Index Transform Extract Search PDF2SVG XSL stylesheets Taggers normalization Parameters “Permanent” Filestore Temporary Filestore Extracted facts indexes Plugins Regex
  13. 13. PDF Non-Unicode Pixel glyphs No words No structures ScholarlyHTML SVG High-level graphics PDF2SVG characters Sentences Paras tables PNG OCR Tagged Sections SVGBuilder Captioned Figures NORMA XSLT1/2
  14. 14. Raw HTML Not wellformed Bad character semantics ScholarlyHTML Well-formed XHTML PNG Tagged Sections Captioned Figures Tables Captioned Tables XML HtmlTidy Jsoup HtmlUnit XSLT1/2 XSLT1/2 NORMA Per-journal Stylesheets
  15. 15. RSU: Richard Smith-Unna PMR: Peter Murray-Rust CL: CottageLabs Queues Repos Scientific literature Science Plugins Science Volunteers Collaboration with Open Access Button
  16. 16. quickscrape Crawl Feed Norma Index & Transform TXT XML URL DOI Scientific literature Repositories DOC CSV sHTML Plugins Regex SequencesSpecies Bespoke Scrapers XPathPer-Journal Taggers Per- Journal MetadataChemistry Phylogenetics Farming AMI BadHTML OCR Diagrams Open NORMA-lized Scientific Literature + Facts CANARY pipeline CAT-alogue index PDF