• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Golden Gat Ev3   The Plazi Markup System
 

Golden Gat Ev3 The Plazi Markup System

on

  • 855 views

Talk on Plazi Markup System at Artdatenbanken Workshop

Talk on Plazi Markup System at Artdatenbanken Workshop

Statistics

Views

Total Views
855
Views on SlideShare
855
Embed Views
0

Actions

Likes
1
Downloads
5
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Golden Gat Ev3   The Plazi Markup System Golden Gat Ev3 The Plazi Markup System Presentation Transcript

    • The PLAZI Markup System Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter Universität Karlsruhe (TH) Research University – founded 1825
    • The PLAZI Markup System GoldenGATE Document Editor PLAZI Server PLAZI Search Portal External Data Sources Marked-Up Documents Queries Treatments, Detail Data, PDF Document Handles Links, Materials Citations Taxon LSIDs, GeoData New Taxon Names Taxonomic data sources & web services Search portal, TAPIR provider, RSS feed Document markup, external referencing XML & PDF storage, treatment server
    • The PLAZI Search Portal
      • Series of Java Servlets running in Apache Tomcat
      • Front-end for SRS Web Service
      • Linker plug-ins create hyperlinks to other web sites
      • HTML based search portal for humans
        • Search treatments & index data
        • Links submitting new search queries
        • Links to external data sources (e.g. HNS, GoogleMaps)
        • Links to PDF document & XML versions of treatments
      • XML document access in various XML schemas
      • TAPIR provider
        • Taxonomic names
        • Materials citations
      • RSS feed for new treatments
    • The PLAZI Search Portal Probolomyrmex tani
    • The PLAZI Markup System GoldenGATE Document Editor PLAZI Server PLAZI Search Portal External Data Sources Marked-Up Documents Queries Treatments, Detail Data, PDF Document Handles Links, Materials Citations Taxon LSIDs, GeoData New Taxon Names Taxonomic data sources & web services Search portal, TAPIR provider, RSS feed Document markup, external referencing XML & PDF storage, treatment server
    • The PLAZI Server
      • GoldenGATE Search & Retrieval Server (SRS)
        • Extracts individual treatments from XML documents
        • Stores and indexes treatments
        • Based on independend, pluggable Indexers
          • Taxonomic names
          • Materials citations
          • Document meta data
          • Full text
        • Serves treatments or indexed details
      • DSpace
        • Stores PDF and XML documents
        • Issues Handles for documents
      Web Service SRS PostgreSQL File System TN MC MD FT Document Management Data Index Data XML Documents Indexers Indexers Indexers Indexers
    • GoldenGATE Server Apache Tomcat GoldenGATE Server DIO DIO-SRS EXC RES SRS DIO DB GoldenGATE Editor DIO-CON eXist SRS Search Portal SRS TAPIR Service OCR Data Sources GBIF, etc Public EOL, etc Remote GoldenGATE Server DIO DIO-SRS SRS DIO DB RES-DIO SRS DB SRS DB UAA Extract individual treatments Transform to TaxonX Transform to SPM Forward updates & deletions Upload or update document Checkout document for further markup
    • The PLAZI Markup System GoldenGATE Document Editor PLAZI Server PLAZI Search Portal External Data Sources Marked-Up Documents Queries Treatments, Detail Data, PDF Document Handles Links, Materials Citations Taxon LSIDs, GeoData New Taxon Names Taxonomic data sources & web services Search portal, TAPIR provider, RSS feed Document markup, external referencing XML & PDF storage, treatment server
    • The PLAZI Markup Process
      • Process yields semantically enhanced documents:
        • Treatment as atomic unit of text, inner structure for details
        • Materials citations machine-readably associated to taxa
      • Sequence of steps engineered for maximum automation
      • First part (upper row) generic to enhancing OCR output
      • Only second part (lower row) specific to taxonomy
      Printed document HTML document Scanning & OCR Layout Artifact s Paragraph Normalization Normal ized Document Taxon Name Markup Treatment Markup Treatments & Structure GoldenGATE Document Editor Abbyy FineReader MODS Location Markup Treatment Structure Structural Normalization
    • The GoldenGATE Editor
      • Java-based editor for semi-automated document markup
      • Extensible through plug-in mechanism
      • Independent of specific XML schema
      • Element-level XML editing (XML syntax is generated)
      • Flexible display for clear view on all detail levels
      • Existing plug-ins provide broad spectrum of functionality:
        • NLP-based markup generation
          • Regular expressions, gazetteers, GATE JAPE
          • Homegrown and third-party NLP components
          • Import of data from external sources (e.g. LSIDs)
        • Specialized document views for correcting NLP results
        • Markup transformation & filtering
        • IO components for different data formats & storage locations (e.g. for uploading XML documents to PLAZI server)
    • The GoldenGATE Editor
    • GoldenGATE Markup Wizard
      • Basically, a GoldenGATE Editor with another GUI
      • Higher degree of automation and user guidance
      • Especially configurable for a specific series of publications
      • Automatically decides how to proceed with markup
      • Highlights potential errors for user to correct
      • Prevents overlooking errors
        • Less effort for correction
        • No error propagation
      • Editing functionality for corrections like in Editor
      •  More efficient markup of well structured publications
    • Community Markup Apache Tomcat GoldenGATE Server CMS DIO-SRS EXC DPS SRS DIO DB GoldenGATE Editor DIO-CON eXist SRS Search Portal SRS TAPIR Service OCR Data Sources GBIF, etc Public EOL, etc SRS DB Markup Community Portal Markup Volunteers Public Ranking USS ECS UAA DIO Community does inter-active part of markup Online interface, no local program Shows who con-tributes most GoldenGATE Editor em-bedded in Server Measures contribution
    • The PLAZI Markup System GoldenGATE Document Editor PLAZI Server PLAZI Search Portal External Data Sources Marked-Up Documents Queries Treatments, Detail Data, PDF Document Handles Links, Materials Citations Taxon LSIDs, GeoData New Taxon Names Taxonomic data sources & web services Search portal, TAPIR provider, RSS feed Document markup, external referencing XML & PDF storage, treatment server
    • The External Data Sources
      • Hymenoptera Name Server (HNS)
        • Retrieve LSIDs for taxon names
        • Enter new taxon names in HNS database
      • Further LSID sources: ZooBank, Index Fungorum
      • GBIF pulls materials citations via TAPIR
      • EOL pulls treatments via TAPIR (to start soon)
    • Thank you! Questions? Donat Agosti Terry Catapano Robert “Bob“ Morris Guido Sautter PLAZI homepage PLAZI search portal PLAZI main web portal GoldenGATE homepage [email_address] catapanoth @gmail.com ram @cs.umb.edu sautter @ipd.uka.de http://plazi. org http://plazi.org:8080/ GgServer http://plazi2.cs.umb.edu/GgServer http://idaho.ipd.uka.de/ GoldenGATE Universität Karlsruhe (TH) Research University – founded 1825
    • Outlook
      • Tighter integration of GoldenGATE editor with server
        • Load plug-ins from server  Easier update distribution
        • Upload documents directly after OCR
        • Host documents at server throughout markup  Users can share markup work (experts do LSIDs, etc)  Treatments available in search portal soon as marked up
        • Auto-distribute documents to different storage locations
        • Run automated markup generation on server side
        • Get corrections from community via online feedback forms
      • Other extensions of GoldenGATE editor
        • Simplified, more flexible plug-in architecture
        • Extensible user interface
    • The GoldenGATE Editor V3
      • Plug-in GUI extensions (hideable)
      • Simplified, more flexible architecture
      Pre-OCR page images for correcting OCR errors Document navigator for finding stuff more quickly