The Semantic Web meets the Code of Federal Regulations

  • 216 views
Uploaded on

Semantic Web and natural-language-processing techniques meet the Code of Federal Regulations. Presentation from CALICON12 by the Legal Information Institute. Work on definition extraction, linked …

Semantic Web and natural-language-processing techniques meet the Code of Federal Regulations. Presentation from CALICON12 by the Legal Information Institute. Work on definition extraction, linked data publishing, search enhancement, vocabulary discovery.
Joint presentation with Nuria Casellas.

More in: Technology , Education
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
216
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
4
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n
  • \n

Transcript

  • 1. THE CFR MEETS THE SEMANTIC WEB(with a little unnatural language processing thrown in )
  • 2. BACKGROUND: A TWO-PART HISTORY OF THE SEMANTIC WEB• SW is a maze of confusing buzzwords• Can be thought of in two parts • Pre-2005 (the “top-down” period) • Post-2005 (the “bottom-up” period)
  • 3. SW PRE-2005o A fascination with inferencing & top-down analysiso Staked out a lot of theoretical territoryo Built basic standards: • RDF (statement encoding) : saying things about things • OWL (modeling and inferencing): describing relationships between things -- that is, creating ontologies
  • 4. SW FROM 2005 TO NOWo SW now seen as a big heap of statementso Became more practical o SKOS ( inexpensive conversion method/standard for metadata) o Linked Data ( altruistic, like named anchors ca. 1992 )o Could be seen -- from a library point of view -- as a new set of techniques for metadata management better suited to the Web
  • 5. THE SEMANTIC WEB AT THE LII• Tying legal information to the real world, not just itself• Applications like: o Improvements to existing finding aids  Table of Popular Names, , Tables I and III  Finer-grained, more expressive PTOA o Search enhancement via term substitution and expansion o Publication of “regulated nouns” and definitions as Linked Data• Research-driven engineering as a practice/culture
  • 6. WHY USE THE SW TOOLSET?• Sometimes the whole thing looks like an illustration of the Two Fool Rule• Why RDF? o XML is more cumbersome and less expressive o RDF supports inferencing o RDF allows processing of partial information• Why SPARQL? o um, SPARQL is how you query RDF
  • 7. WHY USE SKOS?o its a simple knowledge organization systemo lightweight representation of things we need a lot: o thesauri o taxonomies o classification schemeso it might be a little too simple
  • 8. SKOS: DRIVING INTO A DITCH<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"> <skos:Concept rdf:about="http://www.my.com/#canals"> <skos:definition>A feature type category for places such as the Erie Canal</skos:definition> <skos:prefLabel>canals</skos:prefLabel> <skos:altLabel>canal bends</skos:altLabel> <skos:altLabel>canalized streams</skos:altLabel> <skos:altLabel>ditch mouths</skos:altLabel> <skos:altLabel>ditches</skos:altLabel> <skos:altLabel>drainage canals</skos:altLabel> <skos:altLabel>drainage ditches</skos:altLabel> <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/> <skos:related rdf:resource="http://www.my.com/#channels"/> <skos:related rdf:resource="http://www.my.com/#locks"/> <skos:related rdf:resource="http://www.my.com/#transportation%20features"/> <skos:related rdf:resource="http://www.my.com/#tunnels"/> <skos:scopeNote>Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power</skos:scopeNote> </skos:Concept></rdf:RDF>
  • 9. DATA REUSE: DRUGBANK• Acetaminophen vs. Tylenol : CFR regulates by generic name• DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/) o http://www.drugbank.ca/ o Offered as Linked Data by Freie Universität Berlin• DrugBank associates brand names with their components• We offer component names as suggested search terms in Title 21 [*]
  • 10. CANT EVERYTHING BE DONE WITH RECYCLED DATA? UM, NO.• Some datasets suck, or don´t exist yet• Conversion of existing resources is not painless o Many vocabularies rely on human interpretation o Many vocabularies are not rigorous enough for SKOS encoding (lotta bad SKOS out there)
  • 11. CURATION ISSUES FOR EXISTING DATASETSo Appropriateness, coverage, provenanceo Same metadata quality issues as usualo Many systems of subject terms or identifiers not designed for wide exposure: the "on a horse" problemo We’re talking about curation of vocabularies and schemas as much as we are about curation of data.
  • 12. LII SW FEATURES
  • 13. EXTRACTED VOCABULARIES• The big idea: enhance CFR search via term expansion, suggestion, etc.  Reuse existing thesauri  Make a CFR-specific vocabulary by discovering how the CFR talks about itself  Use that knowledge to suggest better search terms• This is not simple phrase or n-gram matching like Google Suggest.• Rather, we discover how words within the CFR relate to each other and we structure them into a hierarchy of terms (SKOS)
  • 14. WHERE DO VOCABULARIES COME FROM?• Input: text elements in the CFR XML• Extraction and patterns: o Anaphora resolution (JavaRAP) o Natural Language Parser (Stanford Parser) o Hearst patterns:o Output: SKOS (Jena)
  • 15. ANAPHORA RESOLUTION• John spent time in a Turkish prison. He is now the executive director of CALI.• Núria stole Sara’s chocolate and stuffed her face with it. (but whose face was it?)• When a sponsor conducting a nonclinical laboratory study intended to be submitted to or reviewed by the Food and Drug Administration utilizes the services of a consulting laboratory, contractor, or grantee to perform an analysis or other service, it shall notify the consulting laboratory, contractor, or grantee that the service is part of a nonclinical laboratory study that must be conducted in compliance with the provisions of this part.
  • 16. STANFORD PARSER Structured grammar trees & typed dependencies• Noun modifier: nn(product-10, chemical-9) • “product skos:narrower chemical_product”• Conjunctions: conj(doctor-7, practitioner-9) • "doctor skos:related practitioner”
  • 17. HEARST PATTERNSo lexico-syntactic patterns that indicate hypernymic/hyponymic relations.o { NP (,)? (such as | like) (NP,)* (or | and) NPo Example: All vehicles like cars, trucks, and go-kartso PS: o hypernym == word for superset containing term o hyponym == more specific term
  • 18. principal display panelparser understands “display” as a verb. oops.
  • 19. WHY IS THIS HARD?• Legal text is structurally complicated o Parser dies on long sentences, leading to incorrect extractions• Named entities ("Food, Drug, and Cosmetic Act") confuse the parser o Should be separately extracted/tagged o Parser should think of them as a single token, but doesn´t o May need authority files for entities and acronyms, etc.• Corpus is huge (CFR == 96.5 million words) o Strains memory limits and computational resources
  • 20. DEFINITIONS: IMPROVING SEARCH AND PRESENTATION• The big idea: find all terms defined by the reg or statute, and do cool stuff with them, for example o linking terms in text to their definitions o pushing definitions to the top of results when the term is searched for o altering presentation so that (legally) naive user understands the importance of definitions for, eg., compliance.• Of course, that also means figuring out what the scope of definitions is.... :(
  • 21. WHERE DO THE DEFINITIONS COME FROM?• Input: heading elements in the CFR XML with the term "definition".• Using regular expressions, we extract o Defined term and definition text o Location of the definition (section of the CFR) o Scoping information: "For the purposes of this part"• Output: SKOS/RDF o defined term --> SKOS Vocabulary
  • 22. DEFINITIONS: TOOLS• Python Natural Language Toolkit (NLTK)• ElementTree, XML parsing library• Snowball Stemmer Package• RDFlib, an RDF generation library
  • 23. WHY THIS IS HARD: FINDING DEFINITIONSo Text containing definition can make it hard to extract. o Sponsor means: o (1) A person who initiates and supports, by provision of financial or other resources, a nonclinical laboratory study; o (2) A person who submits a nonclinical study to the Food and Drug Administration in support of an application for a research or marketing permito Pattern identification/inconsistencies in sections that are not explicitly meant to be definitions (or, what does “means” mean?)
  • 24. WHY THIS IS HARD: SCOPING DEFINITIONSo Scoping not stated in text, implicit in structureo Complex scoping statements:  "The definitions and interpretations contained in section 201 of the act apply to those terms when used in this part".  "Any term not defined in this part shall have the definition set forth in section 102 of the Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are defined at the beginning of each subpart of that part".
  • 25. SO, WHAT CAN WE DO? [*]
  • 26. IMPROVEMENTSo Vocabulary: better extraction and qualityo Definitions: retrieval and completenesso Obligations: false positives, identification of partso Product Codes: semantic matching
  • 27. FUTURE WORKo RDF-ification, refinement, implementation:  Table III, PTOA, Popular Names  Agency structureo Data management and qualityo Crowdsourcing
  • 28. RESOURCES: STANDARDS AND PRIMERS• RDF: o Primer: http://www.w3.org/TR/rdf-primer/ o Advantages: http://www.w3.org/RDF/advantages.html• SKOS o http://www.w3.org/2004/02/skos/
  • 29. MORE RESOURCES• Linked Open Data: o General: http://linkeddata.org/ o Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/ o Government Data: http://logd.tw.rpi.edu/• W3C Semantic Web resources: o http://www.w3.org/standards/semanticweb/
  • 30. EVEN MORE RESOURCES: RANTS AND RAVES• VoxPop articles on the SW and Law: http://blog.law.cornell.edu/ voxpop/category/semantic-web-and-law/• Mangy dogs: http://liicr.nl/JPcAb2• Legal Informatics blog: http://legalinformatics.wordpress.com/• Books on law and the SW: http://liicr.nl/MGRbkA
  • 31. US• Núria o nuria.casellas@liicornell.org o @ncasellas o http://nuriacasellas.blogspot.com• Tom o tom@liicornell.org o @trbruce o http://blog.law.cornell.edu/(tbruce | metasausage)