SlideShare a Scribd company logo
1 of 32
THE CFR MEETS THE
    SEMANTIC WEB
(with a little unnatural language processing thrown in )
BACKGROUND: A TWO-PART HISTORY OF
        THE SEMANTIC WEB

• SW is a maze of confusing buzzwords
• Can be thought of in two parts
  • Pre-2005 (the “top-down” period)
  • Post-2005 (the “bottom-up” period)
SW PRE-2005


o   A fascination with inferencing & top-down analysis

o   Staked out a lot of theoretical territory

o   Built basic standards:

           • RDF (statement encoding) : saying things about things

           • OWL (modeling and inferencing): describing relationships
             between things -- that is, creating ontologies
SW FROM 2005 TO NOW

o   SW now seen as a big heap of statements

o   Became more practical

    o   SKOS ( inexpensive conversion method/standard for metadata)

    o   Linked Data ( altruistic, like named anchors ca. 1992 )

o   Could be seen -- from a library point of view -- as a new set of
    techniques for metadata management better suited to the Web
THE SEMANTIC WEB AT THE LII
• Tying legal information to the real world, not just itself
• Applications like:
   o   Improvements to existing finding aids

          Table of Popular Names, , Tables I and III

          Finer-grained, more expressive PTOA

   o   Search enhancement via term substitution and expansion

   o   Publication of “regulated nouns” and definitions as Linked Data

• Research-driven engineering as a practice/culture
WHY USE THE SW TOOLSET?
• Sometimes the whole thing looks like an illustration of the Two Fool
  Rule

• Why RDF?
  o   XML is more cumbersome and less expressive

  o   RDF supports inferencing

  o   RDF allows processing of partial information

• Why SPARQL?
  o   um, SPARQL is how you query RDF
WHY USE SKOS?

o   it's a simple knowledge organization system

o   lightweight representation of things we need a lot:

    o   thesauri

    o   taxonomies

    o   classification schemes

o   it might be a little too simple
SKOS: DRIVING INTO A DITCH

<rdf:RDF
  xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
  xmlns:skos="http://www.w3.org/2004/02/skos/core#">

  <skos:Concept rdf:about="http://www.my.com/#canals">
    <skos:definition>A feature type category for places
     such as the Erie Canal</skos:definition>
    <skos:prefLabel>canals</skos:prefLabel>
    <skos:altLabel>canal bends</skos:altLabel>
    <skos:altLabel>canalized streams</skos:altLabel>
    <skos:altLabel>ditch mouths</skos:altLabel>
    <skos:altLabel>ditches</skos:altLabel>
    <skos:altLabel>drainage canals</skos:altLabel>
    <skos:altLabel>drainage ditches</skos:altLabel>
    <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/>
    <skos:related rdf:resource="http://www.my.com/#channels"/>
    <skos:related rdf:resource="http://www.my.com/#locks"/>
    <skos:related rdf:resource="http://www.my.com/#transportation%20features"/>
    <skos:related rdf:resource="http://www.my.com/#tunnels"/>
    <skos:scopeNote>Manmade waterway used by watercraft
     or for drainage, irrigation, mining, or water
     power</skos:scopeNote>
  </skos:Concept>

</rdf:RDF>
DATA REUSE: DRUGBANK
• Acetaminophen vs. Tylenol : CFR regulates by generic name
• DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/)
  o   http://www.drugbank.ca/

  o   Offered as Linked Data by Freie Universität Berlin

• DrugBank associates brand names with their components
• We offer component names as suggested search terms in Title 21
  [*]
CAN'T EVERYTHING BE DONE WITH
         RECYCLED DATA? UM, NO.

• Some datasets suck, or don´t exist yet
• Conversion of existing resources is not painless
  o   Many vocabularies rely on human interpretation

  o   Many vocabularies are not rigorous enough for SKOS encoding
      (lotta bad SKOS out there)
CURATION ISSUES FOR EXISTING DATASETS

o   Appropriateness, coverage, provenance

o   Same metadata quality issues as usual

o   Many systems of subject terms or identifiers not designed for wide
    exposure: the "on a horse" problem

o   We’re talking about curation of vocabularies and schemas as much as
    we are about curation of data.
LII SW FEATURES
EXTRACTED VOCABULARIES
• The big idea: enhance CFR search via term expansion, suggestion,
  etc.

        Reuse existing thesauri

        Make a CFR-specific vocabulary by discovering how the CFR
         talks about itself

        Use that knowledge to suggest better search terms

• This is not simple phrase or n-gram matching like Google Suggest.
• Rather, we discover how words within the CFR relate to each other
  and we structure them into a hierarchy of terms (SKOS)
WHERE DO VOCABULARIES COME FROM?


• Input: text elements in the CFR XML
• Extraction and patterns:
    o   Anaphora resolution (JavaRAP)

    o   Natural Language Parser (Stanford Parser)

    o   Hearst patterns:

o   Output: SKOS (Jena)
ANAPHORA RESOLUTION

• John  spent time in a Turkish prison. He is now the executive
 director of CALI.

• Núria stole Sara’s chocolate and stuffed her face with it. (but
 whose face was it?)

• When    a sponsor conducting a nonclinical laboratory study intended
 to be submitted to or reviewed by the Food and Drug Administration
 utilizes the services of a consulting laboratory, contractor, or grantee
 to perform an analysis or other service, it shall notify the consulting
 laboratory, contractor, or grantee that the service is part of a
 nonclinical laboratory study that must be conducted in compliance
 with the provisions of this part.
STANFORD PARSER


   Structured grammar trees & typed dependencies

• Noun modifier: nn(product-10, chemical-9)

         • “product skos:narrower chemical_product”


• Conjunctions: conj(doctor-7, practitioner-9)

         • "doctor skos:related practitioner”
HEARST PATTERNS
o    lexico-syntactic patterns that indicate hypernymic/hyponymic
    relations.

o   { NP (,)? (such as | like) (NP,)* (or | and) NP

o   Example: All vehicles like cars, trucks, and go-karts

o   PS:

    o     hypernym == word for superset containing term

    o     hyponym == more specific term
principal display panel




parser understands “display”
      as a verb. oops.
WHY IS THIS HARD?
• Legal text is structurally complicated
   o Parser dies on long sentences, leading to incorrect extractions

• Named entities ("Food, Drug, and Cosmetic Act") confuse the parser
   o Should be separately extracted/tagged

   o Parser should think of them as a single token, but doesn´t

   o   May need authority files for entities and acronyms, etc.

• Corpus is huge (CFR == 96.5 million words)
   o   Strains memory limits and computational resources
DEFINITIONS: IMPROVING SEARCH AND
            PRESENTATION
• The big idea: find all terms defined by the reg or statute, and do
  cool stuff with them, for example

  o   linking terms in text to their definitions

  o   pushing definitions to the top of results when the term is
      searched for

  o   altering presentation so that (legally) naive user understands the
      importance of definitions for, eg., compliance.

• Of course, that also means figuring out what the scope of definitions
  is.... :(
WHERE DO THE DEFINITIONS COME
                 FROM?
• Input: heading elements in the CFR XML with the term "definition".
• Using regular expressions, we extract
  o   Defined term and definition text

  o   Location of the definition (section of the CFR)

  o   Scoping information: "For the purposes of this part"

• Output: SKOS/RDF
  o   defined term --> SKOS Vocabulary
DEFINITIONS: TOOLS


• Python Natural Language Toolkit (NLTK)

• ElementTree, XML parsing library

• Snowball Stemmer Package

• RDFlib, an RDF generation library
WHY THIS IS HARD: FINDING
                    DEFINITIONS
o   Text containing definition can make it hard to extract.

    o   Sponsor means:

        o   (1) A person who initiates and supports, by provision of
            financial or other resources, a nonclinical laboratory study;

        o   (2) A person who submits a nonclinical study to the Food and
            Drug Administration in support of an application for a
            research or marketing permit

o   Pattern identification/inconsistencies in sections that are not
    explicitly meant to be definitions (or, what does “means” mean?)
WHY THIS IS HARD: SCOPING DEFINITIONS


o   Scoping not stated in text, implicit in structure

o   Complex scoping statements:

          "The definitions and interpretations contained in section 201 of the act apply to those
           terms when used in this part".

          "Any term not defined in this part shall have the definition set forth in section 102 of the
           Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are
           defined at the beginning of each subpart of that part".
SO, WHAT CAN WE DO? [*]
IMPROVEMENTS


o   Vocabulary: better extraction and quality

o   Definitions: retrieval and completeness

o   Obligations: false positives, identification of parts

o   Product Codes: semantic matching
FUTURE WORK


o   RDF-ification, refinement, implementation:

          Table III, PTOA, Popular Names

          Agency structure

o   Data management and quality

o   Crowdsourcing
RESOURCES: STANDARDS AND PRIMERS
• RDF:
  o   Primer: http://www.w3.org/TR/rdf-primer/

  o   Advantages: http://www.w3.org/RDF/advantages.html

• SKOS
  o   http://www.w3.org/2004/02/skos/
MORE RESOURCES

• Linked Open Data:
  o   General: http://linkeddata.org/

  o   Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/

  o   Government Data: http://logd.tw.rpi.edu/

• W3C Semantic Web resources:
  o   http://www.w3.org/standards/semanticweb/
EVEN MORE RESOURCES: RANTS AND
                 RAVES

• VoxPop articles on the SW and Law: http://blog.law.cornell.edu/
  voxpop/category/semantic-web-and-law/

• Mangy dogs: http://liicr.nl/JPcAb2
• Legal Informatics blog: http://legalinformatics.wordpress.com/
• Books on law and the SW: http://liicr.nl/MGRbkA
US
• Núria
  o   nuria.casellas@liicornell.org

  o   @ncasellas

  o   http://nuriacasellas.blogspot.com

• Tom
  o   tom@liicornell.org

  o   @trbruce

  o   http://blog.law.cornell.edu/(tbruce | metasausage)

More Related Content

What's hot

Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)Dan Brickley
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentationjendibbern
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsLouise Spiteri
 
The tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARCThe tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARCAnn Chapman
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesAlexandra Roatiș
 
RDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesRDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesMarin Dimitrov
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing Colleen Farrelly
 
Resource Description & Access (RDA)
Resource Description & Access (RDA)Resource Description & Access (RDA)
Resource Description & Access (RDA)Buzz Haughton
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Webrobin fay
 
SPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesSPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesBasil Ell
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basicsrobin fay
 
Learning rda in 30 minutes or less
Learning rda in 30 minutes or lessLearning rda in 30 minutes or less
Learning rda in 30 minutes or lessRioghailclan
 
Owl web ontology language
Owl  web ontology languageOwl  web ontology language
Owl web ontology languagehassco2011
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelabCAMELIA BOBAN
 
Cataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources CouncilCataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources CouncilEmily Nimsakont
 

What's hot (20)

RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
 
Understanding RDF: the Resource Description Framework in Context (1999)
Understanding RDF: the Resource Description Framework in Context  (1999)Understanding RDF: the Resource Description Framework in Context  (1999)
Understanding RDF: the Resource Description Framework in Context (1999)
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
 
RDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dotsRDA, FRBR, and FRAD: Connecting the dots
RDA, FRBR, and FRAD: Connecting the dots
 
The tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARCThe tools of our trade: AACR2/RDA and MARC
The tools of our trade: AACR2/RDA and MARC
 
SWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDFSWT Lecture Session 2 - RDF
SWT Lecture Session 2 - RDF
 
Ontologies in RDF-S/OWL
Ontologies in RDF-S/OWLOntologies in RDF-S/OWL
Ontologies in RDF-S/OWL
 
Efficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF DatabasesEfficient Query Answering against Dynamic RDF Databases
Efficient Query Answering against Dynamic RDF Databases
 
RDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic RepositoriesRDF, SPARQL and Semantic Repositories
RDF, SPARQL and Semantic Repositories
 
SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing SAS Global 2021 Introduction to Natural Language Processing
SAS Global 2021 Introduction to Natural Language Processing
 
Resource Description & Access (RDA)
Resource Description & Access (RDA)Resource Description & Access (RDA)
Resource Description & Access (RDA)
 
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic WebRDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
RDA Intro - AACR2 / MARC> RDA / FRBR / Semantic Web
 
SPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queriesSPARTIQULATION - Verbalizing SPARQL queries
SPARTIQULATION - Verbalizing SPARQL queries
 
NCompass Live: Cataloging with RDA
NCompass Live: Cataloging with RDANCompass Live: Cataloging with RDA
NCompass Live: Cataloging with RDA
 
RDA
RDA RDA
RDA
 
Cataloging basics
Cataloging basicsCataloging basics
Cataloging basics
 
Learning rda in 30 minutes or less
Learning rda in 30 minutes or lessLearning rda in 30 minutes or less
Learning rda in 30 minutes or less
 
Owl web ontology language
Owl  web ontology languageOwl  web ontology language
Owl web ontology language
 
GDG Meets U event - Big data & Wikidata - no lies codelab
GDG Meets U event - Big data & Wikidata -  no lies codelabGDG Meets U event - Big data & Wikidata -  no lies codelab
GDG Meets U event - Big data & Wikidata - no lies codelab
 
Cataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources CouncilCataloging with RDA - Western New York Library Resources Council
Cataloging with RDA - Western New York Library Resources Council
 

Similar to The Semantic Web meets the Code of Federal Regulations

CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialLeeFeigenbaum
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application ProfilesDiane Hillmann
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for DiscoveryOCLC
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities Getaneh Alemu
 
20080917 Rev
20080917 Rev20080917 Rev
20080917 Revcharper
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesOntotext
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingDanny Greefhorst
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profileskcoylenet
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2Seonho Kim
 
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Bernard Vatant
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPariadnenetwork
 
SKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCSKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCjonphipps
 
SKOS, Past, Present and Future
SKOS, Past, Present and FutureSKOS, Past, Present and Future
SKOS, Past, Present and Futureseanb
 
The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...Hong (Jenny) Jing
 

Similar to The Semantic Web meets the Code of Federal Regulations (20)

CSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web TutorialCSHALS 2010 W3C Semanic Web Tutorial
CSHALS 2010 W3C Semanic Web Tutorial
 
Introduction to Application Profiles
Introduction to Application ProfilesIntroduction to Application Profiles
Introduction to Application Profiles
 
Data Designed for Discovery
Data Designed for DiscoveryData Designed for Discovery
Data Designed for Discovery
 
Metadata for digital humanities
Metadata for digital humanities Metadata for digital humanities
Metadata for digital humanities
 
Semantic Web and Linked Open Data
Semantic Web and Linked Open DataSemantic Web and Linked Open Data
Semantic Web and Linked Open Data
 
20080917 Rev
20080917 Rev20080917 Rev
20080917 Rev
 
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven RecipesReasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
Reasoning with Big Knowledge Graphs: Choices, Pitfalls and Proven Recipes
 
The Role of Thesauri in Data Modeling
The Role of Thesauri in Data ModelingThe Role of Thesauri in Data Modeling
The Role of Thesauri in Data Modeling
 
An introduction to Metadata Application Profiles
An introduction to Metadata Application ProfilesAn introduction to Metadata Application Profiles
An introduction to Metadata Application Profiles
 
20130622 okfn hackathon t2
20130622 okfn hackathon t220130622 okfn hackathon t2
20130622 okfn hackathon t2
 
Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010Porting Library Vocabularies to the Semantic Web - IFLA 2010
Porting Library Vocabularies to the Semantic Web - IFLA 2010
 
Innovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLPInnovative methods for data integration: Linked Data and NLP
Innovative methods for data integration: Linked Data and NLP
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...NISO/NFAIS Joint Virtual Conference:  Connecting the Library to the Wider Wor...
NISO/NFAIS Joint Virtual Conference: Connecting the Library to the Wider Wor...
 
Introduction to RDF
Introduction to RDFIntroduction to RDF
Introduction to RDF
 
SKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYCSKOS - 2007 Open Forum on Metadata Registries - NYC
SKOS - 2007 Open Forum on Metadata Registries - NYC
 
SKOS, Past, Present and Future
SKOS, Past, Present and FutureSKOS, Past, Present and Future
SKOS, Past, Present and Future
 
Tutorial 1-Ontologies
Tutorial 1-OntologiesTutorial 1-Ontologies
Tutorial 1-Ontologies
 
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...The Impact of Linked Data in Digital Curation and Application to the Catalogu...
The Impact of Linked Data in Digital Curation and Application to the Catalogu...
 

More from tbruce

Nabe Communications 2010
Nabe Communications 2010Nabe Communications 2010
Nabe Communications 2010tbruce
 
FDLP CFR Preview
FDLP CFR PreviewFDLP CFR Preview
FDLP CFR Previewtbruce
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2tbruce
 
Akoma Ntoso 1
Akoma Ntoso 1Akoma Ntoso 1
Akoma Ntoso 1tbruce
 
I Conference 2010 -- Open access to law
I Conference 2010 -- Open access to lawI Conference 2010 -- Open access to law
I Conference 2010 -- Open access to lawtbruce
 
Princeton law.gov meeting
Princeton law.gov meetingPrinceton law.gov meeting
Princeton law.gov meetingtbruce
 
Legal Information and the WebMD effect
Legal Information and the WebMD effectLegal Information and the WebMD effect
Legal Information and the WebMD effecttbruce
 
Open Access to law and the WebMD effect
Open Access to law and the WebMD effectOpen Access to law and the WebMD effect
Open Access to law and the WebMD effecttbruce
 
Philadelphia Assn of Paralegals
Philadelphia Assn of ParalegalsPhiladelphia Assn of Paralegals
Philadelphia Assn of Paralegalstbruce
 
Metadata Quality
Metadata QualityMetadata Quality
Metadata Qualitytbruce
 
Foundlings on the Cathedral Steps
Foundlings on the Cathedral StepsFoundlings on the Cathedral Steps
Foundlings on the Cathedral Stepstbruce
 

More from tbruce (11)

Nabe Communications 2010
Nabe Communications 2010Nabe Communications 2010
Nabe Communications 2010
 
FDLP CFR Preview
FDLP CFR PreviewFDLP CFR Preview
FDLP CFR Preview
 
Akoma Ntoso 2
Akoma Ntoso 2Akoma Ntoso 2
Akoma Ntoso 2
 
Akoma Ntoso 1
Akoma Ntoso 1Akoma Ntoso 1
Akoma Ntoso 1
 
I Conference 2010 -- Open access to law
I Conference 2010 -- Open access to lawI Conference 2010 -- Open access to law
I Conference 2010 -- Open access to law
 
Princeton law.gov meeting
Princeton law.gov meetingPrinceton law.gov meeting
Princeton law.gov meeting
 
Legal Information and the WebMD effect
Legal Information and the WebMD effectLegal Information and the WebMD effect
Legal Information and the WebMD effect
 
Open Access to law and the WebMD effect
Open Access to law and the WebMD effectOpen Access to law and the WebMD effect
Open Access to law and the WebMD effect
 
Philadelphia Assn of Paralegals
Philadelphia Assn of ParalegalsPhiladelphia Assn of Paralegals
Philadelphia Assn of Paralegals
 
Metadata Quality
Metadata QualityMetadata Quality
Metadata Quality
 
Foundlings on the Cathedral Steps
Foundlings on the Cathedral StepsFoundlings on the Cathedral Steps
Foundlings on the Cathedral Steps
 

Recently uploaded

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Hyundai Motor Group
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 

Recently uploaded (20)

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2Next-generation AAM aircraft unveiled by Supernal, S-A2
Next-generation AAM aircraft unveiled by Supernal, S-A2
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 

The Semantic Web meets the Code of Federal Regulations

  • 1. THE CFR MEETS THE SEMANTIC WEB (with a little unnatural language processing thrown in )
  • 2. BACKGROUND: A TWO-PART HISTORY OF THE SEMANTIC WEB • SW is a maze of confusing buzzwords • Can be thought of in two parts • Pre-2005 (the “top-down” period) • Post-2005 (the “bottom-up” period)
  • 3. SW PRE-2005 o A fascination with inferencing & top-down analysis o Staked out a lot of theoretical territory o Built basic standards: • RDF (statement encoding) : saying things about things • OWL (modeling and inferencing): describing relationships between things -- that is, creating ontologies
  • 4. SW FROM 2005 TO NOW o SW now seen as a big heap of statements o Became more practical o SKOS ( inexpensive conversion method/standard for metadata) o Linked Data ( altruistic, like named anchors ca. 1992 ) o Could be seen -- from a library point of view -- as a new set of techniques for metadata management better suited to the Web
  • 5. THE SEMANTIC WEB AT THE LII • Tying legal information to the real world, not just itself • Applications like: o Improvements to existing finding aids  Table of Popular Names, , Tables I and III  Finer-grained, more expressive PTOA o Search enhancement via term substitution and expansion o Publication of “regulated nouns” and definitions as Linked Data • Research-driven engineering as a practice/culture
  • 6. WHY USE THE SW TOOLSET? • Sometimes the whole thing looks like an illustration of the Two Fool Rule • Why RDF? o XML is more cumbersome and less expressive o RDF supports inferencing o RDF allows processing of partial information • Why SPARQL? o um, SPARQL is how you query RDF
  • 7. WHY USE SKOS? o it's a simple knowledge organization system o lightweight representation of things we need a lot: o thesauri o taxonomies o classification schemes o it might be a little too simple
  • 8. SKOS: DRIVING INTO A DITCH <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:skos="http://www.w3.org/2004/02/skos/core#"> <skos:Concept rdf:about="http://www.my.com/#canals"> <skos:definition>A feature type category for places such as the Erie Canal</skos:definition> <skos:prefLabel>canals</skos:prefLabel> <skos:altLabel>canal bends</skos:altLabel> <skos:altLabel>canalized streams</skos:altLabel> <skos:altLabel>ditch mouths</skos:altLabel> <skos:altLabel>ditches</skos:altLabel> <skos:altLabel>drainage canals</skos:altLabel> <skos:altLabel>drainage ditches</skos:altLabel> <skos:broader rdf:resource="http://www.my.com/#hydrographic%20structures"/> <skos:related rdf:resource="http://www.my.com/#channels"/> <skos:related rdf:resource="http://www.my.com/#locks"/> <skos:related rdf:resource="http://www.my.com/#transportation%20features"/> <skos:related rdf:resource="http://www.my.com/#tunnels"/> <skos:scopeNote>Manmade waterway used by watercraft or for drainage, irrigation, mining, or water power</skos:scopeNote> </skos:Concept> </rdf:RDF>
  • 9. DATA REUSE: DRUGBANK • Acetaminophen vs. Tylenol : CFR regulates by generic name • DrugBank (http://www4.wiwiss.fu-berlin.de/drugbank/) o http://www.drugbank.ca/ o Offered as Linked Data by Freie Universität Berlin • DrugBank associates brand names with their components • We offer component names as suggested search terms in Title 21 [*]
  • 10. CAN'T EVERYTHING BE DONE WITH RECYCLED DATA? UM, NO. • Some datasets suck, or don´t exist yet • Conversion of existing resources is not painless o Many vocabularies rely on human interpretation o Many vocabularies are not rigorous enough for SKOS encoding (lotta bad SKOS out there)
  • 11. CURATION ISSUES FOR EXISTING DATASETS o Appropriateness, coverage, provenance o Same metadata quality issues as usual o Many systems of subject terms or identifiers not designed for wide exposure: the "on a horse" problem o We’re talking about curation of vocabularies and schemas as much as we are about curation of data.
  • 13. EXTRACTED VOCABULARIES • The big idea: enhance CFR search via term expansion, suggestion, etc.  Reuse existing thesauri  Make a CFR-specific vocabulary by discovering how the CFR talks about itself  Use that knowledge to suggest better search terms • This is not simple phrase or n-gram matching like Google Suggest. • Rather, we discover how words within the CFR relate to each other and we structure them into a hierarchy of terms (SKOS)
  • 14. WHERE DO VOCABULARIES COME FROM? • Input: text elements in the CFR XML • Extraction and patterns: o Anaphora resolution (JavaRAP) o Natural Language Parser (Stanford Parser) o Hearst patterns: o Output: SKOS (Jena)
  • 15. ANAPHORA RESOLUTION • John spent time in a Turkish prison. He is now the executive director of CALI. • Núria stole Sara’s chocolate and stuffed her face with it. (but whose face was it?) • When a sponsor conducting a nonclinical laboratory study intended to be submitted to or reviewed by the Food and Drug Administration utilizes the services of a consulting laboratory, contractor, or grantee to perform an analysis or other service, it shall notify the consulting laboratory, contractor, or grantee that the service is part of a nonclinical laboratory study that must be conducted in compliance with the provisions of this part.
  • 16. STANFORD PARSER  Structured grammar trees & typed dependencies • Noun modifier: nn(product-10, chemical-9) • “product skos:narrower chemical_product” • Conjunctions: conj(doctor-7, practitioner-9) • "doctor skos:related practitioner”
  • 17. HEARST PATTERNS o lexico-syntactic patterns that indicate hypernymic/hyponymic relations. o { NP (,)? (such as | like) (NP,)* (or | and) NP o Example: All vehicles like cars, trucks, and go-karts o PS: o hypernym == word for superset containing term o hyponym == more specific term
  • 18. principal display panel parser understands “display” as a verb. oops.
  • 19. WHY IS THIS HARD? • Legal text is structurally complicated o Parser dies on long sentences, leading to incorrect extractions • Named entities ("Food, Drug, and Cosmetic Act") confuse the parser o Should be separately extracted/tagged o Parser should think of them as a single token, but doesn´t o May need authority files for entities and acronyms, etc. • Corpus is huge (CFR == 96.5 million words) o Strains memory limits and computational resources
  • 20. DEFINITIONS: IMPROVING SEARCH AND PRESENTATION • The big idea: find all terms defined by the reg or statute, and do cool stuff with them, for example o linking terms in text to their definitions o pushing definitions to the top of results when the term is searched for o altering presentation so that (legally) naive user understands the importance of definitions for, eg., compliance. • Of course, that also means figuring out what the scope of definitions is.... :(
  • 21. WHERE DO THE DEFINITIONS COME FROM? • Input: heading elements in the CFR XML with the term "definition". • Using regular expressions, we extract o Defined term and definition text o Location of the definition (section of the CFR) o Scoping information: "For the purposes of this part" • Output: SKOS/RDF o defined term --> SKOS Vocabulary
  • 22. DEFINITIONS: TOOLS • Python Natural Language Toolkit (NLTK) • ElementTree, XML parsing library • Snowball Stemmer Package • RDFlib, an RDF generation library
  • 23.
  • 24. WHY THIS IS HARD: FINDING DEFINITIONS o Text containing definition can make it hard to extract. o Sponsor means: o (1) A person who initiates and supports, by provision of financial or other resources, a nonclinical laboratory study; o (2) A person who submits a nonclinical study to the Food and Drug Administration in support of an application for a research or marketing permit o Pattern identification/inconsistencies in sections that are not explicitly meant to be definitions (or, what does “means” mean?)
  • 25. WHY THIS IS HARD: SCOPING DEFINITIONS o Scoping not stated in text, implicit in structure o Complex scoping statements:  "The definitions and interpretations contained in section 201 of the act apply to those terms when used in this part".  "Any term not defined in this part shall have the definition set forth in section 102 of the Act (21 U.S.C. 802 ), except that certain terms used in part 1316 of this chapter are defined at the beginning of each subpart of that part".
  • 26. SO, WHAT CAN WE DO? [*]
  • 27. IMPROVEMENTS o Vocabulary: better extraction and quality o Definitions: retrieval and completeness o Obligations: false positives, identification of parts o Product Codes: semantic matching
  • 28. FUTURE WORK o RDF-ification, refinement, implementation:  Table III, PTOA, Popular Names  Agency structure o Data management and quality o Crowdsourcing
  • 29. RESOURCES: STANDARDS AND PRIMERS • RDF: o Primer: http://www.w3.org/TR/rdf-primer/ o Advantages: http://www.w3.org/RDF/advantages.html • SKOS o http://www.w3.org/2004/02/skos/
  • 30. MORE RESOURCES • Linked Open Data: o General: http://linkeddata.org/ o Tutorial: http://www4.wiwiss.fu-berlin.de/bizer/pub/linkeddatatutorial/ o Government Data: http://logd.tw.rpi.edu/ • W3C Semantic Web resources: o http://www.w3.org/standards/semanticweb/
  • 31. EVEN MORE RESOURCES: RANTS AND RAVES • VoxPop articles on the SW and Law: http://blog.law.cornell.edu/ voxpop/category/semantic-web-and-law/ • Mangy dogs: http://liicr.nl/JPcAb2 • Legal Informatics blog: http://legalinformatics.wordpress.com/ • Books on law and the SW: http://liicr.nl/MGRbkA
  • 32. US • Núria o nuria.casellas@liicornell.org o @ncasellas o http://nuriacasellas.blogspot.com • Tom o tom@liicornell.org o @trbruce o http://blog.law.cornell.edu/(tbruce | metasausage)

Editor's Notes

  1. \n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. \n
  9. \n
  10. \n
  11. \n
  12. \n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n
  19. \n
  20. \n
  21. \n
  22. \n
  23. \n
  24. \n
  25. \n
  26. \n
  27. \n
  28. \n
  29. \n
  30. \n
  31. \n
  32. \n