FRS Linked Open Data Concept v1.3 20101130

FRS and Linked Open Data Potential –
Conceptual Discussion v 1.3
November 30, 2010

Dave Smith
USEPA/OEI/OIC/IESD/ISSB
smith.davidg@epa.gov
202-566-0797

Document Change History
Revision Date Author Description
1.0 11/12/2010 David G. Smith Initial Version
1.1 11/24/2010 David G. Smith Minor
updates/revisions
as followon to
11/23 discussion
1.2 11/29/2010 David G. Smith Collaborations,
potential pilots,
FOAF and other
models
1.3 11/30/2010 David G. Smith Additional
collaborations and
detail on facility
granularity
concept

FRS Data Model Initial Conceptual Discussion
November 11, 2010 November 30, 2010

Contents
Document Change History.......................................................................1

Introduction:............................................................................................2

Concept:...................................................................................................2

Current Situation:.....................................................................................3

Linked Open Data Issues:.........................................................................3

Data Model Issues: .................................................................................7

Linked Open Data Development:.............................................................7

Existing Resources....................................................................................7

Short-Term data needs:...........................................................................7

Potential Pilots.........................................................................................9

Longer-Range, Emergent data needs:....................................................10

Other Ongoing, Related Activities..........................................................11

Introduction:
The intent of this concept paper is to initially explore some conceptual, blue-sky, no-constraints for
potential improvements to the FRS Linked Open Data approach being published via data.gov, and to
stimulate additional ideas and brainstorming. Followon to this will be examination of alternatives,
prioritizations and finalization of thoughts toward implementation.

Concept:
Provide enhancements to FRS Linked Open Data approach to improve analysis, enhance facility
representation, improve robustness of LOD querying and analytics, integrate other existing metadata
capabilities and improve capabilities to support Semantic Web approaches, such as more-informed RDF
serialization.

2


Current Situation:
FRS data is currently being published via Data.gov, e.g. RDF button on Data.gov catalog pages (e.g.
http://www.data.gov/raw/1030 ) for FRS data.

Figure 1: Example of Current FRS RDF Offering (highlighted in red box)

The data returned is tied to a data.gov URL, e.g.
http://www.data.gov/semantic/data/alpha/1030/dataset-1030.rdf.gz

Linked Open Data Issues:
Currently, FRS and other datasets published via Data.gov are being serialized as RDF to support semantic
web and linked open data. A basic problem with the Data.gov RDF does not just apply to the FRS RDF
data, it likely applies across the board.

Firstly, in terms of access, the data is a gzipped download. Data must be downloaded and unzipped
before it can be accessed - more ideally, it would be good to see Data.gov serving the data up as a
SPARQL endpoint, or as a SESAME repository or other means of serving up a triple store. That
download/unzip paradigm does not lend itself to dynamic mashups.

3


With regard to the Data.gov RDF, it appears to be a brute-force serialization of data tables into RDF. It
doesn't really have the semantic depth to support analysis that it could use (See Fig. 1-3).

<rdf:Description rdf:about="#entry9985">
<hdatum_desc>NAD83</hdatum_desc>
<state_name>NEBRASKA</state_name>
<latitude83>40.944623</latitude83>
<interest_types>STATE MASTER</interest_types>
<city_name>GARLAND</city_name>
<create_date>01-MAR-00</create_date>
<frs_facility_detail_report_url rdf:resource="
http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?
p_registry_id=110006555085 "/>
<congressional_dist_num>01</congressional_dist_num>
<pgm_sys_acrnms>NE-IIS</pgm_sys_acrnms>
<epa_region_code>07</epa_region_code>
<country_name>USA</country_name>
<fips_code>31159</fips_code>
<huc_code>10200203</huc_code>
<collect_desc>ADDRESS MATCHING-HOUSE NUMBER</collect_desc>
<primary_name>TERRI KELLER RESIDENCE</primary_name>
<rdf:type rdf:resource=" http://data-gov.tw.rpi.edu/2009/data-gov-twc.rdf#DataEntry
"/>
<ref_point_desc>ENTRANCE POINT OF A FACILITY OR STATION</ref_point_desc>
<postal_code>683609338</postal_code>
<registry_id>110006555085</registry_id>
<location_address>1976 OLD MILL RD</location_address>
<accuracy_value>30</accuracy_value>
<update_date>06-AUG-01</update_date>
<county_name>SEWARD</county_name>
<conveyor>FRS</conveyor>
<longitude83>-96.990306</longitude83>
<state_code>NE</state_code>
<site_type_name>STATIONARY</site_type_name>
</rdf:Description>

Figure 1: Sample of current Data.gov FRS RDF/XML Representation

4


< http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#entry9985 > <
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#hdatum_desc > "NAD83" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#state_name > "NEBRASKA" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#latitude83 > "40.944623" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#interest_types > "STATE
MASTER" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#city_name > "GARLAND" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#create_date > "01-MAR-00" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#frs_facility_detail_report_ur
l > < http://iaspub.epa.gov/enviro/fii_query_detail.disp_program_facility?
p_registry_id=110006555085 > .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#congressional_dist_num > "01"
.
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#pgm_sys_acrnms > "NE-IIS" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#epa_region_code > "07" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#country_name > "USA" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#fips_code > "31159" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#huc_code > "10200203" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#collect_desc > "ADDRESS
MATCHING-HOUSE NUMBER" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#primary_name > "TERRI KELLER
RESIDENCE" .
http://www.w3.org/1999/02/22-rdf-syntax-ns#type > < http://data-gov.tw.rpi.edu/2009/data-
gov-twc.rdf#DataEntry > .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#ref_point_desc > "ENTRANCE
POINT OF A FACILITY OR STATION" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#postal_code > "683609338" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#registry_id >
"110006555085" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#location_address > "1976 OLD
MILL RD" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#accuracy_value > "30" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#update_date > "06-AUG-01" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#county_name > "SEWARD" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#conveyor > "FRS" .

5


http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#longitude83 > "-96.990306" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#state_code > "NE" .
http://www.data.gov/semantic/data/alpha/997/dataset-997.rdf#site_type_name > "STATIONARY"
.

Figure 2: Sample of current Data.gov FRS Representation as Triples

The current RDF serialization is essentially just a brute force conversion - there is plenty of opportunity
to enhance and improve.

The properties are things that some EPA users might easily understand, but would others, e.g.
huc_code, pgm_sys_acrnms – are these uniquely identifiable and understood, within this dataset?
Thinking import reference to EPA data dictionary, perhaps EPA namespace or other means of defining
them more positively is needed. We have a lot of metadata that we can bring into the mix, toward
enhancing identifiability, understandability and usability of the RDF data.

There isn't really much structure or model, it's essentially a flat table. Everything is just treated as
alphanumeric data types. No temporal intelligence to dates, et cetera. It doesn't identify registry ID as
something unique or indexable. There are many things that can and should be defined better. There is
probably a semantic analogue to our data model that we can develop as an RDF/OWL/etc analogue and
then map to it.

One approach which may make more sense is to go back and look at the relational database model,
which can support more richness – essentially, individual tables and their relationships would be
generated as Linked Open Data, and the SPARQL queries would then have the flexibility of current SQL
queries.

Regarding the properties, are there in some cases other namespaces that we could/should be
leveraging? geo: as one example - our data is, however, NAD83, and geo: assumes WGS84. We could
reproject to WGS84 and provide geo: values to supplement what we have, as one possibility. Similarly,
maybe foaf: or other namespaces, which deal with addresses and points of contact. The RDF only
carries locations, but FRS also has contacts, if we should at some point incorporate those as well.

In summary, I think it could stand to be improved from a standpoint of accessibility (SPARQL, et cetera - I
think Data.gov needs to look at that from a services infrastructure standpoint), and then, improved
usability, by following more of a data model approach, as opposed to this flat mapping, and approaches
like mapping to existing namespaces and following existing models where appropriate, and we should
be able to leverage some of our metadata elements, data models and other artifacts toward a better
representation and mapping.

6


Data Model Issues:
Long range, some additional tweaks to FRS data model may be needed in order to enhance data
representation and better support Linked Open Data - some of these are described in brief below.

Linked Open Data Development:
Potential collaboration with

• Joshua Lieberman (OGC Geospatial Semantics SWG)

• Spatial Ontology Community of Practice

• Jim Hendler (RPI), George Thomas (HHS): CIO Council and Data.gov Geospatial Semantics
threads

• John Harman / Michael Pendleton (LOD, SRS)

• Steve Young / Zach Scott / Open Gov Team (LOD)

• Talis, pending contract (LOD)

• TRI Program (Potential Pilot)

• Kevin Kirby (Data Model)

• Tom Giffen (Data Model, Business Rules)

• Ken Blumberg (Business Rules)

• Cindy Dickinson (Standards, Business Rules)

• Others (program offices, regions, GISWG)

Existing Resources
• Leverage Data Modeling work that Kevin Kirby has been working on

• Drill into gist.owl and other potential resources

Short-Term data needs:
• Semantic Enhancements / Linked Open Data
Improvement of capabilities for supporting Linked Open Data applications –
Analysis of data structure toward supporting faceted, dimensional analyses (Figure 1)
Development of URI schemes, potentially namespaces, and mans and approaches for allowing

7


unique identification and linkage

Administrative POC

Site -level Organizational
Legal POC Affiliation

Operational POC
Ultimate Organizational
Parent

Lat/Long

People Physical USPS Address

Municipality
Organizational
Dimension HUC Code

Spatial
Temporal Dimension Site
Dimension

Regulatory Dimension

Program IDs

Activity

NAICS Code

SIC Code

Figure 3: Potential Facets / Dimensions for Analysis and Semantic
Enhancement

• Semantic Dimensions:
Explore various dimensions of facility:

• Spatial –

o GML representation of absolute location (lat/long, etc)

o Spatial representation framework for facility (building footprints, parcel boundary,
others for future)

o Facility data modeling granularity and relationships - get a better handle on what
the facility "thing" represents, and its' relation to other things - for example, a parcel
8


boundary, containing an industrial complex with manufacturing and storage
buildings (differing NAICS, possibly even different companies operating and
licensed/permitted), plus associated air stacks, SPCC measures, water outfalls, et
cetera. When we pull up "facility" it should ultimately reflect that bigger picture for
context, with the component of interest in highlight.

• Temporal

o Data currency

o Temporal aspects to regulation, enforcement, permitting, et cetera – future

• Corporate Dimension

o Corporate ownership – at facility level and at ultimate corporate parent level

• Function - Activity and Use

o NAICS/SIC Codes

o EPA Regulatory program

o EPA Interest Type

o Linkages / translation between interest type and other ontologies/vocabularies

o Linkages to regulatory programs and other components

• Interrelationships of facilities (future)

• Individuals

o Friend-of-a-friend (FOAF) and other existing RDF constructs

• Many other potential enhancements

Potential Pilots
A number of potential pilots for mashups can be considered. What may be “low hanging fruit” for OEI
build upon exploitation of known internal assets, i.e.
9


• FRS
• TRI (Toxic Release Quantities for Given Location)
• SRS (Substance)
Potentially, as one scenario, one could tie TRI discharges to reaches via OW web services and TRI
reported receiving waters, and then tie this to observed impacts downstream.
One caveat of using EPA data is that it is known to EPA users, but ideally needs to be more fully fleshed-
out to make it discoverable and uniquely identifiable for external users, perhaps via embedded EPA
identifiers (perhaps an epa: namespace or similar means of identifying our assets)
Other potential scenarios TBD… OECA targeted enforcement vs. OSHA, or OPP vs. USDA pesticides
application data.

Longer-Range, Emergent data needs:
These are not specific to LOD, but are instead emergent attributes of interest for FRS – LOD approaches
may help inform on how to structure these.

• HUC Codes
Completion of prepopulating of HUC Codes can support identification of facilities impacting
major watersheds, e.g. Chesapeake Bay (OECA need) – Other potential needs: Airsheds

• Municipality
Toward improving data quality – Physical street address may include ZIP Code for city which is
different than actual municipality where site resides – for example, Suburban Drive, State
College PA is actually Ferguson Township, PA – and local planning and building code officials and
emergency responders who either have or need information on the facility of interest would be
different than that of the one listed

• Relationship
Ability to relate facilities – relating individual components of a larger system of infrastructure,
such as relating a gas terminal to a compressor station – changes to one may impact others.
Ability to organize information in appropriate fashions, such as relating multiple individual oil
platforms with discrete permits to a lease boundary with another level of permitting.

• Indian Country
More robust identification/validation of facilities which may lie within tribal boundaries –
refinement of IND-3 boundaries with other source data, analysis of flows containing either tribal
flag (Y/N) and/or tribal identifier (tribe/reservation name) - (collaboration with Elizabeth

10


Jackson / Ed Liu)

• Facility Definition
Potential broadening of scope and use of FRS to accomodate grant award locations and other
types of locations – 2005 NAPA Report recommendations for consistent agencywide site
identification. May be predicated on buildout of other capabilities, such as being able to relate
sites.

Other Ongoing, Related Activities
A number of activities, internal and external, can help to inform on direction and data model for FRS
data collection and publishing activities – some of these are listed below:

• Potential EPA Corporate ID Workgroup
Collaborate with TRI, TSCA, FRP, RMP, Others who collect corporate parent information, as well
as OECA and others who need corporate parent information to support analysis.

• White House Corporate ID Workgroup
Collaborate with emergent White House Corporate ID workgroup – Beth Noveck / Steve Croley,
SEC, Labor and other agencies to align, coordinate and collaborate on corporate identifiers

• OpenGov
Collaboration with EPA Open Gov initiatives to inform on how best to publish data for external
reuse.

• National Academy of Public Administration
Follow-through on 2005 NAPA Report recommendations

• Spatial Ontology Community of Practices (SOCOP)
Collaboration on vocabularies, standards and data modeling approaches

• Data.Gov Data Architecture Subgroup

• EPA OEI/OIC/IESD Data Standards Branch

• Others…

Anticipated Next Steps:
TBD, develop ideas for potential pilots, engage on “LOD Cookbook” and approaches for representing
and rendering our data as RDF.

11

FRS Linked Open Data Concept v1.3 20101130

Recommended

Recommended

More Related Content

Similar to FRS Linked Open Data Concept v1.3 20101130

Similar to FRS Linked Open Data Concept v1.3 20101130 (20)

More from Dave Smith / USEPA Office of Environmental Information

More from Dave Smith / USEPA Office of Environmental Information (9)

Recently uploaded

Recently uploaded (20)

FRS Linked Open Data Concept v1.3 20101130