Publishing EPA Data as
      Linked Data
                A brief by
           Michael Pendleton
 EPA Office of Environmental Information
     pendleton.michael@epa.gov
What is driving us?
“We’re moving from managing documents
 to managing discrete pieces of open data
 and content which can be tagged, shared,
  secured, mashed up and presented in the
 way that is most useful for the consumer
                     of that information.”

      -- Report on Digital Government: Building a 21st Century Platform to
                                         Better Serve the American People
Goal: Make Open Data, Content, and
     Web APIs the New Default
Linked Data
What’s It All About?

 • Speak the Language of the Web
 • Just as you surf web pages, linked data lets you surf
   data.
 • SOAP was about making the web try to work like
   applications; REST was about making applications
   work like the web.
 • Linked Data is about making your DATA work like the
   web.


  Slide Credit: David G. Smith
                                 U.S. Environmental Protection Agency   4
  Aug 16, 2011 presentation
RDF is a lingua
franca for data
   exchange
Linked Data
Basics
• Tim Berners-Lee:                    5-Star model for publishing
     data




Slide Credit: David G. Smith   U.S. Environmental Protection Agency   6
• Linked Data is about
 publishing and
 consuming data
 using international
 data standards
• Based on 20 year
 old idea (the Web)
• A system of linked
 information systems
Global requirements
• Comprehensively link
  legislation & regulations
  for more effective
  government

• Explain context, source,
  version & publication
  date with the data itself

• We need global
  standards for metadata
The mission of the Government Linked
Data (GLD) Working Group is to provide
standards and other information which
help governments around the world
publish their data as effective and usable
Linked Data using Semantic Web
technologies.
Best Practices

Vocabulary Guidance

Community Building
US EPA publishes lots of CSV files ...
And now,
            Linked Open Data ...
•   A proof-of-concept launched 2011 with 5 Star Linked Data

•   Publication of 1.3M facilities (FRS) and the substances (SRS)
    regulated by the EPA

•   TRI program links to 25 years of data on major polluters

•   Additional pilots in 2012 incorporating EPA and anonymized
    electronic medical records (EMR) data from Sentara
    Healthcare

•   5 Star Linked Open Data to be hosted & accessible on an EPA
    production Web site in summer 2012
Increase re-use by publishing
        Linked Data
  •   Empower users to create their own views of data to
      satisfy different applications

  •   Build a community around the data in which users help
      each other to curate and connect as needed

  •   Skip the supermodel - Leave data in the multiple “best
      of breed” systems; wrap and expose on the Web of Data
There is a Process


Identify
 Identify   Model
            Model      Name
                       Name    Describe
                               Describe   Convert
                                          Convert   Publish
                                                    Publish




                          Maintain
7 steps to publishing Linked Data
•   Identify a dataset others are likely to want to re-use
•   Modeling
    •   Onsite modeling session (half day)
    •   Linked Data modeling supported by experts
    •   Validate the model with data owners/stewards
•   Publish data on the Web (opendata.epa.gov) per Best Practices
•   Produce automated scripts to maintain current data
•   Announce Linked Open Data sets *
•   Review usage reports to support relevance & user feedback


             * Pending EPA Systems Security Plan approval
Open Data Platforms
•   We’re using Callimachus, a Web
    platform for data-driven applications
    based on Linked Data principles.

•   It is hosted on Amazon EC2 and we
    have 24x7x365 data & application
    support.

•   There are other data platforms, we
    selected this one because it is fully
    W3C standards compliant, no vendor
    “lock in”

•   It’s Open Source (Apache 2.0)
Recommendations
• Linked Data promotes goals of transparency &
  economic development during times of fiscal
  austerity
 •  Publish in reusable format (RDF family of
    standards)
 •  Use OPEN vs proprietary in data formats
 •  Define a URI Policy and Strategy
 •  Use best practices and vocabularies exist --
    don’t recreate the wheel
Publishing Linked Data
will require continual
nurturing but the
rewards are worth it
Resources
•   VisibleGovernment.ca Website http://visiblegovernment.ca
•   Hack, Mash and Peer: Crowdsourcing Government Transparency, Jerry Brito, George
    Mason University, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1023485
•   Blog on UK Environment Agency Water Quality, see
    http://data.southampton.ac.uk/datasets.html
•   Southampton Open Data Service, see http://data.southampton.ac.uk/datasets.html
•   Blog post on Clean Energy data from Reegle, see http://blog.semantic-
    web.at/2012/04/13/reegle-info-linked-open-energy-data-cloud/
•   Blog post on Publishing Linked Open Data in Tight Economic Times, 30-Jan-2012,
    http://3roundstones.com/2012/01/30/publishing-linked-open-data-makes-good-sense-in-
    tight-economic-times/
•   Blog post on HealthData.gov from US Health & Human Services, 4-June-2012,
    http://www.healthdata.gov/blog/welcome-new-healthdatagov
•   Blog post on US HHS Domain Challenge 1: Metadata, 2-June-2012,
    http://www.healthdata.gov/blog/domain-challenge-1-metadata
Coming soon ...
•   Best Practices for Publishing Linked Data (editor’s Draft
    20-Apr-2012), see https://dvcs.w3.org/hg/gld/raw-
    file/default/bp/index.html

•   Linked Data Cookbook, see
    http://www.w3.org/2011/gld/wiki/Linked_Data_Cookboo
    k

•   Linked Data Directory, see http://dir.w3.org

•   Attend the 2012 International Open Government Data
    Conference co-sponsored by data.gov & The World Bank
    10-12 July 2012, Washington DC, see
    http://www.data.gov/communities/conference
This work is Copyright © 2011-2012 3 Round Stones Inc.
It is licensed under the Creative Commons Attribution 3.0 Unported License
Full details at: http://creativecommons.org/licenses/by/3.0/

You are free:

       to Share — to copy, distribute and transmit the work



       to Remix — to adapt the work



Under the following conditions:
       Attribution. You must attribute the work in the manner specified by the
       author or licensor (but not in any way that suggests that they endorse you
       or your use of the work).

       Share Alike. If you alter, transform, or build upon this work, you may
       distribute the resulting work only under the same or similar license to this
       one.
Credits
         Jennifer Bell,
                               http://www.slideshare.net/jenniferbell
    VisibleGovernment.ca
         (CC-BY-SA)


                               http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/
  1-5 Star Linked Data image


   LOD Cloud Diagrams
   Richard Cyganiak, Anja      http://lod-cloud.net/
   Jentzsch, (CC-BY-SA)




             Book covers © their respective owners and used under Fair Use for educational purposes



© 2012 Bernadette Hyland, released under a CC-BY-SA license

EPA OEI Linked Data Process

  • 1.
    Publishing EPA Dataas Linked Data A brief by Michael Pendleton EPA Office of Environmental Information pendleton.michael@epa.gov
  • 2.
    What is drivingus? “We’re moving from managing documents to managing discrete pieces of open data and content which can be tagged, shared, secured, mashed up and presented in the way that is most useful for the consumer of that information.” -- Report on Digital Government: Building a 21st Century Platform to Better Serve the American People
  • 3.
    Goal: Make OpenData, Content, and Web APIs the New Default
  • 4.
    Linked Data What’s ItAll About? • Speak the Language of the Web • Just as you surf web pages, linked data lets you surf data. • SOAP was about making the web try to work like applications; REST was about making applications work like the web. • Linked Data is about making your DATA work like the web. Slide Credit: David G. Smith U.S. Environmental Protection Agency 4 Aug 16, 2011 presentation
  • 5.
    RDF is alingua franca for data exchange
  • 6.
    Linked Data Basics • TimBerners-Lee: 5-Star model for publishing data Slide Credit: David G. Smith U.S. Environmental Protection Agency 6
  • 7.
    • Linked Datais about publishing and consuming data using international data standards • Based on 20 year old idea (the Web) • A system of linked information systems
  • 9.
    Global requirements • Comprehensivelylink legislation & regulations for more effective government • Explain context, source, version & publication date with the data itself • We need global standards for metadata
  • 10.
    The mission ofthe Government Linked Data (GLD) Working Group is to provide standards and other information which help governments around the world publish their data as effective and usable Linked Data using Semantic Web technologies.
  • 11.
  • 12.
    US EPA publisheslots of CSV files ...
  • 13.
    And now, Linked Open Data ... • A proof-of-concept launched 2011 with 5 Star Linked Data • Publication of 1.3M facilities (FRS) and the substances (SRS) regulated by the EPA • TRI program links to 25 years of data on major polluters • Additional pilots in 2012 incorporating EPA and anonymized electronic medical records (EMR) data from Sentara Healthcare • 5 Star Linked Open Data to be hosted & accessible on an EPA production Web site in summer 2012
  • 14.
    Increase re-use bypublishing Linked Data • Empower users to create their own views of data to satisfy different applications • Build a community around the data in which users help each other to curate and connect as needed • Skip the supermodel - Leave data in the multiple “best of breed” systems; wrap and expose on the Web of Data
  • 15.
    There is aProcess Identify Identify Model Model Name Name Describe Describe Convert Convert Publish Publish Maintain
  • 19.
    7 steps topublishing Linked Data • Identify a dataset others are likely to want to re-use • Modeling • Onsite modeling session (half day) • Linked Data modeling supported by experts • Validate the model with data owners/stewards • Publish data on the Web (opendata.epa.gov) per Best Practices • Produce automated scripts to maintain current data • Announce Linked Open Data sets * • Review usage reports to support relevance & user feedback * Pending EPA Systems Security Plan approval
  • 20.
    Open Data Platforms • We’re using Callimachus, a Web platform for data-driven applications based on Linked Data principles. • It is hosted on Amazon EC2 and we have 24x7x365 data & application support. • There are other data platforms, we selected this one because it is fully W3C standards compliant, no vendor “lock in” • It’s Open Source (Apache 2.0)
  • 31.
    Recommendations • Linked Datapromotes goals of transparency & economic development during times of fiscal austerity • Publish in reusable format (RDF family of standards) • Use OPEN vs proprietary in data formats • Define a URI Policy and Strategy • Use best practices and vocabularies exist -- don’t recreate the wheel
  • 32.
    Publishing Linked Data willrequire continual nurturing but the rewards are worth it
  • 33.
    Resources • VisibleGovernment.ca Website http://visiblegovernment.ca • Hack, Mash and Peer: Crowdsourcing Government Transparency, Jerry Brito, George Mason University, http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1023485 • Blog on UK Environment Agency Water Quality, see http://data.southampton.ac.uk/datasets.html • Southampton Open Data Service, see http://data.southampton.ac.uk/datasets.html • Blog post on Clean Energy data from Reegle, see http://blog.semantic- web.at/2012/04/13/reegle-info-linked-open-energy-data-cloud/ • Blog post on Publishing Linked Open Data in Tight Economic Times, 30-Jan-2012, http://3roundstones.com/2012/01/30/publishing-linked-open-data-makes-good-sense-in- tight-economic-times/ • Blog post on HealthData.gov from US Health & Human Services, 4-June-2012, http://www.healthdata.gov/blog/welcome-new-healthdatagov • Blog post on US HHS Domain Challenge 1: Metadata, 2-June-2012, http://www.healthdata.gov/blog/domain-challenge-1-metadata
  • 34.
    Coming soon ... • Best Practices for Publishing Linked Data (editor’s Draft 20-Apr-2012), see https://dvcs.w3.org/hg/gld/raw- file/default/bp/index.html • Linked Data Cookbook, see http://www.w3.org/2011/gld/wiki/Linked_Data_Cookboo k • Linked Data Directory, see http://dir.w3.org • Attend the 2012 International Open Government Data Conference co-sponsored by data.gov & The World Bank 10-12 July 2012, Washington DC, see http://www.data.gov/communities/conference
  • 35.
    This work isCopyright © 2011-2012 3 Round Stones Inc. It is licensed under the Creative Commons Attribution 3.0 Unported License Full details at: http://creativecommons.org/licenses/by/3.0/ You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one.
  • 36.
    Credits Jennifer Bell, http://www.slideshare.net/jenniferbell VisibleGovernment.ca (CC-BY-SA) http://lab.linkeddata.deri.ie/2010/star-scheme-by-example/ 1-5 Star Linked Data image LOD Cloud Diagrams Richard Cyganiak, Anja http://lod-cloud.net/ Jentzsch, (CC-BY-SA) Book covers © their respective owners and used under Fair Use for educational purposes © 2012 Bernadette Hyland, released under a CC-BY-SA license

Editor's Notes

  • #3 The recently published report by White House described the information, platform and presentation layers of digital services agencies are to provide. The EPA joins government authorities around the world who are defining plans based on Open data and open APIs.
  • #4 A lot of people in governments around the world are publishing data on the Web of Data. We ’ re familiar with portals such as data.gov.uk and data.gov . Often this is in the form of CSV files but an increasing amount is available as well modeled LINKED DATA. We just participated in the International Open Data Conference which showed the open (government) data community is really thriving:  450 in-person participants from over 50 countries, 4000 online participants, over 2000 tweets & 162 speakers.
  • #6 Not all of Open Government content is Linked Data. But a growing number of data sets are available as 4-5 star linked data. Use of structured data is actively promoted by international standards groups like the W3C and major search engines, Google, Yahoo!, Bing, Yandex.
  • #7 This presentation discusses the increasing number of high value data sets being published by the EPA as 5 star Linked Data. This means data is publishing on the Web in both human & machine readable formats. A human can read the nicely formatted content AND a machine can find, access and re-use the machine readable format if it is published in the Web ’ s data exchange format, RDF.
  • #8 There are a growing number of resources on this topic, several have been authored by EPA ’ s Linked Data contractors, Dr. David Wood and Bernadette Hyland, and their colleague Dr. Tom Health. Links to all the projects described in this talk are included at the end of this presentation.
  • #9 Data formats and standards sometimes sounds like alphabet soup to many people The EPA is a member of the W3C Government Linked Data Working Group. We have a practical focus on removing the friction from the Web publishing process and specifically, are working to make it easier for government authorities to publish DATA on the WEB.
  • #10 The GLD working group works with leading academics, and has guests from the private sector & non-profits who define use cases for open government data. They describe the need for government agencies to publish content that describes the relative authority of a piece of data, for example, case law and regulation.
  • #11 The EPA is a member of the W3C and are active on the Government Linked Data Working Group, along with our colleague George Thomas from HHS. We are one year into the working group ’ s two year charter. Our mission is ...
  • #12 The GLD WG is on track to publish BEST PRACTICES, Vocabulary Guidance as W3C Recommendations which are the standards of the World Wide Web. We ’ ve also produced a Linked Data Directory of projects, products and service providers, and a Government Linked Data Cookbook describing a step by step approach for developers.
  • #13 So where are we today? The EPA already publishes a huge amount of information as CSV files and through portals like Envirofacts. Unfortunately, that data is often hard to find, without context. Furthermore, it ’ s written from a regulatory perspective. It is not re-usable for other scientists and the public without significant re-structuring.
  • #15 Our goals are to broaden access and re-use of this important data that tax payers have paid us to collect, and to reduce the burden of compliance for regulated entities.
  • #16 So here is the exact process: Identify the data, model exemplar records -- what you are going to carry forward. Name all of the NOUNs. Turn the records into URIs. Next, describe RESOURCES with vocabularies. Write a script or process to convert from say the CSV to RDF. Automate it so it is easy to maintain. This is routinely done in 30-60 day sprint, with the involvement of the EPA data steward, a project manager and 2 Linked Data experts, part time.
  • #18 We draw “ ball and stick ” diagrams that describe how all the data is RELATED to each other. That is all there is to Linked Data, it is a view of data and its relationships to other pieces of information. Other people can come along and add more relationships and information they have.
  • #19 Then we produce scripts that convert CSV to RDF. These scripts can be run ANYTIME there is an update to the underlying CSV extract from the relational database that today stores the data.
  • #20 So let ’ s review the entire process for producing Linked Open Data and we ’ ll show you what the UI looks like next... OEI has followed this process with 3 different data sets of varying complexity, size and data quality. Each data set was published on an interim cloud server on Amazon EC2 with part time involvement by several EPA staff and a couple of contractors within 60 days. See http://usepa.3roundstones.net We expect the System Security Plan for the production data platform to be approved this summer & we ’ ll host as much Linked Data as EPA produces.
  • #21 The data platform landscape is emerging. Data.gov is using Socrata for 1-3 star data. We felt it was important to avoid vendor lock-in from proprietary formats & ETL processes, so we chose a Web Standards compliant, open source platform specializing in 4 & 5 star data. It’s commercially supported and available via the cloud.
  • #22 Once we had the data modeled, validated with SMEs, we converted & loaded into Callimachus. We spent about 1 hour creating templates to view the data in Callimachus. So here is the power of LOD in action -- Within one hour, we could view the data, navigate through the data and verify the contents without being a DBA or Java developer!
  • #23 A designer with CSS skills can help us make it look pretty with a nice CSS theme. Thus, Web developers with HTML, CSS and RDFa / SPARQL skills can create data driven Web applications. No understanding of semantics, deep RDF knowledge is required.
  • #24 Callimachus ’ forms driven interface allows authorized users to modify the underlying triples in the database -- we are round tripping create/modify/delete to a triple store via a Web page!
  • #25 This is an example of an application that was created in less than 3 days by a Web developer using Callimachus. The data sources included EPA FRS, SRS and TRI Linked Data, spreadsheet data from ABT Associates on corporate ownership (as CSV), Open Street Maps content from the Web (Linked Data Cloud).
  • #26 If you have permissions, you can edit the underlying data stored in the database (an RDF triple store). Several different triple stores are supported by Callimachus. A triple store is effectively just a “ library ” to Callimachus -- as long as it stores the data standards (RDF, SPARQL), it doesn ’ t matter.
  • #28 Note the fixed name and added comment.
  • #29 A history of changes is kept. Note the change to the name and the added comment, along with the time/date and name of the user who made the edit.
  • #31 If you ’ re interested in the maturity of the RDF family of standards, here is the technology “ layer cake ” . The data exchanges standards are mature and well defined. The world ’ s leading technology companies are supporting RDF in their products including Oracle 11g, IBM DB2, EMC. The world ’ s leading search engines including Google, Yahoo!, Bing (Microsoft) and Yandex are displaying content with RDF (RDFa & RDFa Lite). That is why we ’ re joining leading governments worldwide to publish our valuable content as LOD.
  • #32 Your ability to move into the future will be ensured by publishing data to the Web. Use data exchange standards. Define URI policy, document it and help people to comply. Leverage existing vocabularies. Despite what you think, you are probably talking about many of the same objects (people, organizations, assets, scientific terms, etc as someone else), so use a shared vocabulary to realize the benefits of Linked Data.
  • #36 This presentation is licensed under a Creative Commons BY-SA license, allowing you to share and remix its contents as long as you give us attribution and share alike.