Linked Data at US EPA


                                          1-Feb-2013
                            Linked Data Workshop with EPA OSWER
                         By Bernadette Hyland, David Wood & Luke Ruth

                                                                                                                1
These slides will walk us through a common workflow use case comparing the Linked Data Service to Envirofacts.
Agenda
• Intros ...
• Trends in data management
 • Government data publication
• Update on EPA Linked Data Service
• EPA OSWER moving towards Linked
  Data?

 • Review Next steps ...
                                      2
Trends in government
  data management


                       3
4

Headlines and agency memos about government transparency with open data and various government Web
sites.
... innovation challenges based on open government data

... High energy datapalooza’s are emerging with awards ranging from a couple thousand to $100k+. These
challenges open the doors to innovation for better healthcare solutions and more efficient use of energy, to
name but a few. They all require access to and re-use of HIGH QUALITY DATA.

In 2012, we read many headlines about big data and world’s search engines and social media sites.
Photo credit: http://www.flickr.com/photos/glennharper/4452247708/
                                                                                               5

                                                                                                                 5

However, while there is lots of gold to be mined from public data, it is an uncomfortable time for Government
IT and business managers who are tasked with data management programs.

Most people are having a difficult time keeping up. If you feel like you are hanging on while the world changes
too fast, you are not alone.

Photo credit: http://www.flickr.com/photos/glennharper/4452247708/
6

Who is sharing their data as Linked Data? Small and large commercial and government organizations, NGOs,
Non-profits ... plus many universities.
Governments in the last few years have been responding to Open Government initiatives that mandate publishing
open government data.
Some are careful, slow-moving entities who simply needed to find real solutions to real problems.
Governments
Goals: Governmental transparency and/or improved
       internal efficiencies (data warehouses)




                                                   7
Big Data
                                                               Simple data
                                                               Complex data
                                                               Legacy data




                                                                                                                    8

KEY POINT: Search, discovery and data access approaches have evolved over the last decade and techniques
are beginning to come together. GoPubMed was launched in 2002 as the first semantic search portal. Later,
Microsoft’s Bing, Google’s Knowledge Graph are two of the other well known search engines employing
semantic techniques.

Semantic search systems generally considers the context of search, location, intent, variation of words, synonyms
and concepts. Semantic search has roots in linguistic research and NLP.

Big data research has grown to include the MapReduce algorithm for handling really large data sets, often
measured in terabytes or greater. This is the kind of data that people at the Large Hadron Collider at CERN
are working on to provide insights into how the universe works, including the recent discovery of the Higgs
Boson, the particle that gives mass to matter.

Under the big top tent of semantic search we’re dealing with different types of content, big, public, complex and
legacy data. Simple, complex and legacy data comes in small, medium and large sizes.

Many government agencies by contrast have lots of small to medium data sets in structured databases, like
Oracle. These databases (and the systems that depend upon them) are not going away however fewer new data
warehouse projects are likely to be started. Data warehouses are widely recognized to be costly to create and
maintain, and change SLOWLY.

The biggest win for governments worldwide who adopt a Web architecture for data publishing is combining data
sets to discover new or previously uncontemplated relationships.
“Big Data Is Important, but
         Open Data Is More Valuable”
        As change agents, enterprise architects can help
           their organizations become richer through
                  strategies such as open data.
                                                David Newman, VP Research, Gartner




                                                                                                                       9

Open data refers to the idea that certain data should be freely available to everyone to use and republish as they
wish, without restrictions from copyright, patents or other forms of control.

The term “open data” has gained popularity with open data initiatives including data.gov.uk, data.gov and other
government data catalog sites.

Enterprise architects are playing an important role in fostering information-sharing practices. Access to, and use
of, open data will be particularly critical for a business that operate using the Web; organizations should focus on
using open data to enhance business practices that generate growth and innovation.
Open data + open standards +
          open platforms
        Highly scalable computing &
        hosting via the       Cloud
        International Data Exchange
        Standards
        5 Star Data (Linked Data)
        Open Source tools
                                                                                                                10

A Web-oriented approach to information sharing has impacted how scientists, researchers, regulators and the
public interacts with government.

Linked data lowers the barriers to re-use and interoperability among multiple, distributed and heterogeneous
data sources.

Access to high-quality Linked Open Data via the Web means millions of researchers and developers will be able
to shorten the time-consuming research process involving data cleansing and modeling.
11

How do we get a loose coupling of shared data over Web architectures? By using the structured data model for
the Web: RDF.

There is a project to create freely available data on the Web in this way, which is known as the Linked Open
Data project.

W3C sees Linked Data as the set of best practices and technologies to support worldwide data access,
integration and creative re-use of authoritative data.
12
The mission of the Government Linked
          Data (GLD) Working Group is to
          provide standards and other information
          which help governments around the
          world publish their data as effective and
          usable Linked Data using Semantic Web
          technologies.




                                                                                     13

We are 16 months into the Government Linked Data Working group’s two year charter.
14

A sound government information management strategy requires providing CONTEXT and CONFIDENCE to
those accessing and potentially re-using your data.

Giving people have timely access to information, for disaster preparedness, scientific research, policy and
research, the network effect of people helping people is our greatest hope.

On the heels of the recent East Coast hurricane that devastated parts of New York and New Jersey, government
executives suggested that fear of cyber-doom scenarios may be taking too much of our thinking & planning.
According to Secretary Panetta, it may be driving us to unrealistic and potentially dangerous responses to threats
that don’t exist.

The reality is that when disaster strike, people come together and help one another. We don’t see paralysis,
panic and social collapse.

During today’s session, I’ll describe how several agencies and private sector organizations are using Web
technologies and semantics to improve information access and discovery. Simply put, semantic technologies
provide CONTEXT.
Open Government Data




                       15
Growing chorus ...
        “We’re moving from managing
        documents to managing discrete pieces of
        open data and content which can be
        tagged, shared, secured, mashed up and
        presented in the way that is most useful
        for the consumer of that information.”
                        -- Report on Digital Government: Building a 21st Century Platform to
                                                          Better Serve the American People




                                                                                                              16

The Digital Government Strategy sets out to accomplish three things: Access to high quality digital information
& services; procure and manage devices, applications, and data in smart, secure and affordable ways; and unlock
the power of government data to spur innovation.

Governments around the world are defining detailed digital services plans based on open data, open APIs and
open source data platforms. They are defining how governments are publishing data with an eye towards
improving access and re-use. Administrators and program managers are committing to delivery of digital services
using semantic technologies broadly, and Linked Data specifically.
Big data

  Integrating ...                                                            • Simple data
                                                                             • Complex data
                                                                             • Legacy data
                                                                                                                 17

We need to find ways to fit things together that wasn’t originally intended to fit together.

NB: This is the Musée du Louvre which has evolved from a late 12th Century fortress under Phillip II, extended
over centuries to incorporate the landmark Inverted Pyramid architected by I.M. Pei that was completed in
1993.

A recent competition to house its new galleries for Islamic art opened this year, 2012. It continues to
accommodate new works for art & galleries in new & previously unanticipated ways.

Today, we need to understand the context of big data + complex data + public data and legacy data into one
consistent whole.
18

September 2011: 295 datasets that meet the LOD Cloud criteria, consisting of over 31 billion
RDF triples and are interlinked by around 504 million links.
THERE IS A PROCESS


      Identify       Model          Name           Describe       Convert        Publish




                                                 Maintain




                                                                                                 19

Take comfort in the fact that there is a familiar process. It is similar to the process & roles of
traditional data modeling.

Creating Linked Data requires that we identify the data, model exemplar records -- what you
are going to carry forward & what you are going to leave behind.

Name all of the NOUNs. Turn the records into URIs.

Next, describe RESOURCES with vocabularies.

Write a script or process to convert from canonical form to RDF. Then publish. Maintain over
time.
3 Round Stones produces the leading platform for
    the publication of reusable data on the Web. Our
    commercially supported Open Source platform is
    used by the Fortune 2000 and US Government
    agencies to collect, publish and reuse data, both on
    the public Internet and behind institutional firewalls.




                                                                                               20

Our goal is to produce the leading platform for the publication of reusable data on the Web.
Callimachus
                     http://callimachusproject.org
                       http://3roundstones.com



                                                                                             21

Callimachus is that platform. It is available via 3roundstones.com or its Open Source site
callimachusproject.org.
CONTENT                                   LINKED DATA
   MANAGEMENT                                  MANAGEMENT
      SYSTEM                                      SYSTEM


                        DATA




                                                                       TEXT
                      UNSTRUCTURED




                               Callimachus

                                                                     STRUCTURED
                                                                        DATA
                          TEXT




                                                                                               22

Callimachus may be compared to a distributed CMS. CMS’s manage mostly unstructured
information. Callimachus, by contrast to a CMS, manages primarily structured Linked Data. We
call this a Linked Data management system.
23

Callimachus started in 2009 as a simple online RDF editor. Users could fill out HTML forms,
which would create RDF behind the scenes. The resulting RDF could be viewed as HTML and,
of course, shared and combined with other RDF data. Since then HTML5 has allowed us to
hack the browser environment much less and extend its capabilities.
Data driven Web apps using Callimachus
  US Legislation +
  enterprise data




                                                                          Clinical Trials +
       DBpedia +                                                         enterprise linked
   enterprise datasets                                                          data




                                                                                        24

                                                                                              24

Callimachus integrates (very) well with other enterprise systems as well as Web content. It
can form an entire application or part of one.
NB: Mention Documentum, Oracle via HTTP
25

•   US HHS committed to making a vast array of open data more readily available to improve health care delivery
    & reduce costs in 2013 and beyond.

•   In 2012, Sentara created a Web application that integrates authoritative data from 5 different sources including
    content from NLM, NOAA, EPA and DBpedia

•   This application utilizes open data, open standards and an open source data platform
User




          US EPA                 US EPA
 NOAA
          AirNow                SunWise




                    National
DBpedia            Library of
                   Medicine



                                          26
US EPA Linked Data
• Cloud-based Linked Data provision of 3 core
programs:

 • 2.9M Facilities
 • 100K substances
 • 25 years of toxic pollution reports
• FISMA compliant
• 16 Callimachus templates
• Official launch Feb 2013
                                                27
28

Envirofacts, EPA’s older system.
29

EPA’s new Linked Data system. Cooperation without coordination. Data reuse breaks the back of API gridlock.
Clay Shirky stole that from me :)
30

This data is exactly the same data used to create the interface. Unlike traditional database-driven applications,
the data is immediately accessible for reuse by third parties. This prevents data duplication, allows for tracking of
provenance and avoids reinventing the wheel.
We’ve Seen This Before




                                                                                          31

Like HTML and RDF, credit cards have a human-readable side and a machine-readable side.
Linked Data management system
                                                                             located at a Tier 1 Cloud Provider
                                                                                    (FISMA compliant)

              RDF Database


                                Resource URIs   REST API          SPARQL endpoint




     Public


                             Web Browser




                                                       Application, Script or automated client




                                                                   Registered developer



                                                                                                                  32

Introduce Callimachus, an open source, open data platform based on open standards.
3 Round Stones provides commercial support for Callimachus and is a major contributor to
the OS project.

Users of Callimachus see a generated Web interface, but can also directly access the data via
REST or SPARQL.

SPARQL Named Queries (like stored procedures) allow for automated conversion to different
formats for reuse in non-RDF environments.
From EPA
                              From Wikipedia




                               Open Street Map

                                                               33

Data may be easily combined from several sources.
HOW IT IS DONE TODAY ...




                           34
Audience for EPA Data
    • Middle                      school student doing a science project
    • Concerned                                   citizen worried about local pollution
    • Environmental                                         Science PhD from EPA
    • Doctor                         from NIH writing a research paper




                                                                                                                                                                                           35
To try and understand the advantages and disadvantages of both systems, we need to know the audience for the system. That presents a problem though. It’s nearly impossible to know
your audience at any given moment. Even if it were possible, the audience is so varying that it would be unwise to cater to a single group at the expense of another.

The audience could be a middle school student, a concerned local citizen, a PhD collecting information for a report, or a doctor writing a research paper. We just don’t know.

For example, if the system was designed to accommodate a 6th grader, the system would be over simplified and thin. If it was designed with a PhD or a doctor in mind, the average citizen
could be overwhelmed and find the system complicated and verbose.

That’s why the goal should be to make the simple things easy and the complicated things possible.
How much mercury did
              Hanson Permanente Cement
                   release in 2004?



                                                                                                                                                                                         36
With that in mind, let’s walk through our example trying to keep all our audience members in mind. Let’s pretend our audience members live in Cupertino, California and want to know about
the local Cement plant. The question is - How much mercury did Hanson Permanente Cement release in 2004?
37
The process starts out much the same. The user enters their zip code into the search field, which in this case is 95014, and are presented with the results. From here though, the workflow is
quite different.
Envirofacts


              38
39
Rather than immediately returning a list of the facilities in that zip code, Envirofacts gives users the option to verify their location, drag, drop, and resize a map to match their request. While
this provides a high level of granularity it is making a simple thing harder than it needs to be for a lot of users. While some people may know the physical boundary of their zip code, the
average user would most likely trust the application to take care of that.
40
Envirofacts then returns a page that looks like this (with some rows obviously cut from the table). It’s great that right up front, the user is able to both see the results and have access to the
data. There is a link they can paste into their browser and get the raw data as well as a button to click that will download the results as CSV. The big problem here though is that the data
comes back with formatting that renders it nearly unusable and content that adds very little value above what is present on the screen.

Copying and pasting the link at the top of the screen gives us the following data:
41
It comes back in what *appears* to be CSV. However, there is no actual indication of that. There are no column headers, no descriptions of what fields represent, the text is difficult to
partition and understand, and there are no links to documentation on the use of the API or how the query is structured. All we know is that it is data from “this report”.

The other link - the CSV Table link - returns the following data:
42
Having the table as CSV is theoretically useful but it too comes back without the necessary structure to understand and use the data. This time there are column headers but it is unclear
what exactly they represent. Many of the fields are just the URL links found in the HTML table from the website. There doesn’t appear to be raw data - just links to where the raw data could
potentially be found.

Moving on from accessing the data, we will try to find the facility we are looking for - Hanson Permanente.
Finding Hanson Permanente




                                                                                                                                                                                              43
Finding Hanson Permanente can only be done in this table by scrolling down and finding "Hanson" in the alphabetically sorted list of facilities. There are no options to sort on the contents of
another column or search within the table. The unfortunate side effect of this is that by the time you scroll down you can no longer see the column headers. Simply looking at this screen, you
can see that there are 8 separate reports that can be viewed, but it is unclear how they are differentiated and what each contains.

The key here is that the data reflects internal EPA systems - which are unknown to the majority of users. By doing this, Envirofacts is implicitly asking users to become expert on internal
EPA systems which they either are not capable of, or do not have the time for.
Finding Mercury Released in 2004




                                                                                                                                                                                            44
Because most users do not have this knowledge, the first report they'll most likely click is the Summary Report. The “Summary Report” brings us to a long page where after quite a bit of
scrolling we can see the Toxic Releases for 2011. However, unlike the previous search results, this data is not available for download or retrieval by any means other than screen-scraping or
re-keying. It is also a limited dataset and does not have the data for 2004.
Compliance Report




                                                                                                                                                                                              45
The Summary, Facility, AFS, BR, RCRA, TRI, TSCA Reports at their top level do not have the data about Mercury either. It is actually contained in the Compliance Report. However, like the
other tables, there is no way to download this data and repurpose it for other applications. The other source of confusion is that this data can be found in multiple places depending on its
originating report, and it can be unclear whether the data is in fact the same.

For example, this data can also be found by drilling down in the TRI Report by clicking:
“View Report”->”P2 Report (Report)”->”P2 Report”-> and then manipulating the view based on the year and view you want. These graphs and charts ultimately contain very interesting and
relevant data but they are so obscured and inaccessible that it becomes extremely difficult to create anything new.
Potential Audience
XMiddle school student doing a science project
•

XConcerned citizen worried about local pollution
•

✔Environmental Science PhD from EPA
•

XDoctor from NIH writing a research paper
•




                                                                                                                                                                                    46
Who did we cater to? The middle school student? Probably not. The concerned citizen? Unless that citizen happens to have specific knowledge of the EPA system and a great deal of
experience navigating technology, most likely not. What about the Environmental Impact PhD and the Doctor from NIH? They may have the knowledge to understand column names,
chemical compounds, and reporting a bit better but only the Environmental Science PhD with a working knowledge of EPA’s system can determine enough information to make use of it. The
doctor on the other hand still is still working against the system itself to find the data behind it.
Linked Data


                                                                 47

Now let’s look at the same workflow in the Linked Data Service.
Finding Hanson Permanente




                                                                                                  48

By keeping the application simple - and letting the results be viewed either as a table or a
map - the user can adjust their search as they see fit without extra navigation. Also, by
having the data in a table that can searched or sorted however the user sees fit, finding a
specific facility is as easy as typing the name in or sorting on relevant criteria. This is made
possible by exposing the data, rather than containing it in a standard HTML table.

I fully recognize that Envirofacts could offer identical functionality by tweaking their
application, but the key underlying point is that this application was created very cheaply and
quickly *because* the data is modeled as Linked Data. When the developing environment is a
Web Browser, and the data is described and Linked, an application can be a simple XHMTL
page with JavaScript, instead of a heavy-weight dedicated application.
Finding Mercury Released in 2004
                                                        1




                                                                     2




                                                                                                 49

There are two very important things to note on this page. 1 is that on any facility’s page,
there is always an option to download the data. This data is available in two formats (RDF/
XML and Turtle). With the click of a button a user can have all of the data that was used to
drive the creation of the current page, which means he or she can repurpose that data into
any new application. Note here that this download is not an extract, summary, or recreation
of the data - it is literally the *same* data that was used to drive that page.

2 is that because this page is “data-driven”, navigation relies on exploring the data, not the
system that contains it. On the same page where we get information like it’s latitude and
longitude, we can also find a link to a report detailing exactly how much mercury was
released in 2004. We could easily do an in-page search for 2004 or Mercury to identify the
releases associated with those terms.
TRI Report




                                                                                             50

Rather than aggregating the data for presentation, the actual report is presented with the raw
data continuously available in the top right of the page.

A subtle difference to be pointed out here is the difference in the name of the facility.
Previously it was identified as Hanson Permanente, but now it is known as Lehigh Southwest
Cement Co. During the modeling phase, the Linked Data was created to implicitly include this
relationship (which is known via the mapping of EPA FRS identifiers). On the other hand,
pulling down the CSV files would not give the user any obvious way of understanding this
relationship.
Data Reuse




                                                                                            51

Lastly, giving users the ability to grab the data off any page, at any time during navigation,
strongly facilitates the reuse of data. These graphs are not natively embedded in the webpage
of a given facility. Rather, by downloading the data the user can quickly and easily make new
and different visualizations for a report or presentation.

For example, this history of air stack pollution reports was made with a single parameterized
SPARQL query and a single JavaScript pattern. This could very easily be applied to any number
of facilities, changed to a bar graph, or altered in any number of other ways with very little
effort thanks to the fact it was modeled using Linked Data.
Potential Audience
✔
• Middle school student doing a science project

✔
• Concerned citizen worried about local pollution

✔Environmental Science PhD from EPA
•

✔
• Doctor from NIH writing a research paper




                                                                                                 52

Linked Data allowed us to reach all the members of our potential audience by giving the user
options, aggregating based on relevance rather than data source, and by exposing the data
that drives the service for reuse.

The middle school student or concerned citizen that want to know the location of a facility,
the amount of a particular chemical it released, and the year it was released in never have to
click any of the options in the Linked Data box. They can simply use the interface, explore
the data, and find what they need in a read-only experience.

The Environmental Science PhD is still able to find what he is looking for with Linked Data but
can do so in a much more intuitive way. The doctor from NIH is now able to find the data
they’re interested in and if they choose to take the next step, download the actual data
behind the page. By quickly and easily obtaining the raw data, anyone from scientists to
journalists can generate their own applications without any knowledge of the Linked Data
Service itself.
What Callimachus is



                      53
Subject

      Predicate Object



                                                                                                                                         54
The heart of Callimachus is a template engine used to navigate, visualize and build applications upon Linked Data. Here we see some typical
RDF data, with a subject, a predicate and an object.
Subject


                                          Object
                           (Predicate is defined in a template)




                                                                      55
Callimachus can use that data to build complex Web pages.
Subject                                Predicate


                       (Object gets filled in when template is evaluated)




                                                                                                                       56
It does this with a template language that is simply XHTML with RDFa markup. There are some extensions for syntactic
convenience.
Templates

• Written in XHTML+RDFa (declarative
  pattern);
• Parsed to create SPARQL queries;
• Query results are filled into the same
  template.



                                          57
Controller
                                                                            RDF Store
                         Web server
             HTTP GET                                                              Class
                                                                      Resource
              request




                                                                                  Viewable


                                      RDF response


                                                     SPARQL query
                                                                       XHTML
                                                                         XML      Template
                                                                        +RDFa
                                                                      template    apply.xsl
                                                                                   Engine
                                                                       template



                HTTP
              response                    HTML




                                                                                              58

Callimachus is implemented as a Web MVC architecture with an underlying RDF DB. The
process shown demonstrates how a view is generated from a Web request.
Create/edit templates are HTML forms




                                                                                       59

Callimachus templates may also be used to create or edit RDF data by generating HTML
forms.
60

Callimachus provides a pseudo file system that is used to store and represent content,
including RDF/OWL data, named SPARQL queries, schemata, templates, etc. The pseudo file
system provides a common view of content that is abstracted from its actual storage location;
RDF data is stored in an RDF store whereas file-oriented content is stored in a BLOB store.
61

Documents, including data and ontologies, can be uploaded via drag-and-drop when using
an HTML5-compliant browser. File upload via a separate interface is available for older
browsers.
Linked Data management system
                                                                             located at a Tier 1 Cloud Provider
                                                                                    (FISMA compliant)

              RDF Database


                                Resource URIs   REST API          SPARQL endpoint




     Public


                             Web Browser




                                                       Application, Script or automated client




                                                                   Registered developer



                                                                                                                  62

Users of Callimachus see a generated Web interface, but can also directly access the data via
REST or SPARQL. SPARQL Named Queries (like stored procedures) allow for automated
conversion to different formats for reuse in non-RDF environments.
63

Callimachus can associate SPARQL queries with URLs, so that they are executed when their
URL is resolved. We call these “named queries” and they are analogous to stored procedures
in a relational database. Named queries can accept parameters, which allows them to be a
very flexible way to manage routine access to queries that can drive visualizations.
64

The view of a named query displays its results. Results, like template results, are naturally
cached to increase performance.
65

The results of named queries may be formatted in a variety of ways, or arbitrarily
transformed via XProc pipelines and XSLT. This screenshot shows the results of a named
query being used to drive a Google Chart widget. Callimachus also has stock transforms for
d3 visualizations.
Leather tags holding metadata
Papyrus rolls




                                            66
Credits


                           Gartner: “Innovation Insight: Linked Data Drives Innovation Through Information-
      David Newman
                           Sharing Network Effects” Published: 15 December 2011

                           Linking Government Data, Springer (2011)
      David Wood, ed.
                           http://3roundstones.com/linking-government-data/

                           Digital Government Strategy: Building a 21st Century Platform to Better Serve the
                           American People,
    US Executive Branch
                           http://www.whitehouse.gov/sites/default/files/omb/egov/digital-government/digital-
                           government.html


W3C Linked Data Cookbook http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook




All other photos and images © 2010-2012 3 Round Stones, Inc. and released under a CC-by-sa license




                                                                                                               67
This work is Copyright © 2011-2012 3 Round Stones Inc.
                  It is licensed under the Creative Commons Attribution 3.0 Unported License
                  Full details at: http://creativecommons.org/licenses/by/3.0/

                  You are free:

                          to Share — to copy, distribute and transmit the work



                          to Remix — to adapt the work



                  Under the following conditions:
                          Attribution. You must attribute the work in the manner specified by the
                          author or licensor (but not in any way that suggests that they endorse
                          you or your use of the work).

                          Share Alike. If you alter, transform, or build upon this work, you may
                          distribute the resulting work only under the same or similar license to this
                          one.




                                                                                                         68

This presentation is licensed under a Creative Commons BY-SA license, allowing you to share
and remix its contents as long as you give us attribution and share alike.

US EPA OSWER Linked Data Workshop 1-Feb-2013

  • 1.
    Linked Data atUS EPA 1-Feb-2013 Linked Data Workshop with EPA OSWER By Bernadette Hyland, David Wood & Luke Ruth 1 These slides will walk us through a common workflow use case comparing the Linked Data Service to Envirofacts.
  • 2.
    Agenda • Intros ... •Trends in data management • Government data publication • Update on EPA Linked Data Service • EPA OSWER moving towards Linked Data? • Review Next steps ... 2
  • 3.
    Trends in government data management 3
  • 4.
    4 Headlines and agencymemos about government transparency with open data and various government Web sites. ... innovation challenges based on open government data ... High energy datapalooza’s are emerging with awards ranging from a couple thousand to $100k+. These challenges open the doors to innovation for better healthcare solutions and more efficient use of energy, to name but a few. They all require access to and re-use of HIGH QUALITY DATA. In 2012, we read many headlines about big data and world’s search engines and social media sites.
  • 5.
    Photo credit: http://www.flickr.com/photos/glennharper/4452247708/ 5 5 However, while there is lots of gold to be mined from public data, it is an uncomfortable time for Government IT and business managers who are tasked with data management programs. Most people are having a difficult time keeping up. If you feel like you are hanging on while the world changes too fast, you are not alone. Photo credit: http://www.flickr.com/photos/glennharper/4452247708/
  • 6.
    6 Who is sharingtheir data as Linked Data? Small and large commercial and government organizations, NGOs, Non-profits ... plus many universities. Governments in the last few years have been responding to Open Government initiatives that mandate publishing open government data. Some are careful, slow-moving entities who simply needed to find real solutions to real problems.
  • 7.
    Governments Goals: Governmental transparencyand/or improved internal efficiencies (data warehouses) 7
  • 8.
    Big Data Simple data Complex data Legacy data 8 KEY POINT: Search, discovery and data access approaches have evolved over the last decade and techniques are beginning to come together. GoPubMed was launched in 2002 as the first semantic search portal. Later, Microsoft’s Bing, Google’s Knowledge Graph are two of the other well known search engines employing semantic techniques. Semantic search systems generally considers the context of search, location, intent, variation of words, synonyms and concepts. Semantic search has roots in linguistic research and NLP. Big data research has grown to include the MapReduce algorithm for handling really large data sets, often measured in terabytes or greater. This is the kind of data that people at the Large Hadron Collider at CERN are working on to provide insights into how the universe works, including the recent discovery of the Higgs Boson, the particle that gives mass to matter. Under the big top tent of semantic search we’re dealing with different types of content, big, public, complex and legacy data. Simple, complex and legacy data comes in small, medium and large sizes. Many government agencies by contrast have lots of small to medium data sets in structured databases, like Oracle. These databases (and the systems that depend upon them) are not going away however fewer new data warehouse projects are likely to be started. Data warehouses are widely recognized to be costly to create and maintain, and change SLOWLY. The biggest win for governments worldwide who adopt a Web architecture for data publishing is combining data sets to discover new or previously uncontemplated relationships.
  • 9.
    “Big Data IsImportant, but Open Data Is More Valuable” As change agents, enterprise architects can help their organizations become richer through strategies such as open data. David Newman, VP Research, Gartner 9 Open data refers to the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other forms of control. The term “open data” has gained popularity with open data initiatives including data.gov.uk, data.gov and other government data catalog sites. Enterprise architects are playing an important role in fostering information-sharing practices. Access to, and use of, open data will be particularly critical for a business that operate using the Web; organizations should focus on using open data to enhance business practices that generate growth and innovation.
  • 10.
    Open data +open standards + open platforms Highly scalable computing & hosting via the Cloud International Data Exchange Standards 5 Star Data (Linked Data) Open Source tools 10 A Web-oriented approach to information sharing has impacted how scientists, researchers, regulators and the public interacts with government. Linked data lowers the barriers to re-use and interoperability among multiple, distributed and heterogeneous data sources. Access to high-quality Linked Open Data via the Web means millions of researchers and developers will be able to shorten the time-consuming research process involving data cleansing and modeling.
  • 11.
    11 How do weget a loose coupling of shared data over Web architectures? By using the structured data model for the Web: RDF. There is a project to create freely available data on the Web in this way, which is known as the Linked Open Data project. W3C sees Linked Data as the set of best practices and technologies to support worldwide data access, integration and creative re-use of authoritative data.
  • 12.
  • 13.
    The mission ofthe Government Linked Data (GLD) Working Group is to provide standards and other information which help governments around the world publish their data as effective and usable Linked Data using Semantic Web technologies. 13 We are 16 months into the Government Linked Data Working group’s two year charter.
  • 14.
    14 A sound governmentinformation management strategy requires providing CONTEXT and CONFIDENCE to those accessing and potentially re-using your data. Giving people have timely access to information, for disaster preparedness, scientific research, policy and research, the network effect of people helping people is our greatest hope. On the heels of the recent East Coast hurricane that devastated parts of New York and New Jersey, government executives suggested that fear of cyber-doom scenarios may be taking too much of our thinking & planning. According to Secretary Panetta, it may be driving us to unrealistic and potentially dangerous responses to threats that don’t exist. The reality is that when disaster strike, people come together and help one another. We don’t see paralysis, panic and social collapse. During today’s session, I’ll describe how several agencies and private sector organizations are using Web technologies and semantics to improve information access and discovery. Simply put, semantic technologies provide CONTEXT.
  • 15.
  • 16.
    Growing chorus ... “We’re moving from managing documents to managing discrete pieces of open data and content which can be tagged, shared, secured, mashed up and presented in the way that is most useful for the consumer of that information.” -- Report on Digital Government: Building a 21st Century Platform to Better Serve the American People 16 The Digital Government Strategy sets out to accomplish three things: Access to high quality digital information & services; procure and manage devices, applications, and data in smart, secure and affordable ways; and unlock the power of government data to spur innovation. Governments around the world are defining detailed digital services plans based on open data, open APIs and open source data platforms. They are defining how governments are publishing data with an eye towards improving access and re-use. Administrators and program managers are committing to delivery of digital services using semantic technologies broadly, and Linked Data specifically.
  • 17.
    Big data Integrating ... • Simple data • Complex data • Legacy data 17 We need to find ways to fit things together that wasn’t originally intended to fit together. NB: This is the Musée du Louvre which has evolved from a late 12th Century fortress under Phillip II, extended over centuries to incorporate the landmark Inverted Pyramid architected by I.M. Pei that was completed in 1993. A recent competition to house its new galleries for Islamic art opened this year, 2012. It continues to accommodate new works for art & galleries in new & previously unanticipated ways. Today, we need to understand the context of big data + complex data + public data and legacy data into one consistent whole.
  • 18.
    18 September 2011: 295datasets that meet the LOD Cloud criteria, consisting of over 31 billion RDF triples and are interlinked by around 504 million links.
  • 19.
    THERE IS APROCESS Identify Model Name Describe Convert Publish Maintain 19 Take comfort in the fact that there is a familiar process. It is similar to the process & roles of traditional data modeling. Creating Linked Data requires that we identify the data, model exemplar records -- what you are going to carry forward & what you are going to leave behind. Name all of the NOUNs. Turn the records into URIs. Next, describe RESOURCES with vocabularies. Write a script or process to convert from canonical form to RDF. Then publish. Maintain over time.
  • 20.
    3 Round Stonesproduces the leading platform for the publication of reusable data on the Web. Our commercially supported Open Source platform is used by the Fortune 2000 and US Government agencies to collect, publish and reuse data, both on the public Internet and behind institutional firewalls. 20 Our goal is to produce the leading platform for the publication of reusable data on the Web.
  • 21.
    Callimachus http://callimachusproject.org http://3roundstones.com 21 Callimachus is that platform. It is available via 3roundstones.com or its Open Source site callimachusproject.org.
  • 22.
    CONTENT LINKED DATA MANAGEMENT MANAGEMENT SYSTEM SYSTEM DATA TEXT UNSTRUCTURED Callimachus STRUCTURED DATA TEXT 22 Callimachus may be compared to a distributed CMS. CMS’s manage mostly unstructured information. Callimachus, by contrast to a CMS, manages primarily structured Linked Data. We call this a Linked Data management system.
  • 23.
    23 Callimachus started in2009 as a simple online RDF editor. Users could fill out HTML forms, which would create RDF behind the scenes. The resulting RDF could be viewed as HTML and, of course, shared and combined with other RDF data. Since then HTML5 has allowed us to hack the browser environment much less and extend its capabilities.
  • 24.
    Data driven Webapps using Callimachus US Legislation + enterprise data Clinical Trials + DBpedia + enterprise linked enterprise datasets data 24 24 Callimachus integrates (very) well with other enterprise systems as well as Web content. It can form an entire application or part of one. NB: Mention Documentum, Oracle via HTTP
  • 25.
    25 • US HHS committed to making a vast array of open data more readily available to improve health care delivery & reduce costs in 2013 and beyond. • In 2012, Sentara created a Web application that integrates authoritative data from 5 different sources including content from NLM, NOAA, EPA and DBpedia • This application utilizes open data, open standards and an open source data platform
  • 26.
    User US EPA US EPA NOAA AirNow SunWise National DBpedia Library of Medicine 26
  • 27.
    US EPA LinkedData • Cloud-based Linked Data provision of 3 core programs: • 2.9M Facilities • 100K substances • 25 years of toxic pollution reports • FISMA compliant • 16 Callimachus templates • Official launch Feb 2013 27
  • 28.
  • 29.
    29 EPA’s new LinkedData system. Cooperation without coordination. Data reuse breaks the back of API gridlock. Clay Shirky stole that from me :)
  • 30.
    30 This data isexactly the same data used to create the interface. Unlike traditional database-driven applications, the data is immediately accessible for reuse by third parties. This prevents data duplication, allows for tracking of provenance and avoids reinventing the wheel.
  • 31.
    We’ve Seen ThisBefore 31 Like HTML and RDF, credit cards have a human-readable side and a machine-readable side.
  • 32.
    Linked Data managementsystem located at a Tier 1 Cloud Provider (FISMA compliant) RDF Database Resource URIs REST API SPARQL endpoint Public Web Browser Application, Script or automated client Registered developer 32 Introduce Callimachus, an open source, open data platform based on open standards. 3 Round Stones provides commercial support for Callimachus and is a major contributor to the OS project. Users of Callimachus see a generated Web interface, but can also directly access the data via REST or SPARQL. SPARQL Named Queries (like stored procedures) allow for automated conversion to different formats for reuse in non-RDF environments.
  • 33.
    From EPA From Wikipedia Open Street Map 33 Data may be easily combined from several sources.
  • 34.
    HOW IT ISDONE TODAY ... 34
  • 35.
    Audience for EPAData • Middle school student doing a science project • Concerned citizen worried about local pollution • Environmental Science PhD from EPA • Doctor from NIH writing a research paper 35 To try and understand the advantages and disadvantages of both systems, we need to know the audience for the system. That presents a problem though. It’s nearly impossible to know your audience at any given moment. Even if it were possible, the audience is so varying that it would be unwise to cater to a single group at the expense of another. The audience could be a middle school student, a concerned local citizen, a PhD collecting information for a report, or a doctor writing a research paper. We just don’t know. For example, if the system was designed to accommodate a 6th grader, the system would be over simplified and thin. If it was designed with a PhD or a doctor in mind, the average citizen could be overwhelmed and find the system complicated and verbose. That’s why the goal should be to make the simple things easy and the complicated things possible.
  • 36.
    How much mercurydid Hanson Permanente Cement release in 2004? 36 With that in mind, let’s walk through our example trying to keep all our audience members in mind. Let’s pretend our audience members live in Cupertino, California and want to know about the local Cement plant. The question is - How much mercury did Hanson Permanente Cement release in 2004?
  • 37.
    37 The process startsout much the same. The user enters their zip code into the search field, which in this case is 95014, and are presented with the results. From here though, the workflow is quite different.
  • 38.
  • 39.
    39 Rather than immediatelyreturning a list of the facilities in that zip code, Envirofacts gives users the option to verify their location, drag, drop, and resize a map to match their request. While this provides a high level of granularity it is making a simple thing harder than it needs to be for a lot of users. While some people may know the physical boundary of their zip code, the average user would most likely trust the application to take care of that.
  • 40.
    40 Envirofacts then returnsa page that looks like this (with some rows obviously cut from the table). It’s great that right up front, the user is able to both see the results and have access to the data. There is a link they can paste into their browser and get the raw data as well as a button to click that will download the results as CSV. The big problem here though is that the data comes back with formatting that renders it nearly unusable and content that adds very little value above what is present on the screen. Copying and pasting the link at the top of the screen gives us the following data:
  • 41.
    41 It comes backin what *appears* to be CSV. However, there is no actual indication of that. There are no column headers, no descriptions of what fields represent, the text is difficult to partition and understand, and there are no links to documentation on the use of the API or how the query is structured. All we know is that it is data from “this report”. The other link - the CSV Table link - returns the following data:
  • 42.
    42 Having the tableas CSV is theoretically useful but it too comes back without the necessary structure to understand and use the data. This time there are column headers but it is unclear what exactly they represent. Many of the fields are just the URL links found in the HTML table from the website. There doesn’t appear to be raw data - just links to where the raw data could potentially be found. Moving on from accessing the data, we will try to find the facility we are looking for - Hanson Permanente.
  • 43.
    Finding Hanson Permanente 43 Finding Hanson Permanente can only be done in this table by scrolling down and finding "Hanson" in the alphabetically sorted list of facilities. There are no options to sort on the contents of another column or search within the table. The unfortunate side effect of this is that by the time you scroll down you can no longer see the column headers. Simply looking at this screen, you can see that there are 8 separate reports that can be viewed, but it is unclear how they are differentiated and what each contains. The key here is that the data reflects internal EPA systems - which are unknown to the majority of users. By doing this, Envirofacts is implicitly asking users to become expert on internal EPA systems which they either are not capable of, or do not have the time for.
  • 44.
    Finding Mercury Releasedin 2004 44 Because most users do not have this knowledge, the first report they'll most likely click is the Summary Report. The “Summary Report” brings us to a long page where after quite a bit of scrolling we can see the Toxic Releases for 2011. However, unlike the previous search results, this data is not available for download or retrieval by any means other than screen-scraping or re-keying. It is also a limited dataset and does not have the data for 2004.
  • 45.
    Compliance Report 45 The Summary, Facility, AFS, BR, RCRA, TRI, TSCA Reports at their top level do not have the data about Mercury either. It is actually contained in the Compliance Report. However, like the other tables, there is no way to download this data and repurpose it for other applications. The other source of confusion is that this data can be found in multiple places depending on its originating report, and it can be unclear whether the data is in fact the same. For example, this data can also be found by drilling down in the TRI Report by clicking: “View Report”->”P2 Report (Report)”->”P2 Report”-> and then manipulating the view based on the year and view you want. These graphs and charts ultimately contain very interesting and relevant data but they are so obscured and inaccessible that it becomes extremely difficult to create anything new.
  • 46.
    Potential Audience XMiddle schoolstudent doing a science project • XConcerned citizen worried about local pollution • ✔Environmental Science PhD from EPA • XDoctor from NIH writing a research paper • 46 Who did we cater to? The middle school student? Probably not. The concerned citizen? Unless that citizen happens to have specific knowledge of the EPA system and a great deal of experience navigating technology, most likely not. What about the Environmental Impact PhD and the Doctor from NIH? They may have the knowledge to understand column names, chemical compounds, and reporting a bit better but only the Environmental Science PhD with a working knowledge of EPA’s system can determine enough information to make use of it. The doctor on the other hand still is still working against the system itself to find the data behind it.
  • 47.
    Linked Data 47 Now let’s look at the same workflow in the Linked Data Service.
  • 48.
    Finding Hanson Permanente 48 By keeping the application simple - and letting the results be viewed either as a table or a map - the user can adjust their search as they see fit without extra navigation. Also, by having the data in a table that can searched or sorted however the user sees fit, finding a specific facility is as easy as typing the name in or sorting on relevant criteria. This is made possible by exposing the data, rather than containing it in a standard HTML table. I fully recognize that Envirofacts could offer identical functionality by tweaking their application, but the key underlying point is that this application was created very cheaply and quickly *because* the data is modeled as Linked Data. When the developing environment is a Web Browser, and the data is described and Linked, an application can be a simple XHMTL page with JavaScript, instead of a heavy-weight dedicated application.
  • 49.
    Finding Mercury Releasedin 2004 1 2 49 There are two very important things to note on this page. 1 is that on any facility’s page, there is always an option to download the data. This data is available in two formats (RDF/ XML and Turtle). With the click of a button a user can have all of the data that was used to drive the creation of the current page, which means he or she can repurpose that data into any new application. Note here that this download is not an extract, summary, or recreation of the data - it is literally the *same* data that was used to drive that page. 2 is that because this page is “data-driven”, navigation relies on exploring the data, not the system that contains it. On the same page where we get information like it’s latitude and longitude, we can also find a link to a report detailing exactly how much mercury was released in 2004. We could easily do an in-page search for 2004 or Mercury to identify the releases associated with those terms.
  • 50.
    TRI Report 50 Rather than aggregating the data for presentation, the actual report is presented with the raw data continuously available in the top right of the page. A subtle difference to be pointed out here is the difference in the name of the facility. Previously it was identified as Hanson Permanente, but now it is known as Lehigh Southwest Cement Co. During the modeling phase, the Linked Data was created to implicitly include this relationship (which is known via the mapping of EPA FRS identifiers). On the other hand, pulling down the CSV files would not give the user any obvious way of understanding this relationship.
  • 51.
    Data Reuse 51 Lastly, giving users the ability to grab the data off any page, at any time during navigation, strongly facilitates the reuse of data. These graphs are not natively embedded in the webpage of a given facility. Rather, by downloading the data the user can quickly and easily make new and different visualizations for a report or presentation. For example, this history of air stack pollution reports was made with a single parameterized SPARQL query and a single JavaScript pattern. This could very easily be applied to any number of facilities, changed to a bar graph, or altered in any number of other ways with very little effort thanks to the fact it was modeled using Linked Data.
  • 52.
    Potential Audience ✔ • Middleschool student doing a science project ✔ • Concerned citizen worried about local pollution ✔Environmental Science PhD from EPA • ✔ • Doctor from NIH writing a research paper 52 Linked Data allowed us to reach all the members of our potential audience by giving the user options, aggregating based on relevance rather than data source, and by exposing the data that drives the service for reuse. The middle school student or concerned citizen that want to know the location of a facility, the amount of a particular chemical it released, and the year it was released in never have to click any of the options in the Linked Data box. They can simply use the interface, explore the data, and find what they need in a read-only experience. The Environmental Science PhD is still able to find what he is looking for with Linked Data but can do so in a much more intuitive way. The doctor from NIH is now able to find the data they’re interested in and if they choose to take the next step, download the actual data behind the page. By quickly and easily obtaining the raw data, anyone from scientists to journalists can generate their own applications without any knowledge of the Linked Data Service itself.
  • 53.
  • 54.
    Subject Predicate Object 54 The heart of Callimachus is a template engine used to navigate, visualize and build applications upon Linked Data. Here we see some typical RDF data, with a subject, a predicate and an object.
  • 55.
    Subject Object (Predicate is defined in a template) 55 Callimachus can use that data to build complex Web pages.
  • 56.
    Subject Predicate (Object gets filled in when template is evaluated) 56 It does this with a template language that is simply XHTML with RDFa markup. There are some extensions for syntactic convenience.
  • 57.
    Templates • Written inXHTML+RDFa (declarative pattern); • Parsed to create SPARQL queries; • Query results are filled into the same template. 57
  • 58.
    Controller RDF Store Web server HTTP GET Class Resource request Viewable RDF response SPARQL query XHTML XML Template +RDFa template apply.xsl Engine template HTTP response HTML 58 Callimachus is implemented as a Web MVC architecture with an underlying RDF DB. The process shown demonstrates how a view is generated from a Web request.
  • 59.
    Create/edit templates areHTML forms 59 Callimachus templates may also be used to create or edit RDF data by generating HTML forms.
  • 60.
    60 Callimachus provides apseudo file system that is used to store and represent content, including RDF/OWL data, named SPARQL queries, schemata, templates, etc. The pseudo file system provides a common view of content that is abstracted from its actual storage location; RDF data is stored in an RDF store whereas file-oriented content is stored in a BLOB store.
  • 61.
    61 Documents, including dataand ontologies, can be uploaded via drag-and-drop when using an HTML5-compliant browser. File upload via a separate interface is available for older browsers.
  • 62.
    Linked Data managementsystem located at a Tier 1 Cloud Provider (FISMA compliant) RDF Database Resource URIs REST API SPARQL endpoint Public Web Browser Application, Script or automated client Registered developer 62 Users of Callimachus see a generated Web interface, but can also directly access the data via REST or SPARQL. SPARQL Named Queries (like stored procedures) allow for automated conversion to different formats for reuse in non-RDF environments.
  • 63.
    63 Callimachus can associateSPARQL queries with URLs, so that they are executed when their URL is resolved. We call these “named queries” and they are analogous to stored procedures in a relational database. Named queries can accept parameters, which allows them to be a very flexible way to manage routine access to queries that can drive visualizations.
  • 64.
    64 The view ofa named query displays its results. Results, like template results, are naturally cached to increase performance.
  • 65.
    65 The results ofnamed queries may be formatted in a variety of ways, or arbitrarily transformed via XProc pipelines and XSLT. This screenshot shows the results of a named query being used to drive a Google Chart widget. Callimachus also has stock transforms for d3 visualizations.
  • 66.
    Leather tags holdingmetadata Papyrus rolls 66
  • 67.
    Credits Gartner: “Innovation Insight: Linked Data Drives Innovation Through Information- David Newman Sharing Network Effects” Published: 15 December 2011 Linking Government Data, Springer (2011) David Wood, ed. http://3roundstones.com/linking-government-data/ Digital Government Strategy: Building a 21st Century Platform to Better Serve the American People, US Executive Branch http://www.whitehouse.gov/sites/default/files/omb/egov/digital-government/digital- government.html W3C Linked Data Cookbook http://www.w3.org/2011/gld/wiki/Linked_Data_Cookbook All other photos and images © 2010-2012 3 Round Stones, Inc. and released under a CC-by-sa license 67
  • 68.
    This work isCopyright © 2011-2012 3 Round Stones Inc. It is licensed under the Creative Commons Attribution 3.0 Unported License Full details at: http://creativecommons.org/licenses/by/3.0/ You are free: to Share — to copy, distribute and transmit the work to Remix — to adapt the work Under the following conditions: Attribution. You must attribute the work in the manner specified by the author or licensor (but not in any way that suggests that they endorse you or your use of the work). Share Alike. If you alter, transform, or build upon this work, you may distribute the resulting work only under the same or similar license to this one. 68 This presentation is licensed under a Creative Commons BY-SA license, allowing you to share and remix its contents as long as you give us attribution and share alike.