US Government Linked Data Bernadette Hyland, CEO co-chair W3C Government Linked Data WG email@example.com @BernHylandNARA II - College Park MD 07 February 2013 1
Agenda• Intros ...• Trends in data management• Government data publication • Update on new Linked Data Services 2
3 Round Stones produces the leading platform forthe publication of data on the Web. Ourcommercially supported Open Source platform isused by the Fortune 2000 and US Governmentagencies to collect, publish and reuse data, both onthe public Internet and behind institutional ﬁrewalls. 3
Our Partners Callimachus 4Our partners ...Our customers - 50% US Gov’t and 50% private sector, focused on pharma & health delivery,and business publishing.
5Headlines and agency memos about government transparency with open data and various government Websites.... innovation challenges based on open government data... High energy datapalooza’s are emerging with awards ranging from a couple thousand to $100k+. Thesechallenges open the doors to innovation for better healthcare solutions and more efﬁcient use of energy, toname but a few. They all require access to and re-use of HIGH QUALITY DATA.In 2012, we read many headlines about big data and world’s search engines and social media sites.
7Who is sharing their data as Linked Data? Small and large commercial and government organizations, NGOs,Non-proﬁts ... plus many universities.Governments in the last few years have been responding to Open Government initiatives that mandate publishingopen government data.Some are careful, slow-moving entities who simply needed to ﬁnd real solutions to real problems.
Photo credit: http://www.ﬂickr.com/photos/glennharper/4452247708/ 9 9However, while there is lots of gold to be mined from public data, it is an uncomfortable time for GovernmentIT and business managers who are tasked with data management programs.Most people are having a difﬁcult time keeping up. If you feel like you are hanging on while the world changestoo fast, you are not alone.Photo credit: http://www.ﬂickr.com/photos/glennharper/4452247708/
10Linked data is used extensively by the government seen to be the global leader in datatransparency -- the UK Government. This is their home page.
Big Data Simple data Complex data Legacy data 11KEY POINT: Search, discovery and data access approaches have evolved over the last decade and techniquesare beginning to come together. GoPubMed was launched in 2002 as the ﬁrst semantic search portal. Later,Microsoft’s Bing, Google’s Knowledge Graph are two of the other well known search engines employingsemantic techniques.Big data research has grown to include the MapReduce algorithm for handling really large data sets, oftenmeasured in terabytes or greater. This is the kind of data that people at the Large Hadron Collider at CERNare working on to provide insights into how the universe works, including the recent discovery of the HiggsBoson, the particle that gives mass to matter.Under the big top tent of semantic search we’re dealing with different types of content, big, public, complex andlegacy data. Simple, complex and legacy data comes in small, medium and large sizes.Many government agencies by contrast have lots of small to medium data sets in structured databases. Thesedatabases (and the systems that depend upon them) are not going away however fewer new data warehouseprojects are likely to be started. Data warehouses are widely recognized to be costly to create and maintain,and change SLOWLY.The biggest win for governments worldwide who adopt a Web architecture for data publishing is combining datasets to discover new or previously uncontemplated relationships.
“Big Data Is Important, but Open Data Is More Valuable” As change agents, enterprise architects can help their organizations become richer through strategies such as open data. David Newman, VP Research, Gartner 12Open data refers to the idea that certain data should be freely available to everyone to use and republish as theywish, without restrictions from copyright, patents or other forms of control.The term “open data” has gained popularity with open data initiatives including data.gov.uk, data.gov and othergovernment data catalog sites.Enterprise architects are playing an important role in fostering information-sharing practices. Access to, and useof, open data will be particularly critical for a business that operate using the Web; organizations should focus onusing open data to enhance business practices that generate growth and innovation.
13A sound government information management strategy requires providing CONTEXT and CONFIDENCE tothose accessing and potentially re-using your data.Giving people have timely access to information, for disaster preparedness, scientiﬁc research, policy andresearch, the network effect of people helping people is our greatest hope.On the heels of the recent East Coast hurricane that devastated parts of New York and New Jersey, governmentexecutives suggested that fear of cyber-doom scenarios may be taking too much of our thinking & planning.According to Secretary Panetta, it may be driving us to unrealistic and potentially dangerous responses to threatsthat don’t exist.The reality is that when disaster strike, people come together and help one another. We don’t see paralysis,panic and social collapse.During today’s session, I’ll describe how several agencies and private sector organizations are using Webtechnologies and semantics to improve information access and discovery. Simply put, semantic technologiesprovide CONTEXT.
Growing chorus ... “We’re moving from managing documents to managing discrete pieces of open data and content which can be tagged, shared, secured, mashed up and presented in the way that is most useful for the consumer of that information.” -- Report on Digital Government: Building a 21st Century Platform to Better Serve the American People 15The Digital Government Strategy sets out to accomplish three things: Access to high quality digital information& services; procure and manage devices, applications, and data in smart, secure and affordable ways; and unlockthe power of government data to spur innovation.Governments around the world are deﬁning detailed digital services plans based on open data, open APIs andopen source data platforms. They are deﬁning how governments are publishing data with an eye towardsimproving access and re-use. Administrators and program managers are committing to delivery of digital servicesusing semantic technologies broadly, and Linked Data speciﬁcally.
Open data + open standards + open platforms Highly scalable computing & hosting via the Cloud International Data Exchange Standards 5 Star Data (Linked Data) Open Source tools 16A Web-oriented approach to information sharing has impacted how scientists, researchers, regulators and thepublic interacts with government.Linked data lowers the barriers to re-use and interoperability among multiple, distributed and heterogeneousdata sources.Access to high-quality Linked Open Data via the Web means millions of researchers and developers will be ableto shorten the time-consuming research process involving data cleansing and modeling.
17How do we get a loose coupling of shared data over Web architectures? By using the structured data model forthe Web: RDF.There is a project to create freely available data on the Web in this way, which is known as the Linked OpenData project.W3C sees Linked Data as the set of best practices and technologies to support worldwide data access,integration and creative re-use of authoritative data.
18September 2011: 295 datasets that meet the LOD Cloud criteria, consisting of over 31 billionRDF triples and are interlinked by around 504 million links.
Callimachus http://callimachusproject.org http://3roundstones.com 19Callimachus is that platform. It is available via 3roundstones.com or its Open Source sitecallimachusproject.org.
CONTENT LINKED DATA MANAGEMENT MANAGEMENT SYSTEM SYSTEM DATA TEXT UNSTRUCTURED Callimachus STRUCTURED DATA TEXT 20Callimachus may be compared to a distributed CMS. CMS’s manage mostly unstructuredinformation. Callimachus, by contrast to a CMS, manages primarily structured Linked Data. Wecall this a Linked Data management system.
Data driven Web apps using Callimachus US Legislation + enterprise data Clinical Trials + DBpedia + enterprise linked enterprise datasets data 21 21Callimachus integrates (very) well with other enterprise systems as well as Web content. Itcan form an entire application or part of one.NB: Mention Documentum, Oracle via HTTP
22• US HHS committed to making a vast array of open data more readily available to improve health care delivery & reduce costs in 2013 and beyond.• In 2012, Sentara created a Web application that integrates authoritative data from 5 different sources including content from NLM, NOAA, EPA and DBpedia• This application utilizes open data, open standards and an open source data platform
User US EPA US EPA NOAA AirNow SunWise NationalDBpedia Library of Medicine 23
US EPA Linked Data• Cloud-based Linked Data provision of 3 coreprograms: • 2.9M Facilities • 100K substances • 25 years of toxic pollution reports• FISMA compliant• 16 Callimachus templates• Oﬃcial launch March 2013 24
26EPA’s new Linked Data system. Cooperation without coordination. Data reuse breaks the back of API gridlock.Clay Shirky stole that from me :)
27This data is exactly the same data used to create the interface. Unlike traditional database-driven applications,the data is immediately accessible for reuse by third parties. This prevents data duplication, allows for tracking ofprovenance and avoids reinventing the wheel.
We’ve Seen This Before 28Like HTML and RDF, credit cards have a human-readable side and a machine-readable side.
Linked Data management system located at a Tier 1 Cloud Provider (FISMA compliant) RDF Database Resource URIs REST API SPARQL endpoint Public Web Browser Application, Script or automated client Registered developer 29Introduce Callimachus, an open source, open data platform based on open standards.3 Round Stones provides commercial support for Callimachus and is a major contributor to the OS project.Users of Callimachus see a generated Web interface, but can also directly access the data via REST or SPARQL.SPARQL Named Queries (like stored procedures) allow for automated conversion to different formats for reuse innon-RDF environments.
From EPA From Wikipedia Open Street Map 30Data may be easily combined from several sources.
US GPO• Cloud-based Linked Data provision of persistentURLs for US Government documents: • 33K documents • Used by 1,240 Federal Depository Libraries and public• In 3rd year of operation• Deemed an Essential service supporting USCongress 31
Real World Linked Data 32Now let’s look at the same workﬂow in the Linked Data Service.
Finding Mercury Released in 2004 1 2 34There are two very important things to note on this page. 1 is that on any facility’s page,there is always an option to download the data. This data is available in two formats (RDF/XML and Turtle). With the click of a button a user can have all of the data that was used todrive the creation of the current page, which means he or she can repurpose that data intoany new application. Note here that this download is not an extract, summary, or recreationof the data - it is literally the *same* data that was used to drive that page.2 is that because this page is “data-driven”, navigation relies on exploring the data, not thesystem that contains it. On the same page where we get information like it’s latitude andlongitude, we can also ﬁnd a link to a report detailing exactly how much mercury wasreleased in 2004. We could easily do an in-page search for 2004 or Mercury to identify thereleases associated with those terms.
TRI Report 35Rather than aggregating the data for presentation, the actual report is presented with the rawdata continuously available in the top right of the page.A subtle difference to be pointed out here is the difference in the name of the facility.Previously it was identiﬁed as Hanson Permanente, but now it is known as Lehigh SouthwestCement Co. During the modeling phase, the Linked Data was created to implicitly include thisrelationship (which is known via the mapping of EPA FRS identiﬁers). On the other hand,pulling down the CSV ﬁles would not give the user any obvious way of understanding thisrelationship.
Potential Audience✔• Middle school student doing a science project✔• Concerned citizen worried about local pollution✔Environmental Science PhD from EPA•✔• Doctor from NIH writing a research paper 37Linked Data allowed us to reach all the members of our potential audience by giving the useroptions, aggregating based on relevance rather than data source, and by exposing the datathat drives the service for reuse.The middle school student or concerned citizen that want to know the location of a facility,the amount of a particular chemical it released, and the year it was released in never have toclick any of the options in the Linked Data box. They can simply use the interface, explorethe data, and ﬁnd what they need in a read-only experience.The Environmental Science PhD is still able to ﬁnd what he is looking for with Linked Data butcan do so in a much more intuitive way. The doctor from NIH is now able to ﬁnd the datathey’re interested in and if they choose to take the next step, download the actual databehind the page. By quickly and easily obtaining the raw data, anyone from scientists tojournalists can generate their own applications without any knowledge of the Linked DataService itself.
The mission of the Government Linked Data (GLD) Working Group is to provide standards and other information which help governments around the world publish their data as effective and usable Linked Data using Semantic Web technologies. 40We are 16 months into the Government Linked Data Working group’s two year charter.