Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

1,519 views

Published on

Published in: Technology, Education
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,519
On SlideShare
0
From Embeds
0
Number of Embeds
70
Actions
Shares
0
Downloads
9
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Provenance and Reuse of Open Data (PILOD 2.0 June 2014)

  1. 1. Converts' rally, Evangelistic Committee of New York City, Carnegie Hall, Sept.14, 1908
  2. 2. Open DataLinked Six Ingredients The missing ★ Mix ‘n Mash Contextualize! Choose your Grain Size Lower the Threshold Repeatable Transformation
  3. 3. 1 Themissing★ http://give.everything/a/URI HTTPs URIs only please!
 (or resolver + URN) Versioninformation Versionagnostic Guessable
  4. 4. 2 RepeatableTransformation Transformation should be part of routine ... ... manageable and scalable... ... repeatable ... Linked Data will not be the officialsource anytime soon http://www.w3.org/TR/prov-overview/ Provenance is key
  5. 5. 3 ChooseyourGrainSize • The document is the 
 traditionalgrain size
 (dublin core) • Linked data allows for 
 deeplinks into data • Cost versus usefulness • Are you the right party to provide detailed descriptions? http://creatingandeducating.blogspot.nl/2011/11/blog-post.html
  6. 6. 4 Mix‘nMash • Multiple vocabularies won’tbite • Multiple identifiers won’tbite ! • Choose what’s useful for you... • ... then map to others! Image © David Sykes 2009 All rights reserved Good News: the bulk has already been done for you!
  7. 7. 5• Information is notalwayscompatible • Make explicit in which context the information holds ... • ... and who stated the information, why and how. Contextualize! Flat Earth and Square Earth idea courtesy of SzymonKlarman
  8. 8. to 2Data SemanticsSemantics for Scientific Data PublishersFrom Data
  9. 9. Photo by Philip Dujardin, http://www.filipdujardin.be
  10. 10. HerkomstenHergebruikvan OpenData Rinke Hoekstra
 VU University Amsterdam/University of Amsterdam rinke.hoekstra@vu.nl Photo by Philip Dujardin, http://www.filipdujardin.be
  11. 11. Definition
 (OxfordEnglishDictionary) • The fact of coming from some particular source or quarter; origin, derivation; • the history or pedigree of a work of art, manuscript, rare book, etc.; • concretely, arecordofthepassage of an item through its various owners.
  12. 12. Making trust judgements Liability, trust and privacy 
 in open government data Compliance and auditing 
 of business processes Licensing and attribution 
 of combined information
  13. 13. Curt Tilmes, Peter Fox, Xiaogang Ma, Deborah L. McGuinness, Ana Pinheiro Privette, Aaron Smith, Anne Waple, Stephan Zednik, Jinguang Zheng: Provenance Representation for the National Climate Assessment in the Global Change Information System. IEEE T. Geoscience and Remote Sensing 51(11): 5160-5168 (2013) Integrated & Summarized Data Transparency and Trust “Provenance is the number one issue that we face when publishing government data in data.gov.uk” John Sheridan, UK National Archives, data.gov.uk
  14. 14. Provenance? • Provenance = Metadata?
 Provenance can be seen as metadata, but not all metadata is provenance • Provenance = Trust?
 Provenance provides a substrate for deriving different trust metrics • Provenance = Authentication?
 Provenance records can be used to verify and authenticate amongst users
  15. 15. ThreeDimensions • Content
 Capturing and representing provenance information • Management
 Storing, querying, and accessing provenance information • Use
 Interpreting and understanding provenance in practice
  16. 16. ThreeDimensions • Content
 Capturing and representing provenance information • Management
 Storing, querying, and accessing provenance information • Use
 Interpreting and understanding provenance in practice recording annotating workflows
  17. 17. ThreeDimensions • Content
 Capturing and representing provenance information • Management
 Storing, querying, and accessing provenance information • Use
 Interpreting and understanding provenance in practice recording annotating workflows scalability interoperability
  18. 18. ThreeDimensions • Content
 Capturing and representing provenance information • Management
 Storing, querying, and accessing provenance information • Use
 Interpreting and understanding provenance in practice recording annotating workflows scalability interoperability trust accountability compliance explanation debugging
  19. 19. Standardization
  20. 20. W3CPROVStandard Provenance is a record
 that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data
 or a thing. http://www.w3.org/TR/prov-overview
  21. 21. Luc Moreau & Paul Groth W3CPROVStandard Provenance is a record
 that describes the people, institutions, entities, and activities, involved in producing, influencing, or delivering a piece of data
 or a thing. http://www.w3.org/TR/prov-overview
  22. 22. http://doc.metalex.eu http://yasgui.data2semantics.org
  23. 23. Interpretation
  24. 24. NaiveApproaches InProv: Visualizing Provenance Graphs with Radial Layouts and Time-Based Hierarchical Grouping
 Madelaine D. Boyd - http://www.seas.harvard.edu/sites/default/files/files/archived/Boyd.pdf Orbiter has several limitations. It does not have capabilities for query subgraph high- lighting, regular expression filters, process grouping, annotations, or programmable views[16]. Furthermore, the structure of each summary node, where child nodes are grouped within parents and are hidden until the parent is expanded, benefits queries earlier in the depen- dency chain. Initial overviews often correspond with system bootup, and appear very similar across di↵erent traces (time slices of system activity). Figure 10: In these screenshots of Orbiter, the presence of edges overwhelms the visibility of nodes. By relying on a node-link graph layout and using spatial location to encode object relationships, Orbiter’s graph layout algorithm must draw many long edges to communi- cate node connections. Without edge bundling or opacity variation, the meanings of these relationships are obscured. Another one of Orbiter’s weaknesses is its node-link diagram layout. As a result, each node’s position in the X-Y plane and the length and angle of connecting lines are wasted attributes. The chosen graph layout algorithm (dot by default) arranges nodes to minimize Figure 11: (Top): A screenshot of the portion of the graph generated by GraphViz for a trace of the third provenance challenge. (Bottom): A zoomed-in view of the same graph. The horizontal black bars across the images are dense collections of edges. E↵ective large graph visualizations present the user with a summary view that can be explored, filtered, and expanded interactively. 2.5 Tree Visualization While trees are a subcategory of graphs, because of their hierarchical composition, tree visu- alization forms its own subfield of research. A survey of over two-hundred tree visualizations is given at Hans-Jrg Schulz’s treevis.net. Visitors can narrow down by dimensionality (2D, 3D, or mixed), representation (explicit node-link diagram, implicit treemap, or combi- nation), alignment (XY plot, radial layout, or free diagram)[55]. These categories are shown Figure 12: Left: Pajek uses various summary node-link and matrix-based representations depending on the structure of the supplied data set. Pictured is a main core subgraph extracted from routing data on the Internet. Right: TopoLayout optimizes the choice of visualization display depending on the underlying graph structure. The right column is TopoLayout’s output, while the left and middle columns are the outputs of the GRIP and FM graph layout algorithms. Figure 13: treevis.net defines di↵erent categories for tree maps. Tree maps can be cate- gorized by dimensionality (2D, 3D, or mixed), representation (explicit, implicit, or mixed), or alignment (XY, radial, or spring). Tree visualizations are either explicit or implicit. Explicit representations resemble node- link diagrams. An example of an implicit representation is a tree map, a diagram where the entire tree is inscribed in a rectangle representing the root node. This root is subdivided hierarchically into more rectangles, which represent child nodes, and each child node is subdivided into more child nodes. Treemaps are excellent for displaying hierarchical or categorical data[57]. One famous example, shown in Figure 14, is the “Map of the Market” from SmartMoney.com, which displays in red and green the changes in market value of publicly-traded companies, grouped by market sector, with cell size proportional to market capitalization[64]. TreePlus is an example of a tree-inspired graph visualization tool (Figure 15). It uses the guiding metaphor of “plant a seed to watch it grow” to summarize navigation of its tree-
  25. 25. Width of activities and entities is based on informationflow Activities and entities are extracted from an egograph
  26. 26. Capturing
  27. 27. We need an intuitive REST-like API to integrated Open Government data. Dealing with all these different formats and identifiers is really taking too much time. I have all this data, and I want to make (part of) it available for the general public, but haven't a clue how! Civil Servant wants to publish data Application Developers want to consume data Carrier 12:00 PM Page Title http:// www.domain.com Googl e Apps and applications Visual interactions with Open Data. Application specific logics (e.g. 'danger') CitySDK API HTTP API to the CitySDK Returns JSON, Turtle, etc. (includes the Linked Data API of CitySDK) SPARQL API SPARQL Endpoint to the Linked Data storage of the ODE Partial Synchronisation CitySDK Datastores Linked Data Triplestore Feed into Query Orchestrator Amsterdam Open Data Exchange HTTP API to `canned queries' across multiple datasets. Returns JSON-LD, Turtle Data Integrator ODE Best Practices Best practices for publishing Open Data CitySDK Ingestion Plugins "Standard" adapters part of CitySDK ODE Ingestion Adapters Ingestion adapters developed within ODE Municipal Legacy Systems Excel Files Amsterdam Open Data CKAN Amsterdam Open Data Catalog Will point to datasets in the ODE May provide a direct query interface on top of ODE Wrapper-based
  28. 28. Workflow-based
  29. 29. TomdeNies(Ghent University)
 SaraMagliacane (VU University Amsterdam)
  30. 30. Integrated
  31. 31. to 2Data SemanticsSemantics for Scientific Data PublishersFrom Data The Big Future of Data
 2 October 2014
  32. 32. Enrich Publish Analyze
  33. 33. Semantic Publication of Data Publish directly from the cloud
 to the cloud On-the-fly analysis and tag suggestion
  34. 34. Interactive Data Construction 
 via Instrumented IPython Notebook Integration in 
 popular tool No “green field”
  35. 35. Visual Exploration of Big Data Virtualisation Discover patterns Interactive visualisation Sparse and heterogeneous

×