Published on

Open Data, by definition, provides the chance to re-shape and publish heterogeneous pieces and fragments of information which are open, namely anyone is free to use, reuse, and redistribute it. In order for users to fully benefit this idea, Open Data Systems of tomorrow must provide high quality data, relying on real time and ubiquitous services, along with a deep integration with mobile and smart devices and infrastructures.
In this session, we present a syntheses of Whitehall proposal addressed a this vision: is addressed at building Open Data in a fully-fledged Big Data infrastructure, realized using graph based and NoSQL technologies. This idea is shaped in a cultural heritage scenario, where data in envisaged at valorizing one of the main assets of Italy: cultural heritage.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide


  1. 1. OPEN DATA & OPEN CULTURE Michele Piunti Whitehall Reply
  2. 2. OUTLINE • Background • Motivations • Approaches • Open Data As A Service • Cultural Heritage Hacking • Case Study2
  3. 3. Background
  4. 4. Obama Vision ―In the coming year, we’ll also work to rebuild people’s faith in the institution of government. Because you deserve to know exactly how and where your tax dollars are being spent, you’ll be able to go to a website and get that information for the very first time in history. Because President Obama you deserve to know when your The State of the Union elected officials are meeting Speech - 2011 with lobbyists, I ask Congress to do what the White House has state-of-the-union-2011 already done — put that5 information online.―
  5. 5. What Open Data is Open data is the idea that certain data should be freely available to everyone to use and republish as they wish, without restrictions from copyright, patents or other mechanisms of control. The goals of the open data movement are similar to those of other "Open" movements such as open source, open content, and open access (Ref. Wikipedia) Citizen centricity comes from citizen empowerment, namely disintermediation .wrt traditional actors6
  6. 6. Expected Payoff • Ubiquitous access • Re usability • Optimization • Social and Cultural enrichment ROI ―A greater than 100X return on investment in direct Federal IT spending through economies of scope is achievable by equipping agencies with an Open Data platform that is the shared foundation for numerous programs that are independently funded today‖ []OD turns to be a formidable tool for:• Analyzing Spending Review on administrations expenses• Enforcing Fact Checking on declarations policies and campaigns7
  7. 7. Where Open Data is counting8
  8. 8. Italian Digital Agenda : Open Data + E-Gov Italy established in 2012 a ―control room‖ of experts aimed at promoting Open Data in the context of a digital agenda Open Data is integrated with E-Gov 1. Enabling Infrastructures 2. PA digital switchover 3. Purposive and regulative set of norms and rules 4. Communication plan The challenge is: optimizing services and costs: • Digital Identities and related services, unified and web based registry offices, e-payments, continuous census, interoperability of EU platforms • Digital health, Cultural Heritage • eLearning, eProcurement, eRecruitment9
  9. 9.
  10. 10. Approaches
  11. 11. Roadmap to Open Data Data assets Identify Use Cases Identify ROI Architectur Legal LOD analysis and Final Users • Risk e Definition Issues Feasibility • Relevant Datasets • Best Practices and assessment • Identify service • Copyright report, Exec • Customer internal similar datasets • Savings level • Licensing utive Plan processes analysis • Linked Data Cloud • Identify non • W3C • Liability of Data and Road quantifiable Compliance update Map ROI Identify Development Data Enrichment Validation and Composition Datasets • Architecture • Metadata Publication of Services LOD • Data Analysis Definition description, Ontolo • W3C Compliance • Documentatio Services • Datasets Store gy, RDF • Data Localization, n • Data and transformation • Internal linking • External Linking History • Build Ecosystem Platform • SPARQL • External • Communication • Public API •Normalization Endpoint component Plan (GIS, Data- Mining, BI Analysis modules) Service Development Knowledge Transfer15
  12. 12. Legal Restrictions, Privacy, Licenses Multiple legal or regulatory restrictions on the use of the data.16
  13. 13. 5★ Open Data Tim Berners-Lee, the inventor of the Web and Linked Data initiator, suggested a 5 star deployment scheme for Open Data. make PUBLIC stuff available on the Web (whatever ★ format, .jpeg .pdf) under an open license make it available as structured data (e.g., Excel ★★ instead of image scan of a table) use non-proprietary formats ★★★ (e.g., CSV instead of Excel) use URIs to denote things, so that people ★★★★ can point at your stuff ★★★★★ link your data to other data to provide context17
  14. 14. W3C Roadmap Having Standard Names/URIs for All Government Objects aids in discoverability, improves metadata, and ensures authenticity. • Provide permanent, patterned and/or discoverable URI/URLs to your data • Create a web page with a plain language description of the dataset to help search engines find the data, so people can use it. • Provide links out to other data and documentation. • Ensure that data is findable and can be referenced for as long as people need it • Data published in industry standards like (X)HTML, XML and RDF can be used as an object database or RESTful API18
  15. 15. Linked Open Data Recommended best practice for exposing, sharing, and connecting pieces of data, information, and knowledge on the Semantic Web using URIs OWL and RDF. 1. Requires Ontologies to be applied to data 2. Allows heterogeneous Nodes to be traversed in a semantically coherent fashion19
  16. 16. Botticelli Case One may specify that the author’s mention of ―La Primavera‖ at Uffizi Museum LINKS to exactly the same person as the one described on the Dbpedia (LOD of Wikipedia) The link is not just a hyperlink because it is typed. In the BOTTICELLI page, the information about his life and works is structured, by means of the topology 20
  17. 17. Semantic Network Enable Reasoning: OWL-DL, based on Description Logics, represent decidable fragments of First Order Logic Sandro_Botticelli  category: Italian_Renaissance_painters category: Italian_Renaissance_painters  category:Quattrocento_painters Sandro_Botticelli  category:Quattrocento_painters
  18. 18. Linked Open Data Cloud (2011) Doubled in size every 10 months, since 2007 Media User-generated Geographic Publications Government Cross-Domain Life Sciences22
  19. 19. Recipes for Serving Information as Linked Data • Entities must be identified with referenceable HTTP URIs. • At the MIME-type application/rdf+xml, the data source must return an RDF/XML description. • RDF descriptions should also contain RDF links to resources provided by other data sources, so that clients can navigate the Web of Data as a whole by following RDF links.23
  20. 20. Open Data As A Service
  21. 21. Towards Government 2.0 “ Governments IT need to redefine themselves as Government as a Platform ” Open Data is the platform for Open Government. Actors: • Institutions: to better serve services for citizens • Civic-minded developers: to serve themselves and the others by extending the platform (i.e. mash-ups, applications) What actors need: Open Data management platforms, consistent admin tools and a powerful Open Data Catalog to consolidate the entire Open Data lifecycle (STEP 1-5)25
  22. 22. Open Data As-a-Service26
  23. 23. Open Data As-A-Service REST APIMobile App REST API Web App REST API Mobile App Data-on-Demand data are not closed inside CMS applications but are consumed on-demand As-a-Service Data as Web Resources RESTful API make it possible to retrieve data as a web resource (through URI)27
  24. 24. Socrata: GovStat Approach Socrata is being realeasing fragments of the platform as Open Source in Git Hub Business Model is moving to advanced data analysis tools, mining, real time monitoring, decision making support systems
  25. 25. Cultural Heritage Hacking
  26. 26. Open Data in a Cultural Heritage Scenario Art Galleries, Libraries, Archives and Museums (GLAMS) are exploring the added value of sharing their data resources as LOD Key facts: • Rich and structured data sets accumulated over many years by experts • Ability to reach out to audiences to both enrich datasets and to evaluation services • Long-standing expertise in meta-data management and (co-) curation • Authoritative knowledge on a wide range of subjects30
  27. 27. GLAMS LOD Examples In Agora, the Rijksmuseum Amsterdam and the Netherlands Institute for Sound and Vision collaborate with the Computer Science and History departments at the VU to integrate their collections and enrich with historical information to facilitate a more comprehensive understanding of the historical dimension of objects in online heritage collections. [] The Amsterdam Museum was the first museum in the Netherlands to convert its complete museum collection database to RDF. The resulting resource consists of more than 5 Million RDF triples describing over than 70.000 cultural heritage objects. Several working examples uses this dataset, such as a mobile city guide.31
  28. 28. GLAMS LOD Examples Europeana is a pan-European initiative that provides access to millions of objects as LOD through API. The Europeana Thought Lab[5] search interface shows how LOD principles can aid the search process. Europeana has been a strong supporter for the uptake of CC0, the "no rights reserved" in Creative Commons-licenses [] Open Images provides access to a large and growing collection of Creative Commons licensed archive material. The meta-data is converted to RDF, allowing the creation of rich semantic links between other datasets such as the Amsterdam Museum dataset []32
  29. 29. PROS and CONS of LOD for GLAMSPROS • Driving users to online content held by GLAMS (e.g., by improved search engine optimization); • Stimulating collaboration in the library, archives and museums domain and beyond, for instance by inviting people to clean/enrich existing data; • Enabling new scholarship that can only be done with open data; • Allowing the creation of new services for discovery; • quoting Verwayen (2011) ―increas[ing] relevance to digital society.‖CONS • Loss of Attribution to the ―memory institution‖, which may turn to decrease values of the artworks • Loss of potential Incomes: open data may not be sold33
  30. 30. Metrics of Success Incomes: measured in money Public Outreach: to measure the number of (online) visitors Reuse: to measure the use of data and content by heritage institutions themselves and by others Public Participation: to measure the amount of added metadata and content34
  31. 31. Developing Open Linked Data (with Graph Database)
  32. 32. Developing Open Linked Data We may recognize few contingencies in our scenario: • Exponential growth in data volumes • Rise of connectedness • Increase in degrees of semi-structure • Structures and Schemes emerge rather than having a pre-defined upfront Key facts: • Volume: the size of the stored data • Velocity: the rate at which data changes over time • Variety: the degree to which data is regularly or irregularly structured, dense or sparse, and importantly connected or37 disconnected
  33. 33. ER Approach We do not know the structure of the documents in design time. Adopting an ER approach we have to define vertical tables38
  34. 34. Relational Model Weakness In ER model relationships are semantic free (direction, name) • As the amount of semi-structured information increases, the relational model becomes burdened • Maintenance overheads: join tables and maintaining foreign key constraints just to make the database work. • Large join tables, sparsely populated rows and lots of null-checking logic • Difficult to face with reciprocal queries in nowadays semi-structured, real-world cases • Recommendation systems, social networks39
  35. 35. Aggregate Stores Weakness Aggregates allow to mimic relationships embedding cross-stores identifiers, but: • Is up to the developer to manage, infer and reify useful knowledge from that • Do not provide index-free adjacency • Delete must be checked • Traversing relationships is expensive, each link requiring index lookup • Brute force computing an entire data set is O(n) since all n aggregates in the data store must be considered. That’s far too costly where we’d prefer O(log n) • Impractical in real time scenario40
  36. 36. Storing data in Graphs Graph theory was pioneered by Euler in the 18th century, received multidisciplinary contributes across centuries • Facebook, Google and Twitter have centered their business models around their own proprietary distributed graph technologies Facebook TAO Twitter FlockDB Graph databases store information in ways that much more closely resemble the ways the world is organized and the humans ―think about‖ data. Top 10 Gartner IT technologies in 2013 ―[..] are designed to support new transaction, interaction and observation use cases involving web scale, mobile, cloud and clustered environments‖41
  37. 37. From Relational to Graph based Modeling Graph DB place relationships as first-class abstractions of the data model • It contains nodes and relationships • Nodes contain properties (key- value pairs) • Relationships are named, directed and always have a start and end node • Relationships can also contain properties A Graph –[:RECORDS_DATA_IN] Nodes –[:WHICH_HAVE] Properties. Nodes –[:LINKED_BY] Relationships42
  38. 38. From Relational to Graph based Modeling Shake RDBMS while keeping all the relationships, and you’ll see a graph Where RDBMS are optimized for aggregated data, Graph Database are optimized for highly connected data43
  39. 39. Traversing Map Performances Friend of Friend (FoF) problem : for any person in a social network, look for a route to some other person in the graph at most depth=N hops away. For a social network containing 1,000,000 people each with ~50 friends the results (*) shows that graph databases are the best choice Depth RDBMS Execution Time (s) Neo4j (s) Returned Records 2 0.016 0.01 ~2500 3 30.267 0.168 ~110,000 4 1543.505 1.359 ~600,000 5 Unfinished 2.132 ~800,00045 (*) Graph Databases, O’ Reilly – To Appear
  40. 40. A Case StudyUsing Neo4j and Spring Data
  41. 41. Neo4j Graph DB intuitive, using a graph model for data representation reliable, with full ACID transactions durable and fast, using a custom disk-based, native storage engine massively scalable, up to several billion nodes/relationships/properties highly-available, when distributed across multiple machines expressive, with a powerful, human readable graph query language fast, with a powerful traversal framework for high-speed graph queries embeddable, with a few small jars simple, accessible by a convenient REST interface or an object- oriented Java API48
  42. 42. Spring Data and Neo4J Promotes POJO based development for the Graph Database Neo4j. It maps annotated entity classes to the Neo4j Graph Database with advanced mapping functionality. Seamless integration of the Cypher Query Language49
  43. 43. Spring Data Neo4j It is possible to derive queries for domain entities from finder method names like Iterable<T> @Indexed fields will be converted into index-lookups of the start clause, navigation along relationships will be reflected in the match clause properties with operators will end up as expressions in the where clause50
  44. 44. Open Linked Graph User51
  45. 45. Open Linked Graph User [:OWNS] [:OWNS] Document Document52
  46. 46. Open Linked Graph User [:OWNS] [:OWNS] Document Document [:INCLUDES] [:INCLUDES] [:INCLUDES] Node [:INCLUDES] [:INCLUDES] Node [:INCLUDES]Node Node Node Node 53
  47. 47. Open Linked Graph User [:OWNS] [:OWNS] Document Document [:INCLUDES] [:INCLUDES] [:INCLUDES] Node [:INCLUDES] [:INCLUDES] Node [:INCLUDES]Node Node Node Node [:DBP_LINKED] [:DBP_LINKED] [:DBP_LINKED] [:LOCATED] [:LOCATED] [:LOCATED] [:LOCATED] [:DBP_LINKED] DBPedia URI DBPedia URI Venue DBPedia URI Venue DBPedia URI Venue Venue 54
  48. 48. Ideas for Changing the Future 1. User Centric Experience 2. Relevance Based Approach 3. Big Data 4. Environments and Societies 5. Smart Cities55
  49. 49. Thanks Michele
  50. 50. Annex
  51. 51. Modular Approach Mobile Web UI Interface Public API JQuery UI, Kendo UI, Spring REST Bootstrap Open CMS Enterprise OLAP Framework Analysis, SaaS Mining Manager Spring, J2EE Pentaho Persistence Manager Spring Data, MyBatis, Hybernate SOCRATA CKan ** Geographic Information System Data Storage Ontology NoSQL RDF, AML * SQL DBMS Neo4J,MongoDB Oracle, MySQL, Postgre Ongoing collaborations:58 (*) ST-Lab, ISTC-CNR (**) Socrata, Inc.