Technical challenges in resource discovery Paul Walk email@example.com @paulwalk http://www.paulwalk.net
Contents1. a general consideration: • open or closed2. a particular challenge: • synchronisation in an open world3. the ‘nothing new’, but doing it better • APIs that work and can be trusted
a healthy(?) state of tensionbetween open and closed
open and closed worlds• I’m not talking about licensing or access to data• open • unbounded -‐ like the Web• closed • bounded -‐ like most collections management system, aggregations etc.• formally, much of what we do is underpinned by ‘open/closed worlds’ assumptions: • open world assumption: any statement not known to be true is unknown • closed world assumption: any statement not known to be true is false
judging where to apply each• we need our infrastructure (especially integration technology between systems) to be open and relatively unbounded • the Web is still the best available foundation for this• however, we still need to manage our resources, maintain quality and honour complex rights management commitments• we probably need to recognise that users’ experience is often enhanced through the application of a more focussed, targeted and context-‐aware approach
synchronisation • how is the state of the resource maintained across ResourceCollection an infrastructure of Aggregation ‘federated’ repositories? Resource • if a resource is changed or Collection Aggregation deleted, how does the right-‐ hand side aggregation know? Aggregation Resource • note -‐ this is based on our Collection existing ‘harvesting’ or ‘pull’ approach Resource Collection multiple harvest routes, multiple copies
ResourceSync• a joint project of NISO and OAI, led by Herbert Van de Sompel of Los Alamos• a light-‐weight mechanism to allow the state of web resources to be communicated between web systems• developing a spec which builds on the sitemap speciTication, allowing content providers to publish changesets• draft: http://bit.ly/WYhTz2• Jisc have funded UK participation in this
The sun shone, having noalternative, on the nothingnew. Murphy, Samuel Becket
A distributed system is onein which the failure of acomputer you didnt evenknow existed can renderyour own computer unusable Leslie Lamport
a common ‘anti-pattern’ • as a developer, I have no reason to trust that these APIs are any good. end-user end-user end-user UI • after all, the service provider UI UI doesn’t seem to trust them for their Future own application.... Future 3rd-party Future 3rd-party dev 3rd-party dev dev API AP A PI I some aggregated data of broad interest and potential usefulness = certainty UI = belief = speculation end-user
a better pattern • As a developer, I’m more likely to trust this pattern. • the content provider is using their end-user end-user own API to deliver their own application. UI UI • they have a vested interest! 3rd-party focussed app app API = certainty = belief some aggregated data of broad = speculation interest and potential usefulness
APIs are not best thought ofas machine-to-machineinterfacesAPIs are interfaces fordevelopers
messages from developers to content-providers• These are from yesterday’s developer day held here at the BL in support of this summit:• please don’t build elaborate APIs which do not allow us to see all of the data, or its extent. It’s not that we simply want to download all the data -‐ but we do need to see what we’re dealing with• if you give us access to incomplete data (perhaps because you’re worried about revealing poor data quality), then we will tend to either abandon our attempts to use it or we will ‘Bill in the gaps’ with data from elsewhere. So offering an API which delivers incomplete data is usually self-‐defeating• the implicit bargain, made explicit: • give us access to the data as soon as possible and we will do some of the work to process so it is Bit for some new purpose -‐ and we will happily share this code with you
Questions for the parallel sessions1. Which emerging technologies do we need to focus on in 2013?2. Do we still need to aggregate?3. What does data quality stop us doing?
Which emerging technologies do we need tofocus on in 2013?• Graphs: Content Context is king• both Facebook and Google are betting heavily on graph technologies• closer to home -‐ so are content providers like the BBC• linking these is an interesting challenge• databases based on a graph model give the potential for a richer understanding about entities (users!)• instrumentation in personal devices makes more context available (e.g. geo-‐ location).
Do we still need to aggregate? yes.• to address systems/network latency -‐ provide a cache• to showcase!• for ‘Web Scale concentration’• network effects if user facing services also developed• to create middleman business opportunities• as infrastructure to support locally developed services• as an approach to preservation
What does data quality stop us doing?• interpreted as: “what does a concern for data quality stop us doing?” • it stops us from releasing data early• interpreted as: “what does poor/uncertain data quality stop us doing?” • it erodes trust, which impacts the likelihood of someone doing something worthwhile with our data• reconciling these concerns is a major challenge for us.
thank you! Paul Walk firstname.lastname@example.org @paulwalk http://www.paulwalk.net