Technical challenges in resource discovery




                                           Paul	
  	
  Walk
                                     paul@paulwalk.net
                                           @paulwalk
                              http://www.paulwalk.net
Contents

1. a	
  general	
  consideration:
   • open	
  or	
  closed

2. a	
  particular	
  challenge:
   • synchronisation	
  in	
  an	
  open	
  world

3. the	
  ‘nothing	
  new’,	
  but	
  doing	
  it	
  better
   • APIs	
  that	
  work	
  and	
  can	
  be	
  trusted
a healthy(?) state of tension
between open and closed
open and closed worlds

• I’m	
  not	
  talking	
  about	
  licensing	
  or	
  access	
  to	
  data

• open
   • unbounded	
  -­‐	
  like	
  the	
  Web

• closed
   • bounded	
  -­‐	
  like	
  most	
  collections	
  management	
  system,	
  aggregations	
  etc.

• formally,	
  much	
  of	
  what	
  we	
  do	
  is	
  underpinned	
  by	
  ‘open/closed	
  worlds’	
  
  assumptions:

    • open	
  world	
  assumption:	
  any	
  statement	
  not	
  known	
  to	
  be	
  true	
  is	
  unknown
    • closed	
  world	
  assumption:	
  any	
  statement	
  not	
  known	
  to	
  be	
  true	
  is	
  false
characteristics of an open world
characteristics of a closed/bounded world
judging where to apply each

• we	
  need	
  our	
  infrastructure	
  (especially	
  integration	
  technology	
  between	
  systems)	
  
  to	
  be	
  open	
  and	
  relatively	
  unbounded

    • the	
  Web	
  is	
  still	
  the	
  best	
  available	
  foundation	
  for	
  this

• however,	
  we	
  still	
  need	
  to	
  manage	
  our	
  resources,	
  maintain	
  quality	
  and	
  honour	
  
  complex	
  rights	
  management	
  commitments

• we	
  probably	
  need	
  to	
  recognise	
  that	
  users’	
  experience	
  is	
  often	
  enhanced	
  through	
  
  the	
  application	
  of	
  a	
  more	
  focussed,	
  targeted	
  and	
  context-­‐aware	
  approach
a particular challenge
synchronisation

                                                                   • how	
  is	
  the	
  state	
  of	
  the	
  
                                                                     resource	
  maintained	
  across	
  
Resource
Collection                                                           an	
  infrastructure	
  of	
  
                          Aggregation                                ‘federated’	
  repositories?

    Resource
                                                                   • if	
  a	
  resource	
  is	
  changed	
  or	
  
    Collection
                                                   Aggregation
                                                                     deleted,	
  how	
  does	
  the	
  right-­‐
                                                                     hand	
  side	
  aggregation	
  know?
                          Aggregation
  Resource                                                         • note	
  -­‐	
  this	
  is	
  based	
  on	
  our	
  
  Collection                                                         existing	
  ‘harvesting’	
  or	
  ‘pull’	
  
                                                                     approach
             Resource
             Collection                 multiple harvest routes,
                                                 multiple copies
ResourceSync

• a	
  joint	
  project	
  of	
  NISO	
  and	
  OAI,	
  led	
  by	
  
  Herbert	
  Van	
  de	
  Sompel	
  of	
  Los	
  Alamos

• a	
  light-­‐weight	
  mechanism	
  to	
  allow	
  the	
  
  state	
  of	
  web	
  resources	
  to	
  be	
  
  communicated	
  between	
  web	
  systems

• developing	
  a	
  spec	
  which	
  builds	
  on	
  the	
  
  sitemap	
  speciTication,	
  allowing	
  
  content	
  providers	
  to	
  publish	
  
  changesets

• draft:	
  http://bit.ly/WYhTz2

• Jisc	
  have	
  funded	
  UK	
  participation	
  in	
  
  this
The sun shone, having no
alternative, on the nothing
new.            Murphy,	
  Samuel	
  Becket
A distributed system is one
in which the failure of a
computer you didn't even
know existed can render
your own computer unusable
                    Leslie Lamport
a common ‘anti-pattern’

                                                                   • as	
  a	
  developer,	
  I	
  have	
  no	
  reason	
  to	
  
                                                                     trust	
  that	
  these	
  APIs	
  are	
  any	
  good.	
  
                               end-user
        end-user                                       end-user
                                  UI                               • after	
  all,	
  the	
  service	
  provider	
  
             UI                                           UI         doesn’t	
  seem	
  to	
  trust	
  them	
  for	
  their	
  
                                Future                               own	
  application....
          Future               3rd-party                Future
         3rd-party                dev                  3rd-party
            dev                                           dev


                                  API            AP
                   A   PI                           I


                    some aggregated data of broad
                   interest and potential usefulness


  = certainty                     UI
  = belief
  = speculation

                              end-user
a better pattern

                                                      • As	
  a	
  developer,	
  I’m	
  more	
  likely	
  to	
  
                                                        trust	
  this	
  pattern.

                                                      • the	
  content	
  provider	
  is	
  using	
  their	
  
                  end-user                end-user      own	
  API	
  to	
  deliver	
  their	
  own	
  
                                                        application.
                      UI                      UI


                                                      • they	
  have	
  a	
  vested	
  interest!
                  3rd-party              focussed
                    app                     app




                                API
  = certainty
  = belief
                   some aggregated data of broad
  = speculation   interest and potential usefulness
APIs are not best thought of
as machine-to-machine
interfaces

APIs are interfaces for
developers
messages from developers to content-providers

• These	
  are	
  from	
  yesterday’s	
  developer	
  day	
  held	
  here	
  at	
  the	
  BL	
  in	
  support	
  of	
  this	
  
  summit:

• please	
  don’t	
  build	
  elaborate	
  APIs	
  which	
  do	
  not	
  allow	
  us	
  to	
  see	
  all	
  of	
  the	
  data,	
  
  or	
  its	
  extent.	
  It’s	
  not	
  that	
  we	
  simply	
  want	
  to	
  download	
  all	
  the	
  data	
  -­‐	
  but	
  we	
  do	
  need	
  
  to	
  see	
  what	
  we’re	
  dealing	
  with

• if	
  you	
  give	
  us	
  access	
  to	
  incomplete	
  data	
  (perhaps	
  because	
  you’re	
  worried	
  about	
  
  revealing	
  poor	
  data	
  quality),	
  then	
  we	
  will	
  tend	
  to	
  either	
  abandon	
  our	
  attempts	
  to	
  
  use	
  it	
  or	
  we	
  will	
  ‘Bill	
  in	
  the	
  gaps’	
  with	
  data	
  from	
  elsewhere.	
  So	
  offering	
  an	
  API	
  
  which	
  delivers	
  incomplete	
  data	
  is	
  usually	
  self-­‐defeating

• the	
  implicit	
  bargain,	
  made	
  explicit:
   • give	
  us	
  access	
  to	
  the	
  data	
  as	
  soon	
  as	
  possible	
  and	
  we	
  will	
  do	
  some	
  of	
  the	
  work	
  to	
  
     process	
  so	
  it	
  is	
  Bit	
  for	
  some	
  new	
  purpose	
  -­‐	
  and	
  we	
  will	
  happily	
  share	
  this	
  code	
  
     with	
  you
Questions for the parallel sessions

1. Which	
  emerging	
  technologies	
  do	
  we	
  need	
  to	
  focus	
  on	
  in	
  
   2013?

2. Do	
  we	
  still	
  need	
  to	
  aggregate?

3. What	
  does	
  data	
  quality	
  stop	
  us	
  doing?
Which emerging technologies do we need to
focus on in 2013?

• Graphs:	
  Content	
  Context	
  is	
  king

• both	
  Facebook	
  and	
  Google	
  are	
  betting	
  
  heavily	
  on	
  graph	
  technologies

• closer	
  to	
  home	
  -­‐	
  so	
  are	
  content	
  providers	
  
  like	
  the	
  BBC

• linking	
  these	
  is	
  an	
  interesting	
  challenge

• databases	
  based	
  on	
  a	
  graph	
  model	
  give	
  
  the	
  potential	
  for	
  a	
  richer	
  understanding	
  
  about	
  entities	
  (users!)

• instrumentation	
  in	
  personal	
  devices	
  
  makes	
  more	
  context	
  available	
  (e.g.	
  geo-­‐
  location).
Do we still need to aggregate?
Do we still need to aggregate?

                     yes.
Do we still need to aggregate?

                                                  yes.
• to	
  address	
  systems/network	
  latency	
  -­‐	
  provide	
  a	
  cache

• to	
  showcase!

• for	
  ‘Web	
  Scale	
  concentration’

• network	
  effects	
  if	
  user	
  facing	
  services	
  also	
  developed

• to	
  create	
  middleman	
  business	
  opportunities

• as	
  infrastructure	
  to	
  support	
  locally	
  developed	
  services

• as	
  an	
  approach	
  to	
  preservation
What does data quality stop us doing?

• interpreted	
  as:	
  “what	
  does	
  a	
  concern	
  for	
  data	
  quality	
  stop	
  us	
  doing?”
   • it	
  stops	
  us	
  from	
  releasing	
  data	
  early

• interpreted	
  as:	
  “what	
  does	
  poor/uncertain	
  data	
  quality	
  stop	
  us	
  doing?”
   • it	
  erodes	
  trust,	
  which	
  impacts	
  the	
  likelihood	
  of	
  someone	
  doing	
  something	
  
     worthwhile	
  with	
  our	
  data

• reconciling	
  these	
  concerns	
  is	
  a	
  major	
  challenge	
  for	
  us.
thank you!

                      Paul	
  	
  Walk
                paul@paulwalk.net
                      @paulwalk
         http://www.paulwalk.net

Technical Coping Strategies for Resource Discovery - Paul Walk

  • 1.
    Technical challenges inresource discovery Paul    Walk paul@paulwalk.net @paulwalk http://www.paulwalk.net
  • 2.
    Contents 1. a  general  consideration: • open  or  closed 2. a  particular  challenge: • synchronisation  in  an  open  world 3. the  ‘nothing  new’,  but  doing  it  better • APIs  that  work  and  can  be  trusted
  • 3.
    a healthy(?) stateof tension between open and closed
  • 4.
    open and closedworlds • I’m  not  talking  about  licensing  or  access  to  data • open • unbounded  -­‐  like  the  Web • closed • bounded  -­‐  like  most  collections  management  system,  aggregations  etc. • formally,  much  of  what  we  do  is  underpinned  by  ‘open/closed  worlds’   assumptions: • open  world  assumption:  any  statement  not  known  to  be  true  is  unknown • closed  world  assumption:  any  statement  not  known  to  be  true  is  false
  • 5.
  • 6.
    characteristics of aclosed/bounded world
  • 7.
    judging where toapply each • we  need  our  infrastructure  (especially  integration  technology  between  systems)   to  be  open  and  relatively  unbounded • the  Web  is  still  the  best  available  foundation  for  this • however,  we  still  need  to  manage  our  resources,  maintain  quality  and  honour   complex  rights  management  commitments • we  probably  need  to  recognise  that  users’  experience  is  often  enhanced  through   the  application  of  a  more  focussed,  targeted  and  context-­‐aware  approach
  • 8.
  • 9.
    synchronisation • how  is  the  state  of  the   resource  maintained  across   Resource Collection an  infrastructure  of   Aggregation ‘federated’  repositories? Resource • if  a  resource  is  changed  or   Collection Aggregation deleted,  how  does  the  right-­‐ hand  side  aggregation  know? Aggregation Resource • note  -­‐  this  is  based  on  our   Collection existing  ‘harvesting’  or  ‘pull’   approach Resource Collection multiple harvest routes, multiple copies
  • 10.
    ResourceSync • a  joint  project  of  NISO  and  OAI,  led  by   Herbert  Van  de  Sompel  of  Los  Alamos • a  light-­‐weight  mechanism  to  allow  the   state  of  web  resources  to  be   communicated  between  web  systems • developing  a  spec  which  builds  on  the   sitemap  speciTication,  allowing   content  providers  to  publish   changesets • draft:  http://bit.ly/WYhTz2 • Jisc  have  funded  UK  participation  in   this
  • 11.
    The sun shone,having no alternative, on the nothing new. Murphy,  Samuel  Becket
  • 12.
    A distributed systemis one in which the failure of a computer you didn't even know existed can render your own computer unusable Leslie Lamport
  • 13.
    a common ‘anti-pattern’ • as  a  developer,  I  have  no  reason  to   trust  that  these  APIs  are  any  good.   end-user end-user end-user UI • after  all,  the  service  provider   UI UI doesn’t  seem  to  trust  them  for  their   Future own  application.... Future 3rd-party Future 3rd-party dev 3rd-party dev dev API AP A PI I some aggregated data of broad interest and potential usefulness = certainty UI = belief = speculation end-user
  • 14.
    a better pattern • As  a  developer,  I’m  more  likely  to   trust  this  pattern. • the  content  provider  is  using  their   end-user end-user own  API  to  deliver  their  own   application. UI UI • they  have  a  vested  interest! 3rd-party focussed app app API = certainty = belief some aggregated data of broad = speculation interest and potential usefulness
  • 15.
    APIs are notbest thought of as machine-to-machine interfaces APIs are interfaces for developers
  • 16.
    messages from developersto content-providers • These  are  from  yesterday’s  developer  day  held  here  at  the  BL  in  support  of  this   summit: • please  don’t  build  elaborate  APIs  which  do  not  allow  us  to  see  all  of  the  data,   or  its  extent.  It’s  not  that  we  simply  want  to  download  all  the  data  -­‐  but  we  do  need   to  see  what  we’re  dealing  with • if  you  give  us  access  to  incomplete  data  (perhaps  because  you’re  worried  about   revealing  poor  data  quality),  then  we  will  tend  to  either  abandon  our  attempts  to   use  it  or  we  will  ‘Bill  in  the  gaps’  with  data  from  elsewhere.  So  offering  an  API   which  delivers  incomplete  data  is  usually  self-­‐defeating • the  implicit  bargain,  made  explicit: • give  us  access  to  the  data  as  soon  as  possible  and  we  will  do  some  of  the  work  to   process  so  it  is  Bit  for  some  new  purpose  -­‐  and  we  will  happily  share  this  code   with  you
  • 17.
    Questions for theparallel sessions 1. Which  emerging  technologies  do  we  need  to  focus  on  in   2013? 2. Do  we  still  need  to  aggregate? 3. What  does  data  quality  stop  us  doing?
  • 18.
    Which emerging technologiesdo we need to focus on in 2013? • Graphs:  Content  Context  is  king • both  Facebook  and  Google  are  betting   heavily  on  graph  technologies • closer  to  home  -­‐  so  are  content  providers   like  the  BBC • linking  these  is  an  interesting  challenge • databases  based  on  a  graph  model  give   the  potential  for  a  richer  understanding   about  entities  (users!) • instrumentation  in  personal  devices   makes  more  context  available  (e.g.  geo-­‐ location).
  • 19.
    Do we stillneed to aggregate?
  • 20.
    Do we stillneed to aggregate? yes.
  • 21.
    Do we stillneed to aggregate? yes. • to  address  systems/network  latency  -­‐  provide  a  cache • to  showcase! • for  ‘Web  Scale  concentration’ • network  effects  if  user  facing  services  also  developed • to  create  middleman  business  opportunities • as  infrastructure  to  support  locally  developed  services • as  an  approach  to  preservation
  • 22.
    What does dataquality stop us doing? • interpreted  as:  “what  does  a  concern  for  data  quality  stop  us  doing?” • it  stops  us  from  releasing  data  early • interpreted  as:  “what  does  poor/uncertain  data  quality  stop  us  doing?” • it  erodes  trust,  which  impacts  the  likelihood  of  someone  doing  something   worthwhile  with  our  data • reconciling  these  concerns  is  a  major  challenge  for  us.
  • 23.
    thank you! Paul    Walk paul@paulwalk.net @paulwalk http://www.paulwalk.net