Mashspa

749 views
674 views

Published on

Open Bibliography and standards.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
749
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
3
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Mashspa

  1. 1. Open Bibliography, And why it shouldn't have to exist. Ben O'Steen “Mashspa” Mashed Libraries, Bath 29/10/2010 CC-By
  2. 2. Morning, (don't worry, I'll be quick...)
  3. 3. Urgh, “Open” - what does that mean?
  4. 4. Publishing bibliographic information under a permissive license to encourage indexing, re- use, and re-purposing.
  5. 5. But.... why?
  6. 6. In essence, an open bibliography is all about Advertising
  7. 7. Bibliographic info allows you to ● Identify and find an item you know you want
  8. 8. Bibliographic info allows you to ● Identify and find an item you know you want, ● Discover related items or items you believe you want
  9. 9. Bibliographic info allows you to ● Identify and find an item you know you want, ● Discover related items or items you believe you want ● Serendipitously discover items you would like without knowing they might exist ● And so on.
  10. 10. Bibliographic info allows you to ● Identify and find an item you know you want, ● Discover related items or items you believe you want ● Serendipitously discover items you would like without knowing they might exist ● And so on. Requires Increasing Investment!
  11. 11. Advertising 'proverb' You never spend money on advertising; you invest with an expectation of return on investment
  12. 12. To maximise returns, you maximise the audience.
  13. 13. Should the advertising target 'b2b' or 'consumers'?
  14. 14. One thing I am not saying must be necessary...
  15. 15. But, by not making bibliographic data open, you limit the audience. (You also limit the data quality, but more on that later.)
  16. 16. “Can't I just scrap sites and reuse it anyway? It's just facts after all...”
  17. 17. “Directive 96/9/EC of the European Parliament and of the Council of 11 March 1996 on the legal protection of databases” http://is.gd/gqkqb
  18. 18. Databases have in the past been defended using Copyright laws. This new law codifies a new protection based on “sui generis”* rights, rights earned by the “sweat of the brow” * http://en.wikipedia.org/wiki/Sui_generis
  19. 19. So far, noone seems to have any evidence that this encouraged database-based economies. There is evidence that it 'awarded' unending monopolies on existing datasets.
  20. 20. Due to fluffy wording, it is a timebomb It is a right, like copyright, that doesn't need to be defended and can be assumed for almost any aggregation.
  21. 21. When we asked UK PubMedCentral if we could reproduce the bibliographic data they share through their OAI-PMH service. They said “Generally, No”* (*me paraphrasing that they had non-transferable licenses and contracts yada yada. Their 'OA subset' of 1876 journals is available however, mainly BMC.)
  22. 22. From OAI-PMH specification: * Data Providers administer systems that support the OAI-PMH as a means of exposing metadata; and * Service Providers use metadata harvested via the OAI-PMH as a basis for building value-added services. http://www.openarchives.org/OAI/openarchivesproto col.html
  23. 23. “… Service Providers use metadata harvested via the OAI-PMH as a basis for building value-added services.” And the survey said...
  24. 24. X
  25. 25. Open Bibliographic principles http://openbiblio.net/2010/10/15/principles-for- open-bibliographic-data/
  26. 26. 1 -When publishing data make an explicit and robust license statement.
  27. 27. 2 -Use a recognized waiver or license that is appropriate for metadata.
  28. 28. 3 - If you want your data to be effectively used and added to by others it should be open … – in particular non-commercial and other restrictive clauses should not be used.
  29. 29. 4 - We strongly recommend explicitly placing bibliographic data in the Public Domain via PDDL or CC0.
  30. 30. 5 – We strongly urge creators of bibliographic metadata explicitly either dedicate this to the public domain or use an open licence.
  31. 31. Identify Title, Date, Any identifiers, Publisher, Container (eg Journal), Author names etc Discover Keywords, Abstract, Author Identifiers, etc Serendipity Citations, citing text, Usage data, supplemental data, etc. Bibliographic Sliding Scale
  32. 32. Identify Discover Serendipity Increasing Investment BUT Increased Chance of usage Bibliographic Sliding Scale
  33. 33. “So, we just pick a standard and publish and we'll reap all the benefits, right?”
  34. 34. Erm, no. For three main reasons.
  35. 35. #1 “Where there is human input, there is interpretation” Meanings of words and usage of fields have changed over time.
  36. 36. #1 (cont.) Interchange standards don't make the information any more understandable. Someone interprets them.
  37. 37. #2 Data has been entered and curated without large- scale sharing as a focus. Lots of implicit, contextual info left out.
  38. 38. #3 Data quality is typically poor with formally closed datasets.
  39. 39. For #1 - Collisions caused by interpretation can really only be solved by sharing data and seeing how bad things are.
  40. 40. Standards and interoperability: “The first follower transforms a lone nut into a leader” - Derek Sivers' TED Talk http://www.ted.com/talks/lang/eng/derek_sivers_how_to_start_a_move ment.html
  41. 41. Video: http://www.youtube.com/watch?v=GA8z7f7a2Pk The man dancing is joined by one or two, but he is still doing his own thing. Eventually a group decides to join him, and the group grows. The quality of the dance isn't important, but the community dancing along with it is. And so it is with standards.
  42. 42. For #2 (implicit info), provenance and the source of data gives us crucial clues. Due to #1, I remain unconvinced that this information can ever be totally machine-readable.
  43. 43. And for #3, misleading or incorrect data... … um. No easy answers – we just don't have the info.
  44. 44. The data clean-up process is going to be probabalistic. (We cannot be sure – by definition - that we are 'accurate' when we de-duplicate or disambiguate.)
  45. 45. Typical methods then: Natural Language Processing, Machine learning techniques and String Metrics and old skool record deduplication
  46. 46. I <3 String Metrics and old skool record deduplication (out of the 3)
  47. 47. http://staffwww.dcs.shef.ac.uk/people/S.Chapman/stri ngmetrics.html http://is.gd/gqOjQ
  48. 48. Old skool record linkage: “Felligi-Sunter” - probabilistic record linkage (PRL). It's not a great model, but it's achievable. Machine-learning requires a reasonably large golden set. (http://en.wikipedia.org/wiki/Record_linkage)
  49. 49. PRL is not great in itself, BUT It does lend itself to Map-Reduce style operations And It's a great way to filter down to those records that really do need to be compared by eye.
  50. 50. http://datamining.anu.edu.au/projects/linkage.html “Record or data linkage techniques are used to link together records which relate to the same entity (e.g. patient, customer, household) in one or more data sets where a unique identifier for each entity is not available in all or any of the data sets to be linked.” ANU's Febrl python code
  51. 51. So far, much effort has been directed at the Works; We need to put much more effort into their Networks. Bibliographic directions
  52. 52. Networks?
  53. 53. Networks? ● A cites B
  54. 54. Networks? ● A cites B ● Works by a given (identified) Author ● Works cited by a given Author ● Works citing articles that have since been disproved, redacted or withdrawn. ● Co-authors ● And many more connections we've not even considered yet ('betweenness', 'centrality', etc)
  55. 55. In Summary, ● Accessible Bibliography as Advertising. ● Bibliography authors choose how they wish to invest to gain usage and real impact. ● Closed data has a much slimmer chance of increasing in quality ● Open data makes it easier to find problems and to improve the data ● Benefits will come from developing networks of information ● Don't get hung up on standards! A lone nut with followers doing something copyable is enough!

×