Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Metadata based statistics for DSpace

2,009 views

Published on

The DSpace infrastructure for logging page-views and downloads has been limited to aggregations on communities, collections and items. While this already provides a wealth of aggregated information that is impossible to retrieve using Google Analytics, it still does not assist a repository manager in addressing questions such as:

“How many downloads did Professor X get through Google Scholar last month?”

Because authors are represented as metadata on items, tackling this challenge effectively means opening the potential to aggregate pageview and download statistics on any metadata field in the repository.

By the time of the conference, functionality that addresses this need will be available as part of @mire’s Content and Usage analysis module. The metadata based usage statistics were realized in co­-development with the World Bank.

Presentation created by Lieven Droogmans, Art Lowel and Ignace Deroost.
Presented at Open Repositories 2015 by Ignace Deroost.

Published in: Technology

Metadata based statistics for DSpace

  1. 1. www.atmire.com Metadata based usage statistics
  2. 2. OVERVIEW 1. Why DSpace statistics? 2. Usage event vs. Item metadata 3. Generating metadata based statistics 4. Linking metadata to usage events 5. Performance 6. Problem solved?
  3. 3. Statistics solution that knows DSpace: Structure “Which are the most downloaded bitstreams in a collection” Metadata “Who are the most popular authors in terms of downloads?” 1 - WHY DSPACE STATISTICS?
  4. 4. USAGE EVENT VS. ITEM METADATA 2 types of metadata: Usage event metadata Additional information about the usage event Item metadata Additional information about the target of the usage event
  5. 5. USAGE EVENT METADATA Additional information about the usage event Not related to repository Also possible with other statistics solutions: • IP address • Country • User Agent • HTTP Referrer • ...
  6. 6. ITEM METADATA Relate usage event to information stored in your repository. Allows statistics queries based on item metadata. → Not possible with a statistics solution that is not tied to the repository.
  7. 7. GENERATING METADATA BASED STATISTICS How many downloads did author "Barnes, Douglas F.” get in the last year, grouped by month
  8. 8. LINKING METADATA TO USAGE EVENTS Solr Query http://localhost:8080/solr/statistics/select? facet=true&facet.offset=0&facet.mincount=1&facet.sort= false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f acet.method=enum&fq=bundleName:ORIGINAL&fq=type: +0&fq=statistics_type:view&fq=-isBot:true&fq=- isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO +2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes, +Douglas+F.)+&wt=javabin&rows=0
  9. 9. LINKING METADATA TO USAGE EVENTS facet.field=dateYearMonth group by the field dateYearMonth fq=type:+0 only include bitstream downloads fq=bundleName:ORIGINAL only include files in bundle “ORIGINAL” fq=-isBot:true filter out all bot statistics fq=-isInternal:true filter out all internal statistics fq=time:[2014-07-01+TO+2015-06-06] only include stats that are between Jul 1st 2014 and Jun 6th 2015 fq=+(author_mtdt:Barnes,+Douglas+F.)+ only include statistics that are by author Barnes, Douglas F.
  10. 10. <response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst> </response>
  11. 11. LINKING METADATA TO USAGE EVENTS In a vanilla DSpace installation: • Usage statistics only contain bitstream IDs: no metadata • The metadata is stored in the database
  12. 12. PROPOSED SOLUTION 1. Query the database for bitstream IDs based on the author metadata 2. Use those IDs to query solr for statistics
  13. 13. PROPOSED SOLUTION: DOWNSIDES • Two queries to answer one question • The solr query can get very long and inefficient to execute • Inefficient but still possible
  14. 14. PROPOSED SOLUTION: DOWNSIDES What if we want to show the 10 authors with the most downloads? • query the database for all authors • query SOLR to get the number of usage events for each author • sort those counts, and return the 10 highest
  15. 15. PROPOSED SOLUTION: DOWNSIDES Very inefficient! • do a lot of queries • throw away most of the results: we only need top 10
  16. 16. SOLR FACETS To do a facet query: • specify ”facet.field” along with the regular query • results will be grouped by the values they have for that field
  17. 17. SOLR FACETS: EXAMPLE q=type:0&facet.field=owningItem q=type:0 search for all usage events that are bitstream downloads facet.field=owningItem group these by item count the # records in each group
  18. 18. OUR SOLUTION • Add Item metadata to SOLR. • Use built-in filtering and grouping
  19. 19. CHALLENGE: SIZE OF THE SOLR CORE That solution creates new challenges Metadata is duplicated in every statistical record that takes up a lot of space and it needs to be kept in sync
  20. 20. SIZE OF SINGLE USAGE EVENT <doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long> </doc> 25 elements
  21. 21. <doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ... </doc> SIZE OF SINGLE USAGE EVENT WITH METADATA 3 authors 140 subjects
  22. 22. KEEPING METADATA IN SYNC When the metadata of an item changes • a mistake was corrected • extra info was added the statistical records for that item need to be updated as well
  23. 23. KEEPING METADATA IN SYNC Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events. • That takes time • During that time, it takes longer to view other statistical reports
  24. 24. PERFORMANCE Size of single usage event Metadata updates Amount of events Live search queries
  25. 25. PERFORMANCE ENHANCEMENT: SYNCING Try to keep the load created by synching metadata in the statistics as low as possible: → only sync while solr is idle interrupt the operation when a search request can’t be handled in time interrupt the operation when Solr’s memory usage nears its max
  26. 26. PERFORMANCE ENHANCEMENT: CACHING Caching store generated reports in a separate Solr core retrieving them is very fast invalidate cached reports after a set time (e.g. 24 hours)
  27. 27. PERFORMANCE ENHANCEMENT: CACHING Don’t delete expired cached reports If a user requests a report that is cached → show the outdated version In the mean time → generate a new version Automatically show new report when it’s done
  28. 28. EXAMPLE: CACHE MISS
  29. 29. EXAMPLE: CACHE MISS
  30. 30. PROBLEM SOLVED? Additional complexity Number of usage events keeps growing Name variants Different names for one author
  31. 31. “Who are the Most Popular Authors in terms of downloads?” NAME VARIANTS USE CASE
  32. 32. https://openknowledge.worldbank.org/most-popular/author
  33. 33. Ferreira, Francisco H. G. Ferreira, Francisco H.G. Ferreira, Francisco 3 name variants:
  34. 34. SOLUTION FOR NAME VARIANTS include all name variants in Solr query: author_mtdt: (Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
  35. 35. ALTERNATIVE SOLUTION If you have unique IDs (e.g. ORCID) Index, and search for them instead
  36. 36. www.atmire.com Thank you! Questions?
  37. 37. Desktop view Phone view
  38. 38. Desktop view Phone view
  39. 39. Desktop view Phone view

×