The DSpace infrastructure for logging page-views and downloads has been limited to aggregations on communities, collections and items. While this already provides a wealth of aggregated information that is impossible to retrieve using Google Analytics, it still does not assist a repository manager in addressing questions such as:
“How many downloads did Professor X get through Google Scholar last month?”
Because authors are represented as metadata on items, tackling this challenge effectively means opening the potential to aggregate pageview and download statistics on any metadata field in the repository.
By the time of the conference, functionality that addresses this need will be available as part of @mire’s Content and Usage analysis module. The metadata based usage statistics were realized in co-development with the World Bank.
Presentation created by Lieven Droogmans, Art Lowel and Ignace Deroost.
Presented at Open Repositories 2015 by Ignace Deroost.
2. OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
3. Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
4. USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
5. USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address
• Country
• User Agent
• HTTP Referrer
• ...
6. ITEM METADATA
Relate usage event to information stored in
your repository.
Allows statistics queries based on item
metadata.
→ Not possible with a statistics solution that
is not tied to the repository.
7. GENERATING METADATA BASED STATISTICS
How many downloads did
author "Barnes, Douglas F.”
get in the last year, grouped
by month
14. LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonth
group by the field dateYearMonth
fq=type:+0
only include bitstream downloads
fq=bundleName:ORIGINAL
only include files in bundle “ORIGINAL”
fq=-isBot:true
filter out all bot statistics
fq=-isInternal:true
filter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]
only include stats that are between Jul 1st 2014
and Jun 6th 2015
fq=+(author_mtdt:Barnes,+Douglas+F.)+
only include statistics that are by
author Barnes, Douglas F.
16. LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no
metadata
• The metadata is stored in the database
17. PROPOSED SOLUTION
1. Query the database for bitstream IDs
based on the author metadata
2. Use those IDs to query solr for statistics
18. PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and
inefficient to execute
• Inefficient but still possible
19. PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with
the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events
for each author
• sort those counts, and return the 10 highest
21. SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the
regular query
• results will be grouped by the values they have
for that field
23. OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
24. CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
26. <doc>
<str name="ip">177.21.194.80</str>
...
<arr name="author_mtdt">
<str>Khandker, Shahidur R.</str>
<str>Barnes, Douglas F.</str>
<str>Samad, Hussain A.</str>
</arr>
<arr name="subject_mtdt">
<str>ACCESS TO LIGHTING</str>
<str>ACCESS TO MODERN ENERGY</str>
<str>AGRICULTURAL LAND</str>
<str>AGRICULTURAL RESIDUE</str>
<str>AIR CONDITIONERS</str>
<str>AIR POLLUTION</str>
<str>ALTERNATIVE ENERGY</str>
<str>ALTERNATIVE SOURCES OF ENERGY</str>
<str>APPROACH</str>
<str>ATMOSPHERE</str>
<str>AVAILABILITY</str>
<str>BASIC ENERGY</str>
<str>BIOMASS</str>
<str>BIOMASS BURNING</str>
<str>BIOMASS COLLECTION</str>
<str>BIOMASS CONSUMPTION</str>
<str>BIOMASS ENERGY</str>
...
<str>WORLD ENERGY</str>
<str>WORLD ENERGY OUTLOOK</str>
</arr>
...
</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
27. KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be
updated as well
28. KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads
→ that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other
statistical reports
30. PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching
metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request
can’t be handled in time
interrupt the operation when Solr’s memory
usage nears its max
32. PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached
→ show the outdated version
In the mean time
→ generate a new version
Automatically show new report when it’s done
40. SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:
(Ferreira, Francisco H. G.) OR
(Ferreira, Francisco H.G.) OR
(Ferreira, Francisco)