SlideShare a Scribd company logo
1 of 45
Download to read offline
www.atmire.com
Metadata based
usage statistics
OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address
• Country
• User Agent
• HTTP Referrer
• ...
ITEM METADATA
Relate usage event to information stored in
your repository.
Allows statistics queries based on item
metadata.
→ Not possible with a statistics solution that
is not tied to the repository.
GENERATING METADATA BASED STATISTICS
How many downloads did
author "Barnes, Douglas F.”
get in the last year, grouped
by month
LINKING METADATA TO USAGE EVENTS
Solr Query
http://localhost:8080/solr/statistics/select?
facet=true&facet.offset=0&facet.mincount=1&facet.sort=
false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f
acet.method=enum&fq=bundleName:ORIGINAL&fq=type:
+0&fq=statistics_type:view&fq=-isBot:true&fq=-
isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO
+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,
+Douglas+F.)+&wt=javabin&rows=0
LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonth
group by the field dateYearMonth
fq=type:+0
only include bitstream downloads
fq=bundleName:ORIGINAL
only include files in bundle “ORIGINAL”
fq=-isBot:true
filter out all bot statistics
fq=-isInternal:true
filter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]
only include stats that are between Jul 1st 2014
and Jun 6th 2015
fq=+(author_mtdt:Barnes,+Douglas+F.)+
only include statistics that are by
author Barnes, Douglas F.
<response>
<lst name="responseHeader">
...
</lst>
<result name="response" numFound="164" start="0"></result>
<lst name="facet_counts">
<lst name="facet_fields">
<lst name="dateYearMonth">
<int name="2014-07">15</int>
<int name="2014-08">19</int>
<int name="2014-09">15</int>
<int name="2014-10">10</int>
<int name="2014-11">7</int>
<int name="2014-12">13</int>
<int name="2015-01">13</int>
<int name="2015-02">15</int>
<int name="2015-03">21</int>
<int name="2015-04">22</int>
<int name="2015-05">12</int>
<int name="2015-06">2</int>
</lst>
</lst>
</lst>
</response>
LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no
metadata
• The metadata is stored in the database
PROPOSED SOLUTION
1. Query the database for bitstream IDs
based on the author metadata
2. Use those IDs to query solr for statistics
PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and
inefficient to execute
• Inefficient but still possible
PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with
the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events
for each author
• sort those counts, and return the 10 highest
PROPOSED SOLUTION: DOWNSIDES
Very inefficient!
• do a lot of queries
• throw away most of the results: we only
need top 10
SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the
regular query
• results will be grouped by the values they have
for that field
SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group
OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
SIZE OF SINGLE USAGE EVENT
<doc>
<str name="ip">177.21.194.80</str>
<arr name="ip_search"><str>177.21.194.80</str></arr>
<arr name="ip_ngram"><str>177.21.194.80</str></arr>
<int name="type">0</int>
<int name="id">54</int>
<date name="time">2015-05-11T04:33:49.077Z</date>
<str name="dateYearMonth">2015-05</str>
<str name="dateYear">2015</str>
<str name="continent">SA</str>
<str name="countryCode">BR</str>
<float name="latitude">-10.0</float>
<float name="longitude">-55.0</float>
<arr name="bundleName"><str>ORIGINAL</str></arr>
<arr name="containerBitstream"><int>54</int></arr>
<arr name="owningItem"><int>1652</int></arr>
<arr name="containerItem"><int>1652</int></arr>
<arr name="owningColl"><int>14</int></arr>
<arr name="containerCollection"><int>14</int></arr>
<arr name="owningComm"><int>1</int></arr>
<arr name="containerCommunity"><int>1</int></arr>
<str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str>
<bool name="isBot">false</bool>
<bool name="isInternal">false</bool>
<str name="statistics_type">view</str>
<long name="_version_">1501767933804675072</long>
</doc>
25 elements
<doc>
<str name="ip">177.21.194.80</str>
...
<arr name="author_mtdt">
<str>Khandker, Shahidur R.</str>
<str>Barnes, Douglas F.</str>
<str>Samad, Hussain A.</str>
</arr>
<arr name="subject_mtdt">
<str>ACCESS TO LIGHTING</str>
<str>ACCESS TO MODERN ENERGY</str>
<str>AGRICULTURAL LAND</str>
<str>AGRICULTURAL RESIDUE</str>
<str>AIR CONDITIONERS</str>
<str>AIR POLLUTION</str>
<str>ALTERNATIVE ENERGY</str>
<str>ALTERNATIVE SOURCES OF ENERGY</str>
<str>APPROACH</str>
<str>ATMOSPHERE</str>
<str>AVAILABILITY</str>
<str>BASIC ENERGY</str>
<str>BIOMASS</str>
<str>BIOMASS BURNING</str>
<str>BIOMASS COLLECTION</str>
<str>BIOMASS CONSUMPTION</str>
<str>BIOMASS ENERGY</str>
...
<str>WORLD ENERGY</str>
<str>WORLD ENERGY OUTLOOK</str>
</arr>
...
</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be
updated as well
KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads
→ that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other
statistical reports
PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries
PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching
metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request
can’t be handled in time
interrupt the operation when Solr’s memory
usage nears its max
PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time
(e.g. 24 hours)
PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached
→ show the outdated version
In the mean time
→ generate a new version
Automatically show new report when it’s done
EXAMPLE: CACHE MISS
EXAMPLE: CACHE MISS
PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author
“Who are the Most
Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE
https://openknowledge.worldbank.org/most-popular/author
Ferreira, Francisco H. G.
Ferreira, Francisco H.G.
Ferreira, Francisco
3 name variants:
SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:
(Ferreira, Francisco H. G.) OR
(Ferreira, Francisco H.G.) OR
(Ferreira, Francisco)
ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead
www.atmire.com
Thank you!
Questions?
Desktop view Phone view
Desktop view
Phone view
Desktop view
Phone view

More Related Content

What's hot

DSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISDSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISAndrea Bollini
 
DSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable EntitiesDSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable EntitiesAtmire
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Claudio Martella
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptxnehabsairam
 
DSpace-CRIS, anticipating innovation
DSpace-CRIS, anticipating innovationDSpace-CRIS, anticipating innovation
DSpace-CRIS, anticipating innovation4Science
 
Open Source Software in Libraries
Open Source Software in LibrariesOpen Source Software in Libraries
Open Source Software in LibrariesSukhdev Singh
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformAndrea Bollini
 
Module 1 introduction of Dspace
Module 1  introduction of DspaceModule 1  introduction of Dspace
Module 1 introduction of DspaceShehzad Ali
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark JobsDataWorks Summit
 
Introducing the New DSpace User Interface
Introducing the New DSpace User InterfaceIntroducing the New DSpace User Interface
Introducing the New DSpace User InterfaceTim Donohue
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!Databricks
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJADataWorks Summit
 
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCERELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCELibcorpio
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).Alexey Lesovsky
 

What's hot (20)

DSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISDSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRIS
 
DSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable EntitiesDSpace 7 - The Power of Configurable Entities
DSpace 7 - The Power of Configurable Entities
 
Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014Giraph at Hadoop Summit 2014
Giraph at Hadoop Summit 2014
 
SQOOP PPT
SQOOP PPTSQOOP PPT
SQOOP PPT
 
Copy of MongoDB .pptx
Copy of MongoDB .pptxCopy of MongoDB .pptx
Copy of MongoDB .pptx
 
DSpace-CRIS, anticipating innovation
DSpace-CRIS, anticipating innovationDSpace-CRIS, anticipating innovation
DSpace-CRIS, anticipating innovation
 
Database System Architecture
Database System ArchitectureDatabase System Architecture
Database System Architecture
 
Open Source Software in Libraries
Open Source Software in LibrariesOpen Source Software in Libraries
Open Source Software in Libraries
 
Sqoop
SqoopSqoop
Sqoop
 
DSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platformDSpace-CRIS: a CRIS enhanced repository platform
DSpace-CRIS: a CRIS enhanced repository platform
 
SRS of Library Circulation System
SRS of Library Circulation SystemSRS of Library Circulation System
SRS of Library Circulation System
 
Module 1 introduction of Dspace
Module 1  introduction of DspaceModule 1  introduction of Dspace
Module 1 introduction of Dspace
 
Dspace 7 presentation
Dspace 7 presentationDspace 7 presentation
Dspace 7 presentation
 
The Hidden Life of Spark Jobs
The Hidden Life of Spark JobsThe Hidden Life of Spark Jobs
The Hidden Life of Spark Jobs
 
Introducing the New DSpace User Interface
Introducing the New DSpace User InterfaceIntroducing the New DSpace User Interface
Introducing the New DSpace User Interface
 
How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!How Adobe Does 2 Million Records Per Second Using Apache Spark!
How Adobe Does 2 Million Records Per Second Using Apache Spark!
 
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJAEvaluation of TPC-H on Spark and Spark SQL in ALOJA
Evaluation of TPC-H on Spark and Spark SQL in ALOJA
 
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCERELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
RELATIONSHIP OF LIBRARY SCIENCE WITH ‎INFORMATION SCIENCE
 
Recommendation and the Library
Recommendation and the LibraryRecommendation and the Library
Recommendation and the Library
 
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
PostgreSQL Troubleshoot On-line, (RITfest 2015 meetup at Moscow, Russia).
 

Viewers also liked

DSpace in Belgium and beyond
DSpace in Belgium and beyondDSpace in Belgium and beyond
DSpace in Belgium and beyondBram Luyten
 
Working for Atmire
Working for AtmireWorking for Atmire
Working for AtmireBram Luyten
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrowBram Luyten
 
DSpace UI prototype dsember
DSpace UI prototype dsemberDSpace UI prototype dsember
DSpace UI prototype dsemberBram Luyten
 
Durable Item Relations for DSpace
Durable Item Relations for DSpaceDurable Item Relations for DSpace
Durable Item Relations for DSpaceBram Luyten
 
Git and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopGit and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopBram Luyten
 
So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?Bram Luyten
 
Límite de una función
Límite de una funciónLímite de una función
Límite de una funciónmariofriedman
 
Price list
Price listPrice list
Price listGunaep
 
Classroom20 precentation
Classroom20 precentationClassroom20 precentation
Classroom20 precentationaivanoulis
 
Rubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRheetam Mitra
 
Private Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarPrivate Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarDon Buchanan
 
Presentation3-One Pound
Presentation3-One PoundPresentation3-One Pound
Presentation3-One PoundChaseTomlinson
 
Ingles isabel mª
Ingles isabel mªIngles isabel mª
Ingles isabel mªmiguelingp
 

Viewers also liked (20)

DSpace in Belgium and beyond
DSpace in Belgium and beyondDSpace in Belgium and beyond
DSpace in Belgium and beyond
 
Working for Atmire
Working for AtmireWorking for Atmire
Working for Atmire
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrow
 
DSpace UI prototype dsember
DSpace UI prototype dsemberDSpace UI prototype dsember
DSpace UI prototype dsember
 
Durable Item Relations for DSpace
Durable Item Relations for DSpaceDurable Item Relations for DSpace
Durable Item Relations for DSpace
 
Email deposit
Email depositEmail deposit
Email deposit
 
Git and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopGit and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshop
 
So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?
 
Enterprize aws
Enterprize awsEnterprize aws
Enterprize aws
 
Tarea unidad II
Tarea unidad  II Tarea unidad  II
Tarea unidad II
 
¿Cómo organizar una estrategia de investigación?
¿Cómo organizar una estrategia de investigación?¿Cómo organizar una estrategia de investigación?
¿Cómo organizar una estrategia de investigación?
 
Pilicolayi
PilicolayiPilicolayi
Pilicolayi
 
Límite de una función
Límite de una funciónLímite de una función
Límite de una función
 
Price list
Price listPrice list
Price list
 
Classroom20 precentation
Classroom20 precentationClassroom20 precentation
Classroom20 precentation
 
Rubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRubanomics - Corporate Presentation
Rubanomics - Corporate Presentation
 
Masgnb seminar itr_2013-program
Masgnb seminar itr_2013-programMasgnb seminar itr_2013-program
Masgnb seminar itr_2013-program
 
Private Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarPrivate Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to Solar
 
Presentation3-One Pound
Presentation3-One PoundPresentation3-One Pound
Presentation3-One Pound
 
Ingles isabel mª
Ingles isabel mªIngles isabel mª
Ingles isabel mª
 

Similar to Metadata based statistics for DSpace

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by SalesforceThinqloud
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big DataAmazon Web Services
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSBowenDing4
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondJeremyOtt5
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentCloudera, Inc.
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDSFrauke Ziedorn
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Nishant Gandhi
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Arun Karthick Manoharan
 
SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803Andreas Grabner
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 

Similar to Metadata based statistics for DSpace (20)

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Database
DatabaseDatabase
Database
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
 
SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 

More from Bram Luyten

Archiving Sensitive Data
Archiving Sensitive DataArchiving Sensitive Data
Archiving Sensitive DataBram Luyten
 
Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7Bram Luyten
 
DSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewDSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewBram Luyten
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and TomorrowBram Luyten
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceBram Luyten
 
Dépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceDépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceBram Luyten
 
Secrets of the DSpace Submission Form
Secrets of the DSpace Submission FormSecrets of the DSpace Submission Form
Secrets of the DSpace Submission FormBram Luyten
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Bram Luyten
 
What's in Store for DSpace 4?
What's in Store for DSpace 4?What's in Store for DSpace 4?
What's in Store for DSpace 4?Bram Luyten
 
ORCID for DSpace
ORCID for DSpaceORCID for DSpace
ORCID for DSpaceBram Luyten
 
Using Github for DSpace development
Using Github for DSpace developmentUsing Github for DSpace development
Using Github for DSpace developmentBram Luyten
 
Workshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceWorkshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceBram Luyten
 

More from Bram Luyten (12)

Archiving Sensitive Data
Archiving Sensitive DataArchiving Sensitive Data
Archiving Sensitive Data
 
Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7
 
DSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewDSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 Preview
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and Tomorrow
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpace
 
Dépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceDépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpace
 
Secrets of the DSpace Submission Form
Secrets of the DSpace Submission FormSecrets of the DSpace Submission Form
Secrets of the DSpace Submission Form
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3
 
What's in Store for DSpace 4?
What's in Store for DSpace 4?What's in Store for DSpace 4?
What's in Store for DSpace 4?
 
ORCID for DSpace
ORCID for DSpaceORCID for DSpace
ORCID for DSpace
 
Using Github for DSpace development
Using Github for DSpace developmentUsing Github for DSpace development
Using Github for DSpace development
 
Workshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceWorkshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpace
 

Recently uploaded

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55
 

Recently uploaded (20)

Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...
 

Metadata based statistics for DSpace

  • 2. OVERVIEW 1. Why DSpace statistics? 2. Usage event vs. Item metadata 3. Generating metadata based statistics 4. Linking metadata to usage events 5. Performance 6. Problem solved?
  • 3. Statistics solution that knows DSpace: Structure “Which are the most downloaded bitstreams in a collection” Metadata “Who are the most popular authors in terms of downloads?” 1 - WHY DSPACE STATISTICS?
  • 4. USAGE EVENT VS. ITEM METADATA 2 types of metadata: Usage event metadata Additional information about the usage event Item metadata Additional information about the target of the usage event
  • 5. USAGE EVENT METADATA Additional information about the usage event Not related to repository Also possible with other statistics solutions: • IP address • Country • User Agent • HTTP Referrer • ...
  • 6. ITEM METADATA Relate usage event to information stored in your repository. Allows statistics queries based on item metadata. → Not possible with a statistics solution that is not tied to the repository.
  • 7. GENERATING METADATA BASED STATISTICS How many downloads did author "Barnes, Douglas F.” get in the last year, grouped by month
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. LINKING METADATA TO USAGE EVENTS Solr Query http://localhost:8080/solr/statistics/select? facet=true&facet.offset=0&facet.mincount=1&facet.sort= false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f acet.method=enum&fq=bundleName:ORIGINAL&fq=type: +0&fq=statistics_type:view&fq=-isBot:true&fq=- isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO +2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes, +Douglas+F.)+&wt=javabin&rows=0
  • 14. LINKING METADATA TO USAGE EVENTS facet.field=dateYearMonth group by the field dateYearMonth fq=type:+0 only include bitstream downloads fq=bundleName:ORIGINAL only include files in bundle “ORIGINAL” fq=-isBot:true filter out all bot statistics fq=-isInternal:true filter out all internal statistics fq=time:[2014-07-01+TO+2015-06-06] only include stats that are between Jul 1st 2014 and Jun 6th 2015 fq=+(author_mtdt:Barnes,+Douglas+F.)+ only include statistics that are by author Barnes, Douglas F.
  • 15. <response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst> </response>
  • 16. LINKING METADATA TO USAGE EVENTS In a vanilla DSpace installation: • Usage statistics only contain bitstream IDs: no metadata • The metadata is stored in the database
  • 17. PROPOSED SOLUTION 1. Query the database for bitstream IDs based on the author metadata 2. Use those IDs to query solr for statistics
  • 18. PROPOSED SOLUTION: DOWNSIDES • Two queries to answer one question • The solr query can get very long and inefficient to execute • Inefficient but still possible
  • 19. PROPOSED SOLUTION: DOWNSIDES What if we want to show the 10 authors with the most downloads? • query the database for all authors • query SOLR to get the number of usage events for each author • sort those counts, and return the 10 highest
  • 20. PROPOSED SOLUTION: DOWNSIDES Very inefficient! • do a lot of queries • throw away most of the results: we only need top 10
  • 21. SOLR FACETS To do a facet query: • specify ”facet.field” along with the regular query • results will be grouped by the values they have for that field
  • 22. SOLR FACETS: EXAMPLE q=type:0&facet.field=owningItem q=type:0 search for all usage events that are bitstream downloads facet.field=owningItem group these by item count the # records in each group
  • 23. OUR SOLUTION • Add Item metadata to SOLR. • Use built-in filtering and grouping
  • 24. CHALLENGE: SIZE OF THE SOLR CORE That solution creates new challenges Metadata is duplicated in every statistical record that takes up a lot of space and it needs to be kept in sync
  • 25. SIZE OF SINGLE USAGE EVENT <doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long> </doc> 25 elements
  • 26. <doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ... </doc> SIZE OF SINGLE USAGE EVENT WITH METADATA 3 authors 140 subjects
  • 27. KEEPING METADATA IN SYNC When the metadata of an item changes • a mistake was corrected • extra info was added the statistical records for that item need to be updated as well
  • 28. KEEPING METADATA IN SYNC Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events. • That takes time • During that time, it takes longer to view other statistical reports
  • 29. PERFORMANCE Size of single usage event Metadata updates Amount of events Live search queries
  • 30. PERFORMANCE ENHANCEMENT: SYNCING Try to keep the load created by synching metadata in the statistics as low as possible: → only sync while solr is idle interrupt the operation when a search request can’t be handled in time interrupt the operation when Solr’s memory usage nears its max
  • 31. PERFORMANCE ENHANCEMENT: CACHING Caching store generated reports in a separate Solr core retrieving them is very fast invalidate cached reports after a set time (e.g. 24 hours)
  • 32. PERFORMANCE ENHANCEMENT: CACHING Don’t delete expired cached reports If a user requests a report that is cached → show the outdated version In the mean time → generate a new version Automatically show new report when it’s done
  • 35. PROBLEM SOLVED? Additional complexity Number of usage events keeps growing Name variants Different names for one author
  • 36. “Who are the Most Popular Authors in terms of downloads?” NAME VARIANTS USE CASE
  • 38. Ferreira, Francisco H. G. Ferreira, Francisco H.G. Ferreira, Francisco 3 name variants:
  • 39.
  • 40. SOLUTION FOR NAME VARIANTS include all name variants in Solr query: author_mtdt: (Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
  • 41. ALTERNATIVE SOLUTION If you have unique IDs (e.g. ORCID) Index, and search for them instead