SlideShare a Scribd company logo
1 of 45
Download to read offline
www.atmire.com
Metadata based
usage statistics
OVERVIEW
1. Why DSpace statistics?
2. Usage event vs. Item metadata
3. Generating metadata based statistics
4. Linking metadata to usage events
5. Performance
6. Problem solved?
Statistics solution that knows DSpace:
Structure
“Which are the most downloaded bitstreams in a collection”
Metadata
“Who are the most popular authors in terms of downloads?”
1 - WHY DSPACE STATISTICS?
USAGE EVENT VS. ITEM METADATA
2 types of metadata:
Usage event metadata
Additional information about the usage event
Item metadata
Additional information about the target of the usage event
USAGE EVENT METADATA
Additional information about the usage event
Not related to repository
Also possible with other statistics solutions:
• IP address
• Country
• User Agent
• HTTP Referrer
• ...
ITEM METADATA
Relate usage event to information stored in
your repository.
Allows statistics queries based on item
metadata.
→ Not possible with a statistics solution that
is not tied to the repository.
GENERATING METADATA BASED STATISTICS
How many downloads did
author "Barnes, Douglas F.”
get in the last year, grouped
by month
LINKING METADATA TO USAGE EVENTS
Solr Query
http://localhost:8080/solr/statistics/select?
facet=true&facet.offset=0&facet.mincount=1&facet.sort=
false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f
acet.method=enum&fq=bundleName:ORIGINAL&fq=type:
+0&fq=statistics_type:view&fq=-isBot:true&fq=-
isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO
+2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes,
+Douglas+F.)+&wt=javabin&rows=0
LINKING METADATA TO USAGE EVENTS
facet.field=dateYearMonth
group by the field dateYearMonth
fq=type:+0
only include bitstream downloads
fq=bundleName:ORIGINAL
only include files in bundle “ORIGINAL”
fq=-isBot:true
filter out all bot statistics
fq=-isInternal:true
filter out all internal statistics
fq=time:[2014-07-01+TO+2015-06-06]
only include stats that are between Jul 1st 2014
and Jun 6th 2015
fq=+(author_mtdt:Barnes,+Douglas+F.)+
only include statistics that are by
author Barnes, Douglas F.
<response>
<lst name="responseHeader">
...
</lst>
<result name="response" numFound="164" start="0"></result>
<lst name="facet_counts">
<lst name="facet_fields">
<lst name="dateYearMonth">
<int name="2014-07">15</int>
<int name="2014-08">19</int>
<int name="2014-09">15</int>
<int name="2014-10">10</int>
<int name="2014-11">7</int>
<int name="2014-12">13</int>
<int name="2015-01">13</int>
<int name="2015-02">15</int>
<int name="2015-03">21</int>
<int name="2015-04">22</int>
<int name="2015-05">12</int>
<int name="2015-06">2</int>
</lst>
</lst>
</lst>
</response>
LINKING METADATA TO USAGE EVENTS
In a vanilla DSpace installation:
• Usage statistics only contain bitstream IDs: no
metadata
• The metadata is stored in the database
PROPOSED SOLUTION
1. Query the database for bitstream IDs
based on the author metadata
2. Use those IDs to query solr for statistics
PROPOSED SOLUTION: DOWNSIDES
• Two queries to answer one question
• The solr query can get very long and
inefficient to execute
• Inefficient but still possible
PROPOSED SOLUTION: DOWNSIDES
What if we want to show the 10 authors with
the most downloads?
• query the database for all authors
• query SOLR to get the number of usage events
for each author
• sort those counts, and return the 10 highest
PROPOSED SOLUTION: DOWNSIDES
Very inefficient!
• do a lot of queries
• throw away most of the results: we only
need top 10
SOLR FACETS
To do a facet query:
• specify ”facet.field” along with the
regular query
• results will be grouped by the values they have
for that field
SOLR FACETS: EXAMPLE
q=type:0&facet.field=owningItem
q=type:0
search for all usage events that are bitstream downloads
facet.field=owningItem
group these by item
count the # records in each group
OUR SOLUTION
• Add Item metadata to SOLR.
• Use built-in filtering and grouping
CHALLENGE: SIZE OF THE SOLR CORE
That solution creates new challenges
Metadata is duplicated in every statistical record
that takes up a lot of space
and it needs to be kept in sync
SIZE OF SINGLE USAGE EVENT
<doc>
<str name="ip">177.21.194.80</str>
<arr name="ip_search"><str>177.21.194.80</str></arr>
<arr name="ip_ngram"><str>177.21.194.80</str></arr>
<int name="type">0</int>
<int name="id">54</int>
<date name="time">2015-05-11T04:33:49.077Z</date>
<str name="dateYearMonth">2015-05</str>
<str name="dateYear">2015</str>
<str name="continent">SA</str>
<str name="countryCode">BR</str>
<float name="latitude">-10.0</float>
<float name="longitude">-55.0</float>
<arr name="bundleName"><str>ORIGINAL</str></arr>
<arr name="containerBitstream"><int>54</int></arr>
<arr name="owningItem"><int>1652</int></arr>
<arr name="containerItem"><int>1652</int></arr>
<arr name="owningColl"><int>14</int></arr>
<arr name="containerCollection"><int>14</int></arr>
<arr name="owningComm"><int>1</int></arr>
<arr name="containerCommunity"><int>1</int></arr>
<str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str>
<bool name="isBot">false</bool>
<bool name="isInternal">false</bool>
<str name="statistics_type">view</str>
<long name="_version_">1501767933804675072</long>
</doc>
25 elements
<doc>
<str name="ip">177.21.194.80</str>
...
<arr name="author_mtdt">
<str>Khandker, Shahidur R.</str>
<str>Barnes, Douglas F.</str>
<str>Samad, Hussain A.</str>
</arr>
<arr name="subject_mtdt">
<str>ACCESS TO LIGHTING</str>
<str>ACCESS TO MODERN ENERGY</str>
<str>AGRICULTURAL LAND</str>
<str>AGRICULTURAL RESIDUE</str>
<str>AIR CONDITIONERS</str>
<str>AIR POLLUTION</str>
<str>ALTERNATIVE ENERGY</str>
<str>ALTERNATIVE SOURCES OF ENERGY</str>
<str>APPROACH</str>
<str>ATMOSPHERE</str>
<str>AVAILABILITY</str>
<str>BASIC ENERGY</str>
<str>BIOMASS</str>
<str>BIOMASS BURNING</str>
<str>BIOMASS COLLECTION</str>
<str>BIOMASS CONSUMPTION</str>
<str>BIOMASS ENERGY</str>
...
<str>WORLD ENERGY</str>
<str>WORLD ENERGY OUTLOOK</str>
</arr>
...
</doc>
SIZE OF SINGLE USAGE EVENT WITH METADATA
3 authors
140 subjects
KEEPING METADATA IN SYNC
When the metadata of an item changes
• a mistake was corrected
• extra info was added
the statistical records for that item need to be
updated as well
KEEPING METADATA IN SYNC
Item with 7,000 page visits and 5,000 downloads
→ that means updating 12,000 usage events.
• That takes time
• During that time, it takes longer to view other
statistical reports
PERFORMANCE
Size of single usage event
Metadata updates
Amount of events
Live search queries
PERFORMANCE ENHANCEMENT: SYNCING
Try to keep the load created by synching
metadata in the statistics as low as possible:
→ only sync while solr is idle
interrupt the operation when a search request
can’t be handled in time
interrupt the operation when Solr’s memory
usage nears its max
PERFORMANCE ENHANCEMENT: CACHING
Caching
store generated reports in a separate Solr core
retrieving them is very fast
invalidate cached reports after a set time
(e.g. 24 hours)
PERFORMANCE ENHANCEMENT: CACHING
Don’t delete expired cached reports
If a user requests a report that is cached
→ show the outdated version
In the mean time
→ generate a new version
Automatically show new report when it’s done
EXAMPLE: CACHE MISS
EXAMPLE: CACHE MISS
PROBLEM SOLVED?
Additional complexity
Number of usage events
keeps growing
Name variants
Different names for one author
“Who are the Most
Popular Authors in terms
of downloads?”
NAME VARIANTS USE CASE
https://openknowledge.worldbank.org/most-popular/author
Ferreira, Francisco H. G.
Ferreira, Francisco H.G.
Ferreira, Francisco
3 name variants:
SOLUTION FOR NAME VARIANTS
include all name variants in Solr query:
author_mtdt:
(Ferreira, Francisco H. G.) OR
(Ferreira, Francisco H.G.) OR
(Ferreira, Francisco)
ALTERNATIVE SOLUTION
If you have unique IDs (e.g. ORCID)
Index, and search for them instead
www.atmire.com
Thank you!
Questions?
Desktop view Phone view
Desktop view
Phone view
Desktop view
Phone view

More Related Content

What's hot

Historical development of reference service
Historical development of reference serviceHistorical development of reference service
Historical development of reference serviceCynthia Narra
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...Hong (Jenny) Jing
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpaceIryna Kuchma
 
DSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISDSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISAndrea Bollini
 
WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN TODAY’S INFORMATIO...
WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN  TODAY’S INFORMATIO...WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN  TODAY’S INFORMATIO...
WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN TODAY’S INFORMATIO...`Shweta Bhavsar
 
Planning and Implementing a Digital Library Project
Planning and Implementing a Digital Library ProjectPlanning and Implementing a Digital Library Project
Planning and Implementing a Digital Library ProjectJenn Riley
 
Dewey Decimal Classification vs Library of Congress Classification
Dewey Decimal Classification vs Library of Congress Classification Dewey Decimal Classification vs Library of Congress Classification
Dewey Decimal Classification vs Library of Congress Classification Francheska Vonne Gali
 
Library cooperation.
Library cooperation.Library cooperation.
Library cooperation.Manu K M
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core MetadataHannes Ebner
 
Review of Existing Standards
Review of Existing StandardsReview of Existing Standards
Review of Existing StandardsPLAI STRLC
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital LibraryImran Mansuri
 
DSpace 7 - The Angular UI from a user’s perspective
DSpace 7 - The Angular UI from a user’s perspectiveDSpace 7 - The Angular UI from a user’s perspective
DSpace 7 - The Angular UI from a user’s perspectiveAtmire
 
Management of Journals Through Koha Open Source Software: an Overview
Management of Journals Through Koha Open Source Software: an OverviewManagement of Journals Through Koha Open Source Software: an Overview
Management of Journals Through Koha Open Source Software: an OverviewAsheesh Kamal
 
Collection development
Collection developmentCollection development
Collection developmentDheeraj Negi
 
Book Selection Tools
Book Selection Tools Book Selection Tools
Book Selection Tools tonyviamll89
 
DSpace Training Presentation
DSpace Training PresentationDSpace Training Presentation
DSpace Training PresentationThomas King
 

What's hot (20)

Historical development of reference service
Historical development of reference serviceHistorical development of reference service
Historical development of reference service
 
The workflows for the ingest of digital objects into a repository/digital l...
The workflows for the ingest of  digital objects into a repository/digital l...The workflows for the ingest of  digital objects into a repository/digital l...
The workflows for the ingest of digital objects into a repository/digital l...
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
 
DSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRISDSpace standard Data model and DSpace-CRIS
DSpace standard Data model and DSpace-CRIS
 
Preparations for koha implementation
Preparations for koha implementationPreparations for koha implementation
Preparations for koha implementation
 
WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN TODAY’S INFORMATIO...
WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN  TODAY’S INFORMATIO...WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN  TODAY’S INFORMATIO...
WHAT IS DIGITAL PRESERVATION? DISCUSS ITS SIGNIFICANCE IN TODAY’S INFORMATIO...
 
Planning and Implementing a Digital Library Project
Planning and Implementing a Digital Library ProjectPlanning and Implementing a Digital Library Project
Planning and Implementing a Digital Library Project
 
Dewey Decimal Classification vs Library of Congress Classification
Dewey Decimal Classification vs Library of Congress Classification Dewey Decimal Classification vs Library of Congress Classification
Dewey Decimal Classification vs Library of Congress Classification
 
Digital Library Software
Digital Library SoftwareDigital Library Software
Digital Library Software
 
Library cooperation.
Library cooperation.Library cooperation.
Library cooperation.
 
Dspace
DspaceDspace
Dspace
 
Introduction to Dublin Core Metadata
Introduction to Dublin Core MetadataIntroduction to Dublin Core Metadata
Introduction to Dublin Core Metadata
 
Review of Existing Standards
Review of Existing StandardsReview of Existing Standards
Review of Existing Standards
 
Greenstone Digital Library
Greenstone Digital LibraryGreenstone Digital Library
Greenstone Digital Library
 
DSpace 7 - The Angular UI from a user’s perspective
DSpace 7 - The Angular UI from a user’s perspectiveDSpace 7 - The Angular UI from a user’s perspective
DSpace 7 - The Angular UI from a user’s perspective
 
Management of Journals Through Koha Open Source Software: an Overview
Management of Journals Through Koha Open Source Software: an OverviewManagement of Journals Through Koha Open Source Software: an Overview
Management of Journals Through Koha Open Source Software: an Overview
 
Collection development
Collection developmentCollection development
Collection development
 
Book Selection Tools
Book Selection Tools Book Selection Tools
Book Selection Tools
 
Introduction to DSpace
Introduction to DSpaceIntroduction to DSpace
Introduction to DSpace
 
DSpace Training Presentation
DSpace Training PresentationDSpace Training Presentation
DSpace Training Presentation
 

Viewers also liked

DSpace in Belgium and beyond
DSpace in Belgium and beyondDSpace in Belgium and beyond
DSpace in Belgium and beyondBram Luyten
 
Working for Atmire
Working for AtmireWorking for Atmire
Working for AtmireBram Luyten
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrowBram Luyten
 
DSpace UI prototype dsember
DSpace UI prototype dsemberDSpace UI prototype dsember
DSpace UI prototype dsemberBram Luyten
 
Durable Item Relations for DSpace
Durable Item Relations for DSpaceDurable Item Relations for DSpace
Durable Item Relations for DSpaceBram Luyten
 
Git and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopGit and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopBram Luyten
 
So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?Bram Luyten
 
Límite de una función
Límite de una funciónLímite de una función
Límite de una funciónmariofriedman
 
Price list
Price listPrice list
Price listGunaep
 
Classroom20 precentation
Classroom20 precentationClassroom20 precentation
Classroom20 precentationaivanoulis
 
Rubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRheetam Mitra
 
Private Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarPrivate Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarDon Buchanan
 
Presentation3-One Pound
Presentation3-One PoundPresentation3-One Pound
Presentation3-One PoundChaseTomlinson
 
Ingles isabel mª
Ingles isabel mªIngles isabel mª
Ingles isabel mªmiguelingp
 

Viewers also liked (20)

DSpace in Belgium and beyond
DSpace in Belgium and beyondDSpace in Belgium and beyond
DSpace in Belgium and beyond
 
Working for Atmire
Working for AtmireWorking for Atmire
Working for Atmire
 
DSpace repositories today and tomorrow
DSpace repositories today and tomorrowDSpace repositories today and tomorrow
DSpace repositories today and tomorrow
 
DSpace UI prototype dsember
DSpace UI prototype dsemberDSpace UI prototype dsember
DSpace UI prototype dsember
 
Durable Item Relations for DSpace
Durable Item Relations for DSpaceDurable Item Relations for DSpace
Durable Item Relations for DSpace
 
Email deposit
Email depositEmail deposit
Email deposit
 
Git and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshopGit and Github - a 90 Minute interactive workshop
Git and Github - a 90 Minute interactive workshop
 
So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?So we all have ORCID integrations, now what?
So we all have ORCID integrations, now what?
 
Enterprize aws
Enterprize awsEnterprize aws
Enterprize aws
 
Tarea unidad II
Tarea unidad  II Tarea unidad  II
Tarea unidad II
 
¿Cómo organizar una estrategia de investigación?
¿Cómo organizar una estrategia de investigación?¿Cómo organizar una estrategia de investigación?
¿Cómo organizar una estrategia de investigación?
 
Pilicolayi
PilicolayiPilicolayi
Pilicolayi
 
Límite de una función
Límite de una funciónLímite de una función
Límite de una función
 
Price list
Price listPrice list
Price list
 
Classroom20 precentation
Classroom20 precentationClassroom20 precentation
Classroom20 precentation
 
Rubanomics - Corporate Presentation
Rubanomics - Corporate PresentationRubanomics - Corporate Presentation
Rubanomics - Corporate Presentation
 
Masgnb seminar itr_2013-program
Masgnb seminar itr_2013-programMasgnb seminar itr_2013-program
Masgnb seminar itr_2013-program
 
Private Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to SolarPrivate Sector Leads Virgin Islands to Solar
Private Sector Leads Virgin Islands to Solar
 
Presentation3-One Pound
Presentation3-One PoundPresentation3-One Pound
Presentation3-One Pound
 
Ingles isabel mª
Ingles isabel mªIngles isabel mª
Ingles isabel mª
 

Similar to Metadata based statistics for DSpace

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by SalesforceThinqloud
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesCidar Mendizabal
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with HadoopJayant Shekhar
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big DataAmazon Web Services
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Codemotion
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSBowenDing4
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Amazon Web Services
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondJeremyOtt5
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleBharvi Dixit
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentCloudera, Inc.
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Spark Summit
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014ALTER WAY
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDSFrauke Ziedorn
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Nishant Gandhi
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Arun Karthick Manoharan
 
SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803Andreas Grabner
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...DataWorks Summit
 

Similar to Metadata based statistics for DSpace (20)

Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Handling of Large Data by Salesforce
Handling of Large Data by SalesforceHandling of Large Data by Salesforce
Handling of Large Data by Salesforce
 
Large Data Volume Salesforce experiences
Large Data Volume Salesforce experiencesLarge Data Volume Salesforce experiences
Large Data Volume Salesforce experiences
 
Unifying your data management with Hadoop
Unifying your data management with HadoopUnifying your data management with Hadoop
Unifying your data management with Hadoop
 
Database
DatabaseDatabase
Database
 
(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data(BDT313) Amazon DynamoDB For Big Data
(BDT313) Amazon DynamoDB For Big Data
 
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
Search on the fly: how to lighten your Big Data - Simona Russo, Auro Rolle - ...
 
Minerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFSMinerva: Drill Storage Plugin for IPFS
Minerva: Drill Storage Plugin for IPFS
 
Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018Success has Many Query Engines- Tel Aviv Summit 2018
Success has Many Query Engines- Tel Aviv Summit 2018
 
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & BeyondAutomated Data Synchronization: Data Loader, Data Mirror & Beyond
Automated Data Synchronization: Data Loader, Data Mirror & Beyond
 
Configuring elasticsearch for performance and scale
Configuring elasticsearch for performance and scaleConfiguring elasticsearch for performance and scale
Configuring elasticsearch for performance and scale
 
Integrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI EnvironmentIntegrating Hadoop in Your Existing DW and BI Environment
Integrating Hadoop in Your Existing DW and BI Environment
 
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
Building a Dataset Search Engine with Spark and Elasticsearch: Spark Summit E...
 
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
Séminaire Big Data Alter Way - Elasticsearch - octobre 2014
 
DataCite How To: Use the MDS
DataCite How To: Use the MDSDataCite How To: Use the MDS
DataCite How To: Use the MDS
 
Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks Customer Feedback Analytics for Starbucks
Customer Feedback Analytics for Starbucks
 
Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016Apache Eagle Strata Hadoop World London 2016
Apache Eagle Strata Hadoop World London 2016
 
SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803SharePoint TechCon 2009 - 803
SharePoint TechCon 2009 - 803
 
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Ad...
 
Hadoop HDFS.ppt
Hadoop HDFS.pptHadoop HDFS.ppt
Hadoop HDFS.ppt
 

More from Bram Luyten

Archiving Sensitive Data
Archiving Sensitive DataArchiving Sensitive Data
Archiving Sensitive DataBram Luyten
 
Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7Bram Luyten
 
DSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewDSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewBram Luyten
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and TomorrowBram Luyten
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceBram Luyten
 
Dépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceDépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceBram Luyten
 
Secrets of the DSpace Submission Form
Secrets of the DSpace Submission FormSecrets of the DSpace Submission Form
Secrets of the DSpace Submission FormBram Luyten
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Bram Luyten
 
What's in Store for DSpace 4?
What's in Store for DSpace 4?What's in Store for DSpace 4?
What's in Store for DSpace 4?Bram Luyten
 
ORCID for DSpace
ORCID for DSpaceORCID for DSpace
ORCID for DSpaceBram Luyten
 
Using Github for DSpace development
Using Github for DSpace developmentUsing Github for DSpace development
Using Github for DSpace developmentBram Luyten
 
Workshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceWorkshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceBram Luyten
 

More from Bram Luyten (12)

Archiving Sensitive Data
Archiving Sensitive DataArchiving Sensitive Data
Archiving Sensitive Data
 
Update on DSpace 7
Update on DSpace 7Update on DSpace 7
Update on DSpace 7
 
DSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 PreviewDSpace 5.7 and 6.1 Preview
DSpace 5.7 and 6.1 Preview
 
DSpace Today and Tomorrow
DSpace Today and TomorrowDSpace Today and Tomorrow
DSpace Today and Tomorrow
 
Mirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpaceMirage 2: A responsive user interface for DSpace
Mirage 2: A responsive user interface for DSpace
 
Dépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpaceDépôts institutionnels et collections spéciales en DSpace
Dépôts institutionnels et collections spéciales en DSpace
 
Secrets of the DSpace Submission Form
Secrets of the DSpace Submission FormSecrets of the DSpace Submission Form
Secrets of the DSpace Submission Form
 
Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3Introduction to XMLUI and Mirage Theming for DSpace 3
Introduction to XMLUI and Mirage Theming for DSpace 3
 
What's in Store for DSpace 4?
What's in Store for DSpace 4?What's in Store for DSpace 4?
What's in Store for DSpace 4?
 
ORCID for DSpace
ORCID for DSpaceORCID for DSpace
ORCID for DSpace
 
Using Github for DSpace development
Using Github for DSpace developmentUsing Github for DSpace development
Using Github for DSpace development
 
Workshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpaceWorkshop: Google Analytics for DSpace
Workshop: Google Analytics for DSpace
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Metadata based statistics for DSpace

  • 2. OVERVIEW 1. Why DSpace statistics? 2. Usage event vs. Item metadata 3. Generating metadata based statistics 4. Linking metadata to usage events 5. Performance 6. Problem solved?
  • 3. Statistics solution that knows DSpace: Structure “Which are the most downloaded bitstreams in a collection” Metadata “Who are the most popular authors in terms of downloads?” 1 - WHY DSPACE STATISTICS?
  • 4. USAGE EVENT VS. ITEM METADATA 2 types of metadata: Usage event metadata Additional information about the usage event Item metadata Additional information about the target of the usage event
  • 5. USAGE EVENT METADATA Additional information about the usage event Not related to repository Also possible with other statistics solutions: • IP address • Country • User Agent • HTTP Referrer • ...
  • 6. ITEM METADATA Relate usage event to information stored in your repository. Allows statistics queries based on item metadata. → Not possible with a statistics solution that is not tied to the repository.
  • 7. GENERATING METADATA BASED STATISTICS How many downloads did author "Barnes, Douglas F.” get in the last year, grouped by month
  • 8.
  • 9.
  • 10.
  • 11.
  • 12.
  • 13. LINKING METADATA TO USAGE EVENTS Solr Query http://localhost:8080/solr/statistics/select? facet=true&facet.offset=0&facet.mincount=1&facet.sort= false&q=*:*&facet.limit=24&facet.field=dateYearMonth&f acet.method=enum&fq=bundleName:ORIGINAL&fq=type: +0&fq=statistics_type:view&fq=-isBot:true&fq=- isInternal:true&fq=time:[2014-07-01T00:00:00.000Z+TO +2015-06-06T00:00:00.000Z]&fq=+(author_mtdt:Barnes, +Douglas+F.)+&wt=javabin&rows=0
  • 14. LINKING METADATA TO USAGE EVENTS facet.field=dateYearMonth group by the field dateYearMonth fq=type:+0 only include bitstream downloads fq=bundleName:ORIGINAL only include files in bundle “ORIGINAL” fq=-isBot:true filter out all bot statistics fq=-isInternal:true filter out all internal statistics fq=time:[2014-07-01+TO+2015-06-06] only include stats that are between Jul 1st 2014 and Jun 6th 2015 fq=+(author_mtdt:Barnes,+Douglas+F.)+ only include statistics that are by author Barnes, Douglas F.
  • 15. <response> <lst name="responseHeader"> ... </lst> <result name="response" numFound="164" start="0"></result> <lst name="facet_counts"> <lst name="facet_fields"> <lst name="dateYearMonth"> <int name="2014-07">15</int> <int name="2014-08">19</int> <int name="2014-09">15</int> <int name="2014-10">10</int> <int name="2014-11">7</int> <int name="2014-12">13</int> <int name="2015-01">13</int> <int name="2015-02">15</int> <int name="2015-03">21</int> <int name="2015-04">22</int> <int name="2015-05">12</int> <int name="2015-06">2</int> </lst> </lst> </lst> </response>
  • 16. LINKING METADATA TO USAGE EVENTS In a vanilla DSpace installation: • Usage statistics only contain bitstream IDs: no metadata • The metadata is stored in the database
  • 17. PROPOSED SOLUTION 1. Query the database for bitstream IDs based on the author metadata 2. Use those IDs to query solr for statistics
  • 18. PROPOSED SOLUTION: DOWNSIDES • Two queries to answer one question • The solr query can get very long and inefficient to execute • Inefficient but still possible
  • 19. PROPOSED SOLUTION: DOWNSIDES What if we want to show the 10 authors with the most downloads? • query the database for all authors • query SOLR to get the number of usage events for each author • sort those counts, and return the 10 highest
  • 20. PROPOSED SOLUTION: DOWNSIDES Very inefficient! • do a lot of queries • throw away most of the results: we only need top 10
  • 21. SOLR FACETS To do a facet query: • specify ”facet.field” along with the regular query • results will be grouped by the values they have for that field
  • 22. SOLR FACETS: EXAMPLE q=type:0&facet.field=owningItem q=type:0 search for all usage events that are bitstream downloads facet.field=owningItem group these by item count the # records in each group
  • 23. OUR SOLUTION • Add Item metadata to SOLR. • Use built-in filtering and grouping
  • 24. CHALLENGE: SIZE OF THE SOLR CORE That solution creates new challenges Metadata is duplicated in every statistical record that takes up a lot of space and it needs to be kept in sync
  • 25. SIZE OF SINGLE USAGE EVENT <doc> <str name="ip">177.21.194.80</str> <arr name="ip_search"><str>177.21.194.80</str></arr> <arr name="ip_ngram"><str>177.21.194.80</str></arr> <int name="type">0</int> <int name="id">54</int> <date name="time">2015-05-11T04:33:49.077Z</date> <str name="dateYearMonth">2015-05</str> <str name="dateYear">2015</str> <str name="continent">SA</str> <str name="countryCode">BR</str> <float name="latitude">-10.0</float> <float name="longitude">-55.0</float> <arr name="bundleName"><str>ORIGINAL</str></arr> <arr name="containerBitstream"><int>54</int></arr> <arr name="owningItem"><int>1652</int></arr> <arr name="containerItem"><int>1652</int></arr> <arr name="owningColl"><int>14</int></arr> <arr name="containerCollection"><int>14</int></arr> <arr name="owningComm"><int>1</int></arr> <arr name="containerCommunity"><int>1</int></arr> <str name="uid">60fe8ebb-b8a9-454c-8eef-3f9f800d1399</str> <bool name="isBot">false</bool> <bool name="isInternal">false</bool> <str name="statistics_type">view</str> <long name="_version_">1501767933804675072</long> </doc> 25 elements
  • 26. <doc> <str name="ip">177.21.194.80</str> ... <arr name="author_mtdt"> <str>Khandker, Shahidur R.</str> <str>Barnes, Douglas F.</str> <str>Samad, Hussain A.</str> </arr> <arr name="subject_mtdt"> <str>ACCESS TO LIGHTING</str> <str>ACCESS TO MODERN ENERGY</str> <str>AGRICULTURAL LAND</str> <str>AGRICULTURAL RESIDUE</str> <str>AIR CONDITIONERS</str> <str>AIR POLLUTION</str> <str>ALTERNATIVE ENERGY</str> <str>ALTERNATIVE SOURCES OF ENERGY</str> <str>APPROACH</str> <str>ATMOSPHERE</str> <str>AVAILABILITY</str> <str>BASIC ENERGY</str> <str>BIOMASS</str> <str>BIOMASS BURNING</str> <str>BIOMASS COLLECTION</str> <str>BIOMASS CONSUMPTION</str> <str>BIOMASS ENERGY</str> ... <str>WORLD ENERGY</str> <str>WORLD ENERGY OUTLOOK</str> </arr> ... </doc> SIZE OF SINGLE USAGE EVENT WITH METADATA 3 authors 140 subjects
  • 27. KEEPING METADATA IN SYNC When the metadata of an item changes • a mistake was corrected • extra info was added the statistical records for that item need to be updated as well
  • 28. KEEPING METADATA IN SYNC Item with 7,000 page visits and 5,000 downloads → that means updating 12,000 usage events. • That takes time • During that time, it takes longer to view other statistical reports
  • 29. PERFORMANCE Size of single usage event Metadata updates Amount of events Live search queries
  • 30. PERFORMANCE ENHANCEMENT: SYNCING Try to keep the load created by synching metadata in the statistics as low as possible: → only sync while solr is idle interrupt the operation when a search request can’t be handled in time interrupt the operation when Solr’s memory usage nears its max
  • 31. PERFORMANCE ENHANCEMENT: CACHING Caching store generated reports in a separate Solr core retrieving them is very fast invalidate cached reports after a set time (e.g. 24 hours)
  • 32. PERFORMANCE ENHANCEMENT: CACHING Don’t delete expired cached reports If a user requests a report that is cached → show the outdated version In the mean time → generate a new version Automatically show new report when it’s done
  • 35. PROBLEM SOLVED? Additional complexity Number of usage events keeps growing Name variants Different names for one author
  • 36. “Who are the Most Popular Authors in terms of downloads?” NAME VARIANTS USE CASE
  • 38. Ferreira, Francisco H. G. Ferreira, Francisco H.G. Ferreira, Francisco 3 name variants:
  • 39.
  • 40. SOLUTION FOR NAME VARIANTS include all name variants in Solr query: author_mtdt: (Ferreira, Francisco H. G.) OR (Ferreira, Francisco H.G.) OR (Ferreira, Francisco)
  • 41. ALTERNATIVE SOLUTION If you have unique IDs (e.g. ORCID) Index, and search for them instead