Traffic Analytics for Linked Data Publishers

Luca Costabello
Luca CostabelloPostdoctoral Researcher at Fujitsu Ireland
FUJITSU RESTRICTED - UK & IRELAND EYES ONLY © Copyright 2014 Fujitsu (Ireland) Limited
Traffic Analytics for
Linked Data Publishers
Luca Costabello, Pierre-Yves Vandenbussche, Gofran Shukair
Fujitsu Ireland
Corine Deliot, Neil Wilson
British Library
2
The Problem: Measuring Traffic on RDF Datasets
Linked Data publishers have limited awareness of how
datasets are accessed by visitors.
No tool to mine Linked Data servers access logs
Why is this such a big deal?
Justify investment in Linked Data IT infrastructure
Cost control
Identify abuses
Interpret access peaks
3
Which traffic metrics?
 Adapt conventional web analytics metrics
 Define Linked-Data specific extensions
How to extract and compute such metrics?
 Which data sources? (client tracking? server access logs mining? both?)
 Need to support dual data access protocol (HTTP operations + SPARQL)
 How to filter noise? (i.e. robots, search engines crawlers)
 How to detect client sessions? (no client tracking, dual data access protocol)
 How to detect SPARQL activity peaks?
Challenges
4
Existing tools do not include Linked Data-specific metrics:
Linked data-specific metrics, but no platform [Moller et al, WebScience 2010]
Filling a Gap in Prior Art
5
• Traffic analytics platform for LD servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
Our Contribution / Agenda
6
• Traffic analytics platform for LD servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
7
8
9
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
10
Metrics
* Linked Data-specific
11
Metrics
* Linked Data-specific
12
Metrics
* Linked Data-specific
13
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
14
Visitor Session Detection
Session: sequence of requests issued with no significant
interruptions by a uniquely identified visitor. Expires after a
period of inactivity.
We use the HAC variant by [Murray et al. 2006, Mehrzadi et al. 2012]
Unsupervised, gap-based session boundary detection
Traditional web logs analysis
Benefit: visitor-specific temporal cut-off
Two-step procedure:
Set visitor-specific session cut-off as time interval that significantly
increases the variance.
Group HTTP/SPARQL requests into sessions according to the cut-off
15
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
16
Heavy/Light SPARQL Queries Binary Classifier
 Rough estimate of heavy and light queries with supervised
binary classification.
 Heavy SPARQL Query: if it requires considerable computational
and memory resources.
Light
Heavy
17
Heavy/Light SPARQL Queries Binary Classifier
 Feature vectors: SPARQL 1.1 syntactic features only:
18
• Traffic Analytics Platform for LD Servers
• Metrics
• Metrics Extraction
• Visitor Sessions
• Heavy/Light SPARQL Queries
• Results & British Library Trial Insight
19
British National Bibliography access logs
bnb.data.bl.uk (access logs are not public)
13 months
~ 10M HTTP requests/month
DBpedia 3.9 access logs
USEWOD 2015 Dataset
Datasets
20
Visitor Session Detection: Results
How well do we detect the beginning of a new session?
Dataset
British National Bibliography access logs (3 consecutive days)
~16k HTTP/SPARQL requests
• 32% Desktop browsers (115 visitors)
• 68% Software libraries (10 visitors)
Manually annotated records
• 1=session_start | 0=internal
Baseline: fixed-length cut-offs
HAC outperforms fixed-length cut-offs
21
Random distinct queries from DBpedia 3.9 access logs
Run the queries multiple times on local clone of DBpedia
Kept ~3.7k queries with low variance (3.1k light, 600 heavy)
Cut-off threshold: 100ms
Naïve Bayes and SVM
Grid search & randomized search w/ 10-fold CV
Heavy/Light SPARQL: Experiment Protocol
22
Heavy/Light SPARQL: Results
23
Genuine calls account for 0.6% of total traffic!
+30% of HTTP/SPARQL traffic over the observed 13 months
Sharp increase in requests from Software Libraries (95x)
SPARQL accounts for 29% of traffic
6% of heavy SPARQL queries
37 days have unusual traffic spikes
Bounce rate: 48%
Software Libraries have bigger, deeper, and longer sessions.
Some Insights on BL Traffic Logs
24
We relieve publishers from manual and time-consuming
access log mining
Support Linked Data-specific metrics
 Break down traffic by RDF content
 Capture SPARQL insights
 Properly interpret 303 patterns
Reconstruction of Linked Data visitors sessions
Heavy/light SPARQL classifier w/ SPARQL syntax +
supervised learning
Revealed hidden insights on 13 months of access logs of the
British Library
Summary
25
Statistics on noise (i.e. web crawlers)
Heavy/light classifier
 Feature set refinements
 Does it generalize to other datasets?
Enhance session detection with content-based heuristics
 Relatedness of subsequent SPARQL queries
 Structure and type of requested RDF entities
Future Work
26
Public Demo: bit.ly/ld-traffic
innovation.ie.fujitsu.com/kedi
1 of 26

Recommended

20101020 harper by
20101020 harper20101020 harper
20101020 harpercharper
401 views16 slides
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati... by
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...
BioSHaRE: Opal and Mica: a software suite for data harmonization and federati...Lisette Giepmans
912 views33 slides
Linked Open Government Data (LOGD) by
Linked Open Government Data (LOGD)Linked Open Government Data (LOGD)
Linked Open Government Data (LOGD)Nooshin Allahyari
530 views22 slides
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe... by
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...
OSDC 2015: Tudor Golubenco | Application Performance Management with Packetbe...NETWAYS
202 views28 slides
Log aggragation by
Log aggragationLog aggragation
Log aggragationMatheusCanabarroPea
28 views15 slides
The CIARD RINGValeri by
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeriCIARD Movement
445 views18 slides

More Related Content

Similar to Traffic Analytics for Linked Data Publishers

Data monstersrealtimeetl new by
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl newGreenM
17 views38 slides
Southwest Power Pool big data case study by
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study Seeling Cheung
1.3K views15 slides
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre by
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreHPCC Systems
599 views28 slides
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog by
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDogRedis Labs
2K views76 slides
Building Fast Applications for Streaming Data by
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Datafreshdatabos
1.2K views44 slides
nstitutional repositories, item and research data metrics by
nstitutional repositories, item and research data metricsnstitutional repositories, item and research data metrics
nstitutional repositories, item and research data metricsUKSG: connecting the knowledge community
315 views32 slides

Similar to Traffic Analytics for Linked Data Publishers(20)

Data monstersrealtimeetl new by GreenM
Data monstersrealtimeetl newData monstersrealtimeetl new
Data monstersrealtimeetl new
GreenM17 views
Southwest Power Pool big data case study by Seeling Cheung
Southwest Power Pool big data case study Southwest Power Pool big data case study
Southwest Power Pool big data case study
Seeling Cheung1.3K views
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre by HPCC Systems
Big Data Processing Beyond MapReduce by Dr. Flavio VillanustreBig Data Processing Beyond MapReduce by Dr. Flavio Villanustre
Big Data Processing Beyond MapReduce by Dr. Flavio Villanustre
HPCC Systems599 views
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog by Redis Labs
 Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Monitoring and Scaling Redis at DataDog - Ilan Rabinovitch, DataDog
Redis Labs2K views
Building Fast Applications for Streaming Data by freshdatabos
Building Fast Applications for Streaming DataBuilding Fast Applications for Streaming Data
Building Fast Applications for Streaming Data
freshdatabos1.2K views
Improving Reporting Performance by Dhiren Gala
Improving Reporting PerformanceImproving Reporting Performance
Improving Reporting Performance
Dhiren Gala256 views
Introducing the IRUSdataUK pilot webinar by Jisc
Introducing the IRUSdataUK pilot webinarIntroducing the IRUSdataUK pilot webinar
Introducing the IRUSdataUK pilot webinar
Jisc464 views
HUG Ireland Event - HPCC Presentation Slides by John Mulhall
HUG Ireland Event - HPCC Presentation SlidesHUG Ireland Event - HPCC Presentation Slides
HUG Ireland Event - HPCC Presentation Slides
John Mulhall572 views
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials by Veeva Systems
Tufts Research: Strategies from Data Management Leaders to Speed Clinical TrialsTufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Tufts Research: Strategies from Data Management Leaders to Speed Clinical Trials
Veeva Systems186 views
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ... by Maya Lumbroso
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Maya Lumbroso79 views
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ... by Dataconomy Media
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Ronan Corkery, kdb+ developer at Kx Systems: “Kdb+: How Wall Street Tech can ...
Dataconomy Media391 views
Web Performance – die effektivsten Techniken aus der Praxis by Felix Gessert
Web Performance – die effektivsten Techniken aus der PraxisWeb Performance – die effektivsten Techniken aus der Praxis
Web Performance – die effektivsten Techniken aus der Praxis
Felix Gessert1.7K views

More from Luca Costabello

Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings by
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph EmbeddingsMachine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph EmbeddingsLuca Costabello
643 views24 slides
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data... by
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...Luca Costabello
1.5K views26 slides
Context-Aware Access Control and Presentation of Linked Data by
Context-Aware Access Control and Presentation of Linked DataContext-Aware Access Control and Presentation of Linked Data
Context-Aware Access Control and Presentation of Linked DataLuca Costabello
7.8K views52 slides
Access Control for HTTP Operations on Linked Data by
Access Control for HTTP Operations on Linked DataAccess Control for HTTP Operations on Linked Data
Access Control for HTTP Operations on Linked DataLuca Costabello
1.8K views31 slides
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores by
Linked Data Access Goes Mobile: Context Aware Authorization for Graph StoresLinked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Linked Data Access Goes Mobile: Context Aware Authorization for Graph StoresLuca Costabello
1.2K views9 slides
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data by
PRISSMA,Towards Mobile Adaptive Presentation of the Web of DataPRISSMA,Towards Mobile Adaptive Presentation of the Web of Data
PRISSMA, Towards Mobile Adaptive Presentation of the Web of DataLuca Costabello
1.2K views7 slides

More from Luca Costabello(7)

Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings by Luca Costabello
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph EmbeddingsMachine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Machine Learning on Knowledge Graphs: a Quick Tour of Knowledge Graph Embeddings
Luca Costabello643 views
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data... by Luca Costabello
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Error-Tolerant RDF Subgraph Matching for Adaptive Presentation of Linked Data...
Luca Costabello1.5K views
Context-Aware Access Control and Presentation of Linked Data by Luca Costabello
Context-Aware Access Control and Presentation of Linked DataContext-Aware Access Control and Presentation of Linked Data
Context-Aware Access Control and Presentation of Linked Data
Luca Costabello7.8K views
Access Control for HTTP Operations on Linked Data by Luca Costabello
Access Control for HTTP Operations on Linked DataAccess Control for HTTP Operations on Linked Data
Access Control for HTTP Operations on Linked Data
Luca Costabello1.8K views
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores by Luca Costabello
Linked Data Access Goes Mobile: Context Aware Authorization for Graph StoresLinked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Linked Data Access Goes Mobile: Context Aware Authorization for Graph Stores
Luca Costabello1.2K views
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data by Luca Costabello
PRISSMA,Towards Mobile Adaptive Presentation of the Web of DataPRISSMA,Towards Mobile Adaptive Presentation of the Web of Data
PRISSMA, Towards Mobile Adaptive Presentation of the Web of Data
Luca Costabello1.2K views
Time Based Cluster Analysis for Automatic Blog Generation by Luca Costabello
Time Based Cluster Analysis for Automatic Blog GenerationTime Based Cluster Analysis for Automatic Blog Generation
Time Based Cluster Analysis for Automatic Blog Generation
Luca Costabello2.2K views

Recently uploaded

Short Story Assignment by Kelly Nguyen by
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyenkellynguyen01
18 views17 slides
CRIJ4385_Death Penalty_F23.pptx by
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptxyvettemm100
6 views24 slides
PROGRAMME.pdf by
PROGRAMME.pdfPROGRAMME.pdf
PROGRAMME.pdfHiNedHaJar
17 views13 slides
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfvikas12611618
8 views30 slides
ColonyOS by
ColonyOSColonyOS
ColonyOSJohanKristiansson6
9 views17 slides
Data structure and algorithm. by
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm. Abdul salam
18 views24 slides

Recently uploaded(20)

Short Story Assignment by Kelly Nguyen by kellynguyen01
Short Story Assignment by Kelly NguyenShort Story Assignment by Kelly Nguyen
Short Story Assignment by Kelly Nguyen
kellynguyen0118 views
CRIJ4385_Death Penalty_F23.pptx by yvettemm100
CRIJ4385_Death Penalty_F23.pptxCRIJ4385_Death Penalty_F23.pptx
CRIJ4385_Death Penalty_F23.pptx
yvettemm1006 views
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf by vikas12611618
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdfVikas 500 BIG DATA TECHNOLOGIES LAB.pdf
Vikas 500 BIG DATA TECHNOLOGIES LAB.pdf
vikas126116188 views
Data structure and algorithm. by Abdul salam
Data structure and algorithm. Data structure and algorithm.
Data structure and algorithm.
Abdul salam 18 views
3196 The Case of The East River by ErickANDRADE90
3196 The Case of The East River3196 The Case of The East River
3196 The Case of The East River
ErickANDRADE9011 views
Chapter 3b- Process Communication (1) (1)(1) (1).pptx by ayeshabaig2004
Chapter 3b- Process Communication (1) (1)(1) (1).pptxChapter 3b- Process Communication (1) (1)(1) (1).pptx
Chapter 3b- Process Communication (1) (1)(1) (1).pptx
ayeshabaig20045 views
Survey on Factuality in LLM's.pptx by NeethaSherra1
Survey on Factuality in LLM's.pptxSurvey on Factuality in LLM's.pptx
Survey on Factuality in LLM's.pptx
NeethaSherra15 views
Advanced_Recommendation_Systems_Presentation.pptx by neeharikasingh29
Advanced_Recommendation_Systems_Presentation.pptxAdvanced_Recommendation_Systems_Presentation.pptx
Advanced_Recommendation_Systems_Presentation.pptx
RuleBookForTheFairDataEconomy.pptx by noraelstela1
RuleBookForTheFairDataEconomy.pptxRuleBookForTheFairDataEconomy.pptx
RuleBookForTheFairDataEconomy.pptx
noraelstela167 views
Understanding Hallucinations in LLMs - 2023 09 29.pptx by Greg Makowski
Understanding Hallucinations in LLMs - 2023 09 29.pptxUnderstanding Hallucinations in LLMs - 2023 09 29.pptx
Understanding Hallucinations in LLMs - 2023 09 29.pptx
Greg Makowski13 views
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx by DataScienceConferenc1
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
[DSC Europe 23] Zsolt Feleki - Machine Translation should we trust it.pptx
Introduction to Microsoft Fabric.pdf by ishaniuudeshika
Introduction to Microsoft Fabric.pdfIntroduction to Microsoft Fabric.pdf
Introduction to Microsoft Fabric.pdf
ishaniuudeshika24 views
Building Real-Time Travel Alerts by Timothy Spann
Building Real-Time Travel AlertsBuilding Real-Time Travel Alerts
Building Real-Time Travel Alerts
Timothy Spann109 views
Cross-network in Google Analytics 4.pdf by GA4 Tutorials
Cross-network in Google Analytics 4.pdfCross-network in Google Analytics 4.pdf
Cross-network in Google Analytics 4.pdf
GA4 Tutorials6 views
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation by DataScienceConferenc1
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
[DSC Europe 23] Spela Poklukar & Tea Brasanac - Retrieval Augmented Generation
Supercharging your Data with Azure AI Search and Azure OpenAI by Peter Gallagher
Supercharging your Data with Azure AI Search and Azure OpenAISupercharging your Data with Azure AI Search and Azure OpenAI
Supercharging your Data with Azure AI Search and Azure OpenAI
Peter Gallagher37 views

Traffic Analytics for Linked Data Publishers

  • 1. FUJITSU RESTRICTED - UK & IRELAND EYES ONLY © Copyright 2014 Fujitsu (Ireland) Limited Traffic Analytics for Linked Data Publishers Luca Costabello, Pierre-Yves Vandenbussche, Gofran Shukair Fujitsu Ireland Corine Deliot, Neil Wilson British Library
  • 2. 2 The Problem: Measuring Traffic on RDF Datasets Linked Data publishers have limited awareness of how datasets are accessed by visitors. No tool to mine Linked Data servers access logs Why is this such a big deal? Justify investment in Linked Data IT infrastructure Cost control Identify abuses Interpret access peaks
  • 3. 3 Which traffic metrics?  Adapt conventional web analytics metrics  Define Linked-Data specific extensions How to extract and compute such metrics?  Which data sources? (client tracking? server access logs mining? both?)  Need to support dual data access protocol (HTTP operations + SPARQL)  How to filter noise? (i.e. robots, search engines crawlers)  How to detect client sessions? (no client tracking, dual data access protocol)  How to detect SPARQL activity peaks? Challenges
  • 4. 4 Existing tools do not include Linked Data-specific metrics: Linked data-specific metrics, but no platform [Moller et al, WebScience 2010] Filling a Gap in Prior Art
  • 5. 5 • Traffic analytics platform for LD servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight Our Contribution / Agenda
  • 6. 6 • Traffic analytics platform for LD servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 7. 7
  • 8. 8
  • 9. 9 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 13. 13 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 14. 14 Visitor Session Detection Session: sequence of requests issued with no significant interruptions by a uniquely identified visitor. Expires after a period of inactivity. We use the HAC variant by [Murray et al. 2006, Mehrzadi et al. 2012] Unsupervised, gap-based session boundary detection Traditional web logs analysis Benefit: visitor-specific temporal cut-off Two-step procedure: Set visitor-specific session cut-off as time interval that significantly increases the variance. Group HTTP/SPARQL requests into sessions according to the cut-off
  • 15. 15 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 16. 16 Heavy/Light SPARQL Queries Binary Classifier  Rough estimate of heavy and light queries with supervised binary classification.  Heavy SPARQL Query: if it requires considerable computational and memory resources. Light Heavy
  • 17. 17 Heavy/Light SPARQL Queries Binary Classifier  Feature vectors: SPARQL 1.1 syntactic features only:
  • 18. 18 • Traffic Analytics Platform for LD Servers • Metrics • Metrics Extraction • Visitor Sessions • Heavy/Light SPARQL Queries • Results & British Library Trial Insight
  • 19. 19 British National Bibliography access logs bnb.data.bl.uk (access logs are not public) 13 months ~ 10M HTTP requests/month DBpedia 3.9 access logs USEWOD 2015 Dataset Datasets
  • 20. 20 Visitor Session Detection: Results How well do we detect the beginning of a new session? Dataset British National Bibliography access logs (3 consecutive days) ~16k HTTP/SPARQL requests • 32% Desktop browsers (115 visitors) • 68% Software libraries (10 visitors) Manually annotated records • 1=session_start | 0=internal Baseline: fixed-length cut-offs HAC outperforms fixed-length cut-offs
  • 21. 21 Random distinct queries from DBpedia 3.9 access logs Run the queries multiple times on local clone of DBpedia Kept ~3.7k queries with low variance (3.1k light, 600 heavy) Cut-off threshold: 100ms Naïve Bayes and SVM Grid search & randomized search w/ 10-fold CV Heavy/Light SPARQL: Experiment Protocol
  • 23. 23 Genuine calls account for 0.6% of total traffic! +30% of HTTP/SPARQL traffic over the observed 13 months Sharp increase in requests from Software Libraries (95x) SPARQL accounts for 29% of traffic 6% of heavy SPARQL queries 37 days have unusual traffic spikes Bounce rate: 48% Software Libraries have bigger, deeper, and longer sessions. Some Insights on BL Traffic Logs
  • 24. 24 We relieve publishers from manual and time-consuming access log mining Support Linked Data-specific metrics  Break down traffic by RDF content  Capture SPARQL insights  Properly interpret 303 patterns Reconstruction of Linked Data visitors sessions Heavy/light SPARQL classifier w/ SPARQL syntax + supervised learning Revealed hidden insights on 13 months of access logs of the British Library Summary
  • 25. 25 Statistics on noise (i.e. web crawlers) Heavy/light classifier  Feature set refinements  Does it generalize to other datasets? Enhance session detection with content-based heuristics  Relatedness of subsequent SPARQL queries  Structure and type of requested RDF entities Future Work