Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories
Spatio-Temporal Pseudo Relevance Feedback
for Large-Scale and Heterogeneous Scientific
Repositories
Shinichi Takeuchi, Yuhei Akahoshi, Bun Theang Ong,
Komei Sugiura, and Koji Zettsu
National Institute of Information and Communications Tech., Japan
Background: Our target task is scientific data retrieval
• Why scientific data retrieval is important?
– Some funding agencies started to
request open access to research
outcomes
– Open science data are useless unless
they are searchable
1 correct vs
9 incorrect
• Existing systems
– Portals: WDS Portal, Pangaea Portal, …
– Search engine: Google Fusion Tables,
…
Examples of existing systems
• Google Fusion Tables
– https://research.google.com/tables?source=fthm
• Pangaea
– http://www.pangaea.de/
Difficulty: Text information is very limited
• Text information is limited compared with web page searching
– e.g. Only 1.7% of Pangaea’s datasets have sufficient text data
Dataset attributes # of datasets Ratio [%]
With abstract 7,028 1.7
With spatial info. 404,145 99.6
With temporal info. 297,478 73.3
With spatio-temporal info. 297,037 73.2
Total (Pangaea) 405,456 100.0
Definition of a “dataset” := a dataset having metadata
cf: We have collected approx. 800,000 scientific datasets
Conventional studies
• PRF = Pseudo (Blind) Relevance Feedback
Field Example
Scientific data
retrieval
• Generation of spatio-temporal metadata [Pallickara+
2010]
• KVS for discretized spatio-temporal information [Fox+
2013]
Original PRF Validation with TREC tasks [Buckley+ 1995]
PRF applications Microblog search, temporal expression extraction, …
[Lioma+ 2008, Lv+ 2010, Chen+ 2013]
Main innovation and differentiation
• Pseudo relevance feedback using Space-Time-Text(STT) information
• Dataset similarity based on Bhattacharyya distance of spatio-temporal
probabilistic distributions
Overview: Space-Time-Text (STT) query is used in the 2nd search
Browser
Time score
GUI
input
Index
GUI
output
System
Text query
1st search results
DB search
Clustering
Datasets
Dataset
clusters
Text query
2nd search results
STT query
STT query
Space score
Text score
Text query
expansion
Space query
expansion
Retrieval
Retrieval
STT query
expansion
Time query
expansion
Proposed: Bhattacharyya distance is used for measuring similarity
between two spatio-temporal distributions
Space-Time-Text score 𝜙𝜙 𝑦𝑦 is defined as a simple linear combination
𝜙𝜙 𝑦𝑦 = 𝑤𝑤𝑠𝑠 𝜙𝜙𝑠𝑠 𝑦𝑦 + 𝑤𝑤𝑡𝑡 𝜙𝜙𝑡𝑡 𝑦𝑦 + 𝜙𝜙𝑘𝑘(𝑦𝑦)
𝜙𝜙𝑠𝑠(𝑦𝑦) = exp(−( min
𝑦𝑦′∈𝑌𝑌𝐿𝐿
𝑑𝑑𝑠𝑠 𝑦𝑦, 𝑦𝑦′ )2)
If we approximate distributions as Gaussians, Bhattacharyya distance can
be written as follows:
𝑑𝑑 𝑦𝑦𝑖𝑖, 𝑦𝑦𝑗𝑗 =
1
8
𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗
′ Σ𝑖𝑖 + Σ𝑗𝑗
2
−1
𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗 +
1
2
ln
det
Σ𝑖𝑖 + Σ𝑗𝑗
2
det Σ𝑖𝑖 det(Σ𝑗𝑗)
TextTimeSpace
* Time score is calculated in the same
manner
* Cosine distance is used as text score
Min Distance from top L results
Experiment: We built a test set for evaluation
• No standard benchmarking test is available for scientific data retrieval
• Our test set
– Queries: Scientific keywords
– Training/test datasets obtained from Pangaea
– Labels are given as the average of three expert labelers
Size Source of datasets
Queries
(Scientific keywords)
50 Cross-DB, Google Trends, Microsoft Academic
Search, SWEET Ontology
Training/test datasets 6,000
(120 * 50)
Top 120 Pangaea’s search results per query
acid deposition, aerosol, air quality, atmospheric circulation, boreal forest, climate
change, coastal waters, desert, glacier, global warming, heavy metal, hurricane,
interannual variability, marine biology, ocean circulation, ozone, particulate matter, sea
level pressure, sediment, soil ph, species richness, trade wind, typhoon, …
Queries
Experimental conditions: quantitative comparison
• Labeling by experts in natural science
– Labelers: (at least) master-degree holders
– Relevance: 0 (no relevance) – 3 (high relevance)
• Measure
– nDCG@k, Precision@k, Recall@k, Average Precision
P@𝑘𝑘 =
tp@𝑘𝑘
tp@𝑘𝑘 + fp@𝑘𝑘
R@𝑘𝑘 =
tp@𝑘𝑘
tp@𝑘𝑘 + fn@ALL AP =
1
𝑁𝑁
�
𝑘𝑘=1
𝑁𝑁
rel 𝑘𝑘 P@𝑘𝑘
Method Text PRF Space-Time PRF
Baseline No No
Text-PRF Yes No
STT-PRF Yes Yes
Quantitative result (1):
Text-PRF and STT-PRF improved Average Precision
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0 1 2 5 10 20 50 100
Baseline Text-PRF STT-PRF
AP
Ratio of datasets having abstract [%]
STT-PRF beats baseline in
standard setups
Quantitative results (2):STT-PRF obtained best results in Recall,
AP, and number of hits
nDCG@30 P@30 R@30 AP #Hit
Baseline 0.681 0.388 0.095 0.086 15.0
ST-PRF 0.627 0.354 0.155 0.137 26.8
Text-PRF 0.725 0.332 0.221 0.339 91.5
STT-PRF 0.722 0.332 0.238 0.343 91.6
Ratio of datasets having abstract = 2% (simulating Pangaea’s condition)
Future directions: Application to heterogeneous data
We have collected 1.25 million datasets (2.5PB) as of Jan, 2014
Asset category Details
Physical sensor data
Winds, temperature, pressure, humidity, rainfalls, snowfalls, luminance,
CO2, air quality, pollen allergy, radiation, typhoon, earth quake, land slide,
infection disease, etc. (49 sensorss)
Social sensor data Geo-tagged Twitter (JP, US, Sample, trend), Google news, RSS news
Web archive Full-text data, sender data, reputation data, modification relation data
Science data
WDS metadata (40 domains, 25 sites from Pangaea, ICPSR, DRYAD, ESDS,
ADA, etc.)
Open government data Data.gov metadata
Geographical data Landmarks, river-level data, shelter data
Text analysis data
Web text ontology, EDR concept dictionary, WordNet, sentiment
dictionary
Language trans. tools VoiceTra text translation, JServer
Text analysis tools Proper noun extractor, morphological analyzer, dependency parsing
GIS tools
Google Geocoding, Yahoo Contents Geocoder, landmark extractor, postal
code search, GeoNLP
Speech tools VoiceTra (speech recognition & synthesis), Rospeex
Summary
• Novelty of approach
– Pseudo relevance feedback using Space-Time-Text (STT)
information
• Results
– Proposed method improved Recall, AP, and #Hit under
practical setup
• Applications
– SNS and other geo-tagged
messages