More Related Content

Similar to Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories(20)

More from Komei Sugiura(20)


Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories

  1. Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous Scientific Repositories Shinichi Takeuchi, Yuhei Akahoshi, Bun Theang Ong, Komei Sugiura, and Koji Zettsu National Institute of Information and Communications Tech., Japan
  2. Background: Our target task is scientific data retrieval • Why scientific data retrieval is important? – Some funding agencies started to request open access to research outcomes – Open science data are useless unless they are searchable 1 correct vs 9 incorrect • Existing systems – Portals: WDS Portal, Pangaea Portal, … – Search engine: Google Fusion Tables, …
  3. Examples of existing systems • Google Fusion Tables – • Pangaea –
  4. Difficulty: Text information is very limited • Text information is limited compared with web page searching – e.g. Only 1.7% of Pangaea’s datasets have sufficient text data Dataset attributes # of datasets Ratio [%] With abstract 7,028 1.7 With spatial info. 404,145 99.6 With temporal info. 297,478 73.3 With spatio-temporal info. 297,037 73.2 Total (Pangaea) 405,456 100.0 Definition of a “dataset” := a dataset having metadata cf: We have collected approx. 800,000 scientific datasets
  5. Demo: Baseline has low recall
  6. Conventional studies • PRF = Pseudo (Blind) Relevance Feedback Field Example Scientific data retrieval • Generation of spatio-temporal metadata [Pallickara+ 2010] • KVS for discretized spatio-temporal information [Fox+ 2013] Original PRF Validation with TREC tasks [Buckley+ 1995] PRF applications Microblog search, temporal expression extraction, … [Lioma+ 2008, Lv+ 2010, Chen+ 2013] Main innovation and differentiation • Pseudo relevance feedback using Space-Time-Text(STT) information • Dataset similarity based on Bhattacharyya distance of spatio-temporal probabilistic distributions
  7. Standard dataset example Citation info (Author, year, etc) sensory observations What is observed Spatio-temporal info. Dataset
  8. Overview: Space-Time-Text (STT) query is used in the 2nd search Browser Time score GUI input Index GUI output System Text query 1st search results DB search Clustering Datasets Dataset clusters Text query 2nd search results STT query STT query Space score Text score Text query expansion Space query expansion Retrieval Retrieval STT query expansion Time query expansion
  9. Proposed: Bhattacharyya distance is used for measuring similarity between two spatio-temporal distributions Space-Time-Text score 𝜙𝜙 𝑦𝑦 is defined as a simple linear combination 𝜙𝜙 𝑦𝑦 = 𝑤𝑤𝑠𝑠 𝜙𝜙𝑠𝑠 𝑦𝑦 + 𝑤𝑤𝑡𝑡 𝜙𝜙𝑡𝑡 𝑦𝑦 + 𝜙𝜙𝑘𝑘(𝑦𝑦) 𝜙𝜙𝑠𝑠(𝑦𝑦) = exp(−( min 𝑦𝑦′∈𝑌𝑌𝐿𝐿 𝑑𝑑𝑠𝑠 𝑦𝑦, 𝑦𝑦′ )2) If we approximate distributions as Gaussians, Bhattacharyya distance can be written as follows: 𝑑𝑑 𝑦𝑦𝑖𝑖, 𝑦𝑦𝑗𝑗 = 1 8 𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗 ′ Σ𝑖𝑖 + Σ𝑗𝑗 2 −1 𝝁𝝁𝑖𝑖 − 𝝁𝝁𝑗𝑗 + 1 2 ln det Σ𝑖𝑖 + Σ𝑗𝑗 2 det Σ𝑖𝑖 det(Σ𝑗𝑗) TextTimeSpace * Time score is calculated in the same manner * Cosine distance is used as text score Min Distance from top L results
  10. Experiment: We built a test set for evaluation • No standard benchmarking test is available for scientific data retrieval • Our test set – Queries: Scientific keywords – Training/test datasets obtained from Pangaea – Labels are given as the average of three expert labelers Size Source of datasets Queries (Scientific keywords) 50 Cross-DB, Google Trends, Microsoft Academic Search, SWEET Ontology Training/test datasets 6,000 (120 * 50) Top 120 Pangaea’s search results per query acid deposition, aerosol, air quality, atmospheric circulation, boreal forest, climate change, coastal waters, desert, glacier, global warming, heavy metal, hurricane, interannual variability, marine biology, ocean circulation, ozone, particulate matter, sea level pressure, sediment, soil ph, species richness, trade wind, typhoon, … Queries
  11. Qualitative example: query = “sediment” Green: Correct (high relevance) Red: Incorrect (low relevance) BaselineProposed
  12. Experimental conditions: quantitative comparison • Labeling by experts in natural science – Labelers: (at least) master-degree holders – Relevance: 0 (no relevance) – 3 (high relevance) • Measure – nDCG@k, Precision@k, Recall@k, Average Precision P@𝑘𝑘 = tp@𝑘𝑘 tp@𝑘𝑘 + fp@𝑘𝑘 R@𝑘𝑘 = tp@𝑘𝑘 tp@𝑘𝑘 + fn@ALL AP = 1 𝑁𝑁 � 𝑘𝑘=1 𝑁𝑁 rel 𝑘𝑘 P@𝑘𝑘 Method Text PRF Space-Time PRF Baseline No No Text-PRF Yes No STT-PRF Yes Yes
  13. Quantitative result (1): Text-PRF and STT-PRF improved Average Precision 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0 1 2 5 10 20 50 100 Baseline Text-PRF STT-PRF AP Ratio of datasets having abstract [%] STT-PRF beats baseline in standard setups
  14. Quantitative results (2):STT-PRF obtained best results in Recall, AP, and number of hits nDCG@30 P@30 R@30 AP #Hit Baseline 0.681 0.388 0.095 0.086 15.0 ST-PRF 0.627 0.354 0.155 0.137 26.8 Text-PRF 0.725 0.332 0.221 0.339 91.5 STT-PRF 0.722 0.332 0.238 0.343 91.6 Ratio of datasets having abstract = 2% (simulating Pangaea’s condition)
  15. Future directions: Application to heterogeneous data We have collected 1.25 million datasets (2.5PB) as of Jan, 2014 Asset category Details Physical sensor data Winds, temperature, pressure, humidity, rainfalls, snowfalls, luminance, CO2, air quality, pollen allergy, radiation, typhoon, earth quake, land slide, infection disease, etc. (49 sensorss) Social sensor data Geo-tagged Twitter (JP, US, Sample, trend), Google news, RSS news Web archive Full-text data, sender data, reputation data, modification relation data Science data WDS metadata (40 domains, 25 sites from Pangaea, ICPSR, DRYAD, ESDS, ADA, etc.) Open government data metadata Geographical data Landmarks, river-level data, shelter data Text analysis data Web text ontology, EDR concept dictionary, WordNet, sentiment dictionary Language trans. tools VoiceTra text translation, JServer Text analysis tools Proper noun extractor, morphological analyzer, dependency parsing GIS tools Google Geocoding, Yahoo Contents Geocoder, landmark extractor, postal code search, GeoNLP Speech tools VoiceTra (speech recognition & synthesis), Rospeex
  16. Summary • Novelty of approach – Pseudo relevance feedback using Space-Time-Text (STT) information • Results – Proposed method improved Recall, AP, and #Hit under practical setup • Applications – SNS and other geo-tagged messages