Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

MUDROD - Ranking


Published on

A machine learning based ranking system for geospatial data discovery

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

MUDROD - Ranking

  1. 1. Mining and Utilizing Dataset Relevancy from Oceanographic Dataset (MUDROD) Metadata, Usage Metrics, and User Feedback to Improve Data Discovery and Access NASA AIST (NNX15AM85G) Yongyao Jiang, Chaowei (Phil) Yang, Yun Li, George Mason University Edward M Armstrong, Thomas Huang, David Moroni, Chris Finch, Lewis Mcgibbney, Frank Greguska, JPL, NASA ESIP Discovery Cluster Telecon
  2. 2. Data discovery problems • Keyword-matching, no semantic context – Original query: sea surface temperature – Keyword-based: sea AND surface AND temperature – Real intent: “sea surface temperature” OR “sst” OR “ghrsst” OR … • Single-dimension based ranking – Popularity, spatial resolution, text match, etc. – Hundreds, even thousands of matches • Lack of data recommendation – When a user clicks into a data, related data should be presented
  3. 3. • Analyze web logs to discover user knowledge (query and data relationships) • Construct knowledge base by combining semantics and profile analyzer • Improve data discovery by 1) better ranking; 2) recommendation; 3) ontology navigation Objectives
  4. 4. • Web log preprocessing • Semantic analysis of user queries • Machine learning based search ranking • Data Recommendation Functions/Modules
  5. 5. Background • Ranking is a long-standing problem in geospatial data discovery • Typically, hundreds, even thousands of matches • Can get larger as more Earth observation data is being collected
  6. 6. Background • Some provide metrics to sort the search results • Can be helpful, but induce users to focus on one single data characteristic • Fail to take account of users’ multidimensional preferences (e.g. processing level&spatial resolution)
  7. 7. Objective • Put the most desired data to the top of the result list • What features can represent users’ search preferences for geospatial data? • How can the ranking function reach a balance of all these features?
  8. 8. Ranking features • Identified eleven features from – Geospatial metadata attributes – Query – metadata content overlap – User behavior from web logs
  9. 9. Ranking features – Metadata attributes Features Description Release date The date when the data was published Processing level (PL) The processing level of image products, ranging from level 0 to level 4. Version number The publish version of the data Spatial resolution The spatial resolution of the data Temporal resolution The temporal resolution of the data • Five features • Verified by domains experts • Query-independent: static, depends on the data itself, won’t change with the query
  10. 10. • Text overlap • Spatial overlap Ranking features: query-metadata overlap
  11. 11. • Term frequency: How often does the query term appear in the metadata attribute? • Inverse document frequency: Query terms that appear in many metadata have a lower weight than uncommon terms (e.g., ocean < temperature) • Field-length: The length of the metadata attribute (e.g., title > abstract/description) • Lucene practical scoring function combines them Text query-metadata overlap
  12. 12. • Expand original query by providing the semantic context (Supported by the knowledge base) • Original query: ocean temperature • Semantically expanded query: ocean temperature OR sea surface temperature OR sst • Boost the relevance and increase recall & precision Text query-metadata overlap
  13. 13. Spatial query-metadata overlap • Spatial similarity between query area and the coverage of a particular data • Overlap area normalized by the original area of query and data
  14. 14. Ranking features – User behavior • All-time, monthly, user popularity, and semantic popularity (retrieved from web logs) • Semantic popularity: the number of times that the data has been clicked after searching a particular query and its highly related ones (query-dependent) ocean temperature ocean temperature sea surface temperature sst 1.0 1.0 1.0 Data A 10 5 5
  15. 15. Why popularity alone is not enough? • Agent of users’ search interests, used by many websites as the default ranking • A user is less likely to click a data low in the ranking, independent of how relevant it is. • The probability that a user clicks a data at page N (N>=2) is nearly zero (Jochims 2002). • Newly released data generally have less clicks than long-term, static data, but can be more relevant
  16. 16. Ranking features • Eleven features identified • Machine learning is needed – Difficult to tune the weighting by hand – No feature dependency assumption – Good at handling large number of features
  17. 17. RankSVM • One of the well-recognized ML ranking algorithm • Convert a ranking problem into a classification problem that a regular SVM algorithm can solve • 3 main steps • 1) Standardize: mean = 0, std = 1 – SVM is not scale invariant – Over-optimized – Longer to train
  18. 18. Conceptual example • 2) For any pair of training data, calculate the difference • For each pair xi−xj , a label +1 is assigned if data i is more relevant than data j and for the reverse case, a label −1 is used. • If both of them have the same label, we do not make any facts out of it.
  19. 19. RankSVM Sec. 15.4.2 3) A ranking problem becomes a binary classification problem, where SVM is applied to find the optimal decision boundary
  20. 20. Index User query Semantic query Top K retrieval Ranking model Learning algorithm Training data Re-ranked results Feature extractor Weblogs User clicksKnowledge base Architecture • All of these (except for training) can be finished within 2 seconds • None of the open source mainstream ML library provide any ranking algorithm • Implemented it by ourselves with the aid of Spark MLlib
  21. 21. Sample results of “ocean temperature AND Atlantic ocean”
  22. 22. Quantitative evaluation • Precision at K measures the fraction of data ranked in the top K results that are labelled as “excellent”. • 1 Y 1 • 2 N 1/2 • 3 N 1/3 • 4 Y 2/4 • 5 Y 3/5 K Excellent P(K) P(K) = 𝑁𝑢𝑚 𝑜𝑓 𝑒𝑥𝑐𝑒𝑙𝑙𝑒𝑛𝑡 𝑑𝑎𝑡𝑎 𝐾
  23. 23. Precision (K) for five different ranking methods at varying K (1-40)
  24. 24. NDCG (K) for five different ranking methods at varying K (1-40)
  25. 25. Conclusion • The rich set of ranking features and the ML algorithm provide substantial advantages over using other ranking methods • The proposed architecture enables the loosely coupled software structure of a data portal and avoids the cost of replacing the existing system
  26. 26. Next steps • Add more features (e.g., temporal similarity) • Create training data from web logs • Develop a query understanding module to better interpret user’s search intent (e.g. “ocean wind level 3” -> “ocean wind” AND “level 3”) • Support Solr • Support near real-time data ingestion to detect temporal events • Integration with DOMS and OceanXtremes
  27. 27. • Publications • Jiang, Y., Y. Li, C. Yang, E. M. Armstrong, T. Huang & D. Moroni (2016) Reconstructing Sessions from Data Discovery and Access Logs to Build a Semantic Knowledge Base for Improving Data Discovery. ISPRS International Journal of Geo-Information, 5, 54. • Y. Li, Jiang, Y., C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2016) Leverage cloud computing to improve data access log mining. IEEE Oceans 2016. • Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang & D. Moroni (2017) A Comprehensive Approach to Determining the Linkage Weights among Geospatial Vocabularies - An Example with Oceanographic Data Discovery. International Journal of Geographical Information Science (under review) • Jiang, Y., Y. Li, C. Yang, K. Liu, E. M. Armstrong, T. Huang, D. Moroni & L. Mcgibbney (2017) Towards intelligent geospatial discovery: a machine learning ranking framework. Remote Sensning (under review) • Source code: • Demo: Resources