Your SlideShare is downloading. ×
Improving VIVO search through semantic ranking.
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Improving VIVO search through semantic ranking.

1,455
views

Published on

Published in: Education, Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,455
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Duplicate slide to maintain title and subtitle formatting
  • Duplicate slide to maintain title and subtitle formatting
  • Transcript

    • 1. Improving VIVO search results through Semantic Ranking. Anup Sawant Deepak Konidena
    • 2. VIVO Search till Release 1.2.1
      • VIVO Search till Release 1.2.1.
        • Lucene keyword based search.
        • Score based on Textual relevance.
        • Importance of a node was not taken into consideration.
        • Additional data that describes a relationship was not being searched.
    • 3. Adding knowledge from semantic relationships
        •   VIVO 1.2 Search contained restricted information about an individual in the index. This lead people to ask questions like:
      •  
      •   “ Hey I work for "USDA" and when I search for "USDA", my profile doesn't show up in the search results and vice-versa.”  
        • “ Hey information related to my Educational background, Awards, the Roles I assumed, etc. that appear on my profile don't show up in the search results when I search for them individually or when I search for my name.”
      •  
      •  
      •  
      •      
      •  
    • 4. Lucene field for an Individual.   And here's why                  
    • 5. Intermediate nodes were overlooked.
      •  
        • Traditionally semantic relationships of an Individual like Roles, Educational Training, Awards, Authorship, etc. were not stored in the Index. 
        • Individuals were connected to these properties through intermediate nodes called "Context Nodes". And the information hiding beyond these context nodes was not captured.
      •  
      •  
      •  
      •      
      •  
    • 6. How does the semantic graph look like with the presence of context nodes?
    • 7. VIVO Search in 1.3
      • VIVO Search in 1.3
        • Transition from Lucene to SOLR.
        • Provides base for distributed search capabilities.
        • Individuals enriched by description of semantic relationships.
        • Enhanced score by Individual connectivity.
        • Improved precision and recall of search results.
    • 8. Enriching Individuals with Semantic Relations.
        • For 1.3, the individuals are enriched with information from their semantic relations. i.e information hidden behind context nodes.
        • Sparql queries are fired during index time to capture this information.
        • Result: Improvement in overall placement of search results. Relevant results float up.
    • 9. Influence of PageRank
        • Introduced by Larry Page & Sergey Brin.
        • Every node relies on every other node for its ranking.
        • Intuitive understanding: Node importance is calculated based on incoming connections and contribution of highly ranked important nodes.
    • 10. Some parameters based on PageRank
      • β
        • Number of nodes connected to a particular node.
        • Intuition: Probably, a node deserves high rank because it is connected to lot of individuals.
      • Φ
        • Average over β values of all the nodes to which a node is connected.
        • Intuition: Probably, a node deserves high rank because it is connected to some important individuals.
      • Γ
        • Average strength of uniqueness of properties through which a node is connected.
        • Intuition: Probably, a node deserves high rank based on the strength of connection to other nodes.
    • 11. Search Index Architecture: Enriching with Semantic Relations. Overall connectivity of an Individual (ß) Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase Multithreaded.
    • 12. Real-time Indexing: Enriching with Semantic Relations. Overall connectivity of an Individual (ß) Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase ADD/EDIT/DELETE of an Individual or its properties. The changes occur in real time and propagate beyond intermediate nodes. Multithreaded.
    • 13. Cluster Analysis of Search Results
      • Intuition
        • Assume search results from Release 1.2.1 and Release 1.3 are two different clusters.
      • Expectation
        • Results from Release 1.3 should have their mean vector close to query vector.
      • Results
        • Text to vector conversion using ‘Bag of words’ technique.
        • Tanimoto distance measure used.
        • Code at : https:// github.com / anupsavvy / Cluster_Analysis
      Query Distance from Mean vector of Release 1.2.1 Distance from Mean vector of Release 1.3 Scripps 0.27286328362357193 0.004277746256068157 Paulson James 0.009907336493786136 0.004650133621323327 Genome Sequencing 9.185463752863598E-4 8.154498815206635E-4 Kenny Paul 0.007610235640599918 0.003984303949283425
    • 14. Understanding how it happens ..
      • R1
      • R2
      • R3
      • R4
      • R5
      • .
      • .
      • .
      • .
      name location description name research name articles name location Bla bla bla ….
    • 15. Understanding how it happens ..
      • scripps
      • loring
      • jeanne
      • institute
      • cornell
      • florida
      • .
      • .
      • .
      • .
      R1 R2 R3 .. .. .. .. 6 1 Q 1 0 0 1 4 0 1 1 0 1 4 0 1 1 0 1 1 1 1 0 0 0 - - - - - - - - - - - - - - - -
    • 16. Understanding how it happens .. institute cornell loring V1 V2 θ Euclidean distance Cosine distance
    • 17. Understanding how it happens .. institute cornell loring V2 θ V1 Euclidean distance increases, Cosine distance remains the same
    • 18. Query vector distance from Cluster Mean vectors
    • 19. User testing for Relevance
    • 20. Precision and Recall Total Relevant Total Retrieved Precision = X / (Total Retrieved) Recall = X / (Total Relevant) X
    • 21. Precision-Recall graphs based on User Analysis.
    • 22. Cluster Analysis for Relevance
    • 23. Precision-Recall graphs based on Cluster Analysis
    • 24. Query vector distance from individual search result vectors
    • 25. Experiments : SOLR
        • Search query expansion can be done using SOLR synonym analyzer.
        • Princeton Wordnet http://wordnet.princeton.edu/ is frequently used with SOLR synonym analyzer.
        • A gist code by Bradford on Github https://gist.github.com/562776 was used to convert wordnet flat file into SOLR compatible synonyms file.
        • Pros
          • High Recall
          • Documents can be matched to well known acronyms and words not present in SOLR index. For instance, a query which has ‘ fl ’ as one of the terms would retrieve documents related to ‘ Florida ’ as well.
        • Cons
          • Documents matching just the synonym part of the query could be ranked higher.
    • 26. Experiments : SOLR ( cont. )
        • Certain degree of spelling correction like feature could be achieved through SOLR Phonetic Analyzer.
        • Phonetic Analyzer uses Apache Commons Codec for phonetic implementations.
        • Pros
          • High Recall
          • Helps in detecting spelling mistakes in search query. For instance, if a query like ‘ scrips ’ would be accurately match to a similar sounding word ‘ scripps ’ which is actually present in the index. Misspelled name like ‘ Polex Frank ’ in the query could be matched to correct name ‘ Polleux Franck ’ .
        • Cons
          • Number of results matched just based on Phonetics could decrease the precision of the engine.
    • 27. Experiments : Ontology provides a good base for Factoid Questioning.
        • Properties of Individuals give direct reference to the information.
        • Natural language techniques and Machine learning algorithms could help us understand the search query better.
        • A query like “What is Brian Lowe’s email id ?” should probably return just the email id on top or a query like “Who are the co-authors of Brian Lowe ?” should return just the list of co-authors of Brian Lowe.
        • We can train an algorithm to know the type of question or search query that has been fired. Cognitive Computation Group of University of Illinois At Urbana-Champaign provides corpus of tagged questions to be used as training set. http://cogcomp.cs.illinois.edu/page/resources/data
    • 28. Experiments : Ontology provides a good base for Factoid Questioning. ( cont. )
        • Once the question type is determined, we could grammatically parse the question using Stanford Lexparser http://nlp.stanford.edu/software/lex- parser.shtml
        • Question type helps us to know whether we should look for a datatype property or an object property. Lexparser will helps us to form a SPARQL query.
      Stanford Lexparser Kmeans/SVM Search Query SPARQL Query Corpora Question type Terms
    • 29. Summary
        • Transition from Lucene to SOLR
        • Additional information of semantic relationships and interconnectivity in the index.
        • More relevant results and good ranking compared to VIVO 1.2.1
        • Improvements in indexing time due to multithreading.
    • 30. Team Work…