VIVO 1.2 Search contained restricted information about an individual in the index. This lead people to ask questions like:
“ Hey I work for "USDA" and when I search for "USDA", my profile doesn't show up in the search results and vice-versa.”
“ Hey information related to my Educational background, Awards, the Roles I assumed, etc. that appear on my profile don't show up in the search results when I search for them individually or when I search for my name.”
Intuition: Probably, a node deserves high rank because it is connected to lot of individuals.
Average over β values of all the nodes to which a node is connected.
Intuition: Probably, a node deserves high rank because it is connected to some important individuals.
Average strength of uniqueness of properties through which a node is connected.
Intuition: Probably, a node deserves high rank based on the strength of connection to other nodes.
Search Index Architecture: Enriching with Semantic Relations. Overall connectivity of an Individual (ß) Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase Multithreaded.
Real-time Indexing: Enriching with Semantic Relations. Overall connectivity of an Individual (ß) Apache Solr Relevant Documents. Dismax Query Handler. Indexing Phase Sparql Proper Boosts Searching Phase ADD/EDIT/DELETE of an Individual or its properties. The changes occur in real time and propagate beyond intermediate nodes. Multithreaded.
Assume search results from Release 1.2.1 and Release 1.3 are two different clusters.
Results from Release 1.3 should have their mean vector close to query vector.
Text to vector conversion using ‘Bag of words’ technique.
Tanimoto distance measure used.
Code at : https:// github.com / anupsavvy / Cluster_Analysis
Query Distance from Mean vector of Release 1.2.1 Distance from Mean vector of Release 1.3 Scripps 0.27286328362357193 0.004277746256068157 Paulson James 0.009907336493786136 0.004650133621323327 Genome Sequencing 9.185463752863598E-4 8.154498815206635E-4 Kenny Paul 0.007610235640599918 0.003984303949283425
Certain degree of spelling correction like feature could be achieved through SOLR Phonetic Analyzer.
Phonetic Analyzer uses Apache Commons Codec for phonetic implementations.
Helps in detecting spelling mistakes in search query. For instance, if a query like ‘ scrips ’ would be accurately match to a similar sounding word ‘ scripps ’ which is actually present in the index. Misspelled name like ‘ Polex Frank ’ in the query could be matched to correct name ‘ Polleux Franck ’ .
Number of results matched just based on Phonetics could decrease the precision of the engine.
Experiments : Ontology provides a good base for Factoid Questioning.
Properties of Individuals give direct reference to the information.
Natural language techniques and Machine learning algorithms could help us understand the search query better.
A query like “What is Brian Lowe’s email id ?” should probably return just the email id on top or a query like “Who are the co-authors of Brian Lowe ?” should return just the list of co-authors of Brian Lowe.
We can train an algorithm to know the type of question or search query that has been fired. Cognitive Computation Group of University of Illinois At Urbana-Champaign provides corpus of tagged questions to be used as training set. http://cogcomp.cs.illinois.edu/page/resources/data
Experiments : Ontology provides a good base for Factoid Questioning. ( cont. )
Once the question type is determined, we could grammatically parse the question using Stanford Lexparser http://nlp.stanford.edu/software/lex- parser.shtml
Question type helps us to know whether we should look for a datatype property or an object property. Lexparser will helps us to form a SPARQL query.