This document discusses how graph capabilities have evolved in Lucene, Solr, and Elasticsearch. It provides an overview of how full-text search engines started supporting relationship exploration through indexing and scoring related terms. It then explains how Elasticsearch, Solr, and streaming expressions in Solr added more explicit graph traversal capabilities through features like the graph explorer API, GraphQueryParser, and functions like shortestPath and gatherNodes. Examples of using these graph features are also provided.
2. Agenda
1. Introduction to Lucene and friends
2. Evolution of data analysis by Solr and Elasticsearch
3. Graph capabilities of Elasticsearch(briefly)
4. Solr - QueryParserPlugin
5. Solr - Streaming Expressions
6. Examples
17. • From Elasticsearch 2.3
• REST API - /_graph/explore
• visualization for Kibana
• Part of elastic commercial offering (named
from 5.0 X-Pack)
Elasticsearch+Kibana
Plugin for Elasticsearch and Kibana - Graph
picture from: https://www.elastic.co/guide/en/graph/current/graph-introduction.html
18. • Available from Solr 6.0
• experimental feature
• currently, works for single node, single core
applications (due to change)
• no 1st party visualization
• does not track edges of the traversal
Solr
built-in GraphQueryParser
picture from: http://solr.pl/2016/04/25/wizualizacja-grafow-przy-pomocy-solr-6/
19. • Available from Solr 5.5
• experimental feature
• no 1st party visualization
• does track edges of the traversal and level
Solr
built-in Streaming Expressions
picture from: http://solr.pl/2016/04/25/wizualizacja-grafow-przy-pomocy-solr-6/
28. Streaming Expressions
• New alternative way of creating and processing queries
• allow chaining functions
• also experimental
• graph functions - shortestPath, gatherNodes, scoreNodes
30. shortestPath
• one of the source functions - function producing tuple stream
• returns shortest path between to given nodes using iterative breadth-first search of the graph
31. shortestPath - params
• collection - collection to perform the search
• from - starting node
• to - ending node
• edge - definition of edge, in format <from-field>=<to_field>
• fq - filter query, which filters out nodes taken into account
• maxDepth - maximal depth of the traversal
32. gatherNodes
• transforms input document stream to stream of accessible, through graph
traversal, documents
• can return edges
• allows nesting functions
• works for multi-collection streams, irregardless of number of cluster nodes
• is also a source function
• currently does not support multivalued fields
33. gatherNodes - params
• collection - collection on which function will be performed
• walk - defines starting nodes and the field, e.g. „zpapierski@atlassian.com->from”
• gather - defines which fields are gathered
• scatter - parameter that can have values(one or both):
• leaves - emits only leaf nodes (outer-most ones)
• branches - emits nodes leading up to leaves (root node is a branch)
• fq - filter query that filters out nodes
• maxDocFreq - every node in the result over this number is filtered out
Aggregations, cross-collection gathering and combining with other streaming expressions
is possible
34. scoreNodes
• Function user only with output of gatherNodes
• Score document relevancy, using TF-IDF formula
• As TF - how often document appeared on graph traversal
• IDF is fetched from documents original collection
• Adds additional field, nodeScore, to the output stream