Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Apache Solr Smart Data Ecosystem

3,688 views

Published on

Search engines, and Apache Solr in particular, are quickly shifting the focus away from “big data” systems storing massive amounts of raw (but largely unharnessed) content, to “smart data” systems where the most relevant and actionable content is quickly surfaced instead. Apache Solr is the blazing-fast and fault-tolerant distributed search engine leveraged by 90% of Fortune 500 companies. As a community-driven open source project, Solr brings in diverse contributions from many of the top companies in the world, particularly those for whom returning the most relevant results is mission critical.

Out of the box, Solr includes advanced capabilities like learning to rank (machine-learned ranking), graph queries and distributed graph traversals, job scheduling for processing batch and streaming data workloads, the ability to build and deploy machine learning models, and a wide variety of query parsers and functions allowing you to very easily build highly relevant and domain-specific semantic search, recommendations, or personalized search experiences. These days, Solr even enables you to run SQL queries directly against it, mixing and matching the full power of Solr’s free-text, geospatial, and other search capabilities with the a prominent query language already known by most developers (and which many external systems can use to query Solr directly).

Due to the community-oriented nature of Solr, the ecosystem of capabilities also spans well beyond just the core project. In this talk, we’ll also cover several other projects within the larger Apache Lucene/Solr ecosystem that further enhance Solr’s smart data capabilities: bi-directional integration of Apache Spark and Solr’s capabilities, large-scale entity extraction, semantic knowledge graphs for discovering, traversing, and scoring meaningful relationships within your data, auto-generation of domain-specific ontologies, running SPARQL queries against Solr on RDF triples, probabilistic identification of key phrases within a query or document, conceptual search leveraging Word2Vec, and even Lucidworks’ own Fusion project which extends Solr to provide an enterprise-ready smart data platform out of the box.

We’ll dive into how all of these capabilities can fit within your data science toolbox, and you’ll come away with a really good feel for how to build highly relevant “smart data” applications leveraging these key technologies.

Published in: Technology

The Apache Solr Smart Data Ecosystem

  1. 1. The Apache Solr Smart Data Ecosystem Trey Grainger SVP of Engineering, Lucidworks DFW Data Science 2017.01.09
  2. 2. Trey Grainger SVP of Engineering • Previously Director of Engineering @ CareerBuilder • MBA, Management of Technology – Georgia Tech • BA, Computer Science, Business, & Philosophy – Furman University • Information Retrieval & Web Search - Stanford University Other fun projects: • Co-author of Solr in Action, plus numerous research papers • Frequent conference speaker • Founder of Celiaccess.com, the gluten-free search engine • Lucene/Solr contributor About Me
  3. 3. • Apache Solr Overview Lucidworks Fusion Overview • Search & Relevancy - Keyword Search - Text Analysis - Multilingual Text Analysis • Recommendations (Demo) • Relevancy Spectrum • Reflected Intelligence - Relevancy Tuning - Learning to Rank (Demo) - Signals (Demo) … Agenda … • Semantic Search - Entity Extraction (Demo) - Query Parsing (Demo) - Semantic Knowledge Graph (Demo) • Streaming Expressions • Solr / Fusion SQL (Demo) • Solr Graph DFW Data Science
  4. 4. what do you do?
  5. 5. Search-Driven Everything Customer Service Customer Insights Fraud Surveillance Research Portal Online Retail Digital Content
  6. 6. Lucidworks enables Search-Driven Everything Data Acquisition Indexing & Streaming Smart Access API Recommendations & Alerts Analytics & InsightsExtreme Relevancy CUSTOMER SERVICE RESEARCH PORTAL DIGITAL CONTENT CUSTOMER INSIGHTS FRAUD SURVEILLANCE ONLINE RETAIL • Access all your data in a number of ways from one place. • Secure storage and processing from Solr and Spark. • Acquire data from any source with pre-built connectors and adapters. Machine learning and advanced analytics turn all of your apps into intelligent data-driven applications.
  7. 7. Apache Solr
  8. 8. “Solr is the popular, blazing-fast, open source enterprise search platform built on Apache Lucene™.”
  9. 9. Key Solr Features: ● Multilingual Keyword search ● Relevancy Ranking of results ● Faceting & Analytics (nested / relational) ● Highlighting ● Spelling Correction ● Autocomplete/Type-ahead Prediction ● Sorting, Grouping, Deduplication ● Distributed, Fault-tolerant, Scalable ● Geospatial search ● Complex Function queries ● Recommendations (More Like This) ● Graph Queries and Traversals ● SQL Query Support ● Streaming Aggregations ● Batch and Streaming processing ● Highly Configurable / Plugins ● Learning to Rank ● Building machine-learning models ● … many more *source: Solr in Action, chapter 2
  10. 10. The standard for enterprise search. of Fortune 500 uses Solr. 90%
  11. 11. Lucidworks Fusion
  12. 12. DFW Data Science
  13. 13. All Your Data
  14. 14. • Over 50 connectors to integrate all your data • Robust parsing framework to seamlessly ingest all your document types • Point and click Indexing configuration and iterative simulation of results for full control over your ETL process • Your security model enforced end-to-end from ingest to search across your different datasources
  15. 15. Experience Management
  16. 16. • Relevancy tuning: Point-and-click query pipeline configuration allow fine-grained control of results. • Machine-driven relevancy: Signals aggregation learn and automatically tune relevancy and drive recommendations out of the box . • Powerful pipeline stages: Customize fields, stages, synonyms, boosts, facets, machine learning models, your own scripted behavior, and dozens of other powerful search stages. • Turnkey search UI (Lucidworks View): Build a sophisticated end-to-end search application in just hours.
  17. 17. Operational Simplicity
  18. 18. SECURITY BUILT-IN Shards Shards Apache Solr Apache Zookeeper ZK 1 Leader Election Load Balancing Shared Config Management Worker Worker Apache Spark Cluster Manager Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Lucidworks View LOGS FILE WEB DATABASE CLOUD HDFS(Optional)
  19. 19. • 75% decrease in development time • Licensing costs cut by 50% With Fusion’s out-of-the-box capabilities, we skipped months in our dev cycle so we could focus our team where they would have the most impact. We cut our licensing costs by 50% and improved application usability. The Lucidworks professional services team amplified our success even further. We’re all Fusion from here on out!” “ Lourduraju Pamishetty Senior IT Application Architect —
  20. 20. • Seamless integration of your entire search & analytics platform • All capabilities exposed through secured API's, so you can use our UI or build your own. • End-to-end security policies can be applied out of the box to every aspect of your search ecosystem. • Distributed, fault-tolerant scaling and supervision of your entire search application
  21. 21. Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Lucidworks View LOGS FILE WEB DATABASE CLOUD • Seamless integration of your entire search & analytics platform • All capabilities exposed through secured API's, so you can use our UI or build your own. • End-to-end security policies can be applied out of the box to every aspect of your search ecosystem. • Distributed, fault-tolerant scaling and supervision of your entire search application
  22. 22. Fusion powers search for the brightest companies in the world.
  23. 23. Lucidworks Fusion
  24. 24. search & relevancy
  25. 25. Basic Keyword Search The beginning of a typical search journey
  26. 26. Term Documents a doc1 [2x] brown doc3 [1x] , doc5 [1x] cat doc4 [1x] cow doc2 [1x] , doc5 [1x] … ... once doc1 [1x], doc5 [1x] over doc2 [1x], doc3 [1x] the doc2 [2x], doc3 [2x], doc4[2x], doc5 [1x] … … Document Content Field doc1 once upon a time, in a land far, far away doc2 the cow jumped over the moon. doc3 the quick brown fox jumped over the lazy dog. doc4 the cat in the hat doc5 The brown cow said “moo” once. … … What you SEND to Lucene/Solr: How the content is INDEXED into Lucene/Solr (conceptually): The inverted index DFW Data Science
  27. 27. /solr/select/?q=apache solr Field Documents … … apache doc1, doc3, doc4, doc5 … hadoop doc2, doc4, doc6 … … solr doc1, doc3, doc4, doc7, doc8 … … doc5 doc7 doc8 doc1 doc3 doc4 solr apache apache solr Matching queries to documents DFW Data Science
  28. 28. Text Analysis Generating terms to index from raw text
  29. 29. Text Analysis in Solr A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream *From Solr in Action, Chapter 6 DFW Data Science
  30. 30. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 DFW Data Science
  31. 31. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 DFW Data Science
  32. 32. A text field in Lucene/Solr has an Analyzer containing: ① Zero or more CharFilters Takes incoming text and “cleans it up” before it is tokenized ② One Tokenizer Splits incoming text into a Token Stream containing Zero or more Tokens ③ Zero or more TokenFilters Examines and optionally modifies each Token in the Token Stream Text Analysis in Solr *From Solr in Action, Chapter 6 DFW Data Science
  33. 33. Multi-lingual Text Analysis Analyzing text across multiple languages
  34. 34. Example English Analysis Chains <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" words="lang/stopwords_en.txt” ignoreCase="true" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.EnglishPossessiveFilterFactory"/> <filter class="solr.KeywordMarkerFilterFactory" protected="lang/en_protwords.txt"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType> <fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <charFilter class="solr.HTMLStripCharFilterFactory"/> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="lang/en_synonyms.txt" I ignoreCase="true" expand="true"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.ASCIIFoldingFilterFactory"/> <filter class="solr.KStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> DFW Data Science
  35. 35. Per-language Analysis Chains DFW Data Science *Some of the 32 different languages configurations in Appendix B of Solr in Action
  36. 36. Per-language Analysis Chains *Some of the 32 different languages configurations in Appendix B of Solr in Action DFW Data Science
  37. 37. Which Stemmer do I choose? *From Solr in Action, Chapter 14 DFW Data Science
  38. 38. Common English Stemmers DFW Data Science *From Solr in Action, Chapter 14
  39. 39. When Stemming goes awry Fixing Stemming Mistakes: • Unfortunately, every stemmer will have problem-cases that aren’t handled as you would expect • Thankfully, Stemmers can be overriden • KeywordMarkerFilter: protects a list of terms you specify from being stemmed • StemmerOverrideFilter: applies a list of custom term mappings you specify Alternate strategy: • Use Lemmatization (root-form analysis) instead of Stemming • Commercial vendors help tremendously in this space • The Hunspell stemmer enables dictionary-based support of varying quality in over 100 languages DFW Data Science
  40. 40. Relevancy Scoring the results, returning the best matches
  41. 41. Classic Lucene Relevancy Algorithm (now switched to BM25): *Source: Solr in Action, chapter 3 Score(q, d) = ∑ ( tf(t in d) · idf(t)2 · t.getBoost() · norm(t, d) ) · coord(q, d) · queryNorm(q) t in q Where: t = term; d = document; q = query; f = field tf(t in d) = numTermOccurrencesInDocument ½ idf(t) = 1 + log (numDocs / (docFreq + 1)) coord(q, d) = numTermsInDocumentFromQuery / numTermsInQuery queryNorm(q) = 1 / (sumOfSquaredWeights ½ ) sumOfSquaredWeights = q.getBoost()2 · ∑ (idf(t) · t.getBoost() )2 t in q norm(t, d) = d.getBoost() · lengthNorm(f) · f.getBoost() DFW Data Science
  42. 42. • Term Frequency: “How well a term describes a document?” – Measure: how often a term occurs per document • Inverse Document Frequency: “How important is a term overall?” – Measure: how rare the term is across all documents TF * IDF *Source: Solr in Action, chapter 3 DFW Data Science
  43. 43. News Search : popularity and freshness drive relevance Restaurant Search: geographical proximity and price range are critical Ecommerce: likelihood of a purchase is key Movie search: More popular titles are generally more relevant Job search: category of job, salary range, and geographical proximity matter TF * IDF of keywords can’t hold it’s own against good domain-specific relevance factors! That’s great, but what about domain-specific knowledge? DFW Data Science
  44. 44. John lives in Boston but wants to move to New York or possibly another big city. He is currently a sales manager but wants to move towards business development. Irene is a bartender in Dublin and is only interested in jobs within 10KM of her location in the food service industry. Irfan is a software engineer in Atlanta and is interested in software engineering jobs at a Big Data company. He is happy to move across the U.S. for the right job. Jane is a nurse educator in Boston seeking between $40K and $60K *Example from chapter 16 of Solr in Action Consider what you know about users DFW Data Science
  45. 45. http://localhost:8983/solr/jobs/select/? fl=jobtitle,city,state,salary& q=( jobtitle:"nurse educator"^25 OR jobtitle:(nurse educator)^10 ) AND ( (city:"Boston" AND state:"MA")^15 OR state:"MA") AND _val_:"map(salary, 40000, 60000,10, 0)” *Example from chapter 16 of Solr in Action Query for Jane Jane is a nurse educator in Boston seeking between $40K and $60K DFW Data Science
  46. 46. { ... "response":{"numFound":22,"start":0,"docs":[ {"jobtitle":" Clinical Educator (New England/ Boston)", "city":"Boston", "state":"MA", "salary":41503}, …]}} *Example documents available @ http://github.com/treygrainger/solr-in-action Search Results for Jane {"jobtitle":"Nurse Educator", "city":"Braintree", "state":"MA", "salary":56183}, {"jobtitle":"Nurse Educator", "city":"Brighton", "state":"MA", "salary":71359} DFW Data Science
  47. 47. You just built a recommendation engine!
  48. 48. Demo: Recommendations
  49. 49. Traditional Keyword Search Recommendations Semantic Search User Intent Personalized Search Augmented Search Domain-aware Matching The Relevancy Spectrum DFW Data Science
  50. 50. Basic Keyword Search (inverted index, tf-idf, bm25, query formulation, etc.) Taxonomies / Entity Extraction (entity recognition, ontologies, synonyms, etc.) Query Intent (query classification, semantic query parsing, concept expansion, rules, clustering, classification) Relevancy Tuning (signals, AB testing/genetic algorithms, Learning to Rank, Neural Networks) Self-learning Data-driven App Sophistication DFW Data Science
  51. 51. what is “reflected intelligence”?
  52. 52. The Three C’s Content: Keywords and other features in your documents Collaboration: How other’s have chosen to interact with your system Context: Available information about your users and their intent Reflected Intelligence “Leveraging previous data and interactions to improve how new data and interactions should be interpreted” DFW Data Science
  53. 53. Feedback Loops User Searches User Sees Results User takes an action Users’ actions inform system improvements DFW Data Science
  54. 54. ● Recommendation Algorithms ● Building user profiles from past searches, clicks, and other actions ● Identifying correlations between keywords/phrases ● Building out automatically-generated ontologies from content and queries ● Determining relevancy judgements (precision, recall, nDCG, etc.) from click logs ● Learning to Rank - using relevancy judgements and machine learning to train a relevance model ● Discovering misspellings, synonyms, acronyms, and related keywords ● Disambiguation of keyword phrases with multiple meanings ● Learning what’s important in your content Examples of Reflected Intelligence DFW Data Science
  55. 55. Relevancy Tuning Improving ranking algorithms through experiments and models
  56. 56. How to Measure Relevancy? A B C Retrieved Documents Related Documents Precision = B/A Recall = B/C Problem: Assume Prec = 90% and Rec = 100% but assume the 10% irrelevant documents were ranked at the top of the retrieved documents, is that OK? DFW Data Science
  57. 57. Normalized Discounted Cumulative Gain Rank Relevancy 3 0.95 1 0.70 2 0.60 4 0.45 Rank Relevancy 1 0.95 2 0.85 3 0.80 4 0.65 Ranking Ideal Given • Position is considered in quantifying relevancy. • Labeled dataset is required. DFW Data Science
  58. 58. Learning to Rank
  59. 59. Learning to Rank (LTR) ● It applies machine learning techniques to discover the best combination of features that provide best ranking. ● It requires labeled set of documents with relevancy scores for given set of queries ● Features used for ranking are usually more computationally expensive than the ones used for matching ● It typically re-ranks a subset of the matched documents (e.g. top 1000) DFW Data Science
  60. 60. DFW Data Science
  61. 61. Common LTR Algorithms • RankNet* (Neural Network, boosted trees) • LambdaMart* (set of regression trees) • SVM Rank** (SVM classifier) ** http://research.microsoft.com/en-us/people/hangli/cao-et-al-sigir2006.pdf * http://research.microsoft.com/pubs/132652/MSR-TR-2010-82.pdf DFW Data Science
  62. 62. LambdaMart Example Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 DFW Data Science
  63. 63. Demo: Solr Learning to Rank
  64. 64. Obtaining Relevancy Judgements Typical Methodologies 1) Hire employees, contractors, or interns -Pros: Accuracy -Cons: Expensive Not scalable (cost or man-power-wise) Data Becomes Stale 2) Crowdsource -Pros: Less cost, more scalable -Cons: Less accurate Data still becomes stale Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 DFW Data Science
  65. 65. Reflected Intelligence: Possible to infer relevancy judgements? Rank Document ID 1 Doc1 2 Doc2 3 Doc3 4 Doc4 Query Query Doc1 Doc2 Doc3 0 1 1 Query Doc1 Doc2 Doc3 1 0 0 Source: T. Grainger, K. AlJadda. ”Reflected Intelligence: Evolving self-learning data systems". Georgia Tech, 2016 DFW Data Science
  66. 66. Automated Relevancy Benchmarking Default Algorithm 0.61 0.59 0.58 0.60 0.61 0.61 0.60 0.61 0.60 0.75 0.74 0.75 0.74 0.75 0.73 0.75 0.76 0.75 0.74 0.79 0.79 0.78 0.79 0.80 0.81 0.80 0.81 0.79 0.79 0.70 0.71 0.71 0.69 0.70 0.70 0.69 0.70 0.71 0.70 0.75 0.76 0.77 0.76 0.76 0.77 0.76 0.75 0.76 0.76 0.30 0.31 0.32 0.33 0.32 0.30 0.31 0.31 0.31 0.32 10/1/16 10/2/16 10/3/16 10/4/16 10/5/16 10/6/16 10/7/16 10/8/16 10/9/16 10/10/16 Default Algorithm Algorithm 1 Algorithm 2 Algorithm3 Algorithm 4 Algorithm 5 DFW Data Science
  67. 67. Demo: Fusion Signals
  68. 68. • 200%+ increase in click-through rates • 91% lower TCO • Fewer support tickets • Increased customer satisfaction
  69. 69. semantic search
  70. 70. DFW Data Science
  71. 71. Building a Taxonomy of Entities Many ways to generate this: • Topic Modelling • Clustering of documents • Statistical Analysis of interesting phrases - Word2Vec / Glove / Dice Conceptual Search • Buy a dictionary (often doesn’t work for domain-specific search problems) • Generate a model of domain-specific phrases by mining query logs for commonly searched phrases within the domain* * K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. DFW Data Science
  72. 72. DFW Data Science
  73. 73. DFW Data Science
  74. 74. entity extraction
  75. 75. DFW Data Science
  76. 76. Demo: Solr Text Tagger
  77. 77. semantic query parsing
  78. 78. DFW Data Science
  79. 79. Probabilistic Query Parser Goal: given a query, predict which combinations of keywords should be combined together as phrases Example: senior java developer hadoop Possible Parsings: senior, java, developer, hadoop "senior java", developer, hadoop "senior java developer", hadoop "senior java developer hadoop” "senior java", "developer hadoop” senior, "java developer", hadoop senior, java, "developer hadoop" Source: Trey Grainger, “Searching on Intent: Knowledge Graphs, Personalization, and Contextual Disambiguation”, Bay Area Search Meetup, November 2015. DFW Data Science
  80. 80. Demo: Probabilistic Query Parser
  81. 81. Semantic Query Parsing Identification of phrases in queries using two steps: 1) Check a dictionary of known terms that is continuously built, cleaned, and refined based upon common inputs from interactions with real users of the system. The SolrTextTagger works well for this.* 2) Also invoke a probabilistic query parser to dynamically identify unknown phrases using statistics from a corpus of data (language model) *K. Aljadda, M. Korayem, T. Grainger, C. Russell. "Crowdsourced Query Augmentation through Semantic Discovery of Domain-specific Jargon," in IEEE Big Data 2014. DFW Data Science
  82. 82. query augmentation
  83. 83. DFW Data Science
  84. 84. Knowledge Graph Semantic Data Encoded into Free Text Content e en eng engi engineer engineers engineer engineersNode Type: Term software engineer software engineers electrical engineering engineer engineering software … … … Node Type: Character Sequence Node Type: Term Sequence Node Type: Document id: 1 text: looking for a software engineerwith degree in computer science or electrical engineering id: 2 text: apply to be a software engineer and work with other great software engineers id: 3 text: start a great careerin electrical engineering … … DFW Data Science
  85. 85. id: 1 job_title: Software Engineer desc: software engineer at a great company skills: .Net, C#, java id: 2 job_title: Registered Nurse desc: a registered nurse at hospital doing hard work skills: oncology, phlebotemy id: 3 job_title: Java Developer desc: a software engineer or a java engineer doing work skills: java, scala, hibernate field term postings list doc pos desc a 1 4 2 1 3 1, 5 at 1 3 2 4 company 1 6 doing 2 6 3 8 engineer 1 2 3 3, 7 great 1 5 hard 2 7 hospital 2 5 java 3 6 nurse 2 3 or 3 4 registered 2 2 software 1 1 3 2 work 2 10 3 9 job_title java developer 3 1 … … … … field doc term desc 1 a at company engineer great software 2 a at doing hard hospital nurse registered work 3 a doing engineer java or software work job_title 1 Software Engineer … … … Terms-Docs Inverted IndexDocs-Terms Forward IndexDocuments Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph DFW Data Science
  86. 86. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Set-theory View Graph View How the Graph Traversal Works skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology Data Structure View Java Scala Hibernate docs 1, 2, 6 docs 3, 4 Oncology doc 5 DFW Data Science
  87. 87. Knowledge Graph Graph Model Structure: Single-level Traversal / Scoring: Multi-level Traversal / Scoring:
  88. 88. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Traversal Data Structure View Graph View doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 skill: Java skill: Java skill: Scala skill: Hibernate skill: Oncology doc 1 doc 2 doc 3 doc 4 doc 5 doc 6 job_title: Software Engineer job_title: Data Scientist job_title: Java Developer …… Inverted Index Lookup Forward Index Lookup Forward Index Lookup Inverted Index Lookup Java Java Developer Hibernate Scala Software Engineer Data Scientist has_related_job_title has_related_job_title DFW Data Science
  89. 89. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Scoring nodes in the Graph Foreground vs. Background Analysis Every term scored against it’s context. The more commonly the term appears within it’s foreground context versus its background context, the more relevant it is to the specified foreground context. countFG(x) - totalDocsFG * probBG(x) z = -------------------------------------------------------- sqrt(totalDocsFG * probBG(x) * (1 - probBG(x))) { "type":"keywords”, "values":[ { "value":"hive", "relatedness": 0.9765, "popularity":369 }, { "value":"spark", "relatedness": 0.9634, "popularity":15653 }, { "value":".net", "relatedness": 0.5417, "popularity":17683 }, { "value":"bogus_word", "relatedness": 0.0, "popularity":0 }, { "value":"teaching", "relatedness": -0.1510, "popularity":9923 }, { "value":"CPR", "relatedness": -0.4012, "popularity":27089 } ] } + - Foreground Query: "Hadoop" DFW Data Science
  90. 90. Source: Trey Grainger, Khalifeh AlJadda, Mohammed Korayem, Andries Smith.“The Semantic Knowledge Graph: A compact, auto-generated model for real-time traversal and ranking of any relationship within a domain”. DSAA 2016. Knowledge Graph Multi-level Graph Traversal with Scores software engineer* (materialized node) Java C# .NET .NET Developer Java Developer Hibernate ScalaVB.NET Software Engineer Data Scientist Skill Nodes has_related_skillStarting Node Skill Nodes has_related_skill Job Title Nodes has_related_job_title 0.90 0.88 0.93 0.93 0.34 0.74 0.91 0.89 0.74 0.89 0.780.72 0.48 0.93 0.76 0.83 0.80 0.64 0.61 0.780.55 DFW Data Science
  91. 91. Knowledge Graph Use Case: Document Summarization Experiment: Pass in raw text (extracting phrases as needed), and rank their similarity to the documents using the SKG. Additionally, can traverse the graph to “related” entities/keyword phrases NOT found in the original document Applications: Content-based and multi-modal recommendations (no cold-start problem), data cleansing prior to clustering or other ML methods, semantic search / similarity scoring
  92. 92. Demo: Semantic Knowledge Graph
  93. 93. Knowledge Graph DFW Data Science
  94. 94. Knowledge Graph DFW Data Science
  95. 95. DFW Data Science
  96. 96. streaming expressions
  97. 97. • Perform relational operations on streams • Stream sources: search, jdbc, facets, features, gatherNodes, shortestPath, train, features, model, random, stats, topic • Stream decorators: classify, commit, complement, daemon, executor, fetch, having, leftOuterJoin, hashJoin, innerJoin, intersect, merge, null, outerHashJoin, parallel, priority, reduce, rollup, scoreNodes, select, sort, top, unique, update Streaming Expressions Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  98. 98. • Relies on docValues (column-oriented data structure) and /export handler • Extreme read performance (8-10x faster than queries using cursorMark) • Facet or map/reduce style aggregation modes • Tiered architecture • SQL interface tier • Worker tier (scale a pool of worker “nodes” independently of the data collection) • Data tier (Solr collection) Streaming API: Nuts and Bolts Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  99. 99. Streaming Expressions - Examples Shortest-path Graph Traversal Parallel Batch Procesing Train a Logistic Regression Model Distributed Joins Rapid Export of all Search Results Pull Results from External Database Sources: https://cwiki.apache.org/confluence/display/solr/Streaming+Expressions http://joelsolr.blogspot.com/2016/10/solr-63-batch-jobs-parallel-etl-and.html Classifying Search Results
  100. 100. Solr SQL
  101. 101. • SQL is ubiquitous language for analytics • People: Less training and easier to understand • Tools! Solr as JDBC data source (DbVisualizer, Apache Zeppelin, and SQuirreL SQL) • Query planning / optimization can evolve iteratively SQL is natural extension for Solr’s parallel computing engine Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  102. 102. Give me the top 5 action movies with rating of 4 or better Mental Warm-up /select?q=*:* &fq=genre_ss:action &fq=rating_i:[4 TO *] &facet=true &facet.limit=5 &facet.mincount=1 &facet.field=title_s SELECT title_s, COUNT(*) as cnt FROM movielens WHERE genre_ss='action' AND rating_i='[4 TO *]’ GROUP BY title_s ORDER BY cnt desc LIMIT 5 { ... "facet_counts":{ "facet_fields":{ "title_s":[ "Star Wars (1977)",501, "Return of the Jedi (1983)",379, "Godfather, The (1972)",351, "Raiders of the Lost Ark (1981)",348, "Empire Strikes Back, The (1980)",293]}, ...}} {"result-set":{"docs":[ {"title_s":"Star Wars (1977)”,"cnt":501}, {"title_s":"Return of the Jedi (1983)","cnt":379}, {"title_s":"Godfather, The (1972)","cnt":351}, {"title_s":"Raiders of the Lost Ark (1981)","cnt":348}, {"title_s":"Empire Strikes Back, The (1980)","cnt":293}, {"EOF":true,"RESPONSE_TIME":42}]}} Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  103. 103. SELECT gender_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens WHERE genre_ss='romance' AND age_i='[30 TO *]' GROUP BY gender_s ORDER BY num_ratings desc SQL Examples SELECT title_s, genre_s, COUNT(*) as num_ratings, avg(rating_i) as avg_rating FROM movielens GROUP BY title_s, genre_s HAVING num_ratings >= 100 ORDER BY avg_rating desc LIMIT 5 SELECT DISTINCT(user_id_i) as user_id FROM movielens WHERE genre_ss='documentary' ORDER BY user_id desc Give me the avg rating for men and women over 30 for romance movies Give me the top 5 rated movies with at least 100 ratings Give me the set of unique users that have rated documentaries DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  104. 104. parallel(workers, hashJoin( search(movielens, q=*:*, fl="user_id_i,movie_id_i,rating_i", sort="movie_id_i asc", partitionKeys="movie_id_i"), hashed=search(movielens_movies, q=*:*, fl="movie_id_i,title_s,genre_s", sort="movie_id_i asc", partitionKeys="movie_id_i"), on="movie_id_i" ), workers="4", sort="movie_id_i asc" ) Streaming Expression Example: hashJoin The small “right” side of the join gets loaded into memory on each worker node Each shard queried by N workers, so 4 workers x 4 shards means 16 queries (usually all replicas per shard are hit) Workers collection isolates parallel computation nodes from data nodes DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  105. 105. • spark-solr project uses streaming API to pull data from Solr into Spark jobs if docValues enabled, see: https://github.com/lucidworks/spark-solr • Perform aggregations of “signals”, e.g clicks, to compute boosts and recommendations using Spark • Custom Scala script jobs to perform complex analysis on data in Solr, e.g. sessionize request logs • Power rich data visualizations using Fusion’s SQL Engine powered by SparkSQL + Solr streaming aggregations How we use Solr streaming API in Fusion Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016. DFW Data Science
  106. 106. DFW Data Science
  107. 107. Comparing SQL Capabilities Fusion Solr Hive Drill SparkSQL Secret Sauce SparkSQL Benefits + Solr Benefits + Enterprise Security Push complex query constructs into engine (full text, spatial, relevancy, graph, functions, etc) Mature SQL solution for Hadoop stack Execute SQL over NoSQL data sources Spark core (optimized shuffle, in-memory, etc), integration of other APIs: ML, Streaming, GraphX SQL Features Maturing Evolving Mature Maturing Maturing Scaling Linear (shards and replicas) backed by inverted index; Linear (shards and replicas) backed by inverted index Limited by Hadoop infrastructure (table scans) Good, but need to benchmark Memory intensive; Scale out using Spark cluster, backed by RDDs Integration w/ external systems Analytics Catalog API, JDCB Driver, ODBC Bridge JDBC stream source external tables / plugin API many drivers available DataSource API, many systems supported DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  108. 108. Demo: Fusion SQL Engine
  109. 109. Graph
  110. 110. Graph Use Cases • Anomaly detection / fraud detection • Recommenders • Social network analysis • Graph Search • Access Control • Relationship discovery / scoring Examples o Find all draft blog posts about “Parallel SQL” written by a developer o Find all tweets mentioning “Solr” by me or people I follow o Find all draft blog posts about “Parallel SQL” written by a developer o Find 3-star hotels in NYC my friends stayed in last year DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  111. 111. Solr Graph Timeline • Some data is much more naturally represented as a graph structure • Solr 6.0: Introduced the Graph Query Parser • Solr 6.1: Introduced Graph Streaming expressions … • Solr 6.3: Current Version • TBD: Semantic Knowledge Graph (patch available) DFW Data Science
  112. 112. Graph Query Parser • Query-time, cyclic aware graph traversal is able to rank documents based on relationships • Provides controls for depth, filtering of results and inclusion of root and/or leaves • Limitations: single node/shard only Examples: • http://localhost:8983/solr/graph/query?fl=id,score& q={!graph from=in_edge to=out_edge}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge traversalFilter='foo:[* TO 15]'}id:A • http://localhost:8983/solr/my_graph/query?fl=id& q={!graph from=in_edge to=out_edge maxDepth=1}foo:[* TO 10] DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  113. 113. Graph Streaming Expressions • Part of Solr’s broader Streaming Expressions capability • Implements a powerful, breadth-first traversal • Works across shards AND collections • Supports aggregations • Cycle aware curl -X POST -H "Content-Type: application/x-www-form-urlencoded" -d ‘expr=…’"http://localhost:18984/solr/movielens/stream" DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  114. 114. All movies that user 389 watched expr:gatherNodes(movielens,walk="389->user_id_i",gather="movie_id_i") DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  115. 115. All movies that viewers of a specific movie watched expr:gatherNodes(movielens, gatherNodes(movielens,walk="161->movie_id_i",gather="user_id_i"), walk="node->user_id_i",gather="movie_id_i", trackTraversal="true" ) Movie 161: “The Air Up There” DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  116. 116. Collaborative Filtering expr=top(n="5", sort="count(*) desc", gatherNodes(movielens, top(n="30", sort="count(*) desc", gatherNodes(movielens, search(movielens, q="user_id_i:305", fl="movie_id_i", sort="movie_id_i asc", qt=“/export"), walk="movie_id_i->movie_id_i", gather="user_id_i", maxDocFreq="10000", count(*) ) ), walk="node->user_id_i", gather="movie_id_i", count(*) ) ) DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  117. 117. Comparing Graph Choices Solr Elastic Graph Neo4J Spark GraphX Best Use Case QParser: predef. relationships as filters Expressions: fast, query-based, dist. graph ops Limited to sequential, term relatedness exploration only Graph ops and querying that fit on a single node Large-scale, iterative graph ops Common Graph Algorithms (e.g. Pregel, Traversal) Partial No Yes Yes Scaling QParser: Co-located Shards only Expressions: Yes Yes Master/Replica Yes Commercial License Required No Yes GPLv3 No Visualizations GraphML support (e.g. Gephi) Kibana Neo4j browser 3rd party DFW Data Science Source: “Solr 6 Deep Dive: SQL and Graph”. Grant Ingersoll & Tim Potter, 2016.
  118. 118. Additional References: DFW Data Science
  119. 119. Contact Info Trey Grainger trey.grainger@lucidworks.com @treygrainger http://solrinaction.com Meetup discount (39% off): 39grainger Other presentations: http://www.treygrainger.com DFW Data Science

×