Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
SearchHub: How to Spend Your Summer Keeping it Real
Grant Ingersoll
CTO, Lucidworks
3
01
SearchHub Demo
github.com/lucidworks/searchhub
http://searchhub.lucidworks.com
4
02
SearchHub Details
• Basics:
• 37 Apache Projects registered so far plus LW properties, opensource.com, Stack Overflow
...
5
03
Goals
• Company:
• “LucidFind” aka SearchHub on Fusion
• Provide backend for LW.com search, including
docs and suppor...
6
01
Agenda
• Quick Intro to Fusion and SearchHub
• Fusion Configuration, UI, Middle Tier
• Data Acquisition
• Deployment
•...
7
Drive next generation relevance
via Content, Collaboration and
Context
Built on best in class Open Source:
Apache Solr +...
8
01
Fusion
SECURITY BUILT-IN
Shards Shards
Apache Solr
Apache Zookeeper
ZK 1
Leader
Election
Load
Balancing
ZK N
Shared C...
9
01
Fusion Configuration, UI and Middle Tier
• UI
• Derivative of Lucidworks View (https://lucidworks.com/products/view/)
...
10
01
Data Acquisition
• Sources:
• ASF Mail archives mirrored at: http://asfmail.lucidworks.io
• Stack Overflow (SO)
• Git...
11
01
Deployment
• Client and Middle Tier run in a Docker container using Apache HTTPd and mod_wsgi
• Hosted on AWS (m4.2x...
12
01
Signals
• UI is fully instrumented, using Snowplow Javascript Tracker, for most
user interactions. See SnowplowServi...
13
01
Machine Learning
• Fusion makes it easy to “round-trip” ML data/models between Spark and Solr
• Examples of:
• Recom...
14
Experiment Management and Bandits
Get Started
• Goal: Experimentation, not hard coded rules*
• Goal: Drive down the cos...
15
Demo
16
01
Still Hungry?
• “Combining Content and Collaboration in Recommenders” by Jake Mannix:
Friday at 1:10 pm http://sched...
Upcoming SlideShare
Loading in …5
×

SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

1,307 views

Published on

Presented at Lucene/Solr Revolution 2016

Published in: Technology
  • Be the first to comment

  • Be the first to like this

SearchHub - How to Spend Your Summer Keeping it Real: Presented by Grant Ingersoll, Lucidworks

  1. 1. O C T O B E R 1 1 - 1 4 , 2 0 1 6 • B O S T O N , M A
  2. 2. SearchHub: How to Spend Your Summer Keeping it Real Grant Ingersoll CTO, Lucidworks
  3. 3. 3 01 SearchHub Demo github.com/lucidworks/searchhub http://searchhub.lucidworks.com
  4. 4. 4 02 SearchHub Details • Basics: • 37 Apache Projects registered so far plus LW properties, opensource.com, Stack Overflow • 130 datasources* including email, Github, JIRA*, Website and Wiki • Fusion 2.4.2 • Signals everywhere • UI based on View (work not complete) • ASF Mail archives mirrored at: http://asfmail.lucidworks.io
  5. 5. 5 03 Goals • Company: • “LucidFind” aka SearchHub on Fusion • Provide backend for LW.com search, including docs and support • Real, production, living, breathing instance of Fusion that we control • Fusion best practices demo of major use cases • CTO Office • Real data, including clicks • Platform for machine learning and experimentation • Demos and talks
  6. 6. 6 01 Agenda • Quick Intro to Fusion and SearchHub • Fusion Configuration, UI, Middle Tier • Data Acquisition • Deployment • Signals and Machine Learning • Next Steps
  7. 7. 7 Drive next generation relevance via Content, Collaboration and Context Built on best in class Open Source: Apache Solr + Spark Simplify application development and reduce ongoing maintenance Access data from anywhere to build intelligent, data- driven applications. Fusion in a Nutshell
  8. 8. 8 01 Fusion SECURITY BUILT-IN Shards Shards Apache Solr Apache Zookeeper ZK 1 Leader Election Load Balancing ZK N Shared Config Management Worker Worker Apache Spark Cluster Manager Core Services • • • NLP Recommenders / Signals Blob Storage Pipelines Scheduling Alerting / Messaging Connectors RESTAPI Admin UI Lucidworks View HDFS(Optional) LOGS FILE WEB DATABASE CLOUD HADOOP
  9. 9. 9 01 Fusion Configuration, UI and Middle Tier • UI • Derivative of Lucidworks View (https://lucidworks.com/products/view/) • Deep integration of Snowplow Javascript Tracker (https://github.com/ snowplow/snowplow/wiki/javascript-tracker) • Python Flask middle tier ($SEARCHHUB_HOME/python) • Data sources (project_config) • Pipelines (fusion_config) • Schedules (fusion_config)
  10. 10. 10 01 Data Acquisition • Sources: • ASF Mail archives mirrored at: http://asfmail.lucidworks.io • Stack Overflow (SO) • Github • Processing • Pipelines, including custom stage for parsing mail • Main Challenges: • “fail2ban” by the ASF • Focused crawling of SO — JSoup FTW! (try.jsoup.org) • Mail Threads
  11. 11. 11 01 Deployment • Client and Middle Tier run in a Docker container using Apache HTTPd and mod_wsgi • Hosted on AWS (m4.2xls) • Fusion backend is OOTB 2.4.2 with extra memory for Connectors and Solr • README has the gory details: https://github.com/lucidworks/searchhub/blob/master/README.md
  12. 12. 12 01 Signals • UI is fully instrumented, using Snowplow Javascript Tracker, for most user interactions. See SnowplowService.js • Captures, amongst other things: • User Id, Session Id, Unique Query Id, IP address, Location, Timing data • Actions tracked: • Page View • Page Ping (heartbeat) every 30 seconds • Search with query, displayed doc list and displayed facet list • Clicks with query, doc id, position, score and query UUID • Typeahead Clicks with characters typed and suggestions offered
  13. 13. 13 01 Machine Learning • Fusion makes it easy to “round-trip” ML data/models between Spark and Solr • Examples of: • Recommenders • Spark Lucene tokenization • k-Means • Word2Vec • Topic Detection (LDA) • Random Forests Classifier • Many examples SparkShellHelpers.scala
  14. 14. 14 Experiment Management and Bandits Get Started • Goal: Experimentation, not hard coded rules* • Goal: Drive down the cost of experimentation • “A/B testing on steroids” • Exploration vs. Exploitation • Fusion 3.0 (beta): • Record and calculate relevance metrics from w/in Fusion (gold standard, TREC, other) • Easily calculate MRR, NDCG, Precision, Recall and report over time • Support for Bandits: Greedy Epsilon, SoftMax, UCB1
  15. 15. 15 Demo
  16. 16. 16 01 Still Hungry? • “Combining Content and Collaboration in Recommenders” by Jake Mannix: Friday at 1:10 pm http://sched.co/7amt • https://github.com/lucidworks/searchhub • http://searchhub.lucidworks.com •Email: grant@lucidworks.com •Twitter: @gsingers •Web: http://lucidworks.com

×