Presented by Amit Nithianandan, Lead Engineer Search/Analytics New Platforms, Zvents/Stubhub
Zvents has been a user of Apache Solr since 2007 when it was very early. Since then, the team has made extensive use of the various features and most recently completed an overhaul of the search engine to Solr 4.0. We'll touch on a variety of development/operational topics including how we manage the build lifecycle of the search application using Maven, release the deployment package using Capistrano and monitor using NewRelic as well as the extensive use of virtual machines to simplify node management. Also, we’ll talk about application level details such as our unique federated search product, and the integration of technologies such as Hypertable, RabbitMQ, and EHCache to power more real-time ranking and filtering based on traffic statistics and ticket inventory.
4. My “Street Cred”
• Joined Zvents in Aug 2008 as member of search
engineering team.
– Knew nothing about Solr/Lucene (Lucene.. Isn’t that a
misspelling of the Safeway brand milk?)
• Worked on small features early on
– New ranking configuration for “hot tickets” module on site.
• Worked on larger initiatives
– Multiple re-writes of the federated search component
– Recent upgrade to Solr 4.0
• Contribute to community
– Authored a few articles/blog posts, most notable regarding
running Solr in Eclipse.
– Wrote Chrome extension for easily editing long(Solr) API URLs
5. Overview
• About Zvents
• Why Solr?
• Search @ Zvents Details
• Federated Search discussion
• Integration with external data stores
• Development/Deployment
• Operations/Performance Details
6.
7. About Zvents
• Helps people find fun things to do since 2005!
• Content sourced from a variety of places:
– Normal end users
– Internal content editors
– External content editors @ local newspapers
– Feeds
• Powers the events guide section of hundreds
of local newspaper sites around the nation.
9. Why Solr?
• Flexible, Powerful, Customizable
• RESTful query API
• Scales reasonably well without hassle.
• Fast and easy to get started given the samples.
• Strong and active community
– Mailing list amazing. Conferences and meetups
help too
10. Zvents Search at a quick glance…
• 2 Masters/10 Slaves Not sharded.
• Solr 4.x running on Jetty
• Six cores – Five host actual data, sixth used for
federated search
• Federated search among eight different
document types (i.e. venue, restaurants,
movies…)
• Total number of documents ~5M documents
11. • We allow blank text (“what”) searches so people can
look for stuff based on date and location
• How to surface the most relevant things to do?
11
Search Challenges
12. Document Design Notes
• Venues, Artists are as you would expect
• For movies, index each showtime
pk = {theater_id, movie id, time} triple.
– When searching, filter by location, collapse on the
movie_id sort by time asc
• For events, index each occurrence (time).
– When searching, collapse on a sequence_id sort by
time asc to show the most recent upcoming event.
• Avoid showing visual “duplicates”
14. Zvents Search Service API
• Essentially Solr API with a few changes.
• ServletFilter and custom QueryComponent
used to translate URL parameters to proper
Solr parameter “syntax”
– E.g. latitude/longitude/radius converted to
geospatial query and distance in km.
• Federated search executed using ThreadPool
– Parallel searches, results blended together.
18. Federated Search (cont’d)
• Zvents federator component executes multiple concurrent searches
and blends the results.
• Raw score meaningless across products so scores must be
normalized so that across products they mean something.
• Division by max to yield 0-1 scale throws out the score
distribution differences
• We chose to use the Z score (score – avg)/stddev.
• Getting stats like average and standard deviation on the results not
trivial.
• Initially thought to hack the handler to put my own
collector/scorer
19. PostFilter to the rescue!
• PostFilters allow you to (as the name suggests) execute filtering
logic *after* the main query and all other filters have executed.
• Lucene filters + main query execute in parallel in a leap-frog
manner. Some filters (i.e. filter by distance to user) are
expensive to generate up front for all documents.
• You can create a delegate Collector to optionally call
“super.collect()” if some condition is true.
• Since now I am at the lowest level of Lucene effectively
(Collector/Scorer), I can store distribution information about the
scores as they pass through the collector and custom scorer!
21. Federated Search – Victory!
• Now the federator, when executing the product specific searches, can extract
this information to produce a “normal” score.
• Results from different products can be blended based on how good individual
results are relative to their peers.
22. Ranking/Filtering using (highly) volatile
data…
• Store data in field, re-index document
constantly with updated field value
• Atomic updates? Solr 4.0 feature
– Claim ignorance here. Don’t know performance
impacts nor usage.
• Use functions/FunctionQuery + pseudo-fields
– Instead of indexed click field, use clk() function.
• Use PostFilter to support filtering of
documents based on this volatile data
23. Solr + External Data Store == Sweet!
Log
Processing
Jetty
Container
Solr Functions
pull volatile data
from EhCache
Example:
log(clk(EVENT,sequence_id))
Separate thread updates
EhCache from
Hypertable
24. Filtering events based on ticket
availability
Example: &fq={!ticket_filter idField=id}
Ticket availability
publisher
EHCache
Publishes ticket
information via
AMQP
Jetty
Cache stores:
{Event_id=>ticket_count}
1) Fetch ticket
information.
2) Filter out
document if
ticket_count ==0
id
0
1
2
4
3
1245
Solr PostFilter
5678
26. Production Environment
• Java 1.7
• Quad Core 2.8 GHz
• 10 GB RAM
– 8GB dedicated to JVM heap.
• All provisioned as VMs on VMWare ESX servers.
– Significantly simplifies cluster growth. Simply add
servers and go!
• 10 Slaves, 2 Masters
– From configuration standpoint, masters == slave except
masters have 4GB JVM heap instead of 8GB.
27. Solr Project Configuration
• Maven based
– Treat Solr as dependency *not* as application.
• Other dependencies specified in POM,
bundled into war during assembly phase.
• Build tarball that is pushed to Nexus
– Tarball contains configuration scripts + Jetty jar
etc.
• Bundle Jetty with the app for all in one
deployment.
28. Advantages of using Maven
• Solr version upgrades as simple as increasing
dependency version in pom.xml.
– Of course run tests before deploy!
• All dependencies managed by pom.xml and
bundled into deployment artifact
– No management of classpath via solrconfig.xml
• Take advantage of standard release
management practices. Everything self
contained.
29. Deployment via Capistrano
• Capistrano- Framework/Utility for executing
commands in parallel via SSH on multiple
servers
(https://github.com/capistrano/capistrano)
• Capistrano-Nexus Gem- Zvents built gem to
deploy a tarball hosted on a Nexus server out
to staging/production.
30. Examples
• Staging/Development Deploy:
– mvn deploy
– RELEASE=“2.10-SNAPSHOT” cap staging deploy
• Production Deploy:
– mvn release:prepare
– mvn relesae:perform
– RELEASE=“2.10” cap production deploy