Jazzed about Solr: People as a Search Problem - By Joshua Tuberville
Upcoming SlideShare
Loading in...5
×
 

Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

on

  • 2,699 views

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011 ...

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Search oriented architectures are obvious approaches for web pages, emails, documents, and other
text based entities. Often with traditional structured data, text searching is “added on” to the
traditional Boolean queries in relational stores. When Jazzed was initiated we wanted search to be
front and center. When we evaluated Solr we realized we could take the opposite approach “add on”
Boolean components to textual searches. This hybrid query approach makes transitioning to flexible
ranking easy and straightforward.

Statistics

Views

Total Views
2,699
Views on SlideShare
1,985
Embed Views
714

Actions

Likes
3
Downloads
34
Comments
0

14 Embeds 714

http://www.dzone.com 285
http://www.lucidimagination.com 282
http://java.dzone.com 65
http://searchhub.org 26
http://www.lucenerevolution.org 13
http://snippets.dzone.com 12
http://lucenerevolution.com 11
http://dzone.com 9
url_unknown 3
http://www.slideshare.net 2
http://www.techgig.com 2
http://lucidsearchhub.stephenz.com 2
http://ajax.dzone.com 1
http://css.dzone.com 1
More...

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Jazzed about Solr: People as a Search Problem - By Joshua Tuberville Jazzed about Solr: People as a Search Problem - By Joshua Tuberville Presentation Transcript

  • About Solr People as A Search ProblemThursday, May 26, 2011
  • About Me • Building websites since 1996, Java since 1997 • Prior web search experience • Building and scaling eHarmony products since 2002Thursday, May 26, 2011
  • What is Jazzed • Subscription Based Dating Site • Incubated by eHarmonyThursday, May 26, 2011
  • What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  • What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  • What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  • What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  • How is it different? • Covers broader range of relationships • Easy to get started • Real profiles screened by machine and humans • Fast, effective search oriented toolsThursday, May 26, 2011
  • Jazzed Stats • Started Fall 2009 • Beta Summer 2010 • Launched October 2010 • 100,000s of Profiles • 1,000s of Searches DailyThursday, May 26, 2011
  • Jazzed Architecture • Event-driven SOA • REST, JSON, EIP, Not-only-SQL • Technology incubationThursday, May 26, 2011
  • Tech Stack • Java 6, Spring 3, Jersey 1.1, JMS (AQMP) • RHEL 4, Oracle 11g, Voldemort 0.81, Solr 1.4.1, NFSThursday, May 26, 2011
  • Thursday, May 26, 2011
  • Thursday, May 26, 2011
  • Not Covered • Distributed Search • Caching Strategies • Data Import • Analyzers/TokenizersThursday, May 26, 2011
  • Why Lucene? • Proven Solid IR library • Prefer Open Source Solutions • Not Only SQL • Flexible Ranking • PluggableThursday, May 26, 2011
  • Why Solr • Performant, Extensible, RESTful Service • Configuration, Schema, Multicores • Admin Interface • Replication, Backups, MonitoringThursday, May 26, 2011
  • Open Source • Strengthens Engineering Team • Be apart of great community • Not Brochure-wareThursday, May 26, 2011
  • Not Only SQL • One solution does not fit all • Prefer availability over consistency • Horizontal Scaling over VerticalThursday, May 26, 2011
  • Flexible Ranking • Query Strategies • Boolean Algebra • Vector Space Analysis • Hybrids • Extensive Function Support • Index and Query BoostingThursday, May 26, 2011
  • ...Oh My! • Standard Plugins - Geospatial*, Faceting, Spelling, MoreLikeThis • Full Text with Highlighted Results • Client agnosticThursday, May 26, 2011
  • Inevitable Question • “Does it scale?” • Solr POC Benchmark • 10 Million profiles • >200 queries/sec under 100ms 90th • Default tuning until 5 million profilesThursday, May 26, 2011
  • Profile Service • RESTful Hybrid Data Service • Public, Private, Attributes • Event ProducerThursday, May 26, 2011
  • Profiles • Mostly structured • Categories - Eye Color, Desired Ethnicity • Dates - Birthdate • Numbers - Coordinates, Age Range • Text -Name, HeadlineThursday, May 26, 2011
  • Inverting People Term Document MALE 1, 3, 5, 7, 9 FEMALE 2, 4, 6, 8, 10 • Stored as an HAIR_RED 8 inverted index HAIR_BLOND 1, 2, 5, 6 EYE_BLUE 1, 2, 3, 10 • Index random EYE_BROWN 4, 5, 6, 7, 8, 9 accessed by term fun 1, 3, 7, 9 funny 2, 4, 6, 10 beach 1, 2, 3, 4, 5, 6, 7, 8Thursday, May 26, 2011
  • Schema Design • Single “Table” • One-to-many = multi-value fields • Individual vs Composite Fields • copyTo and have both!Thursday, May 26, 2011
  • Field considerations • Stored or not • Indexed or not • Multivalued - desires fields • TypeThursday, May 26, 2011
  • Solr Types Used The ‘t’ is for Trie • tdate, tint, tfloat* - birthdate, loginAt • text - all text • string - id, non indexed text • random - good for random sorts • enum - for all enumerationsThursday, May 26, 2011
  • Data Duplication • By function - numberPhotos & hasPhotos • By relationship - hiddenBy & hidden • By analysis - name & textThursday, May 26, 2011
  • Saving Profiles • Updating is in memory operation • No partial updates • Commit means flush index changes • Autocommit on maxDocs, maxTime or bothThursday, May 26, 2011
  • Why Also Voldemort • Private profiles can not be stale • Many fields not searchable or viewable by others • Isolate queries from fetch by idThursday, May 26, 2011
  • Querying • Superset of Lucene • Efficient Range Queries • Multiple Query Handlers • Dismax, Boost, GeoThursday, May 26, 2011
  • Recall vs Precision • Focus on recall when corpus is small • Precision once it is at critical massThursday, May 26, 2011
  • Boolean Queries • Default operator set to AND • +gender:FEMALE +seeking:MALE +eyeColor:EYE_BLUE +hairColor: (HAIR_RED, HAIR_BLONDE) • Sort order is importantThursday, May 26, 2011
  • Hybrid Queries • Default operator set to OR • +gender:FEMALE +seeking:MALE eyeColor:EYE_BLUE hairColor: (HAIR_RED, HAIR_BLONDE)Thursday, May 26, 2011
  • Why you’re lucky if you like redheads • Inverse Document Frequency (IDF) 1.Blue eyed, redheads 2.Blue eyed, blonds • Rarer is favored 3.Redheads over more common 4.Blonds • More fields matched = higher rankingThursday, May 26, 2011
  • Boosting • Query time by importance • eyeColor:EYE_BLUE^2 hairColor:HAIR_BLONDThursday, May 26, 2011
  • Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6)Thursday, May 26, 2011
  • Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6) id hiddenBy 1 2 • -hiddenBy:1 2 1 4 1 6 1Thursday, May 26, 2011
  • Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR]Thursday, May 26, 2011
  • Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR] Between 25 and 35 years oldThursday, May 26, 2011
  • Distance Searching • lat, lon, distance • SolrLocal by Patrick O’Leary • Additional overhead ~90ms per query • Superceded in Solr 3.1Thursday, May 26, 2011
  • Testing Queries • Log queries and ids returned • Version your search strategies • Improve one thing at a timeThursday, May 26, 2011
  • Geo Service • Read-mostly service • Fields - Postal Code, Country, State, Cities, Lat, Lon • Usage - Registration Validation, City SelectionThursday, May 26, 2011
  • Operations • Servlet container and filesystem • Jetty 6, 64 Java 6 JVM • 8G Heap -XX:+UseCompressedOopsThursday, May 26, 2011
  • Operations • Active/Passive • Layer 7 Load balancing • Nightly snapshots • Eventually SolrCloudThursday, May 26, 2011
  • Multicore • Run multiple schemas on the same • Hot swappable for backwards compatible changes • private / public profilesThursday, May 26, 2011
  • Security • No security provided • At minimum secure <delete> <query>*:*</query> your UpdateHandler </delete> • Separate CoresThursday, May 26, 2011
  • Future • Solr 3.1 • Mutual Matching • Faceting / Guided Search • Incorporating spelling • Hierarchies, categories, better ranking modelsThursday, May 26, 2011
  • Faceting • Returns counts with query results • Efficient • Guides the user toward precisionThursday, May 26, 2011
  • Thank you jtuberville@eharmony.com Twitter: @jtubervilleThursday, May 26, 2011