0
About Solr                         People as A Search ProblemThursday, May 26, 2011
About Me                    • Building websites since 1996, Java since                      1997                    • Prio...
What is Jazzed                    • Subscription Based                      Dating Site                    • Incubated by ...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
How is it different?                    • Covers broader range of relationships                    • Easy to get started  ...
Jazzed Stats                    • Started Fall 2009                    • Beta Summer 2010                    • Launched Oc...
Jazzed Architecture                    • Event-driven SOA                    • REST, JSON, EIP, Not-only-SQL              ...
Tech Stack                    • Java 6, Spring 3, Jersey 1.1, JMS                      (AQMP)                    • RHEL 4,...
Thursday, May 26, 2011
Thursday, May 26, 2011
Not Covered                    • Distributed Search                    • Caching Strategies                    • Data Impo...
Why Lucene?                    • Proven Solid IR library                    • Prefer Open Source Solutions                ...
Why Solr                    • Performant, Extensible, RESTful Service                    • Configuration, Schema, Multicore...
Open Source                    • Strengthens Engineering Team                    • Be apart of great community            ...
Not Only SQL                    • One solution does not fit all                    • Prefer availability over consistency  ...
Flexible Ranking                    • Query Strategies                         • Boolean Algebra                         •...
...Oh My!                    • Standard Plugins - Geospatial*,                      Faceting, Spelling, MoreLikeThis      ...
Inevitable Question                    • “Does it scale?”                    • Solr POC Benchmark                         ...
Profile Service                    • RESTful Hybrid Data Service                    • Public, Private, Attributes          ...
Profiles                    • Mostly structured                    • Categories - Eye Color, Desired                      E...
Inverting People                                            Term          Document                                        ...
Schema Design                    • Single “Table”                    • One-to-many = multi-value fields                    ...
Field considerations                    • Stored or not                    • Indexed or not                    • Multivalu...
Solr Types Used                                                 The ‘t’ is for Trie                    • tdate, tint, tfloa...
Data Duplication                    • By function - numberPhotos &                      hasPhotos                    • By ...
Saving Profiles                    • Updating is in memory operation                    • No partial updates               ...
Why Also Voldemort                    • Private profiles can not be stale                    • Many fields not searchable or...
Querying                    • Superset of Lucene                    • Efficient Range Queries                    • Multiple...
Recall vs Precision                    • Focus on recall when corpus is small                    • Precision once it is at...
Boolean Queries                    • Default operator set to AND                    • +gender:FEMALE +seeking:MALE        ...
Hybrid Queries                    • Default operator set to OR                    • +gender:FEMALE +seeking:MALE          ...
Why you’re lucky if you                      like redheads                    • Inverse Document                      Freq...
Boosting                    • Query time by importance                         • eyeColor:EYE_BLUE^2                      ...
Filter Fields                                             id   hidden                                             1    2, ...
Filter Fields                                             id    hidden                                             1     2...
Date Math                    • Simplifies query preprocessing                    • +birthDate:[NOW/DAY+1DAY-36YEAR         ...
Date Math                    • Simplifies query preprocessing                    • +birthDate:[NOW/DAY+1DAY-36YEAR         ...
Distance Searching                    • lat, lon, distance                    • SolrLocal by Patrick O’Leary              ...
Testing Queries                    • Log queries and ids returned                    • Version your search strategies     ...
Geo Service                    • Read-mostly service                    • Fields - Postal Code, Country,                  ...
Operations                    • Servlet container and filesystem                    • Jetty 6, 64 Java 6 JVM               ...
Operations                    • Active/Passive                    • Layer 7 Load balancing                    • Nightly sn...
Multicore                    • Run multiple schemas on the same                    • Hot swappable for backwards          ...
Security                     • No security provided                     • At minimum secure      <delete>                 ...
Future                    • Solr 3.1                    • Mutual Matching                    • Faceting / Guided Search   ...
Faceting                    • Returns counts                      with query                      results                 ...
Thank you                         jtuberville@eharmony.com                            Twitter: @jtubervilleThursday, May 2...
Upcoming SlideShare
Loading in...5
×

Jazzed about Solr: People as a Search Problem - By Joshua Tuberville

2,400

Published on

See conference video - http://www.lucidimagination.com/devzone/events/conferences/revolution/2011

Search oriented architectures are obvious approaches for web pages, emails, documents, and other
text based entities. Often with traditional structured data, text searching is “added on” to the
traditional Boolean queries in relational stores. When Jazzed was initiated we wanted search to be
front and center. When we evaluated Solr we realized we could take the opposite approach “add on”
Boolean components to textual searches. This hybrid query approach makes transitioning to flexible
ranking easy and straightforward.

Published in: Technology
0 Comments
3 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,400
On Slideshare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
35
Comments
0
Likes
3
Embeds 0
No embeds

No notes for slide

Transcript of "Jazzed about Solr: People as a Search Problem - By Joshua Tuberville"

  1. 1. About Solr People as A Search ProblemThursday, May 26, 2011
  2. 2. About Me • Building websites since 1996, Java since 1997 • Prior web search experience • Building and scaling eHarmony products since 2002Thursday, May 26, 2011
  3. 3. What is Jazzed • Subscription Based Dating Site • Incubated by eHarmonyThursday, May 26, 2011
  4. 4. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  5. 5. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  6. 6. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  7. 7. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  8. 8. How is it different? • Covers broader range of relationships • Easy to get started • Real profiles screened by machine and humans • Fast, effective search oriented toolsThursday, May 26, 2011
  9. 9. Jazzed Stats • Started Fall 2009 • Beta Summer 2010 • Launched October 2010 • 100,000s of Profiles • 1,000s of Searches DailyThursday, May 26, 2011
  10. 10. Jazzed Architecture • Event-driven SOA • REST, JSON, EIP, Not-only-SQL • Technology incubationThursday, May 26, 2011
  11. 11. Tech Stack • Java 6, Spring 3, Jersey 1.1, JMS (AQMP) • RHEL 4, Oracle 11g, Voldemort 0.81, Solr 1.4.1, NFSThursday, May 26, 2011
  12. 12. Thursday, May 26, 2011
  13. 13. Thursday, May 26, 2011
  14. 14. Not Covered • Distributed Search • Caching Strategies • Data Import • Analyzers/TokenizersThursday, May 26, 2011
  15. 15. Why Lucene? • Proven Solid IR library • Prefer Open Source Solutions • Not Only SQL • Flexible Ranking • PluggableThursday, May 26, 2011
  16. 16. Why Solr • Performant, Extensible, RESTful Service • Configuration, Schema, Multicores • Admin Interface • Replication, Backups, MonitoringThursday, May 26, 2011
  17. 17. Open Source • Strengthens Engineering Team • Be apart of great community • Not Brochure-wareThursday, May 26, 2011
  18. 18. Not Only SQL • One solution does not fit all • Prefer availability over consistency • Horizontal Scaling over VerticalThursday, May 26, 2011
  19. 19. Flexible Ranking • Query Strategies • Boolean Algebra • Vector Space Analysis • Hybrids • Extensive Function Support • Index and Query BoostingThursday, May 26, 2011
  20. 20. ...Oh My! • Standard Plugins - Geospatial*, Faceting, Spelling, MoreLikeThis • Full Text with Highlighted Results • Client agnosticThursday, May 26, 2011
  21. 21. Inevitable Question • “Does it scale?” • Solr POC Benchmark • 10 Million profiles • >200 queries/sec under 100ms 90th • Default tuning until 5 million profilesThursday, May 26, 2011
  22. 22. Profile Service • RESTful Hybrid Data Service • Public, Private, Attributes • Event ProducerThursday, May 26, 2011
  23. 23. Profiles • Mostly structured • Categories - Eye Color, Desired Ethnicity • Dates - Birthdate • Numbers - Coordinates, Age Range • Text -Name, HeadlineThursday, May 26, 2011
  24. 24. Inverting People Term Document MALE 1, 3, 5, 7, 9 FEMALE 2, 4, 6, 8, 10 • Stored as an HAIR_RED 8 inverted index HAIR_BLOND 1, 2, 5, 6 EYE_BLUE 1, 2, 3, 10 • Index random EYE_BROWN 4, 5, 6, 7, 8, 9 accessed by term fun 1, 3, 7, 9 funny 2, 4, 6, 10 beach 1, 2, 3, 4, 5, 6, 7, 8Thursday, May 26, 2011
  25. 25. Schema Design • Single “Table” • One-to-many = multi-value fields • Individual vs Composite Fields • copyTo and have both!Thursday, May 26, 2011
  26. 26. Field considerations • Stored or not • Indexed or not • Multivalued - desires fields • TypeThursday, May 26, 2011
  27. 27. Solr Types Used The ‘t’ is for Trie • tdate, tint, tfloat* - birthdate, loginAt • text - all text • string - id, non indexed text • random - good for random sorts • enum - for all enumerationsThursday, May 26, 2011
  28. 28. Data Duplication • By function - numberPhotos & hasPhotos • By relationship - hiddenBy & hidden • By analysis - name & textThursday, May 26, 2011
  29. 29. Saving Profiles • Updating is in memory operation • No partial updates • Commit means flush index changes • Autocommit on maxDocs, maxTime or bothThursday, May 26, 2011
  30. 30. Why Also Voldemort • Private profiles can not be stale • Many fields not searchable or viewable by others • Isolate queries from fetch by idThursday, May 26, 2011
  31. 31. Querying • Superset of Lucene • Efficient Range Queries • Multiple Query Handlers • Dismax, Boost, GeoThursday, May 26, 2011
  32. 32. Recall vs Precision • Focus on recall when corpus is small • Precision once it is at critical massThursday, May 26, 2011
  33. 33. Boolean Queries • Default operator set to AND • +gender:FEMALE +seeking:MALE +eyeColor:EYE_BLUE +hairColor: (HAIR_RED, HAIR_BLONDE) • Sort order is importantThursday, May 26, 2011
  34. 34. Hybrid Queries • Default operator set to OR • +gender:FEMALE +seeking:MALE eyeColor:EYE_BLUE hairColor: (HAIR_RED, HAIR_BLONDE)Thursday, May 26, 2011
  35. 35. Why you’re lucky if you like redheads • Inverse Document Frequency (IDF) 1.Blue eyed, redheads 2.Blue eyed, blonds • Rarer is favored 3.Redheads over more common 4.Blonds • More fields matched = higher rankingThursday, May 26, 2011
  36. 36. Boosting • Query time by importance • eyeColor:EYE_BLUE^2 hairColor:HAIR_BLONDThursday, May 26, 2011
  37. 37. Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6)Thursday, May 26, 2011
  38. 38. Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6) id hiddenBy 1 2 • -hiddenBy:1 2 1 4 1 6 1Thursday, May 26, 2011
  39. 39. Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR]Thursday, May 26, 2011
  40. 40. Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR] Between 25 and 35 years oldThursday, May 26, 2011
  41. 41. Distance Searching • lat, lon, distance • SolrLocal by Patrick O’Leary • Additional overhead ~90ms per query • Superceded in Solr 3.1Thursday, May 26, 2011
  42. 42. Testing Queries • Log queries and ids returned • Version your search strategies • Improve one thing at a timeThursday, May 26, 2011
  43. 43. Geo Service • Read-mostly service • Fields - Postal Code, Country, State, Cities, Lat, Lon • Usage - Registration Validation, City SelectionThursday, May 26, 2011
  44. 44. Operations • Servlet container and filesystem • Jetty 6, 64 Java 6 JVM • 8G Heap -XX:+UseCompressedOopsThursday, May 26, 2011
  45. 45. Operations • Active/Passive • Layer 7 Load balancing • Nightly snapshots • Eventually SolrCloudThursday, May 26, 2011
  46. 46. Multicore • Run multiple schemas on the same • Hot swappable for backwards compatible changes • private / public profilesThursday, May 26, 2011
  47. 47. Security • No security provided • At minimum secure <delete> <query>*:*</query> your UpdateHandler </delete> • Separate CoresThursday, May 26, 2011
  48. 48. Future • Solr 3.1 • Mutual Matching • Faceting / Guided Search • Incorporating spelling • Hierarchies, categories, better ranking modelsThursday, May 26, 2011
  49. 49. Faceting • Returns counts with query results • Efficient • Guides the user toward precisionThursday, May 26, 2011
  50. 50. Thank you jtuberville@eharmony.com Twitter: @jtubervilleThursday, May 26, 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×