Your SlideShare is downloading. ×
0
About Solr                         People as A Search ProblemThursday, May 26, 2011
About Me                    • Building websites since 1996, Java since                      1997                    • Prio...
What is Jazzed                    • Subscription Based                      Dating Site                    • Incubated by ...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
What is Jazzed                     • Create a profile                     • Search for others                     • View th...
How is it different?                    • Covers broader range of relationships                    • Easy to get started  ...
Jazzed Stats                    • Started Fall 2009                    • Beta Summer 2010                    • Launched Oc...
Jazzed Architecture                    • Event-driven SOA                    • REST, JSON, EIP, Not-only-SQL              ...
Tech Stack                    • Java 6, Spring 3, Jersey 1.1, JMS                      (AQMP)                    • RHEL 4,...
Thursday, May 26, 2011
Thursday, May 26, 2011
Not Covered                    • Distributed Search                    • Caching Strategies                    • Data Impo...
Why Lucene?                    • Proven Solid IR library                    • Prefer Open Source Solutions                ...
Why Solr                    • Performant, Extensible, RESTful Service                    • Configuration, Schema, Multicore...
Open Source                    • Strengthens Engineering Team                    • Be apart of great community            ...
Not Only SQL                    • One solution does not fit all                    • Prefer availability over consistency  ...
Flexible Ranking                    • Query Strategies                         • Boolean Algebra                         •...
...Oh My!                    • Standard Plugins - Geospatial*,                      Faceting, Spelling, MoreLikeThis      ...
Inevitable Question                    • “Does it scale?”                    • Solr POC Benchmark                         ...
Profile Service                    • RESTful Hybrid Data Service                    • Public, Private, Attributes          ...
Profiles                    • Mostly structured                    • Categories - Eye Color, Desired                      E...
Inverting People                                            Term          Document                                        ...
Schema Design                    • Single “Table”                    • One-to-many = multi-value fields                    ...
Field considerations                    • Stored or not                    • Indexed or not                    • Multivalu...
Solr Types Used                                                 The ‘t’ is for Trie                    • tdate, tint, tfloa...
Data Duplication                    • By function - numberPhotos &                      hasPhotos                    • By ...
Saving Profiles                    • Updating is in memory operation                    • No partial updates               ...
Why Also Voldemort                    • Private profiles can not be stale                    • Many fields not searchable or...
Querying                    • Superset of Lucene                    • Efficient Range Queries                    • Multiple...
Recall vs Precision                    • Focus on recall when corpus is small                    • Precision once it is at...
Boolean Queries                    • Default operator set to AND                    • +gender:FEMALE +seeking:MALE        ...
Hybrid Queries                    • Default operator set to OR                    • +gender:FEMALE +seeking:MALE          ...
Why you’re lucky if you                      like redheads                    • Inverse Document                      Freq...
Boosting                    • Query time by importance                         • eyeColor:EYE_BLUE^2                      ...
Filter Fields                                             id   hidden                                             1    2, ...
Filter Fields                                             id    hidden                                             1     2...
Date Math                    • Simplifies query preprocessing                    • +birthDate:[NOW/DAY+1DAY-36YEAR         ...
Date Math                    • Simplifies query preprocessing                    • +birthDate:[NOW/DAY+1DAY-36YEAR         ...
Distance Searching                    • lat, lon, distance                    • SolrLocal by Patrick O’Leary              ...
Testing Queries                    • Log queries and ids returned                    • Version your search strategies     ...
Geo Service                    • Read-mostly service                    • Fields - Postal Code, Country,                  ...
Operations                    • Servlet container and filesystem                    • Jetty 6, 64 Java 6 JVM               ...
Operations                    • Active/Passive                    • Layer 7 Load balancing                    • Nightly sn...
Multicore                    • Run multiple schemas on the same                    • Hot swappable for backwards          ...
Security                     • No security provided                     • At minimum secure      <delete>                 ...
Future                    • Solr 3.1                    • Mutual Matching                    • Faceting / Guided Search   ...
Faceting                    • Returns counts                      with query                      results                 ...
Thank you                         jtuberville@eharmony.com                            Twitter: @jtubervilleThursday, May 2...
Upcoming SlideShare
Loading in...5
×

Jazeed about Solr - People as A Search Problem

1,448

Published on

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,448
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Transcript of "Jazeed about Solr - People as A Search Problem"

  1. 1. About Solr People as A Search ProblemThursday, May 26, 2011
  2. 2. About Me • Building websites since 1996, Java since 1997 • Prior web search experience • Building and scaling eHarmony products since 2002Thursday, May 26, 2011
  3. 3. What is Jazzed • Subscription Based Dating Site • Incubated by eHarmonyThursday, May 26, 2011
  4. 4. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  5. 5. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  6. 6. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  7. 7. What is Jazzed • Create a profile • Search for others • View their photos • Privately CommunicateThursday, May 26, 2011
  8. 8. How is it different? • Covers broader range of relationships • Easy to get started • Real profiles screened by machine and humans • Fast, effective search oriented toolsThursday, May 26, 2011
  9. 9. Jazzed Stats • Started Fall 2009 • Beta Summer 2010 • Launched October 2010 • 100,000s of Profiles • 1,000s of Searches DailyThursday, May 26, 2011
  10. 10. Jazzed Architecture • Event-driven SOA • REST, JSON, EIP, Not-only-SQL • Technology incubationThursday, May 26, 2011
  11. 11. Tech Stack • Java 6, Spring 3, Jersey 1.1, JMS (AQMP) • RHEL 4, Oracle 11g, Voldemort 0.81, Solr 1.4.1, NFSThursday, May 26, 2011
  12. 12. Thursday, May 26, 2011
  13. 13. Thursday, May 26, 2011
  14. 14. Not Covered • Distributed Search • Caching Strategies • Data Import • Analyzers/TokenizersThursday, May 26, 2011
  15. 15. Why Lucene? • Proven Solid IR library • Prefer Open Source Solutions • Not Only SQL • Flexible Ranking • PluggableThursday, May 26, 2011
  16. 16. Why Solr • Performant, Extensible, RESTful Service • Configuration, Schema, Multicores • Admin Interface • Replication, Backups, MonitoringThursday, May 26, 2011
  17. 17. Open Source • Strengthens Engineering Team • Be apart of great community • Not Brochure-wareThursday, May 26, 2011
  18. 18. Not Only SQL • One solution does not fit all • Prefer availability over consistency • Horizontal Scaling over VerticalThursday, May 26, 2011
  19. 19. Flexible Ranking • Query Strategies • Boolean Algebra • Vector Space Analysis • Hybrids • Extensive Function Support • Index and Query BoostingThursday, May 26, 2011
  20. 20. ...Oh My! • Standard Plugins - Geospatial*, Faceting, Spelling, MoreLikeThis • Full Text with Highlighted Results • Client agnosticThursday, May 26, 2011
  21. 21. Inevitable Question • “Does it scale?” • Solr POC Benchmark • 10 Million profiles • >200 queries/sec under 100ms 90th • Default tuning until 5 million profilesThursday, May 26, 2011
  22. 22. Profile Service • RESTful Hybrid Data Service • Public, Private, Attributes • Event ProducerThursday, May 26, 2011
  23. 23. Profiles • Mostly structured • Categories - Eye Color, Desired Ethnicity • Dates - Birthdate • Numbers - Coordinates, Age Range • Text -Name, HeadlineThursday, May 26, 2011
  24. 24. Inverting People Term Document MALE 1, 3, 5, 7, 9 FEMALE 2, 4, 6, 8, 10 • Stored as an HAIR_RED 8 inverted index HAIR_BLOND 1, 2, 5, 6 EYE_BLUE 1, 2, 3, 10 • Index random EYE_BROWN 4, 5, 6, 7, 8, 9 accessed by term fun 1, 3, 7, 9 funny 2, 4, 6, 10 beach 1, 2, 3, 4, 5, 6, 7, 8Thursday, May 26, 2011
  25. 25. Schema Design • Single “Table” • One-to-many = multi-value fields • Individual vs Composite Fields • copyTo and have both!Thursday, May 26, 2011
  26. 26. Field considerations • Stored or not • Indexed or not • Multivalued - desires fields • TypeThursday, May 26, 2011
  27. 27. Solr Types Used The ‘t’ is for Trie • tdate, tint, tfloat* - birthdate, loginAt • text - all text • string - id, non indexed text • random - good for random sorts • enum - for all enumerationsThursday, May 26, 2011
  28. 28. Data Duplication • By function - numberPhotos & hasPhotos • By relationship - hiddenBy & hidden • By analysis - name & textThursday, May 26, 2011
  29. 29. Saving Profiles • Updating is in memory operation • No partial updates • Commit means flush index changes • Autocommit on maxDocs, maxTime or bothThursday, May 26, 2011
  30. 30. Why Also Voldemort • Private profiles can not be stale • Many fields not searchable or viewable by others • Isolate queries from fetch by idThursday, May 26, 2011
  31. 31. Querying • Superset of Lucene • Efficient Range Queries • Multiple Query Handlers • Dismax, Boost, GeoThursday, May 26, 2011
  32. 32. Recall vs Precision • Focus on recall when corpus is small • Precision once it is at critical massThursday, May 26, 2011
  33. 33. Boolean Queries • Default operator set to AND • +gender:FEMALE +seeking:MALE +eyeColor:EYE_BLUE +hairColor: (HAIR_RED, HAIR_BLONDE) • Sort order is importantThursday, May 26, 2011
  34. 34. Hybrid Queries • Default operator set to OR • +gender:FEMALE +seeking:MALE eyeColor:EYE_BLUE hairColor: (HAIR_RED, HAIR_BLONDE)Thursday, May 26, 2011
  35. 35. Why you’re lucky if you like redheads • Inverse Document Frequency (IDF) 1.Blue eyed, redheads 2.Blue eyed, blonds • Rarer is favored 3.Redheads over more common 4.Blonds • More fields matched = higher rankingThursday, May 26, 2011
  36. 36. Boosting • Query time by importance • eyeColor:EYE_BLUE^2 hairColor:HAIR_BLONDThursday, May 26, 2011
  37. 37. Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6)Thursday, May 26, 2011
  38. 38. Filter Fields id hidden 1 2, 4, 6 • Useful for roles and other lists 2 1 • -hidden:(2 4 6) id hiddenBy 1 2 • -hiddenBy:1 2 1 4 1 6 1Thursday, May 26, 2011
  39. 39. Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR]Thursday, May 26, 2011
  40. 40. Date Math • Simplifies query preprocessing • +birthDate:[NOW/DAY+1DAY-36YEAR TO NOW/DAY-25YEAR] Between 25 and 35 years oldThursday, May 26, 2011
  41. 41. Distance Searching • lat, lon, distance • SolrLocal by Patrick O’Leary • Additional overhead ~90ms per query • Superceded in Solr 3.1Thursday, May 26, 2011
  42. 42. Testing Queries • Log queries and ids returned • Version your search strategies • Improve one thing at a timeThursday, May 26, 2011
  43. 43. Geo Service • Read-mostly service • Fields - Postal Code, Country, State, Cities, Lat, Lon • Usage - Registration Validation, City SelectionThursday, May 26, 2011
  44. 44. Operations • Servlet container and filesystem • Jetty 6, 64 Java 6 JVM • 8G Heap -XX:+UseCompressedOopsThursday, May 26, 2011
  45. 45. Operations • Active/Passive • Layer 7 Load balancing • Nightly snapshots • Eventually SolrCloudThursday, May 26, 2011
  46. 46. Multicore • Run multiple schemas on the same • Hot swappable for backwards compatible changes • private / public profilesThursday, May 26, 2011
  47. 47. Security • No security provided • At minimum secure <delete> <query>*:*</query> your UpdateHandler </delete> • Separate CoresThursday, May 26, 2011
  48. 48. Future • Solr 3.1 • Mutual Matching • Faceting / Guided Search • Incorporating spelling • Hierarchies, categories, better ranking modelsThursday, May 26, 2011
  49. 49. Faceting • Returns counts with query results • Efficient • Guides the user toward precisionThursday, May 26, 2011
  50. 50. Thank you jtuberville@eharmony.com Twitter: @jtubervilleThursday, May 26, 2011
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×