The Seven Deadly Sins of Solr
Upcoming SlideShare
Loading in...5
×
 

The Seven Deadly Sins of Solr

on

  • 3,255 views

Etsy is using Solr and Lucene to serve queries at a rate of more than 8 billion per year (and growing). In this case study, we will describe how Etsy has integrated Solr/Lucene into our continuous ...

Etsy is using Solr and Lucene to serve queries at a rate of more than 8 billion per year (and growing). In this case study, we will describe how Etsy has integrated Solr/Lucene into our continuous deployment infrastructure, allowing for Solr configuration, Java-based indexers, and query parsing logic to go from passing tests to production code in minutes.

Statistics

Views

Total Views
3,255
Views on SlideShare
3,240
Embed Views
15

Actions

Likes
5
Downloads
47
Comments
0

2 Embeds 15

https://twitter.com 13
http://twitter.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

The Seven Deadly Sins of Solr The Seven Deadly Sins of Solr Presentation Transcript

  • Introductions…!  Who  the  hell  am  I?    Jay  Hill,  Lucid  Imagina-on    7  years  Lucene  experience    4  years  Solr  experience    Author  of  Lucid  Training    SME  for  Lucid  Cer-fica-on    Who  the  hell  are  you?    New  to  search?    New  to  Lucene/Solr?    BaKle-­‐tested  veterans?  ©  Lucid  Imagina-on,  Inc.  
  • Well Leave Time For Q&A!  Whos  doing  what?    Solr  3.1?    Solr  1.4.1?    Nightly  build?    Solr  1.3  or  older?    Are  there  any  specific  problems  youre  having?    Meanwhile,  interrupt,  ask  ques8ons  as  we  go,  etc.    ©  Lucid  Imagina-on,  Inc.  
  • A Brief Word About Lucid Imagination!  Lucid  Imagina8on:    The  commercial  company  suppor-ng     Lucene/Solr  open  source  search.    Founded  by      Yonik  Seeley  –  Creator  of  Solr    Erik  Hatcher  –  Co-­‐author,  Lucene  In  Ac-on    Grant  Ingersoll  –  Apache  PMC  Chair    Marc  Krellenstein  –  Lucid  CTO    Staff  includes  9  Lucene/Solr  commiKers    Training,  cer-fica-on,  support,  LucidWorks  Enterprise  ©  Lucid  Imagina-on,  Inc.  
  • Lucid Customers (That Ive Worked With)!©  Lucid  Imagina-on,  Inc.  
  • …On To The Sinning!!©  Lucid  Imagina-on,  Inc.  
  • Sins As Anti-Patterns?!  "Sorta  kinda"    Specify  Nothing  (Sloth)    Creeping  Featurei-s  (Greed)    Blowhard  Jamboree  (Pride)    Boat  Anchor  (Lust)    Not  Invented  Here  (Envy)    Phatware  (GluKony)    Emperors  New  Clothes  (Wrath)  ©  Lucid  Imagina-on,  Inc.  
  • Sins Can Contradict One Another!!  Youll  no-ce  that  many  of  the  "sins"     we  see  will  be  the  exact  opposite  of     others    Just  as  some  of  us  tend  towards     laziness,  others  towards  excess    Some-mes  you  -­‐    "Look  before  you  leap."    Other  -mes,      "He  who  hesitates  is  lost."    In  Solr  (or  any  search  app),  one  size  never  fits  all  ©  Lucid  Imagina-on,  Inc.  
  • "I  dont  know   and  I  dont  care."  ©  Lucid  Imagina-on,  Inc.  
  • Sloth!  "We  arent  really  into  open  source."    Lack  of  commitment  to  Solr  and/or  the  search   applica-on  itself    Not  developing  in-­‐house  Solr  exper-se    Not  paying  enough  aKen-on  to  JVM  sebngs,     garbage  collec-on,  and  RAM  alloca-on.  ©  Lucid  Imagina-on,  Inc.  
  • Sloth!  Neglec-ng  to  get  familiar  with  the  source  code    It  is  open  source  ader  all!    Not  taking  the  -me  to  understand  the  main   parts  of  Solr:    Request  Handlers    Search  components    Query  parsers    Extend  QParserPlugin  class    ValueSource  &  ValueSourceParser  –  custom  func-ons    New  pseudo-­‐fields  in  4.x    Response  writers  ©  Lucid  Imagina-on,  Inc.  
  • Sloth!  Not  keeping  up  with  new  features  and   developments  in  Lucene  and  Solr   CHANGES.txt  –  use  "diff"  to  keep  up  on  changes  ©  Lucid  Imagina-on,  Inc.  
  • Sloth!  New  features  in  Solr  3.1:    Solr  spa8al    Edismax  query  parser    NOT  experimental!    Dynamic  metadata  extrac-on  via  UIMA    Numeric  range  face8ng  (like  date  face-ng)    Lucene  RAMDirectoryFactory  available    Face-ng  performance  improvements    Spellcheck  and  Terms  components  now   work  for  distributed  search    Suggester  component  –  beKer  autosuggest!    Can  add  custom  dict.,  phrases,  etc.  ©  Lucid  Imagina-on,  Inc.  
  • Sloth!  New  features  coming  in  Solr  4.x:    Lucene  DocumentWritersPerThread  (DWPT)    Moving  towards  "real  -me"    UpdateHandler  upgrade  to  work  with  real-­‐-me      Field  collapsing/grouping    Pivot  facets    SolrCloud  (Zookeeper)    Fuzzy  queries  100  -mes  faster    Pseudo  fields  via  func-ons    Relevancy  func-on  queries:  n,  idf,  docFreq,  norm,  …  ©  Lucid  Imagina-on,  Inc.  
  • Sloth: The Path To Salvation!  Commit  to  the  project  and  to  learning  Solr    Stay  up  to  date  on  Solr  changes    Stay  current  with  ongoing  releases    Get  familiar  with  the  source  code    Spend  some  -me  to  understand  the  main   configura-on  files:    solrconfig.xml    schema.xml    Read  through  the  en-re  Solr  Wiki  once  every  so  oden    Develop  in-­‐house  Solr  exper-se  ©  Lucid  Imagina-on,  Inc.  
  • Save  a  penny,   lose  a  customer.  ©  Lucid  Imagina-on,  Inc.  
  • Greed!  Skimping  on  resources  such  as:    RAM      "Heres  a  quarter  buddy,  go  buy  some  RAM!"    Storage  space    You  will  get  what  you  pay  for!    …on  the  other  hand,  not  every  company  has  "deep  pockets"  ©  Lucid  Imagina-on,  Inc.  
  • Greed!  Trying  to  "squeeze  by",  indexing  to,  and  searching   on,  the  same  server   Indexing   Indexing   Shards  (Indexers)   Slave/Searchers   Load  Balancer   Searches   Searches  ©  Lucid  Imagina-on,  Inc.  
  • Greed!  Not  making  the  effort  to  find  the  right  balance   between  precision  and  recall   Recall:  What  frac-on  of   Precision:  What  frac-on   the  relevant  documents  in   of  the  returned  results   the  collec-on  were  re-­‐   are  relevant  to  the   turned  by  the  system?     informa-on  need?  ©  Lucid  Imagina-on,  Inc.  
  • Greed!  A  few  thoughts  about  relevance:    Get  feedback  from  domain  experts    Is  it  beKer  to  have  lots  of  results  with  less     precision,  or  fewer,  more  targeted  results?    Different  sites  will  have  very  different     requirements  ©  Lucid  Imagina-on,  Inc.  
  • Greed: The Path To Salvation!  Pry  open  your  wallet  –  dont  be  cheap    You  dont  have  to  push  the  envelope    Find  the  right  balance  between  recall  and  precision    Dont  push  for  more  results  over  precision  –  unless   that  is  a  clear  requirement  (some-mes  it  is)  ©  Lucid  Imagina-on,  Inc.  
  • "What  could  possibly   go  wrong?  ©  Lucid  Imagina-on,  Inc.  
  • Pride!  Reinven-ng  the  wheel    "Why  dont  we  just  write  our  own  search   libraries?"    Nobody  has  a  use  case  like  us  –  right?    "We  need  to  change  the  scoring  algorithms."  ©  Lucid  Imagina-on,  Inc.  
  • Pride!  Thinking  you  can  "do  it  all"  in  Solr    Solr  is  rarely  a  good  choice  as  a  SOR    Consider  other  tools  to  work  with  Solr:    Nutch    Mahout    OpenNLP    Google  Connector  Framework    Your  own  code  ©  Lucid  Imagina-on,  Inc.  
  • Pride!  Stubbornly  refusing  to  use  resources  such  as  the     mailing  lists:    Solr  user  list:    solr-­‐user@lucene.apache.org    Solr  developer  list:    dev@lucene.apache.org    Lucene  user  list:    java-­‐user@lucene.apache.org      LucidFind:  hKp://www.lucidimagina-on.com/search/    ©  Lucid  Imagina-on,  Inc.  
  • Pride!  "I  will  not  yield!"    Trying  to  "win  baKles"  on  the  mailing  lists    Good  Karma  –  be  a  good  ci-zen  in  the  community  ©  Lucid  Imagina-on,  Inc.  
  • Pride: The Path To Salvation!  Ask  for  help  when  needed    Let  the  business  needs  define  the  project  –  dont   let  the  tail  wag  the  dog    Get  a  feel  for  the  Solr  community  and  respect  the   experience  of  others    Youre  situa-on,  while  possibly  unique,  is  probably   not  completely  dissimilar  to  others.  Learn  from  the     pioneers  and  Solr  veterans  ©  Lucid  Imagina-on,  Inc.  
  • "Someone  stop  me!"  ©  Lucid  Imagina-on,  Inc.  
  • Lust!  Obsessing  over  unimportant  details  too  early   in  the  project    Agile  approach  is  well  suited  to  Solr   development  –  iterate!    Trying  to  "push  the  envelope"    Necessary  some-mes,  but  its  not  called   the  "bleeding  edge"  without  reason    "Ease  in"  to  major  changes    Too  much  aKen-on  to  JVM  sebngs    Solr  experts  are  not  usually  JVM/GC  experts  ©  Lucid  Imagina-on,  Inc.  
  • Lust!  "An--­‐greed"  –  CommiEng  too  many  resources     to  Solr    Make  sure  the  OS  has  plenty  of  RAM   to  cache  files,  etc    "If  one  is  good,  a  dozen  must  be  beKer!"    As  much  as  possible,  try  to  get  a  sense  of  what   your  query  volume  will  be,  and  dont  just  throw   money  at  building  a  monstrous  farm  of  searchers    Solr  has  proven  to  be  much  more  efficient  than  some     large,  commercial  search  solu-ons  ©  Lucid  Imagina-on,  Inc.  
  • Lust!  Blood  from  a  turnip:    Trying  some  absurd  new  technique,     "just  because"    RAMDirectoryFactory  –  not  a  secret  way  to  faster   indexing/searching    No  disk-­‐backed  persistence    Usually  not  worth  it    …but  you  never  know…    Research  first  before  going  "extreme"  ©  Lucid  Imagina-on,  Inc.  
  • Lust!  No  need  to  index  millions  of  docs  for  development    BeKer  to  work  with  small  sets  of  data  while   gebng  started.    Dont  worry  too  much  about  field  types  as  you  get   started.  Get  data  in  the  index,  then  analyze  and   refine.  ©  Lucid  Imagina-on,  Inc.  
  • Lust: The Path To Salvation!  Use  an  agile  approach  –  start  simply,  build  your   applica-on  slowly,  iterate    Deal  with  the  low-­‐hanging  fruit  first    Measure  twice,  cut  once    Dont  miss  the  forest  for  the  trees  –  no  need  to   obsess  over  details  in  the  early  stages    Do  some  due  diligence  before  trying  unorthodox   approaches    Get  a  small  sample  of  data  indexed  w/o  worrying  about  type,   then  itera-ons  of  refinement  ©  Lucid  Imagina-on,  Inc.  
  • "If  we  had  some  bacon     we  could  have  some    bacon  and  eggs  –  if  we     had  some  eggs."  ©  Lucid  Imagina-on,  Inc.  
  • Envy!  Adding  "cool"  features  you  see  on  other   sites,  but  dont  really  need    Keep  it  "lean  and  mean",  especially   to  start    Resist  the  urge  to  include  the     "kitchen  sink"  ©  Lucid  Imagina-on,  Inc.  
  • Envy!  You  too  can  master  dismax!    Dont  be  afraid  of  dismax/edismax    Lots  of  controls  to  learn,  but  also   lots  of  power    Flexibility  to  search  mul-ple  fields    Boost  different  fields    Boost  phrase  fields  (pf)  higher  than  query  fields  (qf)    Use  boost  queries  (bq)  and  func-on  queries  (bf)    Most  in-mida-ng  params:    -e    mm  ©  Lucid  Imagina-on,  Inc.  
  • Envy!  Spa-al  search  –  seems  complicated,  but   major  sites  make  it  look  easy    Now,  in  Solr  3.1  –  it  is  easy!    You  can:    Store  spa-al  data  in  your  index    Filter  by  distance    Sort  by  distance    Boost/bias  by  distance    Facet  by  distance    Also  consider:  Search-­‐based  naviga-on  such  as   "Show  me  in-­‐stock  items  only"  ©  Lucid  Imagina-on,  Inc.  
  • Envy: The Path To Salvation!  Focus  on  your  requirements,  dont  try   to  add  "bells  and  whistles"  you  dont   need    Dont  be  hesitant  to  dive  into  the  power   of  dismax/edismax    Take  advantage  of  new  features  such  as   Solr  spa-al,  if  those  features  will  add   value  to  the  end  user  experience  ©  Lucid  Imagina-on,  Inc.  
  • "A  fat  stomach  never     breeds  fine  thoughts."  ©  Lucid  Imagina-on,  Inc.  
  • Gluttony!  “Staying  fit  and  trim”  is  usually  good  prac-ce     when  designing  and  running  Solr  applica-ons    Once  again  –  keep  it  "lean  and  mean"      A  lot  of  these  issues  cross  over  into  the  “Sloth”     category    The  effort  needed  to  keep  your  configura-on     and  data  efficiently  managed  is  not  considered     important    Dont  lose  control  of  your  configura-on  files    Remove  unnecessary  elements    Version  control  all  configura-on  files  ©  Lucid  Imagina-on,  Inc.  
  • Gluttony!  Slim  down  those  "bloated"  queries:    q="red  shoes"&  accountId=(12343  OR  338899   OR  554443  OR  243445  OR  55442OR  3330899     OR  59927  OR  3888999  OR  549  OR  440293579   34201  OR  339917  OR  300191  OR  339338  OR     109823  OR  679176  OR  31407815  OR  3001756     OR  134322  OR  311123  OR  987888  OR  997181  OR  771819  OR   100292  OR  3389474  OR  5505759  OR  2459577  OR  4499957  OR   1996571  OR  559590  OR  220299  OR  4404872  OR  151510  OR   66017  OR  666  OR  113459  OR  890575  OR  505725  OR  330393  OR   349940  OR  4094994  OR  1245995  OR  2459959  OR  4255909  OR   899955  OR  7878899  OR  100999  …  ∞  )  ©  Lucid  Imagina-on,  Inc.  
  • Gluttony!  Stay  in  shape  –  Flex  Your  Solr  Muscles!    Keep  up  on  new  features    Training,  when  appropriate    Cer-fica-on    Contribute!    Follow  the  user  lists    Refactor  when  new  features  can  help    Keep  up  to  date  on  new  releases  ©  Lucid  Imagina-on,  Inc.  
  • Gluttony: The Path To Salvation!  Keep  configura-on  files  clean  and  trim.  Remove   unused  elements    Periodically  review  queries  to  make  sure  they   are  efficient    Refactor  when  necessary  –  keep  your   applica-on  fit  and  trim  ©  Lucid  Imagina-on,  Inc.  
  • "Hope  is  the  denial  of  reality."  ©  Lucid  Imagina-on,  Inc.  
  • Wrath!  Wrath  -­‐  usually  synonymous  with  anger,  but…    Let’s  use  an  older  defini-on  here:      “A  vehement  denial  of  the  truth,     both  to  others  and  in  the  form  of     self-­‐denial  and  impaMence.”    Step  back  every  now  and  then  and  look   objec-vely  at  your  applica-on  ©  Lucid  Imagina-on,  Inc.  
  • Wrath!  Resist  the  push  to  rush  to  produc-on…  ©  Lucid  Imagina-on,  Inc.  
  • Wrath!  Ignoring  new  Solr  releases    OK  to  wait  un-l  a  release  is  proven    But  gebng  too  far  behind  makes  upgrading   more  painful  with  each  release    We  dont  have  -me  to  do  it  right,  but  we  always     have  -me  to  fix  it  ©  Lucid  Imagina-on,  Inc.  
  • Wrath!  Ignoring  complaints  about  results  relevance    Disregarding  feedback  from  stakeholders    Remember  –  the  point  of  your  search  applica-on   is  to  support  the  business,  not  to  "build  cool  stuff"    Not  taking  advantage  of  log  files    Consider  mining  log  files,  storing  data  in   rela-onal  DB  for  genera-ng  reports    Capturing  user  queries  and  query  counts  can  be   extremely  useful    Can  also  be  used  for  query-­‐based  autosuggest.   (not  just  indexed  terms)  ©  Lucid  Imagina-on,  Inc.  
  • Wrath: The Path To Salvation!  Keep  your  version  of  Solr  up  to  date    OK  to  wait  "awhile",  but  dont  skip  versions    Seek  and  embrace  feedback  from  business  and     domain  experts    Constantly  gauge  and  improve  relevance  as  an     ongoing  task    Avoid  the  push  to  release  too  soon  (as  best  you  can)    Take  advantage  of  log  files  to  understand  what     users  are  doing,  and  what  is  not  working  well  ©  Lucid  Imagina-on,  Inc.  
  • ¡Búsqueda,  y  usted  encontrará!