RAPID PRUNING OF SEARCH SPACE THROUGH
HIERARCHICAL MATCHING
Chandra Mouleeswaran
Machine Learning Scientist, ThreatMetrix Inc.
5/2/13	
   1	
  
My	
  Background	
  
•  Machine	
  Learning	
  Scien8st	
  at	
  ThreatMetrix	
  Inc.	
  
•  Co-­‐	
  Chair,	
  Developer	
  Programs,	
  IntelliFest.org,	
  Oct	
  2013,	
  
San	
  Diego,	
  CA	
  
	
  
Career	
  Path	
  
-­‐  Siemens	
  Corporate	
  Research:	
  Learning	
  &	
  Expert	
  Systems	
  
-­‐  Technology	
  division	
  of	
  Donaldson,	
  LuQin	
  and	
  JenreSe	
  
company	
  (Pershing):	
  Ar8ficial	
  Intelligence	
  Group	
  -­‐	
  Network	
  
Monitoring	
  
-­‐  Several	
  startups:	
  Classifica8on,	
  Web	
  Crawling,	
  Security,	
  
Financial	
  Trading	
  etc.	
  
5/2/13	
   2	
  
Outline	
  
•  Task	
  descrip8on	
  
•  Approaches	
  
•  Why	
  search	
  paradigm?	
  
•  Hierarchical	
  matching	
  	
  
•  Results	
  
•  Acknowledgments	
  
	
  
5/2/13	
   3	
  
The	
  Device	
  Iden8fica8on	
  Task	
  
•  Computa8onally,	
  it’s	
  a	
  CLASSIFICATION	
  problem:	
  
{	
  a0,	
  a1,	
  a2,	
  a3………..	
  an	
  }	
  	
  è	
  {	
  ci	
  }	
  
ai	
  =	
  (	
  aSribute	
  |	
  field	
  |	
  key	
  )	
  value	
  
ci	
  =	
  (	
  label	
  |	
  signature	
  |	
  class	
  |	
  hash	
  )	
  
•  Returning	
  devices	
  should	
  be	
  correctly	
  iden8fied	
  
within	
  certain	
  tolerances	
  
•  New	
  classes	
  may	
  be	
  created	
  if	
  a	
  good	
  match	
  is	
  
not	
  found	
  in	
  the	
  repository	
  of	
  known	
  devices	
  
•  Devices	
  age	
  out,	
  based	
  on	
  data	
  reten8on	
  policy	
  
	
  
	
  
5/2/13	
   4	
  
Task	
  Challenges	
  
•  Extremely	
  vola8le	
  aSributes	
  
•  There	
  are	
  no	
  pivot	
  aSributes	
  to	
  divide	
  and	
  
conquer	
  the	
  search	
  space	
  	
  
•  Changing	
  distribu8ons	
  
•  Emphasis	
  on	
  PRECISION	
  
•  Stringent	
  RESPONSE	
  8me	
  
5/2/13	
   5	
  
Engineering	
  Challenges	
  
•  Precision	
  (accuracy)	
  and	
  latency	
  (response	
  
8me)	
  are	
  antagonis8c	
  constraints	
  
•  Project	
  management	
  	
  
Repository	
  Size	
  
(millions)	
  
Load	
  
(TPS)	
  
Latency	
  
(ms)	
  
Project	
  start	
   28	
   200	
  	
   <	
  100	
  
Present	
   280	
   300	
  	
   <	
  100	
  
Change	
   10	
  X	
   1.5	
  X	
   None	
  
5/2/13	
   6	
  
Approaches	
  
•  Rules	
  engine	
  
•  Learning	
  models	
  
•  Vector	
  space	
  models	
  
	
  
	
  
	
  
Need	
  an	
  enterprise	
  grade	
  solu8on!	
  
5/2/13	
   7	
  
Rules	
  Engine	
  
•  No	
  experts	
  
•  Number	
  of	
  rules?	
  
•  Maintenance?	
  
Not	
  a	
  viable	
  approach!	
  
	
  
5/2/13	
   8	
  
Learning	
  Models	
  
•  Most	
  machine	
  learning	
  methods	
  deal	
  
predominantly	
  with	
  binary	
  classifica8on	
  
problems	
  (eg.	
  fraud	
  /	
  not	
  fraud)	
  or	
  a	
  small	
  
number	
  of	
  target	
  classes	
  
•  Few	
  exemplars	
  for	
  each	
  class	
  
•  ASribute	
  values	
  may	
  be	
  unbounded	
  	
  
•  ASributes	
  may	
  not	
  follow	
  a	
  natural	
  
progression	
  
	
  
5/2/13	
   9	
  
Learning	
  Models	
  …	
  
•  Unsupervised	
  learning	
  such	
  as	
  clustering	
  
methods	
  would	
  make	
  good	
  models,	
  but	
  not	
  
good	
  enough	
  to	
  be	
  of	
  prac8cal	
  use.	
  Any	
  
simplifica8on	
  process	
  will	
  compromise	
  on	
  
accuracy	
  
•  Ability	
  to	
  explain	
  is	
  cri8cal	
  
•  Tend	
  to	
  ignore	
  domain	
  knowledge	
  
	
  
Challenge	
  in	
  providing	
  enterprise	
  solu8on	
  
5/2/13	
   10	
  
Thoughts	
  
•  No	
  comparable	
  applica8on	
  with	
  such	
  
requirements	
  
•  Build	
  and	
  deploy	
  a	
  classifier	
  that	
  explains	
  itself	
  
easily,	
  scales	
  temporally	
  and	
  offers	
  quick	
  
response	
  
•  Use	
  domain	
  knowledge	
  to	
  guide	
  verifica8on	
  
•  Improve	
  the	
  classifier	
  through	
  machine	
  
learning	
  methods	
  by	
  analyzing	
  performance	
  in	
  
the	
  field	
  
	
  
5/2/13	
   11	
  
Vector-­‐Space	
  Models	
  
•  Similarity	
  based	
  search	
  make	
  vector-­‐space	
  
model	
  a	
  good	
  choice	
  for	
  genera8ng	
  selec8ons	
  
•  Given	
  the	
  vola8le	
  nature	
  of	
  data,	
  informa8on	
  
retrieval	
  (IR)	
  systems	
  can	
  adapt	
  easily	
  
•  Good	
  at	
  neighborhood	
  search	
  
	
  
Sensi8ve	
  to	
  individual	
  aSribute	
  changes!	
  
5/2/13	
   12	
  
Sources	
  of	
  Inspira8on	
  
•  Lucene/Solr	
  features	
  
•  Documenta8on	
  from	
  (erstwhile)	
  Lucid	
  
Imagina8on	
  
•  Ease	
  with	
  which	
  Lucene/Solr	
  could	
  be	
  
installed	
  and	
  explored	
  
	
  
Very	
  short	
  learning	
  curve	
  for	
  novices!	
  
5/2/13	
   13	
  
Feature	
  Selec8on	
  	
  
•  Primi8ve	
  and	
  derived	
  aSributes	
  
•  Entropy	
  
•  Distribu8on	
  
	
  
5/2/13	
   14	
  
Domain	
  
•  Devices	
  come	
  with	
  structural	
  informa8on	
  but	
  
not	
  much	
  grammar	
  or	
  seman8cs	
  
•  Bag-­‐of-­‐words	
  (single	
  field)	
  approach	
  is	
  fast	
  but	
  
not	
  precise	
  
•  Using	
  all	
  fields	
  is	
  precise	
  but	
  response	
  is	
  slow	
  
	
  
	
  
Now	
  what?	
  
5/2/13	
   15	
  
Disjunc8on	
  Max	
  
•  Matrix	
  of	
  all	
  possible	
  combina8ons	
  of	
  user	
  input	
  query	
  
and	
  document	
  fields	
  
•  Transforms	
  into	
  a	
  Boolean	
  query	
  of	
  
Disjunc8onMaxQueries	
  of	
  each	
  row	
  
•  Maximum	
  score	
  of	
  sub	
  clauses	
  Is	
  used	
  by	
  
Disjunc8onMaxQuery	
  
•  No	
  single	
  term	
  in	
  user	
  input	
  dominates	
  
	
  
This	
  is	
  needed!	
  
	
  
Src:	
  SearchHub	
  and	
  LucidWorks	
  
	
  
	
  5/2/13	
   16	
  
DisMax	
  Experiments	
  
(index	
  size	
  =	
  60	
  Million)	
  
Scenario	
  1	
  
mm=2	
  	
  
Solr	
  fields	
  =	
  {	
  a1,	
  a2,	
  
a3	
  }	
  
Values=	
  {	
  phrase1,	
  
phrase2,	
  phrase3}	
  
	
  
Must-­‐Match	
  Clauses	
  
Latency:	
  YES	
  (35	
  ms)	
  
Precision:	
  NO	
  (20%	
  
failure)	
  
5/2/13	
   17	
  
Scenario	
  2	
  
mm	
  =	
  50	
  %	
  
Solr	
  fields	
  =	
  {	
  a1	
  }	
  
Values=	
  {	
  term1,	
  term2,	
  
term3	
  ….	
  termn	
  }	
  
	
  
Should-­‐Match	
  Clauses	
  
Latency:	
  NO	
  (>	
  2	
  seconds)	
  
Precision:	
  YES	
  (>	
  98%)	
  
Possible	
  Workaround	
  	
  
•  Look-­‐ahead:	
  Customize	
  Lucene/Solr	
  to	
  do	
  a	
  
branch-­‐and-­‐bound	
  search,	
  bail	
  out	
  on	
  some	
  
lower	
  bound	
  score	
  
•  Minimize	
  candidates	
  for	
  DisMax	
  search	
  
-­‐  reduce	
  total	
  number	
  of	
  Solr	
  instances	
  to	
  search	
  
-­‐  reduce	
  total	
  number	
  of	
  disjunc8ve	
  terms	
  	
  
	
  [	
  Empirical	
  es8mate:	
  tn	
  =	
  2	
  *	
  tn-­‐1	
  
	
   	
  where	
  t	
  =	
  8me	
  &	
  	
  
	
   	
   	
   	
  n	
  =	
  number	
  of	
  disjunc8ve	
  terms]	
  
5/2/13	
   18	
  
Phrases	
  over	
  Terms	
  
•  Used	
  coloca8on	
  (co-­‐occurrence	
  matrix)	
  to	
  
determine	
  most	
  common	
  phrases	
  
•  Delete	
  terms	
  covered	
  by	
  phrases	
  
•  Add	
  stop	
  words	
  based	
  on	
  frequency	
  analysis	
  
•  Ensure	
  precision	
  is	
  preserved	
  through	
  
regression	
  tests	
  
	
  
Reduced	
  the	
  number	
  of	
  DisMax	
  terms	
  by	
  30%	
  
5/2/13	
   19	
  
Sources	
  of	
  Inspira8on	
  
•  Planning	
  in	
  a	
  Hierarchy	
  of	
  Abstrac8on	
  Spaces,	
  
Ar8ficial	
  Intelligence,	
  Vol.	
  5,	
  No.	
  2,	
  pp.	
  
115-­‐135	
  (1974)	
  	
  
•  Search	
  Reduc8on	
  in	
  Hierarchical	
  Problem	
  
Solving,	
  Proc.	
  Of	
  the	
  9th	
  IJCAI,	
  AAAI	
  Press,	
  
Menlo	
  Park,	
  CA	
  (1991)	
  
•  Excep8onal	
  Data	
  Quality	
  Using	
  Intelligent	
  
Matching	
  and	
  Retrieval,	
  AI	
  Magazine,	
  AAAI	
  
Press	
  (Spring	
  2010)	
  
5/2/13	
   20	
  
Hierarchical	
  Matching	
  
Bag	
  of	
  words	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
Models	
  
Phrases	
  
Filters	
   DisMax	
  
Query	
  
Formulator	
  
Domain-­‐
specific	
  
paSerns	
  
	
  
	
  
CSV/JSON	
  
Solr	
  	
  
instances	
  
selector	
  
To	
  Solr	
  Servers	
  
5/2/13	
  
21	
  
Verifica8on	
  
Conflict	
  Resolu8on	
  
•  Top	
  n	
  candidates	
  are	
  returned	
  from	
  each	
  Solr	
  
instance	
  
•  They	
  are	
  ranked	
  based	
  on	
  custom	
  verifica8on	
  
module	
  
•  Ties	
  are	
  broken	
  using	
  recency	
  
•  Top	
  candidate	
  is	
  persisted	
  and	
  returned	
  along	
  
with	
  custom	
  score	
  
5/2/13	
   22	
  
Comments	
  
•  Dismax	
  performs	
  mul8dimensional	
  match	
  
•  Extracted	
  mul8ple	
  filters	
  and	
  arranged	
  them	
  
hierarchically	
  
•  Separa8on	
  of	
  selec8on	
  and	
  evalua8on	
  
-­‐  Selec8on	
  =	
  approximate	
  solu8on	
  
-­‐  Evalua8on	
  =	
  refinement	
  
5/2/13	
   23	
  
Where	
  8me	
  went..	
  
•  ASribute	
  selec8on	
  
•  Ranking	
  	
  
•  Op8miza8on	
  
•  Index	
  re-­‐genera8on	
  	
  
•  Regression	
  tes8ng	
  
5/2/13	
   24	
  
Sources	
  for	
  Tune	
  Up	
  
•  Scaling	
  Solr,	
  Lucene	
  Revolu8on,	
  May	
  2011	
  	
  
•  Prac8cal	
  Search	
  with	
  Solr:	
  Beyond	
  just	
  Looking	
  
it	
  Up,	
  Lucid	
  Imagina8on,	
  May	
  2010	
  
5/2/13	
   25	
  
Tes8ng	
  
•  Precision	
  tes8ng	
  using	
  self	
  and	
  mixed	
  modes	
  
•  Latency	
  tests	
  	
  
-­‐  custom	
  harness	
  for	
  stand-­‐alone	
  tests	
  
-­‐  integrated	
  tests	
  with	
  JMeter	
  framework	
  
5/2/13	
   26	
  
 
Results	
  
5/2/13	
   27	
  
Latency	
  Percen8les	
  
original	
  edismax	
  
Ini8al	
  solu8on	
  
Op8miza8on	
  2:	
  Domain	
  paSerns,	
  	
  
Stop	
  words,	
  de-­‐dupe	
  
Op8miza8on	
  1:	
  Filters,	
  
Focused	
  search,	
  verifica8on	
  
5/2/13	
   28	
  
TPS	
  
5/2/13	
   29	
  
Response	
  Times	
  over	
  Time	
  
5/2/13	
   30	
  
Project	
  Execu8on	
  
•  Agile	
  Methodology	
  
•  Risk	
  mi8ga8on	
  through	
  primary	
  and	
  
con8ngency	
  plans	
  
•  Rapid	
  prototyping	
  followed	
  by	
  good	
  sozware	
  
engineering	
  prac8ces	
  
•  Evalua8ng	
  DSE	
  (DataStax)	
  &	
  Solr	
  Cloud	
  
	
  
5/2/13	
   31	
  
Gleanings	
  
•  You	
  can	
  classify	
  anything	
  with	
  Lucene/Solr,	
  
lexicon	
  is	
  your	
  own	
  
•  The	
  ques8on	
  is	
  not	
  whether	
  Lucene/Solr	
  can	
  
solve	
  a	
  par8cular	
  classifica8on	
  problem,	
  but	
  
whether	
  you	
  can	
  priori8ze	
  among	
  the	
  many	
  
ways	
  of	
  doing	
  it	
  
•  If	
  you	
  run	
  into	
  a	
  problem,	
  someone	
  has	
  solved	
  
it	
  or	
  will	
  solve	
  it	
  in	
  the	
  near	
  future	
  
	
  
5/2/13	
   32	
  
Gleanings	
  …	
  
•  Deal	
  with	
  accuracy	
  before	
  latency	
  
•  If	
  precision,	
  latency	
  and	
  scale	
  are	
  all	
  cri8cal	
  to	
  
your	
  domain,	
  expect	
  to	
  invest	
  some8me	
  in	
  
hierarchical	
  abstrac8ons	
  
•  Index	
  once,	
  run	
  any8me,	
  anywhere,	
  does	
  not	
  
apply	
  during	
  development	
  
•  Throwing	
  all	
  data	
  at	
  Lucene/Solr	
  will	
  not	
  work	
  for	
  
mission	
  cri8cal	
  applica8ons	
  
•  Rapid	
  prototyping	
  and	
  willingness	
  to	
  fail	
  
5/2/13	
   33	
  
Summary	
  
	
  
	
  
	
  
Simplify	
  and	
  match	
  at	
  mul0ple	
  levels	
  of	
  
abstrac0on	
  
	
  
5/2/13	
   34	
  
Contributors	
  
Chandra	
  Mouleeswaran	
  
Research	
  &	
  Prototyping	
  
Fang	
  Chen	
  
Research	
  &	
  Prototyping	
  
Luke	
  Mertens	
  
Produc8za8on	
  &	
  Scalability	
  
Brent	
  Pearson	
  
Release	
  Management	
  
Tracy	
  Hsu	
  
Precision	
  Tes8ng	
  &	
  QA	
  
5/2/13	
   35	
  
Srinivas	
  Nayani	
  
Deployment	
  &	
  QA	
  
COMMENTS & FEEDBACK:
Chandra Mouleeswaran
cmouleeswaran@threatmetrix.com
5/2/13	
   36	
  

Rapid pruning of search space through hierarchical matching

  • 1.
    RAPID PRUNING OFSEARCH SPACE THROUGH HIERARCHICAL MATCHING Chandra Mouleeswaran Machine Learning Scientist, ThreatMetrix Inc. 5/2/13   1  
  • 2.
    My  Background   • Machine  Learning  Scien8st  at  ThreatMetrix  Inc.   •  Co-­‐  Chair,  Developer  Programs,  IntelliFest.org,  Oct  2013,   San  Diego,  CA     Career  Path   -­‐  Siemens  Corporate  Research:  Learning  &  Expert  Systems   -­‐  Technology  division  of  Donaldson,  LuQin  and  JenreSe   company  (Pershing):  Ar8ficial  Intelligence  Group  -­‐  Network   Monitoring   -­‐  Several  startups:  Classifica8on,  Web  Crawling,  Security,   Financial  Trading  etc.   5/2/13   2  
  • 3.
    Outline   •  Task  descrip8on   •  Approaches   •  Why  search  paradigm?   •  Hierarchical  matching     •  Results   •  Acknowledgments     5/2/13   3  
  • 4.
    The  Device  Iden8fica8on  Task   •  Computa8onally,  it’s  a  CLASSIFICATION  problem:   {  a0,  a1,  a2,  a3………..  an  }    è  {  ci  }   ai  =  (  aSribute  |  field  |  key  )  value   ci  =  (  label  |  signature  |  class  |  hash  )   •  Returning  devices  should  be  correctly  iden8fied   within  certain  tolerances   •  New  classes  may  be  created  if  a  good  match  is   not  found  in  the  repository  of  known  devices   •  Devices  age  out,  based  on  data  reten8on  policy       5/2/13   4  
  • 5.
    Task  Challenges   • Extremely  vola8le  aSributes   •  There  are  no  pivot  aSributes  to  divide  and   conquer  the  search  space     •  Changing  distribu8ons   •  Emphasis  on  PRECISION   •  Stringent  RESPONSE  8me   5/2/13   5  
  • 6.
    Engineering  Challenges   • Precision  (accuracy)  and  latency  (response   8me)  are  antagonis8c  constraints   •  Project  management     Repository  Size   (millions)   Load   (TPS)   Latency   (ms)   Project  start   28   200     <  100   Present   280   300     <  100   Change   10  X   1.5  X   None   5/2/13   6  
  • 7.
    Approaches   •  Rules  engine   •  Learning  models   •  Vector  space  models         Need  an  enterprise  grade  solu8on!   5/2/13   7  
  • 8.
    Rules  Engine   • No  experts   •  Number  of  rules?   •  Maintenance?   Not  a  viable  approach!     5/2/13   8  
  • 9.
    Learning  Models   • Most  machine  learning  methods  deal   predominantly  with  binary  classifica8on   problems  (eg.  fraud  /  not  fraud)  or  a  small   number  of  target  classes   •  Few  exemplars  for  each  class   •  ASribute  values  may  be  unbounded     •  ASributes  may  not  follow  a  natural   progression     5/2/13   9  
  • 10.
    Learning  Models  …   •  Unsupervised  learning  such  as  clustering   methods  would  make  good  models,  but  not   good  enough  to  be  of  prac8cal  use.  Any   simplifica8on  process  will  compromise  on   accuracy   •  Ability  to  explain  is  cri8cal   •  Tend  to  ignore  domain  knowledge     Challenge  in  providing  enterprise  solu8on   5/2/13   10  
  • 11.
    Thoughts   •  No  comparable  applica8on  with  such   requirements   •  Build  and  deploy  a  classifier  that  explains  itself   easily,  scales  temporally  and  offers  quick   response   •  Use  domain  knowledge  to  guide  verifica8on   •  Improve  the  classifier  through  machine   learning  methods  by  analyzing  performance  in   the  field     5/2/13   11  
  • 12.
    Vector-­‐Space  Models   • Similarity  based  search  make  vector-­‐space   model  a  good  choice  for  genera8ng  selec8ons   •  Given  the  vola8le  nature  of  data,  informa8on   retrieval  (IR)  systems  can  adapt  easily   •  Good  at  neighborhood  search     Sensi8ve  to  individual  aSribute  changes!   5/2/13   12  
  • 13.
    Sources  of  Inspira8on   •  Lucene/Solr  features   •  Documenta8on  from  (erstwhile)  Lucid   Imagina8on   •  Ease  with  which  Lucene/Solr  could  be   installed  and  explored     Very  short  learning  curve  for  novices!   5/2/13   13  
  • 14.
    Feature  Selec8on     •  Primi8ve  and  derived  aSributes   •  Entropy   •  Distribu8on     5/2/13   14  
  • 15.
    Domain   •  Devices  come  with  structural  informa8on  but   not  much  grammar  or  seman8cs   •  Bag-­‐of-­‐words  (single  field)  approach  is  fast  but   not  precise   •  Using  all  fields  is  precise  but  response  is  slow       Now  what?   5/2/13   15  
  • 16.
    Disjunc8on  Max   • Matrix  of  all  possible  combina8ons  of  user  input  query   and  document  fields   •  Transforms  into  a  Boolean  query  of   Disjunc8onMaxQueries  of  each  row   •  Maximum  score  of  sub  clauses  Is  used  by   Disjunc8onMaxQuery   •  No  single  term  in  user  input  dominates     This  is  needed!     Src:  SearchHub  and  LucidWorks      5/2/13   16  
  • 17.
    DisMax  Experiments   (index  size  =  60  Million)   Scenario  1   mm=2     Solr  fields  =  {  a1,  a2,   a3  }   Values=  {  phrase1,   phrase2,  phrase3}     Must-­‐Match  Clauses   Latency:  YES  (35  ms)   Precision:  NO  (20%   failure)   5/2/13   17   Scenario  2   mm  =  50  %   Solr  fields  =  {  a1  }   Values=  {  term1,  term2,   term3  ….  termn  }     Should-­‐Match  Clauses   Latency:  NO  (>  2  seconds)   Precision:  YES  (>  98%)  
  • 18.
    Possible  Workaround     •  Look-­‐ahead:  Customize  Lucene/Solr  to  do  a   branch-­‐and-­‐bound  search,  bail  out  on  some   lower  bound  score   •  Minimize  candidates  for  DisMax  search   -­‐  reduce  total  number  of  Solr  instances  to  search   -­‐  reduce  total  number  of  disjunc8ve  terms      [  Empirical  es8mate:  tn  =  2  *  tn-­‐1      where  t  =  8me  &            n  =  number  of  disjunc8ve  terms]   5/2/13   18  
  • 19.
    Phrases  over  Terms   •  Used  coloca8on  (co-­‐occurrence  matrix)  to   determine  most  common  phrases   •  Delete  terms  covered  by  phrases   •  Add  stop  words  based  on  frequency  analysis   •  Ensure  precision  is  preserved  through   regression  tests     Reduced  the  number  of  DisMax  terms  by  30%   5/2/13   19  
  • 20.
    Sources  of  Inspira8on   •  Planning  in  a  Hierarchy  of  Abstrac8on  Spaces,   Ar8ficial  Intelligence,  Vol.  5,  No.  2,  pp.   115-­‐135  (1974)     •  Search  Reduc8on  in  Hierarchical  Problem   Solving,  Proc.  Of  the  9th  IJCAI,  AAAI  Press,   Menlo  Park,  CA  (1991)   •  Excep8onal  Data  Quality  Using  Intelligent   Matching  and  Retrieval,  AI  Magazine,  AAAI   Press  (Spring  2010)   5/2/13   20  
  • 21.
    Hierarchical  Matching   Bag  of  words                         Models   Phrases   Filters   DisMax   Query   Formulator   Domain-­‐ specific   paSerns       CSV/JSON   Solr     instances   selector   To  Solr  Servers   5/2/13   21   Verifica8on  
  • 22.
    Conflict  Resolu8on   • Top  n  candidates  are  returned  from  each  Solr   instance   •  They  are  ranked  based  on  custom  verifica8on   module   •  Ties  are  broken  using  recency   •  Top  candidate  is  persisted  and  returned  along   with  custom  score   5/2/13   22  
  • 23.
    Comments   •  Dismax  performs  mul8dimensional  match   •  Extracted  mul8ple  filters  and  arranged  them   hierarchically   •  Separa8on  of  selec8on  and  evalua8on   -­‐  Selec8on  =  approximate  solu8on   -­‐  Evalua8on  =  refinement   5/2/13   23  
  • 24.
    Where  8me  went..   •  ASribute  selec8on   •  Ranking     •  Op8miza8on   •  Index  re-­‐genera8on     •  Regression  tes8ng   5/2/13   24  
  • 25.
    Sources  for  Tune  Up   •  Scaling  Solr,  Lucene  Revolu8on,  May  2011     •  Prac8cal  Search  with  Solr:  Beyond  just  Looking   it  Up,  Lucid  Imagina8on,  May  2010   5/2/13   25  
  • 26.
    Tes8ng   •  Precision  tes8ng  using  self  and  mixed  modes   •  Latency  tests     -­‐  custom  harness  for  stand-­‐alone  tests   -­‐  integrated  tests  with  JMeter  framework   5/2/13   26  
  • 27.
  • 28.
    Latency  Percen8les   original  edismax   Ini8al  solu8on   Op8miza8on  2:  Domain  paSerns,     Stop  words,  de-­‐dupe   Op8miza8on  1:  Filters,   Focused  search,  verifica8on   5/2/13   28  
  • 29.
  • 30.
    Response  Times  over  Time   5/2/13   30  
  • 31.
    Project  Execu8on   • Agile  Methodology   •  Risk  mi8ga8on  through  primary  and   con8ngency  plans   •  Rapid  prototyping  followed  by  good  sozware   engineering  prac8ces   •  Evalua8ng  DSE  (DataStax)  &  Solr  Cloud     5/2/13   31  
  • 32.
    Gleanings   •  You  can  classify  anything  with  Lucene/Solr,   lexicon  is  your  own   •  The  ques8on  is  not  whether  Lucene/Solr  can   solve  a  par8cular  classifica8on  problem,  but   whether  you  can  priori8ze  among  the  many   ways  of  doing  it   •  If  you  run  into  a  problem,  someone  has  solved   it  or  will  solve  it  in  the  near  future     5/2/13   32  
  • 33.
    Gleanings  …   • Deal  with  accuracy  before  latency   •  If  precision,  latency  and  scale  are  all  cri8cal  to   your  domain,  expect  to  invest  some8me  in   hierarchical  abstrac8ons   •  Index  once,  run  any8me,  anywhere,  does  not   apply  during  development   •  Throwing  all  data  at  Lucene/Solr  will  not  work  for   mission  cri8cal  applica8ons   •  Rapid  prototyping  and  willingness  to  fail   5/2/13   33  
  • 34.
    Summary         Simplify  and  match  at  mul0ple  levels  of   abstrac0on     5/2/13   34  
  • 35.
    Contributors   Chandra  Mouleeswaran   Research  &  Prototyping   Fang  Chen   Research  &  Prototyping   Luke  Mertens   Produc8za8on  &  Scalability   Brent  Pearson   Release  Management   Tracy  Hsu   Precision  Tes8ng  &  QA   5/2/13   35   Srinivas  Nayani   Deployment  &  QA  
  • 36.
    COMMENTS & FEEDBACK: ChandraMouleeswaran cmouleeswaran@threatmetrix.com 5/2/13   36