Rapid pruning of search space through hierarchical matching

3,845 views

Published on

Presented by Chandra Mouleeswaran, Co Chair at Intellifest.org, ThreatMetrix

This talk will present our experiences in using Lucene/Solr to the classification of user and device data. On a daily basis, ThreatMetrix, Inc., handles a huge volume of volatile data. The primary challenge is rapidly and precisely classifying each incoming transaction, by searching a huge index within a very strict latency specification. The audience will be taken through the various design choices and the lessons learned. Details on introducing a hierarchical search procedure that systematically divides the search space into manageable partitions, yet maintaining precision, will be presented.

Published in: Education, Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
3,845
On SlideShare
0
From Embeds
0
Number of Embeds
450
Actions
Shares
0
Downloads
20
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Rapid pruning of search space through hierarchical matching

  1. 1. RAPID PRUNING OF SEARCH SPACE THROUGHHIERARCHICAL MATCHINGChandra MouleeswaranMachine Learning Scientist, ThreatMetrix Inc.5/2/13   1  
  2. 2. My  Background  •  Machine  Learning  Scien8st  at  ThreatMetrix  Inc.  •  Co-­‐  Chair,  Developer  Programs,  IntelliFest.org,  Oct  2013,  San  Diego,  CA    Career  Path  -­‐  Siemens  Corporate  Research:  Learning  &  Expert  Systems  -­‐  Technology  division  of  Donaldson,  LuQin  and  JenreSe  company  (Pershing):  Ar8ficial  Intelligence  Group  -­‐  Network  Monitoring  -­‐  Several  startups:  Classifica8on,  Web  Crawling,  Security,  Financial  Trading  etc.  5/2/13   2  
  3. 3. Outline  •  Task  descrip8on  •  Approaches  •  Why  search  paradigm?  •  Hierarchical  matching    •  Results  •  Acknowledgments    5/2/13   3  
  4. 4. The  Device  Iden8fica8on  Task  •  Computa8onally,  it’s  a  CLASSIFICATION  problem:  {  a0,  a1,  a2,  a3………..  an  }    è  {  ci  }  ai  =  (  aSribute  |  field  |  key  )  value  ci  =  (  label  |  signature  |  class  |  hash  )  •  Returning  devices  should  be  correctly  iden8fied  within  certain  tolerances  •  New  classes  may  be  created  if  a  good  match  is  not  found  in  the  repository  of  known  devices  •  Devices  age  out,  based  on  data  reten8on  policy      5/2/13   4  
  5. 5. Task  Challenges  •  Extremely  vola8le  aSributes  •  There  are  no  pivot  aSributes  to  divide  and  conquer  the  search  space    •  Changing  distribu8ons  •  Emphasis  on  PRECISION  •  Stringent  RESPONSE  8me  5/2/13   5  
  6. 6. Engineering  Challenges  •  Precision  (accuracy)  and  latency  (response  8me)  are  antagonis8c  constraints  •  Project  management    Repository  Size  (millions)  Load  (TPS)  Latency  (ms)  Project  start   28   200     <  100  Present   280   300     <  100  Change   10  X   1.5  X   None  5/2/13   6  
  7. 7. Approaches  •  Rules  engine  •  Learning  models  •  Vector  space  models        Need  an  enterprise  grade  solu8on!  5/2/13   7  
  8. 8. Rules  Engine  •  No  experts  •  Number  of  rules?  •  Maintenance?  Not  a  viable  approach!    5/2/13   8  
  9. 9. Learning  Models  •  Most  machine  learning  methods  deal  predominantly  with  binary  classifica8on  problems  (eg.  fraud  /  not  fraud)  or  a  small  number  of  target  classes  •  Few  exemplars  for  each  class  •  ASribute  values  may  be  unbounded    •  ASributes  may  not  follow  a  natural  progression    5/2/13   9  
  10. 10. Learning  Models  …  •  Unsupervised  learning  such  as  clustering  methods  would  make  good  models,  but  not  good  enough  to  be  of  prac8cal  use.  Any  simplifica8on  process  will  compromise  on  accuracy  •  Ability  to  explain  is  cri8cal  •  Tend  to  ignore  domain  knowledge    Challenge  in  providing  enterprise  solu8on  5/2/13   10  
  11. 11. Thoughts  •  No  comparable  applica8on  with  such  requirements  •  Build  and  deploy  a  classifier  that  explains  itself  easily,  scales  temporally  and  offers  quick  response  •  Use  domain  knowledge  to  guide  verifica8on  •  Improve  the  classifier  through  machine  learning  methods  by  analyzing  performance  in  the  field    5/2/13   11  
  12. 12. Vector-­‐Space  Models  •  Similarity  based  search  make  vector-­‐space  model  a  good  choice  for  genera8ng  selec8ons  •  Given  the  vola8le  nature  of  data,  informa8on  retrieval  (IR)  systems  can  adapt  easily  •  Good  at  neighborhood  search    Sensi8ve  to  individual  aSribute  changes!  5/2/13   12  
  13. 13. Sources  of  Inspira8on  •  Lucene/Solr  features  •  Documenta8on  from  (erstwhile)  Lucid  Imagina8on  •  Ease  with  which  Lucene/Solr  could  be  installed  and  explored    Very  short  learning  curve  for  novices!  5/2/13   13  
  14. 14. Feature  Selec8on    •  Primi8ve  and  derived  aSributes  •  Entropy  •  Distribu8on    5/2/13   14  
  15. 15. Domain  •  Devices  come  with  structural  informa8on  but  not  much  grammar  or  seman8cs  •  Bag-­‐of-­‐words  (single  field)  approach  is  fast  but  not  precise  •  Using  all  fields  is  precise  but  response  is  slow      Now  what?  5/2/13   15  
  16. 16. Disjunc8on  Max  •  Matrix  of  all  possible  combina8ons  of  user  input  query  and  document  fields  •  Transforms  into  a  Boolean  query  of  Disjunc8onMaxQueries  of  each  row  •  Maximum  score  of  sub  clauses  Is  used  by  Disjunc8onMaxQuery  •  No  single  term  in  user  input  dominates    This  is  needed!    Src:  SearchHub  and  LucidWorks      5/2/13   16  
  17. 17. DisMax  Experiments  (index  size  =  60  Million)  Scenario  1  mm=2    Solr  fields  =  {  a1,  a2,  a3  }  Values=  {  phrase1,  phrase2,  phrase3}    Must-­‐Match  Clauses  Latency:  YES  (35  ms)  Precision:  NO  (20%  failure)  5/2/13   17  Scenario  2  mm  =  50  %  Solr  fields  =  {  a1  }  Values=  {  term1,  term2,  term3  ….  termn  }    Should-­‐Match  Clauses  Latency:  NO  (>  2  seconds)  Precision:  YES  (>  98%)  
  18. 18. Possible  Workaround    •  Look-­‐ahead:  Customize  Lucene/Solr  to  do  a  branch-­‐and-­‐bound  search,  bail  out  on  some  lower  bound  score  •  Minimize  candidates  for  DisMax  search  -­‐  reduce  total  number  of  Solr  instances  to  search  -­‐  reduce  total  number  of  disjunc8ve  terms      [  Empirical  es8mate:  tn  =  2  *  tn-­‐1      where  t  =  8me  &            n  =  number  of  disjunc8ve  terms]  5/2/13   18  
  19. 19. Phrases  over  Terms  •  Used  coloca8on  (co-­‐occurrence  matrix)  to  determine  most  common  phrases  •  Delete  terms  covered  by  phrases  •  Add  stop  words  based  on  frequency  analysis  •  Ensure  precision  is  preserved  through  regression  tests    Reduced  the  number  of  DisMax  terms  by  30%  5/2/13   19  
  20. 20. Sources  of  Inspira8on  •  Planning  in  a  Hierarchy  of  Abstrac8on  Spaces,  Ar8ficial  Intelligence,  Vol.  5,  No.  2,  pp.  115-­‐135  (1974)    •  Search  Reduc8on  in  Hierarchical  Problem  Solving,  Proc.  Of  the  9th  IJCAI,  AAAI  Press,  Menlo  Park,  CA  (1991)  •  Excep8onal  Data  Quality  Using  Intelligent  Matching  and  Retrieval,  AI  Magazine,  AAAI  Press  (Spring  2010)  5/2/13   20  
  21. 21. Hierarchical  Matching  Bag  of  words                        Models  Phrases  Filters   DisMax  Query  Formulator  Domain-­‐specific  paSerns      CSV/JSON  Solr    instances  selector  To  Solr  Servers  5/2/13  21  Verifica8on  
  22. 22. Conflict  Resolu8on  •  Top  n  candidates  are  returned  from  each  Solr  instance  •  They  are  ranked  based  on  custom  verifica8on  module  •  Ties  are  broken  using  recency  •  Top  candidate  is  persisted  and  returned  along  with  custom  score  5/2/13   22  
  23. 23. Comments  •  Dismax  performs  mul8dimensional  match  •  Extracted  mul8ple  filters  and  arranged  them  hierarchically  •  Separa8on  of  selec8on  and  evalua8on  -­‐  Selec8on  =  approximate  solu8on  -­‐  Evalua8on  =  refinement  5/2/13   23  
  24. 24. Where  8me  went..  •  ASribute  selec8on  •  Ranking    •  Op8miza8on  •  Index  re-­‐genera8on    •  Regression  tes8ng  5/2/13   24  
  25. 25. Sources  for  Tune  Up  •  Scaling  Solr,  Lucene  Revolu8on,  May  2011    •  Prac8cal  Search  with  Solr:  Beyond  just  Looking  it  Up,  Lucid  Imagina8on,  May  2010  5/2/13   25  
  26. 26. Tes8ng  •  Precision  tes8ng  using  self  and  mixed  modes  •  Latency  tests    -­‐  custom  harness  for  stand-­‐alone  tests  -­‐  integrated  tests  with  JMeter  framework  5/2/13   26  
  27. 27.  Results  5/2/13   27  
  28. 28. Latency  Percen8les  original  edismax  Ini8al  solu8on  Op8miza8on  2:  Domain  paSerns,    Stop  words,  de-­‐dupe  Op8miza8on  1:  Filters,  Focused  search,  verifica8on  5/2/13   28  
  29. 29. TPS  5/2/13   29  
  30. 30. Response  Times  over  Time  5/2/13   30  
  31. 31. Project  Execu8on  •  Agile  Methodology  •  Risk  mi8ga8on  through  primary  and  con8ngency  plans  •  Rapid  prototyping  followed  by  good  sozware  engineering  prac8ces  •  Evalua8ng  DSE  (DataStax)  &  Solr  Cloud    5/2/13   31  
  32. 32. Gleanings  •  You  can  classify  anything  with  Lucene/Solr,  lexicon  is  your  own  •  The  ques8on  is  not  whether  Lucene/Solr  can  solve  a  par8cular  classifica8on  problem,  but  whether  you  can  priori8ze  among  the  many  ways  of  doing  it  •  If  you  run  into  a  problem,  someone  has  solved  it  or  will  solve  it  in  the  near  future    5/2/13   32  
  33. 33. Gleanings  …  •  Deal  with  accuracy  before  latency  •  If  precision,  latency  and  scale  are  all  cri8cal  to  your  domain,  expect  to  invest  some8me  in  hierarchical  abstrac8ons  •  Index  once,  run  any8me,  anywhere,  does  not  apply  during  development  •  Throwing  all  data  at  Lucene/Solr  will  not  work  for  mission  cri8cal  applica8ons  •  Rapid  prototyping  and  willingness  to  fail  5/2/13   33  
  34. 34. Summary        Simplify  and  match  at  mul0ple  levels  of  abstrac0on    5/2/13   34  
  35. 35. Contributors  Chandra  Mouleeswaran  Research  &  Prototyping  Fang  Chen  Research  &  Prototyping  Luke  Mertens  Produc8za8on  &  Scalability  Brent  Pearson  Release  Management  Tracy  Hsu  Precision  Tes8ng  &  QA  5/2/13   35  Srinivas  Nayani  Deployment  &  QA  
  36. 36. COMMENTS & FEEDBACK:Chandra Mouleeswarancmouleeswaran@threatmetrix.com5/2/13   36  

×