Clouds, Search or HLT The 'forecast'?
Benson Margulies
Executive Vice President and Chief Technology Officer

Basis Technology – Human Language Technology Conference 2012   1
Clouds, Search or HLT
                               The 'forecast'?




Basis Technology – Human Language Technology Conference 2012   2
Meteorology - or - Why Clouds

•  Lie	
  on	
  the	
  grass	
  and	
  look	
  up	
  at	
  the	
  clouds	
  
   •  Everyone	
  sees	
  something	
  different	
  

•  Computerized	
  Clouds	
  are	
  no	
  different	
  
      • 
      Applica;ons	
  Always	
  Available	
  
      • 
      Data	
  Always	
  Available	
  
      • 
      Tools	
  for	
  Processing	
  Big	
  Data	
  


Basis Technology – Human Language Technology Conference 2012                   3
Big Data and Clouds =~ Hadoop

•  It's	
  not	
  just	
  a	
  maFer	
  of	
  size	
  
•  Hadoop	
  ...	
  
      o    Takes	
  in	
  structured	
  data	
  sets	
  
      o    Op;mizes	
  stateless,	
  batch	
  processes	
  
      o    Moves	
  computa3on	
  to	
  data	
  
•  All	
  of	
  which	
  is	
  great	
  if	
  that's	
  what	
  you	
  have	
  
•  The	
  world	
  is	
  more	
  complicated	
  than	
  that	
  

Basis Technology – Human Language Technology Conference 2012                      4
What it Doesn't Do So Easily

•  On-­‐the-­‐fly	
  (non-­‐batch)	
  processing	
  
•  Stateful,	
  non-­‐local,	
  processing	
  
•  For	
  example,	
  consider	
  a	
  search	
  engine	
  
      o    All	
  about	
  online:	
  a	
  document	
  arrives,	
  users	
  want	
  
           to	
  find	
  it.	
  
      o    All	
  about	
  global	
  state:	
  relevancy	
  involves	
  global	
  
           data	
  across	
  the	
  whole	
  index.	
  




Basis Technology – Human Language Technology Conference 2012                           5
More on Search-in-a-Cloud

•  Good	
  News:	
  'conven;onal'	
  technologies	
  scale	
  
     to	
  very	
  large	
  indices.	
  
      o    Solr	
  
      o    SolrCloud	
  
      o    Elas;c	
  Search	
  
      o    ...	
  
•  How?	
  Shards.	
  
      o    'hash'	
  to	
  split	
  docs	
  
      o    queries	
  go	
  everywhere	
  

Basis Technology – Human Language Technology Conference 2012     6
Search-in-a-Cloud less good news

•  Alterna;ves	
  are	
  s;ll:	
  
      o  Limited	
  
      o  Research	
  
      o  or	
  both	
  
•    Solandra	
  
      o    Scaling	
  via	
  Cassandra	
  
      o    'just	
  another	
  sharded	
  solu;on'	
  
      o    Just	
  the	
  thing	
  if	
  you	
  like	
  Cassandra	
  
•  	
  or	
  Accumulo	
  
      o    So	
  far,	
  very	
  basic	
  inverted	
  index	
  
      o    beFer	
  things	
  coming	
  
Basis Technology – Human Language Technology Conference 2012            7
Other HLT tasks ...

•  'Extrac;on'	
  is	
  'straighZorward'	
  
•  Text	
  comes	
  in,	
  en;;es	
  or	
  rela;onships	
  come	
  
     out.	
  
•    Results	
  end	
  up	
  in	
  graph	
  DB	
  or	
  bigtable	
  or	
  ...	
  
•    Scale	
  via	
  Hadoop	
  or	
  whatever	
  
•    The	
  Challenge	
  of	
  Mixing	
  and	
  Matching	
  
•    But	
  ...	
  what	
  if	
  you	
  want	
  a	
  feedback	
  loop?	
  



Basis Technology – Human Language Technology Conference 2012                        8
Interoperation

•  Lot's	
  of	
  focus	
  on	
  applica;ons	
  
      o    e.g.	
  Ozone	
  Widgets	
  
•  Not	
  so	
  much	
  on	
  backend	
  processes	
  
•  What	
  good	
  is	
  'data	
  everywhere'	
  if:	
  
      o    you	
  can't	
  deploy	
  processing	
  to	
  exploit	
  it?	
  
      o    you	
  can't	
  fit	
  together	
  pieces	
  of	
  the	
  puzzle?	
  
•  A	
  stovepipe	
  in	
  a	
  cloud	
  is	
  s;ll........	
  
•  A	
  stovepipe	
  
Basis Technology – Human Language Technology Conference 2012                      9
Harder Unstructured Problems

•  Imagine	
  you	
  wanted	
  to	
  cluster	
  ...	
  
•  New	
  items	
  show	
  up	
  
•  Need	
  to	
  find	
  'best'	
  exis;ng	
  cluster	
  
      o    It	
  could	
  be	
  'anywhere'	
  
•  Need	
  to	
  update	
  to	
  reflect	
  each	
  new	
  item	
  
•  (If	
  you're	
  wondering	
  what	
  we're	
  clustering	
  ...)	
  


Basis Technology – Human Language Technology Conference 2012           10
Rosette Concrete Examples

•  Straight	
  Search	
  
      o    RoseFe	
  Solr	
  Plugins	
  work	
  all	
  the	
  same	
  
      o    SolrCloud	
  hashes/shards	
  
      o    RoseFe	
  runs	
  on	
  the	
  target	
  node	
  


•  Extrac;on	
  and	
  similar	
  processes	
  
      o    Same	
  story,	
  using	
  Update	
  Request	
  Processor	
  




Basis Technology – Human Language Technology Conference 2012               11
Rosette and Hadoop

•  Stateless	
  APIs	
  lead	
  to	
  simple	
  implementa;on	
  
•  Non-­‐code	
  resources	
  lead	
  to	
  some	
  issues	
  
•  Stateful	
  processes	
  (e.g.	
  RNI)	
  ...	
  back	
  to	
  Solr	
  




Basis Technology – Human Language Technology Conference 2012                 12

A Lightning Introduction To Clouds & HLT - Human Language Technology Conference

  • 1.
    Clouds, Search orHLT The 'forecast'? Benson Margulies Executive Vice President and Chief Technology Officer Basis Technology – Human Language Technology Conference 2012 1
  • 2.
    Clouds, Search orHLT The 'forecast'? Basis Technology – Human Language Technology Conference 2012 2
  • 3.
    Meteorology - or- Why Clouds •  Lie  on  the  grass  and  look  up  at  the  clouds   •  Everyone  sees  something  different   •  Computerized  Clouds  are  no  different   •  Applica;ons  Always  Available   •  Data  Always  Available   •  Tools  for  Processing  Big  Data   Basis Technology – Human Language Technology Conference 2012 3
  • 4.
    Big Data andClouds =~ Hadoop •  It's  not  just  a  maFer  of  size   •  Hadoop  ...   o  Takes  in  structured  data  sets   o  Op;mizes  stateless,  batch  processes   o  Moves  computa3on  to  data   •  All  of  which  is  great  if  that's  what  you  have   •  The  world  is  more  complicated  than  that   Basis Technology – Human Language Technology Conference 2012 4
  • 5.
    What it Doesn'tDo So Easily •  On-­‐the-­‐fly  (non-­‐batch)  processing   •  Stateful,  non-­‐local,  processing   •  For  example,  consider  a  search  engine   o  All  about  online:  a  document  arrives,  users  want   to  find  it.   o  All  about  global  state:  relevancy  involves  global   data  across  the  whole  index.   Basis Technology – Human Language Technology Conference 2012 5
  • 6.
    More on Search-in-a-Cloud • Good  News:  'conven;onal'  technologies  scale   to  very  large  indices.   o  Solr   o  SolrCloud   o  Elas;c  Search   o  ...   •  How?  Shards.   o  'hash'  to  split  docs   o  queries  go  everywhere   Basis Technology – Human Language Technology Conference 2012 6
  • 7.
    Search-in-a-Cloud less goodnews •  Alterna;ves  are  s;ll:   o  Limited   o  Research   o  or  both   •  Solandra   o  Scaling  via  Cassandra   o  'just  another  sharded  solu;on'   o  Just  the  thing  if  you  like  Cassandra   •   or  Accumulo   o  So  far,  very  basic  inverted  index   o  beFer  things  coming   Basis Technology – Human Language Technology Conference 2012 7
  • 8.
    Other HLT tasks... •  'Extrac;on'  is  'straighZorward'   •  Text  comes  in,  en;;es  or  rela;onships  come   out.   •  Results  end  up  in  graph  DB  or  bigtable  or  ...   •  Scale  via  Hadoop  or  whatever   •  The  Challenge  of  Mixing  and  Matching   •  But  ...  what  if  you  want  a  feedback  loop?   Basis Technology – Human Language Technology Conference 2012 8
  • 9.
    Interoperation •  Lot's  of  focus  on  applica;ons   o  e.g.  Ozone  Widgets   •  Not  so  much  on  backend  processes   •  What  good  is  'data  everywhere'  if:   o  you  can't  deploy  processing  to  exploit  it?   o  you  can't  fit  together  pieces  of  the  puzzle?   •  A  stovepipe  in  a  cloud  is  s;ll........   •  A  stovepipe   Basis Technology – Human Language Technology Conference 2012 9
  • 10.
    Harder Unstructured Problems • Imagine  you  wanted  to  cluster  ...   •  New  items  show  up   •  Need  to  find  'best'  exis;ng  cluster   o  It  could  be  'anywhere'   •  Need  to  update  to  reflect  each  new  item   •  (If  you're  wondering  what  we're  clustering  ...)   Basis Technology – Human Language Technology Conference 2012 10
  • 11.
    Rosette Concrete Examples • Straight  Search   o  RoseFe  Solr  Plugins  work  all  the  same   o  SolrCloud  hashes/shards   o  RoseFe  runs  on  the  target  node   •  Extrac;on  and  similar  processes   o  Same  story,  using  Update  Request  Processor   Basis Technology – Human Language Technology Conference 2012 11
  • 12.
    Rosette and Hadoop • Stateless  APIs  lead  to  simple  implementa;on   •  Non-­‐code  resources  lead  to  some  issues   •  Stateful  processes  (e.g.  RNI)  ...  back  to  Solr   Basis Technology – Human Language Technology Conference 2012 12