Searching	
  Conversa/ons	
  
using	
  Hadoop:	
  More	
  than	
     find the talk
Just	
  Analy/cs

Jacques	
  Nadeau,	
  CTO	
  
jacques@yapmap.com	
  
@intjesus	
  


June	
  13,	
  2012	
  
	
  
Agenda	
  
 ü What	
  is	
  YapMap?	
  
 •  FiLng	
  Hadoop	
  into	
  your	
  architecture	
  
 •  YapMap	
  Approach	
  
    –  Crawling	
  
    –  Processing	
  
    –  Index	
  Genera/on	
  
    –  Results	
  
 •  Opera/ons,	
  GeLng	
  Started	
  &	
  Ques/ons	
  
What	
  is	
  YapMap?	
  
 •  A	
  visual	
  search	
  technology	
  	
  
 •  Focused	
  on	
  threaded	
  
    conversa/ons	
  
 •  Built	
  to	
  provide	
  beWer	
  
    context	
  and	
  ranking	
  
 •  Built	
  on	
  Hadoop	
  ecosystem	
  
    for	
  massive	
  scale	
  
 •  Two	
  self-­‐funded	
  guys	
  
 •  Motoyap.com	
  largest	
  
    implementa/on	
  at	
  650mm	
                www.motoyap.com	
  
    automo/ve	
  docs	
  
Why	
  do	
  this?	
  
                         •  Discussion	
  forums	
  and	
  
                            mailings	
  list	
  primary	
  
                            home	
  for	
  many	
  hobbies	
  
                         •  Threaded	
  search	
  sucks	
  
                             –  No	
  context	
  in	
  the	
  middle	
  
                                of	
  the	
  conversa/on	
  
How	
  does	
  it	
  work?	
  
                                 Post	
  1	
  
                                 Post	
  2	
  
                                          Post	
  3	
  
                                                   Post	
  4	
  
                                 Post	
  5	
  
                                          Post	
  6	
  
Conceptual	
  data	
  model	
  
                                                             Thread	
  
                  Post	
  1	
  
                  Post	
  2	
  
                           Post	
  3	
                         Sub-­‐thread	
  
                                    Post	
  4	
  
                   Post	
  5	
  
                           Post	
  6	
  

                                                    Individual	
  post	
  


 •  Single	
  thread	
  scaWered	
  across	
  many	
  web	
  pages	
  
 •  Posts	
  don’t	
  necessarily	
  arrive	
  in	
  order	
  
A	
  YapMap	
  search	
  result	
  page	
  
Agenda	
  
 •  What	
  is	
  YapMap?	
  
 ü FiLng	
  Hadoop	
  into	
  your	
  architecture	
  
 •  YapMap	
  Approach	
  
    –  Crawling	
  
    –  Processing	
  
    –  Index	
  Genera/on	
  
    –  Results	
  
 •  Opera/ons,	
  GeLng	
  Started	
  &	
  Ques/ons	
  
Evolu/on	
  of	
  Hadoop	
  
 Hadoop	
  Today	
                              Hadoop	
  Tomorrow	
  
 •  Batch	
  analysis	
  system	
               •  Real-­‐/me	
  enterprise	
  
                                                   applica/on	
  pladorm	
  
 •  Lacks	
  enterprise	
  features	
           •  Strong	
  Enterprise	
  Features	
  
    (e.g.	
  HA,	
  Stability,	
  compat)	
  

 •  Limited	
  applica/ons	
                    •  BI,	
  Email/Collabora/on,	
  
    primarily	
  BI	
  &	
  analy/cs	
             Marke/ng	
  DW,	
  etc.	
  

 •  Clusters	
  focused	
  on	
  point	
        •  Shared	
  resource	
  suppor/ng	
  
    use	
  cases	
                                 a	
  large	
  number	
  of	
  use	
  cases	
  
Complementary	
  to	
  exis/ng	
  technologies	
  

 Tradi-onal	
  Tools	
                     Hadoop	
  Addi-ons	
  
 •    Glassfish	
  3.1.2	
  (EJB&CDI)	
     •      Zookeeper	
  
 •    MySQL	
                              •      HBase	
  
 •    RabbitMQ	
                           •      MapReduce	
  
 •    Protobuf	
                           •      MapRfs/HDFS	
  
 •    Varnish	
                            •      Mahout	
  	
  
 •    Riak	
                               	
  
General	
  architecture	
  

                 RabbitMQ	
                    MapReduce	
  

                          Processing	
          Indexing	
         Results	
  
      Crawler	
  
                           Pipeline	
            Engine	
       Presenta/on	
  


                         HBase	
                               Riak	
  
                                     HDFS/MapRfs	
  
                                      Zookeeper	
  

     MySQL	
                                                              MySQL	
  
Hadoop	
  doesn’t	
  solve	
  all	
  problems	
  
 	
                          MySQL                     HBase	
                            Riak
 Primary	
  Use              Business	
                Storage	
  of	
  crawl	
  data,	
  Storage	
  of	
  
                             management	
              processing	
  pipeline             components	
  
                             information	
                                                directly	
  related	
  to	
  
                                                                                          presentation
 Key	
  features	
  that	
   Transactions,	
  SQL,	
   Consistency,	
  redundancy,	
   Predictable	
  l ow	
  
 drove	
  selection          JPA                       memory	
  to	
  persitence	
       latency,	
  full	
  
                                                       ratio	
                            uptime,	
  max	
  one	
  
                                                                                          IOP	
  per	
  object
 Average	
  Object	
  Size                  Small                                     20k                             2k
 Object	
  Count                     <1	
  million                         500	
  million                   1	
  billion
 System	
  Count                                 2                                     10                              8
 Memory	
  Footprint                         <1gb                                  120gb                          240gb
 Dataset	
  Size valuated	
  Voldemort	
  and	
  Cassandra	
  
    We	
  also	
  e                         10mb                                    10tb                             2tb
How	
  we	
  use	
  Hadoop	
  
 •  Zookeeper	
                                     •  Corosync,	
  Accord,	
  JGroups	
  
       –  Distributed	
  Locks	
  
       –  Cluster	
  membership	
  
          coordina/on	
  
       –  Index	
  distribu/on	
  coordina/on	
  
                                                    •  Teradata,	
  Exadata,	
  sharded	
  
 •  HBase	
                                            MySQL,	
  Cassandra	
  
       –    Primary	
  Data	
  store	
  
       –    Crawl	
  Caching	
  
       –    Data	
  merging	
  
       –    Processing	
  Pipeline	
                •  MPI,	
  JPPF,	
  Clustered	
  EJB	
  
 •  MapReduce	
  
       –  Index	
  genera/on	
                      •  Gluster,	
  SAN/NAS,	
  Lustre	
  
 •  MapRfs/HDFS	
  
       –  Index	
  storage	
                        •  Carrot2,	
  Lingpipe,	
  Lexaly/cs	
  	
  
 •  Mahout	
  	
  
       –  Cluster	
  iden/fica/on	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Crawling	
  	
  
YapMap	
  crawling	
  challenges	
  
 •  Depth	
  versus	
  breadth	
  
 •  Crawls	
  must	
  be	
  throWled	
  to	
  avoid	
  overloading	
  
 •  Avoid	
  duplicate	
  crawling	
  
 •  Save	
  progress	
  of	
  long	
  running	
  crawls	
  
 •  Need	
  an	
  elas/c	
  and	
  full	
  distributed	
  approach	
  
    to	
  crawling	
  
 •  Crawler	
  death	
  managed	
  
Crawler	
  overview	
  
      RabbitMQ	
  
                                                                         5.	
  Crawler	
  Outputs	
  
    1.	
  New	
  
                                       4.	
  Crawler	
  retrieves	
      Posts	
  (using	
  
    Crawl	
  job	
  
                                       external	
  assets	
              append	
  as	
  
    arrives	
  
                                                                         necessary)	
  

                                                                                   DFS	
  
                                  Crawler	
  
    2.	
  Crawler	
  checks	
  
    document	
  cache	
  
                                                                          Aier	
  achieving	
  	
  /me	
  
         HBase	
                                                            and/or	
  quan/ty	
  
                                        6.	
  Crawler	
  generates	
       thresholds,	
  crawl	
  
                                        more	
  crawl	
  tasks	
         pauses,	
  checkpoints	
  in	
  
      3.	
  Crawler	
  
                                                                          HBase	
  and	
  resubmits	
  
      Acquires	
  
                                                                          to	
  RabbitMQ	
  queue	
  
      Domain	
  Lock	
  
       Zookeeper	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Processing	
  Pipeline	
  
Processing	
  pipeline	
  challenges	
  
 •    Independent	
  posts	
  =>	
  complete	
  threads	
  
 •    Split	
  long	
  threads	
  into	
  mul/ple	
  sub-­‐threads	
  
 •    Fully	
  parallel	
  processing	
  pipeline	
  
 •    Accommodate	
  out	
  of	
  order	
  data	
  
Processing	
  pipeline	
  using	
  HBase	
  
 •     Mul/ple	
  steps	
  with	
  checkpoints	
  to	
  manage	
  failures	
  
 •     Idempotent	
  opera/ons	
  at	
  each	
  stage	
  of	
  process	
  
 •     U/lize	
  op/mis/c	
  locking	
  to	
  do	
  coordinated	
  merges	
  
 •     Use	
  regular	
  cleanup	
  scans	
  to	
  pick	
  up	
  lost	
  tasks	
  
 •     Control	
  batch	
  size	
  of	
  messages	
  to	
  control	
  throughput	
  versus	
  latency	
  
 •     Out	
  of	
  order	
  input	
  assumed	
  

 Posts	
  from	
  	
                      Message	
                        Message	
                           Batch	
  
 Crawler	
                                                                          Process	
  &	
  pre-­‐   Indexing	
  
                         Build	
  thread	
       Merge	
  +	
  split	
  
                                                                                      index	
  sub-­‐
                             parts	
               threads	
                                                    RT	
  
                                                                                        threads	
  
                                                                                                             Indexing	
  


                                         HBase	
                                           Riak	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Index	
  Genera/on	
  
Index	
  genera/on	
  challenges	
  
 •  Shard	
  size	
  control	
  
 •  Index	
  ordering	
  
 •  Maintain	
  inverted	
  and	
  un-­‐inverted	
  data	
  in	
  
    parallel	
  
 •  Minimize	
  merging	
  costs	
  
 •  Support	
  mul/-­‐grain	
  indexing	
  and	
  scoring	
  
Index	
  Shards	
  loosely	
  based	
  on	
  HBase	
  regions	
  
 •  HBase	
  primary	
  key	
       Pre-­‐index	
  Docs	
  
    order	
  is	
  same	
  as	
  
    index	
  order	
  
 •  Shards	
  sized	
  based	
                 R1	
           Shard	
  1	
  
    on	
  paralleliza/on	
  
    requirements	
  
      –  Typically	
  ~5gb	
                   R2	
  
                                                              Shard	
  2	
  
         each	
  
 •  Shards	
  are	
  based	
  
    on	
  snapshots	
  of	
                    R3	
  
                                                              Shard	
  3	
  
    splits	
  for	
  data	
  
    locality	
  
MapReduce	
  for	
  Index	
  Genera/on	
  

     IndexedTableInputFormat	
                      Term:	
  Pos/ng	
  Lists	
  	
  

                      Map	
                                                                        Reduce	
  
                       Map	
                                                                        Reduce	
  
                                                    Barrier	
  Map	
  Split	
  	
  
   Term	
  Distribu/on	
  Par//oner	
               Sta/s/cs	
                           FileAndPutOutputCommiWer	
  



                                                                     Inverted	
  data	
  	
  
                                    Un-­‐inverted	
                  characteris/cs	
            Inverted	
  
 Un-­‐inverted	
  data	
  	
        Data	
  
                                                                                                 Indices	
  &	
  dic/onaries	
  
 characteris/cs	
  


                                                                          DFS	
  
                        HBase	
  
Processing	
     Indexing	
        Results	
  
    Crawler	
  
                   Pipeline	
       Engine	
      Presenta/on	
  




YapMap	
  Approach:	
  Results	
  Presenta/on	
  
Presenta/on	
  Layer	
  Challenges	
  
 •    Distributed	
  search	
  tree	
  
 •    High	
  performance	
  index	
  loading	
  and	
  serving	
  
 •    No	
  SPOF	
  
 •    Effec/ve	
  memory	
  management	
  	
  &	
  alloca/on	
  
 •    Automa/c	
  cluster	
  management	
  
 •    Smart	
  index	
  distribu/on	
  
Results	
  Presenta/on	
  Layer	
  
                                             1.	
  Request	
   5.	
  Response	
  

                                                                                    2.	
  Query	
  Zookeeper	
  for	
  
            4.	
  Retrieve	
  assets	
  
                                              Results	
  SServer	
                  ac/ve	
  servers	
  
 Riak	
                                        Results	
   erver	
                                                 Zookeeper	
  
            3.	
  Fan-­‐out	
  request,	
  	
  
            consolidate	
  responses	
                                                                                 3.	
  Register	
  
                                                                                                                       new	
  shard	
  
                 Shard	
                    Shard	
                   Shard	
                  Shard	
                 availability	
  
                Daemon	
                   Daemon	
                  Daemon	
                 Daemon	
  
             Index	
  Server	
                                    Index	
  Server	
  

                                           1.	
  Load	
  shard	
  profile	
  &	
                         2.	
  Parallel	
  load	
  and	
  
                                           configure	
  memory	
                                         integrate	
  shard	
  

                                                                   HBase	
                    DFS	
  
Agenda	
  
 •  What	
  is	
  YapMap?	
  
 •  FiLng	
  Hadoop	
  into	
  your	
  architecture	
  
 •  YapMap	
  Approach	
  
    –  Crawling	
  
    –  Processing	
  
    –  Index	
  Genera/on	
  
    –  Results	
  
 ü Opera/ons,	
  GeLng	
  Started	
  &	
  Ques/ons	
  
Opera/ons	
  
 •  Hardware	
  
     –  Supermicro	
  with	
  8	
  core	
  low	
  power	
  chips,	
  low	
  power	
  ddr3	
  
     –  WD	
  Black	
  2TB	
  drives	
  
     –  DDR	
  Infiniband	
  using	
  IPoIB	
  for	
  index	
  loading	
  performance	
  
 •  Soiware	
  
     –  Started	
  on	
  Cloudera,	
  switched	
  to	
  MapR’s	
  M3	
  distribu/on	
  
        of	
  Hadoop	
  
 •  GC	
  was	
  painful,	
  now	
  manageable	
  
     –  HBase	
  now	
  supports	
  MSLAB	
  for	
  writes	
  and	
  off-­‐heap	
  block	
  
        cache	
  to	
  support	
  larger	
  memory	
  usage	
  
     –  Shard	
  servers	
  u/lize	
  large	
  pages	
  to	
  minimize	
  
        fragmenta/on	
  	
  
     –  Shard	
  servers	
  do	
  immediate	
  large	
  alloca/ons	
  to	
  minimize	
  
        GC	
  problems	
  
GeLng	
  Started	
  
 •  Amazon	
  Elas/c	
  Map	
  Reduce	
  
     –  Common	
  Crawl	
  dataset	
  is	
  a	
  great	
  data	
  set	
  to	
  start	
  
        with	
  
 •  Cheap	
  old-­‐gen	
  cluster	
  if	
  you	
  want	
  to	
  run	
  things	
  
    like	
  HBase	
  
     –  We	
  built	
  a	
  effec/ve	
  6	
  node	
  Hadoop/HBase	
  cluster	
  for	
  
        $1500	
  (Craigslist,	
  eBay)	
  
     –  Mailing	
  lists	
  are	
  liWered	
  with	
  performance	
  and	
  
        interconnec/vity	
  challenges	
  when	
  using	
  cloud	
  
        compu/ng	
  resources	
  to	
  do	
  Hadoop	
  stuff	
  
Ques/ons	
  
 •  Why	
  not	
  Lucene/Solr/Elas/cSearch/KaWa/etc?	
  
       –  Not	
  built	
  to	
  work	
  well	
  with	
  Hadoop	
  and	
  HBase	
  (Blur.io	
  is	
  first	
  to	
  tackle	
  this	
  head	
  on)	
  
       –  Data	
  locality	
  between	
  threads	
  and	
  posts	
  to	
  do	
  document-­‐at-­‐once	
  scoring	
  
 •  Why	
  not	
  store	
  indices	
  directly	
  in	
  HBase?	
  
       –  Single	
  cell	
  storage	
  would	
  be	
  the	
  only	
  way	
  to	
  do	
  it	
  efficiently	
  	
  	
  
       –  No	
  such	
  thing	
  as	
  a	
  single	
  cell	
  no-­‐read	
  append	
  (HBASE-­‐5993)	
  
       –  No	
  single	
  cell	
  par/al	
  read	
  	
  
 •  Why	
  use	
  Riak	
  for	
  presenta/on	
  side?	
  
       –  Hadoop	
  SPOF	
  
       –  Even	
  with	
  newer	
  Hadoop	
  versions,	
  HBase	
  does	
  not	
  do	
  sub-­‐second	
  row-­‐level	
  HA	
  on	
  node	
  
          failure	
  (HBASE-­‐2357)	
  
       –  Riak	
  has	
  more	
  predictable	
  latency	
  
 •  Why	
  did	
  you	
  switch	
  to	
  MapR?	
  
       –  Index	
  load	
  performance	
  was	
  substan/ally	
  faster	
  
       –  Snapshots	
  in	
  trial	
  copy	
  were	
  nice	
  for	
  those	
  30	
  days	
  
       –  Less	
  impact	
  on	
  HBase	
  performance	
  

Searching conversations with hadoop

  • 1.
    Searching  Conversa/ons   using  Hadoop:  More  than   find the talk Just  Analy/cs Jacques  Nadeau,  CTO   jacques@yapmap.com   @intjesus   June  13,  2012    
  • 2.
    Agenda   ü What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
  • 3.
    What  is  YapMap?   •  A  visual  search  technology     •  Focused  on  threaded   conversa/ons   •  Built  to  provide  beWer   context  and  ranking   •  Built  on  Hadoop  ecosystem   for  massive  scale   •  Two  self-­‐funded  guys   •  Motoyap.com  largest   implementa/on  at  650mm   www.motoyap.com   automo/ve  docs  
  • 4.
    Why  do  this?   •  Discussion  forums  and   mailings  list  primary   home  for  many  hobbies   •  Threaded  search  sucks   –  No  context  in  the  middle   of  the  conversa/on  
  • 5.
    How  does  it  work?   Post  1   Post  2   Post  3   Post  4   Post  5   Post  6  
  • 6.
    Conceptual  data  model   Thread   Post  1   Post  2   Post  3   Sub-­‐thread   Post  4   Post  5   Post  6   Individual  post   •  Single  thread  scaWered  across  many  web  pages   •  Posts  don’t  necessarily  arrive  in  order  
  • 7.
    A  YapMap  search  result  page  
  • 8.
    Agenda   • What  is  YapMap?   ü FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
  • 9.
    Evolu/on  of  Hadoop   Hadoop  Today   Hadoop  Tomorrow   •  Batch  analysis  system   •  Real-­‐/me  enterprise   applica/on  pladorm   •  Lacks  enterprise  features   •  Strong  Enterprise  Features   (e.g.  HA,  Stability,  compat)   •  Limited  applica/ons   •  BI,  Email/Collabora/on,   primarily  BI  &  analy/cs   Marke/ng  DW,  etc.   •  Clusters  focused  on  point   •  Shared  resource  suppor/ng   use  cases   a  large  number  of  use  cases  
  • 10.
    Complementary  to  exis/ng  technologies   Tradi-onal  Tools   Hadoop  Addi-ons   •  Glassfish  3.1.2  (EJB&CDI)   •  Zookeeper   •  MySQL   •  HBase   •  RabbitMQ   •  MapReduce   •  Protobuf   •  MapRfs/HDFS   •  Varnish   •  Mahout     •  Riak    
  • 11.
    General  architecture   RabbitMQ   MapReduce   Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   HBase   Riak   HDFS/MapRfs   Zookeeper   MySQL   MySQL  
  • 12.
    Hadoop  doesn’t  solve  all  problems     MySQL HBase   Riak Primary  Use Business   Storage  of  crawl  data,  Storage  of   management   processing  pipeline components   information   directly  related  to   presentation Key  features  that   Transactions,  SQL,   Consistency,  redundancy,   Predictable  l ow   drove  selection JPA memory  to  persitence   latency,  full   ratio   uptime,  max  one   IOP  per  object Average  Object  Size Small 20k 2k Object  Count <1  million 500  million 1  billion System  Count 2 10 8 Memory  Footprint <1gb 120gb 240gb Dataset  Size valuated  Voldemort  and  Cassandra   We  also  e 10mb 10tb 2tb
  • 13.
    How  we  use  Hadoop   •  Zookeeper   •  Corosync,  Accord,  JGroups   –  Distributed  Locks   –  Cluster  membership   coordina/on   –  Index  distribu/on  coordina/on   •  Teradata,  Exadata,  sharded   •  HBase   MySQL,  Cassandra   –  Primary  Data  store   –  Crawl  Caching   –  Data  merging   –  Processing  Pipeline   •  MPI,  JPPF,  Clustered  EJB   •  MapReduce   –  Index  genera/on   •  Gluster,  SAN/NAS,  Lustre   •  MapRfs/HDFS   –  Index  storage   •  Carrot2,  Lingpipe,  Lexaly/cs     •  Mahout     –  Cluster  iden/fica/on  
  • 14.
    Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Crawling    
  • 15.
    YapMap  crawling  challenges   •  Depth  versus  breadth   •  Crawls  must  be  throWled  to  avoid  overloading   •  Avoid  duplicate  crawling   •  Save  progress  of  long  running  crawls   •  Need  an  elas/c  and  full  distributed  approach   to  crawling   •  Crawler  death  managed  
  • 16.
    Crawler  overview   RabbitMQ   5.  Crawler  Outputs   1.  New   4.  Crawler  retrieves   Posts  (using   Crawl  job   external  assets   append  as   arrives   necessary)   DFS   Crawler   2.  Crawler  checks   document  cache   Aier  achieving    /me   HBase   and/or  quan/ty   6.  Crawler  generates   thresholds,  crawl   more  crawl  tasks   pauses,  checkpoints  in   3.  Crawler   HBase  and  resubmits   Acquires   to  RabbitMQ  queue   Domain  Lock   Zookeeper  
  • 17.
    Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Processing  Pipeline  
  • 18.
    Processing  pipeline  challenges   •  Independent  posts  =>  complete  threads   •  Split  long  threads  into  mul/ple  sub-­‐threads   •  Fully  parallel  processing  pipeline   •  Accommodate  out  of  order  data  
  • 19.
    Processing  pipeline  using  HBase   •  Mul/ple  steps  with  checkpoints  to  manage  failures   •  Idempotent  opera/ons  at  each  stage  of  process   •  U/lize  op/mis/c  locking  to  do  coordinated  merges   •  Use  regular  cleanup  scans  to  pick  up  lost  tasks   •  Control  batch  size  of  messages  to  control  throughput  versus  latency   •  Out  of  order  input  assumed   Posts  from     Message   Message   Batch   Crawler   Process  &  pre-­‐ Indexing   Build  thread   Merge  +  split   index  sub-­‐ parts   threads   RT   threads   Indexing   HBase   Riak  
  • 20.
    Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Index  Genera/on  
  • 21.
    Index  genera/on  challenges   •  Shard  size  control   •  Index  ordering   •  Maintain  inverted  and  un-­‐inverted  data  in   parallel   •  Minimize  merging  costs   •  Support  mul/-­‐grain  indexing  and  scoring  
  • 22.
    Index  Shards  loosely  based  on  HBase  regions   •  HBase  primary  key   Pre-­‐index  Docs   order  is  same  as   index  order   •  Shards  sized  based   R1   Shard  1   on  paralleliza/on   requirements   –  Typically  ~5gb   R2   Shard  2   each   •  Shards  are  based   on  snapshots  of   R3   Shard  3   splits  for  data   locality  
  • 23.
    MapReduce  for  Index  Genera/on   IndexedTableInputFormat   Term:  Pos/ng  Lists     Map   Reduce   Map   Reduce   Barrier  Map  Split     Term  Distribu/on  Par//oner   Sta/s/cs   FileAndPutOutputCommiWer   Inverted  data     Un-­‐inverted   characteris/cs   Inverted   Un-­‐inverted  data     Data   Indices  &  dic/onaries   characteris/cs   DFS   HBase  
  • 24.
    Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   YapMap  Approach:  Results  Presenta/on  
  • 25.
    Presenta/on  Layer  Challenges   •  Distributed  search  tree   •  High  performance  index  loading  and  serving   •  No  SPOF   •  Effec/ve  memory  management    &  alloca/on   •  Automa/c  cluster  management   •  Smart  index  distribu/on  
  • 26.
    Results  Presenta/on  Layer   1.  Request   5.  Response   2.  Query  Zookeeper  for   4.  Retrieve  assets   Results  SServer   ac/ve  servers   Riak   Results   erver   Zookeeper   3.  Fan-­‐out  request,     consolidate  responses   3.  Register   new  shard   Shard   Shard   Shard   Shard   availability   Daemon   Daemon   Daemon   Daemon   Index  Server   Index  Server   1.  Load  shard  profile  &   2.  Parallel  load  and   configure  memory   integrate  shard   HBase   DFS  
  • 27.
    Agenda   • What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   ü Opera/ons,  GeLng  Started  &  Ques/ons  
  • 28.
    Opera/ons   • Hardware   –  Supermicro  with  8  core  low  power  chips,  low  power  ddr3   –  WD  Black  2TB  drives   –  DDR  Infiniband  using  IPoIB  for  index  loading  performance   •  Soiware   –  Started  on  Cloudera,  switched  to  MapR’s  M3  distribu/on   of  Hadoop   •  GC  was  painful,  now  manageable   –  HBase  now  supports  MSLAB  for  writes  and  off-­‐heap  block   cache  to  support  larger  memory  usage   –  Shard  servers  u/lize  large  pages  to  minimize   fragmenta/on     –  Shard  servers  do  immediate  large  alloca/ons  to  minimize   GC  problems  
  • 29.
    GeLng  Started   •  Amazon  Elas/c  Map  Reduce   –  Common  Crawl  dataset  is  a  great  data  set  to  start   with   •  Cheap  old-­‐gen  cluster  if  you  want  to  run  things   like  HBase   –  We  built  a  effec/ve  6  node  Hadoop/HBase  cluster  for   $1500  (Craigslist,  eBay)   –  Mailing  lists  are  liWered  with  performance  and   interconnec/vity  challenges  when  using  cloud   compu/ng  resources  to  do  Hadoop  stuff  
  • 30.
    Ques/ons   • Why  not  Lucene/Solr/Elas/cSearch/KaWa/etc?   –  Not  built  to  work  well  with  Hadoop  and  HBase  (Blur.io  is  first  to  tackle  this  head  on)   –  Data  locality  between  threads  and  posts  to  do  document-­‐at-­‐once  scoring   •  Why  not  store  indices  directly  in  HBase?   –  Single  cell  storage  would  be  the  only  way  to  do  it  efficiently       –  No  such  thing  as  a  single  cell  no-­‐read  append  (HBASE-­‐5993)   –  No  single  cell  par/al  read     •  Why  use  Riak  for  presenta/on  side?   –  Hadoop  SPOF   –  Even  with  newer  Hadoop  versions,  HBase  does  not  do  sub-­‐second  row-­‐level  HA  on  node   failure  (HBASE-­‐2357)   –  Riak  has  more  predictable  latency   •  Why  did  you  switch  to  MapR?   –  Index  load  performance  was  substan/ally  faster   –  Snapshots  in  trial  copy  were  nice  for  those  30  days   –  Less  impact  on  HBase  performance