Searching conversations with hadoop


Published on

Published in: Technology
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Searching conversations with hadoop

  1. 1. Searching  Conversa/ons  using  Hadoop:  More  than   find the talkJust  Analy/csJacques  Nadeau,  CTO  @intjesus  June  13,  2012    
  2. 2. Agenda   ü What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
  3. 3. What  is  YapMap?   •  A  visual  search  technology     •  Focused  on  threaded   conversa/ons   •  Built  to  provide  beWer   context  and  ranking   •  Built  on  Hadoop  ecosystem   for  massive  scale   •  Two  self-­‐funded  guys   •  largest   implementa/on  at  650mm   automo/ve  docs  
  4. 4. Why  do  this?   •  Discussion  forums  and   mailings  list  primary   home  for  many  hobbies   •  Threaded  search  sucks   –  No  context  in  the  middle   of  the  conversa/on  
  5. 5. How  does  it  work?   Post  1   Post  2   Post  3   Post  4   Post  5   Post  6  
  6. 6. Conceptual  data  model   Thread   Post  1   Post  2   Post  3   Sub-­‐thread   Post  4   Post  5   Post  6   Individual  post   •  Single  thread  scaWered  across  many  web  pages   •  Posts  don’t  necessarily  arrive  in  order  
  7. 7. A  YapMap  search  result  page  
  8. 8. Agenda   •  What  is  YapMap?   ü FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
  9. 9. Evolu/on  of  Hadoop   Hadoop  Today   Hadoop  Tomorrow   •  Batch  analysis  system   •  Real-­‐/me  enterprise   applica/on  pladorm   •  Lacks  enterprise  features   •  Strong  Enterprise  Features   (e.g.  HA,  Stability,  compat)   •  Limited  applica/ons   •  BI,  Email/Collabora/on,   primarily  BI  &  analy/cs   Marke/ng  DW,  etc.   •  Clusters  focused  on  point   •  Shared  resource  suppor/ng   use  cases   a  large  number  of  use  cases  
  10. 10. Complementary  to  exis/ng  technologies   Tradi-onal  Tools   Hadoop  Addi-ons   •  Glassfish  3.1.2  (EJB&CDI)   •  Zookeeper   •  MySQL   •  HBase   •  RabbitMQ   •  MapReduce   •  Protobuf   •  MapRfs/HDFS   •  Varnish   •  Mahout     •  Riak    
  11. 11. General  architecture   RabbitMQ   MapReduce   Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   HBase   Riak   HDFS/MapRfs   Zookeeper   MySQL   MySQL  
  12. 12. Hadoop  doesn’t  solve  all  problems     MySQL HBase   Riak Primary  Use Business   Storage  of  crawl  data,  Storage  of   management   processing  pipeline components   information   directly  related  to   presentation Key  features  that   Transactions,  SQL,   Consistency,  redundancy,   Predictable  l ow   drove  selection JPA memory  to  persitence   latency,  full   ratio   uptime,  max  one   IOP  per  object Average  Object  Size Small 20k 2k Object  Count <1  million 500  million 1  billion System  Count 2 10 8 Memory  Footprint <1gb 120gb 240gb Dataset  Size valuated  Voldemort  and  Cassandra   We  also  e 10mb 10tb 2tb
  13. 13. How  we  use  Hadoop   •  Zookeeper   •  Corosync,  Accord,  JGroups   –  Distributed  Locks   –  Cluster  membership   coordina/on   –  Index  distribu/on  coordina/on   •  Teradata,  Exadata,  sharded   •  HBase   MySQL,  Cassandra   –  Primary  Data  store   –  Crawl  Caching   –  Data  merging   –  Processing  Pipeline   •  MPI,  JPPF,  Clustered  EJB   •  MapReduce   –  Index  genera/on   •  Gluster,  SAN/NAS,  Lustre   •  MapRfs/HDFS   –  Index  storage   •  Carrot2,  Lingpipe,  Lexaly/cs     •  Mahout     –  Cluster  iden/fica/on  
  14. 14. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Crawling    
  15. 15. YapMap  crawling  challenges   •  Depth  versus  breadth   •  Crawls  must  be  throWled  to  avoid  overloading   •  Avoid  duplicate  crawling   •  Save  progress  of  long  running  crawls   •  Need  an  elas/c  and  full  distributed  approach   to  crawling   •  Crawler  death  managed  
  16. 16. Crawler  overview   RabbitMQ   5.  Crawler  Outputs   1.  New   4.  Crawler  retrieves   Posts  (using   Crawl  job   external  assets   append  as   arrives   necessary)   DFS   Crawler   2.  Crawler  checks   document  cache   Aier  achieving    /me   HBase   and/or  quan/ty   6.  Crawler  generates   thresholds,  crawl   more  crawl  tasks   pauses,  checkpoints  in   3.  Crawler   HBase  and  resubmits   Acquires   to  RabbitMQ  queue   Domain  Lock   Zookeeper  
  17. 17. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Processing  Pipeline  
  18. 18. Processing  pipeline  challenges   •  Independent  posts  =>  complete  threads   •  Split  long  threads  into  mul/ple  sub-­‐threads   •  Fully  parallel  processing  pipeline   •  Accommodate  out  of  order  data  
  19. 19. Processing  pipeline  using  HBase   •  Mul/ple  steps  with  checkpoints  to  manage  failures   •  Idempotent  opera/ons  at  each  stage  of  process   •  U/lize  op/mis/c  locking  to  do  coordinated  merges   •  Use  regular  cleanup  scans  to  pick  up  lost  tasks   •  Control  batch  size  of  messages  to  control  throughput  versus  latency   •  Out  of  order  input  assumed   Posts  from     Message   Message   Batch   Crawler   Process  &  pre-­‐ Indexing   Build  thread   Merge  +  split   index  sub-­‐ parts   threads   RT   threads   Indexing   HBase   Riak  
  20. 20. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Index  Genera/on  
  21. 21. Index  genera/on  challenges   •  Shard  size  control   •  Index  ordering   •  Maintain  inverted  and  un-­‐inverted  data  in   parallel   •  Minimize  merging  costs   •  Support  mul/-­‐grain  indexing  and  scoring  
  22. 22. Index  Shards  loosely  based  on  HBase  regions   •  HBase  primary  key   Pre-­‐index  Docs   order  is  same  as   index  order   •  Shards  sized  based   R1   Shard  1   on  paralleliza/on   requirements   –  Typically  ~5gb   R2   Shard  2   each   •  Shards  are  based   on  snapshots  of   R3   Shard  3   splits  for  data   locality  
  23. 23. MapReduce  for  Index  Genera/on   IndexedTableInputFormat   Term:  Pos/ng  Lists     Map   Reduce   Map   Reduce   Barrier  Map  Split     Term  Distribu/on  Par//oner   Sta/s/cs   FileAndPutOutputCommiWer   Inverted  data     Un-­‐inverted   characteris/cs   Inverted   Un-­‐inverted  data     Data   Indices  &  dic/onaries   characteris/cs   DFS   HBase  
  24. 24. Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Results  Presenta/on  
  25. 25. Presenta/on  Layer  Challenges   •  Distributed  search  tree   •  High  performance  index  loading  and  serving   •  No  SPOF   •  Effec/ve  memory  management    &  alloca/on   •  Automa/c  cluster  management   •  Smart  index  distribu/on  
  26. 26. Results  Presenta/on  Layer   1.  Request   5.  Response   2.  Query  Zookeeper  for   4.  Retrieve  assets   Results  SServer   ac/ve  servers   Riak   Results   erver   Zookeeper   3.  Fan-­‐out  request,     consolidate  responses   3.  Register   new  shard   Shard   Shard   Shard   Shard   availability   Daemon   Daemon   Daemon   Daemon   Index  Server   Index  Server   1.  Load  shard  profile  &   2.  Parallel  load  and   configure  memory   integrate  shard   HBase   DFS  
  27. 27. Agenda   •  What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   ü Opera/ons,  GeLng  Started  &  Ques/ons  
  28. 28. Opera/ons   •  Hardware   –  Supermicro  with  8  core  low  power  chips,  low  power  ddr3   –  WD  Black  2TB  drives   –  DDR  Infiniband  using  IPoIB  for  index  loading  performance   •  Soiware   –  Started  on  Cloudera,  switched  to  MapR’s  M3  distribu/on   of  Hadoop   •  GC  was  painful,  now  manageable   –  HBase  now  supports  MSLAB  for  writes  and  off-­‐heap  block   cache  to  support  larger  memory  usage   –  Shard  servers  u/lize  large  pages  to  minimize   fragmenta/on     –  Shard  servers  do  immediate  large  alloca/ons  to  minimize   GC  problems  
  29. 29. GeLng  Started   •  Amazon  Elas/c  Map  Reduce   –  Common  Crawl  dataset  is  a  great  data  set  to  start   with   •  Cheap  old-­‐gen  cluster  if  you  want  to  run  things   like  HBase   –  We  built  a  effec/ve  6  node  Hadoop/HBase  cluster  for   $1500  (Craigslist,  eBay)   –  Mailing  lists  are  liWered  with  performance  and   interconnec/vity  challenges  when  using  cloud   compu/ng  resources  to  do  Hadoop  stuff  
  30. 30. Ques/ons   •  Why  not  Lucene/Solr/Elas/cSearch/KaWa/etc?   –  Not  built  to  work  well  with  Hadoop  and  HBase  (  is  first  to  tackle  this  head  on)   –  Data  locality  between  threads  and  posts  to  do  document-­‐at-­‐once  scoring   •  Why  not  store  indices  directly  in  HBase?   –  Single  cell  storage  would  be  the  only  way  to  do  it  efficiently       –  No  such  thing  as  a  single  cell  no-­‐read  append  (HBASE-­‐5993)   –  No  single  cell  par/al  read     •  Why  use  Riak  for  presenta/on  side?   –  Hadoop  SPOF   –  Even  with  newer  Hadoop  versions,  HBase  does  not  do  sub-­‐second  row-­‐level  HA  on  node   failure  (HBASE-­‐2357)   –  Riak  has  more  predictable  latency   •  Why  did  you  switch  to  MapR?   –  Index  load  performance  was  substan/ally  faster   –  Snapshots  in  trial  copy  were  nice  for  those  30  days   –  Less  impact  on  HBase  performance