• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Searching conversations with hadoop
 

Searching conversations with hadoop

on

  • 1,361 views

 

Statistics

Views

Total Views
1,361
Views on SlideShare
1,286
Embed Views
75

Actions

Likes
5
Downloads
0
Comments
0

2 Embeds 75

http://eventifier.co 59
http://eventifier.com 16

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Searching conversations with hadoop Searching conversations with hadoop Presentation Transcript

    • Searching  Conversa/ons  using  Hadoop:  More  than   find the talkJust  Analy/csJacques  Nadeau,  CTO  jacques@yapmap.com  @intjesus  June  13,  2012    
    • Agenda   ü What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
    • What  is  YapMap?   •  A  visual  search  technology     •  Focused  on  threaded   conversa/ons   •  Built  to  provide  beWer   context  and  ranking   •  Built  on  Hadoop  ecosystem   for  massive  scale   •  Two  self-­‐funded  guys   •  Motoyap.com  largest   implementa/on  at  650mm   www.motoyap.com   automo/ve  docs  
    • Why  do  this?   •  Discussion  forums  and   mailings  list  primary   home  for  many  hobbies   •  Threaded  search  sucks   –  No  context  in  the  middle   of  the  conversa/on  
    • How  does  it  work?   Post  1   Post  2   Post  3   Post  4   Post  5   Post  6  
    • Conceptual  data  model   Thread   Post  1   Post  2   Post  3   Sub-­‐thread   Post  4   Post  5   Post  6   Individual  post   •  Single  thread  scaWered  across  many  web  pages   •  Posts  don’t  necessarily  arrive  in  order  
    • A  YapMap  search  result  page  
    • Agenda   •  What  is  YapMap?   ü FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   •  Opera/ons,  GeLng  Started  &  Ques/ons  
    • Evolu/on  of  Hadoop   Hadoop  Today   Hadoop  Tomorrow   •  Batch  analysis  system   •  Real-­‐/me  enterprise   applica/on  pladorm   •  Lacks  enterprise  features   •  Strong  Enterprise  Features   (e.g.  HA,  Stability,  compat)   •  Limited  applica/ons   •  BI,  Email/Collabora/on,   primarily  BI  &  analy/cs   Marke/ng  DW,  etc.   •  Clusters  focused  on  point   •  Shared  resource  suppor/ng   use  cases   a  large  number  of  use  cases  
    • Complementary  to  exis/ng  technologies   Tradi-onal  Tools   Hadoop  Addi-ons   •  Glassfish  3.1.2  (EJB&CDI)   •  Zookeeper   •  MySQL   •  HBase   •  RabbitMQ   •  MapReduce   •  Protobuf   •  MapRfs/HDFS   •  Varnish   •  Mahout     •  Riak    
    • General  architecture   RabbitMQ   MapReduce   Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on   HBase   Riak   HDFS/MapRfs   Zookeeper   MySQL   MySQL  
    • Hadoop  doesn’t  solve  all  problems     MySQL HBase   Riak Primary  Use Business   Storage  of  crawl  data,  Storage  of   management   processing  pipeline components   information   directly  related  to   presentation Key  features  that   Transactions,  SQL,   Consistency,  redundancy,   Predictable  l ow   drove  selection JPA memory  to  persitence   latency,  full   ratio   uptime,  max  one   IOP  per  object Average  Object  Size Small 20k 2k Object  Count <1  million 500  million 1  billion System  Count 2 10 8 Memory  Footprint <1gb 120gb 240gb Dataset  Size valuated  Voldemort  and  Cassandra   We  also  e 10mb 10tb 2tb
    • How  we  use  Hadoop   •  Zookeeper   •  Corosync,  Accord,  JGroups   –  Distributed  Locks   –  Cluster  membership   coordina/on   –  Index  distribu/on  coordina/on   •  Teradata,  Exadata,  sharded   •  HBase   MySQL,  Cassandra   –  Primary  Data  store   –  Crawl  Caching   –  Data  merging   –  Processing  Pipeline   •  MPI,  JPPF,  Clustered  EJB   •  MapReduce   –  Index  genera/on   •  Gluster,  SAN/NAS,  Lustre   •  MapRfs/HDFS   –  Index  storage   •  Carrot2,  Lingpipe,  Lexaly/cs     •  Mahout     –  Cluster  iden/fica/on  
    • Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Crawling    
    • YapMap  crawling  challenges   •  Depth  versus  breadth   •  Crawls  must  be  throWled  to  avoid  overloading   •  Avoid  duplicate  crawling   •  Save  progress  of  long  running  crawls   •  Need  an  elas/c  and  full  distributed  approach   to  crawling   •  Crawler  death  managed  
    • Crawler  overview   RabbitMQ   5.  Crawler  Outputs   1.  New   4.  Crawler  retrieves   Posts  (using   Crawl  job   external  assets   append  as   arrives   necessary)   DFS   Crawler   2.  Crawler  checks   document  cache   Aier  achieving    /me   HBase   and/or  quan/ty   6.  Crawler  generates   thresholds,  crawl   more  crawl  tasks   pauses,  checkpoints  in   3.  Crawler   HBase  and  resubmits   Acquires   to  RabbitMQ  queue   Domain  Lock   Zookeeper  
    • Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Processing  Pipeline  
    • Processing  pipeline  challenges   •  Independent  posts  =>  complete  threads   •  Split  long  threads  into  mul/ple  sub-­‐threads   •  Fully  parallel  processing  pipeline   •  Accommodate  out  of  order  data  
    • Processing  pipeline  using  HBase   •  Mul/ple  steps  with  checkpoints  to  manage  failures   •  Idempotent  opera/ons  at  each  stage  of  process   •  U/lize  op/mis/c  locking  to  do  coordinated  merges   •  Use  regular  cleanup  scans  to  pick  up  lost  tasks   •  Control  batch  size  of  messages  to  control  throughput  versus  latency   •  Out  of  order  input  assumed   Posts  from     Message   Message   Batch   Crawler   Process  &  pre-­‐ Indexing   Build  thread   Merge  +  split   index  sub-­‐ parts   threads   RT   threads   Indexing   HBase   Riak  
    • Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Index  Genera/on  
    • Index  genera/on  challenges   •  Shard  size  control   •  Index  ordering   •  Maintain  inverted  and  un-­‐inverted  data  in   parallel   •  Minimize  merging  costs   •  Support  mul/-­‐grain  indexing  and  scoring  
    • Index  Shards  loosely  based  on  HBase  regions   •  HBase  primary  key   Pre-­‐index  Docs   order  is  same  as   index  order   •  Shards  sized  based   R1   Shard  1   on  paralleliza/on   requirements   –  Typically  ~5gb   R2   Shard  2   each   •  Shards  are  based   on  snapshots  of   R3   Shard  3   splits  for  data   locality  
    • MapReduce  for  Index  Genera/on   IndexedTableInputFormat   Term:  Pos/ng  Lists     Map   Reduce   Map   Reduce   Barrier  Map  Split     Term  Distribu/on  Par//oner   Sta/s/cs   FileAndPutOutputCommiWer   Inverted  data     Un-­‐inverted   characteris/cs   Inverted   Un-­‐inverted  data     Data   Indices  &  dic/onaries   characteris/cs   DFS   HBase  
    • Processing   Indexing   Results   Crawler   Pipeline   Engine   Presenta/on  YapMap  Approach:  Results  Presenta/on  
    • Presenta/on  Layer  Challenges   •  Distributed  search  tree   •  High  performance  index  loading  and  serving   •  No  SPOF   •  Effec/ve  memory  management    &  alloca/on   •  Automa/c  cluster  management   •  Smart  index  distribu/on  
    • Results  Presenta/on  Layer   1.  Request   5.  Response   2.  Query  Zookeeper  for   4.  Retrieve  assets   Results  SServer   ac/ve  servers   Riak   Results   erver   Zookeeper   3.  Fan-­‐out  request,     consolidate  responses   3.  Register   new  shard   Shard   Shard   Shard   Shard   availability   Daemon   Daemon   Daemon   Daemon   Index  Server   Index  Server   1.  Load  shard  profile  &   2.  Parallel  load  and   configure  memory   integrate  shard   HBase   DFS  
    • Agenda   •  What  is  YapMap?   •  FiLng  Hadoop  into  your  architecture   •  YapMap  Approach   –  Crawling   –  Processing   –  Index  Genera/on   –  Results   ü Opera/ons,  GeLng  Started  &  Ques/ons  
    • Opera/ons   •  Hardware   –  Supermicro  with  8  core  low  power  chips,  low  power  ddr3   –  WD  Black  2TB  drives   –  DDR  Infiniband  using  IPoIB  for  index  loading  performance   •  Soiware   –  Started  on  Cloudera,  switched  to  MapR’s  M3  distribu/on   of  Hadoop   •  GC  was  painful,  now  manageable   –  HBase  now  supports  MSLAB  for  writes  and  off-­‐heap  block   cache  to  support  larger  memory  usage   –  Shard  servers  u/lize  large  pages  to  minimize   fragmenta/on     –  Shard  servers  do  immediate  large  alloca/ons  to  minimize   GC  problems  
    • GeLng  Started   •  Amazon  Elas/c  Map  Reduce   –  Common  Crawl  dataset  is  a  great  data  set  to  start   with   •  Cheap  old-­‐gen  cluster  if  you  want  to  run  things   like  HBase   –  We  built  a  effec/ve  6  node  Hadoop/HBase  cluster  for   $1500  (Craigslist,  eBay)   –  Mailing  lists  are  liWered  with  performance  and   interconnec/vity  challenges  when  using  cloud   compu/ng  resources  to  do  Hadoop  stuff  
    • Ques/ons   •  Why  not  Lucene/Solr/Elas/cSearch/KaWa/etc?   –  Not  built  to  work  well  with  Hadoop  and  HBase  (Blur.io  is  first  to  tackle  this  head  on)   –  Data  locality  between  threads  and  posts  to  do  document-­‐at-­‐once  scoring   •  Why  not  store  indices  directly  in  HBase?   –  Single  cell  storage  would  be  the  only  way  to  do  it  efficiently       –  No  such  thing  as  a  single  cell  no-­‐read  append  (HBASE-­‐5993)   –  No  single  cell  par/al  read     •  Why  use  Riak  for  presenta/on  side?   –  Hadoop  SPOF   –  Even  with  newer  Hadoop  versions,  HBase  does  not  do  sub-­‐second  row-­‐level  HA  on  node   failure  (HBASE-­‐2357)   –  Riak  has  more  predictable  latency   •  Why  did  you  switch  to  MapR?   –  Index  load  performance  was  substan/ally  faster   –  Snapshots  in  trial  copy  were  nice  for  those  30  days   –  Less  impact  on  HBase  performance