Hadoop bangalore-meetup-dec-2011-yoda

3,342 views

Published on

A high performance Hadoop system for querying very large datasets

Published in: Technology, Education, Travel
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,342
On SlideShare
0
From Embeds
0
Number of Embeds
269
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Hadoop bangalore-meetup-dec-2011-yoda

  1. 1. Yoda   gaurav  @inmobi  
  2. 2. What  is  it?  •  System  to  query  very  large  amount  of  data.  •  Used  by  en<re  Inmobi  Sales  and  Analy<cs   teams.  •  Also  used  by  mul<ple  other  Inmobi  projects  to   extract  data.  •  Built  en<rely  on  top  of  Hadoop  system.  •  Highly  op<mized  for  storage  and  efficiency.    
  3. 3. What  is  it?  •  Ingests  close  to  2.5B  events  per  day.  •  A  record  consists  of  events  belonging  to   different  <me-­‐shiKed  streams.  •   Provides  a  unified  (joined)  view  of  the  events.  •  Queries  are  asynchronous  and  the  goal  is  to   execute  most  queries  in  under  2  minutes.  
  4. 4. Supported  Features  •  Mul<ple  Aggregates  (SUM,  MAX,  MIN,   DISTINCT,  COUNTDISTINCT  etc.)  •  Custom  formula  expressions.  •  Powerful  expression  based  filters.  Integrated   with  JEP  expression  library.  •  Decode,  Truncate  etc.  •  Top,  Having  •  UDF.  
  5. 5. Architecture  Diagram    
  6. 6. Why  not  Hive,  Pig  etc.  •  We  do  not  need  a  very  generic  and  complex  system  for  our   requirements  –  we  have  focused  more  on  speed  and   resource  op<miza<on  and  our  analyst  requirements.  •  Single  Job:  Most  user  queries  can  be  modeled  as  a  single   MR  job  –  hive,  pig  launch  job  chains  which  process  data   mul<ple  <mes.  We  pack  everything  in  one  job.  •  Pre-­‐process  fact-­‐fact  joins  and  have  smart  meta-­‐data  joins     on  map  side  (can  join  with  10s  of  tables  running  into  up-­‐to   500  MB  size).  •  Op<mized  query  planning/execu<on  based  on  our  data   model.  •  Flexible  –  can  quickly  iden<fy  and  fix  problems  or  add  new   features.    

×