Your SlideShare is downloading. ×
Hadoop bangalore-meetup-dec-2011-yoda
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Hadoop bangalore-meetup-dec-2011-yoda

3,075
views

Published on

A high performance Hadoop system for querying very large datasets

A high performance Hadoop system for querying very large datasets

Published in: Technology, Education, Travel

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
3,075
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Yoda   gaurav  @inmobi  
  • 2. What  is  it?  •  System  to  query  very  large  amount  of  data.  •  Used  by  en<re  Inmobi  Sales  and  Analy<cs   teams.  •  Also  used  by  mul<ple  other  Inmobi  projects  to   extract  data.  •  Built  en<rely  on  top  of  Hadoop  system.  •  Highly  op<mized  for  storage  and  efficiency.    
  • 3. What  is  it?  •  Ingests  close  to  2.5B  events  per  day.  •  A  record  consists  of  events  belonging  to   different  <me-­‐shiKed  streams.  •   Provides  a  unified  (joined)  view  of  the  events.  •  Queries  are  asynchronous  and  the  goal  is  to   execute  most  queries  in  under  2  minutes.  
  • 4. Supported  Features  •  Mul<ple  Aggregates  (SUM,  MAX,  MIN,   DISTINCT,  COUNTDISTINCT  etc.)  •  Custom  formula  expressions.  •  Powerful  expression  based  filters.  Integrated   with  JEP  expression  library.  •  Decode,  Truncate  etc.  •  Top,  Having  •  UDF.  
  • 5. Architecture  Diagram    
  • 6. Why  not  Hive,  Pig  etc.  •  We  do  not  need  a  very  generic  and  complex  system  for  our   requirements  –  we  have  focused  more  on  speed  and   resource  op<miza<on  and  our  analyst  requirements.  •  Single  Job:  Most  user  queries  can  be  modeled  as  a  single   MR  job  –  hive,  pig  launch  job  chains  which  process  data   mul<ple  <mes.  We  pack  everything  in  one  job.  •  Pre-­‐process  fact-­‐fact  joins  and  have  smart  meta-­‐data  joins     on  map  side  (can  join  with  10s  of  tables  running  into  up-­‐to   500  MB  size).  •  Op<mized  query  planning/execu<on  based  on  our  data   model.  •  Flexible  –  can  quickly  iden<fy  and  fix  problems  or  add  new   features.