Invisible loading

  • 1,525 views
Uploaded on

Invisible Loading Talk by Azza Abouzied at the VLDB Workshop on End-to-end Management of Big Data 2012

Invisible Loading Talk by Azza Abouzied at the VLDB Workshop on End-to-end Management of Big Data 2012

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,525
On Slideshare
0
From Embeds
0
Number of Embeds
1

Actions

Shares
Downloads
21
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Invisible  Loading   Yalies:  Azza  Abouzied,    Daniel  Abadi,  Avi  Silberschatz   BigData  2012  
  • 2. Problem:  The  Crying  Baby  
  • 3. Two  ways  to  deal  with  this:   Immediate  GraDficaDon   Long  term  $$$  costs   Misery  &  sleep  deprivaDon   Long  term  benefits  
  • 4. The  Crying  Baby  Problem   Wants  A(en*on  Now!   ≈  The  ImpaDent  Boss  Problem   Wants  Answers  Now!  
  • 5. Two  ways  to  analyze  data  MapReduce  way   Immediate  GraDficaDon   Hack it: Locate Determine Parse Long-­‐term  cumulaDve  costs     Key File +Map because  MR  is  slow!   Attributes +ReduceDB  &  HadoopDB  way   Organize Query: Figure Determine Process Locate or Index out Load File Key File DB without schema Attributes tables Parse Misery  &  sleep  deprivaDon   Long  term  benefits  
  • 6. The  Problem  Can  we  get  the  immediate  gra*fica*on  of  working  with  MapReduce  and  make  progress  towards  the  performances  advantages  of  working  with  Databases?    
  • 7. Our  SoluDon  Begin  with  the  MapReduce  Way   File System Write Determine Map/ Locate Run it! Key Reduce File Attributes Scripts Database System BEHIND-­‐THE-­‐SCENCES   PER  JOB   Organize Figure or Index out Load File DB schema tables INCREMENTALLY  
  • 8. Figureout P1)  How  to  automaDcally  figure  schema out  a  schema?  Short  answer:  DON’T    Split  map  phase  into  Parse  and  Map  phases.      Enforce  a  simple  Parse  API:  Parser  has  one  output  method:  getField(int  id)    Name  a  table  aZer  its  Parser-­‐implementaDon  and  label  a[ributes  with  their  field  id.    Different  parsers  on  the  same  file  result  in  different  tables.  
  • 9. Incrementally P2)  How  to  load  files  with   minimal  marginal  costs?   Load File•  Load  only  touched  a[ributes  (VerDcal   ParDDon)   –  Requires  a  Column-­‐Store  •  Load  only  parts  of  a  column  (Horizontal   parDDon)   –  AZer  a  file-­‐split  is  processed  by  Map,  its  touched   a[ributes  are  loaded  enDrely     –  How  many  splits  of  a  file  is  a  tunable  parameter.    
  • 10. Tuple  construcDon  Some  columns  are  at  different  loading  stages.   –  Maintain  OIDs  for  each  column:  an  address   column     •  The  OIDs  assigned  are  equivalent  to  the  inserDon  order   –  Keep  a  catalog  to  track  loading  progress   a b c d Process  in  DB   Use  File  System  
  • 11. Incrementally P3)  How  to  index  a  parDally-­‐ loaded  table?  Organize fileIf  a  selec*on  filter  is  applied  on  an  a(ribute,  we  organize  it.    Dealing  with  parDally  loaded  a[ributes   c1 c2 address $ ! # & c1 % " $ column # ## % ( !!"#$$$ ! !! ) % %"#$$$ ) % * & &"#$$$ * & ! !! JOIN !"#$$$ ( ! ( ! "#$$$ & , ( &"#$$$ + & + & ("#$$$ , ( & ! !%"#$$$ !% !% " ( ## &
  • 12. Choosing  an  organizaDon  strategy  •  Why  not  use  merge  sort?     ./01#2# 3/45 !!"#$$$ % % %"#$$$ & ! % &"#$$$ !! ( ! & + !"#$$$ ! & , "#$$$ ( ( &"#$$$ & !% & ("#$$$ !! & !%"#$$$ !% ) - )"#$$$ ) + *"#$$$ * , * !+"#$$$ !+ ) !% - !! ,"#$$$ + * !+ !,"#$$$ , !+ !, +"#$$$ - !, !& -"#$$$ !, !& !&"#$$$ !& 367859#8:#;<=3# 378A>9#/B#5C>#A/7D@:#8:# =87>#3?95>@ 1050E09>
  • 13. Incremental  Merge  Sort   0123#4# 892:;#! 892:;#+ 5167 5<=>7/?>7.#!%%%? 5<=>7/?>7.#%!%%? !!"#$$$ % % ! % %"#$$$ & ! &"#$$$ !! ( %.#%/, & + & , !"#$$$ ! "#$$$ ( &"#$$$ * & !.#/!! ("#$$$ !% !% !%"#$$$ !! !! !% + ( )"#$$$ ) , & *"#$$$ * +.#(/- ) & !+"#$$$ !+ - ) - ,"#$$$ + * !,"#$$$ , !+ !+ ,.#!+/!& +"#$$$ - !, !, -"#$$$ !, !& !& !&"#$$$ !& 5<=>7:#>@#ABC5# 5=>F;:#1G#79;#F1=HE@#>@# 5>E<=;#>@3;I C>=;#5D:7;E 3272?2:;
  • 14. EVALUATION  
  • 15. Setup  •  Single-­‐Machine  Experiments   –  Embarrassingly  parallel   –  No  distributed  reorganizaDon  or  parDDoning  •  MonetDB  (hacked  to  support  IMS)  •  Hadoop  •  2  GB  file  of  5  integer  a[ributes:  107,374,182   tuples.    •  See  paper  for  more  details  
  • 16. The  big  picture   800 SQL Pre-load Incremental Reorganization (5/5) Incremental Reorganization (2/5) 700 Invisible Loading (5/5) Invisible Loading (2/5) MapReduce 600 500Time in Seconds 400 300 200 100 0 1 10 100 Job Sequence
  • 17. CumulaDve  costs   100000 SQL Pre-load Incremental Reorganization (5/5) Incremental Reorganization (2/5) Invisible Loading (5/5) Invisible Loading (2/5) MapReduceCumulative Time Spent in Seconds 10000 1000 100 1 10 100 Job Sequence
  • 18. Change  the  access  pa[ern   800 SQL Pre-load Incremental Reorganization (5/5) Incremental Reorganization (2/5) 700 Invisible Loading (5/5) Invisible Loading (2/5) MapReduce 600 500Time in Seconds 400 300 200 100 0 1 10 83 85 87 89 91 93 Job Sequence (Log scale) Job Sequence (Linear scale)
  • 19. Further  EvaluaDon  (Paper)  •  In-­‐depth  study  of  IMS   –  Comparison  with  Cracking  and  Pre-­‐sorDng   –  Effect  of  integraDng  Lightweight  compressions   into  IMS.  •  Li[le  mini-­‐experiments   –  InserDon  vs.  Copy   –  Processing  in  DB  vs.  using  DB  as  a  fast  access   medium  with  all  processing  in  MapReduce  
  • 20. Conclusion:  Lessons  Learned  •  Engineering  Nightmare   –  Many  complemenDng  technologies   •  Manimal,  AdapDve  Merging  …   –  In  the  era  of  Big-­‐Data  we  need  to  design  more   modular,  plug-­‐n-­‐play  tools  •  Can  of  worms   –  Most  BigData  problems  look  decepDvely  simple   unDl  you  start  mocking  around.  
  • 21. Some  problems  are  easier  than  others  
  • 22. Thanks!  QuesDons?  
  • 23. Why  is  loading  this  log  file  hard?   !"#$%&#%()%*+,-+,++%*+(*.%!/0010.%!23/$4%(*56+6+6(.%!"#$%&#%()%*+,-+,++%*+(*.%789:68,%;<=>*%?%@A#/0:(-B*-C)5*D@EF%0/G/0/0,%H448,II129 !"#$%&#%()%*+,-+,++%*+(*.%!/0010.%!23/$4%(*56+6+6(.%!"#$%&#%()%*+,-+,++%*+(*.%789:68,%;<=>-%?%@137J@EF%0/G/0/0,%H448,II129H1J4I789:IK !"#$%&#%()%*+,-+,++%*+(*.%!/0010.%!23/$4%(*56+6+6(.%!"#$%&#%()%*+,-+,++%*+(*.%789:68,%;<=>B%?%@!PJ4#7/$4Q+PFPJ4#7/$4Q(PFPJ4#7/$4Q*PF Message  field   !"#$%&#%()%*+,-+,++%*+(*.%!/0010.%!23/$4%(*56+6+6(.%!"#$%&#%()%*+,-+,++%*+(*.%789:68,%;<=>)%?%@21OO9$7@EF%0/G/0/0,%H448,II129H1J4I78 varies   !"#$%&#%()%*+,-+,++%*+(*.%!/0010.%!23/$4%(*56+6+6(.%!"#$%&#%()%*+,-+,++%*+(*.%789:68,%;<=>R%?%@/S9#94/T4#8/J@EF%0/G/0/0,%H448,II129H !"#$%&#%()%*(,+*,*+%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321What  is  the   depending  on   !Z1$%&#%(R%(+,)),-*%*+(*.%!/0010.%!23/$4%(+C6((C6(+56D).%[$S937%O/4H17%3$%0/A#/J4%)]V/G]VC9]V7B]V729]V(2]]^]V(D]VD7]V9)6]V+-_base  schema?   applicaDon!     !Z1$%&#%(R%((,-*,*D%*+(*.%!/0010.%!23/$4%C*6B+6*)-6(*5.%[$S937%`>[%3$%0/A#/J4%]S]V2+]VCC1]VG-]V9B!]V/L#]VCRI]V+G]VL*;]V()]VG-C !Z1$%&#%(R%(*,(+,BD%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321Time,  Type,   !Z1$%&#%(R%(*,(+,))%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 !N#/%&#%(5%*+,((,B+%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$Message  ?   !U03%&#%*+%(C,(-,(C%*+(*.%!K90$.%[$34,%"/JJ31$%d92H/%3J%$14%21$G3a#0/7%!H3$4,%""W"/JJ31$d92H/. H4487,%d1#7%$14%0/39L:%7/4/0O3$/%4H/%J/0S/0@J%G#:%A#93G3/7%71O93$%$9O/F%#J3$a%/V2//$2/6129%G10%"/0S/0e9O/ !U03%&#%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%a/$/0943$a%J/20/4%G10%73a/J4%9#4H/$4329431$%666Different  tables   !U03%&#%*+%(C,(-,*+%*+(*.%!$1432/.%Y3a/J4,%71$/ Context-­‐dependent  for  each  type?   Schema  Awareness   !U03%&#%*+%(C,(-,*+%*+(*.%!$1432/.%=892H/I*6*6*(%f`$3Vg%O17TJJI*6*6*(%h8/$""WI+6C6D0%Y=<I*%O17TG2a37I*6-6R%21$G3a#0/7%ii%0/J#O3$a%$ !U03%&#%*+%(C,(-,*-%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 Different  analysts  know   !U03%&#%*+%(C,(B,+D%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 !"94%&#%*(%(R,-C,*5%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 the  schema  of  what   !"94%&#%*(%(R,)+,*5%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 they  are  looking  for  and   !"94%&#%*(%(5,+B,*)%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 !"94%&#%*(%(5,+R,)R%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 don’t  care  about  other   !"94%&#%*(%(5,)+,(-%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 log  messages   !"94%&#%*(%(5,)+,*R%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 !"94%&#%*(%(C,(C,*+%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 !"94%&#%*(%(C,(C,--%*+(*.%!/0010.%!23/$4%(*56+6+6(.%U3/%71/J%$14%/V3J4,%IW3L090:IX/L"/0S/0IY12#O/$4JIG9S321$6321 !N#/%&#%*B%(5,)B,B)%*+(*.%!$1432/.%29#aH4%"[bNc>ZF%JH#443$a%71K$