Lessons learned scaling big data in cloud

995 views

Published on

Presented at a technical series on BigData and Amazon EMR.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Lessons learned scaling big data in cloud

  1. 1. Lessons Learned : Scaling Hadoop and Big Data in Cloud Vijay Rayapati @amnigos
  2. 2. Big Data
  3. 3. Data Keeps Growing
  4. 4. What you can do with Data?
  5. 5. Problem : Local Intelligence
  6. 6. High Level Architecture ProcessingMultiple Data Ingest (Hadoop Jobs)Sources Data Storage Output Location Datasets API Intelligence
  7. 7. Our Challenges•  Mul$ple  data  sources  –  social,  retail,  events,  news,  census,  loca$on  etc    •  Spa$al  data  analysis  and  querying  –  loca$on  overlay  on  data    •  Temporal  nature  of  the  input  datasets      •  Large  input  data  sets  and  hundreds  of  GB  compressed  inputs  for  jobs  •  Complex  processing  and  business  logic  based  on  use  cases    •  Custom  output  data  formats  –  JSON,  XML,  XLS,  Flat  files  etc  
  8. 8. Why Amazon EMR?I am interested in using Hadoop to solve business problemsand not in building and managing Hadoop infrastructure!  Scalable Storage - S3Flexible Computing – EC2No Hadoop Management – EMR
  9. 9. Amazon EMR - Service Architecture
  10. 10. How to move existing data to Cloud? 10’s  of  GB   100’s  of  GB    >  Terabyte   Direct  Upload   Any  S3  tools   AWS  Import/ Export   Any  S3  tools   Tsunami   Aspera,  Tsunami  
  11. 11. Solution Architecture EC2 S3 EMR ProcessingMultiple Data Ingest (Hadoop Jobs)Sources Data Storage Output Location Datasets API Intelligence S3 EC2 EC2
  12. 12. Amazon EMR – SetupLaunching  a  500  node  and  fully  configured  cluster  is  as  simple  as  execu$ng  one  command.     >  elas$c-­‐mapreduce    -­‐-­‐create  -­‐-­‐alive  -­‐-­‐plain-­‐output  -­‐-­‐master-­‐instance-­‐type   m1.xlarge  -­‐-­‐slave-­‐instance-­‐type  m2.2xlarge    -­‐-­‐num-­‐instances  500    -­‐-­‐name  "Site   Analy$cs  Cluster"  -­‐-­‐bootstrap-­‐ac$on  s3://com.bcb11.emr/scripts/bootstrap-­‐ custom.sh  -­‐-­‐bootstrap-­‐ac$on  s3://elas$cmapreduce/bootstrap-­‐ac$ons/install-­‐ ganglia  -­‐-­‐bootstrap-­‐ac$on  s3://elas$cmapreduce/bootstrap-­‐ac$ons/configure-­‐ hadoop  -­‐-­‐args  "-­‐-­‐mapred-­‐config-­‐file,  s3://com.bcb11.emr/conf/custom-­‐mapred-­‐ site.xml"       >  elas$c-­‐mapreduce  -­‐j  ${jobflow}  -­‐-­‐stream  -­‐-­‐step-­‐name  “Profile  Analyzer"  -­‐-­‐ jobconf  mapred.task.$meout=0  -­‐-­‐mapper  s3://com.bcb11.emr/code/mapper.rb   -­‐-­‐reducer  s3://com.bcb11.emr/bin/reducer.rb  -­‐-­‐cache  s3://com.bcb11.emr/ cache/customdata.dat#data.txt  -­‐-­‐input  s3://com.bcb11.emr/input/  -­‐-­‐output  s3:// com.bcb11.emr/output  
  13. 13. EMR Map Reduce Jobs  Amazon  EMR  supports  –  streaming,  custom  jar,  cascading,  pig  and  hive.        Streaming  –  Write  Map  Reduce  jobs  in  any  scrip$ng  language.      Custom  Jar  –  Write  using  Java  and  good  for  speed/control.      Cascading,  Hive  and  Pig  –  Higher  level  of  abstrac$on.      AWS  EMR  forums  if  you  need  help.  
  14. 14. Hadoop and EMR – Lesson Learned
  15. 15. EMR – Good, Bad and UglyGreat  for  bootstrapping  large  clusters  and  very  cost-­‐effec$ve  for  transient  clusters.      Most  patches  are  applied  and  Amazon  creates  new  AMI’s  with  improvements  –  but  not  for  everything.      Intermiient  network  issues  –  Some$mes  could  cause  serious  degrada$on  of  performance.      Network/Disk  IO  is  variable  based  on  instance  types  and  streaming  jobs  will  be  much  sluggish  on  EMR  compared  to  dedicated  setup.      Be  ready  to  face  variable  performance  in  Cloud.  
  16. 16. Hadoop and EMR – JobsUse  local  Hadoop  setup  for  debugging  your  jobs  –  there  is  no  easy  way  on  EMR.      Capture  EMR  cluster  metrics  -­‐  always  bootstrap  with  Ganglia.      High  JVM  memory  alloca$on  lead  to  long  GC  pauses.      Don’t  trust  EMR  tuned  sekngs  of  Hadoop  configura$ons.        Benchmark  on  small  cluster  for  data  points.  
  17. 17. Hadoop and EMR – Jobs performance  GC  Overhead  -­‐    increase  memory  and  reduce  the  jvm  reuse  tasks.    Avoid  read  conten$on  at  S3  –  Have  equal  or  more  files  in  S3  compared  to  available  mappers.    Use  mapred  output  compression  to  save  storage,  processing  $me  and  bandwidth  costs.    Set  mapred  task  $meout  to  0  if  you  have  long  running  jobs  (>  10  mins)  and  can  disable  specula$ve  execu$on  $me.      Always  benchmark  third  party  libraries  used  in  your  jobs  code  before  pukng  them  in  produc$on  –  too  much  sluggish  stuff  out  there.      
  18. 18. Hadoop – High Level Tuning     Small  files  problem  –  avoid  too   Tune  your  sekngs  –  JVM  Reuse,   many  small  files  in  S3   Sort  Buffer,  Sort  Factor,  Map/ Reduce  Tasks,  Parallel  Copies,   MapRed  Output  Compression  etc   Good  thing  is  that  you  can  use  Know  what  is  limi$ng  you  at  a   small  cluster  and  sample  input  node  level  –  CPU,  Memory,  DISK   size  for  tuning  IO  or  Network  IN/OUT  
  19. 19. Performance Tuning Golden RulesWhen you are operating at very large scaleeven a 10 ms makes a big difference!Example  :  Moving  away  from  Simple-­‐Json  to  Jackson    JSON  Parsing  –  600  ms    Op$mized  Parsing  –  500  ms    Number  of  input  JSON  records  –  3  million    Time  saved  by  simple  op$miza$on  –  84  hrs  of  savings  
  20. 20. We have seen improvements from 10x to100x in our production clusters –significant money savings.
  21. 21. Lesson Learned – Saving TimeHadoop  Job  with  complex  business  logic  opera$ng  on  350  MB  input  size    Job Language   Cluster Size   Input Files   Processing Time  Ruby   6 m1.xlarge   1000   184 mins  Java   6 m1.xlarge   1000   69 mins  Java   6 m1.xlarge   100 39 mins   (1000 files combined)  Java 6 m1.xlarge   100 25 mins  (EMR tuned)   (1000 files combined)  Java 6 m1.xlarge   100 13 mins  (EMR and Code tuned)   (1000 files combined)  
  22. 22. Lesson Learned – Saving CostA  data  mining    job  in  produc$on  with  50  GB  compressed  input  data  Job Cluster Size   Processing Each Job 100  Jobs  Cost  Per  Language   Time   Cost   Month  Ruby   50 m2.2xlarge   240 mins   $242   $24200  Java   20 m1.xlarge   200 mins   $68   $6800  Java 20 m1.xlarge   $50   $5000  (EMR tuned)   165 mins  Java 20 m1.xlarge   50 mins $17   $1700  (EMR and Code  tuned)  
  23. 23. EMR Cost Optimization  Use  a  small  dedicated/transient  cluster            Leverage  spot  instance  for  Task  Nodes      Op$mize,  profile  and  tune  your  code  always  –  code  first  and  config  next      Tune  EMR  configura$on  based  on  historical  jobs  data    Always  benchmark  third  party  libraries    
  24. 24. Q & A
  25. 25. Like what we do? – connect with me Kuliza.com | vijay.rayapati@kuliza.com | @kuliza vijay.rayapati@kuliza.com @amnigos

×