May 29, 2014 Toronto Hadoop User Group - Micro ETL

863 views

Published on

This is a brief discussion about the various technologies utilized for micro-ETL aka near-realtime ETL

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
863
On SlideShare
0
From Embeds
0
Number of Embeds
76
Actions
Shares
0
Downloads
19
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

May 29, 2014 Toronto Hadoop User Group - Micro ETL

  1. 1. MICRO-­‐ETL:   AN  EMERGING  PATTERN  OF  USE  WITH  HADOOP   AND  NEAR-­‐REALTIME  FRAMEWORKS   Adam  Muise  –  Principle  Architect,  Hortonworks  
  2. 2. Who  is                                        ?  
  3. 3. We  do  Hadoop   The  leaders  of  Hadoop’s   development   Community  driven,     Enterprise  Focused   Drive  InnovaEon  in   the  plaForm  –  We   lead  the  roadmap     100%  Open  Source  –   DemocraEzed  Access  to   Data  
  4. 4. What  is  Micro-­‐ETL?  
  5. 5. I  made  it  up.  
  6. 6. It  turns  out  several  other  people   have  made  it  up  before  so  I  don’t   feel  like  a  megalomaniac.   Another  terminology  calls  it  Near-­‐ RealEme  ETL.     hTp://www.researchgate.net/publicaEon/226219087_Near_Real_Time_ETL/file/79e4150b23b3aca5aa.pdf  
  7. 7. Micro-­‐ETL/Near-­‐RealEme  ETL   involves:   1.  An  intra-­‐batch  and/or  near-­‐real3me   ingest  of  analy3cal  data     2.  Processing  on  a  small  batches  of  data   or  event  streams     3.  A  scalable  “ETL”  processing   framework    
  8. 8. Why  Use  Micro-­‐ETL:   1.  Your  analy3cs  require  more  up  to  date   informa3on  than  provided  by  a  regular  batch   process   2.  There  is  opera3onal  risk  (ie  falling  behind  the  data   firehose,  data  loss,  data  fidelity)  in  leaving  the   processing  of  events  un3l  batch  can  process  them   3.  Your  data  ingest  rate  is  inconsistent  and  there  is   value  in  keeping  up-­‐to-­‐date  with  current  events.  
  9. 9. When  not  to  use  Micro-­‐ETL:   1.  Your  data  ingest  rates  are  predictable  and   analyzing  them  intra-­‐batch  provides  liMle   value.   2.  Processing  in  a  large  batch  yields  a   complete  popula3on  of  the  data  required  to   make  a  decision.     3.  You  have  exis3ng  investments  in  tradi3onal   ETL  tools  that  outweigh  any  benefits  in  a  new   tool/framework  (as  tools  evolve,  this  will  be  a   moot  point  however)  
  10. 10. Micro-­‐ETL  involves  regular   processing  tasks  only  run  on  data   frameworks  that  can  handle   near-­‐realEme  event  streams  
  11. 11. Let’s  refresh  on  some  core   Hadoop  concepts…  
  12. 12. Refresh  on  YARN  
  13. 13. YARN  =  Yet  Another  Resource   NegoEator  
  14. 14. Resource  Manager   +   Node  Managers   =  YARN   Resource  Manager   AppMaster   Node  Manager   Scheduler   AppMaster   AppMaster   Node  Manager   Node  Manager   Node  Manager   Container   Container   MapReduce   Container   Storm   Container   Container   Container   Pig   Container   Container   Container  
  15. 15. YARN  abstracts  resource   management  so  you  can  run  all  sorts   of  distributed  applicaEons   HDFS   MapReduce  V2   YARN   MapReduce  V?   STORM   MPI  Giraph   HBase  Tez   …  and   more  Spark  
  16. 16. The  following  secEon  outlines   frameworks  that  are  emerging  as   data  processing  opEons  to  batch-­‐ driven  MapReduce  
  17. 17. Introducing  Tez  
  18. 18. Three  important  facts  about  Tez:   1.  Tez  is  a  YARN  applicaEon.   2.  Tez  will  eventually  replace   MapRecue.   3.  Tez  scales  as  well  as  the  rest  of   Hadoop  scales  (thousands  of  nodes).  
  19. 19. Tez  provides  a  layer  for  abstract   tasks,  these  could  be  mappers,   reducers,  customized  stream   processes,  in  memory  structures,   etc  
  20. 20. Tez  chains  tasks  together  into  one  job  to   get  jobs  like  Map  –  Reduce  –  Reduce.     This  is  ideal  for  apps  like  Hive.   TezMap   TezMap   TezMap   TezMap   TezMap   TezReduce   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   Data   TezReduce   TezReduce   TezReduce   TezReduce   TezReduce   Group  By   ProjecEons   Order  By  
  21. 21. Tez  provides  the  opEon  for  more   complicated  workflows  
  22. 22. Tez  allows  for  more  complicated   workflow  primiEves  than  just  Map   and  Reduce.  A  Tez  task  is  composed   of  a  programmable  Input,  Output,   and  Processor  
  23. 23. YARN  can  provide  long-­‐running   containers*  for  applicaEons  like   Storm,  Hbase,  JBoss,  etc     *  -­‐  With  the  help  of  Apache  Slider:     hTp://wiki.apache.org/incubator/SliderProposal  
  24. 24. Yahoo!,  the  original  author  of   Hadoop  and  the  largest  Hadoop   user,  is  bepng  on  Apache  Hive/Tez/ YARN  as  their  core  data   architecture.       hTp://yahoodevelopers.tumblr.com/post/85930551108/yahoo-­‐ bepng-­‐on-­‐apache-­‐hive-­‐tez-­‐and-­‐yarn  
  25. 25. Refresh  on  Storm  
  26. 26. Storm  is  a  distributed  execuEon   engine  that  handles  streaming  data  
  27. 27. Storm  processes  streaming  event   data  as  tuples.  Each  event  is   generated/ingested  through  a  spout   and  processed  in  series  of  bolts.  The   spouts  and  bolts  form  a  topology.  
  28. 28. Refresh  on  Summingbird  
  29. 29. Summingbird  is  a  processing   framework  that  runs  over  Storm.  It   uses  Scala  and  has  MapReduce-­‐like   features.  Technically  it’s  Scalding   (Cascading  with  Scala).  
  30. 30. Refresh  on  Spark  
  31. 31. Spark  is  a  framework  designed  to   handle  in-­‐memory  compuEng.  If  you   are  using  Hadoop,  you  are  typically   running  Spark  on  YARN.  
  32. 32. Spark  uses  RDDs  (Resilient   Distributed  Datasets)  as  a  primiEve   to  enable  in-­‐memory  processing.   RDDs  can  be  created  from  all  sorts   of  data,  like  HDFS  files:   scala>  val  distFile  =  sc.textFile("hdfs://my.namenode.com:8020/tmp/data.txt")     distFile:  spark.RDD[String]  =  spark.HadoopRDD@1d4cee08  
  33. 33. Spark  Streaming  constructs  DStream   primiEves  from  a  streaming  data   source.  DStreams  are  actually  made   up  of  many  RDDs  and  use  the   common  Spark  Engine.    
  34. 34. Spark  Streaming  has  an  expressive   API  to  allow  typical  transformaEons   on  RDDs  or  DStreams.  This  is   comparable  to  MapReduce.  
  35. 35. Since  Micro-­‐ETL  involves  porEng   tradiEonal  data  processing  tasks  to   different  execuEon  environments,  it   makes  sense  that  data  processing   tools  would  facilitate  execuEon  on   mulEple  plaForms.    
  36. 36. Some  opEons  in  the  field…  
  37. 37. You  can  write  generic  processing   libraries  (Java)  for  your  data  and   port  them  from  MapReduce/Tez  to   Storm.  Storm  can  also  run   Summingbird  (Scala).    
  38. 38. Spark  already  provides  a   streaming  and  processing  layer   for  Micro-­‐ETL.  These  can  be   Scala,  Java,  or  Python.  
  39. 39. Cascading  has  recently  announced   that  it  would  support  Tez  and  Storm   immediately  in  version  3.0.  They   also  have  plans  for  Spark   hTp://www.concurrenEnc.com/2014/05/cascading-­‐3-­‐0-­‐adds-­‐mulEple-­‐framework-­‐support-­‐concurrent-­‐driven-­‐manages-­‐big-­‐data-­‐apps/  
  40. 40. Pentaho  KeTle  currently   supports  porEng  their  data   workflows  to  Storm  in  a  beta   version.   hTp://wiki.pentaho.com/display/BAD/KeTle+ExecuEon+on+Storm  
  41. 41. Talend  recently  asked  for  $40   million  in  funding  to  help  push   further  into  Big  Data.  That  includes   tooling  to  port  Talend  ETL  workflows   to  Storm  and  Tez.   hTp://techcrunch.com/2013/12/11/talend-­‐raises-­‐40m-­‐to-­‐more-­‐aggressively-­‐extend-­‐into-­‐big-­‐data-­‐market-­‐sets-­‐sights-­‐on-­‐ipo/  
  42. 42. Discuss.  
  43. 43. Thanks  THUGs.  

×