• Save
20140614 introduction to spark-ben white
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

20140614 introduction to spark-ben white

on

  • 314 views

Big Data Camp LA 2014, Introduction to Spark by Ben White of Cloudera

Big Data Camp LA 2014, Introduction to Spark by Ben White of Cloudera

Statistics

Views

Total Views
314
Views on SlideShare
296
Embed Views
18

Actions

Likes
0
Downloads
0
Comments
0

4 Embeds 18

http://yguan.github.io 8
http://localhost 5
https://twitter.com 4
https://www.linkedin.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

20140614 introduction to spark-ben white Presentation Transcript

  • 1. CONFIDENTIAL  -­‐  RESTRICTED   Introduc6on  to  Spark   Ben  White  –  Systems  Engineer,  Cloudera  
  • 2. 2 But  first…  how  did  we  get  here?  
  • 3. What  does  Hadoop  look  like?   3   HDFS   worker   (“DN”)     MR   worker   (“TT”)                   HDFS   worker   (“DN”)     MR   worker   (“TT”)                 HDFS   worker   (“DN”)     MR   worker   (“TT”)                 HDFS   worker   (“DN”)     MR   worker   (“TT”)                 HDFS   worker   (“DN”)     MR   worker   (“TT”)                 …     HDFS   master   (“NN”)     MR   master   (“JT”)               Standby   master                      
  • 4. But  I  want  MORE!   4 HDFS   worker                           HDFS   worker                           HDFS   worker                           HDFS   worker                           HDFS   worker                           …   MapReduce   HDFS   master   (“NN”)     MR   master   (“JT”)               Standby   master                      
  • 5. Hadoop  as  an  Architecture   The  Old  Way   $30,000+  per  TB   Expensive  &  UnaNainable   •  Hard  to  scale   •  Network  is  a  boNleneck   •  Only  handles  rela6onal  data   •  Difficult  to  add  new  fields  &  data  types   Expensive,  Special  purpose,  “Reliable”  Servers   Expensive  Licensed  SoRware   Network   Data  Storage   (SAN,  NAS)   Compute   (RDBMS,  EDW)   The  Hadoop  Way   $300-­‐$1,000  per  TB   Affordable  &  ANainable   •  Scales  out  forever   •  No  boNlenecks   •  Easy  to  ingest  any  data   •  Agile  data  access   Commodity  “Unreliable”  Servers   Hybrid  Open  Source  SoRware   Compute   (CPU)   Memory   Storage   (Disk)   z   z  
  • 6. CDH:  the  App  Store  for  Hadoop   6 Integra6on   Storage   Resource  Management   Metadata   NoSQL   DBMS   …   AnalyYc   MPP   DBMS   Search   Engine   In-­‐   Memory   Batch   Processing   System   Management   Data   Management   Support   Security   Machine   Learning   MapReduce
  • 7. 7 Introduc6on  to  Apache  Spark   Credits:   •  Todd  Lipcon   •  Ted  Malaska   •  Jairam  Ranganathan   •  Jayant  Shekhar   •  Sandy  Ryza  
  • 8. Can  we  improve  on  MR?   •  Problems  with  MR:   •  Very  low-­‐level:  requires  a  lot  of  code  to  do  simple   things   •  Very  constrained:  everything  must  be  described  as   “map”  and  “reduce”.  Powerful  but  some6mes   difficult  to  think  in  these  terms.   8
  • 9. Can  we  improve  on  MR?   •  Two  approaches  to  improve  on  MapReduce:     1.  Special  purpose  systems  to  solve  one  problem  domain   well.   •  Giraph  /  Graphlab  (graph  processing)   •  Storm  (stream  processing)     2.  Generalize  the  capabili6es  of  MapReduce  to   provide  a  richer  founda6on  to  solve  problems.   •  Tez,  MPI,  Hama/Pregel  (BSP),  Dryad  (arbitrary  DAGs)     Both  are  viable  strategies  depending  on  the  problem!   9
  • 10. What  is  Apache  Spark?   Spark  is  a  general  purpose  computa6onal  framework     Retains  the  advantages  of  MapReduce:   •  Linear  scalability   •  Fault-­‐tolerance   •  Data  Locality  based  computa6ons     …but  offers  so  much  more:   •  Leverages  distributed  memory  for  beNer  performance   •  Supports  itera6ve  algorithms  that  are  not  feasible  in  MR   •  Improved  developer  experience   •  Full  Directed  Graph  expressions  for  data  parallel  computa6ons   •  Comes  with  libraries  for  machine  learning,  graph  analysis,  etc   10
  • 11. Gecng  started  with  Spark   •  Java  API   •  Interac6ve  shells:   •  Scala  (spark-­‐shell)   •  Python  (pyspark)   11
  • 12. Execu6on  modes   •  Standalone  Mode   •  Dedicated  master  and  worker  daemons   •  YARN  Client  Mode   •  Launches  a  YARN  applica6on  with  the   driver  program  running  locally   •  YARN  Cluster  Mode   •  Launches  a  YARN  applica6on  with  the   driver  program  running  in  the  YARN   Applica6onMaster   12 Dynamic  resource   management   between  Spark,   MR,  Impala…   Dedicated  Spark   run6me  with  sta6c   resource  limits  
  • 13. Spark  Concepts   13
  • 14. Parallelized  Collec6ons   14   scala>  val  data  =  1  to  5   data:  Range.Inclusive  =  Range(1,  2,  3,  4,  5)     scala>  val  distData  =  sc.parallelize(data)   distData:  org.apache.spark.rdd.RDD[Int]  =                                                                              ParallelCollectionRDD[0]     Now  I  can  apply  parallel  opera6ons  to  this  array:     scala>  distData.reduce(_  +  _)   […  Adding  task  set  0.0  with  56  tasks  …]     res0:  Int  =  15     What  just  happened?!  
  • 15. RDD  –  Resilient  Distributed  Dataset   •  Collec6ons  of  objects  par66oned  across  a  cluster   •  Stored  in  RAM  or  on  Disk   •  You  can  control  persistence  and  par66oning   •  Created  by:   •  Distribu6ng  local  collec6on  objects   •  Transforma6on  of  data  in  storage   •  Transforma6on  of  RDDs   •  Automa6cally  rebuilt  on  failure  (resilient)   •  Contains  lineage  to  compute  from  storage   •  Lazy  materializa6on   15
  • 16. RDD  transforma6ons   16  
  • 17. Opera6ons  on  RDDs   TransformaYons  lazily  transform  a   RDD  to  a  new  RDD   •  map   •  flatMap   •  filter   •  sample   •  join   •  sort   •  reduceByKey   •  …   AcYons  run  computa6on  to  return  a   value   •  collect   •  reduce(func)   •  foreach(func)   •  count   •  first,  take(n)   •  saveAs   •  …   17  
  • 18. Fault  Tolerance   •  RDDs  contain  lineage.   •  Lineage  –  source  loca6on  and  list  of  transforma6ons   •  Lost  par66ons  can  be  re-­‐computed  from  source  data             18 msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File   Filtered  RDD   Mapped  RDD   filter   (func  =  startsWith(…))   map   (func  =  split(...))  
  • 19. 19 Examples  
  • 20. Word  Count  in  MapReduce   20   package  org.myorg;                     import  java.io.IOExcep6on;   import  java.u6l.*;                     import  org.apache.hadoop.fs.Path;   import  org.apache.hadoop.conf.*;   import  org.apache.hadoop.io.*;   import  org.apache.hadoop.mapreduce.*;   import  org.apache.hadoop.mapreduce.lib.input.FileInputFormat;   import  org.apache.hadoop.mapreduce.lib.input.TextInputFormat;   import  org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;   import  org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;                     public  class  WordCount  {                      public  sta6c  class  Map  extends  Mapper<LongWritable,  Text,  Text,  IntWritable>  {          private  final  sta6c  IntWritable  one  =  new  IntWritable(1);          private  Text  word  =  new  Text();                            public  void  map(LongWritable  key,  Text  value,  Context  context)  throws  IOExcep6on,   InterruptedExcep6on  {                  String  line  =  value.toString();                  StringTokenizer  tokenizer  =  new  StringTokenizer(line);                  while  (tokenizer.hasMoreTokens())  {                          word.set(tokenizer.nextToken());                          context.write(word,  one);                  }          }    }                      public  sta6c  class  Reduce  extends  Reducer<Text,  IntWritable,  Text,  IntWritable>  {            public  void  reduce(Text  key,  Iterable<IntWritable>  values,  Context  context)                throws  IOExcep6on,  InterruptedExcep6on  {                  int  sum  =  0;                  for  (IntWritable  val  :  values)  {                          sum  +=  val.get();                  }                  context.write(key,  new  IntWritable(sum));          }    }                      public  sta6c  void  main(String[]  args)  throws  Excep6on  {          Configura6on  conf  =  new  Configura6on();                                    Job  job  =  new  Job(conf,  "wordcount");                    job.setOutputKeyClass(Text.class);          job.setOutputValueClass(IntWritable.class);                            job.setMapperClass(Map.class);          job.setReducerClass(Reduce.class);                            job.setInputFormatClass(TextInputFormat.class);          job.setOutputFormatClass(TextOutputFormat.class);                            FileInputFormat.addInputPath(job,  new  Path(args[0]));          FileOutputFormat.setOutputPath(job,  new  Path(args[1]));                            job.waitForComple6on(true);    }                     }  
  • 21. Word  Count  in  Spark   ! sc.textFile(“words”)! .flatMap(line => line.split(" "))! .map(word=>(word,1))! .reduceByKey(_+_).collect()! ! ! 21
  • 22. Logis6c  Regression   •  Read  two  sets  of  points   •  Looks  for  a  plane  W  that  separates  them   •  Perform  gradient  descent:   •  Start  with  random  W   •  On  each  itera6on,  sum  a  func6on  of  W  over  the  data   •  Move  W  in  a  direc6on  that  improves  it   22
  • 23. Intui6on   23
  • 24. Logis6c  Regression   24
  • 25. Logis6c  Regression  Performance   25  
  • 26. 26 Spark  and  Hadoop:   a  Framework  within  a  Framework  
  • 27. 27 Integra6on   Storage   Resource  Management   Metadata   HBase   …  Impala   Solr   Spark   Map   Reduce   System   Management   Data   Management   Support   Security  
  • 28. 28  
  • 29. 29  
  • 30. 30 Integra6on   Storage   Resource  Management   Metadata   HBase   …  Impala   Solr   Spark   Map   Reduce   System   Management   Data   Management   Support   Security  
  • 31. Spark  Streaming   •  Takes  the  concept  of  RDDs  and  extends  it  to   DStreams   •  Fault-­‐tolerant  like  RDDs   •  Transformable  like  RDDs   •  Adds  new  “rolling  window”  opera6ons   •  Rolling  averages,  etc   •  But  keeps  everything  else!   •  Regular  Spark  code  works  in  Spark  Streaming   •  Can  s6ll  access  HDFS  data,  etc   31
  • 32. Micro-­‐batching  for  on  the  fly  ETL   32
  • 33. Fault  recovery   How  fast  can  the  system  recover?   33
  • 34. Fault  Recovery   •  RDDs  store  dependency  graph   •  Because  RDDs  are  determinis6c:   Missing  RDDs  are  rebuilt  in  parallel  on  other  nodes   •  Stateful  RDDs  can  have  infinite  lineage   •  Periodic  checkpoints  to  disk  clears  lineage   •  Faster  recovery  6mes   •  BeNer  handling  of  stragglers  vs  row-­‐by-­‐row  streaming   34
  • 35. 35 Summary  
  • 36. Why  Spark?   •  Flexible  like  MapReduce   •  High  performance   •  Machine  learning,   itera6ve  algorithms   •  Interac6ve  data   explora6ons   •  Concise,  easy  API  for   developer  produc6vity   36  
  • 37. 37
  • 38. Spark   38   hNp://www.cloudera.com/content/cloudera/en/products-­‐and-­‐services/cdh/spark.html     hNp://www.cloudera.com/content/cloudera-­‐content/cloudera-­‐docs/CM5/latest/Cloudera-­‐ Manager-­‐Installa6on-­‐Guide/cm5ig_install_spark.html      
  • 39. A  Brief  History   39 2002   2003   2004   2005   2006   2007   2008   2009   2010   2011   2012   2013   2014   Doug  Cu`ng   launches  Nutch   project   Google  releases   GFS  paper   Google  releases   MapReduce   paper   MapReduce   implemented  in   Nutch   Nutch  adds   distributed  file   system   Hadoop  spun   out  of  Nutch   project   Hadoop  breaks   Terasort  world   record   Cloudera   founded   CDH  and  CDH2   released   CDH3  released   CDH4  released   adding  HA   Impala   (SQL  on  Hadoop)   launched   Sentry  and   Search   launched   CDH5   Cloudera   Manager   released   HBase,   Zookeeper,  Flume   and  more  added   to  CDH  
  • 40. What  is  Apache  Hadoop?   •  An  open-­‐source  implementa6on  of  Google’s  GFS  and   MapReduce  papers   •  An  Apache  So~ware  Founda6on  top-­‐level  project   •  Good  at  storing  and  processing  all  kinds  of  data   •  Reliable  storage  at  terabyte/petabyte-­‐scale   on  unreliable  (cheap)  hardware   •  A  distributed  system  for  coun6ng  words  J   40
  • 41. What  is  Apache  Hadoop?   41 Has  the  Flexibility  to  Store  and   Mine  Any  Type  of  Data     §  Ask  ques6ons  across  structured  and   unstructured  data  that  were  previously   impossible  to  ask  or  solve   §  Not  bound  by  a  single  schema   Excels  at   Processing  Complex  Data     §  Scale-­‐out  architecture  divides  workloads   across  mul6ple  nodes   §  Flexible  file  system  eliminates  ETL   boNlenecks   Scales   Economically     §  Can  be  deployed  on  industry  standard   hardware   §  Open  source  pla•orm  guards  against   vendor  lock   Hadoop  Distributed   File  System  (HDFS)     Self-­‐Healing,  High   Bandwidth  Clustered   Storage       MapReduce     Distributed  Compu6ng   Framework   Apache Hadoop  is  an  open  source   pla•orm  for  data  storage  and  processing   that  is…   ü  Scalable   ü  Fault  tolerant   ü  Distributed   CORE  HADOOP  SYSTEM  COMPONENTS