Your SlideShare is downloading. ×
Hadoop and Spark
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop and Spark

2,073

Published on

Slide deck from NJ Hadoop event held on 4/15/14, http://www.meetup.com/nj-hadoop/events/173570432/

Slide deck from NJ Hadoop event held on 4/15/14, http://www.meetup.com/nj-hadoop/events/173570432/

Published in: Technology, Education
0 Comments
12 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,073
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
12
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • * MapReduce struggles from performance optimization for individual systems because of its design* Google has used both techniques in-house quite a bit and the future will contain both
  • Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.
  • Transcript

    • 1. Hadoop  and  Spark   Shravan  (Sean)  Pabba   1  
    • 2. About  Me   •  Diverse  roles/languages  and  pla=orms.   •  Middleware  space  in  recent  years.   •  Worked  for  IBM/Grid  Dynamics/GigaSpaces.   •  Working  as  Systems  Engineer  for  Cloudera   since  last  July.   •  Work  with  and  educate  clients/prospects.   2  
    • 3. Agenda   •  IntroducLon  to  Spark   –  Map  Reduce  Review   –  Why  Spark   –  Architecture  (Stand-­‐alone  AND  Cloudera)   •  Concepts   •  Examples/Use  Cases   •  Spark  Streaming   •  Shark   –  Shark  Vs  Impala   •  Demo   3  
    • 4. Have  you  done?   •  Programming  languages  (Java/ Python/Scala)   •  WriUen  mulL-­‐threaded  or   distributed  programs   •  Numerical  Programming/StaLsLcal   CompuLng  (R,  MATLAB)   •  Hadoop   4  
    • 5. INTRODUCTION  TO  SPARK   5  
    • 6. A  brief  review  of  MapReduce   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Reduce   Reduce   Reduce   Reduce   Key  advances  by  MapReduce:     •  Data  Locality:  AutomaLc  split  computaLon  and  launch  of  mappers  appropriately   •  Fault  tolerance:  Write  intermediate  results  and  restartable  mappers  means  ability   to  run  on  commodity  hardware   •  Linear  scalability:  CombinaLon  of  locality  +  programming  model  that  forces   developers  to  write  generally  scalable  soluLons  to  problems   6  
    • 7. MapReduce  sufficient  for  many  classes   of  problems   MapReduce   Hive   Pig   Mahout   Crunch   Solr   A  bit  like  Haiku:     •  Limited  expressivity   •  But  can  be  used  to  approach  diverse  problem  domains     7  
    • 8. BUT…  Can  we  do  beUer?   Areas  ripe  for  improvement,   •  Launching  Mappers/Reducers  takes  Lme   •  Having  to  write  to  disk  (replicated)  between   each  step   •  Reading  data  back  from  disk  in  the  next  step   •  Each  Map/Reduce  step  has  to  go  back  into  the   queue  and  get  its  resources   •  Not  In  Memory   •  Cannot  iterate  fast   8  
    • 9. What  is  Spark?   Spark  is  a  general  purpose  computaLonal  framework  -­‐  more  flexibility  than   MapReduce.  It  is  an  implementaLon  of  a  2010  Berkley  paper  [1].     Key  properBes:   •  Leverages  distributed  memory   •  Full  Directed  Graph  expressions  for  data  parallel  computaLons   •  Improved  developer  experience   Yet  retains:   Linear  scalability,  Fault-­‐tolerance  and  Data  Locality  based   computaLons   1  -­‐  hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf   9  
    • 10. Spark:  Easy  and  Fast  Big  Data   •  Easy  to  Develop   – Rich  APIs  in  Java,   Scala,  Python   – InteracLve  shell   •  Fast  to  Run   – General  execuLon   graphs   – In-­‐memory  storage   2-­‐5×  less  code   Up  to  10×  faster  on  disk,   100×  in  memory    10  
    • 11. Easy:  Get  Started  Immediately   •  MulL-­‐language  support   •  InteracLve  Shell   Python   lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala   val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java   JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 11  
    • 12. Spark  Ecosystem   hUp://www.databricks.com/spark/#sparkhadoop   12  
    • 13. Spring  Framework   hUp://docs.spring.io/spring/docs/1.2.9/reference/introducLon.html   13  
    • 14. Spark  in  Cloudera  EDH   3RD  PARTY   APPS   STORAGE  FOR  ANY  TYPE  OF  DATA   UNIFIED,  ELASTIC,  RESILIENT,  SECURE             CLOUDERA’S  ENTERPRISE  DATA  HUB   BATCH   PROCESSING   MAPREDUCE   SPARK   ANALYTIC   SQL   IMPALA   SEARCH   ENGINE   SOLR   MACHINE   LEARNING   SPARK   STREAM   PROCESSING   SPARK  STREAMING   WORKLOAD  MANAGEMENT   YARN   FILESYSTEM   HDFS   ONLINE  NOSQL   HBASE   DATA   MANAGEMENT   CLOUDERA  NAVIGATOR   SYSTEM   MANAGEMENT   CLOUDERA  MANAGER   SENTRY  ,  SECURE   14  
    • 15. AdopLon   •  SupporLng:   – DataBricks   •  ContribuLng:   – UC  Berkley,  DataBricks,  Yahoo,  etc   •  Well  known  use-­‐cases:   – Conviva,  QuanLfind,  Bizo   15  
    • 16. CONCEPTS   16  
    • 17. Spark  Concepts  -­‐  Overview   •  Driver  &  Workers   •  RDD  –  Resilient  Distributed  Dataset   •  TransformaLons   •  AcLons   •  Caching   17  
    • 18. Driver  and  Workers   Driver   Worker   Worker   Worker   Data   RAM   Data   RAM   Data   RAM   18  
    • 19. RDD  –  Resilient  Distributed  Dataset   •  Read-­‐only  parLLoned  collecLon  of  records   •  Created  through:   – TransformaLon  of  data  in  storage   – TransformaLon  of  RDDs   •  Contains  lineage  to  compute  from  storage   •  Lazy  materializaLon   •  Users  control  persistence  and  parLLoning   19  
    • 20. OperaLons   TransformaBons   •  Map   •  Filter   •  Sample   •  Join   AcBons   •  Reduce   •  Count   •  First,  Take   •  SaveAs   20  
    • 21. OperaLons   •  TransformaBons  create  new  RDD  from  an  exisLng  one   •  AcBons  run  computaLon  on  RDD  and  return  a  value   •  TransformaLons  are  lazy.     •  AcLons  materialize  RDDs  by  compuLng  transformaLons.   •  RDDs  can  be  cached  to  avoid  re-­‐compuLng.   21  
    • 22. Fault  Tolerance   •  RDDs  contain  lineage.   •  Lineage  –  source  locaLon  and  list  of   transformaLons   •  Lost  parLLons  can  be  re-­‐computed  from   source  data           msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File   Filtered  RDD   Mapped  RDD   filter   (func  =  startsWith(…))   map   (func  =  split(...))   22  
    • 23. Caching   •  Persist()  and  cache()  mark  data     •  RDD  is  cached  ater  first  acLon   •  Fault  tolerant  –  lost  parLLons  will  re-­‐compute   •  If  not  enough  memory  –     some  parLLons  will  not  be  cached   •  Future  acLons  are  performed  on  cached   parLLoned   •  So  they  are  much  faster     Use  caching  for  iteraBve  algorithms     23  
    • 24. Caching  –  Storage  Levels   •  MEMORY_ONLY   •  MEMORY_AND_DISK   •  MEMORY_ONLY_SER   •  MEMORY_AND_DISK_SER   •  DISK_ONLY   •  MEMORY_ONLY_2,  MEMORY_AND_DISK_2…   24  
    • 25. SPARK  EXAMPLES   25  
    • 26. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 26  
    • 27. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 27  
    • 28. Spark  Word  Count  in  Java   JavaSparkContext sc = new JavaSparkContext(...);! JavaRDD<String> lines = ctx.textFile("hdfs://...");! JavaRDD<String> words = lines.flatMap(! new FlatMapFunction<String, String>() {! public Iterable<String> call(String s) {! return Arrays.asList(s.split(" "));! }! }! );! ! JavaPairRDD<String, Integer> ones = words.map(! new PairFunction<String, String, Integer>() {! public Tuple2<String, Integer> call(String s) {! return new Tuple2(s, 1);! }! }! );! ! JavaPairRDD<String, Integer> counts = ones.reduceByKey(! new Function2<Integer, Integer, Integer>() {! public Integer call(Integer i1, Integer i2) {! return i1 + i2;! }! }! );! JavaRDD<String> lines = sc.textFile("hdfs://log.txt");! ! JavaRDD<String> words =! lines.flatMap(line -> Arrays.asList(line.split(" ")));! ! JavaPairRDD<String, Integer> ones =! words.mapToPair(w -> new Tuple2<String, Integer>(w, 1));! ! JavaPairRDD<String, Integer> counts =! ones.reduceByKey((x, y) -> x + y);! Java  8   Lamba   Expression  [1]   1  -­‐  hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html   28  
    • 29. Log  Mining   •  Load  error  messages  from  a  log  into  memory   •  InteracLvely  search  for  paUerns   29  
    • 30. Log  Mining   lines = sparkContext.textFile(“hdfs://…”)! errors = lines.filter(_.startsWith(“ERROR”)! messages = errors.map(_.split(‘t’)(2))! ! cachedMsgs = messages.cache()! ! cachedMsgs.filter(_.contains(“foo”)).count! cachedMsgs.filter(_.contains(“bar”)).count! …! Base  RDD   Transformed   RDD   AcLon   30  
    • 31. LogisLc  Regression   •  Read  two  sets  of  points   •  Looks  for  a  plane  W  that  separates  them   •  Perform  gradient  descent:   – Start  with  random  W   – On  each  iteraLon,  sum  a  funcLon  of  W  over  the   data   – Move  W  in  a  direcLon  that  improves  it   31  
    • 32. IntuiLon   32  
    • 33. LogisLc  Regression   val points = spark.textFile(…).map(parsePoint).cache()! ! val w = Vector.random(D)! ! for (I <- 1 to ITERATIONS) {! "val gradient = points.map(p => ! " "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )! " ".reduce(_+_)! "w -= gradient! }! println(“Final separating plane: ” + w)! 33  
    • 34. Conviva  Use-­‐Case  [1]   •  Monitor  online  video  consumpLon   •  Analyze  trends   Need  to  run  tens  of  queries  like  this  a  day:     SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 1  -­‐  hUp://www.conviva.com/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva/   34  
    • 35. Conviva  With  Spark   val  sessions  =  sparkContext.sequenceFile[SessionSummary,NullWritable] (pathToSessionSummaryOnHdfs)     val  cachedSessions  =  sessions.filter(whereCondiLonToFilterSessions).cache     val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }   val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =   cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap     35  
    • 36. SPARK  STREAMING   36  
    • 37. Large-­‐Scale  Stream  Processing   Requires   •  Fault  Tolerance  –  for  crashes  and  strugglers   •  Efficiency   •  Row-­‐by-­‐row  (conLnuous  operator)  systems   do  not  handle  struggler  nodes     •  Batch  Processing  provides  fault  tolerance   efficiently   Job  is  divided  into  determinisLc  tasks       37  
    • 38. Key  QuesLon     •  How  fast  can  the  system  recover?   38  
    • 39. Spark  Streaming   hUp://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html   39  
    • 40. Spark  Streaming   –  Run  con$nuous  processing  of  data  using  Spark’s  core  API.     –  Extends  Spark  concept  of  RDD’s  to  DStreams  (DiscreLzed  Streams)  which   are  fault  tolerant,  transformable  streams.  Users  can  re-­‐use  exisLng  code  for   batch/offline  processing.   –  Adds  “rolling  window”  operaLons.  E.g.  compute  rolling  averages  or  counts   for  data  over  last  five  minutes.   –  Example  use  cases:   •  “On-­‐the-­‐fly”  ETL  as  data  is  ingested  into  Hadoop/HDFS.   •  DetecLng  anomalous  behavior  and  triggering  alerts.   •  ConLnuous  reporLng  of  summary  metrics  for  incoming  data.   40  
    • 41. val  tweets  =  ssc.twitterStream()   val  hashTags  =  tweets.flatMap  (status  =>  getTags(status))   hashTags.saveAsHadoopFiles("hdfs://...")     flatMap flatMap flatMap save save save batch  @  t+1  batch  @  t   batch  @  t+2   tweets  DStream   hashTags  DStream   Stream  composed  of   small  (1-­‐10s)  batch   computaLons   “Micro-­‐batch”  Architecture   41  
    • 42. SHARK   42  
    • 43. Shark  Architecture   •  IdenLcal  to  Hive   •  Same  CLI,  JDBC,            SQL  Parser,  Metastore   •  Replaced  the  opLmizer,            plan  generator  and  the  execuLon  engine.     •  Added  Cache  Manager.     •  Generate  Spark  code  instead  of  Map  Reduce   43  
    • 44. Hive  CompaLbility   •  MetaStore   •  HQL   •  UDF  /  UDAF   •  SerDes   •  Scripts   44  
    • 45. Shark  Vs  Impala   •  Shark  inherits  Hive  limitaLons  while  Impala  is   purpose  built  for  SQL.   •  Impala  is  significantly  faster  per  our  tests.   •  Shark  does  not  have  security,  audit/lineage,   support  for  high-­‐concurrency,  operaLonal   tooling  for  config/monitor/reporLng/ debugging.   •  InteracLve  SQL  needed  for  connecLng  BI   Tools.  Shark  not  cerLfied  by  any  BI  vendor.   45  
    • 46. DEMO   46  
    • 47. SUMMARY   47  
    • 48. Why  Spark?   •  Flexible  like  MapReduce   •  High  performance   •  Machine  learning,  iteraLve  algorithms   •  InteracLve  data  exploraLons   •  Developer  producLvity   48  
    • 49. How  Spark  Works?   •  RDDs  –  resilient  distributed  data   •  Lazy  transformaLons   •  Caching   •  Fault  tolerance  by  storing  lineage   •  Streams  –  micro-­‐batches  of  RDDs   •  Shark  –  Hive  +  Spark   49  

    ×