Hadoop	
  and	
  Spark	
  
Shravan	
  (Sean)	
  Pabba	
  
1	
  
About	
  Me	
  
•  Diverse	
  roles/languages	
  and	
  pla=orms.	
  
•  Middleware	
  space	
  in	
  recent	
  years.	
  ...
Agenda	
  
•  IntroducLon	
  to	
  Spark	
  
–  Map	
  Reduce	
  Review	
  
–  Why	
  Spark	
  
–  Architecture	
  (Stand-...
Have	
  you	
  done?	
  
•  Programming	
  languages	
  (Java/
Python/Scala)	
  
•  WriUen	
  mulL-­‐threaded	
  or	
  
di...
INTRODUCTION	
  TO	
  SPARK	
  
5	
  
A	
  brief	
  review	
  of	
  MapReduce	
  
Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
   Map	
 ...
MapReduce	
  sufficient	
  for	
  many	
  classes	
  
of	
  problems	
  
MapReduce	
  
Hive	
   Pig	
   Mahout	
   Crunch	
 ...
BUT…	
  Can	
  we	
  do	
  beUer?	
  
Areas	
  ripe	
  for	
  improvement,	
  
•  Launching	
  Mappers/Reducers	
  takes	
...
What	
  is	
  Spark?	
  
Spark	
  is	
  a	
  general	
  purpose	
  computaLonal	
  framework	
  -­‐	
  more	
  flexibility	...
Spark:	
  Easy	
  and	
  Fast	
  Big	
  Data	
  
•  Easy	
  to	
  Develop	
  
– Rich	
  APIs	
  in	
  Java,	
  
Scala,	
  ...
Easy:	
  Get	
  Started	
  Immediately	
  
•  MulL-­‐language	
  support	
  
•  InteracLve	
  Shell	
  
Python	
  
lines =...
Spark	
  Ecosystem	
  
hUp://www.databricks.com/spark/#sparkhadoop	
  
12	
  
Spring	
  Framework	
  
hUp://docs.spring.io/spring/docs/1.2.9/reference/introducLon.html	
  
13	
  
Spark	
  in	
  Cloudera	
  EDH	
  
3RD	
  PARTY	
  
APPS	
  
STORAGE	
  FOR	
  ANY	
  TYPE	
  OF	
  DATA	
  
UNIFIED,	
  E...
AdopLon	
  
•  SupporLng:	
  
– DataBricks	
  
•  ContribuLng:	
  
– UC	
  Berkley,	
  DataBricks,	
  Yahoo,	
  etc	
  
• ...
CONCEPTS	
  
16	
  
Spark	
  Concepts	
  -­‐	
  Overview	
  
•  Driver	
  &	
  Workers	
  
•  RDD	
  –	
  Resilient	
  Distributed	
  Dataset	...
Driver	
  and	
  Workers	
  
Driver	
  
Worker	
  
Worker	
  
Worker	
  
Data	
  
RAM	
  
Data	
  
RAM	
  
Data	
  
RAM	
 ...
RDD	
  –	
  Resilient	
  Distributed	
  Dataset	
  
•  Read-­‐only	
  parLLoned	
  collecLon	
  of	
  records	
  
•  Creat...
OperaLons	
  
TransformaBons	
  
•  Map	
  
•  Filter	
  
•  Sample	
  
•  Join	
  
AcBons	
  
•  Reduce	
  
•  Count	
  
...
OperaLons	
  
•  TransformaBons	
  create	
  new	
  RDD	
  from	
  an	
  exisLng	
  one	
  
•  AcBons	
  run	
  computaLon...
Fault	
  Tolerance	
  
•  RDDs	
  contain	
  lineage.	
  
•  Lineage	
  –	
  source	
  locaLon	
  and	
  list	
  of	
  
tr...
Caching	
  
•  Persist()	
  and	
  cache()	
  mark	
  data	
  	
  
•  RDD	
  is	
  cached	
  ater	
  first	
  acLon	
  
•  ...
Caching	
  –	
  Storage	
  Levels	
  
•  MEMORY_ONLY	
  
•  MEMORY_AND_DISK	
  
•  MEMORY_ONLY_SER	
  
•  MEMORY_AND_DISK_...
SPARK	
  EXAMPLES	
  
25	
  
Easy:	
  Example	
  –	
  Word	
  Count	
  
•  Spark	
  
public static class WordCountMapClass extends MapReduceBase
implem...
Easy:	
  Example	
  –	
  Word	
  Count	
  
•  Spark	
  
public static class WordCountMapClass extends MapReduceBase
implem...
Spark	
  Word	
  Count	
  in	
  Java	
  
JavaSparkContext sc = new JavaSparkContext(...);!
JavaRDD<String> lines = ctx.tex...
Log	
  Mining	
  
•  Load	
  error	
  messages	
  from	
  a	
  log	
  into	
  memory	
  
•  InteracLvely	
  search	
  for	...
Log	
  Mining	
  
lines = sparkContext.textFile(“hdfs://…”)!
errors = lines.filter(_.startsWith(“ERROR”)!
messages = error...
LogisLc	
  Regression	
  
•  Read	
  two	
  sets	
  of	
  points	
  
•  Looks	
  for	
  a	
  plane	
  W	
  that	
  separat...
IntuiLon	
  
32	
  
LogisLc	
  Regression	
  
val points =
spark.textFile(…).map(parsePoint).cache()!
!
val w = Vector.random(D)!
!
for (I <- ...
Conviva	
  Use-­‐Case	
  [1]	
  
•  Monitor	
  online	
  video	
  consumpLon	
  
•  Analyze	
  trends	
  
Need	
  to	
  ru...
Conviva	
  With	
  Spark	
  
val	
  sessions	
  =	
  sparkContext.sequenceFile[SessionSummary,NullWritable]
(pathToSession...
SPARK	
  STREAMING	
  
36	
  
Large-­‐Scale	
  Stream	
  Processing	
  
Requires	
  
•  Fault	
  Tolerance	
  –	
  for	
  crashes	
  and	
  strugglers	
...
Key	
  QuesLon	
  	
  
•  How	
  fast	
  can	
  the	
  system	
  recover?	
  
38	
  
Spark	
  Streaming	
  
hUp://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html	
  
39	
  
Spark	
  Streaming	
  
–  Run	
  con$nuous	
  processing	
  of	
  data	
  using	
  Spark’s	
  core	
  API.	
  	
  
–  Exte...
val	
  tweets	
  =	
  ssc.twitterStream()	
  
val	
  hashTags	
  =	
  tweets.flatMap	
  (status	
  =>	
  getTags(status))	...
SHARK	
  
42	
  
Shark	
  Architecture	
  
•  IdenLcal	
  to	
  Hive	
  
•  Same	
  CLI,	
  JDBC,	
  	
  
	
  	
  	
  	
  SQL	
  Parser,	
 ...
Hive	
  CompaLbility	
  
•  MetaStore	
  
•  HQL	
  
•  UDF	
  /	
  UDAF	
  
•  SerDes	
  
•  Scripts	
  
44	
  
Shark	
  Vs	
  Impala	
  
•  Shark	
  inherits	
  Hive	
  limitaLons	
  while	
  Impala	
  is	
  
purpose	
  built	
  for	...
DEMO	
  
46	
  
SUMMARY	
  
47	
  
Why	
  Spark?	
  
•  Flexible	
  like	
  MapReduce	
  
•  High	
  performance	
  
•  Machine	
  learning,	
  iteraLve	
  a...
How	
  Spark	
  Works?	
  
•  RDDs	
  –	
  resilient	
  distributed	
  data	
  
•  Lazy	
  transformaLons	
  
•  Caching	
...
Upcoming SlideShare
Loading in...5
×

Hadoop and Spark

2,394

Published on

Slide deck from NJ Hadoop event held on 4/15/14, http://www.meetup.com/nj-hadoop/events/173570432/

Published in: Technology, Education
0 Comments
13 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
2,394
On Slideshare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
13
Embeds 0
No embeds

No notes for slide
  • * MapReduce struggles from performance optimization for individual systems because of its design* Google has used both techniques in-house quite a bit and the future will contain both
  • Spark’s storage levels are meant to provide different tradeoffs between memory usage and CPU efficiency. We recommend going through the following process to select one:If your RDDs fit comfortably with the default storage level (MEMORY_ONLY), leave them that way. This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.If not, try using MEMORY_ONLY_SER and selecting a fast serialization library to make the objects much more space-efficient, but still reasonably fast to access.Don’t spill to disk unless the functions that computed your datasets are expensive, or they filter a large amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.Use the replicated storage levels if you want fast fault recovery (e.g. if using Spark to serve requests from a web application). All the storage levels provide full fault tolerance by recomputing lost data, but the replicated ones let you continue running tasks on the RDD without waiting to recompute a lost partition.If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method apply() of the StorageLevel singleton object.
  • Hadoop and Spark

    1. 1. Hadoop  and  Spark   Shravan  (Sean)  Pabba   1  
    2. 2. About  Me   •  Diverse  roles/languages  and  pla=orms.   •  Middleware  space  in  recent  years.   •  Worked  for  IBM/Grid  Dynamics/GigaSpaces.   •  Working  as  Systems  Engineer  for  Cloudera   since  last  July.   •  Work  with  and  educate  clients/prospects.   2  
    3. 3. Agenda   •  IntroducLon  to  Spark   –  Map  Reduce  Review   –  Why  Spark   –  Architecture  (Stand-­‐alone  AND  Cloudera)   •  Concepts   •  Examples/Use  Cases   •  Spark  Streaming   •  Shark   –  Shark  Vs  Impala   •  Demo   3  
    4. 4. Have  you  done?   •  Programming  languages  (Java/ Python/Scala)   •  WriUen  mulL-­‐threaded  or   distributed  programs   •  Numerical  Programming/StaLsLcal   CompuLng  (R,  MATLAB)   •  Hadoop   4  
    5. 5. INTRODUCTION  TO  SPARK   5  
    6. 6. A  brief  review  of  MapReduce   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Map   Reduce   Reduce   Reduce   Reduce   Key  advances  by  MapReduce:     •  Data  Locality:  AutomaLc  split  computaLon  and  launch  of  mappers  appropriately   •  Fault  tolerance:  Write  intermediate  results  and  restartable  mappers  means  ability   to  run  on  commodity  hardware   •  Linear  scalability:  CombinaLon  of  locality  +  programming  model  that  forces   developers  to  write  generally  scalable  soluLons  to  problems   6  
    7. 7. MapReduce  sufficient  for  many  classes   of  problems   MapReduce   Hive   Pig   Mahout   Crunch   Solr   A  bit  like  Haiku:     •  Limited  expressivity   •  But  can  be  used  to  approach  diverse  problem  domains     7  
    8. 8. BUT…  Can  we  do  beUer?   Areas  ripe  for  improvement,   •  Launching  Mappers/Reducers  takes  Lme   •  Having  to  write  to  disk  (replicated)  between   each  step   •  Reading  data  back  from  disk  in  the  next  step   •  Each  Map/Reduce  step  has  to  go  back  into  the   queue  and  get  its  resources   •  Not  In  Memory   •  Cannot  iterate  fast   8  
    9. 9. What  is  Spark?   Spark  is  a  general  purpose  computaLonal  framework  -­‐  more  flexibility  than   MapReduce.  It  is  an  implementaLon  of  a  2010  Berkley  paper  [1].     Key  properBes:   •  Leverages  distributed  memory   •  Full  Directed  Graph  expressions  for  data  parallel  computaLons   •  Improved  developer  experience   Yet  retains:   Linear  scalability,  Fault-­‐tolerance  and  Data  Locality  based   computaLons   1  -­‐  hUp://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf   9  
    10. 10. Spark:  Easy  and  Fast  Big  Data   •  Easy  to  Develop   – Rich  APIs  in  Java,   Scala,  Python   – InteracLve  shell   •  Fast  to  Run   – General  execuLon   graphs   – In-­‐memory  storage   2-­‐5×  less  code   Up  to  10×  faster  on  disk,   100×  in  memory    10  
    11. 11. Easy:  Get  Started  Immediately   •  MulL-­‐language  support   •  InteracLve  Shell   Python   lines = sc.textFile(...) lines.filter(lambda s: “ERROR” in s).count() Scala   val lines = sc.textFile(...) lines.filter(s => s.contains(“ERROR”)).count() Java   JavaRDD<String> lines = sc.textFile(...); lines.filter(new Function<String, Boolean>() { Boolean call(String s) { return s.contains(“error”); } }).count(); 11  
    12. 12. Spark  Ecosystem   hUp://www.databricks.com/spark/#sparkhadoop   12  
    13. 13. Spring  Framework   hUp://docs.spring.io/spring/docs/1.2.9/reference/introducLon.html   13  
    14. 14. Spark  in  Cloudera  EDH   3RD  PARTY   APPS   STORAGE  FOR  ANY  TYPE  OF  DATA   UNIFIED,  ELASTIC,  RESILIENT,  SECURE             CLOUDERA’S  ENTERPRISE  DATA  HUB   BATCH   PROCESSING   MAPREDUCE   SPARK   ANALYTIC   SQL   IMPALA   SEARCH   ENGINE   SOLR   MACHINE   LEARNING   SPARK   STREAM   PROCESSING   SPARK  STREAMING   WORKLOAD  MANAGEMENT   YARN   FILESYSTEM   HDFS   ONLINE  NOSQL   HBASE   DATA   MANAGEMENT   CLOUDERA  NAVIGATOR   SYSTEM   MANAGEMENT   CLOUDERA  MANAGER   SENTRY  ,  SECURE   14  
    15. 15. AdopLon   •  SupporLng:   – DataBricks   •  ContribuLng:   – UC  Berkley,  DataBricks,  Yahoo,  etc   •  Well  known  use-­‐cases:   – Conviva,  QuanLfind,  Bizo   15  
    16. 16. CONCEPTS   16  
    17. 17. Spark  Concepts  -­‐  Overview   •  Driver  &  Workers   •  RDD  –  Resilient  Distributed  Dataset   •  TransformaLons   •  AcLons   •  Caching   17  
    18. 18. Driver  and  Workers   Driver   Worker   Worker   Worker   Data   RAM   Data   RAM   Data   RAM   18  
    19. 19. RDD  –  Resilient  Distributed  Dataset   •  Read-­‐only  parLLoned  collecLon  of  records   •  Created  through:   – TransformaLon  of  data  in  storage   – TransformaLon  of  RDDs   •  Contains  lineage  to  compute  from  storage   •  Lazy  materializaLon   •  Users  control  persistence  and  parLLoning   19  
    20. 20. OperaLons   TransformaBons   •  Map   •  Filter   •  Sample   •  Join   AcBons   •  Reduce   •  Count   •  First,  Take   •  SaveAs   20  
    21. 21. OperaLons   •  TransformaBons  create  new  RDD  from  an  exisLng  one   •  AcBons  run  computaLon  on  RDD  and  return  a  value   •  TransformaLons  are  lazy.     •  AcLons  materialize  RDDs  by  compuLng  transformaLons.   •  RDDs  can  be  cached  to  avoid  re-­‐compuLng.   21  
    22. 22. Fault  Tolerance   •  RDDs  contain  lineage.   •  Lineage  –  source  locaLon  and  list  of   transformaLons   •  Lost  parLLons  can  be  re-­‐computed  from   source  data           msgs = textFile.filter(lambda s: s.startsWith(“ERROR”)) .map(lambda s: s.split(“t”)[2]) HDFS  File   Filtered  RDD   Mapped  RDD   filter   (func  =  startsWith(…))   map   (func  =  split(...))   22  
    23. 23. Caching   •  Persist()  and  cache()  mark  data     •  RDD  is  cached  ater  first  acLon   •  Fault  tolerant  –  lost  parLLons  will  re-­‐compute   •  If  not  enough  memory  –     some  parLLons  will  not  be  cached   •  Future  acLons  are  performed  on  cached   parLLoned   •  So  they  are  much  faster     Use  caching  for  iteraBve  algorithms     23  
    24. 24. Caching  –  Storage  Levels   •  MEMORY_ONLY   •  MEMORY_AND_DISK   •  MEMORY_ONLY_SER   •  MEMORY_AND_DISK_SER   •  DISK_ONLY   •  MEMORY_ONLY_2,  MEMORY_AND_DISK_2…   24  
    25. 25. SPARK  EXAMPLES   25  
    26. 26. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 26  
    27. 27. Easy:  Example  –  Word  Count   •  Spark   public static class WordCountMapClass extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } } public static class WorkdCountReduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } •  Hadoop  MapReduce   val spark = new SparkContext(master, appName, [sparkHome], [jars]) val file = spark.textFile("hdfs://...") val counts = file.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _) counts.saveAsTextFile("hdfs://...") 27  
    28. 28. Spark  Word  Count  in  Java   JavaSparkContext sc = new JavaSparkContext(...);! JavaRDD<String> lines = ctx.textFile("hdfs://...");! JavaRDD<String> words = lines.flatMap(! new FlatMapFunction<String, String>() {! public Iterable<String> call(String s) {! return Arrays.asList(s.split(" "));! }! }! );! ! JavaPairRDD<String, Integer> ones = words.map(! new PairFunction<String, String, Integer>() {! public Tuple2<String, Integer> call(String s) {! return new Tuple2(s, 1);! }! }! );! ! JavaPairRDD<String, Integer> counts = ones.reduceByKey(! new Function2<Integer, Integer, Integer>() {! public Integer call(Integer i1, Integer i2) {! return i1 + i2;! }! }! );! JavaRDD<String> lines = sc.textFile("hdfs://log.txt");! ! JavaRDD<String> words =! lines.flatMap(line -> Arrays.asList(line.split(" ")));! ! JavaPairRDD<String, Integer> ones =! words.mapToPair(w -> new Tuple2<String, Integer>(w, 1));! ! JavaPairRDD<String, Integer> counts =! ones.reduceByKey((x, y) -> x + y);! Java  8   Lamba   Expression  [1]   1  -­‐  hUp://docs.oracle.com/javase/tutorial/java/javaOO/lambdaexpressions.html   28  
    29. 29. Log  Mining   •  Load  error  messages  from  a  log  into  memory   •  InteracLvely  search  for  paUerns   29  
    30. 30. Log  Mining   lines = sparkContext.textFile(“hdfs://…”)! errors = lines.filter(_.startsWith(“ERROR”)! messages = errors.map(_.split(‘t’)(2))! ! cachedMsgs = messages.cache()! ! cachedMsgs.filter(_.contains(“foo”)).count! cachedMsgs.filter(_.contains(“bar”)).count! …! Base  RDD   Transformed   RDD   AcLon   30  
    31. 31. LogisLc  Regression   •  Read  two  sets  of  points   •  Looks  for  a  plane  W  that  separates  them   •  Perform  gradient  descent:   – Start  with  random  W   – On  each  iteraLon,  sum  a  funcLon  of  W  over  the   data   – Move  W  in  a  direcLon  that  improves  it   31  
    32. 32. IntuiLon   32  
    33. 33. LogisLc  Regression   val points = spark.textFile(…).map(parsePoint).cache()! ! val w = Vector.random(D)! ! for (I <- 1 to ITERATIONS) {! "val gradient = points.map(p => ! " "(1/(1+exp(-p.t*(w dot p.x))) -1 *p.y *p.x )! " ".reduce(_+_)! "w -= gradient! }! println(“Final separating plane: ” + w)! 33  
    34. 34. Conviva  Use-­‐Case  [1]   •  Monitor  online  video  consumpLon   •  Analyze  trends   Need  to  run  tens  of  queries  like  this  a  day:     SELECT videoName, COUNT(1) FROM summaries WHERE date='2011_12_12' AND customer='XYZ' GROUP BY videoName; 1  -­‐  hUp://www.conviva.com/using-­‐spark-­‐and-­‐hive-­‐to-­‐process-­‐bigdata-­‐at-­‐conviva/   34  
    35. 35. Conviva  With  Spark   val  sessions  =  sparkContext.sequenceFile[SessionSummary,NullWritable] (pathToSessionSummaryOnHdfs)     val  cachedSessions  =  sessions.filter(whereCondiLonToFilterSessions).cache     val  mapFn  :  SessionSummary  =>  (String,  Long)  =  {  s  =>  (s.videoName,  1)  }   val  reduceFn  :  (Long,  Long)  =>  Long  =  {  (a,b)  =>  a+b  }     val  results  =   cachedSessions.map(mapFn).reduceByKey(reduceFn).collectAsMap     35  
    36. 36. SPARK  STREAMING   36  
    37. 37. Large-­‐Scale  Stream  Processing   Requires   •  Fault  Tolerance  –  for  crashes  and  strugglers   •  Efficiency   •  Row-­‐by-­‐row  (conLnuous  operator)  systems   do  not  handle  struggler  nodes     •  Batch  Processing  provides  fault  tolerance   efficiently   Job  is  divided  into  determinisLc  tasks       37  
    38. 38. Key  QuesLon     •  How  fast  can  the  system  recover?   38  
    39. 39. Spark  Streaming   hUp://spark.apache.org/docs/latest/streaming-­‐programming-­‐guide.html   39  
    40. 40. Spark  Streaming   –  Run  con$nuous  processing  of  data  using  Spark’s  core  API.     –  Extends  Spark  concept  of  RDD’s  to  DStreams  (DiscreLzed  Streams)  which   are  fault  tolerant,  transformable  streams.  Users  can  re-­‐use  exisLng  code  for   batch/offline  processing.   –  Adds  “rolling  window”  operaLons.  E.g.  compute  rolling  averages  or  counts   for  data  over  last  five  minutes.   –  Example  use  cases:   •  “On-­‐the-­‐fly”  ETL  as  data  is  ingested  into  Hadoop/HDFS.   •  DetecLng  anomalous  behavior  and  triggering  alerts.   •  ConLnuous  reporLng  of  summary  metrics  for  incoming  data.   40  
    41. 41. val  tweets  =  ssc.twitterStream()   val  hashTags  =  tweets.flatMap  (status  =>  getTags(status))   hashTags.saveAsHadoopFiles("hdfs://...")     flatMap flatMap flatMap save save save batch  @  t+1  batch  @  t   batch  @  t+2   tweets  DStream   hashTags  DStream   Stream  composed  of   small  (1-­‐10s)  batch   computaLons   “Micro-­‐batch”  Architecture   41  
    42. 42. SHARK   42  
    43. 43. Shark  Architecture   •  IdenLcal  to  Hive   •  Same  CLI,  JDBC,            SQL  Parser,  Metastore   •  Replaced  the  opLmizer,            plan  generator  and  the  execuLon  engine.     •  Added  Cache  Manager.     •  Generate  Spark  code  instead  of  Map  Reduce   43  
    44. 44. Hive  CompaLbility   •  MetaStore   •  HQL   •  UDF  /  UDAF   •  SerDes   •  Scripts   44  
    45. 45. Shark  Vs  Impala   •  Shark  inherits  Hive  limitaLons  while  Impala  is   purpose  built  for  SQL.   •  Impala  is  significantly  faster  per  our  tests.   •  Shark  does  not  have  security,  audit/lineage,   support  for  high-­‐concurrency,  operaLonal   tooling  for  config/monitor/reporLng/ debugging.   •  InteracLve  SQL  needed  for  connecLng  BI   Tools.  Shark  not  cerLfied  by  any  BI  vendor.   45  
    46. 46. DEMO   46  
    47. 47. SUMMARY   47  
    48. 48. Why  Spark?   •  Flexible  like  MapReduce   •  High  performance   •  Machine  learning,  iteraLve  algorithms   •  InteracLve  data  exploraLons   •  Developer  producLvity   48  
    49. 49. How  Spark  Works?   •  RDDs  –  resilient  distributed  data   •  Lazy  transformaLons   •  Caching   •  Fault  tolerance  by  storing  lineage   •  Streams  –  micro-­‐batches  of  RDDs   •  Shark  –  Hive  +  Spark   49  

    ×