Successfully reported this slideshow.

SD CADD meeting 2016-08-30: Intro to Spark

0

Share

Loading in …3
×
1 of 8
1 of 8

More Related Content

SD CADD meeting 2016-08-30: Intro to Spark

  1. 1. Introduc)on  to  Apache  Spark   Peter  Rose   peter.rose@rcsb.org  
  2. 2. Apache  Spark  is  a  fast  and  general  engine  for  large-­‐scale  data  processing   •  In-­‐memory  processing     Successor  of  Hadoop  (MapReduce)   •  File-­‐based  processing   hDp://spark.apache.org/  
  3. 3. Spark  Ecosystem  
  4. 4. Apache  Spark  works  in  parallel  on   •  Mul)core  laptop,  desktop   •  Single  server   •  Cluster  (need  cluster  manager)  
  5. 5. RDD<String>   RDD<String>   PairRDD<String,Integer>   PairRDD<String,Integer>   Map-­‐Reduce  Example   one  to  many   one  to  one  
  6. 6. Scalable  machine     learning  library   Module  for  running   queries  on   structured  data   Data  Sources   Module  to  build  scalable  fault-­‐ tolerant  streaming  applica)ons  Core  Data  Structures  

×