Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

SD CADD meeting 2016-08-30: Intro to Spark

122 views

Published on

Introduction to Spark

Published in: Data & Analytics
  • Be the first to comment

  • Be the first to like this

SD CADD meeting 2016-08-30: Intro to Spark

  1. 1. Introduc)on  to  Apache  Spark   Peter  Rose   peter.rose@rcsb.org  
  2. 2. Apache  Spark  is  a  fast  and  general  engine  for  large-­‐scale  data  processing   •  In-­‐memory  processing     Successor  of  Hadoop  (MapReduce)   •  File-­‐based  processing   hDp://spark.apache.org/  
  3. 3. Spark  Ecosystem  
  4. 4. Apache  Spark  works  in  parallel  on   •  Mul)core  laptop,  desktop   •  Single  server   •  Cluster  (need  cluster  manager)  
  5. 5. RDD<String>   RDD<String>   PairRDD<String,Integer>   PairRDD<String,Integer>   Map-­‐Reduce  Example   one  to  many   one  to  one  
  6. 6. Scalable  machine     learning  library   Module  for  running   queries  on   structured  data   Data  Sources   Module  to  build  scalable  fault-­‐ tolerant  streaming  applica)ons  Core  Data  Structures  

×