Introduction to 
By Tsai Li Ming 
Spark Meetup (SG) - 26 Nov 2014
Who am I? 
• Started with High Performance 
Computing -> Grid -> Cloud 
and now Big Data 
! 
• Member of PUGS committee 
! 
• http://about.me/tsailiming
Why In Memory 
Computing?
Speed matters 
http://web.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf
http://hblok.net/blog/storage/
18:1 server consolidation
In Memory Data Grid 
Ignite vs. Spark 
! 
Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file 
system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is 
data source agnostic and provides both Hadoop-like computation engine 
(MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming 
processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware 
messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions 
spanning memory and optional data sources. Ignite is a broader in-memory system that is less 
focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and 
focused on MR-specific payloads. 
https://wiki.apache.org/incubator/IgniteProposal
In Memory Database 
And many more! 
http://en.wikipedia.org/wiki/List_of_in-memory_databases
What is Spark?
What is Spark? 
• Developed at UC Berkley in 2009. Open 
sourced in 2010. 
• Fast and general engine for large-scale data 
processing 
! 
• Run programs up to 100x faster than Hadoop 
MapReduce in memory, or 10x faster on disk 
! 
• Multi-step Directed Acrylic Graphs (DAGs). 
Many stages compared to just Hadoop Map 
and Reduce only. 
! 
• Rich Scala, Java and Python APIs. R too! 
(SparkR to be merged into 1.3.0) 
! 
• Interactive Shell 
! 
• Active development
Spark Stack
Active Development 
*June 2014* 
http://radar.oreilly.com/2014/06/a-growing-number-of-applications-are-being-built-with-spark. 
html
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
Logistic Regression Performance 
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
How Spark works
Spark Execution Flow 
YARN 
Standalone
Resilient Distributed Datasets (RDDs) 
• Basic abstraction in Spark. Fault-tolerant collection of 
elements that can be operated on in parallel 
! 
• RDDs can be created from local file system, HDFS, 
Cassandra, HBase, Amazon S3, SequenceFiles, and 
any other Hadoop InputFormat. 
! 
• Different levels of caching: MEMORY_ONLY, 
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc 
! 
• Rich APIs for Transformations and Actions 
! 
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL -> 
RACK_LOCAL
RDD Operations 
map flatMap sortByKey 
filter union reduce 
sample join count 
groupByKey distinct saveAsTextFile 
reduceByKey mapValues first
RDDs Transformations Examples (1) 
map(func) 
! >>> myList = [1,2,3,4,5]! 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.map(lambda x: x+1).collect()! 
! [2, 3, 4, 5, 6]! 
! 
filter(func) 
! >>> myRDD.filter(lambda x: x > 3).collect()! 
! [4, 5]
RDDs Transformations Examples (2) 
reduceByKey(func, [numTasks]) 
! >>> myList = [1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]! 
! >>> myRDD = sc.parallelize(myList)! 
! >>> pairs = myRDD.map(lambda x: (x,1))! 
! >>> pairs.collect()[:3]! 
! [(1, 1), (2, 1), (2, 1)]! 
! >>> counts = pairs.reduceByKey(lambda a, b: a + b)! 
! >>> counts.collect()! 
! [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]! 
! 
GroupByKey([numTasks]) 
! >>> for key, values in pairs.groupByKey().collect():! 
! ... ! print "%s : %s" % (key, list(values))! 
! 1 : [1]! 
! 2 : [1, 1]! 
! 3 : [1, 1, 1]! 
!
RDDs Actions Examples (1) 
reduce(func)! 
! >>> myList = [1,2,3,4,5]! 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.reduce(lambda x,y: x+y)! 
! 15! 
! 
collect() 
! >>> myList = [1,2,3,4,5] 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.collect()! 
! [1, 2, 3, 4, 5]
RDDs Actions Examples (2) 
count()! 
! >>> myList = [1,2,3,4,5] 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.count()! 
! 5! 
! 
take(3)! 
! >>> myList = [1,2,3,4,5] 
! >>> myRDD = sc.parallelize(myList)! 
! >>> myRDD.take(3)! 
! [1, 2, 3]
Fault Recovery 
• An RDD is an immutable, deterministically re-computable, 
distributed dataset. 
! 
• RDD tracks lineage info rebuild lost data
Types of Deployment 
• Standalone Deployment 
$ sbin/start-all.sh 
! 
• Amazon EC2 
$ ec2/spark-ec2 -k <keypair> 
-i <key-file> -s <num-slaves> 
launch <cluster-name> 
! 
• Apache Mesos 
! 
• Hadoop YARN
Other Spark Projects
Other Spark Projects (1) 
• Spark Job Server 
! 
"Spark as a Service": Simple REST interface for all 
aspects of job, context management 
! 
https://github.com/ooyala/spark-jobserver 
! 
• Hue support for Spark 
! 
http://gethue.com/a-new-spark-web-ui-spark-app/
Other Spark Projects (2) 
• Tachyon 
! 
“Tachyon is a memory-centric distributed file system 
enabling reliable file sharing at memory-speed 
across cluster frameworks” 
! 
http://tachyon-project.org/ 
! 
• Spark Notebook 
! 
“Use Apache Spark straight from the Browser” 
! 
https://github.com/andypetrella/spark-notebook/
Spark Example
Wordcount Example (1) 
Hadoop MapReduce Spark Scala 
//package org.myorg;! 
import java.io.IOException;! 
i!mport java.util.*;! 
import org.apache.hadoop.fs.Path;! 
import org.apache.hadoop.conf.*;! 
import org.apache.hadoop.io.*;! 
import org.apache.hadoop.mapred.*;! 
i!mport org.apache.hadoop.util.*;! 
public class WordCount {! !! 
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, 
Text, IntWritable> {! 
! ! private final static IntWritable one = new IntWritable(1);! 
!! ! private Text word = new Text();! 
! ! public void map(LongWritable key, Text value, OutputCollector<Text, 
IntWritable> output, Reporter reporter) throws IOException {! 
! ! ! String line = value.toString();! 
! ! ! StringTokenizer tokenizer = new StringTokenizer(line);! 
! ! ! while (tokenizer.hasMoreTokens()) {! 
! ! ! ! word.set(tokenizer.nextToken());! 
! ! ! ! output.collect(word, one);! 
! ! ! }! 
! ! }! 
! }! !! 
public static class Reduce extends MapReduceBase implements Reducer<Text, 
IntWritable, Text, IntWritable> {! 
! ! public void reduce(Text key, Iterator<IntWritable> values, 
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! 
! ! ! int sum = 0;! 
! ! ! while (values.hasNext()) {! 
! ! ! ! sum += values.next().get();! 
! ! ! }! 
! ! ! output.collect(key, new IntWritable(sum));! 
! ! }! 
! }! !! 
public static void main(String[] args) throws Exception {! 
! ! JobConf conf = new JobConf(WordCount.class);! 
!! ! conf.setJobName("wordcount");! 
! ! conf.setOutputKeyClass(Text.class);! 
!! ! conf.setOutputValueClass(IntWritable.class);! 
! ! conf.setMapperClass(Map.class);! 
! ! //conf.setCombinerClass(Reduce.class);! 
!! ! conf.setReducerClass(Reduce.class);! 
! ! conf.setInputFormat(TextInputFormat.class);! 
!! ! conf.setOutputFormat(TextOutputFormat.class);! 
! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! 
!! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! 
! ! JobClient.runJob(conf);! 
! }! 
val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line..map(word => (word, .reduceByKey(_ + _)! 
counts.saveAsTextFile("hdfs://...") 
Spark Python 
file = spark.textFile("hdfs://...")! 
counts = file.flatMap(lambda line: line.split(" .map(lambda word: (word, 1)) ! 
.reduceByKey(lambda a, b: a + b)! 
counts.saveAsTextFile("hdfs://...")
Wordcount Example (2) 
$ bin/pyspark --master local[2] 
! 
Welcome to 
____ __ 
/ __/__ ___ _____/ /__ 
_ / _ / _ `/ __/ '_/ 
/__ / .__/_,_/_/ /_/_ version 1.1.0 
/_/ 
! 
Using Python version 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012 20:52:43) 
SparkContext available as sc. 
>>> 
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). 
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. 
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
Wordcount Example (3) 
>>> file = sc.textFile("file:///Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md").cache() 
14/11/25 15:33:14 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=327410, 
maxMem=278302556 
14/11/25 15:33:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 159.9 KB, 
free 264.9 MB) 
! 
>>> file.count() 
14/11/25 15:33:34 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 2.185 s 
14/11/25 15:33:34 INFO SparkContext: Job finished: count at <stdin>:1, took 2.301952 s 
141 
! 
>>> counts = file.flatMap(lambda line: line.split())  
... .map(lambda word: (word, 1))  
... .reduceByKey(lambda a, b: a + b) 
! 
>>> counts.take(5) 
[(u'email,', 1), (u'Contributing', 1), (u'webpage', 1), (u'when', 3), (u'<http://spark.apache.org/ 
documentation.html>.', 1)]
Actual Demo
Spark in Enterprises
Spark in Enterprise 
Traditional Data Sources 
(RDBMS, EDW, MPP, Excel, CSV) 
New Sources 
(Web, Logs, Social Media, IOT)! 
In Memory = Fast! 
Real Time Streaming 
+ 
Analytics 
+ 
Spark SQL 
Batch Processing 
HiveQL 
Data Sources Data Processing Applications 
BI!Tools! Enterprise!! 
Applica2ons! 
Custom!! 
Applica2ons!
Enterprise Ready? 
• Authentication? Kerberos? AD? LDAP? 
! 
• Encryption? SSL? TLS? 
! 
• Audit Trail? 
! 
• Commercial Support? 
! 
• …. ?
Use Cases: #1 
Music Recommendations at Scale with Spark 
http://spark-summit.org/2014/talk/music-recommendations-at-scale-with-spark
Thank You!

Introduction to Spark

  • 1.
    Introduction to ByTsai Li Ming Spark Meetup (SG) - 26 Nov 2014
  • 2.
    Who am I? • Started with High Performance Computing -> Grid -> Cloud and now Big Data ! • Member of PUGS committee ! • http://about.me/tsailiming
  • 3.
    Why In Memory Computing?
  • 4.
  • 5.
  • 6.
  • 7.
    In Memory DataGrid Ignite vs. Spark ! Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is data source agnostic and provides both Hadoop-like computation engine (MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions spanning memory and optional data sources. Ignite is a broader in-memory system that is less focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and focused on MR-specific payloads. https://wiki.apache.org/incubator/IgniteProposal
  • 8.
    In Memory Database And many more! http://en.wikipedia.org/wiki/List_of_in-memory_databases
  • 9.
  • 10.
    What is Spark? • Developed at UC Berkley in 2009. Open sourced in 2010. • Fast and general engine for large-scale data processing ! • Run programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk ! • Multi-step Directed Acrylic Graphs (DAGs). Many stages compared to just Hadoop Map and Reduce only. ! • Rich Scala, Java and Python APIs. R too! (SparkR to be merged into 1.3.0) ! • Interactive Shell ! • Active development
  • 11.
  • 12.
    Active Development *June2014* http://radar.oreilly.com/2014/06/a-growing-number-of-applications-are-being-built-with-spark. html
  • 13.
  • 14.
  • 15.
    Logistic Regression Performance http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012
  • 16.
  • 17.
    Spark Execution Flow YARN Standalone
  • 18.
    Resilient Distributed Datasets(RDDs) • Basic abstraction in Spark. Fault-tolerant collection of elements that can be operated on in parallel ! • RDDs can be created from local file system, HDFS, Cassandra, HBase, Amazon S3, SequenceFiles, and any other Hadoop InputFormat. ! • Different levels of caching: MEMORY_ONLY, MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc ! • Rich APIs for Transformations and Actions ! • Data Locality: PROCESS_LOCAL -> NODE_LOCAL -> RACK_LOCAL
  • 19.
    RDD Operations mapflatMap sortByKey filter union reduce sample join count groupByKey distinct saveAsTextFile reduceByKey mapValues first
  • 20.
    RDDs Transformations Examples(1) map(func) ! >>> myList = [1,2,3,4,5]! ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.map(lambda x: x+1).collect()! ! [2, 3, 4, 5, 6]! ! filter(func) ! >>> myRDD.filter(lambda x: x > 3).collect()! ! [4, 5]
  • 21.
    RDDs Transformations Examples(2) reduceByKey(func, [numTasks]) ! >>> myList = [1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]! ! >>> myRDD = sc.parallelize(myList)! ! >>> pairs = myRDD.map(lambda x: (x,1))! ! >>> pairs.collect()[:3]! ! [(1, 1), (2, 1), (2, 1)]! ! >>> counts = pairs.reduceByKey(lambda a, b: a + b)! ! >>> counts.collect()! ! [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]! ! GroupByKey([numTasks]) ! >>> for key, values in pairs.groupByKey().collect():! ! ... ! print "%s : %s" % (key, list(values))! ! 1 : [1]! ! 2 : [1, 1]! ! 3 : [1, 1, 1]! !
  • 22.
    RDDs Actions Examples(1) reduce(func)! ! >>> myList = [1,2,3,4,5]! ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.reduce(lambda x,y: x+y)! ! 15! ! collect() ! >>> myList = [1,2,3,4,5] ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.collect()! ! [1, 2, 3, 4, 5]
  • 23.
    RDDs Actions Examples(2) count()! ! >>> myList = [1,2,3,4,5] ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.count()! ! 5! ! take(3)! ! >>> myList = [1,2,3,4,5] ! >>> myRDD = sc.parallelize(myList)! ! >>> myRDD.take(3)! ! [1, 2, 3]
  • 24.
    Fault Recovery •An RDD is an immutable, deterministically re-computable, distributed dataset. ! • RDD tracks lineage info rebuild lost data
  • 25.
    Types of Deployment • Standalone Deployment $ sbin/start-all.sh ! • Amazon EC2 $ ec2/spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name> ! • Apache Mesos ! • Hadoop YARN
  • 26.
  • 27.
    Other Spark Projects(1) • Spark Job Server ! "Spark as a Service": Simple REST interface for all aspects of job, context management ! https://github.com/ooyala/spark-jobserver ! • Hue support for Spark ! http://gethue.com/a-new-spark-web-ui-spark-app/
  • 28.
    Other Spark Projects(2) • Tachyon ! “Tachyon is a memory-centric distributed file system enabling reliable file sharing at memory-speed across cluster frameworks” ! http://tachyon-project.org/ ! • Spark Notebook ! “Use Apache Spark straight from the Browser” ! https://github.com/andypetrella/spark-notebook/
  • 29.
  • 30.
    Wordcount Example (1) Hadoop MapReduce Spark Scala //package org.myorg;! import java.io.IOException;! i!mport java.util.*;! import org.apache.hadoop.fs.Path;! import org.apache.hadoop.conf.*;! import org.apache.hadoop.io.*;! import org.apache.hadoop.mapred.*;! i!mport org.apache.hadoop.util.*;! public class WordCount {! !! public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {! ! ! private final static IntWritable one = new IntWritable(1);! !! ! private Text word = new Text();! ! ! public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! String line = value.toString();! ! ! ! StringTokenizer tokenizer = new StringTokenizer(line);! ! ! ! while (tokenizer.hasMoreTokens()) {! ! ! ! ! word.set(tokenizer.nextToken());! ! ! ! ! output.collect(word, one);! ! ! ! }! ! ! }! ! }! !! public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {! ! ! public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {! ! ! ! int sum = 0;! ! ! ! while (values.hasNext()) {! ! ! ! ! sum += values.next().get();! ! ! ! }! ! ! ! output.collect(key, new IntWritable(sum));! ! ! }! ! }! !! public static void main(String[] args) throws Exception {! ! ! JobConf conf = new JobConf(WordCount.class);! !! ! conf.setJobName("wordcount");! ! ! conf.setOutputKeyClass(Text.class);! !! ! conf.setOutputValueClass(IntWritable.class);! ! ! conf.setMapperClass(Map.class);! ! ! //conf.setCombinerClass(Reduce.class);! !! ! conf.setReducerClass(Reduce.class);! ! ! conf.setInputFormat(TextInputFormat.class);! !! ! conf.setOutputFormat(TextOutputFormat.class);! ! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));! !! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));! ! ! JobClient.runJob(conf);! ! }! val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line..map(word => (word, .reduceByKey(_ + _)! counts.saveAsTextFile("hdfs://...") Spark Python file = spark.textFile("hdfs://...")! counts = file.flatMap(lambda line: line.split(" .map(lambda word: (word, 1)) ! .reduceByKey(lambda a, b: a + b)! counts.saveAsTextFile("hdfs://...")
  • 31.
    Wordcount Example (2) $ bin/pyspark --master local[2] ! Welcome to ____ __ / __/__ ___ _____/ /__ _ / _ / _ `/ __/ '_/ /__ / .__/_,_/_/ /_/_ version 1.1.0 /_/ ! Using Python version 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012 20:52:43) SparkContext available as sc. >>> --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).
  • 32.
    Wordcount Example (3) >>> file = sc.textFile("file:///Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md").cache() 14/11/25 15:33:14 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=327410, maxMem=278302556 14/11/25 15:33:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 159.9 KB, free 264.9 MB) ! >>> file.count() 14/11/25 15:33:34 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 2.185 s 14/11/25 15:33:34 INFO SparkContext: Job finished: count at <stdin>:1, took 2.301952 s 141 ! >>> counts = file.flatMap(lambda line: line.split()) ... .map(lambda word: (word, 1)) ... .reduceByKey(lambda a, b: a + b) ! >>> counts.take(5) [(u'email,', 1), (u'Contributing', 1), (u'webpage', 1), (u'when', 3), (u'<http://spark.apache.org/ documentation.html>.', 1)]
  • 33.
  • 34.
  • 35.
    Spark in Enterprise Traditional Data Sources (RDBMS, EDW, MPP, Excel, CSV) New Sources (Web, Logs, Social Media, IOT)! In Memory = Fast! Real Time Streaming + Analytics + Spark SQL Batch Processing HiveQL Data Sources Data Processing Applications BI!Tools! Enterprise!! Applica2ons! Custom!! Applica2ons!
  • 36.
    Enterprise Ready? •Authentication? Kerberos? AD? LDAP? ! • Encryption? SSL? TLS? ! • Audit Trail? ! • Commercial Support? ! • …. ?
  • 37.
    Use Cases: #1 Music Recommendations at Scale with Spark http://spark-summit.org/2014/talk/music-recommendations-at-scale-with-spark
  • 38.