Introduction to Spark

Introduction to
By Tsai Li Ming
Spark Meetup (SG) - 26 Nov 2014

Who am I?
• Started with High Performance
Computing -> Grid -> Cloud
and now Big Data
!
• Member of PUGS committee
!
• http://about.me/tsailiming

Speed matters
http://web.stanford.edu/~ouster/cgi-bin/papers/ramcloud.pdf

http://hblok.net/blog/storage/

In Memory Data Grid
Ignite vs. Spark
!
Apache Spark is a data-analytic and ML centric system that ingest data from HDFS or another distributed file
system and performs in-memory processing of this data. Ignite is an In-Memory Data Fabric that is
data source agnostic and provides both Hadoop-like computation engine
(MapReduce) as well as many other computing paradigms like MPP, MPI, Streaming
processing. Ignite also includes first-class level support for cluster management and operations, cluster-aware
messaging and zero-deployment technologies. Ignite also provides support for full ACID transactions
spanning memory and optional data sources. Ignite is a broader in-memory system that is less
focused on Hadoop. Apache Spark is more inclined towards analytics and ML, and
focused on MR-specific payloads.
https://wiki.apache.org/incubator/IgniteProposal

In Memory Database
And many more!
http://en.wikipedia.org/wiki/List_of_in-memory_databases

What is Spark?
• Developed at UC Berkley in 2009. Open
sourced in 2010.
• Fast and general engine for large-scale data
processing
!
• Run programs up to 100x faster than Hadoop
MapReduce in memory, or 10x faster on disk
!
• Multi-step Directed Acrylic Graphs (DAGs).
Many stages compared to just Hadoop Map
and Reduce only.
!
• Rich Scala, Java and Python APIs. R too!
(SparkR to be merged into 1.3.0)
!
• Interactive Shell
!
• Active development

Active Development
*June 2014*
http://radar.oreilly.com/2014/06/a-growing-number-of-applications-are-being-built-with-spark.
html

http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012

Logistic Regression Performance
http://www.slideshare.net/jamesskillsmatter/zaharia-sparkscaladays2012

Spark Execution Flow
YARN
Standalone

Resilient Distributed Datasets (RDDs)
• Basic abstraction in Spark. Fault-tolerant collection of
elements that can be operated on in parallel
!
• RDDs can be created from local file system, HDFS,
Cassandra, HBase, Amazon S3, SequenceFiles, and
any other Hadoop InputFormat.
!
• Different levels of caching: MEMORY_ONLY,
MEMORY_AND_DISK, DISK_ONLY, OFF_HEAP, etc
!
• Rich APIs for Transformations and Actions
!
• Data Locality: PROCESS_LOCAL -> NODE_LOCAL ->
RACK_LOCAL

RDD Operations
map flatMap sortByKey
filter union reduce
sample join count
groupByKey distinct saveAsTextFile
reduceByKey mapValues first

RDDs Transformations Examples (1)
map(func)
! >>> myList = [1,2,3,4,5]!
! >>> myRDD = sc.parallelize(myList)!
! >>> myRDD.map(lambda x: x+1).collect()!
! [2, 3, 4, 5, 6]!
!
filter(func)
! >>> myRDD.filter(lambda x: x > 3).collect()!
! [4, 5]

RDDs Transformations Examples (2)
reduceByKey(func, [numTasks])
! >>> myList = [1,2,2,3,3,3,4,4,4,4,5,5,5,5,5]!
! >>> pairs = myRDD.map(lambda x: (x,1))!
! >>> pairs.collect()[:3]!
! [(1, 1), (2, 1), (2, 1)]!
! >>> counts = pairs.reduceByKey(lambda a, b: a + b)!
! >>> counts.collect()!
! [(1, 1), (2, 2), (3, 3), (4, 4), (5, 5)]!
!
GroupByKey([numTasks])
! >>> for key, values in pairs.groupByKey().collect():!
! ... ! print "%s : %s" % (key, list(values))!
! 1 : [1]!
! 2 : [1, 1]!
! 3 : [1, 1, 1]!
!

RDDs Actions Examples (1)
reduce(func)!
! >>> myList = [1,2,3,4,5]!
! >>> myRDD.reduce(lambda x,y: x+y)!
! 15!
!
collect()
! >>> myList = [1,2,3,4,5]
! >>> myRDD.collect()!
! [1, 2, 3, 4, 5]

RDDs Actions Examples (2)
count()!
! >>> myList = [1,2,3,4,5]
! >>> myRDD.count()!
! 5!
!
take(3)!
! >>> myList = [1,2,3,4,5]
! >>> myRDD.take(3)!
! [1, 2, 3]

Fault Recovery
• An RDD is an immutable, deterministically re-computable,
distributed dataset.
!
• RDD tracks lineage info rebuild lost data

Types of Deployment
• Standalone Deployment
$ sbin/start-all.sh
!
• Amazon EC2
$ ec2/spark-ec2 -k <keypair>
-i <key-file> -s <num-slaves>
launch <cluster-name>
!
• Apache Mesos
!
• Hadoop YARN

Other Spark Projects (1)
• Spark Job Server
!
"Spark as a Service": Simple REST interface for all
aspects of job, context management
!
https://github.com/ooyala/spark-jobserver
!
• Hue support for Spark
!
http://gethue.com/a-new-spark-web-ui-spark-app/

Other Spark Projects (2)
• Tachyon
!
“Tachyon is a memory-centric distributed file system
enabling reliable file sharing at memory-speed
across cluster frameworks”
!
http://tachyon-project.org/
!
• Spark Notebook
!
“Use Apache Spark straight from the Browser”
!
https://github.com/andypetrella/spark-notebook/

Wordcount Example (1)
Hadoop MapReduce Spark Scala
//package org.myorg;!
import java.io.IOException;!
i!mport java.util.*;!
import org.apache.hadoop.fs.Path;!
import org.apache.hadoop.conf.*;!
import org.apache.hadoop.io.*;!
import org.apache.hadoop.mapred.*;!
i!mport org.apache.hadoop.util.*;!
public class WordCount {! !!
public static class Map extends MapReduceBase implements Mapper<LongWritable, Text,
Text, IntWritable> {!
! ! private final static IntWritable one = new IntWritable(1);!
!! ! private Text word = new Text();!
! ! public void map(LongWritable key, Text value, OutputCollector<Text,
IntWritable> output, Reporter reporter) throws IOException {!
! ! ! String line = value.toString();!
! ! ! StringTokenizer tokenizer = new StringTokenizer(line);!
! ! ! while (tokenizer.hasMoreTokens()) {!
! ! ! ! word.set(tokenizer.nextToken());!
! ! ! ! output.collect(word, one);!
! ! ! }!
! ! }!
! }! !!
public static class Reduce extends MapReduceBase implements Reducer<Text,
IntWritable, Text, IntWritable> {!
! ! public void reduce(Text key, Iterator<IntWritable> values,
OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {!
! ! ! int sum = 0;!
! ! ! while (values.hasNext()) {!
! ! ! ! sum += values.next().get();!
! ! ! }!
! ! ! output.collect(key, new IntWritable(sum));!
! ! }!
! }! !!
public static void main(String[] args) throws Exception {!
! ! JobConf conf = new JobConf(WordCount.class);!
!! ! conf.setJobName("wordcount");!
! ! conf.setOutputKeyClass(Text.class);!
!! ! conf.setOutputValueClass(IntWritable.class);!
! ! conf.setMapperClass(Map.class);!
! ! //conf.setCombinerClass(Reduce.class);!
!! ! conf.setReducerClass(Reduce.class);!
! ! conf.setInputFormat(TextInputFormat.class);!
!! ! conf.setOutputFormat(TextOutputFormat.class);!
! ! FileInputFormat.setInputPaths(conf, new Path(args[0]));!
!! ! FileOutputFormat.setOutputPath(conf, new Path(args[1]));!
! ! JobClient.runJob(conf);!
! }!
val file = spark.textFile("hdfs://...")val counts = file.flatMap(line => line..map(word => (word, .reduceByKey(_ + _)!
counts.saveAsTextFile("hdfs://...")
Spark Python
file = spark.textFile("hdfs://...")!
counts = file.flatMap(lambda line: line.split(" .map(lambda word: (word, 1)) !
.reduceByKey(lambda a, b: a + b)!
counts.saveAsTextFile("hdfs://...")

$ bin/pyspark --master local[2]
!
Welcome to
____ __
/ __/__ ___ _____/ /__
_ / _ / _ `/ __/ '_/
/__ / .__/_,_/_/ /_/_ version 1.1.0
/_/
!
Using Python version 2.7.3 (v2.7.3:70274d53c1dd, Apr 9 2012 20:52:43)
SparkContext available as sc.
>>>
--driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M).
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local.
--executor-memory MEM Memory per executor (e.g. 1000M, 2G) (Default: 1G).

>>> file = sc.textFile("file:///Users/ltsai/Downloads/spark-1.1.0-bin-hadoop2.4/README.md").cache()
14/11/25 15:33:14 INFO MemoryStore: ensureFreeSpace(163705) called with curMem=327410,
maxMem=278302556
14/11/25 15:33:14 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 159.9 KB,
free 264.9 MB)
!
>>> file.count()
14/11/25 15:33:34 INFO DAGScheduler: Stage 0 (count at <stdin>:1) finished in 2.185 s
14/11/25 15:33:34 INFO SparkContext: Job finished: count at <stdin>:1, took 2.301952 s
141
!
>>> counts = file.flatMap(lambda line: line.split())
... .map(lambda word: (word, 1))
... .reduceByKey(lambda a, b: a + b)
!
>>> counts.take(5)
[(u'email,', 1), (u'Contributing', 1), (u'webpage', 1), (u'when', 3), (u'<http://spark.apache.org/
documentation.html>.', 1)]

Spark in Enterprise
Traditional Data Sources
(RDBMS, EDW, MPP, Excel, CSV)
New Sources
(Web, Logs, Social Media, IOT)!
In Memory = Fast!
Real Time Streaming
+
Analytics
+
Spark SQL
Batch Processing
HiveQL
Data Sources Data Processing Applications
BI!Tools! Enterprise!!
Applica2ons!
Custom!!
Applica2ons!

Enterprise Ready?
• Authentication? Kerberos? AD? LDAP?
!
• Encryption? SSL? TLS?
!
• Audit Trail?
!
• Commercial Support?
!
• …. ?

Use Cases: #1
Music Recommendations at Scale with Spark
http://spark-summit.org/2014/talk/music-recommendations-at-scale-with-spark

Introduction to Spark

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Spark

Recently uploaded

Introduction to Spark