Introduction to Apache Spark / PUT 06.2014

APACHE-SPARK

LARGE-SCALE DATA PROCESSING ENGINE
Bartosz Bogacki <bbogacki@bidlab.pl>

CTO, CODER, ROCK CLIMBER
• current:

• Chief Technology Ofﬁcer at Bidlab
• previous:

• IT Director at
InternetowyKantor.pl SA
• Software Architect / Project
Manager at Wolters Kluwer
Polska
• ﬁnd out more (if you care):

• linkedin.com/in/bartoszbogacki

WE PROCESS MORETHAN
200GB OF LOGS DAILY
Did I mention that…
?

WHY?
• To discover inventory and potential

• To optimize trafﬁc

• To optimize campaigns

• To learn about trends

• To calculate conversions

HISTORY
• 2013-06-19 Project enters Apache incubation

• 2014-02-19 Project established as an ApacheTop
Level Project.

• 2014-05-30 Spark 1.0.0 released

• "Apache Spark is a (lightning-) fast and
general-purpose cluster computing system"

• Engine compatible with Apache Hadoop

• Up to 100x faster than Hadoop

• Less code to write, more elastic

• Active community (117 developers
contributed to release 1.0.0)

KEY CONCEPTS
• Spark /YARN / Mesos resources compatible

• HDFS / S3 support built-in

• RDD - Resilient Distribiuted Dataset

• Transformations & Actions

• Written in Scala,API for Java / Scala / Python

ECOSYSTEM
• Spark Streaming

• Shark

• MLlib (machine learning)

• GraphX

• Spark SQL

RDD
• Collections of objects

• Stored in memory (or disk)

• Spread across the cluster

• Auto-rebuild on failure

TRANSFORMATIONS
• map / ﬂatMap

• ﬁlter

• union / intersection / join / cogroup

• distinct

• many more…

ACTIONS
• reduce / reduceByKey

• foreach

• count / countByKey

• ﬁrst / take / takeOrdered

• collect / saveAsTextFile / saveAsObjectFile

EXAMPLES
val s1=sc.parallelize(Array(1,2,3,4,5))
val s2=sc.parallelize(Array(3,4,6,7,8))
val s3=sc.parallelize(Array(1,2,2,3,3,3))
!
s2.map(num => num * num)
// => 9, 16, 36, 49, 64
s1.reduce((a,b) => a + b)
// => 15
s1 union s2
// => 1, 2, 3, 4, 5, 3, 4, 6, 7, 8
s1 subtract s2
// => 1, 5, 2
s1 intersection s2
// => 4, 3
s3.distinct
// => 1, 2, 3

EXAMPLES
val set1 = sc.parallelize(Array[(Integer,String)]
((1,”bartek"),(2,"jacek"),(3,"tomek")))
val set2 = sc.parallelize(Array[(Integer,String)]
((2,”nowak”),(4,"kowalski"),(5,"iksiński")))
!
set1 join set2
// =>(2,(jacek,nowak))
set1 leftOuterJoin set2
// =>(1,(bartek,None)), (2,(jacek,Some(nowak))), (3,
(tomek,None))
set1 rightOuterJoin set2
// =>(4,(None,kowalski)), (5,(None,iksiński)), (2,
(Some(jacek),nowak))

EXAMPLES
set1.cogroup(set2).sortByKey()
// => (1,(ArrayBuffer(bartek),ArrayBuffer())), (2,
(ArrayBuffer(jacek),ArrayBuffer(nowak))), (3,
(ArrayBuffer(tomek),ArrayBuffer())), (4,
(ArrayBuffer(),ArrayBuffer(kowalski))), (5,
(ArrayBuffer(),ArrayBuffer(iksiński)))
!
set2.map((t) => (t._1, t._2.length))
// => (2,5), (4,8), (5,8)
!
val set3 = sc.parallelize(Array[(String,Long)]
(("onet.pl",1), ("onet.pl",1), ("wp.pl",1))
!
set3.reduceByKey((n1,n2) => n1 + n2)
// => (onet.pl,2), (wp.pl,1)

RUNNING EC2

SPARK CLUSTER
./spark-ec2 -k spark-key -i spark-key.pem
-s 5
-t m3.2xlarge
launch cluster-name
--region=eu-west-1

LINKING WITH SPARK
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.0</version>
</dependency>
If you want to use HDFS

!
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
If you want to use Spark Streaming

!
groupId = org.apache.spark
artifactId = spark-streaming_2.10
version = 1.0.0

INITIALIZING
• SparkConf conf = new SparkConf()
.setAppName("TEST")
.setMaster("local");

• JavaSparkContext sc = new
JavaSparkContext(conf);

CREATING RDD
• List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);

• JavaRDD<Integer> distData = sc.parallelize(data);

CREATING RDD
• JavaRDD<String> logLines = sc.textFile("data.txt");

CREATING RDD
• JavaRDD<String> logLines = sc.textFile(”hdfs://
<HOST>:<PORT>/daily/data-20-00.txt”);

• JavaRDD<String> logLines = sc.textFile(”s3n://my-
bucket/daily/data-*.txt”);

TRANSFORM
JavaRDD<Log> logs =
logLines.map(new Function<String, Log>() {
public Log call(String s) {
return LogParser.parse(s);
}
}).filter(new Function<Log, Boolean>(){
public Integer call(Log log) {
return log.getLevel() == 1;
}
});

TRANSFORM-ACTION
List<Tuple2<String,Integer>> result =
sc.textFile(”/data/notifies-20-00.txt”)
.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String line) throws Exception {
NotifyRequest nr = LogParser.parseNotifyRequest(line);
return new Tuple2<String, Integer>(nr.getFlightId(), 1);
}
})
.reduceByKey(new Function2<Integer, Integer, Integer>(){
@Override
public Integer call(Integer v1, Integer v2) throws Exception {
return v1 + v2;
}})
.sortByKey()
.collect();

FUNCTIONS,

PAIRFUNCTIONS,

ETC.

BROADCASTVARIABLES
• "allow the programmer to keep a read-only
variable cached on each machine rather than
shipping a copy of it with tasks"
Broadcast<int[]> broadcastVar =
sc.broadcast(new int[] {1, 2, 3});
!
broadcastVar.value();
// returns [1, 2, 3]

ACCUMULATORS
• variables that are only “added” to through an associative
operation (add())

• only the driver program can read the accumulator’s value
Accumulator<Integer> accum = sc.accumulator(0);
!
sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x ->
accum.add(x));
!
accum.value();
// returns 10

SERIALIZATION
• All objects used in your code have to be
serializable

• Otherwise:
org.apache.spark.SparkException: Job aborted: Task not
serializable: java.io.NotSerializableException

USE KRYO SERIALIZER
public class MyRegistrator implements KryoRegistrator {
@Override
public void registerClasses(Kryo kryo) {
kryo.register(BidRequest.class);
kryo.register(NotifyRequest.class);
kryo.register(Event.class);
}
}
sparkConfig.set(
"spark.serializer", "org.apache.spark.serializer.KryoSerializer");
sparkConfig.set(
"spark.kryo.registrator", "pl.instream.dsp.offline.MyRegistrator");
sparkConfig.set(
"spark.kryoserializer.buffer.mb", "10");

CACHE !
JavaPairRDD<String, Integer> cachedSet =
sc.textFile(”/data/notifies-20-00.txt”)
.mapToPair(new PairFunction<String, String, Integer>() {
@Override
public Tuple2<String, Integer> call(String line) throws Exception
{
NotifyRequest nr = LogParser.parseNotifyRequest(line);
return new Tuple2<String, Integer>(nr.getFlightId(), 1);
}
}).cache();

RDD PERSISTANCE
• MEMORY_ONLY

• MEMORY_AND_DISK

• MEMORY_ONLY_SER

• MEMORY_AND_DISK_SER

• DISK_ONLY

• MEMORY_ONLY_2, MEMORY_AND_DISK_2, …

• OFF_HEAP (Tachyon, ecperimental)

PARTITIONS
• RDD is partitioned

• You may (and probably should) control number
and size of partitions with coalesce() method

• By default 1 input ﬁle = 1 partition

PARTITIONS
• If your partitions are too big, you’ll face:
[GC 5208198K(5208832K), 0,2403780 secs]
[Full GC 5208831K->5208212K(5208832K), 9,8765730 secs]
[Full GC 5208829K->5208238K(5208832K), 9,7567820 secs]
[Full GC 5208829K->5208295K(5208832K), 9,7629460 secs]
[GC 5208301K(5208832K), 0,2403480 secs]
[Full GC 5208831K->5208344K(5208832K), 9,7497710 secs]
[Full GC 5208829K->5208366K(5208832K), 9,7542880 secs]
[Full GC 5208831K->5208415K(5208832K), 9,7574860 secs]
WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(0, ip-xx-xx-xxx-xxx.eu-
west-1.compute.internal, 60048, 0) with no recent heart
beats: 64828ms exceeds 45000ms

RESULTS
• result.saveAsTextFile(„hdfs://<HOST>:<PORT>/
out.txt")

• result.saveAsObjectFile(„/result/out.obj”)

• collect()

PROCESS RESULTS

PARTITION BY PARTITION
for (Partition partition : result.rdd().partitions()) {
List<String> subresult[] =
result.collectPartitions(new int[] { partition.index() });

for (String line : subresult[0])
{
System.out.println(line);
}
}

„SPARK STREAMING IS AN EXTENSION OFTHE
CORE SPARK APITHAT ENABLES

HIGH-THROUGHPUT, FAULT-TOLERANT
STREAM PROCESSING OF LIVE DATA STREAMS.”

DSTREAMS
• continuous stream of data, either the input data
stream received from source, or the processed
data stream generated by transforming the input
stream

• represented by a continuous sequence of RDDs

INITIALIZING
• SparkConf conf = new
SparkConf().setAppName("Real-Time
Analytics").setMaster("local");

• JavaStreamingContext jssc = new
JavaStreamingContext(conf, new
Duration(TIME_IN_MILIS));;

CREATING DSTREAM
• JavaReceiverInputDStream<String> logLines =
jssc.socketTextStream(sourceAddr, sourcePort,
StorageLevels.MEMORY_AND_DISK_SER);

DATA SOURCES
• plainTCP sockets

• Apache Kafka

• Apache Flume

• ZeroMQ

TRANSFORMATIONS
• map, ﬂatMap, ﬁlter, union, join, etc.

• transform

• updateStateByKey

WINDOW OPERATIONS
• window

• countByWindow / countByValueAndWindow

• reduceByWindow / reduceByKeyAndWindow

OUTPUT OPERTIONS
• print

• foreachRDD

• saveAsObjectFiles

• saveAsTextFiles

• saveAsTextFiles

PROVIDE ENOUGH RAM

TO WORKERS

PROVIDE ENOUGH RAM

TO EXECUTOR

SET FRAME SIZE / BUFFERS
ACCORDINGLY

SPLIT DATATO APPROPRIATE
NUMBER OF PARTITIONS

PACKAGEYOUR APPLICATION

IN UBER-JAR

BUILD A FRAMEWORKTO
PROCESS DATA EFFICIENTLY

IT’S EASIER WITH SCALA!
// word count example
inputLine.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _);

we’re hiring !

mail me: bbogacki@bidlab.pl

Introduction to Apache Spark / PUT 06.2014

More Related Content

What's hot

Viewers also liked

Similar to Introduction to Apache Spark / PUT 06.2014

Recently uploaded

Introduction to Apache Spark / PUT 06.2014