APACHE-SPARK	

LARGE-SCALE DATA PROCESSING ENGINE
Bartosz Bogacki <bbogacki@bidlab.pl>
CTO, CODER, ROCK CLIMBER
• current: 	

• Chief Technology Officer at Bidlab
• previous:	

• IT Director at
InternetowyKantor.pl SA
• Software Architect / Project
Manager at Wolters Kluwer
Polska
• find out more (if you care):	

• linkedin.com/in/bartoszbogacki
WE PROCESS MORETHAN
200GB OF LOGS DAILY
Did I mention that…
?
WHY?
• To discover inventory and potential	

• To optimize traffic	

• To optimize campaigns	

• To learn about trends	

• To calculate conversions
APACHE SPARK !
HISTORY
• 2013-06-19 Project enters Apache incubation	

• 2014-02-19 Project established as an ApacheTop
Level Project.	

• 2014-05-30 Spark 1.0.0 released
• "Apache Spark is a (lightning-) fast and
general-purpose cluster computing system"	

• Engine compatible with Apache Hadoop	

• Up to 100x faster than Hadoop 	

• Less code to write, more elastic	

• Active community (117 developers
contributed to release 1.0.0)
KEY CONCEPTS
• Spark /YARN / Mesos resources compatible	

• HDFS / S3 support built-in	

• RDD - Resilient Distribiuted Dataset	

• Transformations & Actions	

• Written in Scala,API for Java / Scala / Python
ECOSYSTEM
• Spark Streaming	

• Shark	

• MLlib (machine learning)	

• GraphX	

• Spark SQL
RDD
• Collections of objects	

• Stored in memory (or disk)	

• Spread across the cluster	

• Auto-rebuild on failure
TRANSFORMATIONS
• map / flatMap	

• filter	

• union / intersection / join / cogroup	

• distinct	

• many more…
ACTIONS
• reduce / reduceByKey	

• foreach	

• count / countByKey	

• first / take / takeOrdered	

• collect / saveAsTextFile / saveAsObjectFile
EXAMPLES
val s1=sc.parallelize(Array(1,2,3,4,5))
val s2=sc.parallelize(Array(3,4,6,7,8))
val s3=sc.parallelize(Array(1,2,2,3,3,3))
!
s2.map(num => num * num)
// => 9, 16, 36, 49, 64
s1.reduce((a,b) => a + b)
// => 15
s1 union s2
// => 1, 2, 3, 4, 5, 3, 4, 6, 7, 8
s1 subtract s2
// => 1, 5, 2
s1 intersection s2
// => 4, 3
s3.distinct
// => 1, 2, 3
EXAMPLES
val set1 = sc.parallelize(Array[(Integer,String)]
((1,”bartek"),(2,"jacek"),(3,"tomek")))
val set2 = sc.parallelize(Array[(Integer,String)]
((2,”nowak”),(4,"kowalski"),(5,"iksiński")))
!
set1 join set2
// =>(2,(jacek,nowak))
set1 leftOuterJoin set2
// =>(1,(bartek,None)), (2,(jacek,Some(nowak))), (3,
(tomek,None))
set1 rightOuterJoin set2
// =>(4,(None,kowalski)), (5,(None,iksiński)), (2,
(Some(jacek),nowak))
EXAMPLES
set1.cogroup(set2).sortByKey()
// => (1,(ArrayBuffer(bartek),ArrayBuffer())), (2,
(ArrayBuffer(jacek),ArrayBuffer(nowak))), (3,
(ArrayBuffer(tomek),ArrayBuffer())), (4,
(ArrayBuffer(),ArrayBuffer(kowalski))), (5,
(ArrayBuffer(),ArrayBuffer(iksiński)))
!
set2.map((t) => (t._1, t._2.length))
// => (2,5), (4,8), (5,8)
!
val set3 = sc.parallelize(Array[(String,Long)]
(("onet.pl",1), ("onet.pl",1), ("wp.pl",1))
!
set3.reduceByKey((n1,n2) => n1 + n2)
// => (onet.pl,2), (wp.pl,1)
HANDS ON
RUNNING EC2 	

SPARK CLUSTER
./spark-ec2 -k spark-key -i spark-key.pem
-s 5
-t m3.2xlarge
launch cluster-name
--region=eu-west-1
SPARK CONSOLE
LINKING WITH SPARK
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.10</artifactId>
<version>1.0.0</version>
</dependency>
If you want to use HDFS	

!
groupId = org.apache.hadoop
artifactId = hadoop-client
version = <your-hdfs-version>
If you want to use Spark Streaming	

!
groupId = org.apache.spark
artifactId = spark-streaming_2.10
version = 1.0.0
INITIALIZING
• SparkConf conf = new SparkConf()
.setAppName("TEST")
.setMaster("local");	

• JavaSparkContext sc = new
JavaSparkContext(conf);
CREATING RDD
• List<Integer> data = Arrays.asList(1, 2, 3, 4, 5);	

• JavaRDD<Integer> distData = sc.parallelize(data);
CREATING RDD
• JavaRDD<String> logLines = sc.textFile("data.txt");
CREATING RDD
• JavaRDD<String> logLines = sc.textFile(”hdfs://
<HOST>:<PORT>/daily/data-20-00.txt”);	

• JavaRDD<String> logLines = sc.textFile(”s3n://my-
bucket/daily/data-*.txt”);
TRANSFORM
JavaRDD<Log> logs =
logLines.map(new Function<String, Log>() {
public Log call(String s) {
return LogParser.parse(s);
}
}).filter(new Function<Log, Boolean>(){
public Integer call(Log log) {
return log.getLevel() == 1;
}
});
ACTION :)
logs.count();
TRANSFORM-ACTION
List<Tuple2<String,Integer>> result = 	
	 sc.textFile(”/data/notifies-20-00.txt”)	
	 .mapToPair(new PairFunction<String, String, Integer>() {	
	 	 	 @Override	
	 	 	 public Tuple2<String, Integer> call(String line) throws Exception {	
	 	 	 	 NotifyRequest nr = LogParser.parseNotifyRequest(line);	
	 	 	 	 return new Tuple2<String, Integer>(nr.getFlightId(), 1);	
	 	 	 }	
	 	 })	
	 .reduceByKey(new Function2<Integer, Integer, Integer>(){	
	 	 	 @Override	
	 	 	 public Integer call(Integer v1, Integer v2) throws Exception {	
	 	 	 	 return v1 + v2;	
	 	 	 }})	
	 .sortByKey()	
.collect();
FUNCTIONS, 	

PAIRFUNCTIONS, 	

ETC.
BROADCASTVARIABLES
• "allow the programmer to keep a read-only
variable cached on each machine rather than
shipping a copy of it with tasks"
Broadcast<int[]> broadcastVar =
sc.broadcast(new int[] {1, 2, 3});
!
broadcastVar.value();
// returns [1, 2, 3]
ACCUMULATORS
• variables that are only “added” to through an associative
operation (add())	

• only the driver program can read the accumulator’s value
Accumulator<Integer> accum = sc.accumulator(0);
!
sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x ->
accum.add(x));
!
accum.value();
// returns 10
SERIALIZATION
• All objects used in your code have to be
serializable	

• Otherwise:
org.apache.spark.SparkException: Job aborted: Task not
serializable: java.io.NotSerializableException
USE KRYO SERIALIZER
public class MyRegistrator implements KryoRegistrator {	
	 @Override	
	 public void registerClasses(Kryo kryo) {	
	 	 kryo.register(BidRequest.class);	
	 	 kryo.register(NotifyRequest.class);	
	 	 kryo.register(Event.class);	
}	
}
sparkConfig.set(	
	 "spark.serializer", "org.apache.spark.serializer.KryoSerializer");	
sparkConfig.set(	
	 "spark.kryo.registrator", "pl.instream.dsp.offline.MyRegistrator");	
sparkConfig.set(	
	 "spark.kryoserializer.buffer.mb", "10");
CACHE !
JavaPairRDD<String, Integer> cachedSet = 	
	 sc.textFile(”/data/notifies-20-00.txt”)	
	 .mapToPair(new PairFunction<String, String, Integer>() {	
	 	 	 @Override	
	 	 	 public Tuple2<String, Integer> call(String line) throws Exception
	 	 	 {	
	 	 	 	 NotifyRequest nr = LogParser.parseNotifyRequest(line);	
	 	 	 	 return new Tuple2<String, Integer>(nr.getFlightId(), 1);	
	 	 	 }	
	 	 }).cache();
RDD PERSISTANCE
• MEMORY_ONLY	

• MEMORY_AND_DISK	

• MEMORY_ONLY_SER	

• MEMORY_AND_DISK_SER	

• DISK_ONLY	

• MEMORY_ONLY_2, MEMORY_AND_DISK_2, …	

• OFF_HEAP (Tachyon, ecperimental)
PARTITIONS
• RDD is partitioned	

• You may (and probably should) control number
and size of partitions with coalesce() method	

• By default 1 input file = 1 partition
PARTITIONS
• If your partitions are too big, you’ll face:
[GC 5208198K(5208832K), 0,2403780 secs]
[Full GC 5208831K->5208212K(5208832K), 9,8765730 secs]
[Full GC 5208829K->5208238K(5208832K), 9,7567820 secs]
[Full GC 5208829K->5208295K(5208832K), 9,7629460 secs]
[GC 5208301K(5208832K), 0,2403480 secs]
[Full GC 5208831K->5208344K(5208832K), 9,7497710 secs]
[Full GC 5208829K->5208366K(5208832K), 9,7542880 secs]
[Full GC 5208831K->5208415K(5208832K), 9,7574860 secs]
WARN storage.BlockManagerMasterActor: Removing
BlockManager BlockManagerId(0, ip-xx-xx-xxx-xxx.eu-
west-1.compute.internal, 60048, 0) with no recent heart
beats: 64828ms exceeds 45000ms
RESULTS
• result.saveAsTextFile(„hdfs://<HOST>:<PORT>/
out.txt")	

• result.saveAsObjectFile(„/result/out.obj”)	

• collect()
PROCESS RESULTS 	

PARTITION BY PARTITION
for (Partition partition : result.rdd().partitions()) {	
	 List<String> subresult[] = 	
	 	 result.collectPartitions(new int[] { partition.index() });	
	 	
	 for (String line : subresult[0])	
	 {	
	 	 System.out.println(line);	
	 }	
}
SPARK STREAMING
„SPARK STREAMING IS AN EXTENSION OFTHE
CORE SPARK APITHAT ENABLES 	

HIGH-THROUGHPUT, FAULT-TOLERANT
STREAM PROCESSING OF LIVE DATA STREAMS.”
HOW IT WORKS?
DSTREAMS
• continuous stream of data, either the input data
stream received from source, or the processed
data stream generated by transforming the input
stream	

• represented by a continuous sequence of RDDs
INITIALIZING
• SparkConf conf = new
SparkConf().setAppName("Real-Time
Analytics").setMaster("local");	

• JavaStreamingContext jssc = new
JavaStreamingContext(conf, new
Duration(TIME_IN_MILIS));;
CREATING DSTREAM
• JavaReceiverInputDStream<String> logLines =
jssc.socketTextStream(sourceAddr, sourcePort,
StorageLevels.MEMORY_AND_DISK_SER);
DATA SOURCES
• plainTCP sockets	

• Apache Kafka	

• Apache Flume	

• ZeroMQ
TRANSFORMATIONS
• map, flatMap, filter, union, join, etc.	

• transform	

• updateStateByKey
WINDOW OPERATIONS
• window	

• countByWindow / countByValueAndWindow	

• reduceByWindow / reduceByKeyAndWindow
OUTPUT OPERTIONS
• print	

• foreachRDD	

• saveAsObjectFiles	

• saveAsTextFiles	

• saveAsTextFiles
THINGSTO REMEMBER
USE SPARK-SHELLTO LEARN
PROVIDE ENOUGH RAM 	

TO WORKERS
PROVIDE ENOUGH RAM 	

TO EXECUTOR
SET FRAME SIZE / BUFFERS
ACCORDINGLY
USE KRYO SERIALIZER
SPLIT DATATO APPROPRIATE
NUMBER OF PARTITIONS
PACKAGEYOUR APPLICATION	

IN UBER-JAR
DESIGNYOUR DATA FLOW
AND…
BUILD A FRAMEWORKTO
PROCESS DATA EFFICIENTLY
IT’S EASIER WITH SCALA!
	 // word count example	
	 inputLine.flatMap(line => line.split(" "))	
	 	 .map(word => (word, 1))	
	 	 .reduceByKey(_ + _);
HOW WE USE SPARK?
HOW WE USE SPARK?
HOW WE USE SPARK?
THANKS!
we’re hiring !	

mail me: bbogacki@bidlab.pl

Introduction to Apache Spark / PUT 06.2014

  • 1.
    APACHE-SPARK LARGE-SCALE DATA PROCESSINGENGINE Bartosz Bogacki <bbogacki@bidlab.pl>
  • 2.
    CTO, CODER, ROCKCLIMBER • current: • Chief Technology Officer at Bidlab • previous: • IT Director at InternetowyKantor.pl SA • Software Architect / Project Manager at Wolters Kluwer Polska • find out more (if you care): • linkedin.com/in/bartoszbogacki
  • 3.
    WE PROCESS MORETHAN 200GBOF LOGS DAILY Did I mention that… ?
  • 4.
    WHY? • To discoverinventory and potential • To optimize traffic • To optimize campaigns • To learn about trends • To calculate conversions
  • 5.
  • 6.
    HISTORY • 2013-06-19 Projectenters Apache incubation • 2014-02-19 Project established as an ApacheTop Level Project. • 2014-05-30 Spark 1.0.0 released
  • 7.
    • "Apache Sparkis a (lightning-) fast and general-purpose cluster computing system" • Engine compatible with Apache Hadoop • Up to 100x faster than Hadoop • Less code to write, more elastic • Active community (117 developers contributed to release 1.0.0)
  • 8.
    KEY CONCEPTS • Spark/YARN / Mesos resources compatible • HDFS / S3 support built-in • RDD - Resilient Distribiuted Dataset • Transformations & Actions • Written in Scala,API for Java / Scala / Python
  • 9.
    ECOSYSTEM • Spark Streaming •Shark • MLlib (machine learning) • GraphX • Spark SQL
  • 10.
    RDD • Collections ofobjects • Stored in memory (or disk) • Spread across the cluster • Auto-rebuild on failure
  • 11.
    TRANSFORMATIONS • map /flatMap • filter • union / intersection / join / cogroup • distinct • many more…
  • 12.
    ACTIONS • reduce /reduceByKey • foreach • count / countByKey • first / take / takeOrdered • collect / saveAsTextFile / saveAsObjectFile
  • 13.
    EXAMPLES val s1=sc.parallelize(Array(1,2,3,4,5)) val s2=sc.parallelize(Array(3,4,6,7,8)) vals3=sc.parallelize(Array(1,2,2,3,3,3)) ! s2.map(num => num * num) // => 9, 16, 36, 49, 64 s1.reduce((a,b) => a + b) // => 15 s1 union s2 // => 1, 2, 3, 4, 5, 3, 4, 6, 7, 8 s1 subtract s2 // => 1, 5, 2 s1 intersection s2 // => 4, 3 s3.distinct // => 1, 2, 3
  • 14.
    EXAMPLES val set1 =sc.parallelize(Array[(Integer,String)] ((1,”bartek"),(2,"jacek"),(3,"tomek"))) val set2 = sc.parallelize(Array[(Integer,String)] ((2,”nowak”),(4,"kowalski"),(5,"iksiński"))) ! set1 join set2 // =>(2,(jacek,nowak)) set1 leftOuterJoin set2 // =>(1,(bartek,None)), (2,(jacek,Some(nowak))), (3, (tomek,None)) set1 rightOuterJoin set2 // =>(4,(None,kowalski)), (5,(None,iksiński)), (2, (Some(jacek),nowak))
  • 15.
    EXAMPLES set1.cogroup(set2).sortByKey() // => (1,(ArrayBuffer(bartek),ArrayBuffer())),(2, (ArrayBuffer(jacek),ArrayBuffer(nowak))), (3, (ArrayBuffer(tomek),ArrayBuffer())), (4, (ArrayBuffer(),ArrayBuffer(kowalski))), (5, (ArrayBuffer(),ArrayBuffer(iksiński))) ! set2.map((t) => (t._1, t._2.length)) // => (2,5), (4,8), (5,8) ! val set3 = sc.parallelize(Array[(String,Long)] (("onet.pl",1), ("onet.pl",1), ("wp.pl",1)) ! set3.reduceByKey((n1,n2) => n1 + n2) // => (onet.pl,2), (wp.pl,1)
  • 16.
  • 17.
    RUNNING EC2 SPARKCLUSTER ./spark-ec2 -k spark-key -i spark-key.pem -s 5 -t m3.2xlarge launch cluster-name --region=eu-west-1
  • 18.
  • 19.
    LINKING WITH SPARK <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-core_2.10</artifactId> <version>1.0.0</version> </dependency> Ifyou want to use HDFS ! groupId = org.apache.hadoop artifactId = hadoop-client version = <your-hdfs-version> If you want to use Spark Streaming ! groupId = org.apache.spark artifactId = spark-streaming_2.10 version = 1.0.0
  • 20.
    INITIALIZING • SparkConf conf= new SparkConf() .setAppName("TEST") .setMaster("local"); • JavaSparkContext sc = new JavaSparkContext(conf);
  • 21.
    CREATING RDD • List<Integer>data = Arrays.asList(1, 2, 3, 4, 5); • JavaRDD<Integer> distData = sc.parallelize(data);
  • 22.
    CREATING RDD • JavaRDD<String>logLines = sc.textFile("data.txt");
  • 23.
    CREATING RDD • JavaRDD<String>logLines = sc.textFile(”hdfs:// <HOST>:<PORT>/daily/data-20-00.txt”); • JavaRDD<String> logLines = sc.textFile(”s3n://my- bucket/daily/data-*.txt”);
  • 24.
    TRANSFORM JavaRDD<Log> logs = logLines.map(newFunction<String, Log>() { public Log call(String s) { return LogParser.parse(s); } }).filter(new Function<Log, Boolean>(){ public Integer call(Log log) { return log.getLevel() == 1; } });
  • 25.
  • 26.
    TRANSFORM-ACTION List<Tuple2<String,Integer>> result = sc.textFile(”/data/notifies-20-00.txt”) .mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String line) throws Exception { NotifyRequest nr = LogParser.parseNotifyRequest(line); return new Tuple2<String, Integer>(nr.getFlightId(), 1); } }) .reduceByKey(new Function2<Integer, Integer, Integer>(){ @Override public Integer call(Integer v1, Integer v2) throws Exception { return v1 + v2; }}) .sortByKey() .collect();
  • 27.
  • 28.
    BROADCASTVARIABLES • "allow theprogrammer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks" Broadcast<int[]> broadcastVar = sc.broadcast(new int[] {1, 2, 3}); ! broadcastVar.value(); // returns [1, 2, 3]
  • 29.
    ACCUMULATORS • variables thatare only “added” to through an associative operation (add()) • only the driver program can read the accumulator’s value Accumulator<Integer> accum = sc.accumulator(0); ! sc.parallelize(Arrays.asList(1, 2, 3, 4)).foreach(x -> accum.add(x)); ! accum.value(); // returns 10
  • 30.
    SERIALIZATION • All objectsused in your code have to be serializable • Otherwise: org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException
  • 31.
    USE KRYO SERIALIZER publicclass MyRegistrator implements KryoRegistrator { @Override public void registerClasses(Kryo kryo) { kryo.register(BidRequest.class); kryo.register(NotifyRequest.class); kryo.register(Event.class); } } sparkConfig.set( "spark.serializer", "org.apache.spark.serializer.KryoSerializer"); sparkConfig.set( "spark.kryo.registrator", "pl.instream.dsp.offline.MyRegistrator"); sparkConfig.set( "spark.kryoserializer.buffer.mb", "10");
  • 32.
    CACHE ! JavaPairRDD<String, Integer>cachedSet = sc.textFile(”/data/notifies-20-00.txt”) .mapToPair(new PairFunction<String, String, Integer>() { @Override public Tuple2<String, Integer> call(String line) throws Exception { NotifyRequest nr = LogParser.parseNotifyRequest(line); return new Tuple2<String, Integer>(nr.getFlightId(), 1); } }).cache();
  • 33.
    RDD PERSISTANCE • MEMORY_ONLY •MEMORY_AND_DISK • MEMORY_ONLY_SER • MEMORY_AND_DISK_SER • DISK_ONLY • MEMORY_ONLY_2, MEMORY_AND_DISK_2, … • OFF_HEAP (Tachyon, ecperimental)
  • 34.
    PARTITIONS • RDD ispartitioned • You may (and probably should) control number and size of partitions with coalesce() method • By default 1 input file = 1 partition
  • 35.
    PARTITIONS • If yourpartitions are too big, you’ll face: [GC 5208198K(5208832K), 0,2403780 secs] [Full GC 5208831K->5208212K(5208832K), 9,8765730 secs] [Full GC 5208829K->5208238K(5208832K), 9,7567820 secs] [Full GC 5208829K->5208295K(5208832K), 9,7629460 secs] [GC 5208301K(5208832K), 0,2403480 secs] [Full GC 5208831K->5208344K(5208832K), 9,7497710 secs] [Full GC 5208829K->5208366K(5208832K), 9,7542880 secs] [Full GC 5208831K->5208415K(5208832K), 9,7574860 secs] WARN storage.BlockManagerMasterActor: Removing BlockManager BlockManagerId(0, ip-xx-xx-xxx-xxx.eu- west-1.compute.internal, 60048, 0) with no recent heart beats: 64828ms exceeds 45000ms
  • 36.
  • 37.
    PROCESS RESULTS PARTITIONBY PARTITION for (Partition partition : result.rdd().partitions()) { List<String> subresult[] = result.collectPartitions(new int[] { partition.index() }); for (String line : subresult[0]) { System.out.println(line); } }
  • 38.
  • 39.
    „SPARK STREAMING ISAN EXTENSION OFTHE CORE SPARK APITHAT ENABLES HIGH-THROUGHPUT, FAULT-TOLERANT STREAM PROCESSING OF LIVE DATA STREAMS.”
  • 40.
  • 41.
    DSTREAMS • continuous streamof data, either the input data stream received from source, or the processed data stream generated by transforming the input stream • represented by a continuous sequence of RDDs
  • 42.
    INITIALIZING • SparkConf conf= new SparkConf().setAppName("Real-Time Analytics").setMaster("local"); • JavaStreamingContext jssc = new JavaStreamingContext(conf, new Duration(TIME_IN_MILIS));;
  • 43.
    CREATING DSTREAM • JavaReceiverInputDStream<String>logLines = jssc.socketTextStream(sourceAddr, sourcePort, StorageLevels.MEMORY_AND_DISK_SER);
  • 44.
    DATA SOURCES • plainTCPsockets • Apache Kafka • Apache Flume • ZeroMQ
  • 45.
    TRANSFORMATIONS • map, flatMap,filter, union, join, etc. • transform • updateStateByKey
  • 46.
    WINDOW OPERATIONS • window •countByWindow / countByValueAndWindow • reduceByWindow / reduceByKeyAndWindow
  • 47.
    OUTPUT OPERTIONS • print •foreachRDD • saveAsObjectFiles • saveAsTextFiles • saveAsTextFiles
  • 48.
  • 49.
  • 50.
    PROVIDE ENOUGH RAM TO WORKERS
  • 51.
    PROVIDE ENOUGH RAM TO EXECUTOR
  • 52.
    SET FRAME SIZE/ BUFFERS ACCORDINGLY
  • 53.
  • 54.
  • 55.
  • 56.
  • 57.
  • 58.
    IT’S EASIER WITHSCALA! // word count example inputLine.flatMap(line => line.split(" ")) .map(word => (word, 1)) .reduceByKey(_ + _);
  • 59.
    HOW WE USESPARK?
  • 60.
    HOW WE USESPARK?
  • 61.
    HOW WE USESPARK?
  • 62.
  • 63.
    we’re hiring ! mailme: bbogacki@bidlab.pl