Josiah Samuel
IBM Systems Development Labs – Bangalore
Apache
1
Power Software Development
Agenda
§ Motivation
§ Introduction	to	Apache	Spark
§ Spark	SQL
§ Spark	Internals
§ Spark	ML	Pipelines
2
Big-Data	Era
§ Collect,	Store	&	Process	information	at	scale	
§ Rise	of	Open	Source	Software
– Leverage	clusters	of	commodity	computers	to	process	the	data
§ Data	Science
– Bridge	between	Data	and	the	tools
– Starts	with	running	simple	queries
§ Placing	Schema	on	the	data	and	run	SQL	queries
– R,	Octave,	Python	Scikit learn	
§ Data	is	partitioned	and	spread	across	nodes	(	HDFS	)
– Algorithms	with	wide	data	dependency	will	suffer	from	n/w	delays
– Probability	of	Node	failure	increases
3
Examples	of	Big-Data	Processing
§ Build	a	model	to	detect	credit	card	fraud	using	thousands	of	features	and	
billions	of	transactions.
§ Intelligently	recommend	millions	of	products	to	millions	of	users.
§ Estimate	financial	risk	through	simulations	of	portfolios	including	millions	of	
instruments.
§ Easily	manipulate	data	from	thousands	of	human	genomes	to	detect	genetic	
associations	with	disease.
4
Parallel	Systems	 Distributed	Systems
§ Tightly	coupled	Systems
§ Multiple	processors	shared	same	
memory	address	space
§ Scale	Up	Servers
§ High	Performance	Computing(HPC)
§ Disadvantages:
– Scalability
– Expensive	
5
§ Loosely	coupled	Systems
§ Communicate	with	each	other	over	
Network
§ Scale	Out	Servers
§ Capable	on	collaborating	to	complete	
a	task
§ Disadvantages:
– Difficult	in	developing	distributed	
software
– Network	problems
– Reliability	&	Fault	Tolerance
Apache	Hadoop
§ Hadoop	emerges	as	a	leader
– Filesystem	Abstraction
– M/R	programming	model
– Linear	scalability
– Automatic	failure	recovery
– Cheaper	solution
§ Challenges
– Transformational	APIs	missing	for	Feature	Engineering
– Not	suitable	for	ML	modeling
§ Multiple	passes	on	same	data	sets
6
Apache	Spark
§ Analytical	Operation	System
§ No	need	to	write	intermediate	results	in	disk
§ All	transformations	are	represented	in	DAG	
– Acyclic	Graph	of	Operators
§ Pass	directly	the	results	to	next	step	in	the	
pipeline
§ In-Memory	Processing
8
Performance Aspect
9
Ecosystem Aspect
10
Apache	Spark	APIs:	
• Easy	to	use	APIs
• Improves	productivity	when	operating	on	
large	dataset
• APIs	are	intuitive	and	expressive
• RDDs, DataFrames/Datasets
v APIs help seamless movement between DataFrame or
Dataset and RDDs
v DataFrames and Datasets are built on top of RDDs.
v REPL	– Interactive	Analytics
Development Aspect
11
12
13
Hardware	Trends
14
Storage
Network
2010 2017 Rate of Increase
50+MB/s
(HDD)
1Gbps
~3GHz
500+MB/s
(SSD)
10Gbps
~3GHz
10X
10X
??
Spark	Software	Stack	Trend
§ IO	has	been	optimized
– Reduce	IO	by	pruning	input	data	that	is	not	needed
– New	shuffle	and	network	implementations	
(2014	sort	record	– Ref:	http://sortbenchmark.org/)
§ Data	formats	have	improved
– E.g.	Parquet	is	a	“dense”	columnar	format
§ CPU	increasingly	the	bottleneck;	trend	expected	to	continue
15
16
Project	Tungsten
§ Phase	1	– Foundation – Spark	1.6.1
– Memory	Management
– Code	Generation
– Cache-aware	Algorithms
§ Phase	2 - Spark	2.0.1
– Whole-stage	Codegen
– Vectorization
17
Project	Tungsten:	Phase	1
§ Perform	explicit	memory	management	instead	of	relying	on	Java	objects
– Reduce	memory	footprint
– Eliminate	garbage	collection	overheads
– Use	sun.misc.unsafe rows	and	off	heap	memory
§ Code	generation	for	expression	evaluation
– Reduce	virtual	function	calls	and	interpretation	overhead
§ Cache	conscious	sorting
– Reduce	bad	memory	access	patterns
Ref:
https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html
https://www.youtube.com/watch?v=5ajs8EIPWGI&feature=youtu.be
18
Project	Tungsten:	Phase	1:	Memory	Management	
19
String str = “abcd”;
Java takes 48 bytes
Project	Tungsten:	Phase	1:	Memory	Management	
20
In Spark, entire Object fits in 80 bytes
Note: Its applicable only for Dataset/DataFrame as the Schema is known
Project	Tungsten:	Phase	2
21
Volcano Iterator Model Simplicity
Project	Tungsten:	Phase	2
§ Short-coming	of	Volcano	Iterator	Model
– Too	many	virtual	function	calls
– Intermediate	data	in	memory	(or	L1/L2/L3	cache)
– Can’t	take	advantage	of	modern	CPU	features	
§ no	loop	unrolling
§ SIMD
§ Pipelining
§ Prefetching
§ Branch	Prediction
22
Project	Tungsten:	Phase	2
§ Fusing	operators	together	so	the	generated	code	looks	like	hand	optimized	code
§ Identify	chains	of	operators	(“stages”)
§ Compile	each	stage	into	a	single	function
§ Functionality	of	a	general	purpose	execution	engine;
23
24
Challenges	with	Data	Science
§ Majority	of	work	lies	in	preprocessing	the	data	
– Feature	Engineering
– Choosing	the	Algorithms
– Convert	data	to	vectors	for	ML	Algorithms
§ Iteration
– Scans	over	input	vector	till	model	converges
– Results	are	based	on	experimentation
§ Put	the	models	in	Production
– Evaluate	its	accuracy	over	time
– Rebuilt	the	model	periodically
25
System should support more flexible
transformation
Multiple Data access from disk
should be effectively handled
Ease Model creation & suitable for
production use
SparkML – High	level	functionality
26
• Built on top of DataFrames
• org.apache.spark.mllib.* - deprecated
SparkML – TF	IDF
§ Term	Frequency	– Inverse	Document	Frequency
§ Used	to	build	Search	Engines
– Score	indicate	how	important	a	word	is	to	a	collection	of	documents
§ If	a	word	appears	frequently	in	a	doc,	it’s	important
§ But	if	a	word	appears	in	many	docs	(the,	and,	of	- stop-words)	the	word	is	not	
meaningful,	so	lower	its	score
27
SparkML – KMeans Clustering
§ Unsupervised	learning
§ Classify	items	into	K	different	groups
§ Randomly	initialize	the	centroid	for	these	group
§ Compute	Euclidean	distance	between	the	datapoints &	centroid	to	assign	the	
group
§ Recompute the	centroid	once	again	with	all	the	datapoints which	are	part	of	the	
same	group
§ Repeat	till	the	centroid	movement	is	negligible
28
Spark	ML	Pipeline
§ DataFrame:
– uses	DF	from	Spark	SQL	as	a	ML	dataset.	Different	columns	can	store	text,	feature	
vectors,	true	labels	and	predictions
§ Transformer:	
– an	algorithm	which	can	transform	one	DataFrame into	another	DataFrame
(example:	ML	model	is	a	transformer	that	transforms	a	DF	with	features	into	a	DF	with	predictions)
§ Estimator:	
– an	algorithm	which	can	be	fit	on	a	DF	to	produce	a	Model	
(example:	a	learning	algorithm	is	an	Estimator	which	trains	on	a	DF	and	produces	a	model)
§ Pipeline:	
– chains	multiple	Transformers	and	Estimators	together	to	specify	a	ML	workflow
29
Spark	ML	Pipeline
30
Tokenizer KMeans
Kmeans
Model
Hashing
TFPipeline
(Estimator)
Pipeline.fit()
Raw
Text
Words
Feature
Vectors
31
Backup
32
GPU	Acceleration
§ Target	Computation	Heavy	Spark	Applications.
• Machine	learning	algorithms	like	linear	regression,	logistics	regression	etc.
• Same	lambda	function	is	applied	on	huge	set	of	rows
§ Rising	need	to	offload	CPU	work	as	CPUs	has	become	bottleneck	on	Spark	
§ Goal	is	to	shorten	execution	time	of	a	long-running	Computation-heavy	Spark	application
§ Approach
– Accelerate	a	Spark	application	by	using	GPUs	effectively	and	transparently
– Minimum	change	to	user’s	Spark	Program
– No	change	to	existing	Spark	code	base
33
GPUEnabler in	Catalyst
§ Put	GPU	kernel	launcher	and	code	generator	into	Catalyst
34
User’s Spark program
DataFrame
Dataset
Tungsten
Catalyst
Off-heap
UnsafeRow
GPU device memory
Columnar
Logical optimizer
Memory manager
CPU code generator
GPU code generatorGPU kernel launcher
Columnar
How	to	use	GPUEnabler Plugin
35
Available	at	 https://github.com/IBMSparkGPU/GPUEnabler
Build the package:
This will install the package to the maven local repository.
To include this package to your Spark Application include the dependency in the application’s pom.xml file:
More information can be found in the github repository regarding the APIs and sample programs
API	usage	compared	to	Spark	APIs
§ map	&	reduce	is	replaced	with	mapExtFunc &	reduceExtFunc
§ Handles	are	passed	to	it	which	holds	the	mapping	of	CUDA	Kernels
§ Handles	are	also	used	in	specifying	the	input	parameters	&	output	parameters	mappings
36

Power Software Development with Apache Spark