Learning	Apache	Spark
part	1
Presenter	Introduction
• Tim	Spann,	Senior	Solutions	Architect,	airis.DATA
• ex-Pivotal	Senior	Field	Engineer
• DZONE	MVB	and	Zone	Leader
• ex-Startup	Senior	Engineer	/	Team	Lead
http://www.slideshare.net/bunkertor
http://sparkdeveloper.com/
http://www.twitter.com/PaasDev
airis.DATA
airis.DATA is	a	next	generation	system	integrator	that	specializes	in	rapidly	deployable	
machine	learning	and	graph	solutions.	
Our	core	competencies	involve	providing	modular,	scalable	Big	Data	products	that	can	be	
tailored	to	fit	use	cases	across	industry	verticals.	
We	offer	predictive	modeling	and	machine	learning	solutions	at	Petabyte	scale	utilizing	
the	most	advanced,	best-in-class	technologies	and	frameworks	including	Spark,	H20,	
Mahout,	and	Flink.	
Our	data	pipelining	solutions	can	be	deployed	in	batch,	real-time	or	near-real-time	
settings	to	fit	your	specific	business	use-case.
Agenda
• Overview
• What	is	Map	Reduce?
• Hands-On:
• Installation
• Spark	Map	Reduce
• Build	with	IntelliJ/SBT
• Deploy	Local
Overview
Spark	is	a	fast	cluster	computing	
system	that	supports	Java,	Scala,	
Python	and	R	APIs.			It	allows	for	
multiple	workloads	using	the	same	
system	and	coding.			
One	stop	shopping	for	your	big	
data	processing	at	scale	needs.
It	works	well	with	existing	Hadoop	
clusters,	by	itself,	with	AWS	or	on	
it’s	own.
http://spark.apache.org/docs/latest/index.html
What	is	Map	Reduce?
TRANSFORMATION
map(func) Return	a	new	distributed	dataset	formed	by	
passing	each	element	of	the	source	through	a	
function func.
ACTION
reduce(func) Aggregate	the	elements	of	the	dataset	using	a	
function func (which	takes	two	arguments	and	
returns	one).	The	function	should	be	
commutative	and	associative	so	that	it	can	be	
computed	correctly	in	parallel.
Problem	Definition
We	have	Apache	logs	from	our	website.		They	follow	a	standard	pattern	and	we	
want	to	parse	them	to	gain	some	insights	on	usage.
114.200.179.85	- - [24/Feb/2016:00:10:02	-0500]	"GET	/wp HTTP/1.1"	200	5279	
"http://sparkdeveloper.com/"	"Mozilla/5.0"
Bytes	Sent
HTTP	Referer
User	Agent
IP	Address
ClientID
UserID
Date	Time	Stamp
Request	String
HTTP	Status	Code
Map	Function
logFile.map(parseLogLine)
LogRecord(m.group(1),	m.group(2),	m.group(3),	m.group(4),
m.group(5),	m.group(8).toInt,	m.group(9).toLong,	m.group(10),	
m.group(11))
Our	mapping	function	is	parseLogLine	which	takes	a	Log	String	and	splits	it	into	fields	in	a	Case	class
using	regular	expressions.
val contentSizes =	accessLogs.map(log	=>	log.bytesSent)
Our	second	mapping	function,	 maps	to	just	the	byte	field
Reduce
contentSizes.reduce(_	+	_)
We	reduce	by	a	summing	up	all	the	bytes	in	the	dataset.				The	result	is	a	final	sum	of	all	sizes.
Spark	1.6.1	Stack
Spark	SQL
Spark	
Streaming
MLlib GraphX
Spark	Core
Standalone YARN Mesos
Hands-On
Spark	Map	Reduce
Build	with	IntelliJ/SBT
Deploy	Local
Run	History	Server
spark-1.6.1-bin-hadoop2.6/sbin/start-history-server.sh
Installation
• Install	JDK
• Install	Scala	2.10
• Install	SBT
• Install	Maven	(Optional)
• Unzip	Spark	1.6.1
Environment	Variable	Value	(example)
Unix	/	Linux	/	Mac
export	SCALA_HOME=/usr/local/share/scala
export	PATH=$PATH:$SCALA_HOME/bin
Windows	
Set	SCALA_HOME=c:Progra~1Scala
set	PATH=%PATH%;%SCALA_HOME%bin
Spark	Resources
• https://courses.edx.org/courses/BerkeleyX/CS100.1x/1T2015/info
• http://airisdata.com/scala-spark-resources-setup-learning/
• http://spark.apache.org/docs/latest/monitoring.html
• http://spark.apache.org/docs/latest/submitting-applications.html
Spark	Cluster
http://spark.apache.org/docs/latest/cluster-overview.html
Term Meaning
Application User	program	built	on	Spark.	Consists	 of	a driver	program and executorson	the	cluster.
Application	jar A	jar	containing	the	user's	Spark	application.	In	some	cases	users	will	want	to	create	an	"uber	jar"	containing	their	application	
along	with	its	dependencies.	The	user's	jar	should	never	include	Hadoop	or	Spark	libraries,	however,	these	will	be	added	at	
runtime.
Driver	program The	process	running	the	main()	function	of	the	application	and	creating	the	SparkContext
Cluster	manager An	external	service	for	acquiring	resources	on	the	cluster	(e.g.	standalone	manager,	Mesos,	YARN)
Deploy	mode Distinguishes	where	the	driver	process	runs.	In	"cluster"	mode,	the	framework	launches	the	driver	inside	of	the	cluster.	In	
"client"	mode,	the	submitter	launches	the	driver	outside	of	the	cluster.
Worker	node Any	node	that	can	run	application	code	in	the	cluster
Executor A	process	launched	for	an	application	on	a	worker	node,	that	runs	tasks	and	keeps	data	in	memory	or	disk	storage	across	
them.	Each	application	has	its	own	executors.
Task A	unit	of	work	that	will	be	sent	to	one	executor
Job A	parallel	computation	consisting	of	multiple	tasks	that	gets	spawned	in	response	to	a	Spark	action	(e.g. save, collect);	you'll
see	this	term	used	in	the	driver's	logs.
Stage Each	job	gets	divided	into	smaller	sets	of	tasks	called stages that	depend	on	each	other	(similar	to	the	map	and	reduce	stages	
in	MapReduce);	you'll	see	this	term	used	in	the	driver's	logs.
Glossary
The following table summarizes terms you’ll see used to refer to cluster concepts:

Apache Spark Overview