SlideShare a Scribd company logo
1 of 18
Download to read offline
Apache	Spark	
Cluster	Compu2ng	Pla6orm	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Introduc.on	
	
-  Apache	Spark	is	a	open	source,	fast	and	general-purpose	
cluster	compu2ng	pla6orm	
-  parallel	distributed	processing	
-  fault	tolerance	
-  on	commodity	hardware	
-  Originally	developed	at	UC	Berkeley	AMP	Lab,	2009	
-  Open	sourced	in	March	2010	
-  Apache	SoOware	Founda2on,	2013	
-  WriSen	in	Scala	
-  Runs	on	the	JVM	
Insight	Data	Labs,	December	2015	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Introduc.on	
	
-  Deployed	at	massive	scale,	mul2ple	petabytes	of	data	
-  Clusters	of	over	8,000	nodes	
-  Yahoo,	Baidu,	Tencent,	…....	
Insight	Data	Labs,	December	2015	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Features	
•  Spark	 is	 a	 general	 computa2on	 engine	 that	 uses	 distributed	
memory	to	perform	fault-tolerant	computa2ons	with	a	cluster	
•  Speed	
•  Ease	of	use	
•  Analy2c		
•  Environments	that	require	
•  Large	datasets	
•  Low	latency	processing	
•  Spark	 can	 perform	 itera2ve	 computa2ons	 at	 scale	 (in	 memory)	
which	 opens	 up	 the	 possibility	 of	 execu2ng	 machine	 learning	
algorithms	much	faster	than	with	Hadoop	MR	(disk-based)[2]	[4].	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Features	
•  Computa2onal	engine:		
•  Scheduling	
•  Distribu2ng	
•  Monitoring	
applica2ons	 consis2ng	 of	 many	 computa2onal	 tasks	 across	 a	
computa2onal	cluster.	
•  From	an	engineering	perspec2ve	Spark	hides	the	complexity	of:	
•  distributed	systems	programming	
•  network	communica2on	
•  and	fault	tolerance.		
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Features	
•  Spark	contains	mul2ple	closely	integrated	components,	designed	
to	interoperate	closely,	and	be	combined	as	libraries	in	a	soOware	
project.	
•  Supports	Java	(6+),	Scala	(2.10+)	and	Python	(2.6+)	
•  Runs	on	top	of	Hadoop,	Mesos*[1],	Standalone	or	in	the	cloud	
•  Access	diverse	data	sources:	HDFS,	Cassandra,	Hbase	[2]	
•  Supports	SQL	queries	
•  Machine	Learning	algorithms	
•  Graph	processing		
•  Stream	processing	
•  Sensor	data	processing	
•  A	general	cluster	manager,	provides	APIs	for	resource	management	
	and	scheduling	across	datacenter	and	cloud	environments	(www.mesos.apache.org).		
Can	run	Hadoop	MR	and	service	applica2ons.	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
hSps://www.safaribooksonline.com/library/view/learning-spark/	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	Core	
The	execu2on	engine	for	the	Spark	pla6orm	that	all	other	func2onality	is	built	
on	top	of.			
	
Contains	the	basic	func2onality	of	Spark	[5]:	
•  in-memory	compu2ng	capabili2es	
•  memory	management		
•  components	for	task	scheduling		
•  fault	recovery		
•  interac2ng	with	storage	systems		
•  Java,	Scala,	and	Python	APIs		
•  Resilient	Distributed	Datasets	(RDDs)	API		
	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
	
§ 				Resilient	Distributed	Datasets	(RDDs)	
•  Spark	main	programming	abstrac2on	for	working	with	data.	
•  RDDs	 represent	 a	 fault-tolerant	 collec2on	 of	 elements	
distributed	 across	 many	 compute	 nodes	 that	 can	 be	
manipulated	in	parallel.		
•  Spark	Core	provides	many	APIs	for	building	and	manipula2ng	
these	collec2ons.	
•  All	work	is	expressed	as	
•  crea2ng	new	RDDs	
•  transforming	exis2ng	RDDs	–	return	pointers	to	RDDs	
•  ac2ons,	calling	opera2ons	on	RDDs	-		return	values	
Ex: 		val	textFile	=	sc.textFile("README.md")		
textFile:	spark.RDD[String]	=	spark.MappedRDD@2ee9b6e3	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
	
§ 				Resilient	Distributed	Datasets	(RDDs)	
•  crea2ng	new	RDDs	
scala>		val	textFile	=	sc.textFile("README.md")		
textFile:	spark.RDD[String]	=	spark.MappedRDD@2ee9b6e3	
	
•  transforming	exis2ng	RDDs	–	return	pointers	to	RDDs	
Ex:	filter	transformaLon	to	return	a	new	RDD	with	a	subset	of	the	items	in	
the	file.	
scala>		val	linesWithSpark	=	textFile.filter(line	=>	line.contains("Spark"))	
•  ac2ons,	calling	opera2ons	on	RDDs	-		return	values	
	scala>	textFile.count()	//	Number	of	items	in	this	RDD		
	 	res0:	Long	=	126		
	scala>	textFile.first()	//	First	item	in	this	RDD		
	 	res1:	String	=	#	Apache	Spark	 ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	SQL	
•  Spark	SQL	is	a	Spark	module	for	structured	data	processing	
•  Allows	querying	data	via	SQL	as	well	as	HQL	(Hive	Query	Language)	
•  Act	as	distributed	SQL	query	engine	
•  Extends	the	Spark	RDD	API	
•  It	provides	DataFrames	–		a	DataFrame	is	equivalent	to	a	rela2onal	table	in	
Spark	SQL.		
•  It	also	provides	powerful	integra2on	with	the	rest	of	the	Spark	ecosystem		
(e.g.,	integra2ng	SQL	query	processing	with	machine	learning)	
•  It	enables	unmodified	Hadoop	Hive	queries	to	run	up	to	100x	faster	on	exis2ng	
deployments	and	data	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	SQL	-	DataFrames		
val	sc:	SparkContext	 	//	An	exis2ng	SparkContext.		
val	sqlContext	=	new	org.apache.spark.sql.SQLContext(sc)	
	
//	Create	the	DataFrame		
val	df	=	sqlContext.read.json("examples/src/main/resources/people.json")		
	
//	Displays	the	content	of	the	DataFrame	to	stdout		
df.show()	
	
//	age	name		
//	20	Michael		
//	30	Andy	
	//	19	JusLn	
	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	SQL	-	DataFrames		
//	Select	only	the	"name"	column		
df.select("name").show()		
	
//	name		
//	Michael		
//	Andy		
//	JusLn	
	
df.filter(df("age")	>	21).show()	
	
df.groupBy("age").count().show()	
	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	Streaming	
•  Spark	component	that	provides	the	ability	to	process	and	analyze	live	
streams	of	data	in	real-2me.		
•  Web	logs,	online	posts	and	updates	from	web	services,	logfiles,	etc.		
•  Enables	powerful	interac2ve	and	analy2cal	applica2ons	across	both	
streaming	and	historical	data	
•  Integrates	with	a	wide	variety	of	popular	data	sources,	including	HDFS,	
Flume,	Kapa,	and	TwiSer	
•  API	for	manipula2ng	data	streams	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	Mllib	–	Machine	Learning		
•  MLlib	is	a	scalable	machine	learning	library	that	delivers	both	high-quality	
algorithms	(e.g.,	mul2ple	itera2ons	to	increase	accuracy)	and	speed	(up	to	
100x	faster	than	MapReduce).		
•  Provides	 mul2ple	 types	 of	 machine	 learning	 algorithms,	 including	
classifica2on,	 regression,	 clustering,	 and	 collabora2ve	 filtering,	 as	 well	 as	
suppor2ng	func2onality	such	as	model	evalua2on	and	data	import.	
•  The	library	is	usable	in	Java,	Scala,	and	Python	as	part	of	Spark	applica2ons,	
so	that	can	be	included	in	complete	workflows	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Spark	GraphX	–	Graph	Computa2on	
•  It	is	a	library	for	manipula2ng	graphs	
•  Performs	graph-parallel	computa2ons	
•  GraphX	 is	 a	 graph	 computa2on	 engine	 built	 on	 top	 of	 Spark	 that	 enables	
users	 to	 interac2vely	 build,	 transform	 and	 reason	 about	 graph	 structured	
data	at	scale	
•  GraphX	extends	the	Spark	RDD	API	
•  Provides	a	library	of	graph	algorithms	(e.g.,	PageRank	and	triangle	coun2ng)	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
Spark	Ecosystem	
Cluster	Managers	
•  Spark	is	designed	to	scale	up	from	one	to	many	thousands	of	compute	nodes	
•  Runs	on	diverse	cluster	managers:	
•  Hadoop	YARN	
•  Apache	Mesos	[1]	
•  Standalone	Scheduler	–	a	simple	cluster	manager	included	in	Spark	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.
References	
[1]	Apache	Mesos.	hSp://spark.apache.org/docs/1.3.0/cluster-overview.html	
[2]	ApacheHadoop. 	hSp://hadoop.apache.org/.	
[3]	ApacheMahout. 	hSps://mahout.apache.org/.	
[4]	Shi,	Juwei;	Qiu,	Yunjie	et	all.	"Clash	of	the	Titans:	MapReduce	vs.	Spark	for	
Large	 Scale	 Data	 Analy2cs.”.	 IBM	 Research	 China,	 IBM	 Almadem	 Research	
Center,	Renmin	University	of	China.	
[5]	 Safari	 books	 online	 hSps://www.safaribooksonline.com/library/view/
learning-spark/.	
ITV-DS,	Applied	Compu2ng	Group.								
Sergio	Viademonte,	PhD.

More Related Content

What's hot

Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
Evans Ye
 

What's hot (20)

Ninja, Choose Your Weapon!
Ninja, Choose Your Weapon!Ninja, Choose Your Weapon!
Ninja, Choose Your Weapon!
 
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 Openstack - An introduction/Installation - Presented at Dr Dobb's conference... Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
Openstack - An introduction/Installation - Presented at Dr Dobb's conference...
 
OpenStack Deployments with Chef
OpenStack Deployments with ChefOpenStack Deployments with Chef
OpenStack Deployments with Chef
 
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite TalkStackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
StackStorm Product Highlights - DevOps Enterprise 2014 After-Party Ignite Talk
 
Mastering OpenStack - Episode 04 - Provisioning and Deployment
Mastering OpenStack - Episode 04 - Provisioning and DeploymentMastering OpenStack - Episode 04 - Provisioning and Deployment
Mastering OpenStack - Episode 04 - Provisioning and Deployment
 
Building hadoop based big data environment
Building hadoop based big data environmentBuilding hadoop based big data environment
Building hadoop based big data environment
 
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep LearningTensorFlow on Spark: A Deep Dive into Distributed Deep Learning
TensorFlow on Spark: A Deep Dive into Distributed Deep Learning
 
OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015OpenStack 101 - All Things Open 2015
OpenStack 101 - All Things Open 2015
 
Deploying OpenStack with Ansible
Deploying OpenStack with AnsibleDeploying OpenStack with Ansible
Deploying OpenStack with Ansible
 
/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat
/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat
/bin/tails from OpenStack Operations: Rarm Nagalingam, Red Hat
 
The Ubuntu OpenStack interoperability lab - Proven integration testing Nicola...
The Ubuntu OpenStack interoperability lab - Proven integration testing Nicola...The Ubuntu OpenStack interoperability lab - Proven integration testing Nicola...
The Ubuntu OpenStack interoperability lab - Proven integration testing Nicola...
 
Boston/NYC Chef for OpenStack Hack Days
Boston/NYC Chef for OpenStack Hack DaysBoston/NYC Chef for OpenStack Hack Days
Boston/NYC Chef for OpenStack Hack Days
 
Monitoring system for OpenStack,using a OSS products
Monitoring system for OpenStack,using a OSS productsMonitoring system for OpenStack,using a OSS products
Monitoring system for OpenStack,using a OSS products
 
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
NTTドコモ様 導入事例 OpenStack Summit 2015 Tokyo 講演「After One year of OpenStack Cloud...
 
OpenStack: Everything You Need To Know to Get Started (ATO2014)
OpenStack: Everything You Need To Know to Get Started (ATO2014)OpenStack: Everything You Need To Know to Get Started (ATO2014)
OpenStack: Everything You Need To Know to Get Started (ATO2014)
 
Skipping OpenStack Releases: (You Don't) Gotta Catch 'Em All
Skipping OpenStack Releases: (You Don't) Gotta Catch 'Em AllSkipping OpenStack Releases: (You Don't) Gotta Catch 'Em All
Skipping OpenStack Releases: (You Don't) Gotta Catch 'Em All
 
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
Running Kafka and Spark on Raspberry PI with Azure and some .net magicRunning Kafka and Spark on Raspberry PI with Azure and some .net magic
Running Kafka and Spark on Raspberry PI with Azure and some .net magic
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
 
OpenStack + VMware: Deploy, Upgrade, & Operate a Powerful Production OpenStac...
OpenStack + VMware: Deploy, Upgrade, & Operate a Powerful Production OpenStac...OpenStack + VMware: Deploy, Upgrade, & Operate a Powerful Production OpenStac...
OpenStack + VMware: Deploy, Upgrade, & Operate a Powerful Production OpenStac...
 
Openstack 101
Openstack 101Openstack 101
Openstack 101
 

Viewers also liked

Current awareness lexis
Current awareness   lexisCurrent awareness   lexis
Current awareness lexis
mpopielarski
 

Viewers also liked (14)

Current awareness lexis
Current awareness   lexisCurrent awareness   lexis
Current awareness lexis
 
Bus com uet_lahore_session_004_business emails
Bus com uet_lahore_session_004_business emailsBus com uet_lahore_session_004_business emails
Bus com uet_lahore_session_004_business emails
 
Goal andga oct01
Goal andga oct01Goal andga oct01
Goal andga oct01
 
Insilvis GEAR, coat hook
Insilvis GEAR, coat hookInsilvis GEAR, coat hook
Insilvis GEAR, coat hook
 
Chris Newcombe - Costumer Service and the Manager in a Language School
Chris Newcombe -  Costumer Service and the Manager in a Language SchoolChris Newcombe -  Costumer Service and the Manager in a Language School
Chris Newcombe - Costumer Service and the Manager in a Language School
 
Tổ chức sự kiện uy tín, hiệu quả, chất lượng, chuyên nghiệp nhất TPHCM
Tổ chức sự kiện uy tín, hiệu quả, chất lượng, chuyên nghiệp nhất TPHCMTổ chức sự kiện uy tín, hiệu quả, chất lượng, chuyên nghiệp nhất TPHCM
Tổ chức sự kiện uy tín, hiệu quả, chất lượng, chuyên nghiệp nhất TPHCM
 
Fernando Gómez presenta en el Matarranya 'Secretos de Barcelona' (Dédalo Edi...
 Fernando Gómez presenta en el Matarranya 'Secretos de Barcelona' (Dédalo Edi... Fernando Gómez presenta en el Matarranya 'Secretos de Barcelona' (Dédalo Edi...
Fernando Gómez presenta en el Matarranya 'Secretos de Barcelona' (Dédalo Edi...
 
Mapa España IGP ICEX Devinos con Alicia
Mapa España IGP ICEX Devinos con AliciaMapa España IGP ICEX Devinos con Alicia
Mapa España IGP ICEX Devinos con Alicia
 
Time out
Time outTime out
Time out
 
Informefinal.redes socialesogcs pcm-
Informefinal.redes socialesogcs pcm- Informefinal.redes socialesogcs pcm-
Informefinal.redes socialesogcs pcm-
 
Analisis ambiental
Analisis ambientalAnalisis ambiental
Analisis ambiental
 
Desayuno Ejecutivo de Marketing - Transformación Digital
Desayuno Ejecutivo de Marketing - Transformación DigitalDesayuno Ejecutivo de Marketing - Transformación Digital
Desayuno Ejecutivo de Marketing - Transformación Digital
 
Alms portugues
Alms portuguesAlms portugues
Alms portugues
 
Presentación - Estudio Anual Comercio Electrónico 2016
Presentación - Estudio Anual Comercio Electrónico 2016 Presentación - Estudio Anual Comercio Electrónico 2016
Presentación - Estudio Anual Comercio Electrónico 2016
 

Similar to SparkFramework

Similar to SparkFramework (20)

Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Spark introduction and architecture
Spark introduction and architectureSpark introduction and architecture
Spark introduction and architecture
 
Big analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel GatewayBig analytics meetup - Extended Jupyter Kernel Gateway
Big analytics meetup - Extended Jupyter Kernel Gateway
 
Jupyter con meetup extended jupyter kernel gateway
Jupyter con meetup   extended jupyter kernel gatewayJupyter con meetup   extended jupyter kernel gateway
Jupyter con meetup extended jupyter kernel gateway
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala Big Data Processing with Spark and Scala
Big Data Processing with Spark and Scala
 
Building analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernelsBuilding analytical microservices powered by jupyter kernels
Building analytical microservices powered by jupyter kernels
 
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and SparkCassandra Day 2014: Interactive Analytics with Cassandra and Spark
Cassandra Day 2014: Interactive Analytics with Cassandra and Spark
 
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
The Analytic Platform behind IBM’s Watson Data Platform - Big Data Spain 2017
 
Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014Big Data Processing with Apache Spark 2014
Big Data Processing with Apache Spark 2014
 
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache SparkAn Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
An Enterprise Analytics Platform with Jupyter Notebooks and Apache Spark
 
IOT.ppt
IOT.pptIOT.ppt
IOT.ppt
 
Apache Spark Notes
Apache Spark NotesApache Spark Notes
Apache Spark Notes
 
Learn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive GuideLearn Apache Spark: A Comprehensive Guide
Learn Apache Spark: A Comprehensive Guide
 
A Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdfA Jupyter kernel for Scala and Apache Spark.pdf
A Jupyter kernel for Scala and Apache Spark.pdf
 
February 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with DockerFebruary 2016 HUG: Running Spark Clusters in Containers with Docker
February 2016 HUG: Running Spark Clusters in Containers with Docker
 
Spark Uber Development Kit
Spark Uber Development KitSpark Uber Development Kit
Spark Uber Development Kit
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache SparkRunning Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
Running Emerging AI Applications on Big Data Platforms with Ray On Apache Spark
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 

SparkFramework