SlideShare a Scribd company logo
1 of 1
Download to read offline
Investigating	Performance	Using	Apache	Tez	
Ace	Haidrey,	Ad-Analytics	Team	-	Engineering	
July	28,	2016	
Abstract	
Big	data	has	recently	become	a	common	buzzword	in	the	industry,	but	few	
understand	the	true	extent	of	the	word.	Big	data	is	a	term	for	massive	data	sets	
having	large,	complex,	and	more	varied	structure,	which	creates	difGiculties	of	storing	
and	analyzing	this	data	for	results	or	further	processes.	A	lot	of	research	is	being	
done	today	to	reveal	hidden	patterns	or	secret	correlations	within	these	data	sets	to	
gain	richer	and	deeper	insights.	These	insights	are	extremely	valuable	for	companies	
for	multiple	reasons,	such	as	gaining	advantages	over	their	competitors	and	making	
their	engineering	teams	more	productive.	One	big	data	framework,	Hadoop	
MapReduce,	has	become	the	de	facto	standard	for	processing	voluminous	data	on	a	
large	cluster	of	machines.	However,	some	tasks/jobs	do	not	naturally	yield	to	this	
programming	model.	This	poster	presentation	will	delve	into	the	shortcomings	of	
Hadoop	MapReduce	and	will	examine	another	programming	model:	Apache	Tez.			
	
Hadoop	was	created	to	handle	processing	of	large	scales	of	data	using	clusters	of	
computers.	It’s	based	on	two	components:	Hadoop	Distributed	File	System	(HDFS)	
and	MapReduce	(MR).	It	works	by	distributing	the	data	across	nodes	in	its	cluster	of	
computers,	and	then	computes	on	this	data	locally	on	each	of	these	nodes.	Hadoop	
maintains	reliability	by	replicating	its	data	across	multiple	nodes,	in	case	a	computer	
fails.	The	MapReduce	component	provides	developers	with	a	higher-level	
abstraction	and	while	it	does	hide	away	the	complexity	of	writing	a	distributed	
program/application,	it	severely	restricts	the	input/output	model	and	problem	
composition.	It	requires	any	problem	to	be	formulated	into	a	rigid	three-stage	
process	which	is	composed	of	the	following:	Map,	ShufGle/Sort,	and	Reduce.	This	can	
be	seen	in	the	Gigure	below.	MR	also	allows	chaining	of	many	of	these	two	phase	
tasks	for	sequential	execution,	and	this	allows	for	simpliGication	of	development,	but	
it	is	clearly	not	suitable	to	all	problems.	For	example,	it	is	unGit	for	iterative	
workloads	such	as	Machine	Learning	tasks	trying	to	optimize	cost	functions,	or	even	
interactive	workloads	such	as	streaming	data	processing	or	data	mining.	Beyond	just	
the	simplicity	of	the	model,	there	are	other	shortcomings	of	MR,	such	as	storing	
intermediate	outputs	in	the	hard	drive,	which	takes	a	hit	on	performance	for	very	
large	data	sets.	Thus,	it	was	evident	a	new	model	for	processing	data	on	Hadoop	was	
needed.		
Hadoop	MapReduce	
Apache	Tez	
My	Project:	Adding	the	Execution	Option		
Apache	Tez	uses	Directed	Acyclic	Graphs	(DAGs)		and	does	everything	MapReduce	
does,	but	faster.	MapReduce	can	be	thought	of	as	a	DAG	with	3	vertices,	but	Tez	allows	
for	as	many	vertices	as	desired.	The	interactions	among	these	vertices	are	performed	
using	shared	memory,	Giles,	or	network	pipelines.	One	reason	Tez	is	more	successful	is	
that	it	is	more	natural	for	SQL	query	plans.	Tez	models	an	application	into	a	dataGlow	
graph	where	vertices	represent	logic	and	the	dataGlow	among	these	vertices	are	the	
edges,	so	it	can	be	seen	how	this	is	more	Gitting	for	most	query	languages.	Tez	also	
allows	for	dynamic	graph	reconGiguration	where	it	can	alter	the	number	of	mapper	and	
reducer	tasks	on	the	Gly	as	it	seems	the	data	output	estimate	getting	larger	or	smaller.	
Another	beneGit	Tez	introduces	is	data	type	agnostic,	i.e.	this	execution	engine	is	only	
concerned	with	moving	data	through	the	pipelines,	not	the	data	format	(tuple	oriented	
formats,	key-value	pairs,	csv,	etc.)		
	
The	most	convenient	advantage	of	using	Apache	Tez	for	developers	is	that	it	is	
completely	a	client	side	application,	so	it	leverages	Apache	Hadoop	YARN	local	
resources	and	distributed	caches.	To	use	Tez	on	your	cluster,	there’s	no	need	to	deploy	
anything	besides	uploading	the	relevant	Tez	jars	to	HDFS.	Best	of	all,	Tez	can	run	any	
MR	job	without	modiGication.	This	allows	for	migration	of	projects	in	the	staging	
process	which	currently	depend	on	MR.	My	project	aims	to	take	advantage	of	this	
migration	process	within	the	ad-analytics	team,	and	eventually	expand	to	other	teams	
as	well.	
	
	
Over	the	past	7	weeks,	I’ve	been	working	on	incorporating	the	Tez	execution	engine	to	
existing	hive	scripts.	Upon	further	investigation,	I	found	that	there	are	currently	three	
hive	execution	options:	MapReduce,	Tez,	and	Spark.	I	created	two	options	to	set	this	
option	within	our	jobs	in	the	Analytics	module,	which	is	used	by	many	teams.		
	
The	Girst	option	is	to	set	the	parameter	in	the	VM:	
		SET_HIVE_EXECUTION_ENGINE	=	‘TEZ’	
	
The	second	option	is	within	the	job	URL,	add	the	parameter:	
	http://drake-ahaidrey-1.savagebeast.com:9001/analytics/job_control.vm		
	action=start&name=BrandHiveJob&set_execution_engine=tez	
	
If	the	option	within	the	VM	is	set,	every	job	and	task	will	run	with	the	engine	option,	but	
if	it	set	in	the	URL,	it	will	run	for	only	that	job.	There’s	also	an	option	to	leave	the	hive	
execution	engine	already	set	in	the	script	to	remain	unchanged.	I	am	proud	to	say	after	
writing	unit	tests	and	going	through	multiple	iterations	after	code	review	suggestions,	I	
was	able	to	commit	and	merge	my	changes	into	the	main-insights,	release-insights	and	
main	branches.		
	
	
My	Project:	Tracking	Performance	
The	next	part	of	my	project	is	to	investigate	performance	of	our	jobs	running	on	
MR	vs	Tez.	I	have	created	a	Hive	post	hook	to	collect	Job	counters	after	completion.	
We	will	mainly	be	focusing	on	CPU	run	time,	number	of	reads	and	writes,	and	a	few	
other	metrics	that	are	to	be	decided.		I	use	a	graphing	utilities	called	Graphite	and	
Grafana	to	easily	evaluate	performance	changes	at	a	glance.	It’s	especially	helpful	
to	have	counters	vs	time	plots	for	both	MR	and	Tez	jobs	on	the	same	graph	and	job.	
This	utility	may	help	convince	other	teams	to	consider	using	Tez	for	their	tasks,	
and	use	the	changes	I’ve	added	to	the	HiveTask	class	to	easily	integrate.	
	
Future	Work	
There	is	a	lot	of	investigation	left	for	using	the	Tez	execution	engine	as	well.	A	couple	
of	errors	have	risen	that	weren’t	expected,	such	as	conGlicting	GC	types,	inaccurate	
estimates	of	output	causing	a	task	to	run	4	times	longer	(rare	case),	the	creation	of	
subdirectories	in	HDFS	causing	failures,	and	it	is	safe	to	say	there	are	other	cases	we	
haven’t	run	into	yet.	It’s	essential	to	hammer	these	points	out	before	presenting	the	
option	to	other	teams.		
	
After	this	we’d	like	to	investigate	Presto,	which	is	an	open	source	distributed	SQL	
query	engine	for	running	interactive	analytic	queries	against	data	sources	of	all	sizes	
(Gb	to	Pb).	Presto	can	combine	data	from	multiple	sources,	allowing	for	analytics	
across	our	entire	team	or	organization,	so	there	is	a	lot	of	beneGit	from	a	service	like	
this.		
	Acknowledgements	
	
I	would	like	everyone	to	know	that	there	was	no	“I”	really	when	it	came	to	this	
project.	Everything	I	was	able	to	put	together	came	from	the	help,	patience,	and	
guidance	of	my	mentors,	Matthieu	Martin	and	Jonathan	Hsu,	from	members	of	the	
analytics	team,	including	but	not	limited	to:	Li,	Alexandra,	Jeff,	Dom,	Patrick,	Karen,	
Saurav,	Gautam	Jenny,	Sean,	and	my	manager,	Jack	Schonbrun.	I	would	also	like	to	
thank	the	people	at	Pandora	for	giving	me	this	opportunity.	I	can’t	name	them	all	but	
there	are	many	who	have	made	this	experience	the	most	worthwhile.	
	
https://www.infoq.com/articles/apache-tez-saha-murthy	
http://www.sjsu.edu/people/robert.chun/courses/CS259Fall2013/s3/F.pdf	
http://ieeexplore.ieee.org/xpl/login.jsp?tp=&arnumber=6567202

More Related Content

What's hot

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Jonathan Seidman
 
The Forrester Wave - Big Data Hadoop
The Forrester Wave - Big Data HadoopThe Forrester Wave - Big Data Hadoop
The Forrester Wave - Big Data HadoopIBM Software India
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304James Kenney
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Rio Info
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceInformation Security Awareness Group
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digitalsambiswal
 
Christian Hansen case
Christian Hansen caseChristian Hansen case
Christian Hansen caseMicrosoft
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop siliconsudipt
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introductionsaisreealekhya
 
Jon cohn exton pa corporate data architecture
Jon cohn exton pa   corporate data architectureJon cohn exton pa   corporate data architecture
Jon cohn exton pa corporate data architectureJon Cohn
 
Getting more out of your big data
Getting more out of your big dataGetting more out of your big data
Getting more out of your big dataGeert Van Landeghem
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013boorad
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapterRajiv Tiwari
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattooMohamed Magdy
 
What is big data
What is big data What is big data
What is big data DeZyre
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective Viewijtsrd
 

What's hot (20)

Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
Architecting for Big Data - Gartner Innovation Peer Forum Sept 2011
 
The Forrester Wave - Big Data Hadoop
The Forrester Wave - Big Data HadoopThe Forrester Wave - Big Data Hadoop
The Forrester Wave - Big Data Hadoop
 
Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304Hortonworks.HadoopPatternsOfUse.201304
Hortonworks.HadoopPatternsOfUse.201304
 
Big data
Big dataBig data
Big data
 
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...Big data: Descoberta de conhecimento em ambientes de big data e computação na...
Big data: Descoberta de conhecimento em ambientes de big data e computação na...
 
Big data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security AllianceBig data analysis concepts and references by Cloud Security Alliance
Big data analysis concepts and references by Cloud Security Alliance
 
Enterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable DigitalEnterprise Data Lake - Scalable Digital
Enterprise Data Lake - Scalable Digital
 
Christian Hansen case
Christian Hansen caseChristian Hansen case
Christian Hansen case
 
Présentation on radoop
Présentation on radoop   Présentation on radoop
Présentation on radoop
 
A Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - IntroductionA Glimpse of Bigdata - Introduction
A Glimpse of Bigdata - Introduction
 
Jon cohn exton pa corporate data architecture
Jon cohn exton pa   corporate data architectureJon cohn exton pa   corporate data architecture
Jon cohn exton pa corporate data architecture
 
Getting more out of your big data
Getting more out of your big dataGetting more out of your big data
Getting more out of your big data
 
Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013Big Data Analysis Patterns - TriHUG 6/27/2013
Big Data Analysis Patterns - TriHUG 6/27/2013
 
Hadoop for Finance - sample chapter
Hadoop for Finance - sample chapterHadoop for Finance - sample chapter
Hadoop for Finance - sample chapter
 
Combining hadoop with big data analytics
Combining hadoop with big data analyticsCombining hadoop with big data analytics
Combining hadoop with big data analytics
 
Hadoop Business Cases
Hadoop Business CasesHadoop Business Cases
Hadoop Business Cases
 
The book of elephant tattoo
The book of elephant tattooThe book of elephant tattoo
The book of elephant tattoo
 
Hadoop
HadoopHadoop
Hadoop
 
What is big data
What is big data What is big data
What is big data
 
Memory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective ViewMemory Management in BigData: A Perpective View
Memory Management in BigData: A Perpective View
 

Viewers also liked

silent valley national park kerala india
silent valley national park kerala indiasilent valley national park kerala india
silent valley national park kerala indiaSanoop Krishna
 
Ground water presentation
Ground water presentationGround water presentation
Ground water presentationHamza Ali
 
Measurement & calibration of medical equipments
Measurement & calibration of medical equipmentsMeasurement & calibration of medical equipments
Measurement & calibration of medical equipmentsJumaan AlAmri
 
Ground water hydrology
Ground water hydrologyGround water hydrology
Ground water hydrologySandra4Smiley
 
water pollution.
water pollution.water pollution.
water pollution.acitron
 
presentation on water pollution
presentation on water pollutionpresentation on water pollution
presentation on water pollutionakshat2010
 
Water pollution ppt
Water pollution pptWater pollution ppt
Water pollution pptApril Cudo
 

Viewers also liked (11)

silent valley national park kerala india
silent valley national park kerala indiasilent valley national park kerala india
silent valley national park kerala india
 
Ground water presentation
Ground water presentationGround water presentation
Ground water presentation
 
Groundwater
GroundwaterGroundwater
Groundwater
 
Measurement & calibration of medical equipments
Measurement & calibration of medical equipmentsMeasurement & calibration of medical equipments
Measurement & calibration of medical equipments
 
Ground water hydrology
Ground water hydrologyGround water hydrology
Ground water hydrology
 
Ground Water Hydrology
Ground Water HydrologyGround Water Hydrology
Ground Water Hydrology
 
groundwater
groundwatergroundwater
groundwater
 
Water pollution power point
Water pollution power pointWater pollution power point
Water pollution power point
 
water pollution.
water pollution.water pollution.
water pollution.
 
presentation on water pollution
presentation on water pollutionpresentation on water pollution
presentation on water pollution
 
Water pollution ppt
Water pollution pptWater pollution ppt
Water pollution ppt
 

Similar to PandoraPPT2

Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceAssignment Help
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsCognizant
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelEditor IJCATR
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 
Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Edgar Alejandro Villegas
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitzRaghu Kashyap
 
Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceEdureka!
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopIOSR Journals
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRushtempledf
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khanKamranKhan587
 

Similar to PandoraPPT2 (20)

Learn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant ResourceLearn About Big Data and Hadoop The Most Significant Resource
Learn About Big Data and Hadoop The Most Significant Resource
 
Rajesh Angadi Brochure
Rajesh Angadi Brochure Rajesh Angadi Brochure
Rajesh Angadi Brochure
 
Infrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical WorkloadsInfrastructure Considerations for Analytical Workloads
Infrastructure Considerations for Analytical Workloads
 
Big data
Big dataBig data
Big data
 
Big data
Big dataBig data
Big data
 
Unstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus ModelUnstructured Datasets Analysis: Thesaurus Model
Unstructured Datasets Analysis: Thesaurus Model
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869Big Data and Enterprise Data - Oracle -1663869
Big Data and Enterprise Data - Oracle -1663869
 
Gartner peer forum sept 2011 orbitz
Gartner peer forum sept 2011   orbitzGartner peer forum sept 2011   orbitz
Gartner peer forum sept 2011 orbitz
 
Is Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data ScienceIs Hadoop a Necessity for Data Science
Is Hadoop a Necessity for Data Science
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 
G017143640
G017143640G017143640
G017143640
 
Big Data & Hadoop
Big Data & HadoopBig Data & Hadoop
Big Data & Hadoop
 
Big data and hadoop
Big data and hadoopBig data and hadoop
Big data and hadoop
 
Hadoop Overview
Hadoop OverviewHadoop Overview
Hadoop Overview
 
finap ppt conference.pptx
finap ppt conference.pptxfinap ppt conference.pptx
finap ppt conference.pptx
 
View on big data technologies
View on big data technologiesView on big data technologies
View on big data technologies
 
Pervasive DataRush
Pervasive DataRushPervasive DataRush
Pervasive DataRush
 
Big Data
Big DataBig Data
Big Data
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 

PandoraPPT2