1©	Cloudera,	Inc.	All	rights	reserved.
Enterprise	Metadata	Integration	
Mirko Kämpf |	Cloudera
GraphConnect 2017	– London
2©	Cloudera,	Inc.	All	rights	reserved.
Who	is	speaking?
Solutions	Architect	@	Cloudera
-time	series	analysis,	network	analysis,	data	enrichment	pipelines
-personal	interest:	QA-Systems	and	semantic	search
Data	Science	Activities
The	Detection	of	Emerging	Trends	Using	Wikipedia	Traffic	Data
and	Context	Networks	(PLOS	ONE,	2015)
Hadoop.TS (IJCA,	2013)
Fluctuations	in	Wikipedia	Access-Rate	and	Edit-Event	Data.	
(Physica A,	2012).
3©	Cloudera,	Inc.	All	rights	reserved.
Our	Approach: Multilayer	Metadata	Integration	…
• Status	dashboards	are	provided	per	Topic	/	Use-Case.
• Each	dashboard	offers	facts	from	multiple	layers:
- (L1)	Cluster	specific	metadata
- (L2)	Hadoop	specific	ops-metadata	(only)
- (L3)	Application	specific	ops-metadata
- (L4)	Quality	metrics	and	derived	facts
• Current	Project	Status:
• Graph	database	Neo4J and	Cypher	allow	context	exploration.
• Cluster	spanning	metadata	exploration	is	possible.	
• Exposure	of	inherent	but	sometimes	hidden	facts becomes	as	easy	as	writing	an	email.
Integration	of	facts	
to	gain	business	
knowledge
4©	Cloudera,	Inc.	All	rights	reserved.
Agenda	
EMI	- Enterprise	Metadata	Integration
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
5©	Cloudera,	Inc.	All	rights	reserved.
How	To	Become	Data	Driven?
Treat	“data	as	a	resource“	for	your	business.
Think	in	terms	of	dataset	life	cycles.
6©	Cloudera,	Inc.	All	rights	reserved.
People	do	mining	…	for	centuries!
http://www.montanregion-erzgebirge.de/welterbe-erleben/montanregion-fuer-bergbauspezialisten/geschichtliches.html
gold	&	diamonds,	
ore	&	coal,	
minerals,	
oil	…
Outcome	drives	whole	economy
7©	Cloudera,	Inc.	All	rights	reserved.
People	use	computers	…	for	decades!
1938	
Z1:	World’s	first	free	programmable	
device,	created	by	Conrad	Zuse.
U.S.	Department	of Energy uses Intel
Supercomputer	 at	Argonne National	Laboratory.
2015
http://www.intel.com/content/dam/www/public/us/en/images/photography-business/RWD/aurora-aerial-reflection-floor-rwd.png
http://www.horst-zuse.homepage.t-online.de/z1.html
8©	Cloudera,	Inc.	All	rights	reserved.
DATA
MINING
http://codecondo.com/9-free-books-for-learning-data-mining-data-analysis/
Blog: About Learning Data Mining & Data Analysis
9©	Cloudera,	Inc.	All	rights	reserved.
If	data	is	the	new	oil	…
…	metadata	are	nuggets	
and	brilliants	of	our	age.
Screenshot	 taken	from:	
https://www.quora.com/Who-should-get-credit-for-the-quote-data-is-the-new-oil
10©	Cloudera,	Inc.	All	rights	reserved.
Diamonds: are	beautiful	even	as	
raw	material.
Brilliant: is	a	result	of	expert’s	work.
You	have	to	cut	and	grind	it!	
Even	more	exciting	in	combination	
with	other	material	and	skills	…
Process	optimization
Requires	knowledge	
gathering	and	transfer.
11©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Success	Factors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html
12©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Success	Factors:
http://www.burkhard-beyer.net/Reportage_Goldschmied.html
Tools	and	processes	evolve	…
...	success	criteria	have	been	stable.
13©	Cloudera,	Inc.	All	rights	reserved.
Let’s	Think	Data	Driven!	
•Build	a	long-term	strategy!
Not	the	fancy	toolset	but	rather	your	data is	what	matters	most!
• After	initial	success	you	should	carefully	control	speed	of	expansion.
• Maximize	accessibility	of	data!
Example:	Google’s	goal	was	to	make	the	data	of	the	internet	accessible.	
You	should	become	your	own	Google!
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
14©	Cloudera,	Inc.	All	rights	reserved.
Dataset	Profiles	/	Flow	Descriptors
•Our	material	is	data	&	metadata:	
- Data	about	data	:	descriptive	data,	Dublin	core	metadata	model,	…
- Derived	data	:	statistics	extracted	from	processes,	documents,	…
- Results	of	ML/AI	procedures	:	extracted	structure	and	learned	models
- Outcome	of	crowd	based	operations	:	Wikipedia with	its	inherent	
structure,	communication	logs,	access	and	edit	history.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
15©	Cloudera,	Inc.	All	rights	reserved.
Knowledge	Extraction	for	
Better	Data	Science
16©	Cloudera,	Inc.	All	rights	reserved.
Science:
According	to	Wikipedia:
Science	is	a	systematic	
enterprise	that	builds	and	
organizes	knowledge in	the	
form	of	testable	explanations
and predictionsabout	
the universe.
https://en.wikipedia.org/wiki/Science
17©	Cloudera,	Inc.	All	rights	reserved.
Data	Science:
My	observation:
Data Science
is	a	systematic	enterprise	
that	builds	and	organizes
knowledge in	the	form	of	
testable explanations and
predictions about the
market	and	business	context.
https://en.wikipedia.org/wiki/Infographic#/media/File:Gartner_Hype_Cycle_for_Emerging_Technologies.gif
18©	Cloudera,	Inc.	All	rights	reserved.
Details
Look	into	nature	….
19©	Cloudera,	Inc.	All	rights	reserved.
Context
Look	into	nature	….
20©	Cloudera,	Inc.	All	rights	reserved.
Result:	Visualization	of	Facts
• An	image	shows	what	the	text	says.	
>	Multi-channel	communication
• Data	Science	benefits	from	such	an	approach.
>	Today	we	still	use	infographics
Difference:	
Biologist	who	created	the	image	on	the	left	observed	
by	eye.
Today,	data	scientists,	look	more	into	data	than	into	
nature.
21©	Cloudera,	Inc.	All	rights	reserved.
Process:	Knowledge	Extraction	is	a	Natural	Process	
• Combine	multiple	sources	
• Repeat	observation
• Incorporate	context	to	explain	
differences/variation	
• Cross-checks	to	identify	
anomalies
22©	Cloudera,	Inc.	All	rights	reserved.
Process:	Knowledge	Extraction	is	a	Natural	Process	
Knowledge
Facts	
Data
23©	Cloudera,	Inc.	All	rights	reserved.
How	did	we	implement	EMDM?
- Hadoop	Based:	for	scalability.
- Open	Graph	Data	Model:	for	flexibility	and	connectivity
- Data	Centric:	following	the	Big	Data	paradigm
24©	Cloudera,	Inc.	All	rights	reserved.
Big	Data	Processing:
e.g.,	with	Hadoop
25©	Cloudera,	Inc.	All	rights	reserved.
Big	Graph	Processing	on	Hadoop:
e.g.,	with	Giraph
26©	Cloudera,	Inc.	All	rights	reserved.
Project	Name	should	stand	for:	
Graphs,	Hadoop,	and	the	ecosystem	…
27©	Cloudera,	Inc.	All	rights	reserved.
Project	Name	should	stand	for:	
Graphs,	Hadoop,	and	the	ecosystem	…
28©	Cloudera,	Inc.	All	rights	reserved.
Data	Science	Process	Model	(DSPM)
• DSPM	defines	core	artifacts	for	knowledge	management
• Describes	analysis	/	transformation	context	
• Allows	repeatable	execution
• Process	properties	become	measurable
• Supports	comparison	of	results	from	multiple	procedures
• All	those	facts	are	essential	ingredients	to	business	optimization.
• But:	Logging	&	tracking should	never	block	creativity!	
• Remember:	Scientists	often	act	like	artists.	
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Toolbox	and	
Management	Methods
29©	Cloudera,	Inc.	All	rights	reserved.
Data	Science	Process	Model	(DSPM)
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Representation	of	domain	knowledge	
(in	our	case	it	is	data	science	in	general)	
Human	
Interaction
Ontology Toolbox	and	
Management	Methods
Ability	to	solve	
a	problem	using	
IT	and	data
Technology	Aspects
- represent	and	inter-
act	with	facts	&	data
Data	Governance
Certified	QM
30©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Semantic	Logging
• Property	with	name:	(K,V) :			key-value	pair
• Property	of	a	thing:	S	=>	(K,V) :			(S,P,O)		is	a triple
K	becomes	P; V	becomes	O
• Many	of	those	triples	in	one	common	context	with	name	G:
G	=>	(S,P,O)	is	called	quad or	named	graph
We	have	to	hide	this	technical	details	from	users!
Obvious	facts	have	to	be	connected	to	the	knowledge	graph	as	direct	as	possible.
• Log4J	is	the	logging	standard	we	build	on.
• Using	structured	data	instead	of	plain	strings	allows	easy	parsing	(e.g.,	apache	log	format).
• Triple	representation	avoids	specific	parsing	and	makes	log	data	part	of	the	linked	data	graph.
31©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Etosha Toolbox
Data	extractors,
Data	transformers,
Ontology	based	orchestration,
People	and	machines,		
contribute	facts,
Iterative	approach	with	
closed	feedback-loops,
Scalable	environment	…
C
O
N
C
E
P
T
32©	Cloudera,	Inc.	All	rights	reserved.
• Idea	&	Vision
• Material
• Skills	/	Methods
• Tools
Multi-layer	metadata	capturing
Operational	metrics
Metrics	about	fast	&	static	data
Business	metrics
Contextualized	presentation
Ad-hoc	queries	for	exploration
Graph-analytics
>	Knowledge	exposure
>	Self-Service	DS	and	BI	can
speak	the	same	language.
I
N
I
T
I
A
L
I
M
P
L
E
M
E
N
T
A
T
I
O
N
33©	Cloudera,	Inc.	All	rights	reserved.
Results:	Better	Collaboration	for	
(Hadoop)	Knowledge	Workers
• Our	Achievements:
• The	open	graph	model	is	language-,	OS-,	and	hardware-independent.
• Merging	of	knowledge	partitions	enables cluster	spanning	metadata	exploration.
• Query	beans	expose	facts	from	multiple	stores	to	web-based	interfaces.
• Next	Steps:
• Improve	implicit	triplification (Query	Solr-index	and	get	RDF	data)
• Standardize	the	process	and	integrate	with	existing	ontologies.
• Grow	a	community	…	and	enter	the	Apache	Incubator.
34©	Cloudera,	Inc.	All	rights	reserved.
Results:	Access	Facts & Context of	Critical	Processes
DEMO:	https://www.youtube.com/watch?v=ZE7Gcanv90s&feature=youtu.be
35©	Cloudera,	Inc.	All	rights	reserved.
Thank	you!
Many	thanks	to	the	
Cloudera	team	which	
supported	this	work.

Enterprise Metadata Integration