Entity-Relationship	Extraction	from		
Wikipedia	Unstructured	Text
Radityo	Eko	Prasojo(Rido)
PhD	Student	@	KRDB,	Free	University	of	Bozen-Bolzano
Supervised	by:
Mouna Kacimi &	Werner	Nutt
20.07.16,	Bilbao,	Spain
Automatically	generated Manually	curated
Automated	extraction	without
(yet)	a	KB	as	a	result
Knowledge	Vault	
[1]
Knowledge	Graph
NELL	[2]
220/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
Infobox completion	[3]	[4]
320/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
420/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
520/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
Where	was	Obama	born?
Who	are	the	children	
of	Obama?
620/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
When	was	Obama	born?
Who	are	the	children	
of	Obama?
Yes	we	can!
Honolulu,	 Hawaii
Malia	and	Sasha	Obama
720/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
Which	are	Obama’s
favourite sports	team?
Does	Obama	have	pets?
Our	goal	is	to	enrich	existing	Knowledge	Bases	by	
extracting	new	facts	in	the	form	of	machine-readable	
entity-relationship	from	Wikipedia	unstructured	text.
Specific	focus:	RDF	
820/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao
Why	is	it	difficult?
•The	extraction	problem
• Entity	extraction	&	disambiguation
• Relation	extraction
•The	representation	problem
• Lack	of	predefined	schema/ontology
• Topic-independency
• Complex	fact	representation
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 9
Why	is	it	difficult?	Example
• “Obama	is	a	supporter	of	the Chicago	White	Sox”
• Straightforward,	singleton	information
• Pure	syntactic	extraction	possible
• Barack_Obama supporterOf Chicago_White_Sox
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 10
Why	is	it	difficult?	Example
• “Obama	is	a	supporter	of	the Chicago	White	Sox”
• Straightforward,	singleton	information
• Pure	syntactic	extraction	possible
• Barack_Obama supporterOf Chicago_White_Sox
• “He is	also	primarily a Chicago	Bears football	fan	in	the NFL,	but	in	his	
childhood	and	adolescence	was a fan	of	the	Pittsburgh	Steelers”
• Complex,	multiple	information
• Semantic	understanding	necessary
• …	how	do	we	represent	this?
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 11
Example:	representing	complex	fact
• “He is	also	primarily a Chicago	Bears football	fan	in	the NFL,	but	in	his	
childhood	and	adolescence	was a fan	of	the	Pittsburgh	Steelers”
• Barack_Obama footballFan Chicago_Bears in NFL
• supporterOf vs	footballFan
• Is	it	necessary	to	include	NFL in	the	whole	relations?
• What	about	the	adjective	primarily?	What	information	does	it	imply?
• Barack_Obama fanOf Pittsburgh_Steelers
• fanOf vs supporterOf
• Missing	the	time	information	referred	in	“in	his	childhood	and	adolescence	
was”
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 12
Approach
• Document	preprocessing	to	annotate	all	entity	occurences.
• Grammatical	dependency	to	extract	(candidate)	relations.
• Separation	between	the	extraction	problem	and	the	representation	
problem
• We	first	extract	all	candidate	relations	and	then	later	apply	semantic	refinement	
for	better	representation.
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 13
Preliminary	results
• Ground	truth	manually	curated	from	25	Wikipedia	articles	of	famous	
people.
• Preprocessing	
• 4	handcrafted	extraction	rules	leveraging	grammatical	dependency
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 14
Ongoing	work
• Automated	rule	mining
• Semantic	refinement	for	knowledge	representation
• Ontology	building
• Naming	and	taxonomy	of	entities,	classes,	and	relations
• Handling	complex	fact
• Obama	appoints	x	as	y	in	z
• Handling	modality,	adjectives,	and	sentiment
• “In	the	past”,	“it	is	rumoured that”,	“it	is	not	true	that”
• Future	evaluation
• Bigger	ground	truth	(amount	+	topic	coverage)
• Evaluate	how	well	we	enrich	existing	KBs
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 15
Future	work
• Metadata	extraction
• Data	quality,	data	completeness
• Natural	language	question	answering	based	on	the	enriched	KB.
20/07/16 RE	Prasojo	|	KRDB	@	UNIBZ	|	WebST'16,	Bilbao 16

Entity-Relationship Extraction from Wikipedia Unstructured Text - Overview