Learning with the
Web
Structuring data to ease
machine understanding
http://twitter.com/giusepperizzo
July 11th, 2013 Università di Torino, Italy 2/44
Google
Knowledge
Graph
Viewer
July 11th, 2013 Università di Torino, Italy 3/44
Google Knowledge Graph
July 11th, 2013 Università di Torino, Italy 4/44
The Google Knowledge Graph bulk:
encyclopedic sources
July 11th, 2013 Università di Torino, Italy 5/44
Web community has highlithed the road,
but ...
July 11th, 2013 Università di Torino, Italy 6/44
Vast wealth of unstructured data
“80% of data on the Web and on internal
...
July 11th, 2013 Università di Torino, Italy 7/44
The entire digital universe, going to
be part of the Web
“unstructured da...
July 11th, 2013 Università di Torino, Italy 8/44
Structured means
making those
resources available to be easily processed
...
July 11th, 2013 Università di Torino, Italy 9/44
A Web of Linked Entities
http://wole2013.eurecom.fr
http://wole2012.eurec...
July 11th, 2013 Università di Torino, Italy 10/44
Chapter 1:
Named Entity Recognition (NER)
and
Named Entity Linking (NEL)
July 11th, 2013 Università di Torino, Italy 11/44
I want to book a room in an hotel located in the
heart of Paris, just a ...
July 11th, 2013 Università di Torino, Italy 12/44
Part of Speech
I
want
to
book
a
room
in
..
Paris
PRP
VBP
TO
VB
DT
NN
IN
...
July 11th, 2013 Università di Torino, Italy 13/44
What is Paris?
Type ambiguity
asteroid location/city film
July 11th, 2013 Università di Torino, Italy 14/44
Entity recognition
I
want
to
book
a
room
in
..
Paris
PRP
VBP
TO
VB
DT
NN...
July 11th, 2013 Università di Torino, Italy 15/44
NER: State of the art
➢ CRFs (Conditional Random Fields)
➢ FSM (Finite-S...
July 11th, 2013 Università di Torino, Italy 16/44
Which Paris?
Name ambiguity
Paris, Kentucky Paris, Maine Paris, Tennesse...
July 11th, 2013 Università di Torino, Italy 17/44
Entity linking
I
want
to
book
a
room
in
..
Paris
PRP
VBP
TO
VB
DT
NN
IN
...
July 11th, 2013 Università di Torino, Italy 18/44
Ambiguity resolution: linking to an
external knowledge base
➢ Wikipedia/...
July 11th, 2013 Università di Torino, Italy 19/44
NEL: State of the art
➢ Clustering
➢ Vector Space Model (Cosine similari...
July 11th, 2013 Università di Torino, Italy 20/44
Processing natural language texts
➢ Several attempts from the Web commun...
July 11th, 2013 Università di Torino, Italy 21/44
The NERD initiative
http://nerd.eurecom.fr
July 11th, 2013 Università di Torino, Italy 22/44
Combination of off-the-shelf systems
and properly trained CRFs
July 11th, 2013 Università di Torino, Italy 23/44
The strength of this approach lies in the fact that
the supported off-th...
July 11th, 2013 Università di Torino, Italy 24/44
Diversity
Alchemy
API
DBpedia
Spotlight
Extractiv Lupedia Open
Calais
Sa...
July 11th, 2013 Università di Torino, Italy 25/44
NERD Ontology
NERD type Occurrence
Person 10
Organization 10
Country 6
C...
July 11th, 2013 Università di Torino, Italy 26/44
Learning with the Web
➢ FSM-core based
➢ combination of the NERD support...
July 11th, 2013 Università di Torino, Italy 27/44
Challenges and benchmark
July 11th, 2013 Università di Torino, Italy 28/44
ETAPE 2012 - Entity Extraction
Challenge
➢ French transcripts of radio a...
July 11th, 2013 Università di Torino, Italy 29/44
#MSM'13 - Concept Extraction
Challenge
➢ English Twitter microposts
➢ Ch...
July 11th, 2013 Università di Torino, Italy 30/44
CoNLL-2003
➢
English newswire corpus
➢
Benchmark objective: entity typin...
July 11th, 2013 Università di Torino, Italy 31/44
TAC KBP 2011
➢ English newswire corpus
➢ Benchmark objective: entity lin...
July 11th, 2013 Università di Torino, Italy 32/44
NERD in action
http://nerd.eurecom.fr/annotation/247957
July 11th, 2013 Università di Torino, Italy 33/44
Chapter 2:
Annotating streams of
heterogeneous data coming from
social p...
July 11th, 2013 Università di Torino, Italy 34/44
The Social Web is growing fast and is becoming
of a crucial importance f...
July 11th, 2013 Università di Torino, Italy 35/44
Social Web = Big Data
Gartner “3V” definition: Volume, Velocity, Variety...
July 11th, 2013 Università di Torino, Italy 36/44
Microposts
➢ Short (~140 characters) and informal text
➢ Grammar free te...
July 11th, 2013 Università di Torino, Italy 37/44
Can we make sense out of the massive and
rapidly changing amount of info...
July 11th, 2013 Università di Torino, Italy 38/44
Live topic generation
http://youtu.be/8iRiwz7cDYY
July 11th, 2013 Università di Torino, Italy 39/44
http://mediafinder.eurecom.fr
July 11th, 2013 Università di Torino, Italy 40/44
Tracking and analyzing an event
➢ 1 week period
➢ We collected micropost...
July 11th, 2013 Università di Torino, Italy 41/44
http://mediafinder.eurecom.fr/story/elezioni2013
July 11th, 2013 Università di Torino, Italy 42/44
Outlook: an entity graph from the open and
Social Web
July 11th, 2013 Università di Torino, Italy 43/44
Thanks for your time and attention
http://www.slideshare.net/giusepperiz...
July 11th, 2013 Università di Torino, Italy 44/44
Do you have any questions?
Upcoming SlideShare
Loading in …5
×

Learning with the Web. Structuring data to ease machine understanding

980 views

Published on

Talk given at the Universita' di Torino, Turin, Italy - July 11, 2013

Published in: Education, Technology
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
980
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Learning with the Web. Structuring data to ease machine understanding

  1. 1. Learning with the Web Structuring data to ease machine understanding http://twitter.com/giusepperizzo
  2. 2. July 11th, 2013 Università di Torino, Italy 2/44 Google Knowledge Graph Viewer
  3. 3. July 11th, 2013 Università di Torino, Italy 3/44 Google Knowledge Graph
  4. 4. July 11th, 2013 Università di Torino, Italy 4/44 The Google Knowledge Graph bulk: encyclopedic sources
  5. 5. July 11th, 2013 Università di Torino, Italy 5/44 Web community has highlithed the road, but ...
  6. 6. July 11th, 2013 Università di Torino, Italy 6/44 Vast wealth of unstructured data “80% of data on the Web and on internal corporate intranets is unstructured" “80% of data on the Web and on internal corporate intranets is unstructured” “Semantic Web and Information Extraction Workshop”, SWAIE at RANLP2013
  7. 7. July 11th, 2013 Università di Torino, Italy 7/44 The entire digital universe, going to be part of the Web “unstructured data will account for 90 percent of all data created in the next decade” IDC IVIEW, “Extracting Value from Chaos”, June 2011
  8. 8. July 11th, 2013 Università di Torino, Italy 8/44 Structured means making those resources available to be easily processed by machines
  9. 9. July 11th, 2013 Università di Torino, Italy 9/44 A Web of Linked Entities http://wole2013.eurecom.fr http://wole2012.eurecom.fr ➢ GGG (global giant graph) http://goo.gl/fH3h ➢ Nodes are Web entities ➢ Entities provide disambiguation pointers ➢ Entities can be univocally referred (disambiguated) ➢ Entities as centroids for topic generation and undestanding
  10. 10. July 11th, 2013 Università di Torino, Italy 10/44 Chapter 1: Named Entity Recognition (NER) and Named Entity Linking (NEL)
  11. 11. July 11th, 2013 Università di Torino, Italy 11/44 I want to book a room in an hotel located in the heart of Paris, just a stone’s throw from the Eiffel Tower Eric Charton, “Named Entity Detection and Entity Linking in the Context of Semantic Web: Exploring the ambiguity question”
  12. 12. July 11th, 2013 Università di Torino, Italy 12/44 Part of Speech I want to book a room in .. Paris PRP VBP TO VB DT NN IN .. NNP I want to book a room in .. Paris NER: What is Paris? NEL: Which Paris are we talking about?
  13. 13. July 11th, 2013 Università di Torino, Italy 13/44 What is Paris? Type ambiguity asteroid location/city film
  14. 14. July 11th, 2013 Università di Torino, Italy 14/44 Entity recognition I want to book a room in .. Paris PRP VBP TO VB DT NN IN .. NNP I want to book a room in .. Paris O O O O O O O .. LOC
  15. 15. July 11th, 2013 Università di Torino, Italy 15/44 NER: State of the art ➢ CRFs (Conditional Random Fields) ➢ FSM (Finite-State Machine) ➢ HMM (Hidden Markov Model) ➢ Gazetteers ➢ Wikipedia/DBpedia ➢ In-house dictionaries
  16. 16. July 11th, 2013 Università di Torino, Italy 16/44 Which Paris? Name ambiguity Paris, Kentucky Paris, Maine Paris, Tennessee Paris, France Paris, Ontario Paris, Idaho
  17. 17. July 11th, 2013 Università di Torino, Italy 17/44 Entity linking I want to book a room in .. Paris PRP VBP TO VB DT NN IN .. NNP I want to book a room in .. Paris O O O O O O O .. LOC O O O O O O O .. http://en.wikipedia.org/wiki/Paris
  18. 18. July 11th, 2013 Università di Torino, Italy 18/44 Ambiguity resolution: linking to an external knowledge base ➢ Wikipedia/DBpedia ➢ Gigaword Corpus ➢ In-house dataset ➢ LOD dataset ➢ DBLP ➢ ACM ➢ BBC ➢ ...
  19. 19. July 11th, 2013 Università di Torino, Italy 19/44 NEL: State of the art ➢ Clustering ➢ Vector Space Model (Cosine similarity or Maximum Entropy) – it requires a priori knowledge of the spotted entities ➢ Conditional probability – it requires a priori knowledge of the spotted entities ➢ Dictionaries ➢ Wikipedia/DBpedia ➢ In-house dataset
  20. 20. July 11th, 2013 Università di Torino, Italy 20/44 Processing natural language texts ➢ Several attempts from the Web community to structure the large wealth of data available ➢ Numerous off-the-shelf systems (commercial, and academic) that perform the NER+NEL chain ➢ AlchemyAPI ➢ DBpedia Spotlight ➢ Wikimeta ➢ TextRazor ➢ Stanford CRF ➢ ...
  21. 21. July 11th, 2013 Università di Torino, Italy 21/44 The NERD initiative http://nerd.eurecom.fr
  22. 22. July 11th, 2013 Università di Torino, Italy 22/44 Combination of off-the-shelf systems and properly trained CRFs
  23. 23. July 11th, 2013 Università di Torino, Italy 23/44 The strength of this approach lies in the fact that the supported off-the-shelf systems have access to large knowledge bases of entities such as DBpedia and Freebase, while CRFs are domain specific
  24. 24. July 11th, 2013 Università di Torino, Italy 24/44 Diversity Alchemy API DBpedia Spotlight Extractiv Lupedia Open Calais Saplo Semi Tags Wikimeta Yahoo! Zemanta Classification schema Alchemy DBpedia FreeBase Scema.org Extractiv DBpedia LinkedM DB Open Calais Saplo ConLL- 3 ESTER Yahoo FreeBase Number of classes 324 320 34 319 95 5 4 7 13 81
  25. 25. July 11th, 2013 Università di Torino, Italy 25/44 NERD Ontology NERD type Occurrence Person 10 Organization 10 Country 6 Company 6 Location 6 Continent 5 City 5 RadioStation 5 Album 5 Product 5 ... ... The NERD ontology has been integrated in the NIF project, a EU FP7 in the context of the LOD2: Creating Knowledge out of Interlinked Data
  26. 26. July 11th, 2013 Università di Torino, Italy 26/44 Learning with the Web ➢ FSM-core based ➢ combination of the NERD supported off-the-shelf systems ➢ ML-core based ➢ combination of the NERD supported off-the-shelf systems – and a CRF, properly trained with the given corpus
  27. 27. July 11th, 2013 Università di Torino, Italy 27/44 Challenges and benchmark
  28. 28. July 11th, 2013 Università di Torino, Italy 28/44 ETAPE 2012 - Entity Extraction Challenge ➢ French transcripts of radio and video programs ➢ Challenge objective: entity typing ➢ Sumitted system: ➢ FSM-core based ➢ Given annotation priority to the systems that have fine grained classification schemes ➢ Ranked 7th/7
  29. 29. July 11th, 2013 Università di Torino, Italy 29/44 #MSM'13 - Concept Extraction Challenge ➢ English Twitter microposts ➢ Challenge objective: entity typing ➢ Submitted system: ➢ ML-core based: SVM ➢ Features = linguistic features (some of them are capitalization, 3 chars of prefix and suffix, POS), output of a CRF properly trained with the challenge training dataset, outputs of the off-the-shelf systems ➢ Ranked 2nd/22
  30. 30. July 11th, 2013 Università di Torino, Italy 30/44 CoNLL-2003 ➢ English newswire corpus ➢ Benchmark objective: entity typing ➢ System: ➢ ML-core based: SVM and NB ➢ Features = linguistic features (some of them are capitalization, 3 chars of prefix, 3 chars of suffix, POS), output of a CRF properly trained with the challenge training dataset, output of the off-the-shelf systems ➢ Results: outperformed significantly the performances of all the systems (off-the-shelf) used as inputs and the Stanford CRF properly trained with the CoNLL-2003 training corpus
  31. 31. July 11th, 2013 Università di Torino, Italy 31/44 TAC KBP 2011 ➢ English newswire corpus ➢ Benchmark objective: entity linking ➢ System: ➢ FSM-core based ➢ Features: outputs of the off-the-shelf systems, harmonized with the Gigaword corpus ongoing
  32. 32. July 11th, 2013 Università di Torino, Italy 32/44 NERD in action http://nerd.eurecom.fr/annotation/247957
  33. 33. July 11th, 2013 Università di Torino, Italy 33/44 Chapter 2: Annotating streams of heterogeneous data coming from social platforms for topic generation
  34. 34. July 11th, 2013 Università di Torino, Italy 34/44 The Social Web is growing fast and is becoming of a crucial importance for research and companies
  35. 35. July 11th, 2013 Università di Torino, Italy 35/44 Social Web = Big Data Gartner “3V” definition: Volume, Velocity, Variety of microposts
  36. 36. July 11th, 2013 Università di Torino, Italy 36/44 Microposts ➢ Short (~140 characters) and informal text ➢ Grammar free text ➢ Slang ➢ Media items ➢ Picture ➢ Video
  37. 37. July 11th, 2013 Università di Torino, Italy 37/44 Can we make sense out of the massive and rapidly changing amount of information shared in the Social Web?
  38. 38. July 11th, 2013 Università di Torino, Italy 38/44 Live topic generation http://youtu.be/8iRiwz7cDYY
  39. 39. July 11th, 2013 Università di Torino, Italy 39/44 http://mediafinder.eurecom.fr
  40. 40. July 11th, 2013 Università di Torino, Italy 40/44 Tracking and analyzing an event ➢ 1 week period ➢ We collected microposts enclosed with pictures ➢ We followed the 2013 Italian Election ➢ We compared the results with the articles published in those days on famous newspapers http://youtu.be/jIMdnwMoWnk
  41. 41. July 11th, 2013 Università di Torino, Italy 41/44 http://mediafinder.eurecom.fr/story/elezioni2013
  42. 42. July 11th, 2013 Università di Torino, Italy 42/44 Outlook: an entity graph from the open and Social Web
  43. 43. July 11th, 2013 Università di Torino, Italy 43/44 Thanks for your time and attention http://www.slideshare.net/giusepperizzo
  44. 44. July 11th, 2013 Università di Torino, Italy 44/44 Do you have any questions?

×