On the Semantic Representation and
Extraction of Complex Category
Descriptors
André Freitas, Rafael Vieira, Edward Curry, ...
Outline
 Motivation
 Extracting Natural Language Category Descriptors (NLCDs)
 Evaluation
 Summary
2
Motivation
3
Big Data
 Vision: More complete data-based picture of the world for
systems and users.
4
“Schema” Growth & Complexity
 Fundamental shift in the database landscape
 How to build large ‘schemas’?
10s-100s attrib...
Target Motivational Scenario: Wikipedia
 Decentralized content generation
 300,000 editors have edited Wikipedia more th...
Natural Language Category Descriptors
(NLCDs)
7
NLCDs
 Natural Language Category Descriptors (NLCDs) are
natural language descriptors for sets
 Simple NLCDs:
- ‘People’...
Assumptions
N
L
C
D
 NLCDs as a more syntactically tractable subset of natural
language
 NLCDs as a low effort interface...
Formality vs. Usability Spectrum
NLCDss NLCD graphss
Information Extraction
10
NLCD graphss
Applications
 Database Creation
 Semantic Annotation
 Entity/Semantic Search
11
Other Examples
 IFRS and US GAAP
- ‘Partially owned properties’
- ‘Residential portfolio segment’
- ‘Assets arising from ...
Extracting Natural Language
Category Descriptors (NLCDs)
13
Natural Language Category
Descriptors
What is Big Data?
14
Core Features
 Manual analysis of 10,000 NLCDs.
15
Features/Core Lexical Categories
Distribution
16
Number of distinct POS Tag patterns
17
Graph Representation Model
18
Focus of the Representation
 Taxonomic Structure
 Context Representation (Open Relation Extraction)
- Reification-based
Examples
20
Examples
21
Examples
22
Examples
23
NLCD Extractor
24
NLCD Extractor: POS Tagging
25
NLCD Extractor: Segmentation
26
NLCD Extractor: Named Entity
Recognition
27
NLCD Extractor: Core Detection
28
NLCD Extractor: WSD
29
NLCD Extractor: Entity Linking
30
Dbpedia
NLCD Extractor: RDF Representation
31
Dbpedia
RDF Representation
32
Evaluation
33
Evaluation Setup
 Total of 287,957 English Wikipedia categories (Open Domain
scenario)
 Selected random sample of 2,696 ...
Results
 Performance:
- (i) graph extraction time: 9.8 ms per graph
- (ii) word sense disambiguation: 121.0 ms per word
-...
Summary
 NLCDs can provide a more tractable (from the IE perspective)
natural language interface for structuring large KB...
Upcoming SlideShare
Loading in …5
×

On the Semantic Representation and Extraction of Complex Category Descriptors

518 views
404 views

Published on

Natural language descriptors used for categorizations are
present from folksonomies to ontologies. While some descriptors are composed of simple expressions, other descriptors have complex compositional patterns (e.g. ‘French Senators Of The Second Empire’, ‘Churches
Destroyed In The Great Fire Of London And Not Rebuilt’). As conceptual models get more complex and decentralized, more content is transferred to unstructured natural language descriptors, increasing the
terminological variation, reducing the conceptual integration and the structure level of the model. This work describes a formal representation for complex natural language category descriptors (NLCDs). In the
representation, complex categories are decomposed into a graph of primitive concepts, supporting their interlinking and semantic interpretation. A category extractor is built and the quality of its extraction under the proposed representation model is evaluated.

Published in: Science, Technology, Education
0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
518
On SlideShare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
10
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

On the Semantic Representation and Extraction of Complex Category Descriptors

  1. 1. On the Semantic Representation and Extraction of Complex Category Descriptors André Freitas, Rafael Vieira, Edward Curry, Danilo Carvalho, João C. Pereira da Silva Insight Centre for Data Analytics NLDB 2014 Montpellier, France
  2. 2. Outline  Motivation  Extracting Natural Language Category Descriptors (NLCDs)  Evaluation  Summary 2
  3. 3. Motivation 3
  4. 4. Big Data  Vision: More complete data-based picture of the world for systems and users. 4
  5. 5. “Schema” Growth & Complexity  Fundamental shift in the database landscape  How to build large ‘schemas’? 10s-100s attributes 1,000s-1,000,000s attributes 5
  6. 6. Target Motivational Scenario: Wikipedia  Decentralized content generation  300,000 editors have edited Wikipedia more than 10 times  > 280,000 distinct Natural Language Category Descriptors (NLCDs) 6
  7. 7. Natural Language Category Descriptors (NLCDs) 7
  8. 8. NLCDs  Natural Language Category Descriptors (NLCDs) are natural language descriptors for sets  Simple NLCDs: - ‘People’ - ‘Countries’ - ‘Films’  Complex NLCDs: - ‘French Senators Of The Second Empire’ - ‘United Kingdom Parliamentary Constituencies Represented By A Sitting Prime Minister’  Goal: - Parse NLCDs into an integrated structured graph 8
  9. 9. Assumptions N L C D  NLCDs as a more syntactically tractable subset of natural language  NLCDs as a low effort interface for structuring a domain of discourse IE 9
  10. 10. Formality vs. Usability Spectrum NLCDss NLCD graphss Information Extraction 10 NLCD graphss
  11. 11. Applications  Database Creation  Semantic Annotation  Entity/Semantic Search 11
  12. 12. Other Examples  IFRS and US GAAP - ‘Partially owned properties’ - ‘Residential portfolio segment’ - ‘Assets arising from exploration for and evaluation of mineral resources’ - ‘Key management personnel compensation’ - ‘Other long-term employee benefits’ 12
  13. 13. Extracting Natural Language Category Descriptors (NLCDs) 13
  14. 14. Natural Language Category Descriptors What is Big Data? 14
  15. 15. Core Features  Manual analysis of 10,000 NLCDs. 15
  16. 16. Features/Core Lexical Categories Distribution 16
  17. 17. Number of distinct POS Tag patterns 17
  18. 18. Graph Representation Model 18
  19. 19. Focus of the Representation  Taxonomic Structure  Context Representation (Open Relation Extraction) - Reification-based
  20. 20. Examples 20
  21. 21. Examples 21
  22. 22. Examples 22
  23. 23. Examples 23
  24. 24. NLCD Extractor 24
  25. 25. NLCD Extractor: POS Tagging 25
  26. 26. NLCD Extractor: Segmentation 26
  27. 27. NLCD Extractor: Named Entity Recognition 27
  28. 28. NLCD Extractor: Core Detection 28
  29. 29. NLCD Extractor: WSD 29
  30. 30. NLCD Extractor: Entity Linking 30 Dbpedia
  31. 31. NLCD Extractor: RDF Representation 31 Dbpedia
  32. 32. RDF Representation 32
  33. 33. Evaluation 33
  34. 34. Evaluation Setup  Total of 287,957 English Wikipedia categories (Open Domain scenario)  Selected random sample of 2,696 categories  Manual evaluation of the core extraction features - Entity segmentation - Relation identification - Unary operators - Specialization relations - Category core identification - Entity core identification - Word Sense Disambiguation (WordNet) - Entity linking (DBpedia) 34
  35. 35. Results  Performance: - (i) graph extraction time: 9.8 ms per graph - (ii) word sense disambiguation: 121.0 ms per word - (iii) entity linking: 530.0 ms per link * i5-3317U (1.70GHz) CPU computer with 4GB RAM (4 core, 2 threads per core). 35
  36. 36. Summary  NLCDs can provide a more tractable (from the IE perspective) natural language interface for structuring large KBs  We developed an approach for the representation, extraction and integration of NLCDs - ~75% extraction accuracy  Limitations: - Need for a more principled and formal definition for a NLCD - Need for a better entity recognition and linking approach  Future Work: evaluation under a domain-specific scenario 36

×