Coping with Data Variety
in the Big Data Era:
The Semantic Computing Approach
André Freitas
Insight Centre for Data Analyt...
Outline
 Shift in the Information Systems Landscape
 Semantic Computing
 Semantics Technologies that Work Today: Data C...
Shift in the Information
Systems Landscape
Big Data
 Vision: More complete data-based picture of the world for
systems and users.
Big Data Dimensions
 Volume
 Velocity
 Variety
Big Data Dimensions
 Volume
 Velocity
 Variety
 Veracity
 Value
Big Data Definitions
7
Data Variety
What is Big Data?
Cost of Making Sense of It
“A lot of Big Data is a lot of small data put together.”
“Most of Big Data is not a
uniform big...
Cost of Making Sense of It
“It is more about the rate of change, the amount and the
resources that you need to deal with i...
Cost of Making Sense of It
“You can have Big Data challenges not only
because you have PBs of data but because data
is inc...
Cost of Making Sense of It
“The speed in which data is generated and the
speed in which it needs to be processed in
order ...
“Schema” Growth
 Heterogeneous, complex and large-scale databases.
 Very-large and dynamic “schemas”.
10s-100s attribute...
Semantic Heterogeneity
 Decentralized content generation.
 Multiple perspectives (conceptualizations) of the reality.
 ...
Data variety +
Data quality -
Data
Programs
Full data coverage
Full automation
Structure level
Unstructured Data Structured Data
Consistent
Comparable
Processable
Easy to generate Easy to analyze
Seman...
The Futurist Perspective
The Futurist Perspective
 AI vision
 Full automation
 Perfect natural language
interaction
The Realist Perspective
What can be achieved with semantic computing today?
Google Knowledge Graph
FB Graph Search
Apple Siri
IBM Watson
QA: Vision
Semantic Computing
(Some) Challenges in Semantics
Knowledge Representation Model
Reasoning
Large, inconsistent,
heterogeneous
Data
Expected R...
Meaning
 Word meaning is usually represented in terms of some formal,
symbolic structure, either external or internal to ...
Formal Representation of Meaning
(Problems)
 Different meanings
- bank (financial institution)
bank (river side)
 Meanin...
Formal Representation of Meaning
(Problems)
 Different meanings
- bank (financial institution)
bank (river side)
 Meanin...
 Most semantic models have dealt with particular types of
constructions, and have been carried out under very simplifying...
Semantics Technologies that
Work Today
Data Creation
Data Creation
 Human interaction element (Data Curation)
 Semantic representation
 Information extraction
Data Curation
Entity-Centric Content Generation
Defining Core Categories
Disambiguation/Synonym
Defining Attributes & Relationships
Data curation elements
 Data curation platforms
- Spreadsheets
- Open Refine
- Karma
 Algorithmic curation
- Validation ...
Standardized Data Models
 Provides a minimum level of data interoperability
 Examples:
- Resource Description Framework ...
Resource Description Framework (RDF)
 Graph data model
 Entity-centric data integration
 Facilitates decentralized cont...
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization...
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization...
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization...
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization...
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization...
Representation
 Rules (SWRL, RIF)
 Ontology (OWL)
– Logical Constraints
 Taxonomy (RDFS)
– Classes in sub-/super-class ...
Representation
Increasing
Semantic
Representation
Linked Data
HTTP
request
RDF JSON
SPARQL
R2RML
Relational
Database
http://dbpedia.org/resource/Jupiter
Open Data
 Common-sense Knowledge Base
 Domain-specific Knowledge Base
 Entity reference system
 DBpedia
- http://dbpedia.org/
 YAGO
- http://www.mpi-inf.mpg.de/yago-naga/yago/
 Freebase
- http://www.freebase.com/
...
Standardized Vocabularies
 Open conceptual models to be reused across different
datasets
 Provides conceptual model leve...
Standardized Vocabularies
 FOAF
 SIOC
 COGS
 Data Cube Vocabulary
 PROV-O
 DCTERMS
 WGS84 Geo Positioning
 SDMX
 ...
Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to ...
Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to ...
Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to ...
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
<http://dbpedia.org/...
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
<http://dbpedia.org/...
 DBpedia Spotlight
- http://spotlight.dbpedia.org
 NERD (Named Entity Recognition and Disambiguation)
- http://nerd.eure...
Syntactic Parsers
GE/NNP has/VBZ also/RB been/VBN implicated/VBN in/IN the/DT creation/NN of/IN
toxic/JJ waste/NN
 Stanford parser
- http://nlp.stanford.edu/software/lex-parser.shtml
- Languages: English, German, Chinese, and others
 ...
 GATE (General Architecture for Text Engineering)
- http://gate.ac.uk/
 NLTK (Natural Language Toolkit)
- http://nltk.or...
Database Representation
 Easy evolution of schemas (schema-less)
 Graph Databases
- OpenLink Virtuoso
- Neo4J
- Transfor...
 Apache Unstructured Information Management Architecture
(UIMA)
- Component software architecture for the analysis of uns...
Relation/Graph Extraction
 Reverb
- http://reverb.cs.washington.edu/
 Graphia
- http://graphia.dcc.ufrj.br/
Relation/Graph Extraction
In 2002, GE acquired the wind power assets of Enron.In 2002 GE acquired the wind power assets of...
Relation/Graph Extraction
General Electric Company, or GE , is an American multinational conglomerate
corporation incorpor...
Semantics Technologies that
Work Today
Data Consumption
Vector Space Models
 Representation useful for approximate search
 Search over structured and unstructured data
 Constr...
Vector Space Models
θ
http://en.wikipedia.org/wiki/General_Electric
General
Electric
...
“General Electric company”
 Lucene & Solr
- http://lucene.apache.org/
 Terrier
- http://terrier.org/
Indexing & Search Engines
Distributional Hypothesis
“Words occurring in similar (linguistic) contexts tend
to be semantically similar”
 He filled t...
Distributional Semantic Models (DSMs)
 Computational models that build contextual semantic representations
from corpus da...
DSMs as Commonsense Reasoning
Commonsense is here
θ
car
dog
cat
bark
run
leash
DSMs as Commonsense Reasoning
DSMs as Commonsense Reasoning
DSMs as Commonsense Reasoning
DSMs as Commonsense Reasoning
θ
car
dog
cat
bark
run
leash
...
vs.
Semantic best-effort
Distributional Semantic Models (DSMs)
 Amtera Esprit (distributional semantic relatedness)
- http://www.mashape.com/amtera/esa-semantic-relatedness
 WS4J (Jav...
 WordNet
- http://wordnet.princeton.edu/
 Wiktionary
- http://www.wiktionary.org/
 FrameNet
- https://framenet.icsi.ber...
Entity
Recognition &
Linking
Distributional
Semantics
Relation/Graph
Extraction
Internal
Datasets
Reference
Corpora
Semant...
Case Study:
Treo QA System
Querying your Knowledge Graph
Gaelic: direction
Solution (Video)
More Complex Queries (Video)
Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Possible representations = Commonsense Knowledge...
Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Semantic approximationSemantic Gap
Possible repr...
Core Principles
 Minimize the impact of Ambiguity, Vagueness, Synonymy.
 Address the simplest matchings first (heuristic...
Step 1: POS Tagging
Who/WP
is/VBZ
the/DT
daughter/NN
of/IN
Bill/NNP
Clinton/NNP
married/VBN
to/TO
?/.
Query Pre-Processing...
Step 2: Core Entity Recognition
Rules-based: POS Tag + TF/IDF
Who is the daughter of Bill Clinton married to?
(PROBABLY AN...
Step 3: Determine answer type
Rules-based.
Who is the daughter of Bill Clinton married to?
(PERSON)
Query Pre-Processing
(...
Step 4: Dependency parsing
dep(married-8, Who-1)
auxpass(married-8, is-2)
det(daughter-4, the-3)
nsubjpass(married-8, daug...
Step 5: Determine Partial Ordered Dependency Structure
(PODS)
Rules based.
Remove stop words.
Merge words into entities.
R...
Question Analysis
Query Features
Bill Clinton daughter married to
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
PODS
Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE) (PREDICAT...
Instance Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
Instance Search
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:rel...
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:rel...
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:rel...
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:rel...
Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)...
Results
Evaluation
 102 natural language queries (Test Collection: QALD 2011).
 Avg. query execution time: 1.52 s (simple querie...
Treo Answers Jeopardy Queries (Video)
http://bit.ly/1hWcch9
Hybrid unstructured & structured
Sydney's dad, Jack, was a CIA double agent working against SD-6 on this
Jennifer Garner s...
Core Principles
 Semantic best-effort
 Dialog & user disambiguation
 Pay-as-you-go data integration
 Simplicity of use...
Take-away message
 There are approaches that can be used today to cope with
data variety in the Big Data era
 Coping wit...
andre.freitas – at – deri.org
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
Upcoming SlideShare
Loading in …5
×

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

1,208 views

Published on

Big Data is based on the vision of providing users and applications with a more complete picture of the reality supported and mediated by data. This vision comes with the inherent price of data variety, i.e. data which is semantically heterogeneous, poorly structured, complex and with data quality issues. Despite the hype on technologies targeting data volume and velocity, solutions for coping with data variety remain fragmented and with limited adoption. In this talk we will focus on emerging data management approaches, supported by semantic technologies, to cope with data variety. We will provide a broad overview of semantic computing approaches and how they can be applied to data management challenges within organizations today. This talk will allow the audience to have a glimpse into the next-generation, Big Data-driven information systems.

Published in: Software, Education
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,208
On SlideShare
0
From Embeds
0
Number of Embeds
7
Actions
Shares
0
Downloads
24
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Emphasize entity
  • Emphasize entity
  • Emphasize entity
  • Emphasize entity
  • Emphasize entity
  • Emphasize entity
  • ADD BABELNET
  • Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

    1. 1. Coping with Data Variety in the Big Data Era: The Semantic Computing Approach André Freitas Insight Centre for Data Analytics Rio Big Data Meetup (June 2014)
    2. 2. Outline  Shift in the Information Systems Landscape  Semantic Computing  Semantics Technologies that Work Today: Data Creation  Semantics Technologies that Work Today: Data Consumption  Case Study: Treo QA System  Conclusions
    3. 3. Shift in the Information Systems Landscape
    4. 4. Big Data  Vision: More complete data-based picture of the world for systems and users.
    5. 5. Big Data Dimensions  Volume  Velocity  Variety
    6. 6. Big Data Dimensions  Volume  Velocity  Variety  Veracity  Value
    7. 7. Big Data Definitions 7 Data Variety What is Big Data?
    8. 8. Cost of Making Sense of It “A lot of Big Data is a lot of small data put together.” “Most of Big Data is not a uniform big block.” “Each data piece is very small and very messy, and a lot of what we are doing there is dealing with that variety.”
    9. 9. Cost of Making Sense of It “It is more about the rate of change, the amount and the resources that you need to deal with it.”“If the programming effort per amount of high quality data is really high, the data is big in the sense of high cost to produce new information.” “Big Data seems to be about addressing challenges of scale, in terms of how fast things are coming out at you versus how much it costs to get value out of what you already have.”
    10. 10. Cost of Making Sense of It “You can have Big Data challenges not only because you have PBs of data but because data is incredibly varied and therefore consumes a lot of resources to make sense of it.”
    11. 11. Cost of Making Sense of It “The speed in which data is generated and the speed in which it needs to be processed in order to use it effectively.”
    12. 12. “Schema” Growth  Heterogeneous, complex and large-scale databases.  Very-large and dynamic “schemas”. 10s-100s attributes 1,000s-1,000,000s attributes circa 2000 circa 2014
    13. 13. Semantic Heterogeneity  Decentralized content generation.  Multiple perspectives (conceptualizations) of the reality.  Ambiguity, vagueness, inconsistency.
    14. 14. Data variety + Data quality - Data Programs Full data coverage Full automation
    15. 15. Structure level Unstructured Data Structured Data Consistent Comparable Processable Easy to generate Easy to analyze Semantic Computing
    16. 16. The Futurist Perspective
    17. 17. The Futurist Perspective  AI vision  Full automation  Perfect natural language interaction
    18. 18. The Realist Perspective What can be achieved with semantic computing today?
    19. 19. Google Knowledge Graph
    20. 20. FB Graph Search
    21. 21. Apple Siri
    22. 22. IBM Watson
    23. 23. QA: Vision
    24. 24. Semantic Computing
    25. 25. (Some) Challenges in Semantics Knowledge Representation Model Reasoning Large, inconsistent, heterogeneous Data Expected Result: intelligent behavior Semantic flexibility, predictive power, automation ... Acquisition, Learning There is an economical model behind each element!
    26. 26. Meaning  Word meaning is usually represented in terms of some formal, symbolic structure, either external or internal to the word  External structure - Associations between different concepts  Internal structure - Feature (property, attribute) lists  The semantic properties of a word are derived from the formal structure of its representation - e.g. Inference algorithm, etc. Semantics = Meaning representation model (data) + inference model
    27. 27. Formal Representation of Meaning (Problems)  Different meanings - bank (financial institution) bank (river side)  Meaning variation in context  Meaning evolution  Ambiguity, vagueness, inconsistency
    28. 28. Formal Representation of Meaning (Problems)  Different meanings - bank (financial institution) bank (river side)  Meaning variation in context - clever politician, clever tycoon  Meaning evolution  Ambiguity, vagueness, inconsistency Word meaning acquisition & representation Lack of flexibility Scalability
    29. 29.  Most semantic models have dealt with particular types of constructions, and have been carried out under very simplifying assumptions, in true lab conditions.  If these idealizations are removed it is not clear at all that modern semantics can give a full account of all but the simplest models/statements. Sahlgren, 2013 Formal World Real World Baroni et al. 2013 Semantics for a Complex World
    30. 30. Semantics Technologies that Work Today Data Creation
    31. 31. Data Creation  Human interaction element (Data Curation)  Semantic representation  Information extraction
    32. 32. Data Curation
    33. 33. Entity-Centric Content Generation
    34. 34. Defining Core Categories
    35. 35. Disambiguation/Synonym
    36. 36. Defining Attributes & Relationships
    37. 37. Data curation elements  Data curation platforms - Spreadsheets - Open Refine - Karma  Algorithmic curation - Validation & Annotation robots  Curation at source - Minimal Information Models (MIRIAM)  Data curation roles  Crowdsourcing
    38. 38. Standardized Data Models  Provides a minimum level of data interoperability  Examples: - Resource Description Framework (RDF) - Linked Comma Separated Value (CSV) - Javascript Object Notation (JSON)
    39. 39. Resource Description Framework (RDF)  Graph data model  Entity-centric data integration  Facilitates decentralized content generation  URIs for concept identfiers  Associated structured query language (SPARQL)
    40. 40. Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization dbpedia:Fairfield, Connecticutdbp:locationCity
    41. 41. Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity
    42. 42. Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity owl:sameAs
    43. 43. Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity geo:Fairfield "N 41° 13' 29'' geo:latitude owl:sameAs
    44. 44. Resource Description Framework (RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity geo:Fairfield "N 41° 13' 29'' geo:latitude owl:sameAs owl:sameAs
    45. 45. Representation  Rules (SWRL, RIF)  Ontology (OWL) – Logical Constraints  Taxonomy (RDFS) – Classes in sub-/super-class hierarchy  Relational (RDF) – Attributes – Associations  Dictionary – Terms and definitions Increasing Semantic Representation
    46. 46. Representation Increasing Semantic Representation
    47. 47. Linked Data HTTP request RDF JSON SPARQL R2RML Relational Database
    48. 48. http://dbpedia.org/resource/Jupiter
    49. 49. Open Data  Common-sense Knowledge Base  Domain-specific Knowledge Base  Entity reference system
    50. 50.  DBpedia - http://dbpedia.org/  YAGO - http://www.mpi-inf.mpg.de/yago-naga/yago/  Freebase - http://www.freebase.com/  Wikipedia dumps - http://dumps.wikimedia.org/  ConceptNet - http:// conceptnet5.media.mit.edu/  Geonames - http://www.geonames.org/  Common Crawl - http://commoncrawl.org/ Open Data
    51. 51. Standardized Vocabularies  Open conceptual models to be reused across different datasets  Provides conceptual model level interoperability  Useful to be used for modelling recurrent domains of discourse
    52. 52. Standardized Vocabularies  FOAF  SIOC  COGS  Data Cube Vocabulary  PROV-O  DCTERMS  WGS84 Geo Positioning  SDMX  QUDT  SSN  Schema.org  VoID  Data Catalog  ... http://lov.okfn.org/dataset/lov/
    53. 53. Entity Recognition & Linking  Align terms in unstructured text to entities in a structured KB  Integrates structured to unstructured data
    54. 54. Entity Recognition & Linking  Align terms in unstructured text to entities in a structured KB  Integrates structured to unstructured data
    55. 55. Entity Recognition & Linking  Align terms in unstructured text to entities in a structured KB  Integrates structured to unstructured data  Can be used to support semantic search  Provides a first level of structure to unstructured data  Exploratory browsing
    56. 56. Entity Recognition & Linking  Example: “GE has also been implicated in the creation of toxic waste.”
    57. 57. Entity Recognition & Linking  Example: “GE has also been implicated in the creation of toxic waste.”
    58. 58. Entity Recognition & Linking  Example: “GE has also been implicated in the creation of toxic waste.” <http://dbpedia.org/resource/General_Electric> yago:ConglomerateCompanies yago:MedicalEquipmentManufacturers yago:CompaniesListedOnTheNewYorkStockExchange
    59. 59. Entity Recognition & Linking  Example: “GE has also been implicated in the creation of toxic waste.” <http://dbpedia.org/resource/Toxic_waste>
    60. 60.  DBpedia Spotlight - http://spotlight.dbpedia.org  NERD (Named Entity Recognition and Disambiguation) - http://nerd.eurecom.fr/  Stanford Named Entity Recognizer - http://nlp.stanford.edu/software/CRF-NER.shtml Entity Recognition/Linking
    61. 61. Syntactic Parsers GE/NNP has/VBZ also/RB been/VBN implicated/VBN in/IN the/DT creation/NN of/IN toxic/JJ waste/NN
    62. 62.  Stanford parser - http://nlp.stanford.edu/software/lex-parser.shtml - Languages: English, German, Chinese, and others  MALT - http://www.maltparser.org/ - Languages (pre-trained): English, French, Swedish  C&C Parser - http://svn.ask.it.usyd.edu.au/trac/candc Parsers
    63. 63.  GATE (General Architecture for Text Engineering) - http://gate.ac.uk/  NLTK (Natural Language Toolkit) - http://nltk.org/  Stanford NLP - http://www-nlp.stanford.edu/software/index.shtml  LingPipe - http://alias-i.com/lingpipe/index.html Text Processing Tools
    64. 64. Database Representation  Easy evolution of schemas (schema-less)  Graph Databases - OpenLink Virtuoso - Neo4J - Transforming Lucene into a Graph Database  NoSQL ...
    65. 65.  Apache Unstructured Information Management Architecture (UIMA) - Component software architecture for the analysis of unstructured data - http://uima.apache.org/  NLP Interchange Format (NIF) - RDF & OWL-based - http://persistence.uni-leipzig.org/nlp2rdf/ NLP Integration
    66. 66. Relation/Graph Extraction  Reverb - http://reverb.cs.washington.edu/  Graphia - http://graphia.dcc.ufrj.br/
    67. 67. Relation/Graph Extraction In 2002, GE acquired the wind power assets of Enron.In 2002 GE acquired the wind power assets of Enron
    68. 68. Relation/Graph Extraction General Electric Company, or GE , is an American multinational conglomerate corporation incorporated in Schenectady , New York
    69. 69. Semantics Technologies that Work Today Data Consumption
    70. 70. Vector Space Models  Representation useful for approximate search  Search over structured and unstructured data  Construction of approximate semantic models
    71. 71. Vector Space Models θ http://en.wikipedia.org/wiki/General_Electric General Electric ... “General Electric company”
    72. 72.  Lucene & Solr - http://lucene.apache.org/  Terrier - http://terrier.org/ Indexing & Search Engines
    73. 73. Distributional Hypothesis “Words occurring in similar (linguistic) contexts tend to be semantically similar”  He filled the wampimuk with the substance, passed it around and we all drunk some  We found a little, hairy wampimuk sleeping behind the tree
    74. 74. Distributional Semantic Models (DSMs)  Computational models that build contextual semantic representations from corpus data  Semantic context is represented by a vector  Vectors are obtained through the statistical analysis of the linguistic contexts of a word  Salience of contexts (cf. context weighting scheme)  Semantic similarity/relatedness as the core operation over the model
    75. 75. DSMs as Commonsense Reasoning Commonsense is here θ car dog cat bark run leash
    76. 76. DSMs as Commonsense Reasoning
    77. 77. DSMs as Commonsense Reasoning
    78. 78. DSMs as Commonsense Reasoning
    79. 79. DSMs as Commonsense Reasoning θ car dog cat bark run leash ... vs. Semantic best-effort
    80. 80. Distributional Semantic Models (DSMs)
    81. 81.  Amtera Esprit (distributional semantic relatedness) - http://www.mashape.com/amtera/esa-semantic-relatedness  WS4J (Java API for several semantic relatedness algorithms) - https://code.google.com/p/ws4j/  SecondString (string matching) - http://secondstring.sourceforge.net  S-space (distributional semantics framework) - https://github.com/fozziethebeat/S-Space String similarity and semantic relatedness
    82. 82.  WordNet - http://wordnet.princeton.edu/  Wiktionary - http://www.wiktionary.org/  FrameNet - https://framenet.icsi.berkeley.edu/fndrupal/  VerbNet - http://verbs.colorado.edu/~mpalmer/projects/verbnet.html  BabelNet - http://babelnet.org/ Lexical Resources
    83. 83. Entity Recognition & Linking Distributional Semantics Relation/Graph Extraction Internal Datasets Reference Corpora Semantic Pipeline Vocabulary Management Semantic Search & QA Crawling & Indexing Open Data Vocabularies, Taxonomies, Lexical Resources Internal Documents Knowledge Graph Management Knowledge Graph Data Curation Platform Crowdsourcing Services Applications User feedback Provenance Management
    84. 84. Case Study: Treo QA System
    85. 85. Querying your Knowledge Graph Gaelic: direction
    86. 86. Solution (Video)
    87. 87. More Complex Queries (Video)
    88. 88. Vocabulary Problem Query: Who is the daughter of Bill Clinton married to? Possible representations = Commonsense Knowledge Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
    89. 89. Vocabulary Problem Query: Who is the daughter of Bill Clinton married to? Semantic approximationSemantic Gap Possible representations = Commonsense Knowledge Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
    90. 90. Core Principles  Minimize the impact of Ambiguity, Vagueness, Synonymy.  Address the simplest matchings first (heuristics).  Semantic Relatedness as a primitive operation.  Distributional semantics as commonsense knowledge.
    91. 91. Step 1: POS Tagging Who/WP is/VBZ the/DT daughter/NN of/IN Bill/NNP Clinton/NNP married/VBN to/TO ?/. Query Pre-Processing (Question Analysis)
    92. 92. Step 2: Core Entity Recognition Rules-based: POS Tag + TF/IDF Who is the daughter of Bill Clinton married to? (PROBABLY AN INSTANCE) Query Pre-Processing (Question Analysis)
    93. 93. Step 3: Determine answer type Rules-based. Who is the daughter of Bill Clinton married to? (PERSON) Query Pre-Processing (Question Analysis)
    94. 94. Step 4: Dependency parsing dep(married-8, Who-1) auxpass(married-8, is-2) det(daughter-4, the-3) nsubjpass(married-8, daughter-4) prep(daughter-4, of-5) nn(Clinton-7, Bill-6) pobj(of-5, Clinton-7) root(ROOT-0, married-8) xcomp(married-8, to-9) Query Pre-Processing (Question Analysis)
    95. 95. Step 5: Determine Partial Ordered Dependency Structure (PODS) Rules based. Remove stop words. Merge words into entities. Reorder structure from core entity position. Query Pre-Processing (Question Analysis) (INSTANCE) ANSWER TYPE QUESTION FOCUS Bill Clinton daughter married to
    96. 96. Question Analysis Query Features Bill Clinton daughter married to (INSTANCE) (PREDICATE) (PREDICATE) Query Features PODS
    97. 97. Query Plan Map query features into a query plan. A query plan contains a sequence of core operations. (INSTANCE) (PREDICATE) (PREDICATE) Query Features Query Plan  (1) INSTANCE SEARCH (Bill Clinton)  (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)  (3) e1 <- NAVIGATE (Bill Clintion, p1)  (4) p2 <- SEARCH PREDICATE (e1, married to)  (5) e2 <- NAVIGATE (e1, p2)
    98. 98. Instance Search Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: Instance Search
    99. 99. Predicate Search Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... (PIVOT ENTITY) (ASSOCIATED TRIPLES)
    100. 100. Predicate Search Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... sem_rel(daughter,child)=0.054 Which properties are semantically related to ‘daughter’?
    101. 101. Predicate Search Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 Which properties are semantically related to ‘daughter’?
    102. 102. Predicate Search Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 sem_rel(daughter,alma mater)=0.001 Which properties are semantically related to ‘daughter’?
    103. 103. Navigate Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child
    104. 104. Navigate Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child (PIVOT ENTITY)
    105. 105. Predicate Search Bill Clinton daughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child (PIVOT ENTITY) :Mark_Mezvinsky :spouse
    106. 106. Results
    107. 107. Evaluation  102 natural language queries (Test Collection: QALD 2011).  Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries).
    108. 108. Treo Answers Jeopardy Queries (Video) http://bit.ly/1hWcch9
    109. 109. Hybrid unstructured & structured Sydney's dad, Jack, was a CIA double agent working against SD-6 on this Jennifer Garner show.
    110. 110. Core Principles  Semantic best-effort  Dialog & user disambiguation  Pay-as-you-go data integration  Simplicity of use  Franklin et al. (2005): From Databases to Dataspaces.  Helland (2011): If You Have Too Much Data, then “Good Enough” Is Good Enough.
    111. 111. Take-away message  There are approaches that can be used today to cope with data variety in the Big Data era  Coping with data variety demands a multi-disciplinary perspective and a new infrastructure - Knowledge Representation, IR and Natural Language Processing  Semantics at scale as a central concern  You can build your own IBM Watson-like application!  Great opportunity for new solutions and for being a pioneer
    112. 112. andre.freitas – at – deri.org

    ×