Coping with Data Variety
in the Big Data Era:
The Semantic Computing Approach
André Freitas
Insight Centre for Data Analytics
Rio Big Data Meetup (June 2014)
Outline
 Shift in the Information Systems Landscape
 Semantic Computing
 Semantics Technologies that Work Today: Data Creation
 Semantics Technologies that Work Today: Data Consumption
 Case Study: Treo QA System
 Conclusions
Shift in the Information
Systems Landscape
Big Data
 Vision: More complete data-based picture of the world for
systems and users.
Big Data Dimensions
 Volume
 Velocity
 Variety
Big Data Dimensions
 Volume
 Velocity
 Variety
 Veracity
 Value
Big Data Definitions
7
Data Variety
What is Big Data?
Cost of Making Sense of It
“A lot of Big Data is a lot of small data put together.”
“Most of Big Data is not a
uniform big block.”
“Each data piece is very small and very
messy, and a lot of what we are doing
there is dealing with that variety.”
Cost of Making Sense of It
“It is more about the rate of change, the amount and the
resources that you need to deal with it.”“If the programming effort per amount of
high quality data is really high, the data is
big in the sense of high cost to produce
new information.”
“Big Data seems to be about addressing challenges
of scale, in terms of how fast things are coming
out at you versus how much it costs to get value
out of what you already have.”
Cost of Making Sense of It
“You can have Big Data challenges not only
because you have PBs of data but because data
is incredibly varied and therefore consumes a
lot of resources to make sense of it.”
Cost of Making Sense of It
“The speed in which data is generated and the
speed in which it needs to be processed in
order to use it effectively.”
“Schema” Growth
 Heterogeneous, complex and large-scale databases.
 Very-large and dynamic “schemas”.
10s-100s attributes
1,000s-1,000,000s attributes
circa 2000
circa 2014
Semantic Heterogeneity
 Decentralized content generation.
 Multiple perspectives (conceptualizations) of the reality.
 Ambiguity, vagueness, inconsistency.
Data variety +
Data quality -
Data
Programs
Full data coverage
Full automation
Structure level
Unstructured Data Structured Data
Consistent
Comparable
Processable
Easy to generate Easy to analyze
Semantic Computing
The Futurist Perspective
The Futurist Perspective
 AI vision
 Full automation
 Perfect natural language
interaction
The Realist Perspective
What can be achieved with semantic computing today?
Google Knowledge Graph
FB Graph Search
Apple Siri
IBM Watson
QA: Vision
Semantic Computing
(Some) Challenges in Semantics
Knowledge Representation Model
Reasoning
Large, inconsistent,
heterogeneous
Data
Expected Result: intelligent behavior
Semantic flexibility, predictive power, automation ...
Acquisition, Learning
There is an economical model behind each element!
Meaning
 Word meaning is usually represented in terms of some formal,
symbolic structure, either external or internal to the word
 External structure
- Associations between different concepts
 Internal structure
- Feature (property, attribute) lists
 The semantic properties of a word are derived from the formal
structure of its representation
- e.g. Inference algorithm, etc.
Semantics = Meaning representation model (data) +
inference model
Formal Representation of Meaning
(Problems)
 Different meanings
- bank (financial institution)
bank (river side)
 Meaning variation in context
 Meaning evolution
 Ambiguity, vagueness, inconsistency
Formal Representation of Meaning
(Problems)
 Different meanings
- bank (financial institution)
bank (river side)
 Meaning variation in context
- clever politician, clever tycoon
 Meaning evolution
 Ambiguity, vagueness, inconsistency
Word meaning acquisition &
representation
Lack of flexibility
Scalability
 Most semantic models have dealt with particular types of
constructions, and have been carried out under very simplifying
assumptions, in true lab conditions.
 If these idealizations are removed it is not clear at all that modern
semantics can give a full account of all but the simplest
models/statements.
Sahlgren, 2013
Formal World Real World
Baroni et al. 2013
Semantics for a Complex World
Semantics Technologies that
Work Today
Data Creation
Data Creation
 Human interaction element (Data Curation)
 Semantic representation
 Information extraction
Data Curation
Entity-Centric Content Generation
Defining Core Categories
Disambiguation/Synonym
Defining Attributes & Relationships
Data curation elements
 Data curation platforms
- Spreadsheets
- Open Refine
- Karma
 Algorithmic curation
- Validation & Annotation robots
 Curation at source
- Minimal Information Models (MIRIAM)
 Data curation roles
 Crowdsourcing
Standardized Data Models
 Provides a minimum level of data interoperability
 Examples:
- Resource Description Framework (RDF)
- Linked Comma Separated Value (CSV)
- Javascript Object Notation (JSON)
Resource Description Framework (RDF)
 Graph data model
 Entity-centric data integration
 Facilitates decentralized content generation
 URIs for concept identfiers
 Associated structured query language (SPARQL)
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization
dbpedia:Fairfield, Connecticutdbp:locationCity
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization
sec:General_Electric
ifrs:CashFlowsFromUsedInOperationsTotal
…
dbpedia:Fairfield, Connecticutdbp:locationCity
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization
sec:General_Electric
ifrs:CashFlowsFromUsedInOperationsTotal
…
dbpedia:Fairfield, Connecticutdbp:locationCity
owl:sameAs
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization
sec:General_Electric
ifrs:CashFlowsFromUsedInOperationsTotal
…
dbpedia:Fairfield, Connecticutdbp:locationCity
geo:Fairfield
"N 41° 13' 29''
geo:latitude
owl:sameAs
Resource Description Framework (RDF)
dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization
sec:General_Electric
ifrs:CashFlowsFromUsedInOperationsTotal
…
dbpedia:Fairfield, Connecticutdbp:locationCity
geo:Fairfield
"N 41° 13' 29''
geo:latitude
owl:sameAs
owl:sameAs
Representation
 Rules (SWRL, RIF)
 Ontology (OWL)
– Logical Constraints
 Taxonomy (RDFS)
– Classes in sub-/super-class hierarchy
 Relational (RDF)
– Attributes
– Associations
 Dictionary
– Terms and definitions
Increasing
Semantic
Representation
Representation
Increasing
Semantic
Representation
Linked Data
HTTP
request
RDF JSON
SPARQL
R2RML
Relational
Database
http://dbpedia.org/resource/Jupiter
Open Data
 Common-sense Knowledge Base
 Domain-specific Knowledge Base
 Entity reference system
 DBpedia
- http://dbpedia.org/
 YAGO
- http://www.mpi-inf.mpg.de/yago-naga/yago/
 Freebase
- http://www.freebase.com/
 Wikipedia dumps
- http://dumps.wikimedia.org/
 ConceptNet
- http:// conceptnet5.media.mit.edu/
 Geonames
- http://www.geonames.org/
 Common Crawl
- http://commoncrawl.org/
Open Data
Standardized Vocabularies
 Open conceptual models to be reused across different
datasets
 Provides conceptual model level interoperability
 Useful to be used for modelling recurrent domains of
discourse
Standardized Vocabularies
 FOAF
 SIOC
 COGS
 Data Cube Vocabulary
 PROV-O
 DCTERMS
 WGS84 Geo Positioning
 SDMX
 QUDT
 SSN
 Schema.org
 VoID
 Data Catalog
 ...
http://lov.okfn.org/dataset/lov/
Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to unstructured data
Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to unstructured data
Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to unstructured data
 Can be used to support semantic search
 Provides a first level of structure to unstructured data
 Exploratory browsing
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
<http://dbpedia.org/resource/General_Electric>
yago:ConglomerateCompanies
yago:MedicalEquipmentManufacturers
yago:CompaniesListedOnTheNewYorkStockExchange
Entity Recognition & Linking
 Example:
“GE has also been implicated in the creation of toxic waste.”
<http://dbpedia.org/resource/Toxic_waste>
 DBpedia Spotlight
- http://spotlight.dbpedia.org
 NERD (Named Entity Recognition and Disambiguation)
- http://nerd.eurecom.fr/
 Stanford Named Entity Recognizer
- http://nlp.stanford.edu/software/CRF-NER.shtml
Entity Recognition/Linking
Syntactic Parsers
GE/NNP has/VBZ also/RB been/VBN implicated/VBN in/IN the/DT creation/NN of/IN
toxic/JJ waste/NN
 Stanford parser
- http://nlp.stanford.edu/software/lex-parser.shtml
- Languages: English, German, Chinese, and others
 MALT
- http://www.maltparser.org/
- Languages (pre-trained): English, French, Swedish
 C&C Parser
- http://svn.ask.it.usyd.edu.au/trac/candc
Parsers
 GATE (General Architecture for Text Engineering)
- http://gate.ac.uk/
 NLTK (Natural Language Toolkit)
- http://nltk.org/
 Stanford NLP
- http://www-nlp.stanford.edu/software/index.shtml
 LingPipe
- http://alias-i.com/lingpipe/index.html
Text Processing Tools
Database Representation
 Easy evolution of schemas (schema-less)
 Graph Databases
- OpenLink Virtuoso
- Neo4J
- Transforming Lucene into a Graph Database
 NoSQL ...
 Apache Unstructured Information Management Architecture
(UIMA)
- Component software architecture for the analysis of unstructured data
- http://uima.apache.org/
 NLP Interchange Format (NIF)
- RDF & OWL-based
- http://persistence.uni-leipzig.org/nlp2rdf/
NLP Integration
Relation/Graph Extraction
 Reverb
- http://reverb.cs.washington.edu/
 Graphia
- http://graphia.dcc.ufrj.br/
Relation/Graph Extraction
In 2002, GE acquired the wind power assets of Enron.In 2002 GE acquired the wind power assets of Enron
Relation/Graph Extraction
General Electric Company, or GE , is an American multinational conglomerate
corporation incorporated in Schenectady , New York
Semantics Technologies that
Work Today
Data Consumption
Vector Space Models
 Representation useful for approximate search
 Search over structured and unstructured data
 Construction of approximate semantic models
Vector Space Models
θ
http://en.wikipedia.org/wiki/General_Electric
General
Electric
...
“General Electric company”
 Lucene & Solr
- http://lucene.apache.org/
 Terrier
- http://terrier.org/
Indexing & Search Engines
Distributional Hypothesis
“Words occurring in similar (linguistic) contexts tend
to be semantically similar”
 He filled the wampimuk with the substance, passed it
around and we all drunk some
 We found a little, hairy wampimuk sleeping behind the
tree
Distributional Semantic Models (DSMs)
 Computational models that build contextual semantic representations
from corpus data
 Semantic context is represented by a vector
 Vectors are obtained through the statistical analysis of the linguistic
contexts of a word
 Salience of contexts (cf. context weighting scheme)
 Semantic similarity/relatedness as the core operation over the model
DSMs as Commonsense Reasoning
Commonsense is here
θ
car
dog
cat
bark
run
leash
DSMs as Commonsense Reasoning
DSMs as Commonsense Reasoning
DSMs as Commonsense Reasoning
DSMs as Commonsense Reasoning
θ
car
dog
cat
bark
run
leash
...
vs.
Semantic best-effort
Distributional Semantic Models (DSMs)
 Amtera Esprit (distributional semantic relatedness)
- http://www.mashape.com/amtera/esa-semantic-relatedness
 WS4J (Java API for several semantic relatedness
algorithms)
- https://code.google.com/p/ws4j/
 SecondString (string matching)
- http://secondstring.sourceforge.net
 S-space (distributional semantics framework)
- https://github.com/fozziethebeat/S-Space
String similarity and semantic relatedness
 WordNet
- http://wordnet.princeton.edu/
 Wiktionary
- http://www.wiktionary.org/
 FrameNet
- https://framenet.icsi.berkeley.edu/fndrupal/
 VerbNet
- http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
 BabelNet
- http://babelnet.org/
Lexical Resources
Entity
Recognition &
Linking
Distributional
Semantics
Relation/Graph
Extraction
Internal
Datasets
Reference
Corpora
Semantic
Pipeline
Vocabulary
Management
Semantic
Search & QA
Crawling &
Indexing
Open
Data
Vocabularies,
Taxonomies,
Lexical
Resources
Internal
Documents
Knowledge
Graph
Management
Knowledge
Graph
Data Curation
Platform
Crowdsourcing
Services
Applications
User
feedback
Provenance
Management
Case Study:
Treo QA System
Querying your Knowledge Graph
Gaelic: direction
Solution (Video)
More Complex Queries (Video)
Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Possible representations = Commonsense Knowledge
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and
9,434,677 instances
Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Semantic approximationSemantic Gap
Possible representations = Commonsense Knowledge
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and
9,434,677 instances
Core Principles
 Minimize the impact of Ambiguity, Vagueness, Synonymy.
 Address the simplest matchings first (heuristics).
 Semantic Relatedness as a primitive operation.
 Distributional semantics as commonsense knowledge.
Step 1: POS Tagging
Who/WP
is/VBZ
the/DT
daughter/NN
of/IN
Bill/NNP
Clinton/NNP
married/VBN
to/TO
?/.
Query Pre-Processing
(Question Analysis)
Step 2: Core Entity Recognition
Rules-based: POS Tag + TF/IDF
Who is the daughter of Bill Clinton married to?
(PROBABLY AN INSTANCE)
Query Pre-Processing
(Question Analysis)
Step 3: Determine answer type
Rules-based.
Who is the daughter of Bill Clinton married to?
(PERSON)
Query Pre-Processing
(Question Analysis)
Step 4: Dependency parsing
dep(married-8, Who-1)
auxpass(married-8, is-2)
det(daughter-4, the-3)
nsubjpass(married-8, daughter-4)
prep(daughter-4, of-5)
nn(Clinton-7, Bill-6)
pobj(of-5, Clinton-7)
root(ROOT-0, married-8)
xcomp(married-8, to-9)
Query Pre-Processing
(Question Analysis)
Step 5: Determine Partial Ordered Dependency Structure
(PODS)
Rules based.
Remove stop words.
Merge words into entities.
Reorder structure from core entity position.
Query Pre-Processing
(Question Analysis)
(INSTANCE)
ANSWER
TYPE
QUESTION FOCUS
Bill Clinton daughter married to
Question Analysis
Query Features
Bill Clinton daughter married to
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
PODS
Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
Query Plan
 (1) INSTANCE SEARCH (Bill Clinton)
 (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)
 (3) e1 <- NAVIGATE (Bill Clintion, p1)
 (4) p2 <- SEARCH PREDICATE (e1, married to)
 (5) e2 <- NAVIGATE (e1, p2)
Instance Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
Instance Search
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
(PIVOT ENTITY)
(ASSOCIATED
TRIPLES)
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
Which properties are semantically related to ‘daughter’?
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
Which properties are semantically related to ‘daughter’?
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
sem_rel(daughter,alma mater)=0.001
Which properties are semantically related to ‘daughter’?
Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
Navigate
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
:Mark_Mezvinsky
:spouse
Results
Evaluation
 102 natural language queries (Test Collection: QALD 2011).
 Avg. query execution time: 1.52 s (simple queries) – 8.53 s
(all queries).
Treo Answers Jeopardy Queries (Video)
http://bit.ly/1hWcch9
Hybrid unstructured & structured
Sydney's dad, Jack, was a CIA double agent working against SD-6 on this
Jennifer Garner show.
Core Principles
 Semantic best-effort
 Dialog & user disambiguation
 Pay-as-you-go data integration
 Simplicity of use
 Franklin et al. (2005): From Databases to Dataspaces.
 Helland (2011): If You Have Too Much Data, then “Good
Enough” Is Good Enough.
Take-away message
 There are approaches that can be used today to cope with
data variety in the Big Data era
 Coping with data variety demands a multi-disciplinary
perspective and a new infrastructure
- Knowledge Representation, IR and Natural Language Processing
 Semantics at scale as a central concern
 You can build your own IBM Watson-like application!
 Great opportunity for new solutions and for being a pioneer
andre.freitas – at – deri.org

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

  • 1.
    Coping with DataVariety in the Big Data Era: The Semantic Computing Approach André Freitas Insight Centre for Data Analytics Rio Big Data Meetup (June 2014)
  • 2.
    Outline  Shift inthe Information Systems Landscape  Semantic Computing  Semantics Technologies that Work Today: Data Creation  Semantics Technologies that Work Today: Data Consumption  Case Study: Treo QA System  Conclusions
  • 3.
    Shift in theInformation Systems Landscape
  • 4.
    Big Data  Vision:More complete data-based picture of the world for systems and users.
  • 5.
    Big Data Dimensions Volume  Velocity  Variety
  • 6.
    Big Data Dimensions Volume  Velocity  Variety  Veracity  Value
  • 7.
    Big Data Definitions 7 DataVariety What is Big Data?
  • 8.
    Cost of MakingSense of It “A lot of Big Data is a lot of small data put together.” “Most of Big Data is not a uniform big block.” “Each data piece is very small and very messy, and a lot of what we are doing there is dealing with that variety.”
  • 9.
    Cost of MakingSense of It “It is more about the rate of change, the amount and the resources that you need to deal with it.”“If the programming effort per amount of high quality data is really high, the data is big in the sense of high cost to produce new information.” “Big Data seems to be about addressing challenges of scale, in terms of how fast things are coming out at you versus how much it costs to get value out of what you already have.”
  • 10.
    Cost of MakingSense of It “You can have Big Data challenges not only because you have PBs of data but because data is incredibly varied and therefore consumes a lot of resources to make sense of it.”
  • 11.
    Cost of MakingSense of It “The speed in which data is generated and the speed in which it needs to be processed in order to use it effectively.”
  • 12.
    “Schema” Growth  Heterogeneous,complex and large-scale databases.  Very-large and dynamic “schemas”. 10s-100s attributes 1,000s-1,000,000s attributes circa 2000 circa 2014
  • 13.
    Semantic Heterogeneity  Decentralizedcontent generation.  Multiple perspectives (conceptualizations) of the reality.  Ambiguity, vagueness, inconsistency.
  • 16.
    Data variety + Dataquality - Data Programs Full data coverage Full automation
  • 17.
    Structure level Unstructured DataStructured Data Consistent Comparable Processable Easy to generate Easy to analyze Semantic Computing
  • 18.
  • 19.
    The Futurist Perspective AI vision  Full automation  Perfect natural language interaction
  • 20.
    The Realist Perspective Whatcan be achieved with semantic computing today?
  • 21.
  • 22.
  • 23.
  • 24.
  • 25.
  • 26.
  • 27.
    (Some) Challenges inSemantics Knowledge Representation Model Reasoning Large, inconsistent, heterogeneous Data Expected Result: intelligent behavior Semantic flexibility, predictive power, automation ... Acquisition, Learning There is an economical model behind each element!
  • 28.
    Meaning  Word meaningis usually represented in terms of some formal, symbolic structure, either external or internal to the word  External structure - Associations between different concepts  Internal structure - Feature (property, attribute) lists  The semantic properties of a word are derived from the formal structure of its representation - e.g. Inference algorithm, etc. Semantics = Meaning representation model (data) + inference model
  • 29.
    Formal Representation ofMeaning (Problems)  Different meanings - bank (financial institution) bank (river side)  Meaning variation in context  Meaning evolution  Ambiguity, vagueness, inconsistency
  • 30.
    Formal Representation ofMeaning (Problems)  Different meanings - bank (financial institution) bank (river side)  Meaning variation in context - clever politician, clever tycoon  Meaning evolution  Ambiguity, vagueness, inconsistency Word meaning acquisition & representation Lack of flexibility Scalability
  • 31.
     Most semanticmodels have dealt with particular types of constructions, and have been carried out under very simplifying assumptions, in true lab conditions.  If these idealizations are removed it is not clear at all that modern semantics can give a full account of all but the simplest models/statements. Sahlgren, 2013 Formal World Real World Baroni et al. 2013 Semantics for a Complex World
  • 32.
  • 33.
    Data Creation  Humaninteraction element (Data Curation)  Semantic representation  Information extraction
  • 34.
  • 35.
  • 36.
  • 37.
  • 38.
    Defining Attributes &Relationships
  • 39.
    Data curation elements Data curation platforms - Spreadsheets - Open Refine - Karma  Algorithmic curation - Validation & Annotation robots  Curation at source - Minimal Information Models (MIRIAM)  Data curation roles  Crowdsourcing
  • 40.
    Standardized Data Models Provides a minimum level of data interoperability  Examples: - Resource Description Framework (RDF) - Linked Comma Separated Value (CSV) - Javascript Object Notation (JSON)
  • 41.
    Resource Description Framework(RDF)  Graph data model  Entity-centric data integration  Facilitates decentralized content generation  URIs for concept identfiers  Associated structured query language (SPARQL)
  • 42.
    Resource Description Framework(RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization dbpedia:Fairfield, Connecticutdbp:locationCity
  • 43.
    Resource Description Framework(RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity
  • 44.
    Resource Description Framework(RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity owl:sameAs
  • 45.
    Resource Description Framework(RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity geo:Fairfield "N 41° 13' 29'' geo:latitude owl:sameAs
  • 46.
    Resource Description Framework(RDF) dbpedia:General_Electric "US$ 147.3 billion"@en dbp:revenue rdf:type dbo:Organization sec:General_Electric ifrs:CashFlowsFromUsedInOperationsTotal … dbpedia:Fairfield, Connecticutdbp:locationCity geo:Fairfield "N 41° 13' 29'' geo:latitude owl:sameAs owl:sameAs
  • 47.
    Representation  Rules (SWRL,RIF)  Ontology (OWL) – Logical Constraints  Taxonomy (RDFS) – Classes in sub-/super-class hierarchy  Relational (RDF) – Attributes – Associations  Dictionary – Terms and definitions Increasing Semantic Representation
  • 48.
  • 49.
  • 50.
  • 52.
    Open Data  Common-senseKnowledge Base  Domain-specific Knowledge Base  Entity reference system
  • 53.
     DBpedia - http://dbpedia.org/ YAGO - http://www.mpi-inf.mpg.de/yago-naga/yago/  Freebase - http://www.freebase.com/  Wikipedia dumps - http://dumps.wikimedia.org/  ConceptNet - http:// conceptnet5.media.mit.edu/  Geonames - http://www.geonames.org/  Common Crawl - http://commoncrawl.org/ Open Data
  • 54.
    Standardized Vocabularies  Openconceptual models to be reused across different datasets  Provides conceptual model level interoperability  Useful to be used for modelling recurrent domains of discourse
  • 55.
    Standardized Vocabularies  FOAF SIOC  COGS  Data Cube Vocabulary  PROV-O  DCTERMS  WGS84 Geo Positioning  SDMX  QUDT  SSN  Schema.org  VoID  Data Catalog  ... http://lov.okfn.org/dataset/lov/
  • 56.
    Entity Recognition &Linking  Align terms in unstructured text to entities in a structured KB  Integrates structured to unstructured data
  • 57.
    Entity Recognition &Linking  Align terms in unstructured text to entities in a structured KB  Integrates structured to unstructured data
  • 58.
    Entity Recognition &Linking  Align terms in unstructured text to entities in a structured KB  Integrates structured to unstructured data  Can be used to support semantic search  Provides a first level of structure to unstructured data  Exploratory browsing
  • 59.
    Entity Recognition &Linking  Example: “GE has also been implicated in the creation of toxic waste.”
  • 60.
    Entity Recognition &Linking  Example: “GE has also been implicated in the creation of toxic waste.”
  • 61.
    Entity Recognition &Linking  Example: “GE has also been implicated in the creation of toxic waste.” <http://dbpedia.org/resource/General_Electric> yago:ConglomerateCompanies yago:MedicalEquipmentManufacturers yago:CompaniesListedOnTheNewYorkStockExchange
  • 62.
    Entity Recognition &Linking  Example: “GE has also been implicated in the creation of toxic waste.” <http://dbpedia.org/resource/Toxic_waste>
  • 63.
     DBpedia Spotlight -http://spotlight.dbpedia.org  NERD (Named Entity Recognition and Disambiguation) - http://nerd.eurecom.fr/  Stanford Named Entity Recognizer - http://nlp.stanford.edu/software/CRF-NER.shtml Entity Recognition/Linking
  • 64.
    Syntactic Parsers GE/NNP has/VBZalso/RB been/VBN implicated/VBN in/IN the/DT creation/NN of/IN toxic/JJ waste/NN
  • 65.
     Stanford parser -http://nlp.stanford.edu/software/lex-parser.shtml - Languages: English, German, Chinese, and others  MALT - http://www.maltparser.org/ - Languages (pre-trained): English, French, Swedish  C&C Parser - http://svn.ask.it.usyd.edu.au/trac/candc Parsers
  • 66.
     GATE (GeneralArchitecture for Text Engineering) - http://gate.ac.uk/  NLTK (Natural Language Toolkit) - http://nltk.org/  Stanford NLP - http://www-nlp.stanford.edu/software/index.shtml  LingPipe - http://alias-i.com/lingpipe/index.html Text Processing Tools
  • 67.
    Database Representation  Easyevolution of schemas (schema-less)  Graph Databases - OpenLink Virtuoso - Neo4J - Transforming Lucene into a Graph Database  NoSQL ...
  • 68.
     Apache UnstructuredInformation Management Architecture (UIMA) - Component software architecture for the analysis of unstructured data - http://uima.apache.org/  NLP Interchange Format (NIF) - RDF & OWL-based - http://persistence.uni-leipzig.org/nlp2rdf/ NLP Integration
  • 69.
    Relation/Graph Extraction  Reverb -http://reverb.cs.washington.edu/  Graphia - http://graphia.dcc.ufrj.br/
  • 70.
    Relation/Graph Extraction In 2002,GE acquired the wind power assets of Enron.In 2002 GE acquired the wind power assets of Enron
  • 71.
    Relation/Graph Extraction General ElectricCompany, or GE , is an American multinational conglomerate corporation incorporated in Schenectady , New York
  • 72.
    Semantics Technologies that WorkToday Data Consumption
  • 73.
    Vector Space Models Representation useful for approximate search  Search over structured and unstructured data  Construction of approximate semantic models
  • 74.
  • 75.
     Lucene &Solr - http://lucene.apache.org/  Terrier - http://terrier.org/ Indexing & Search Engines
  • 76.
    Distributional Hypothesis “Words occurringin similar (linguistic) contexts tend to be semantically similar”  He filled the wampimuk with the substance, passed it around and we all drunk some  We found a little, hairy wampimuk sleeping behind the tree
  • 77.
    Distributional Semantic Models(DSMs)  Computational models that build contextual semantic representations from corpus data  Semantic context is represented by a vector  Vectors are obtained through the statistical analysis of the linguistic contexts of a word  Salience of contexts (cf. context weighting scheme)  Semantic similarity/relatedness as the core operation over the model
  • 78.
    DSMs as CommonsenseReasoning Commonsense is here θ car dog cat bark run leash
  • 79.
  • 80.
  • 81.
  • 82.
    DSMs as CommonsenseReasoning θ car dog cat bark run leash ... vs. Semantic best-effort
  • 83.
  • 84.
     Amtera Esprit(distributional semantic relatedness) - http://www.mashape.com/amtera/esa-semantic-relatedness  WS4J (Java API for several semantic relatedness algorithms) - https://code.google.com/p/ws4j/  SecondString (string matching) - http://secondstring.sourceforge.net  S-space (distributional semantics framework) - https://github.com/fozziethebeat/S-Space String similarity and semantic relatedness
  • 85.
     WordNet - http://wordnet.princeton.edu/ Wiktionary - http://www.wiktionary.org/  FrameNet - https://framenet.icsi.berkeley.edu/fndrupal/  VerbNet - http://verbs.colorado.edu/~mpalmer/projects/verbnet.html  BabelNet - http://babelnet.org/ Lexical Resources
  • 86.
    Entity Recognition & Linking Distributional Semantics Relation/Graph Extraction Internal Datasets Reference Corpora Semantic Pipeline Vocabulary Management Semantic Search &QA Crawling & Indexing Open Data Vocabularies, Taxonomies, Lexical Resources Internal Documents Knowledge Graph Management Knowledge Graph Data Curation Platform Crowdsourcing Services Applications User feedback Provenance Management
  • 87.
  • 88.
    Querying your KnowledgeGraph Gaelic: direction
  • 89.
  • 90.
  • 91.
    Vocabulary Problem Query: Whois the daughter of Bill Clinton married to? Possible representations = Commonsense Knowledge Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
  • 92.
    Vocabulary Problem Query: Whois the daughter of Bill Clinton married to? Semantic approximationSemantic Gap Possible representations = Commonsense Knowledge Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and 9,434,677 instances
  • 93.
    Core Principles  Minimizethe impact of Ambiguity, Vagueness, Synonymy.  Address the simplest matchings first (heuristics).  Semantic Relatedness as a primitive operation.  Distributional semantics as commonsense knowledge.
  • 94.
    Step 1: POSTagging Who/WP is/VBZ the/DT daughter/NN of/IN Bill/NNP Clinton/NNP married/VBN to/TO ?/. Query Pre-Processing (Question Analysis)
  • 95.
    Step 2: CoreEntity Recognition Rules-based: POS Tag + TF/IDF Who is the daughter of Bill Clinton married to? (PROBABLY AN INSTANCE) Query Pre-Processing (Question Analysis)
  • 96.
    Step 3: Determineanswer type Rules-based. Who is the daughter of Bill Clinton married to? (PERSON) Query Pre-Processing (Question Analysis)
  • 97.
    Step 4: Dependencyparsing dep(married-8, Who-1) auxpass(married-8, is-2) det(daughter-4, the-3) nsubjpass(married-8, daughter-4) prep(daughter-4, of-5) nn(Clinton-7, Bill-6) pobj(of-5, Clinton-7) root(ROOT-0, married-8) xcomp(married-8, to-9) Query Pre-Processing (Question Analysis)
  • 98.
    Step 5: DeterminePartial Ordered Dependency Structure (PODS) Rules based. Remove stop words. Merge words into entities. Reorder structure from core entity position. Query Pre-Processing (Question Analysis) (INSTANCE) ANSWER TYPE QUESTION FOCUS Bill Clinton daughter married to
  • 99.
    Question Analysis Query Features BillClinton daughter married to (INSTANCE) (PREDICATE) (PREDICATE) Query Features PODS
  • 100.
    Query Plan Map queryfeatures into a query plan. A query plan contains a sequence of core operations. (INSTANCE) (PREDICATE) (PREDICATE) Query Features Query Plan  (1) INSTANCE SEARCH (Bill Clinton)  (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)  (3) e1 <- NAVIGATE (Bill Clintion, p1)  (4) p2 <- SEARCH PREDICATE (e1, married to)  (5) e2 <- NAVIGATE (e1, p2)
  • 101.
    Instance Search Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: Instance Search
  • 102.
    Predicate Search Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... (PIVOT ENTITY) (ASSOCIATED TRIPLES)
  • 103.
    Predicate Search Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... sem_rel(daughter,child)=0.054 Which properties are semantically related to ‘daughter’?
  • 104.
    Predicate Search Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 Which properties are semantically related to ‘daughter’?
  • 105.
    Predicate Search Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child :Baptists :religion :Yale_Law_School :almaMater ... sem_rel(daughter,child)=0.054 sem_rel(daughter,child)=0.004 sem_rel(daughter,alma mater)=0.001 Which properties are semantically related to ‘daughter’?
  • 106.
    Navigate Bill Clinton daughtermarried to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child
  • 107.
    Navigate Bill Clinton daughtermarried to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child (PIVOT ENTITY)
  • 108.
    Predicate Search Bill Clintondaughter married to :Bill_Clinton Query: Linked Data: :Chelsea_Clinton :child (PIVOT ENTITY) :Mark_Mezvinsky :spouse
  • 109.
  • 110.
    Evaluation  102 naturallanguage queries (Test Collection: QALD 2011).  Avg. query execution time: 1.52 s (simple queries) – 8.53 s (all queries).
  • 111.
    Treo Answers JeopardyQueries (Video) http://bit.ly/1hWcch9
  • 112.
    Hybrid unstructured &structured Sydney's dad, Jack, was a CIA double agent working against SD-6 on this Jennifer Garner show.
  • 113.
    Core Principles  Semanticbest-effort  Dialog & user disambiguation  Pay-as-you-go data integration  Simplicity of use  Franklin et al. (2005): From Databases to Dataspaces.  Helland (2011): If You Have Too Much Data, then “Good Enough” Is Good Enough.
  • 114.
    Take-away message  Thereare approaches that can be used today to cope with data variety in the Big Data era  Coping with data variety demands a multi-disciplinary perspective and a new infrastructure - Knowledge Representation, IR and Natural Language Processing  Semantics at scale as a central concern  You can build your own IBM Watson-like application!  Great opportunity for new solutions and for being a pioneer
  • 115.

Editor's Notes