Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

Coping with Data Variety
in the Big Data Era:
The Semantic Computing Approach
André Freitas
Insight Centre for Data Analytics
Rio Big Data Meetup (June 2014)

Outline
 Shift in the Information Systems Landscape
 Semantic Computing
 Semantics Technologies that Work Today: Data Creation
 Semantics Technologies that Work Today: Data Consumption
 Case Study: Treo QA System
 Conclusions

Shift in the Information
Systems Landscape

Big Data
 Vision: More complete data-based picture of the world for
systems and users.

Big Data Dimensions
 Volume
 Velocity
 Variety

Big Data Dimensions
 Volume
 Velocity
 Variety
 Veracity
 Value

Big Data Definitions
7
Data Variety
What is Big Data?

Cost of Making Sense of It
“A lot of Big Data is a lot of small data put together.”
“Most of Big Data is not a
uniform big block.”
“Each data piece is very small and very
messy, and a lot of what we are doing
there is dealing with that variety.”

“It is more about the rate of change, the amount and the
resources that you need to deal with it.”“If the programming effort per amount of
high quality data is really high, the data is
big in the sense of high cost to produce
new information.”
“Big Data seems to be about addressing challenges
of scale, in terms of how fast things are coming
out at you versus how much it costs to get value
out of what you already have.”

“You can have Big Data challenges not only
because you have PBs of data but because data
is incredibly varied and therefore consumes a
lot of resources to make sense of it.”

“The speed in which data is generated and the
speed in which it needs to be processed in
order to use it effectively.”

“Schema” Growth
 Heterogeneous, complex and large-scale databases.
 Very-large and dynamic “schemas”.
10s-100s attributes
1,000s-1,000,000s attributes
circa 2000
circa 2014

Semantic Heterogeneity
 Decentralized content generation.
 Multiple perspectives (conceptualizations) of the reality.
 Ambiguity, vagueness, inconsistency.

Data variety +
Data quality -
Data
Programs
Full data coverage
Full automation

Structure level
Unstructured Data Structured Data
Consistent
Comparable
Processable
Easy to generate Easy to analyze
Semantic Computing

The Futurist Perspective
 AI vision
 Full automation
 Perfect natural language
interaction

The Realist Perspective
What can be achieved with semantic computing today?

(Some) Challenges in Semantics
Knowledge Representation Model
Reasoning
Large, inconsistent,
heterogeneous
Data
Expected Result: intelligent behavior
Semantic flexibility, predictive power, automation ...
Acquisition, Learning
There is an economical model behind each element!

Meaning
 Word meaning is usually represented in terms of some formal,
symbolic structure, either external or internal to the word
 External structure
- Associations between different concepts
 Internal structure
- Feature (property, attribute) lists
 The semantic properties of a word are derived from the formal
structure of its representation
- e.g. Inference algorithm, etc.
Semantics = Meaning representation model (data) +
inference model

Formal Representation of Meaning
(Problems)
 Different meanings
- bank (financial institution)
bank (river side)
 Meaning variation in context
 Meaning evolution
 Ambiguity, vagueness, inconsistency

Formal Representation of Meaning
(Problems)
 Different meanings
- bank (financial institution)
bank (river side)
 Meaning variation in context
- clever politician, clever tycoon
 Meaning evolution
 Ambiguity, vagueness, inconsistency
Word meaning acquisition &
representation
Lack of flexibility
Scalability

 Most semantic models have dealt with particular types of
constructions, and have been carried out under very simplifying
assumptions, in true lab conditions.
 If these idealizations are removed it is not clear at all that modern
semantics can give a full account of all but the simplest
models/statements.
Sahlgren, 2013
Formal World Real World
Baroni et al. 2013
Semantics for a Complex World

Semantics Technologies that
Work Today
Data Creation

Data Creation
 Human interaction element (Data Curation)
 Semantic representation
 Information extraction

Entity-Centric Content Generation

Defining Attributes & Relationships

Data curation elements
 Data curation platforms
- Spreadsheets
- Open Refine
- Karma
 Algorithmic curation
- Validation & Annotation robots
 Curation at source
- Minimal Information Models (MIRIAM)
 Data curation roles
 Crowdsourcing

Standardized Data Models
 Provides a minimum level of data interoperability
 Examples:
- Resource Description Framework (RDF)
- Linked Comma Separated Value (CSV)
- Javascript Object Notation (JSON)

Resource Description Framework (RDF)
 Graph data model
 Entity-centric data integration
 Facilitates decentralized content generation
 URIs for concept identfiers
 Associated structured query language (SPARQL)

dbpedia:General_Electric "US$ 147.3 billion"@en
dbp:revenue
rdf:type
dbo:Organization
dbpedia:Fairfield, Connecticutdbp:locationCity

dbp:revenue
rdf:type
dbo:Organization
sec:General_Electric
ifrs:CashFlowsFromUsedInOperationsTotal
…

dbp:revenue
rdf:type
dbo:Organization
…
owl:sameAs

dbp:revenue
rdf:type
dbo:Organization
…
geo:Fairfield
"N 41° 13' 29''
geo:latitude
owl:sameAs

dbp:revenue
rdf:type
dbo:Organization
…
geo:Fairfield
"N 41° 13' 29''
geo:latitude
owl:sameAs
owl:sameAs

Representation
 Rules (SWRL, RIF)
 Ontology (OWL)
– Logical Constraints
 Taxonomy (RDFS)
– Classes in sub-/super-class hierarchy
 Relational (RDF)
– Attributes
– Associations
 Dictionary
– Terms and definitions
Increasing
Semantic
Representation

Representation
Increasing
Semantic
Representation

Linked Data
HTTP
request
RDF JSON
SPARQL
R2RML
Relational
Database

http://dbpedia.org/resource/Jupiter

Open Data
 Common-sense Knowledge Base
 Domain-specific Knowledge Base
 Entity reference system

 DBpedia
- http://dbpedia.org/
 YAGO
- http://www.mpi-inf.mpg.de/yago-naga/yago/
 Freebase
- http://www.freebase.com/
 Wikipedia dumps
- http://dumps.wikimedia.org/
 ConceptNet
- http:// conceptnet5.media.mit.edu/
 Geonames
- http://www.geonames.org/
 Common Crawl
- http://commoncrawl.org/
Open Data

Standardized Vocabularies
 Open conceptual models to be reused across different
datasets
 Provides conceptual model level interoperability
 Useful to be used for modelling recurrent domains of
discourse

Standardized Vocabularies
 FOAF
 SIOC
 COGS
 Data Cube Vocabulary
 PROV-O
 DCTERMS
 WGS84 Geo Positioning
 SDMX
 QUDT
 SSN
 Schema.org
 VoID
 Data Catalog
 ...
http://lov.okfn.org/dataset/lov/

Entity Recognition & Linking
 Align terms in unstructured text to entities in a structured KB
 Integrates structured to unstructured data

 Align terms in unstructured text to entities in a structured KB
 Integrates structured to unstructured data
 Can be used to support semantic search
 Provides a first level of structure to unstructured data
 Exploratory browsing

 Example:
“GE has also been implicated in the creation of toxic waste.”

 Example:
<http://dbpedia.org/resource/General_Electric>
yago:ConglomerateCompanies
yago:MedicalEquipmentManufacturers
yago:CompaniesListedOnTheNewYorkStockExchange

 Example:
<http://dbpedia.org/resource/Toxic_waste>

 DBpedia Spotlight
- http://spotlight.dbpedia.org
 NERD (Named Entity Recognition and Disambiguation)
- http://nerd.eurecom.fr/
 Stanford Named Entity Recognizer
- http://nlp.stanford.edu/software/CRF-NER.shtml
Entity Recognition/Linking

Syntactic Parsers
GE/NNP has/VBZ also/RB been/VBN implicated/VBN in/IN the/DT creation/NN of/IN
toxic/JJ waste/NN

 Stanford parser
- http://nlp.stanford.edu/software/lex-parser.shtml
- Languages: English, German, Chinese, and others
 MALT
- http://www.maltparser.org/
- Languages (pre-trained): English, French, Swedish
 C&C Parser
- http://svn.ask.it.usyd.edu.au/trac/candc
Parsers

 GATE (General Architecture for Text Engineering)
- http://gate.ac.uk/
 NLTK (Natural Language Toolkit)
- http://nltk.org/
 Stanford NLP
- http://www-nlp.stanford.edu/software/index.shtml
 LingPipe
- http://alias-i.com/lingpipe/index.html
Text Processing Tools

Database Representation
 Easy evolution of schemas (schema-less)
 Graph Databases
- OpenLink Virtuoso
- Neo4J
- Transforming Lucene into a Graph Database
 NoSQL ...

 Apache Unstructured Information Management Architecture
(UIMA)
- Component software architecture for the analysis of unstructured data
- http://uima.apache.org/
 NLP Interchange Format (NIF)
- RDF & OWL-based
- http://persistence.uni-leipzig.org/nlp2rdf/
NLP Integration

Relation/Graph Extraction
 Reverb
- http://reverb.cs.washington.edu/
 Graphia
- http://graphia.dcc.ufrj.br/

In 2002, GE acquired the wind power assets of Enron.In 2002 GE acquired the wind power assets of Enron

General Electric Company, or GE , is an American multinational conglomerate
corporation incorporated in Schenectady , New York

Semantics Technologies that
Work Today
Data Consumption

Vector Space Models
 Representation useful for approximate search
 Search over structured and unstructured data
 Construction of approximate semantic models

Vector Space Models
θ
http://en.wikipedia.org/wiki/General_Electric
General
Electric
...
“General Electric company”

 Lucene & Solr
- http://lucene.apache.org/
 Terrier
- http://terrier.org/
Indexing & Search Engines

Distributional Hypothesis
“Words occurring in similar (linguistic) contexts tend
to be semantically similar”
 He filled the wampimuk with the substance, passed it
around and we all drunk some
 We found a little, hairy wampimuk sleeping behind the
tree

Distributional Semantic Models (DSMs)
 Computational models that build contextual semantic representations
from corpus data
 Semantic context is represented by a vector
 Vectors are obtained through the statistical analysis of the linguistic
contexts of a word
 Salience of contexts (cf. context weighting scheme)
 Semantic similarity/relatedness as the core operation over the model

DSMs as Commonsense Reasoning
Commonsense is here
θ
car
dog
cat
bark
run
leash

DSMs as Commonsense Reasoning
θ
car
dog
cat
bark
run
leash
...
vs.
Semantic best-effort

Distributional Semantic Models (DSMs)

 Amtera Esprit (distributional semantic relatedness)
- http://www.mashape.com/amtera/esa-semantic-relatedness
 WS4J (Java API for several semantic relatedness
algorithms)
- https://code.google.com/p/ws4j/
 SecondString (string matching)
- http://secondstring.sourceforge.net
 S-space (distributional semantics framework)
- https://github.com/fozziethebeat/S-Space
String similarity and semantic relatedness

 WordNet
- http://wordnet.princeton.edu/
 Wiktionary
- http://www.wiktionary.org/
 FrameNet
- https://framenet.icsi.berkeley.edu/fndrupal/
 VerbNet
- http://verbs.colorado.edu/~mpalmer/projects/verbnet.html
 BabelNet
- http://babelnet.org/
Lexical Resources

Entity
Recognition &
Linking
Distributional
Semantics
Relation/Graph
Extraction
Internal
Datasets
Reference
Corpora
Semantic
Pipeline
Vocabulary
Management
Semantic
Search & QA
Crawling &
Indexing
Open
Data
Vocabularies,
Taxonomies,
Lexical
Resources
Internal
Documents
Knowledge
Graph
Management
Knowledge
Graph
Data Curation
Platform
Crowdsourcing
Services
Applications
User
feedback
Provenance
Management

Querying your Knowledge Graph
Gaelic: direction

Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Possible representations = Commonsense Knowledge
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and
9,434,677 instances

Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Semantic approximationSemantic Gap
Possible representations = Commonsense Knowledge
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and
9,434,677 instances

Core Principles
 Minimize the impact of Ambiguity, Vagueness, Synonymy.
 Address the simplest matchings first (heuristics).
 Semantic Relatedness as a primitive operation.
 Distributional semantics as commonsense knowledge.

Step 1: POS Tagging
Who/WP
is/VBZ
the/DT
daughter/NN
of/IN
Bill/NNP
Clinton/NNP
married/VBN
to/TO
?/.
Query Pre-Processing
(Question Analysis)

Step 2: Core Entity Recognition
Rules-based: POS Tag + TF/IDF
Who is the daughter of Bill Clinton married to?
(PROBABLY AN INSTANCE)
(Question Analysis)

Step 3: Determine answer type
Rules-based.
Who is the daughter of Bill Clinton married to?
(PERSON)
(Question Analysis)

Step 4: Dependency parsing
dep(married-8, Who-1)
auxpass(married-8, is-2)
det(daughter-4, the-3)
nsubjpass(married-8, daughter-4)
prep(daughter-4, of-5)
nn(Clinton-7, Bill-6)
pobj(of-5, Clinton-7)
root(ROOT-0, married-8)
xcomp(married-8, to-9)
(Question Analysis)

Step 5: Determine Partial Ordered Dependency Structure
(PODS)
Rules based.
Remove stop words.
Merge words into entities.
Reorder structure from core entity position.
(Question Analysis)
(INSTANCE)
ANSWER
TYPE
QUESTION FOCUS
Bill Clinton daughter married to

Question Analysis
Query Features
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
PODS

Query Plan
Map query features into a query plan.
A query plan contains a sequence of core operations.
(INSTANCE) (PREDICATE) (PREDICATE) Query Features
Query Plan
 (1) INSTANCE SEARCH (Bill Clinton)
 (2) p1 <- SEARCH PREDICATE (Bill Clintion, daughter)
 (3) e1 <- NAVIGATE (Bill Clintion, p1)
 (4) p2 <- SEARCH PREDICATE (e1, married to)
 (5) e2 <- NAVIGATE (e1, p2)

Instance Search
:Bill_Clinton
Query:
Linked
Data:
Instance Search

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
(PIVOT ENTITY)
(ASSOCIATED
TRIPLES)

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
Which properties are semantically related to ‘daughter’?

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,alma mater)=0.001

Navigate
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child

Navigate
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)

Predicate Search
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
:Mark_Mezvinsky
:spouse

Evaluation
 102 natural language queries (Test Collection: QALD 2011).
 Avg. query execution time: 1.52 s (simple queries) – 8.53 s
(all queries).

Treo Answers Jeopardy Queries (Video)
http://bit.ly/1hWcch9

Hybrid unstructured & structured
Sydney's dad, Jack, was a CIA double agent working against SD-6 on this
Jennifer Garner show.

Core Principles
 Semantic best-effort
 Dialog & user disambiguation
 Pay-as-you-go data integration
 Simplicity of use
 Franklin et al. (2005): From Databases to Dataspaces.
 Helland (2011): If You Have Too Much Data, then “Good
Enough” Is Good Enough.

Take-away message
 There are approaches that can be used today to cope with
data variety in the Big Data era
 Coping with data variety demands a multi-disciplinary
perspective and a new infrastructure
- Knowledge Representation, IR and Natural Language Processing
 Semantics at scale as a central concern
 You can build your own IBM Watson-like application!
 Great opportunity for new solutions and for being a pioneer

andre.freitas – at – deri.org

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

Similar to Coping with Data Variety in the Big Data Era: The Semantic Computing Approach (20)

More from Andre Freitas

More from Andre Freitas (20)

Recently uploaded

Recently uploaded (20)

Coping with Data Variety in the Big Data Era: The Semantic Computing Approach

Editor's Notes