Big Data is based on the vision of providing users and applications with a more complete picture of the reality supported and mediated by data. This vision comes with the inherent price of data variety, i.e. data which is semantically heterogeneous, poorly structured, complex and with data quality issues. Despite the hype on technologies targeting data volume and velocity, solutions for coping with data variety remain fragmented and with limited adoption. In this talk we will focus on emerging data management approaches, supported by semantic technologies, to cope with data variety. We will provide a broad overview of semantic computing approaches and how they can be applied to data management challenges within organizations today. This talk will allow the audience to have a glimpse into the next-generation, Big Data-driven information systems.
Coping with Data Variety in the Big Data Era: The Semantic Computing Approach
1. Coping with Data Variety
in the Big Data Era:
The Semantic Computing Approach
André Freitas
Insight Centre for Data Analytics
Rio Big Data Meetup (June 2014)
2. Outline
Shift in the Information Systems Landscape
Semantic Computing
Semantics Technologies that Work Today: Data Creation
Semantics Technologies that Work Today: Data Consumption
Case Study: Treo QA System
Conclusions
8. Cost of Making Sense of It
“A lot of Big Data is a lot of small data put together.”
“Most of Big Data is not a
uniform big block.”
“Each data piece is very small and very
messy, and a lot of what we are doing
there is dealing with that variety.”
9. Cost of Making Sense of It
“It is more about the rate of change, the amount and the
resources that you need to deal with it.”“If the programming effort per amount of
high quality data is really high, the data is
big in the sense of high cost to produce
new information.”
“Big Data seems to be about addressing challenges
of scale, in terms of how fast things are coming
out at you versus how much it costs to get value
out of what you already have.”
10. Cost of Making Sense of It
“You can have Big Data challenges not only
because you have PBs of data but because data
is incredibly varied and therefore consumes a
lot of resources to make sense of it.”
11. Cost of Making Sense of It
“The speed in which data is generated and the
speed in which it needs to be processed in
order to use it effectively.”
12. “Schema” Growth
Heterogeneous, complex and large-scale databases.
Very-large and dynamic “schemas”.
10s-100s attributes
1,000s-1,000,000s attributes
circa 2000
circa 2014
13. Semantic Heterogeneity
Decentralized content generation.
Multiple perspectives (conceptualizations) of the reality.
Ambiguity, vagueness, inconsistency.
14.
15.
16. Data variety +
Data quality -
Data
Programs
Full data coverage
Full automation
17. Structure level
Unstructured Data Structured Data
Consistent
Comparable
Processable
Easy to generate Easy to analyze
Semantic Computing
27. (Some) Challenges in Semantics
Knowledge Representation Model
Reasoning
Large, inconsistent,
heterogeneous
Data
Expected Result: intelligent behavior
Semantic flexibility, predictive power, automation ...
Acquisition, Learning
There is an economical model behind each element!
28. Meaning
Word meaning is usually represented in terms of some formal,
symbolic structure, either external or internal to the word
External structure
- Associations between different concepts
Internal structure
- Feature (property, attribute) lists
The semantic properties of a word are derived from the formal
structure of its representation
- e.g. Inference algorithm, etc.
Semantics = Meaning representation model (data) +
inference model
29. Formal Representation of Meaning
(Problems)
Different meanings
- bank (financial institution)
bank (river side)
Meaning variation in context
Meaning evolution
Ambiguity, vagueness, inconsistency
30. Formal Representation of Meaning
(Problems)
Different meanings
- bank (financial institution)
bank (river side)
Meaning variation in context
- clever politician, clever tycoon
Meaning evolution
Ambiguity, vagueness, inconsistency
Word meaning acquisition &
representation
Lack of flexibility
Scalability
31. Most semantic models have dealt with particular types of
constructions, and have been carried out under very simplifying
assumptions, in true lab conditions.
If these idealizations are removed it is not clear at all that modern
semantics can give a full account of all but the simplest
models/statements.
Sahlgren, 2013
Formal World Real World
Baroni et al. 2013
Semantics for a Complex World
52. Open Data
Common-sense Knowledge Base
Domain-specific Knowledge Base
Entity reference system
53. DBpedia
- http://dbpedia.org/
YAGO
- http://www.mpi-inf.mpg.de/yago-naga/yago/
Freebase
- http://www.freebase.com/
Wikipedia dumps
- http://dumps.wikimedia.org/
ConceptNet
- http:// conceptnet5.media.mit.edu/
Geonames
- http://www.geonames.org/
Common Crawl
- http://commoncrawl.org/
Open Data
54. Standardized Vocabularies
Open conceptual models to be reused across different
datasets
Provides conceptual model level interoperability
Useful to be used for modelling recurrent domains of
discourse
56. Entity Recognition & Linking
Align terms in unstructured text to entities in a structured KB
Integrates structured to unstructured data
57. Entity Recognition & Linking
Align terms in unstructured text to entities in a structured KB
Integrates structured to unstructured data
58. Entity Recognition & Linking
Align terms in unstructured text to entities in a structured KB
Integrates structured to unstructured data
Can be used to support semantic search
Provides a first level of structure to unstructured data
Exploratory browsing
59. Entity Recognition & Linking
Example:
“GE has also been implicated in the creation of toxic waste.”
60. Entity Recognition & Linking
Example:
“GE has also been implicated in the creation of toxic waste.”
61. Entity Recognition & Linking
Example:
“GE has also been implicated in the creation of toxic waste.”
<http://dbpedia.org/resource/General_Electric>
yago:ConglomerateCompanies
yago:MedicalEquipmentManufacturers
yago:CompaniesListedOnTheNewYorkStockExchange
62. Entity Recognition & Linking
Example:
“GE has also been implicated in the creation of toxic waste.”
<http://dbpedia.org/resource/Toxic_waste>
73. Vector Space Models
Representation useful for approximate search
Search over structured and unstructured data
Construction of approximate semantic models
76. Distributional Hypothesis
“Words occurring in similar (linguistic) contexts tend
to be semantically similar”
He filled the wampimuk with the substance, passed it
around and we all drunk some
We found a little, hairy wampimuk sleeping behind the
tree
77. Distributional Semantic Models (DSMs)
Computational models that build contextual semantic representations
from corpus data
Semantic context is represented by a vector
Vectors are obtained through the statistical analysis of the linguistic
contexts of a word
Salience of contexts (cf. context weighting scheme)
Semantic similarity/relatedness as the core operation over the model
78. DSMs as Commonsense Reasoning
Commonsense is here
θ
car
dog
cat
bark
run
leash
91. Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Possible representations = Commonsense Knowledge
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and
9,434,677 instances
92. Vocabulary Problem
Query: Who is the daughter of Bill Clinton married to?
Semantic approximationSemantic Gap
Possible representations = Commonsense Knowledge
Dataset (DBpedia 3.7 + YAGO): 45,767 predicates, 5,556,492 classes and
9,434,677 instances
93. Core Principles
Minimize the impact of Ambiguity, Vagueness, Synonymy.
Address the simplest matchings first (heuristics).
Semantic Relatedness as a primitive operation.
Distributional semantics as commonsense knowledge.
95. Step 2: Core Entity Recognition
Rules-based: POS Tag + TF/IDF
Who is the daughter of Bill Clinton married to?
(PROBABLY AN INSTANCE)
Query Pre-Processing
(Question Analysis)
96. Step 3: Determine answer type
Rules-based.
Who is the daughter of Bill Clinton married to?
(PERSON)
Query Pre-Processing
(Question Analysis)
98. Step 5: Determine Partial Ordered Dependency Structure
(PODS)
Rules based.
Remove stop words.
Merge words into entities.
Reorder structure from core entity position.
Query Pre-Processing
(Question Analysis)
(INSTANCE)
ANSWER
TYPE
QUESTION FOCUS
Bill Clinton daughter married to
102. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
(PIVOT ENTITY)
(ASSOCIATED
TRIPLES)
103. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
Which properties are semantically related to ‘daughter’?
104. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
Which properties are semantically related to ‘daughter’?
105. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
:Baptists
:religion
:Yale_Law_School
:almaMater
...
sem_rel(daughter,child)=0.054
sem_rel(daughter,child)=0.004
sem_rel(daughter,alma mater)=0.001
Which properties are semantically related to ‘daughter’?
108. Predicate Search
Bill Clinton daughter married to
:Bill_Clinton
Query:
Linked
Data:
:Chelsea_Clinton
:child
(PIVOT ENTITY)
:Mark_Mezvinsky
:spouse
112. Hybrid unstructured & structured
Sydney's dad, Jack, was a CIA double agent working against SD-6 on this
Jennifer Garner show.
113. Core Principles
Semantic best-effort
Dialog & user disambiguation
Pay-as-you-go data integration
Simplicity of use
Franklin et al. (2005): From Databases to Dataspaces.
Helland (2011): If You Have Too Much Data, then “Good
Enough” Is Good Enough.
114. Take-away message
There are approaches that can be used today to cope with
data variety in the Big Data era
Coping with data variety demands a multi-disciplinary
perspective and a new infrastructure
- Knowledge Representation, IR and Natural Language Processing
Semantics at scale as a central concern
You can build your own IBM Watson-like application!
Great opportunity for new solutions and for being a pioneer