Semantics for
Big Data Integration and Analysis
Craig Knoblock and Pedro Szekely
University of Southern California
Dimensions of Big Data
Our Focus:
Karma
Variety
source types
formats
representations
Big Karma
Volume
cloud execution
Velocity
streaming sources
Exploit Semantics of Data to Support
Big Data Integration
Karma
Hierarchical
Sources
Services
Model
Karma
Tabular
Sources
Database
CSV
KML
…
Interactive tool for rapidly extracting, cleaning,
transforming, and publishing data
4Craig Knoblock
The Secret Sauce
Karma Models
Model
Karma
Meaning of the data in each column
Relationships among data columns
+=
6Craig Knoblock
Semantic Types:
Meaning of Data in Columns
Semantic vs Syntactic Types
String String String String StringDate
Not useful for information integration
Semantic Types:
Meaning of Data in Columns
Semanticvs Syntactic Types
Name
of
Person
Nationality
of
Entity
Location
of
Event
Type
of
Event
Description
of
Event
Date
of
Event
Person Entity Event Event Event Event
Semantic Types
Defined Using an Ontology
Ontology
Name Nationality Location Type DescriptionDate
PropertiesClasses
Person Entity Event Event Event Event
Karma Learns the
Semantic Types
Name Nationality Location Type DescriptionDate
1. User specifies them once
2. Karma learns features to recognize them
3. Next time Karma sees similar data it automatically proposes semantic types
Organization
Person
Event
Relationships Specified in Terms
of Classes and Properties
Name Nationality
Location Type DescriptionDate
member
perpetrator
Organization
Person
Event
Karma Automatically Infers
Relationships
Name Nationality
Location Type DescriptionDate
member
perpetrator
1. Karma
automatically finds
relationships using
the ontology
2. When proposed
relationships are
incorrect, the user
adjusts them
Alignment
1. load
sample
2. demonstrate
transformations
3. model
Alignment
1. load
sample
2. demonstrate
transformations
3. model
Alignment
1. load
sample
2. demonstrate
transformations
3. model
Alignment
1. load
sample
2. demonstrate
transformations
3. model
Batch Execution
1. load data 2. apply model 3. export
Batch Execution
dataset A model A RDF A+
dataset B model B RDF B+
dataset C model C RDF C+
Triple Store
(RDF)
select name, location
of people associated with FARC
BigKarma
select name, location
of people associated with FARC
{ organization: FARC
last-name: Cano
location: Putumayo
}
JSON
FARC Reyez Vaupes
FARC Cano Bogota
Relational
Org Name Location
The Big Idea
JSON XML
Relational CSV
XML Excel
XML Excel( ) > Select > Project
Very large datasets, streaming
The Big Idea
Karma Alignment
1. load
sample
2. demonstrate
transformations
3. model
1. load
sample
2. demonstrate
transformations
3. model
4. combine
models
5. join, union,
select, project
6. export
workflow
BigKarma Alignment
Person
Org…
Event Person Organization
Model Combination
+ +
Person
Event
Organization
=
BigKarma Cloud Execution
1. load
multiple
datasets
2. apply
combined
model
3. export
transformed
results
CHALLENGE 1
Materialize and transform only the data needed for
the results
CHALLENGE 2
Use a Column Store rather than RDF to integrate
the data
BigKarma Streaming
1. stream
multiple
datasets
2. apply
combined
model
3. export
transformed
results
CHALLENGE 1
Modeling streaming sources
CHALLENGE 2
Buffering to support joins
BigKarma Analysis
Pedro Szekely and Craig KnoblockUniversity of Southern California
http://isi.edu/integration/karma

Semantics for Big Data Integration and Analysis