This document discusses using big data and semantics for data integration. It describes loading multiple data sources into Hadoop and mapping the data into a common domain vocabulary that can then be queried using SPARQL. Adding a new data source involves mapping it to the existing domain vocabulary rather than changing queries. Key technologies mentioned include RDF for the data model, RDFS for schemas, SPARQL for querying, and R2RML for mapping relational data to RDF triples.
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
Big Data Integration with Semantic Technologies
1. Big Data with Semantics
Alex Miller
@puredanger
picture: http://bit.ly/MLUIon
2. Hadoop for Data Integration
• Companies are flocking
to Hadoop right now,
mostly for ETL/analysis
• Starting to also use it for data integration
• Traditionally the domain of data
warehouses
2
3. Data Integration in Hive
• Load multiple sources
• Define, query with HiveQL
• Queries access multiple sources in terms
of their original data
• Adding a new "data source" means
changing all of your queries to
accommodate the new data
3
4. Integration with Semantics
• Load data into Hadoop
• Map data into common domain
vocabulary
• Query all your sources with common
domain vocabulary
• Adding a new "data source" means
mapping the new source into the domain
4
14. Relationships
dbp:The_Blues_Brothers_(film)
dbp:Wrigley_Field dbp:Chicago_(band)
n
db tio
po oca
:lo
c _l
m
at
ion :fil
ie
ov
m
dbpo:owner
dbp:Chicago
dbp
o:r
e si
den
c e
dbp:Chicago_Cubs
dbp:Barack_Obama
dbp:Pizza
dbp: http://dbpedia.org/resource/
dbpo: http://dbpedia.org/ontology/
14
15. Triple
"fact" or "assertion"
<subject> <predicate> <object>
15
16. Subject dbp:Chicago_(band)
dbp:The_Blues_Brothers_(film)
dbp:Wrigley_Field
Predicate n
db tio
po ca
:lo o
ca _l
m
tio fil
Object
:
n ie
ov
m
dbpo:owner
dbp:Chicago
dbp
o:r
e si
den
c e
dbp:Chicago_Cubs
dbp:Barack_Obama
dbp:Pizza
dbp: http://dbpedia.org/resource/
dbpo: http://dbpedia.org/ontology/
16
17. Triple
<subject> <predicate> <object>
dbp:Wrigley_Field dbpo:location dbp:Chicago
resource resource resource
(vertex) (edge) (vertex)
or
value
17
18. Graph
dbp:The_Blues_Brothers_(film)
dbp:Wrigley_Field dbp:Chicago_(band)
n
db tio
po oca
:lo
c _l
m
at
ion :fil
ie
ov
m
dbpo:owner
dbp:Chicago
dbp
o:r
e si
den
c e
dbp:Chicago_Cubs
dbp:Barack_Obama
dbp:Pizza
dbp: http://dbpedia.org/resource/
dbpo: http://dbpedia.org/ontology/
18
19. If things and relationships
can be defined by any
URI, how do we know
what we're talking about?
19
22. A class describes a
group of things that
share common
properties.
22
23. Class
ex:City
is a is a is a
dbp:San_Francisco dbp:Chicago dbp:Saint_Louis
dbp: http://dbpedia.org/resource/
ex: http://example.org/ontology/
rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
rdfs: http://www.w3.org/2000/01/rdf-schema#
23
37. How do we map tables
(text or sequence file)
to triples?
37
38. Music Database
Musicians:
MID First Last Inst_ID
1 Eddie Van Halen 10
2 Yo Yo Ma 20
3 Kenny G 30
Instruments: IID Instrument Type
10 Guitar String
20 Cello String
30 Saxophone Woodwind
38
39. Musician Schema
rdfs:Class rdf:Property
rdf:type rdf:type
rdfs:domain music:firstName
music:Musician rdfs:doma
in
rdfs music:lastName
:dom
ain
rdfs:range music:plays
music:Instrument rdfs:dom
ain
rdfs
:do
music:instName
mai
n
music:instType
39
40. Tables to Triples
Musicians: Instruments:
MID First Last Inst_ID IID Instrument Type
1 Eddie Van Halen 10 10 Guitar String
2 Yo Yo Ma 20 20 Cello String
3 Kenny G 30 30 Saxophone Woodwind
Turn each key into a resource and specify the proper
type of each resource:
artist:1 rdf:type music:Musician instrument:10 rdf:type music:Instrument
artist:2 rdf:type music:Musician instrument:20 rdf:type music:Instrument
artist:3 rdf:type music:Musician instrument:30 rdf:type music:Instrument
40
41. Tables to Triples
Musicians: Instruments:
MID First Last Inst_ID IID Instrument Type
1 Eddie Van Halen 10 10 Guitar String
2 Yo Yo Ma 20 20 Cello String
3 Kenny G 30 30 Saxophone Woodwind
Turn each cell into a triple based on the key, property
(mapped per column), and value:
artist:1 music:firstName "Eddie" instrument:10 music:instName "Guitar"
artist:1 music:lastName "Van Halen" instrument:10 music:instType "String"
artist:2 music:firstName "Yo Yo" instrument:20 music:instName "Cello"
artist:2 music:lastName "Ma" instrument:20 music:instType "String"
artist:3 music:firstName "Kenny" instrument:30 music:instName "Saxophone"
artist:3 music:lastName "G" instrument:30 music:instType "Woodwind"
41
42. Tables to Triples
Musicians: Instruments:
MID First Last Inst_ID IID Instrument Type
1 Eddie Van Halen 10 10 Guitar String
2 Yo Yo Ma 20 20 Cello String
3 Kenny G 30 30 Saxophone Woodwind
Turn each foreign key reference into a relationship
between the foreign and primary resources.
artist:1 music:plays instrument:10
artist:1 music:plays instrument:20
artist:2 music:plays instrument:30
42
43. R2RML
• "Relational to RDF Mapping Language"
• RDB2RDF Working Group at W3C
• ETL "data transformation" use case
• Dynamic "query translation" use case
• Translate SPARQL query against
domain to SQL query against the dbms
43
44. R2RML Triple Mapping
ain music:instName
rdfs:dom
music:Instrument
rdfs:d
omain
music:instType
Instruments:
IID Instrument Type
10 Guitar String
44
49. Direct mapping
• Automatically map relational tables into a
domain vocabulary using R2RML
• Good starting point to rapidly integrate
two data sources
46
51. Triple data in Hadoop
• n-triple files
• standard line format for RDF data
• indexed triple format
• triples in Thrift representing RDF terms
• text / sequence files as tabular sources
48
52. SPARQL in Hadoop
• Compile SPARQL to map-reduce jobs
against triple (or tuple) data
• Results materialized back into Hadoop
files
• Similar to HiveQL compiling SQL to map-
reduce against tabular data
49
53. R2RML in Hadoop
• Provide mapping file against tabular data
files in Hadoop
• Execute SPARQL queries through the
virtual mapping
• View your data as triples
• But leave it in sequence files
• OR materialize the virtual mapping into a
real set of triples
50
54. Federation
• Execute queries against combination of
data inside and outside Hadoop
• Or against combination of Hadoop and
real-time (Storm)
• Or across multiple Hadoop clusters!
51