Big Data with Semantics




                  Alex Miller
                @puredanger

                picture: http://bit.ly/MLUIon
Hadoop for Data Integration

 • Companies are flocking
   to Hadoop right now,
   mostly for ETL/analysis

 • Starting to also use it for data integration
 • Traditionally the domain of data
   warehouses


                       2
Data Integration in Hive

• Load multiple sources
• Define, query with HiveQL
• Queries access multiple sources in terms
  of their original data

• Adding a new "data source" means
  changing all of your queries to
  accommodate the new data
                       3
Integration with Semantics
• Load data into Hadoop
• Map data into common domain
  vocabulary

• Query all your sources with common
  domain vocabulary

• Adding a new "data source" means
  mapping the new source into the domain
                   4
Multiple Sources
     in Hive
      Query        Query
        1            2




 S1           S2           S3




              5
Multiple Sources
with Semantics
      Query         Query
        1             2




         Domain Vocab



 S1           S2            S3



              6
Key Technologies


• RDF - data model
• RDFS - schema definition
• SPARQL - query language
• R2RML - relational to RDF mapping


                     7
RDF

"Resource Description Framework"




               8
There are things we wish
      to describe.


           9
We need some way to
 identify each thing.


          10
A URI is abo ut
                  "identifying" things,
                                        not
                 "locating" things (a
                                      URL).




On the web, we identify
  things with a URI.


           11
dbp:Chicago_(band)




dbp:Wrigley_Field
                                                        dbp:The_Blues_Brothers_(film)



                              dbp:Chicago




dbp:Chicago_Cubs                                             dbp:Barack_Obama

                                dbp:Pizza

                    dbp: http://dbpedia.org/resource/

                                   12
Things are more
interesting if we relate
         them.

Relationships are also
 described by a URI.

           13
Relationships
                                                                                       dbp:The_Blues_Brothers_(film)
  dbp:Wrigley_Field                          dbp:Chicago_(band)


                                                                                   n
                       db                                                       tio
                         po                                                  oca
                            :lo
                                c                                          _l
                                                                          m
                                    at
                                       ion                            :fil
                                                                   ie
                                                                ov
                                                               m

dbpo:owner

                                               dbp:Chicago
                                                                  dbp
                                                                     o:r
                                                                        e si
                                                                            den
                                                                               c      e
    dbp:Chicago_Cubs
                                                                                          dbp:Barack_Obama

                                                dbp:Pizza


                                      dbp: http://dbpedia.org/resource/
                                     dbpo: http://dbpedia.org/ontology/

                                                   14
Triple
         "fact" or "assertion"


<subject> <predicate> <object>




                  15
Subject                                       dbp:Chicago_(band)
                                                                                           dbp:The_Blues_Brothers_(film)
 dbp:Wrigley_Field


                       Predicate                                                       n
                       db                                                           tio
                         po                                                       ca
                              :lo                                                o
                                    ca                                         _l
                                                                              m
                                      tio                                  fil
                                                          Object
                                                                          :
                                          n                            ie
                                                                    ov
                                                                   m

dbpo:owner

                                                dbp:Chicago
                                                                     dbp
                                                                        o:r
                                                                           e si
                                                                               den
                                                                                  c       e
    dbp:Chicago_Cubs
                                                                                              dbp:Barack_Obama

                                                  dbp:Pizza


                                       dbp: http://dbpedia.org/resource/
                                      dbpo: http://dbpedia.org/ontology/

                                                     16
Triple
  <subject> <predicate> <object>

dbp:Wrigley_Field dbpo:location dbp:Chicago

   resource        resource     resource
   (vertex)         (edge)      (vertex)
                                   or
                                  value

                     17
Graph
                                                                                          dbp:The_Blues_Brothers_(film)
  dbp:Wrigley_Field                          dbp:Chicago_(band)



                                                                                      n
                       db                                                          tio
                         po                                                     oca
                            :lo
                                c                                             _l
                                                                             m
                                    at
                                       ion                               :fil
                                                                      ie
                                                                   ov
                                                                  m

dbpo:owner

                                                dbp:Chicago
                                                                    dbp
                                                                       o:r
                                                                          e si
                                                                              den
                                                                                 c       e
    dbp:Chicago_Cubs
                                                                                             dbp:Barack_Obama

                                                 dbp:Pizza


                                      dbp: http://dbpedia.org/resource/
                                     dbpo: http://dbpedia.org/ontology/

                                                    18
If things and relationships
   can be defined by any
   URI, how do we know
what we're talking about?


             19
We need metadata.



        20
Specifically, we need a
  vocabulary of terms
that describe our data.


           21
A class describes a
group of things that
  share common
    properties.


         22
Class

                                      ex:City



              is a                           is a                       is a




dbp:San_Francisco                    dbp:Chicago                           dbp:Saint_Louis


                     dbp: http://dbpedia.org/resource/
                     ex: http://example.org/ontology/
                     rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                     rdfs: http://www.w3.org/2000/01/rdf-schema#

                                           23
rdf:type (aka "a")

                                        ex:City


                                                                          rdf:type
            rdf:type                           rdf:type




dbp:San_Francisco                      dbp:Chicago                             dbp:Saint_Louis


                       dbp: http://dbpedia.org/resource/
                       ex: http://example.org/ontology/
                       rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                       rdfs: http://www.w3.org/2000/01/rdf-schema#

                                             24
rdfs:Class                             rdfs:Class

                                                rdf:type



                                         ex:City


                                                                           rdf:type
             rdf:type                           rdf:type




 dbp:San_Francisco                      dbp:Chicago                             dbp:Saint_Louis


                        dbp: http://dbpedia.org/resource/
                        ex: http://example.org/ontology/
                        rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                        rdfs: http://www.w3.org/2000/01/rdf-schema#

                                              25
rdf:subClassOf

                                        rdf:type
                ex:Location                         rdfs:Class

                         rdfs:subClassOf

                                       rdf:type
                   ex:City                          rdfs:Class




 dbp: http://dbpedia.org/resource/
 ex: http://example.org/ontology/
 rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
 rdfs: http://www.w3.org/2000/01/rdf-schema#

                       26
Classes let us talk about
kinds of things. Now we
   need some way to
   describe attributes.


            27
ex:City



                                              rdf:type




                    ex:country                             ex:founded
dbp:United_States                                                        1837


                                      dbp:Chicago




                      dbp: http://dbpedia.org/resource/
                      ex: http://example.org/ontology/
                      rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                      rdfs: http://www.w3.org/2000/01/rdf-schema#

                                            28
rdf:Property
               rdfs:do
ex:City                main
                                                    rdfs:range
                              rdf:Property                        xsd:gYear


    rdf:type
                        rdf:type



                      ex:founded
                                            1837


    dbp:Chicago




               dbp: http://dbpedia.org/resource/
               ex: http://example.org/ontology/
               rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
               rdfs: http://www.w3.org/2000/01/rdf-schema#

                                     29
How do we query stuff in
      this data?

        SPARQL


           30
Data and metadata
ex:Baseball_Team                ex:Stadium                          ex:City


  rdf:type                    rdf:type                        rdf:type



                 dbpo:owner                        dbpo:location


                                                                   dbp:Chicago
  dbp:Chicago_Cubs
                               dbp:Wrigley_Field




                        dbp: http://dbpedia.org/resource/
                       dbpo: http://dbpedia.org/ontology/

                                         31
ex:Stadium                         ex:City


                      rdf:type                        rdf:type



         dbpo:owner                   dbpo:location
?owner                   ?stadium                           ?city




         Graph pattern


                                 32
ex:Stadium                                 ex:City

?stadium rdf:type ex:Stadium .                     ?city rdf:type ex:City .
                                 rdf:type                                rdf:type



                    dbpo:owner                           dbpo:location
   ?owner                            ?stadium                                     ?city

             ?owner dbpo:owner ?stadium .        ?stadium dbpo:location ?city .




                   Triple pattern


                                            33
ex:Stadium                                 ex:City

  ?stadium rdf:type ex:Stadium .                     ?city rdf:type ex:City .
                                   rdf:type                                rdf:type



                      dbpo:owner                           dbpo:location
     ?owner                            ?stadium                                     ?city

               ?owner dbpo:owner ?stadium .        ?stadium dbpo:location ?city .



SELECT ?owner ?stadium ?city
WHERE {
  ?owner dbpo:owner ?stadium .
  ?stadium dbpo:location ?city .
  ?stadium rdf:type ex:Stadium .
  ?city rdf:type ex:City .
}
                                              34
Unions
Joins                   SPARQL
Outer joins
Filter with criteria
Project expressions
Sort
Duplicate removal
Slice (limit / offset)
Aggregates (grouping, etc)
Subqueries
               22
               35
Sounds interesting.
But I don't have triples!



            36
How do we map tables
(text or sequence file)
       to triples?


           37
Music Database
Musicians:
 MID         First       Last        Inst_ID
   1     Eddie         Van Halen       10
   2     Yo Yo            Ma           20
   3     Kenny            G            30




                      Instruments:     IID     Instrument     Type
                                       10        Guitar      String
                                       20        Cello       String
                                       30      Saxophone    Woodwind



                                      38
Musician Schema
    rdfs:Class                             rdf:Property

 rdf:type                                 rdf:type


                    rdfs:domain           music:firstName
 music:Musician         rdfs:doma
                                 in

                           rdfs           music:lastName
                                :dom
                                    ain

                   rdfs:range               music:plays
music:Instrument         rdfs:dom
                                 ain
                        rdfs
                             :do
                                          music:instName
                                mai
                                   n

                                          music:instType



                           39
Tables to Triples
    Musicians:                                     Instruments:
      MID    First      Last       Inst_ID           IID    Instrument     Type
       1     Eddie    Van Halen      10               10      Guitar      String
       2     Yo Yo       Ma          20               20      Cello       String
       3     Kenny       G           30               30    Saxophone    Woodwind



  Turn each key into a resource and specify the proper
  type of each resource:

artist:1 rdf:type music:Musician             instrument:10 rdf:type music:Instrument
artist:2 rdf:type music:Musician             instrument:20 rdf:type music:Instrument
artist:3 rdf:type music:Musician             instrument:30 rdf:type music:Instrument



                                             40
Tables to Triples
     Musicians:                                         Instruments:
       MID         First      Last      Inst_ID           IID      Instrument     Type
           1       Eddie    Van Halen     10               10        Guitar      String
           2       Yo Yo       Ma         20               20        Cello       String
           3       Kenny       G          30               30      Saxophone    Woodwind



   Turn each cell into a triple based on the key, property
   (mapped per column), and value:
artist:1       music:firstName "Eddie"             instrument:10   music:instName "Guitar"
artist:1       music:lastName "Van Halen"         instrument:10   music:instType "String"
artist:2       music:firstName "Yo Yo"             instrument:20   music:instName "Cello"
artist:2       music:lastName "Ma"                instrument:20   music:instType "String"
artist:3       music:firstName "Kenny"             instrument:30   music:instName "Saxophone"
artist:3       music:lastName "G"                 instrument:30   music:instType "Woodwind"


                                                  41
Tables to Triples
 Musicians:                                   Instruments:
  MID     First      Last      Inst_ID          IID    Instrument     Type
   1      Eddie    Van Halen     10             10       Guitar      String
   2      Yo Yo       Ma         20             20       Cello       String
   3      Kenny       G          30             30     Saxophone    Woodwind



Turn each foreign key reference into a relationship
between the foreign and primary resources.

                   artist:1 music:plays instrument:10
                   artist:1 music:plays instrument:20
                   artist:2 music:plays instrument:30




                                         42
R2RML
• "Relational to RDF Mapping Language"
• RDB2RDF Working Group at W3C
• ETL "data transformation" use case
• Dynamic "query translation" use case
  • Translate SPARQL query against
    domain to SQL query against the dbms

                   43
R2RML Triple Mapping
                                    ain          music:instName
                            rdfs:dom
music:Instrument
                            rdfs:d
                                  omain

                                                   music:instType




           Instruments:
             IID     Instrument           Type
              10          Guitar          String

                              44
R2RML Triple Mapping
                                           ain          music:instName
                                   rdfs:dom
      music:Instrument
                                   rdfs:d
                                         omain

                                                          music:instType




Triples Map       rr:tableName

                 Instruments:
                   IID       Instrument          Type
                    10           Guitar          String

                                     44
R2RML Triple Mapping
                                                   ain          music:instName
                                           rdfs:dom
      music:Instrument
                                           rdfs:d
                                                 omain
                    rr:class                                      music:instType

              Subject Map
          "http://example.com/music/
                   Inst-{iid}"




Triples Map              rr:tableName

                        Instruments:
                           IID         Instrument        Type
                            10           Guitar          String

                                             44
R2RML Triple Mapping
                                                   ain          music:instName
                                           rdfs:dom
      music:Instrument
                                           rdfs:d
                                                 omain
                    rr:class                                      music:instType
                                                                                rr:predicate
              Subject Map
          "http://example.com/music/
                   Inst-{iid}"
                                                                               Predicate
                                            Predicate Object
                                                  Map
                                                                              Object Map
Triples Map              rr:tableName

                        Instruments:                                     rr:column

                           IID         Instrument        Type
                            10           Guitar          String

                                             44
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix music: <http://example.com/music/> .
@prefix mapping: <http://example.com/ont/> .

mapping:InstrumentMapping
    a rr:TriplesMapClass;
    rr:logicalTable [ rr:tableName "Instruments" ];
    rr:subjectMap [
       rr:template "http://example.com/music/Inst-{iid}";
       rr:class     music:Instrument
    ];
    rr:predicateObjectMap [
       rr:predicate      music:instName ;
       rr:objectMap      [ rr:column "instrument" ];
    ];
    rr:predicateObjectMap [
       rr:predicate      music:instType ;
       rr:objectMap      [ rr:column "type" ];
    ];
.

                             45
Direct mapping


• Automatically map relational tables into a
  domain vocabulary using R2RML

• Good starting point to rapidly integrate
  two data sources



                     46
So what about big data?



           47
Triple data in Hadoop

• n-triple files
  • standard line format for RDF data
• indexed triple format
  • triples in Thrift representing RDF terms
• text / sequence files as tabular sources

                     48
SPARQL in Hadoop

• Compile SPARQL to map-reduce jobs
  against triple (or tuple) data

• Results materialized back into Hadoop
  files

• Similar to HiveQL compiling SQL to map-
  reduce against tabular data

                     49
R2RML in Hadoop
• Provide mapping file against tabular data
  files in Hadoop
• Execute SPARQL queries through the
  virtual mapping
  • View your data as triples
  • But leave it in sequence files
• OR materialize the virtual mapping into a
  real set of triples
                        50
Federation

• Execute queries against combination of
  data inside and outside Hadoop

• Or against combination of Hadoop and
  real-time (Storm)

• Or across multiple Hadoop clusters!

                      51
Additional capabilities


• SQL queries against tabular data
• Metadata registry
• Workflow design and execution


                      52
BioBig example

• Load into Hadoop as triples
  •   Diseasome - diseases (16.2 MB)
  •   LinkedCT - clinical trials (4.5 GB)
  •   DrugBank - drugs (144 MB)
  •   GeneID - genes (18 GB)
  •   PubMed - research publications (12 GB)

• Map into common domain vocabulary
• Query across all data sets
                         53
BioBig domain ontology
        (partial)




          54
SELECT ?disease ?disname ?geneid
              WHERE {
                 ?geneid a geneid:Gene .
                 ?geneid gene2pub:pubmed_xref ?article .
                 OPTIONAL { ?geneid dc:title ?genetitle . }
                 ?disease a diseasome:diseases .
                 ?genedb a diseasome:genes .
                 ?disease diseasome:associatedGene ?genedb .
                 ?genedb diseasome:geneId ?geneid .
                 OPTIONAL { ?disease diseasome:name ?disname . }
               }

                                                                 dc:title
diseasome:diseases     diseasome:genes          geneid:Gene                 ?genetitle

          a     diseasome:        a                     a
                                       diseasome:              gene2pub:
                associated
                                         geneId               pubmed_xref
                   Gene
    ?disease                 ?genedb                ?geneid                  ?article

  diseasome:name


    ?disname

                                           55
Thanks!

Big Data with Semantics - StampedeCon 2012

  • 1.
    Big Data withSemantics Alex Miller @puredanger picture: http://bit.ly/MLUIon
  • 2.
    Hadoop for DataIntegration • Companies are flocking to Hadoop right now, mostly for ETL/analysis • Starting to also use it for data integration • Traditionally the domain of data warehouses 2
  • 3.
    Data Integration inHive • Load multiple sources • Define, query with HiveQL • Queries access multiple sources in terms of their original data • Adding a new "data source" means changing all of your queries to accommodate the new data 3
  • 4.
    Integration with Semantics •Load data into Hadoop • Map data into common domain vocabulary • Query all your sources with common domain vocabulary • Adding a new "data source" means mapping the new source into the domain 4
  • 5.
    Multiple Sources in Hive Query Query 1 2 S1 S2 S3 5
  • 6.
    Multiple Sources with Semantics Query Query 1 2 Domain Vocab S1 S2 S3 6
  • 7.
    Key Technologies • RDF- data model • RDFS - schema definition • SPARQL - query language • R2RML - relational to RDF mapping 7
  • 8.
  • 9.
    There are thingswe wish to describe. 9
  • 10.
    We need someway to identify each thing. 10
  • 11.
    A URI isabo ut "identifying" things, not "locating" things (a URL). On the web, we identify things with a URI. 11
  • 12.
    dbp:Chicago_(band) dbp:Wrigley_Field dbp:The_Blues_Brothers_(film) dbp:Chicago dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ 12
  • 13.
    Things are more interestingif we relate them. Relationships are also described by a URI. 13
  • 14.
    Relationships dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field dbp:Chicago_(band) n db tio po oca :lo c _l m at ion :fil ie ov m dbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 14
  • 15.
    Triple "fact" or "assertion" <subject> <predicate> <object> 15
  • 16.
    Subject dbp:Chicago_(band) dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field Predicate n db tio po ca :lo o ca _l m tio fil Object : n ie ov m dbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 16
  • 17.
    Triple <subject><predicate> <object> dbp:Wrigley_Field dbpo:location dbp:Chicago resource resource resource (vertex) (edge) (vertex) or value 17
  • 18.
    Graph dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field dbp:Chicago_(band) n db tio po oca :lo c _l m at ion :fil ie ov m dbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 18
  • 19.
    If things andrelationships can be defined by any URI, how do we know what we're talking about? 19
  • 20.
  • 21.
    Specifically, we needa vocabulary of terms that describe our data. 21
  • 22.
    A class describesa group of things that share common properties. 22
  • 23.
    Class ex:City is a is a is a dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 23
  • 24.
    rdf:type (aka "a") ex:City rdf:type rdf:type rdf:type dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 24
  • 25.
    rdfs:Class rdfs:Class rdf:type ex:City rdf:type rdf:type rdf:type dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 25
  • 26.
    rdf:subClassOf rdf:type ex:Location rdfs:Class rdfs:subClassOf rdf:type ex:City rdfs:Class dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 26
  • 27.
    Classes let ustalk about kinds of things. Now we need some way to describe attributes. 27
  • 28.
    ex:City rdf:type ex:country ex:founded dbp:United_States 1837 dbp:Chicago dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 28
  • 29.
    rdf:Property rdfs:do ex:City main rdfs:range rdf:Property xsd:gYear rdf:type rdf:type ex:founded 1837 dbp:Chicago dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 29
  • 30.
    How do wequery stuff in this data? SPARQL 30
  • 31.
    Data and metadata ex:Baseball_Team ex:Stadium ex:City rdf:type rdf:type rdf:type dbpo:owner dbpo:location dbp:Chicago dbp:Chicago_Cubs dbp:Wrigley_Field dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 31
  • 32.
    ex:Stadium ex:City rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city Graph pattern 32
  • 33.
    ex:Stadium ex:City ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . Triple pattern 33
  • 34.
    ex:Stadium ex:City ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . SELECT ?owner ?stadium ?city WHERE { ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . } 34
  • 35.
    Unions Joins SPARQL Outer joins Filter with criteria Project expressions Sort Duplicate removal Slice (limit / offset) Aggregates (grouping, etc) Subqueries 22 35
  • 36.
    Sounds interesting. But Idon't have triples! 36
  • 37.
    How do wemap tables (text or sequence file) to triples? 37
  • 38.
    Music Database Musicians: MID First Last Inst_ID 1 Eddie Van Halen 10 2 Yo Yo Ma 20 3 Kenny G 30 Instruments: IID Instrument Type 10 Guitar String 20 Cello String 30 Saxophone Woodwind 38
  • 39.
    Musician Schema rdfs:Class rdf:Property rdf:type rdf:type rdfs:domain music:firstName music:Musician rdfs:doma in rdfs music:lastName :dom ain rdfs:range music:plays music:Instrument rdfs:dom ain rdfs :do music:instName mai n music:instType 39
  • 40.
    Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each key into a resource and specify the proper type of each resource: artist:1 rdf:type music:Musician instrument:10 rdf:type music:Instrument artist:2 rdf:type music:Musician instrument:20 rdf:type music:Instrument artist:3 rdf:type music:Musician instrument:30 rdf:type music:Instrument 40
  • 41.
    Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each cell into a triple based on the key, property (mapped per column), and value: artist:1 music:firstName "Eddie" instrument:10 music:instName "Guitar" artist:1 music:lastName "Van Halen" instrument:10 music:instType "String" artist:2 music:firstName "Yo Yo" instrument:20 music:instName "Cello" artist:2 music:lastName "Ma" instrument:20 music:instType "String" artist:3 music:firstName "Kenny" instrument:30 music:instName "Saxophone" artist:3 music:lastName "G" instrument:30 music:instType "Woodwind" 41
  • 42.
    Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each foreign key reference into a relationship between the foreign and primary resources. artist:1 music:plays instrument:10 artist:1 music:plays instrument:20 artist:2 music:plays instrument:30 42
  • 43.
    R2RML • "Relational toRDF Mapping Language" • RDB2RDF Working Group at W3C • ETL "data transformation" use case • Dynamic "query translation" use case • Translate SPARQL query against domain to SQL query against the dbms 43
  • 44.
    R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain music:instType Instruments: IID Instrument Type 10 Guitar String 44
  • 45.
    R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain music:instType Triples Map rr:tableName Instruments: IID Instrument Type 10 Guitar String 44
  • 46.
    R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain rr:class music:instType Subject Map "http://example.com/music/ Inst-{iid}" Triples Map rr:tableName Instruments: IID Instrument Type 10 Guitar String 44
  • 47.
    R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain rr:class music:instType rr:predicate Subject Map "http://example.com/music/ Inst-{iid}" Predicate Predicate Object Map Object Map Triples Map rr:tableName Instruments: rr:column IID Instrument Type 10 Guitar String 44
  • 48.
    @prefix rr: <http://www.w3.org/ns/r2rml#>. @prefix music: <http://example.com/music/> . @prefix mapping: <http://example.com/ont/> . mapping:InstrumentMapping a rr:TriplesMapClass; rr:logicalTable [ rr:tableName "Instruments" ]; rr:subjectMap [ rr:template "http://example.com/music/Inst-{iid}"; rr:class music:Instrument ]; rr:predicateObjectMap [ rr:predicate music:instName ; rr:objectMap [ rr:column "instrument" ]; ]; rr:predicateObjectMap [ rr:predicate music:instType ; rr:objectMap [ rr:column "type" ]; ]; . 45
  • 49.
    Direct mapping • Automaticallymap relational tables into a domain vocabulary using R2RML • Good starting point to rapidly integrate two data sources 46
  • 50.
    So what aboutbig data? 47
  • 51.
    Triple data inHadoop • n-triple files • standard line format for RDF data • indexed triple format • triples in Thrift representing RDF terms • text / sequence files as tabular sources 48
  • 52.
    SPARQL in Hadoop •Compile SPARQL to map-reduce jobs against triple (or tuple) data • Results materialized back into Hadoop files • Similar to HiveQL compiling SQL to map- reduce against tabular data 49
  • 53.
    R2RML in Hadoop •Provide mapping file against tabular data files in Hadoop • Execute SPARQL queries through the virtual mapping • View your data as triples • But leave it in sequence files • OR materialize the virtual mapping into a real set of triples 50
  • 54.
    Federation • Execute queriesagainst combination of data inside and outside Hadoop • Or against combination of Hadoop and real-time (Storm) • Or across multiple Hadoop clusters! 51
  • 55.
    Additional capabilities • SQLqueries against tabular data • Metadata registry • Workflow design and execution 52
  • 56.
    BioBig example • Loadinto Hadoop as triples • Diseasome - diseases (16.2 MB) • LinkedCT - clinical trials (4.5 GB) • DrugBank - drugs (144 MB) • GeneID - genes (18 GB) • PubMed - research publications (12 GB) • Map into common domain vocabulary • Query across all data sets 53
  • 57.
  • 58.
    SELECT ?disease ?disname?geneid WHERE { ?geneid a geneid:Gene . ?geneid gene2pub:pubmed_xref ?article . OPTIONAL { ?geneid dc:title ?genetitle . } ?disease a diseasome:diseases . ?genedb a diseasome:genes . ?disease diseasome:associatedGene ?genedb . ?genedb diseasome:geneId ?geneid . OPTIONAL { ?disease diseasome:name ?disname . } } dc:title diseasome:diseases diseasome:genes geneid:Gene ?genetitle a diseasome: a a diseasome: gene2pub: associated geneId pubmed_xref Gene ?disease ?genedb ?geneid ?article diseasome:name ?disname 55
  • 59.