SlideShare a Scribd company logo
1 of 59
Big Data with Semantics




                  Alex Miller
                @puredanger

                picture: http://bit.ly/MLUIon
Hadoop for Data Integration

 • Companies are flocking
   to Hadoop right now,
   mostly for ETL/analysis

 • Starting to also use it for data integration
 • Traditionally the domain of data
   warehouses


                       2
Data Integration in Hive

• Load multiple sources
• Define, query with HiveQL
• Queries access multiple sources in terms
  of their original data

• Adding a new "data source" means
  changing all of your queries to
  accommodate the new data
                       3
Integration with Semantics
• Load data into Hadoop
• Map data into common domain
  vocabulary

• Query all your sources with common
  domain vocabulary

• Adding a new "data source" means
  mapping the new source into the domain
                   4
Multiple Sources
     in Hive
      Query        Query
        1            2




 S1           S2           S3




              5
Multiple Sources
with Semantics
      Query         Query
        1             2




         Domain Vocab



 S1           S2            S3



              6
Key Technologies


• RDF - data model
• RDFS - schema definition
• SPARQL - query language
• R2RML - relational to RDF mapping


                     7
RDF

"Resource Description Framework"




               8
There are things we wish
      to describe.


           9
We need some way to
 identify each thing.


          10
A URI is abo ut
                  "identifying" things,
                                        not
                 "locating" things (a
                                      URL).




On the web, we identify
  things with a URI.


           11
dbp:Chicago_(band)




dbp:Wrigley_Field
                                                        dbp:The_Blues_Brothers_(film)



                              dbp:Chicago




dbp:Chicago_Cubs                                             dbp:Barack_Obama

                                dbp:Pizza

                    dbp: http://dbpedia.org/resource/

                                   12
Things are more
interesting if we relate
         them.

Relationships are also
 described by a URI.

           13
Relationships
                                                                                       dbp:The_Blues_Brothers_(film)
  dbp:Wrigley_Field                          dbp:Chicago_(band)


                                                                                   n
                       db                                                       tio
                         po                                                  oca
                            :lo
                                c                                          _l
                                                                          m
                                    at
                                       ion                            :fil
                                                                   ie
                                                                ov
                                                               m

dbpo:owner

                                               dbp:Chicago
                                                                  dbp
                                                                     o:r
                                                                        e si
                                                                            den
                                                                               c      e
    dbp:Chicago_Cubs
                                                                                          dbp:Barack_Obama

                                                dbp:Pizza


                                      dbp: http://dbpedia.org/resource/
                                     dbpo: http://dbpedia.org/ontology/

                                                   14
Triple
         "fact" or "assertion"


<subject> <predicate> <object>




                  15
Subject                                       dbp:Chicago_(band)
                                                                                           dbp:The_Blues_Brothers_(film)
 dbp:Wrigley_Field


                       Predicate                                                       n
                       db                                                           tio
                         po                                                       ca
                              :lo                                                o
                                    ca                                         _l
                                                                              m
                                      tio                                  fil
                                                          Object
                                                                          :
                                          n                            ie
                                                                    ov
                                                                   m

dbpo:owner

                                                dbp:Chicago
                                                                     dbp
                                                                        o:r
                                                                           e si
                                                                               den
                                                                                  c       e
    dbp:Chicago_Cubs
                                                                                              dbp:Barack_Obama

                                                  dbp:Pizza


                                       dbp: http://dbpedia.org/resource/
                                      dbpo: http://dbpedia.org/ontology/

                                                     16
Triple
  <subject> <predicate> <object>

dbp:Wrigley_Field dbpo:location dbp:Chicago

   resource        resource     resource
   (vertex)         (edge)      (vertex)
                                   or
                                  value

                     17
Graph
                                                                                          dbp:The_Blues_Brothers_(film)
  dbp:Wrigley_Field                          dbp:Chicago_(band)



                                                                                      n
                       db                                                          tio
                         po                                                     oca
                            :lo
                                c                                             _l
                                                                             m
                                    at
                                       ion                               :fil
                                                                      ie
                                                                   ov
                                                                  m

dbpo:owner

                                                dbp:Chicago
                                                                    dbp
                                                                       o:r
                                                                          e si
                                                                              den
                                                                                 c       e
    dbp:Chicago_Cubs
                                                                                             dbp:Barack_Obama

                                                 dbp:Pizza


                                      dbp: http://dbpedia.org/resource/
                                     dbpo: http://dbpedia.org/ontology/

                                                    18
If things and relationships
   can be defined by any
   URI, how do we know
what we're talking about?


             19
We need metadata.



        20
Specifically, we need a
  vocabulary of terms
that describe our data.


           21
A class describes a
group of things that
  share common
    properties.


         22
Class

                                      ex:City



              is a                           is a                       is a




dbp:San_Francisco                    dbp:Chicago                           dbp:Saint_Louis


                     dbp: http://dbpedia.org/resource/
                     ex: http://example.org/ontology/
                     rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                     rdfs: http://www.w3.org/2000/01/rdf-schema#

                                           23
rdf:type (aka "a")

                                        ex:City


                                                                          rdf:type
            rdf:type                           rdf:type




dbp:San_Francisco                      dbp:Chicago                             dbp:Saint_Louis


                       dbp: http://dbpedia.org/resource/
                       ex: http://example.org/ontology/
                       rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                       rdfs: http://www.w3.org/2000/01/rdf-schema#

                                             24
rdfs:Class                             rdfs:Class

                                                rdf:type



                                         ex:City


                                                                           rdf:type
             rdf:type                           rdf:type




 dbp:San_Francisco                      dbp:Chicago                             dbp:Saint_Louis


                        dbp: http://dbpedia.org/resource/
                        ex: http://example.org/ontology/
                        rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                        rdfs: http://www.w3.org/2000/01/rdf-schema#

                                              25
rdf:subClassOf

                                        rdf:type
                ex:Location                         rdfs:Class

                         rdfs:subClassOf

                                       rdf:type
                   ex:City                          rdfs:Class




 dbp: http://dbpedia.org/resource/
 ex: http://example.org/ontology/
 rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
 rdfs: http://www.w3.org/2000/01/rdf-schema#

                       26
Classes let us talk about
kinds of things. Now we
   need some way to
   describe attributes.


            27
ex:City



                                              rdf:type




                    ex:country                             ex:founded
dbp:United_States                                                        1837


                                      dbp:Chicago




                      dbp: http://dbpedia.org/resource/
                      ex: http://example.org/ontology/
                      rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
                      rdfs: http://www.w3.org/2000/01/rdf-schema#

                                            28
rdf:Property
               rdfs:do
ex:City                main
                                                    rdfs:range
                              rdf:Property                        xsd:gYear


    rdf:type
                        rdf:type



                      ex:founded
                                            1837


    dbp:Chicago




               dbp: http://dbpedia.org/resource/
               ex: http://example.org/ontology/
               rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns#
               rdfs: http://www.w3.org/2000/01/rdf-schema#

                                     29
How do we query stuff in
      this data?

        SPARQL


           30
Data and metadata
ex:Baseball_Team                ex:Stadium                          ex:City


  rdf:type                    rdf:type                        rdf:type



                 dbpo:owner                        dbpo:location


                                                                   dbp:Chicago
  dbp:Chicago_Cubs
                               dbp:Wrigley_Field




                        dbp: http://dbpedia.org/resource/
                       dbpo: http://dbpedia.org/ontology/

                                         31
ex:Stadium                         ex:City


                      rdf:type                        rdf:type



         dbpo:owner                   dbpo:location
?owner                   ?stadium                           ?city




         Graph pattern


                                 32
ex:Stadium                                 ex:City

?stadium rdf:type ex:Stadium .                     ?city rdf:type ex:City .
                                 rdf:type                                rdf:type



                    dbpo:owner                           dbpo:location
   ?owner                            ?stadium                                     ?city

             ?owner dbpo:owner ?stadium .        ?stadium dbpo:location ?city .




                   Triple pattern


                                            33
ex:Stadium                                 ex:City

  ?stadium rdf:type ex:Stadium .                     ?city rdf:type ex:City .
                                   rdf:type                                rdf:type



                      dbpo:owner                           dbpo:location
     ?owner                            ?stadium                                     ?city

               ?owner dbpo:owner ?stadium .        ?stadium dbpo:location ?city .



SELECT ?owner ?stadium ?city
WHERE {
  ?owner dbpo:owner ?stadium .
  ?stadium dbpo:location ?city .
  ?stadium rdf:type ex:Stadium .
  ?city rdf:type ex:City .
}
                                              34
Unions
Joins                   SPARQL
Outer joins
Filter with criteria
Project expressions
Sort
Duplicate removal
Slice (limit / offset)
Aggregates (grouping, etc)
Subqueries
               22
               35
Sounds interesting.
But I don't have triples!



            36
How do we map tables
(text or sequence file)
       to triples?


           37
Music Database
Musicians:
 MID         First       Last        Inst_ID
   1     Eddie         Van Halen       10
   2     Yo Yo            Ma           20
   3     Kenny            G            30




                      Instruments:     IID     Instrument     Type
                                       10        Guitar      String
                                       20        Cello       String
                                       30      Saxophone    Woodwind



                                      38
Musician Schema
    rdfs:Class                             rdf:Property

 rdf:type                                 rdf:type


                    rdfs:domain           music:firstName
 music:Musician         rdfs:doma
                                 in

                           rdfs           music:lastName
                                :dom
                                    ain

                   rdfs:range               music:plays
music:Instrument         rdfs:dom
                                 ain
                        rdfs
                             :do
                                          music:instName
                                mai
                                   n

                                          music:instType



                           39
Tables to Triples
    Musicians:                                     Instruments:
      MID    First      Last       Inst_ID           IID    Instrument     Type
       1     Eddie    Van Halen      10               10      Guitar      String
       2     Yo Yo       Ma          20               20      Cello       String
       3     Kenny       G           30               30    Saxophone    Woodwind



  Turn each key into a resource and specify the proper
  type of each resource:

artist:1 rdf:type music:Musician             instrument:10 rdf:type music:Instrument
artist:2 rdf:type music:Musician             instrument:20 rdf:type music:Instrument
artist:3 rdf:type music:Musician             instrument:30 rdf:type music:Instrument



                                             40
Tables to Triples
     Musicians:                                         Instruments:
       MID         First      Last      Inst_ID           IID      Instrument     Type
           1       Eddie    Van Halen     10               10        Guitar      String
           2       Yo Yo       Ma         20               20        Cello       String
           3       Kenny       G          30               30      Saxophone    Woodwind



   Turn each cell into a triple based on the key, property
   (mapped per column), and value:
artist:1       music:firstName "Eddie"             instrument:10   music:instName "Guitar"
artist:1       music:lastName "Van Halen"         instrument:10   music:instType "String"
artist:2       music:firstName "Yo Yo"             instrument:20   music:instName "Cello"
artist:2       music:lastName "Ma"                instrument:20   music:instType "String"
artist:3       music:firstName "Kenny"             instrument:30   music:instName "Saxophone"
artist:3       music:lastName "G"                 instrument:30   music:instType "Woodwind"


                                                  41
Tables to Triples
 Musicians:                                   Instruments:
  MID     First      Last      Inst_ID          IID    Instrument     Type
   1      Eddie    Van Halen     10             10       Guitar      String
   2      Yo Yo       Ma         20             20       Cello       String
   3      Kenny       G          30             30     Saxophone    Woodwind



Turn each foreign key reference into a relationship
between the foreign and primary resources.

                   artist:1 music:plays instrument:10
                   artist:1 music:plays instrument:20
                   artist:2 music:plays instrument:30




                                         42
R2RML
• "Relational to RDF Mapping Language"
• RDB2RDF Working Group at W3C
• ETL "data transformation" use case
• Dynamic "query translation" use case
  • Translate SPARQL query against
    domain to SQL query against the dbms

                   43
R2RML Triple Mapping
                                    ain          music:instName
                            rdfs:dom
music:Instrument
                            rdfs:d
                                  omain

                                                   music:instType




           Instruments:
             IID     Instrument           Type
              10          Guitar          String

                              44
R2RML Triple Mapping
                                           ain          music:instName
                                   rdfs:dom
      music:Instrument
                                   rdfs:d
                                         omain

                                                          music:instType




Triples Map       rr:tableName

                 Instruments:
                   IID       Instrument          Type
                    10           Guitar          String

                                     44
R2RML Triple Mapping
                                                   ain          music:instName
                                           rdfs:dom
      music:Instrument
                                           rdfs:d
                                                 omain
                    rr:class                                      music:instType

              Subject Map
          "http://example.com/music/
                   Inst-{iid}"




Triples Map              rr:tableName

                        Instruments:
                           IID         Instrument        Type
                            10           Guitar          String

                                             44
R2RML Triple Mapping
                                                   ain          music:instName
                                           rdfs:dom
      music:Instrument
                                           rdfs:d
                                                 omain
                    rr:class                                      music:instType
                                                                                rr:predicate
              Subject Map
          "http://example.com/music/
                   Inst-{iid}"
                                                                               Predicate
                                            Predicate Object
                                                  Map
                                                                              Object Map
Triples Map              rr:tableName

                        Instruments:                                     rr:column

                           IID         Instrument        Type
                            10           Guitar          String

                                             44
@prefix rr: <http://www.w3.org/ns/r2rml#> .
@prefix music: <http://example.com/music/> .
@prefix mapping: <http://example.com/ont/> .

mapping:InstrumentMapping
    a rr:TriplesMapClass;
    rr:logicalTable [ rr:tableName "Instruments" ];
    rr:subjectMap [
       rr:template "http://example.com/music/Inst-{iid}";
       rr:class     music:Instrument
    ];
    rr:predicateObjectMap [
       rr:predicate      music:instName ;
       rr:objectMap      [ rr:column "instrument" ];
    ];
    rr:predicateObjectMap [
       rr:predicate      music:instType ;
       rr:objectMap      [ rr:column "type" ];
    ];
.

                             45
Direct mapping


• Automatically map relational tables into a
  domain vocabulary using R2RML

• Good starting point to rapidly integrate
  two data sources



                     46
So what about big data?



           47
Triple data in Hadoop

• n-triple files
  • standard line format for RDF data
• indexed triple format
  • triples in Thrift representing RDF terms
• text / sequence files as tabular sources

                     48
SPARQL in Hadoop

• Compile SPARQL to map-reduce jobs
  against triple (or tuple) data

• Results materialized back into Hadoop
  files

• Similar to HiveQL compiling SQL to map-
  reduce against tabular data

                     49
R2RML in Hadoop
• Provide mapping file against tabular data
  files in Hadoop
• Execute SPARQL queries through the
  virtual mapping
  • View your data as triples
  • But leave it in sequence files
• OR materialize the virtual mapping into a
  real set of triples
                        50
Federation

• Execute queries against combination of
  data inside and outside Hadoop

• Or against combination of Hadoop and
  real-time (Storm)

• Or across multiple Hadoop clusters!

                      51
Additional capabilities


• SQL queries against tabular data
• Metadata registry
• Workflow design and execution


                      52
BioBig example

• Load into Hadoop as triples
  •   Diseasome - diseases (16.2 MB)
  •   LinkedCT - clinical trials (4.5 GB)
  •   DrugBank - drugs (144 MB)
  •   GeneID - genes (18 GB)
  •   PubMed - research publications (12 GB)

• Map into common domain vocabulary
• Query across all data sets
                         53
BioBig domain ontology
        (partial)




          54
SELECT ?disease ?disname ?geneid
              WHERE {
                 ?geneid a geneid:Gene .
                 ?geneid gene2pub:pubmed_xref ?article .
                 OPTIONAL { ?geneid dc:title ?genetitle . }
                 ?disease a diseasome:diseases .
                 ?genedb a diseasome:genes .
                 ?disease diseasome:associatedGene ?genedb .
                 ?genedb diseasome:geneId ?geneid .
                 OPTIONAL { ?disease diseasome:name ?disname . }
               }

                                                                 dc:title
diseasome:diseases     diseasome:genes          geneid:Gene                 ?genetitle

          a     diseasome:        a                     a
                                       diseasome:              gene2pub:
                associated
                                         geneId               pubmed_xref
                   Gene
    ?disease                 ?genedb                ?geneid                  ?article

  diseasome:name


    ?disname

                                           55
Thanks!

More Related Content

More from StampedeCon

Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
StampedeCon
 

More from StampedeCon (20)

Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
Why Should We Trust You-Interpretability of Deep Neural Networks - StampedeCo...
 
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
The Search for a New Visual Search Beyond Language - StampedeCon AI Summit 2017
 
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
Predicting Outcomes When Your Outcomes are Graphs - StampedeCon AI Summit 2017
 
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
Novel Semi-supervised Probabilistic ML Approach to SNP Variant Calling - Stam...
 
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
How to Talk about AI to Non-analaysts - Stampedecon AI Summit 2017
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017Foundations of Machine Learning - StampedeCon AI Summit 2017
Foundations of Machine Learning - StampedeCon AI Summit 2017
 
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
Don't Start from Scratch: Transfer Learning for Novel Computer Vision Problem...
 
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
Bringing the Whole Elephant Into View Can Cognitive Systems Bring Real Soluti...
 
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
Automated AI The Next Frontier in Analytics - StampedeCon AI Summit 2017
 
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017AI in the Enterprise: Past,  Present &  Future - StampedeCon AI Summit 2017
AI in the Enterprise: Past, Present & Future - StampedeCon AI Summit 2017
 
A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017A Different Data Science Approach - StampedeCon AI Summit 2017
A Different Data Science Approach - StampedeCon AI Summit 2017
 
Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017Graph in Customer 360 - StampedeCon Big Data Conference 2017
Graph in Customer 360 - StampedeCon Big Data Conference 2017
 
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
End-to-end Big Data Projects with Python - StampedeCon Big Data Conference 2017
 
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
Doing Big Data Using Amazon's Analogs - StampedeCon Big Data Conference 2017
 
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
Enabling New Business Capabilities with Cloud-based Streaming Data Architectu...
 
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
Big Data Meets IoT: Lessons From the Cloud on Polling, Collecting, and Analyz...
 
Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016Innovation in the Data Warehouse - StampedeCon 2016
Innovation in the Data Warehouse - StampedeCon 2016
 
Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016Creating a Data Driven Organization - StampedeCon 2016
Creating a Data Driven Organization - StampedeCon 2016
 
Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016Using The Internet of Things for Population Health Management - StampedeCon 2016
Using The Internet of Things for Population Health Management - StampedeCon 2016
 

Recently uploaded

TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
panagenda
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
Wonjun Hwang
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
Overkill Security
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
Muhammad Subhan
 

Recently uploaded (20)

The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
TrustArc Webinar - Unified Trust Center for Privacy, Security, Compliance, an...
 
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
AI+A11Y 11MAY2024 HYDERBAD GAAD 2024 - HelloA11Y (11 May 2024)
 
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on ThanabotsContinuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
Continuing Bonds Through AI: A Hermeneutic Reflection on Thanabots
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
ChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps ProductivityChatGPT and Beyond - Elevating DevOps Productivity
ChatGPT and Beyond - Elevating DevOps Productivity
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Design Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptxDesign Guidelines for Passkeys 2024.pptx
Design Guidelines for Passkeys 2024.pptx
 
Microsoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - QuestionnaireMicrosoft CSP Briefing Pre-Engagement - Questionnaire
Microsoft CSP Briefing Pre-Engagement - Questionnaire
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Working together SRE & Platform Engineering
Working together SRE & Platform EngineeringWorking together SRE & Platform Engineering
Working together SRE & Platform Engineering
 
ERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage IntacctERP Contender Series: Acumatica vs. Sage Intacct
ERP Contender Series: Acumatica vs. Sage Intacct
 
CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)CORS (Kitworks Team Study 양다윗 발표자료 240510)
CORS (Kitworks Team Study 양다윗 발표자료 240510)
 
How to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cfHow to Check CNIC Information Online with Pakdata cf
How to Check CNIC Information Online with Pakdata cf
 
Microsoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdfMicrosoft BitLocker Bypass Attack Method.pdf
Microsoft BitLocker Bypass Attack Method.pdf
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
“Iamnobody89757” Understanding the Mysterious of Digital Identity.pdf
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 

Big Data with Semantics - StampedeCon 2012

  • 1. Big Data with Semantics Alex Miller @puredanger picture: http://bit.ly/MLUIon
  • 2. Hadoop for Data Integration • Companies are flocking to Hadoop right now, mostly for ETL/analysis • Starting to also use it for data integration • Traditionally the domain of data warehouses 2
  • 3. Data Integration in Hive • Load multiple sources • Define, query with HiveQL • Queries access multiple sources in terms of their original data • Adding a new "data source" means changing all of your queries to accommodate the new data 3
  • 4. Integration with Semantics • Load data into Hadoop • Map data into common domain vocabulary • Query all your sources with common domain vocabulary • Adding a new "data source" means mapping the new source into the domain 4
  • 5. Multiple Sources in Hive Query Query 1 2 S1 S2 S3 5
  • 6. Multiple Sources with Semantics Query Query 1 2 Domain Vocab S1 S2 S3 6
  • 7. Key Technologies • RDF - data model • RDFS - schema definition • SPARQL - query language • R2RML - relational to RDF mapping 7
  • 9. There are things we wish to describe. 9
  • 10. We need some way to identify each thing. 10
  • 11. A URI is abo ut "identifying" things, not "locating" things (a URL). On the web, we identify things with a URI. 11
  • 12. dbp:Chicago_(band) dbp:Wrigley_Field dbp:The_Blues_Brothers_(film) dbp:Chicago dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ 12
  • 13. Things are more interesting if we relate them. Relationships are also described by a URI. 13
  • 14. Relationships dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field dbp:Chicago_(band) n db tio po oca :lo c _l m at ion :fil ie ov m dbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 14
  • 15. Triple "fact" or "assertion" <subject> <predicate> <object> 15
  • 16. Subject dbp:Chicago_(band) dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field Predicate n db tio po ca :lo o ca _l m tio fil Object : n ie ov m dbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 16
  • 17. Triple <subject> <predicate> <object> dbp:Wrigley_Field dbpo:location dbp:Chicago resource resource resource (vertex) (edge) (vertex) or value 17
  • 18. Graph dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field dbp:Chicago_(band) n db tio po oca :lo c _l m at ion :fil ie ov m dbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 18
  • 19. If things and relationships can be defined by any URI, how do we know what we're talking about? 19
  • 21. Specifically, we need a vocabulary of terms that describe our data. 21
  • 22. A class describes a group of things that share common properties. 22
  • 23. Class ex:City is a is a is a dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 23
  • 24. rdf:type (aka "a") ex:City rdf:type rdf:type rdf:type dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 24
  • 25. rdfs:Class rdfs:Class rdf:type ex:City rdf:type rdf:type rdf:type dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 25
  • 26. rdf:subClassOf rdf:type ex:Location rdfs:Class rdfs:subClassOf rdf:type ex:City rdfs:Class dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 26
  • 27. Classes let us talk about kinds of things. Now we need some way to describe attributes. 27
  • 28. ex:City rdf:type ex:country ex:founded dbp:United_States 1837 dbp:Chicago dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 28
  • 29. rdf:Property rdfs:do ex:City main rdfs:range rdf:Property xsd:gYear rdf:type rdf:type ex:founded 1837 dbp:Chicago dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 29
  • 30. How do we query stuff in this data? SPARQL 30
  • 31. Data and metadata ex:Baseball_Team ex:Stadium ex:City rdf:type rdf:type rdf:type dbpo:owner dbpo:location dbp:Chicago dbp:Chicago_Cubs dbp:Wrigley_Field dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 31
  • 32. ex:Stadium ex:City rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city Graph pattern 32
  • 33. ex:Stadium ex:City ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . Triple pattern 33
  • 34. ex:Stadium ex:City ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . SELECT ?owner ?stadium ?city WHERE { ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . } 34
  • 35. Unions Joins SPARQL Outer joins Filter with criteria Project expressions Sort Duplicate removal Slice (limit / offset) Aggregates (grouping, etc) Subqueries 22 35
  • 36. Sounds interesting. But I don't have triples! 36
  • 37. How do we map tables (text or sequence file) to triples? 37
  • 38. Music Database Musicians: MID First Last Inst_ID 1 Eddie Van Halen 10 2 Yo Yo Ma 20 3 Kenny G 30 Instruments: IID Instrument Type 10 Guitar String 20 Cello String 30 Saxophone Woodwind 38
  • 39. Musician Schema rdfs:Class rdf:Property rdf:type rdf:type rdfs:domain music:firstName music:Musician rdfs:doma in rdfs music:lastName :dom ain rdfs:range music:plays music:Instrument rdfs:dom ain rdfs :do music:instName mai n music:instType 39
  • 40. Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each key into a resource and specify the proper type of each resource: artist:1 rdf:type music:Musician instrument:10 rdf:type music:Instrument artist:2 rdf:type music:Musician instrument:20 rdf:type music:Instrument artist:3 rdf:type music:Musician instrument:30 rdf:type music:Instrument 40
  • 41. Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each cell into a triple based on the key, property (mapped per column), and value: artist:1 music:firstName "Eddie" instrument:10 music:instName "Guitar" artist:1 music:lastName "Van Halen" instrument:10 music:instType "String" artist:2 music:firstName "Yo Yo" instrument:20 music:instName "Cello" artist:2 music:lastName "Ma" instrument:20 music:instType "String" artist:3 music:firstName "Kenny" instrument:30 music:instName "Saxophone" artist:3 music:lastName "G" instrument:30 music:instType "Woodwind" 41
  • 42. Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each foreign key reference into a relationship between the foreign and primary resources. artist:1 music:plays instrument:10 artist:1 music:plays instrument:20 artist:2 music:plays instrument:30 42
  • 43. R2RML • "Relational to RDF Mapping Language" • RDB2RDF Working Group at W3C • ETL "data transformation" use case • Dynamic "query translation" use case • Translate SPARQL query against domain to SQL query against the dbms 43
  • 44. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain music:instType Instruments: IID Instrument Type 10 Guitar String 44
  • 45. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain music:instType Triples Map rr:tableName Instruments: IID Instrument Type 10 Guitar String 44
  • 46. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain rr:class music:instType Subject Map "http://example.com/music/ Inst-{iid}" Triples Map rr:tableName Instruments: IID Instrument Type 10 Guitar String 44
  • 47. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain rr:class music:instType rr:predicate Subject Map "http://example.com/music/ Inst-{iid}" Predicate Predicate Object Map Object Map Triples Map rr:tableName Instruments: rr:column IID Instrument Type 10 Guitar String 44
  • 48. @prefix rr: <http://www.w3.org/ns/r2rml#> . @prefix music: <http://example.com/music/> . @prefix mapping: <http://example.com/ont/> . mapping:InstrumentMapping a rr:TriplesMapClass; rr:logicalTable [ rr:tableName "Instruments" ]; rr:subjectMap [ rr:template "http://example.com/music/Inst-{iid}"; rr:class music:Instrument ]; rr:predicateObjectMap [ rr:predicate music:instName ; rr:objectMap [ rr:column "instrument" ]; ]; rr:predicateObjectMap [ rr:predicate music:instType ; rr:objectMap [ rr:column "type" ]; ]; . 45
  • 49. Direct mapping • Automatically map relational tables into a domain vocabulary using R2RML • Good starting point to rapidly integrate two data sources 46
  • 50. So what about big data? 47
  • 51. Triple data in Hadoop • n-triple files • standard line format for RDF data • indexed triple format • triples in Thrift representing RDF terms • text / sequence files as tabular sources 48
  • 52. SPARQL in Hadoop • Compile SPARQL to map-reduce jobs against triple (or tuple) data • Results materialized back into Hadoop files • Similar to HiveQL compiling SQL to map- reduce against tabular data 49
  • 53. R2RML in Hadoop • Provide mapping file against tabular data files in Hadoop • Execute SPARQL queries through the virtual mapping • View your data as triples • But leave it in sequence files • OR materialize the virtual mapping into a real set of triples 50
  • 54. Federation • Execute queries against combination of data inside and outside Hadoop • Or against combination of Hadoop and real-time (Storm) • Or across multiple Hadoop clusters! 51
  • 55. Additional capabilities • SQL queries against tabular data • Metadata registry • Workflow design and execution 52
  • 56. BioBig example • Load into Hadoop as triples • Diseasome - diseases (16.2 MB) • LinkedCT - clinical trials (4.5 GB) • DrugBank - drugs (144 MB) • GeneID - genes (18 GB) • PubMed - research publications (12 GB) • Map into common domain vocabulary • Query across all data sets 53
  • 57. BioBig domain ontology (partial) 54
  • 58. SELECT ?disease ?disname ?geneid WHERE { ?geneid a geneid:Gene . ?geneid gene2pub:pubmed_xref ?article . OPTIONAL { ?geneid dc:title ?genetitle . } ?disease a diseasome:diseases . ?genedb a diseasome:genes . ?disease diseasome:associatedGene ?genedb . ?genedb diseasome:geneId ?geneid . OPTIONAL { ?disease diseasome:name ?disname . } } dc:title diseasome:diseases diseasome:genes geneid:Gene ?genetitle a diseasome: a a diseasome: gene2pub: associated geneId pubmed_xref Gene ?disease ?genedb ?geneid ?article diseasome:name ?disname 55