Big Data with Semantics - StampedeCon 2012

2,031 views

Published on

At StampedeCon 2012 in St. Louis, Alex Miller of Revelytix presents: This talk will demonstrate how RDF (Resource Description Framework) can be used to describe a variety of data and metadata, how data stored in Hadoop can be transformed or virtualized as an RDF graph, and how queries and transformations can be defined by SPARQL and R2RML (the RDB to RDF Mapping Language).

Published in: Technology, News & Politics

Big Data with Semantics - StampedeCon 2012

  1. 1. Big Data with Semantics Alex Miller @puredanger picture: http://bit.ly/MLUIon
  2. 2. Hadoop for Data Integration • Companies are flocking to Hadoop right now, mostly for ETL/analysis • Starting to also use it for data integration • Traditionally the domain of data warehouses 2
  3. 3. Data Integration in Hive• Load multiple sources• Define, query with HiveQL• Queries access multiple sources in terms of their original data• Adding a new "data source" means changing all of your queries to accommodate the new data 3
  4. 4. Integration with Semantics• Load data into Hadoop• Map data into common domain vocabulary• Query all your sources with common domain vocabulary• Adding a new "data source" means mapping the new source into the domain 4
  5. 5. Multiple Sources in Hive Query Query 1 2 S1 S2 S3 5
  6. 6. Multiple Sourceswith Semantics Query Query 1 2 Domain Vocab S1 S2 S3 6
  7. 7. Key Technologies• RDF - data model• RDFS - schema definition• SPARQL - query language• R2RML - relational to RDF mapping 7
  8. 8. RDF"Resource Description Framework" 8
  9. 9. There are things we wish to describe. 9
  10. 10. We need some way to identify each thing. 10
  11. 11. A URI is abo ut "identifying" things, not "locating" things (a URL).On the web, we identify things with a URI. 11
  12. 12. dbp:Chicago_(band)dbp:Wrigley_Field dbp:The_Blues_Brothers_(film) dbp:Chicagodbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ 12
  13. 13. Things are moreinteresting if we relate them.Relationships are also described by a URI. 13
  14. 14. Relationships dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field dbp:Chicago_(band) n db tio po oca :lo c _l m at ion :fil ie ov mdbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 14
  15. 15. Triple "fact" or "assertion"<subject> <predicate> <object> 15
  16. 16. Subject dbp:Chicago_(band) dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field Predicate n db tio po ca :lo o ca _l m tio fil Object : n ie ov mdbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 16
  17. 17. Triple <subject> <predicate> <object>dbp:Wrigley_Field dbpo:location dbp:Chicago resource resource resource (vertex) (edge) (vertex) or value 17
  18. 18. Graph dbp:The_Blues_Brothers_(film) dbp:Wrigley_Field dbp:Chicago_(band) n db tio po oca :lo c _l m at ion :fil ie ov mdbpo:owner dbp:Chicago dbp o:r e si den c e dbp:Chicago_Cubs dbp:Barack_Obama dbp:Pizza dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 18
  19. 19. If things and relationships can be defined by any URI, how do we knowwhat were talking about? 19
  20. 20. We need metadata. 20
  21. 21. Specifically, we need a vocabulary of termsthat describe our data. 21
  22. 22. A class describes agroup of things that share common properties. 22
  23. 23. Class ex:City is a is a is adbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 23
  24. 24. rdf:type (aka "a") ex:City rdf:type rdf:type rdf:typedbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 24
  25. 25. rdfs:Class rdfs:Class rdf:type ex:City rdf:type rdf:type rdf:type dbp:San_Francisco dbp:Chicago dbp:Saint_Louis dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 25
  26. 26. rdf:subClassOf rdf:type ex:Location rdfs:Class rdfs:subClassOf rdf:type ex:City rdfs:Class dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 26
  27. 27. Classes let us talk aboutkinds of things. Now we need some way to describe attributes. 27
  28. 28. ex:City rdf:type ex:country ex:foundeddbp:United_States 1837 dbp:Chicago dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 28
  29. 29. rdf:Property rdfs:doex:City main rdfs:range rdf:Property xsd:gYear rdf:type rdf:type ex:founded 1837 dbp:Chicago dbp: http://dbpedia.org/resource/ ex: http://example.org/ontology/ rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs: http://www.w3.org/2000/01/rdf-schema# 29
  30. 30. How do we query stuff in this data? SPARQL 30
  31. 31. Data and metadataex:Baseball_Team ex:Stadium ex:City rdf:type rdf:type rdf:type dbpo:owner dbpo:location dbp:Chicago dbp:Chicago_Cubs dbp:Wrigley_Field dbp: http://dbpedia.org/resource/ dbpo: http://dbpedia.org/ontology/ 31
  32. 32. ex:Stadium ex:City rdf:type rdf:type dbpo:owner dbpo:location?owner ?stadium ?city Graph pattern 32
  33. 33. ex:Stadium ex:City?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . Triple pattern 33
  34. 34. ex:Stadium ex:City ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City . rdf:type rdf:type dbpo:owner dbpo:location ?owner ?stadium ?city ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city .SELECT ?owner ?stadium ?cityWHERE { ?owner dbpo:owner ?stadium . ?stadium dbpo:location ?city . ?stadium rdf:type ex:Stadium . ?city rdf:type ex:City .} 34
  35. 35. UnionsJoins SPARQLOuter joinsFilter with criteriaProject expressionsSortDuplicate removalSlice (limit / offset)Aggregates (grouping, etc)Subqueries 22 35
  36. 36. Sounds interesting.But I dont have triples! 36
  37. 37. How do we map tables(text or sequence file) to triples? 37
  38. 38. Music DatabaseMusicians: MID First Last Inst_ID 1 Eddie Van Halen 10 2 Yo Yo Ma 20 3 Kenny G 30 Instruments: IID Instrument Type 10 Guitar String 20 Cello String 30 Saxophone Woodwind 38
  39. 39. Musician Schema rdfs:Class rdf:Property rdf:type rdf:type rdfs:domain music:firstName music:Musician rdfs:doma in rdfs music:lastName :dom ain rdfs:range music:playsmusic:Instrument rdfs:dom ain rdfs :do music:instName mai n music:instType 39
  40. 40. Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each key into a resource and specify the proper type of each resource:artist:1 rdf:type music:Musician instrument:10 rdf:type music:Instrumentartist:2 rdf:type music:Musician instrument:20 rdf:type music:Instrumentartist:3 rdf:type music:Musician instrument:30 rdf:type music:Instrument 40
  41. 41. Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone Woodwind Turn each cell into a triple based on the key, property (mapped per column), and value:artist:1 music:firstName "Eddie" instrument:10 music:instName "Guitar"artist:1 music:lastName "Van Halen" instrument:10 music:instType "String"artist:2 music:firstName "Yo Yo" instrument:20 music:instName "Cello"artist:2 music:lastName "Ma" instrument:20 music:instType "String"artist:3 music:firstName "Kenny" instrument:30 music:instName "Saxophone"artist:3 music:lastName "G" instrument:30 music:instType "Woodwind" 41
  42. 42. Tables to Triples Musicians: Instruments: MID First Last Inst_ID IID Instrument Type 1 Eddie Van Halen 10 10 Guitar String 2 Yo Yo Ma 20 20 Cello String 3 Kenny G 30 30 Saxophone WoodwindTurn each foreign key reference into a relationshipbetween the foreign and primary resources. artist:1 music:plays instrument:10 artist:1 music:plays instrument:20 artist:2 music:plays instrument:30 42
  43. 43. R2RML• "Relational to RDF Mapping Language"• RDB2RDF Working Group at W3C• ETL "data transformation" use case• Dynamic "query translation" use case • Translate SPARQL query against domain to SQL query against the dbms 43
  44. 44. R2RML Triple Mapping ain music:instName rdfs:dommusic:Instrument rdfs:d omain music:instType Instruments: IID Instrument Type 10 Guitar String 44
  45. 45. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain music:instTypeTriples Map rr:tableName Instruments: IID Instrument Type 10 Guitar String 44
  46. 46. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain rr:class music:instType Subject Map "http://example.com/music/ Inst-{iid}"Triples Map rr:tableName Instruments: IID Instrument Type 10 Guitar String 44
  47. 47. R2RML Triple Mapping ain music:instName rdfs:dom music:Instrument rdfs:d omain rr:class music:instType rr:predicate Subject Map "http://example.com/music/ Inst-{iid}" Predicate Predicate Object Map Object MapTriples Map rr:tableName Instruments: rr:column IID Instrument Type 10 Guitar String 44
  48. 48. @prefix rr: <http://www.w3.org/ns/r2rml#> .@prefix music: <http://example.com/music/> .@prefix mapping: <http://example.com/ont/> .mapping:InstrumentMapping a rr:TriplesMapClass; rr:logicalTable [ rr:tableName "Instruments" ]; rr:subjectMap [ rr:template "http://example.com/music/Inst-{iid}"; rr:class music:Instrument ]; rr:predicateObjectMap [ rr:predicate music:instName ; rr:objectMap [ rr:column "instrument" ]; ]; rr:predicateObjectMap [ rr:predicate music:instType ; rr:objectMap [ rr:column "type" ]; ];. 45
  49. 49. Direct mapping• Automatically map relational tables into a domain vocabulary using R2RML• Good starting point to rapidly integrate two data sources 46
  50. 50. So what about big data? 47
  51. 51. Triple data in Hadoop• n-triple files • standard line format for RDF data• indexed triple format • triples in Thrift representing RDF terms• text / sequence files as tabular sources 48
  52. 52. SPARQL in Hadoop• Compile SPARQL to map-reduce jobs against triple (or tuple) data• Results materialized back into Hadoop files• Similar to HiveQL compiling SQL to map- reduce against tabular data 49
  53. 53. R2RML in Hadoop• Provide mapping file against tabular data files in Hadoop• Execute SPARQL queries through the virtual mapping • View your data as triples • But leave it in sequence files• OR materialize the virtual mapping into a real set of triples 50
  54. 54. Federation• Execute queries against combination of data inside and outside Hadoop• Or against combination of Hadoop and real-time (Storm)• Or across multiple Hadoop clusters! 51
  55. 55. Additional capabilities• SQL queries against tabular data• Metadata registry• Workflow design and execution 52
  56. 56. BioBig example• Load into Hadoop as triples • Diseasome - diseases (16.2 MB) • LinkedCT - clinical trials (4.5 GB) • DrugBank - drugs (144 MB) • GeneID - genes (18 GB) • PubMed - research publications (12 GB)• Map into common domain vocabulary• Query across all data sets 53
  57. 57. BioBig domain ontology (partial) 54
  58. 58. SELECT ?disease ?disname ?geneid WHERE { ?geneid a geneid:Gene . ?geneid gene2pub:pubmed_xref ?article . OPTIONAL { ?geneid dc:title ?genetitle . } ?disease a diseasome:diseases . ?genedb a diseasome:genes . ?disease diseasome:associatedGene ?genedb . ?genedb diseasome:geneId ?geneid . OPTIONAL { ?disease diseasome:name ?disname . } } dc:titlediseasome:diseases diseasome:genes geneid:Gene ?genetitle a diseasome: a a diseasome: gene2pub: associated geneId pubmed_xref Gene ?disease ?genedb ?geneid ?article diseasome:name ?disname 55
  59. 59. Thanks!

×