SlideShare a Scribd company logo
1 of 24
Information Sciences Institute
D-REPR: A Language For Describing And Mapping
Diversely-Structured Data Sources To RDF
Binh Vu, Jay Pujara, and Craig Knoblock
Motivating example
(1)
(2)
(3)
csv
json
netcdf
No uniform method to access data
Motivating example
csv
json
netcdf
Need one method to access all types of data
Language for describing dataset
Information Sciences Institute
Heterogeneous datasets
• Multiple formats: CSV, JSON, XLSX, NetCDF4, …
Information Sciences Institute
Heterogeneous datasets
• Same format, multiple layouts
Information Sciences Institute
Related work
• Mapping nested relational datasets:
– RML (Dimou et al, 2014), KR2RML (Slepicka et al 2015), xR2RML (Michel et al,
2015), etc.
– Can handle multiple formats but only work for nested relational model layout
• Mapping tabular datasets:
– XLWrap (Langegger et al, 2009), M2 (O’Connor et al, 2010), T2WML (Szekely
et al, 2019)
– Can handle multiple layouts, but support only tabular formats
Information Sciences Institute
Contributions
• A generic language to easily for describing and mapping heterogeneous
datasets to RDF
– It’s capable of mapping wide variety of data sources and goes beyond the set of
sources that existing languages support.
• The language is extensible to new formats and layouts
• An efficient engine to convert datasets to RDF
gender observation indicator_col
male 57.7 LIFE_0035
female 59.6 LIFE_0035
male 60.6 LIFE_0035
female 62.1 LIFE_0035
Our approach
Step 3: join
attributes to
tables
Step 2: define attributes
Step 1:
define resources
male
female
57.7
59.6
60.6
62.1
LIFE_0035
LIFE_0035
LIFE_0035
LIFE_0029
http://...?iid=35
http://...?iid=29
indicator url
LIFE_0035 http://...?iid=35
LIFE_0029 http://...?iid=29
gender observation indicator_col indicator url
Step 4: semantic
modeling
qb:Observation
sdmx-m:obsValuesdmx-d:sex
eg:indicator
eg:Indicator
drepr:urieg:code
Information Sciences Institute
Step 1: Resources
• A resource can be a physical file, SQL table, etc.
• Syntax:
• Example:
life_table.csv
indicators.json
Information Sciences Institute
Step 2: Attributes
• Containing values that belong to a group
• Syntax
life_table.csv
Information Sciences Institute
Step 2: Attributes
indicators.json
Information Sciences Institute
• Explicitly specifying the layout through alignments
• For linking across resources
Step 3: Alignments
Indicator male female
LIFE_0035 57.7 59.6
LIFE_0035 60.6 62.1
57.7
59.6
60.6
62.1
LIFE_0035
LIFE_0035
LIFE_0035
LIFE_0029
http://...?iid=35
http://...?iid=29
observation indicator_col indicator url
male
female
57.7
59.6
60.6
62.1
LIFE_0035
LIFE_0035
gender observation indicator_col
gender observation indicator_col
male 57.7 LIFE_0035
female 59.6 LIFE_0035
male 60.6 LIFE_0035
female 62.1 LIFE_0035
indicators.jsonlife_tbl.csv
Information Sciences Institute
Step 3: Alignments
• Join by value (equi-join)
• Syntax
Information Sciences Institute
Step 3: Alignments
• Join by positions in the dataset
0 1 2 3
0 2016
1 Indicator Age Group male female
2 LIFE_0035 <1 year 57.7 59.6
3 LIFE_0035 1-4 years 60.6 62.1
(2,3)(2,2)
(3,3)(3,2)
(1,2) (1,3)
dimension 1 (column)dimension 0 (row)
(1, ) male
(1, ) female
57.7 (2, )
59.6 (2, )
60.6 (3, )
62.1 (3, )
gender observation
Information Sciences Institute
Step 3: Alignments
• Join by positions in the dataset
0 1 2 3
0 2016
1 Indicator Age Group male female
2 LIFE_0035 <1 year 57.7 59.6
3 LIFE_0035 1-4 years 60.6 62.1
(2,3)(2,2)
(3,3)(3,2)
(2,0)
dimension 1 (column)dimension 0 (row)
( ,0) LIFE_0035
( ,0) LIFE_0035
57.7 ( ,2)
59.6 ( ,3)
60.6 ( ,2)
62.1 ( ,3)
indicator_col observation
(3,0)
Information Sciences Institute
Step 3: Alignments
• Join by positions in the dataset
Sample data
Name: $.*.departments.people.*.name
Phone: $.*.departments.people.*.phone
dimension 0 dimension 1 dimension 2 dimension 3 dimension 4
Aligned in dimensions 0 and 3
Information Sciences Institute
Step 3: Alignments
• Easy to incorporate new alignment function
• Users only need to define the minimum number of joins (N-1) because
the engine can infer the rest via composition.
Information Sciences Institute
Step 4: Semantic Model
• Using ontologies to describe your data (classes and predicates)
• Syntax
gender observation
male 57.7
female 59.6
male 60.6
female 62.1
indicator url
LIFE_0035 http://...?iid=35
LIFE_0029 http://...?iid=29
qb:Observation
sdmx-m:obsValuesdmx-d:sex
eg:indicator
eg:Indicator
drepr:urieg:code
Information Sciences Institute
Step 4: Semantic Model
• Users can create arbitrary semantic model, even when using attributes
across multiple resources
qb:Observation
sdmx-m:obsValuesdmx-d:sex
gender observation url
male 57.7 http://...?iid=35
female 59.6 http://...?iid=35
male 60.6 http://...?iid=35
female 62.1 http://...?iid=35
eg:indicator
Information Sciences Institute
Data cleaning (optional)
• Users can write python function to clean or transform the data
• Can re-use functions or existing libraries
0 1 2 3
0 2016 2016
1 Indicator Age Group male female
2 LIFE_0035 <1 year 57.7 59.6
3 LIFE_0035 1-4 years 60.6 62.1
index = (0,3)
value = “”
Information Sciences Institute
Evaluation
• Coverage of D-REPR
– Randomly sampling 700 datasets from data.gov
– Modeling datasets of different formats and layouts
Information Sciences Institute
Evaluation
• Runtime of D-REPR engine (ms)
– Mapping large CSV files (row-based table) containing (name, phone, address)
– Generating 1.3m triples / second (15 times faster than KR2RML)
Information Sciences Institute
Discussion and Future work
• A novel generic data representation language: D-REPR
– Uses a declarative approach
– Works for heterogeneous datasets of different formats and layouts
• Open source: https://github.com/usc-isi-i2/d-repr
• Future work:
– (Semi-)automatically generating D-REPR models
– UI for annotating datasets
– Improving efficiency of D-REPR’s engine by doing parallel processing
Information Sciences Institute
References
• [1] RML: Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh,
Erik Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated
RDF Mappings of Heterogeneous Data
• [2] KR2RML: Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig A. Knoblock.
2015. KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources
• [3] xR2RML: Franck Michel, Loïc Djimenou, Catherine Faron Zucker, and Johan
Montagnat. 2015. Translation of relational and non-relational databases into RDF
with xR2RML
• [4] XLWrap: Andreas Langegger and Wolfram Wöß. 2009. XLWrap — Querying and
Inte- grating Arbitrary Spreadsheets with SPARQL
• [5] Martin J O’Connor, Christian Halaschek-Wiener, and Mark A Musen. 2010. M2: A
Language for Mapping Spreadsheets to OWL

More Related Content

Similar to D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sources To RDF

Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisOlga Scrivner
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)University of Washington
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeCarole Goble
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET Journal
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible researchYannick Wurm
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsGaignard Alban
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Richard Zijdeman
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceCarole Goble
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataGiorgos Santipantakis
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
Hindi language as a graphical user interface to relational database for tran...
Hindi language as a graphical user interface to relational  database for tran...Hindi language as a graphical user interface to relational  database for tran...
Hindi language as a graphical user interface to relational database for tran...IRJET Journal
 

Similar to D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sources To RDF (20)

Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data AnalysisWorkshop nwav 47 - LVS - Tool for Quantitative Data Analysis
Workshop nwav 47 - LVS - Tool for Quantitative Data Analysis
 
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
MMDS 2014: Myria (and Scalable Graph Clustering with RelaxMap)
 
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, RomeWorkflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
IRJET- Natural Language Query Processing
IRJET- Natural Language Query ProcessingIRJET- Natural Language Query Processing
IRJET- Natural Language Query Processing
 
COPO kick-off meeting
COPO kick-off meetingCOPO kick-off meeting
COPO kick-off meeting
 
Resume
ResumeResume
Resume
 
2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research2014-10-10-SBC361-Reproducible research
2014-10-10-SBC361-Reproducible research
 
Sharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reportsSharing massive data analysis: from provenance to linked experiment reports
Sharing massive data analysis: from provenance to linked experiment reports
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
Linked Open Data: Combining Data for the Social Sciences and Humanities (and ...
 
TiMetmay10
TiMetmay10TiMetmay10
TiMetmay10
 
Ti met may10
Ti met may10Ti met may10
Ti met may10
 
Software Sustainability: Better Software Better Science
Software Sustainability: Better Software Better ScienceSoftware Sustainability: Better Software Better Science
Software Sustainability: Better Software Better Science
 
RDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival dataRDF-Gen: Generating RDF from streaming and archival data
RDF-Gen: Generating RDF from streaming and archival data
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
Hindi language as a graphical user interface to relational database for tran...
Hindi language as a graphical user interface to relational  database for tran...Hindi language as a graphical user interface to relational  database for tran...
Hindi language as a graphical user interface to relational database for tran...
 
The influence of data curation on QSAR Modeling – examining issues of qualit...
 The influence of data curation on QSAR Modeling – examining issues of qualit... The influence of data curation on QSAR Modeling – examining issues of qualit...
The influence of data curation on QSAR Modeling – examining issues of qualit...
 
Poster (1)
Poster (1)Poster (1)
Poster (1)
 
Nikhil_Ayyagari_Resume
Nikhil_Ayyagari_ResumeNikhil_Ayyagari_Resume
Nikhil_Ayyagari_Resume
 

Recently uploaded

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Recently uploaded (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024SQL Database Design For Developers at php[tek] 2024
SQL Database Design For Developers at php[tek] 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sources To RDF

  • 1. Information Sciences Institute D-REPR: A Language For Describing And Mapping Diversely-Structured Data Sources To RDF Binh Vu, Jay Pujara, and Craig Knoblock
  • 3. Motivating example csv json netcdf Need one method to access all types of data Language for describing dataset
  • 4. Information Sciences Institute Heterogeneous datasets • Multiple formats: CSV, JSON, XLSX, NetCDF4, …
  • 5. Information Sciences Institute Heterogeneous datasets • Same format, multiple layouts
  • 6. Information Sciences Institute Related work • Mapping nested relational datasets: – RML (Dimou et al, 2014), KR2RML (Slepicka et al 2015), xR2RML (Michel et al, 2015), etc. – Can handle multiple formats but only work for nested relational model layout • Mapping tabular datasets: – XLWrap (Langegger et al, 2009), M2 (O’Connor et al, 2010), T2WML (Szekely et al, 2019) – Can handle multiple layouts, but support only tabular formats
  • 7. Information Sciences Institute Contributions • A generic language to easily for describing and mapping heterogeneous datasets to RDF – It’s capable of mapping wide variety of data sources and goes beyond the set of sources that existing languages support. • The language is extensible to new formats and layouts • An efficient engine to convert datasets to RDF
  • 8. gender observation indicator_col male 57.7 LIFE_0035 female 59.6 LIFE_0035 male 60.6 LIFE_0035 female 62.1 LIFE_0035 Our approach Step 3: join attributes to tables Step 2: define attributes Step 1: define resources male female 57.7 59.6 60.6 62.1 LIFE_0035 LIFE_0035 LIFE_0035 LIFE_0029 http://...?iid=35 http://...?iid=29 indicator url LIFE_0035 http://...?iid=35 LIFE_0029 http://...?iid=29 gender observation indicator_col indicator url Step 4: semantic modeling qb:Observation sdmx-m:obsValuesdmx-d:sex eg:indicator eg:Indicator drepr:urieg:code
  • 9. Information Sciences Institute Step 1: Resources • A resource can be a physical file, SQL table, etc. • Syntax: • Example: life_table.csv indicators.json
  • 10. Information Sciences Institute Step 2: Attributes • Containing values that belong to a group • Syntax life_table.csv
  • 11. Information Sciences Institute Step 2: Attributes indicators.json
  • 12. Information Sciences Institute • Explicitly specifying the layout through alignments • For linking across resources Step 3: Alignments Indicator male female LIFE_0035 57.7 59.6 LIFE_0035 60.6 62.1 57.7 59.6 60.6 62.1 LIFE_0035 LIFE_0035 LIFE_0035 LIFE_0029 http://...?iid=35 http://...?iid=29 observation indicator_col indicator url male female 57.7 59.6 60.6 62.1 LIFE_0035 LIFE_0035 gender observation indicator_col gender observation indicator_col male 57.7 LIFE_0035 female 59.6 LIFE_0035 male 60.6 LIFE_0035 female 62.1 LIFE_0035 indicators.jsonlife_tbl.csv
  • 13. Information Sciences Institute Step 3: Alignments • Join by value (equi-join) • Syntax
  • 14. Information Sciences Institute Step 3: Alignments • Join by positions in the dataset 0 1 2 3 0 2016 1 Indicator Age Group male female 2 LIFE_0035 <1 year 57.7 59.6 3 LIFE_0035 1-4 years 60.6 62.1 (2,3)(2,2) (3,3)(3,2) (1,2) (1,3) dimension 1 (column)dimension 0 (row) (1, ) male (1, ) female 57.7 (2, ) 59.6 (2, ) 60.6 (3, ) 62.1 (3, ) gender observation
  • 15. Information Sciences Institute Step 3: Alignments • Join by positions in the dataset 0 1 2 3 0 2016 1 Indicator Age Group male female 2 LIFE_0035 <1 year 57.7 59.6 3 LIFE_0035 1-4 years 60.6 62.1 (2,3)(2,2) (3,3)(3,2) (2,0) dimension 1 (column)dimension 0 (row) ( ,0) LIFE_0035 ( ,0) LIFE_0035 57.7 ( ,2) 59.6 ( ,3) 60.6 ( ,2) 62.1 ( ,3) indicator_col observation (3,0)
  • 16. Information Sciences Institute Step 3: Alignments • Join by positions in the dataset Sample data Name: $.*.departments.people.*.name Phone: $.*.departments.people.*.phone dimension 0 dimension 1 dimension 2 dimension 3 dimension 4 Aligned in dimensions 0 and 3
  • 17. Information Sciences Institute Step 3: Alignments • Easy to incorporate new alignment function • Users only need to define the minimum number of joins (N-1) because the engine can infer the rest via composition.
  • 18. Information Sciences Institute Step 4: Semantic Model • Using ontologies to describe your data (classes and predicates) • Syntax gender observation male 57.7 female 59.6 male 60.6 female 62.1 indicator url LIFE_0035 http://...?iid=35 LIFE_0029 http://...?iid=29 qb:Observation sdmx-m:obsValuesdmx-d:sex eg:indicator eg:Indicator drepr:urieg:code
  • 19. Information Sciences Institute Step 4: Semantic Model • Users can create arbitrary semantic model, even when using attributes across multiple resources qb:Observation sdmx-m:obsValuesdmx-d:sex gender observation url male 57.7 http://...?iid=35 female 59.6 http://...?iid=35 male 60.6 http://...?iid=35 female 62.1 http://...?iid=35 eg:indicator
  • 20. Information Sciences Institute Data cleaning (optional) • Users can write python function to clean or transform the data • Can re-use functions or existing libraries 0 1 2 3 0 2016 2016 1 Indicator Age Group male female 2 LIFE_0035 <1 year 57.7 59.6 3 LIFE_0035 1-4 years 60.6 62.1 index = (0,3) value = “”
  • 21. Information Sciences Institute Evaluation • Coverage of D-REPR – Randomly sampling 700 datasets from data.gov – Modeling datasets of different formats and layouts
  • 22. Information Sciences Institute Evaluation • Runtime of D-REPR engine (ms) – Mapping large CSV files (row-based table) containing (name, phone, address) – Generating 1.3m triples / second (15 times faster than KR2RML)
  • 23. Information Sciences Institute Discussion and Future work • A novel generic data representation language: D-REPR – Uses a declarative approach – Works for heterogeneous datasets of different formats and layouts • Open source: https://github.com/usc-isi-i2/d-repr • Future work: – (Semi-)automatically generating D-REPR models – UI for annotating datasets – Improving efficiency of D-REPR’s engine by doing parallel processing
  • 24. Information Sciences Institute References • [1] RML: Anastasia Dimou, Miel Vander Sande, Pieter Colpaert, Ruben Verborgh, Erik Mannens, and Rik Van de Walle. 2014. RML: A Generic Language for Integrated RDF Mappings of Heterogeneous Data • [2] KR2RML: Jason Slepicka, Chengye Yin, Pedro Szekely, and Craig A. Knoblock. 2015. KR2RML: An Alternative Interpretation of R2RML for Heterogenous Sources • [3] xR2RML: Franck Michel, Loïc Djimenou, Catherine Faron Zucker, and Johan Montagnat. 2015. Translation of relational and non-relational databases into RDF with xR2RML • [4] XLWrap: Andreas Langegger and Wolfram Wöß. 2009. XLWrap — Querying and Inte- grating Arbitrary Spreadsheets with SPARQL • [5] Martin J O’Connor, Christian Halaschek-Wiener, and Mark A Musen. 2010. M2: A Language for Mapping Spreadsheets to OWL