SlideShare a Scribd company logo
Rule-based Capture/Storage
of Scientific Data from PDF Files
and Export using a
Generic Scientific Data Model
Stuart J. Chalk, Audrey Bartholomew, Bashar Baraz, John Turner
Department of Chemistry, University of North Florida
schalk@unf.edu
CINF Paper 3 – 251st ACS Meeting Spring 2016
#ACSCINFDataSummit
 Data, Data Everywhere
 Extracting Text from PDF Files
 Making Sense of Textual Data
 Rules and Rulesets
 Overcoming Incompatibilities
 A Generic Scientific Data Model
 Building the Website
 Functionality of the Website
 Next Steps
 Future Plans
 Conclusion
Outline
Data, Data Everywhere
 U.S. Gov. “…open and machine readable data…” (5/2013)
 http://www.data.gov/
 NSF's Open Data Inventory (http://www.nsf.gov/data/)
 Expand -- publish additional data assets in the inventory:
 Enrich -- improve the discoverability, management, and
reusability of agency data assets through metadata.
 Open -- provide machine-readable and publically accessible
agency data assets.
 Funding agencies require “Data Management Plans”
 MPS Open Data Workshop - https://mpsopendata.crc.nd.edu/
Data, Data Everywhere
(but interoperable?)
 Joint Declaration of Data Citation Principles
Force 11 - https://www.force11.org/datacitation
 Importance
 Credit and Attribution
 Unique Identification
 Access
 Persistence
 Specificity and Verifiability
 Interoperability and Flexibility
 PDF Files are not meant to be vehicles for research data
sharing – but sadly have become the default
 Force 11 grew out of an initial meeting at UC San Diego in
2011 – Beyond the PDF
https://sites.google.com/site/beyondthepdf/
 PDF files store text in many places, and not in logical
sequence as it appears on the page
 Scanned images can be converted to text using optical
character recognition
 Many tools to get text out of PDF files
Extracting Text from PDF Files
 Software tested
 Adobe acrobat (plain text, accessible text, RTF, XML)
Converted to the newest PDF version
 lapdftext (Location Aware)
https://github.com/GullyAPCBurns/lapdftext
 pdftotext (part of)
Xpdf http://www.foolabs.com/xpdf/home.html
Extracting Text from PDF Files
Extracting Text from PDF Files
Making Sense of Textual Data
 Process the text file one line at a time
 ‘Generic’ rules to make decisions on a line
 May be multiple rules for one line
 Includes instructions to move down or stay on line
 Sets of rules tailored to match the generic layout of the text
 These were linked to datatype and publication
Rules and Rulesets
More sophiticated GetFormula regex (([A-Z][a-zA-Z0-9]{0,2}s)+)
 The data we extract with
the ruleset is saved in a
json string
 But how should it be
saved so that compatible
with other data?
 Different properties
 Different units
 We need a generic data
model to put the data in…
Overcoming Incompatibilities
 What is data?
A value of a measured property with/without unit
…or a series of data
... or an observation (text)
 But on its own its not that useful...
 What’s the context?
 Under what conditions was the data obtained?
 What were you testing?
 What instrument did you use to take the measurement?
 Who did the experiment?
A Generic Scientific Data Model
 How to store all this information?
 Relational database – very structured, data has to fixed schema
 Graph database – unstructured, flexible, non-standardized
 Intermediate solution needed
 Must have some structure so it can be easily searched
 Must be flexible to handle and type of data
 Approach
 Use JSON for linked data (JSON-LD) as format to hold data
 Provide a framework upon which you can hang the data and
metadata
A Generic Scientific Data Model
 JavaScript Object Notation (JSON)
 JSON for Linked Data (JSON-LD)
 W3C Recommentation (https://www.w3.org/TR/json-ld/)
 Specification that allows data/metadata to be stored in
JSON format but translated automatically to RDF
(Resource Description Framework)
 Provides a mechanism to add context to data/metadata
stored in JSON
 Note: Does not provide validation of the data/metadata
 Can be validated using JSON Schema
(http://json-schema.org)
What is JSON for Linked Data?
What is JSON for Linked Data?
{
"@context": {
"name": "http://schema.org/name",
"isAlive": "http://example.org/isAlive",
"age": "http://example.org/age",
"height": "http://schema.org/height",
"@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx"
},
"@id": "",
"name": "Stuart Chalk",
"isAlive": true,
"age": 49,
"height": 188.0
}
What is JSON for Linked Data?
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://example.org/age>
"49"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://example.org/isAlive>
"true"^^<http://www.w3.org/2001/XMLSchema#boolean> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://schema.org/height>
"188"^^<http://www.w3.org/2001/XMLSchema#integer> .
<http://www.unf.edu/chemistry/stuart_chalk.aspx>
<http://schema.org/name>
"Stuart Chalk" . http://json-ld.org/playground/
 CakePHP 2.7 PHP Framework (http://cakephp.org)
 Apache 2.4 webserver (http://httpd.apache.org)
 PHP 5.6 (http://www.php.org)
 MySQL 5.5 (http://www.mysql.com)
 jQuery (JavaScript) (http://jquery.com)
 Bootstrap (http://getbootstrap.com/)
 PHPStorm (https://www.jetbrains.com)
 GitHub (https://github.com)
Building the Website
 Upload PDF
 Assign Ruleset
 Extract data
 Perform QC
 Ingest Data
 View “dataset”
 Mass Process – many files from one publication
 Map the Process – see what happened
Functionality of the Website
Database Schema
Live Demo
 Formalize the system to
 Organize the processing
 Capture any processing current done in scripts
 Publish a paper
 Ingest data in SDM format into triplestore and
perform SPARQL queries on aggregated data
 Aggregate data of the same type on the website and
view, organize, plot together
 Provide option to capture and present equation
based results
Next Steps
Take Home
 ~27,500 PDF pages processed
 ~530,000 pieces of physical property data extracted
 Data can be extracted from PDF files as long as the
 Text is cleaned up
 Context is encoded in the process of extraction
 Data homogeneity can be achieved by using a generic
scientific data model
 More work needed to create robust system to handle data
in any “well structured” format
 schalk@unf.edu
 Phone: 904-620-5311
 Skype: stuartchalk
 LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk
 ORCID: http://orcid.org/0000-0002-0703-7776
 ResearcherID: http://www.researcherid.com/rid/D-8577-2013
Questions?
Thanks to Michael Klinge and Stefan Scherer
 Text Extraction of 11 Landolt-Börnstein Volumes
 Regulat Expression (Regex) fragments used to develop
rules to detect specific types of data
 PHP Scripts used to apply ‘Rulesets’ to text
 Extracted data and metadata captured in JSON
 JSON data added to MySQL database to support searching
and export in a generic scientific data model (JSON-LD)
 Website developed to search and display data
 ~27,500 PDF pages processed
 ~530,000 pieces of physical property data extracted
ChemExtractor:
Semantic Data from PDF Files

More Related Content

What's hot

Cdao Evolution08
Cdao Evolution08Cdao Evolution08
Cdao Evolution08
Brandon Chisham
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
Jun Zhao
 
CDISC2RDF overview with examples
CDISC2RDF overview with examplesCDISC2RDF overview with examples
CDISC2RDF overview with examplesKerstin Forsberg
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
Bruce Kozuma
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
ErhardRahm
 
INSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology InfrastructureINSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology Infrastructure
Research Data Alliance
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
Paul Groth
 
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Stuart Chalk
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectAlexandro Colorado
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Laurent Alquier
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
ericmeeks
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook Ontology
Stuart Chalk
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-final
DATAVERSITY
 
Web 3 Mark Greaves
Web 3 Mark GreavesWeb 3 Mark Greaves
Web 3 Mark GreavesMediabistro
 
Citation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online LiteratureCitation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online Literature
Balachandar Radhakrishnan
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
Chimezie Ogbuji
 
Presentation_euroCRIS_ES
Presentation_euroCRIS_ESPresentation_euroCRIS_ES
Presentation_euroCRIS_ESEd Simons
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
Kerstin Forsberg
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
Kerstin Forsberg
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
CIARD Movement
 

What's hot (20)

Cdao Evolution08
Cdao Evolution08Cdao Evolution08
Cdao Evolution08
 
2009 0807 Lod Gmod
2009 0807 Lod Gmod2009 0807 Lod Gmod
2009 0807 Lod Gmod
 
CDISC2RDF overview with examples
CDISC2RDF overview with examplesCDISC2RDF overview with examples
CDISC2RDF overview with examples
 
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v12016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
2016 Bio-IT World Cell Line Coordination Poster 2016-04-05v1
 
Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1Scalable and privacy-preserving data integration - part 1
Scalable and privacy-preserving data integration - part 1
 
INSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology InfrastructureINSTRUCT - Integrated Structural Biology Infrastructure
INSTRUCT - Integrated Structural Biology Infrastructure
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
Integrating AnIML Files in Electronic Laboratory Notebooks - PittCon 2015
 
The OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit ProjectThe OpenOffice.org ODF Toolkit Project
The OpenOffice.org ODF Toolkit Project
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Crosslinks
Crosslinks Crosslinks
Crosslinks
 
The Electronic Notebook Ontology
The Electronic Notebook OntologyThe Electronic Notebook Ontology
The Electronic Notebook Ontology
 
Yosemite part-4 webinar-final
Yosemite part-4 webinar-finalYosemite part-4 webinar-final
Yosemite part-4 webinar-final
 
Web 3 Mark Greaves
Web 3 Mark GreavesWeb 3 Mark Greaves
Web 3 Mark Greaves
 
Citation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online LiteratureCitation Analysis for the Free, Online Literature
Citation Analysis for the Free, Online Literature
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
Presentation_euroCRIS_ES
Presentation_euroCRIS_ESPresentation_euroCRIS_ES
Presentation_euroCRIS_ES
 
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
MIE2014: A Framework for Evaluating and Utilizing Medical Terminology Mappings
 
Linked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcareLinked Data efforts for data standards in biopharma and healthcare
Linked Data efforts for data standards in biopharma and healthcare
 
The CIARD RINGValeri
The CIARD RINGValeriThe CIARD RINGValeri
The CIARD RINGValeri
 

Similar to Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model

Datalake Architecture
Datalake ArchitectureDatalake Architecture
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
Bernadette Hyland-Wood
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
Vital.AI
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataMatthew Rowe
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
LEARN Project
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEWShiyong Lu
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
Carole Goble
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
VMware Tanzu
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanities
vty
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
Jongwook Woo
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
l_ernest
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Support Team, IT Services, University of Oxford
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Research Data Alliance
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
Laboratory of Information Science and Semantic Technologies
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
Ivan Ermilov
 
Enterprise Content Management Migration Best Practices Feat Migrations From...
Enterprise Content Management Migration Best Practices   Feat Migrations From...Enterprise Content Management Migration Best Practices   Feat Migrations From...
Enterprise Content Management Migration Best Practices Feat Migrations From...
Alfresco Software
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Julie Allinson
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
Chris Mattmann
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
Carole Goble
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
CCG
 

Similar to Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model (20)

Datalake Architecture
Datalake ArchitectureDatalake Architecture
Datalake Architecture
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
Vital AI: Big Data Modeling
Vital AI: Big Data ModelingVital AI: Big Data Modeling
Vital AI: Big Data Modeling
 
Data.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked DataData.dcs: Converting Legacy Data into Linked Data
Data.dcs: Converting Legacy Data into Linked Data
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
An Overview of VIEW
An Overview of VIEWAn Overview of VIEW
An Overview of VIEW
 
FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout FAIR Workflows and Research Objects get a Workout
FAIR Workflows and Research Objects get a Workout
 
Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...Federated Queries Across Both Different Storage Mediums and Different Data En...
Federated Queries Across Both Different Storage Mediums and Different Data En...
 
Data standardization process for social sciences and humanities
Data standardization process for social sciences and humanitiesData standardization process for social sciences and humanities
Data standardization process for social sciences and humanities
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation WorkshopHKU Data Curation MLIM7350 Student Project: Data Curation Workshop
HKU Data Curation MLIM7350 Student Project: Data Curation Workshop
 
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
Research Data Management: An Overview - 2014-05-12 - Humanities Division, Uni...
 
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
Enabling Precise Identification and Citability of Dynamic Data: Recommendatio...
 
Linked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter HaaseLinked Data and Semantic Web Application Development by Peter Haase
Linked Data and Semantic Web Application Development by Peter Haase
 
Data Integration And Visualization
Data Integration And VisualizationData Integration And Visualization
Data Integration And Visualization
 
Enterprise Content Management Migration Best Practices Feat Migrations From...
Enterprise Content Management Migration Best Practices   Feat Migrations From...Enterprise Content Management Migration Best Practices   Feat Migrations From...
Enterprise Content Management Migration Best Practices Feat Migrations From...
 
Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29Swap For Dummies Rsp 2007 11 29
Swap For Dummies Rsp 2007 11 29
 
A Look into the Apache OODT Ecosystem
A Look into the Apache OODT EcosystemA Look into the Apache OODT Ecosystem
A Look into the Apache OODT Ecosystem
 
RO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research ObjectsRO-Crate: A framework for packaging research products into FAIR Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
 
Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage Data Analytics Meetup: Introduction to Azure Data Lake Storage
Data Analytics Meetup: Introduction to Azure Data Lake Storage
 

More from Stuart Chalk

Semantic properties and units
Semantic properties and unitsSemantic properties and units
Semantic properties and units
Stuart Chalk
 
Open semantic chemical structures
Open semantic chemical structuresOpen semantic chemical structures
Open semantic chemical structures
Stuart Chalk
 
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
Stuart Chalk
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataSharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Stuart Chalk
 
Bringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic WebBringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic Web
Stuart Chalk
 
Reactions to the Open Spectral Database
Reactions to the Open Spectral DatabaseReactions to the Open Spectral Database
Reactions to the Open Spectral Database
Stuart Chalk
 
Building a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectBuilding a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectStuart Chalk
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSX
Stuart Chalk
 
Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)
Stuart Chalk
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
Stuart Chalk
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
Stuart Chalk
 
ACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility DataACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility Data
Stuart Chalk
 
ACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData ProjectACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData Project
Stuart Chalk
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
Stuart Chalk
 
ACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka CollaborationACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka Collaboration
Stuart Chalk
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)
Stuart Chalk
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
Stuart Chalk
 
Liberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIMLLiberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIML
Stuart Chalk
 

More from Stuart Chalk (18)

Semantic properties and units
Semantic properties and unitsSemantic properties and units
Semantic properties and units
 
Open semantic chemical structures
Open semantic chemical structuresOpen semantic chemical structures
Open semantic chemical structures
 
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
ChemExtractor: Enhanced Rule-Based Capture and Identification of PDF Based Pr...
 
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series DataSharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
Sharing Science Data: Semantically Reimagining the IUPAC Solubility Series Data
 
Bringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic WebBringing Flow injection Analysis to the Semantic Web
Bringing Flow injection Analysis to the Semantic Web
 
Reactions to the Open Spectral Database
Reactions to the Open Spectral DatabaseReactions to the Open Spectral Database
Reactions to the Open Spectral Database
 
Building a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP ProjectBuilding a Standard for Standards: The ChAMP Project
Building a Standard for Standards: The ChAMP Project
 
A Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSXA Standard Data Format for Computational Chemistry: CSX
A Standard Data Format for Computational Chemistry: CSX
 
Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)Overview of the Analytical Information Markup Language (AnIML)
Overview of the Analytical Information Markup Language (AnIML)
 
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into EurekaACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
ACS 248th Paper 146 VIVO/ScientistsDB Integration into Eureka
 
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka IntegrationACS 248th Paper 136 JSmol/JSpecView Eureka Integration
ACS 248th Paper 136 JSmol/JSpecView Eureka Integration
 
ACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility DataACS 248th Paper 108 NIST-IUPAC Solubility Data
ACS 248th Paper 108 NIST-IUPAC Solubility Data
 
ACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData ProjectACS 248th Paper 104 ChemData Project
ACS 248th Paper 104 ChemData Project
 
ACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP ProjectACS 248th Paper 71 ChAMP Project
ACS 248th Paper 71 ChAMP Project
 
ACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka CollaborationACS 248th Paper 67 Eureka Collaboration
ACS 248th Paper 67 Eureka Collaboration
 
247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)247th ACS Meeting: Experiment Markup Language (ExptML)
247th ACS Meeting: Experiment Markup Language (ExptML)
 
Liberating Laboratory Data - Eureka
Liberating Laboratory Data - EurekaLiberating Laboratory Data - Eureka
Liberating Laboratory Data - Eureka
 
Liberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIMLLiberating Laboratory Data - AnIML
Liberating Laboratory Data - AnIML
 

Recently uploaded

The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
mediapraxi
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
University of Maribor
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills MN
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
Abdul Wali Khan University Mardan,kP,Pakistan
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
University of Maribor
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
muralinath2
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Erdal Coalmaker
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Sérgio Sacani
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
Wasswaderrick3
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
Sharon Liu
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
RASHMI M G
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
Nistarini College, Purulia (W.B) India
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
RASHMI M G
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
yqqaatn0
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Ana Luísa Pinho
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
yqqaatn0
 

Recently uploaded (20)

The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdfThe Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
The Evolution of Science Education PraxiLabs’ Vision- Presentation (2).pdf
 
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
Remote Sensing and Computational, Evolutionary, Supercomputing, and Intellige...
 
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
Travis Hills' Endeavors in Minnesota: Fostering Environmental and Economic Pr...
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...THEMATIC  APPERCEPTION  TEST(TAT) cognitive abilities, creativity, and critic...
THEMATIC APPERCEPTION TEST(TAT) cognitive abilities, creativity, and critic...
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...
 
Oedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptxOedema_types_causes_pathophysiology.pptx
Oedema_types_causes_pathophysiology.pptx
 
Unveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdfUnveiling the Energy Potential of Marshmallow Deposits.pdf
Unveiling the Energy Potential of Marshmallow Deposits.pdf
 
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
Observation of Io’s Resurfacing via Plume Deposition Using Ground-based Adapt...
 
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
DERIVATION OF MODIFIED BERNOULLI EQUATION WITH VISCOUS EFFECTS AND TERMINAL V...
 
20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx20240520 Planning a Circuit Simulator in JavaScript.pptx
20240520 Planning a Circuit Simulator in JavaScript.pptx
 
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptxANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
ANAMOLOUS SECONDARY GROWTH IN DICOT ROOTS.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.Nucleic Acid-its structural and functional complexity.
Nucleic Acid-its structural and functional complexity.
 
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptxBREEDING METHODS FOR DISEASE RESISTANCE.pptx
BREEDING METHODS FOR DISEASE RESISTANCE.pptx
 
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
原版制作(carleton毕业证书)卡尔顿大学毕业证硕士文凭原版一模一样
 
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
Deep Behavioral Phenotyping in Systems Neuroscience for Functional Atlasing a...
 
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
如何办理(uvic毕业证书)维多利亚大学毕业证本科学位证书原版一模一样
 

Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model

  • 1. Rule-based Capture/Storage of Scientific Data from PDF Files and Export using a Generic Scientific Data Model Stuart J. Chalk, Audrey Bartholomew, Bashar Baraz, John Turner Department of Chemistry, University of North Florida schalk@unf.edu CINF Paper 3 – 251st ACS Meeting Spring 2016 #ACSCINFDataSummit
  • 2.  Data, Data Everywhere  Extracting Text from PDF Files  Making Sense of Textual Data  Rules and Rulesets  Overcoming Incompatibilities  A Generic Scientific Data Model  Building the Website  Functionality of the Website  Next Steps  Future Plans  Conclusion Outline
  • 3. Data, Data Everywhere  U.S. Gov. “…open and machine readable data…” (5/2013)  http://www.data.gov/  NSF's Open Data Inventory (http://www.nsf.gov/data/)  Expand -- publish additional data assets in the inventory:  Enrich -- improve the discoverability, management, and reusability of agency data assets through metadata.  Open -- provide machine-readable and publically accessible agency data assets.  Funding agencies require “Data Management Plans”  MPS Open Data Workshop - https://mpsopendata.crc.nd.edu/
  • 4. Data, Data Everywhere (but interoperable?)  Joint Declaration of Data Citation Principles Force 11 - https://www.force11.org/datacitation  Importance  Credit and Attribution  Unique Identification  Access  Persistence  Specificity and Verifiability  Interoperability and Flexibility
  • 5.  PDF Files are not meant to be vehicles for research data sharing – but sadly have become the default  Force 11 grew out of an initial meeting at UC San Diego in 2011 – Beyond the PDF https://sites.google.com/site/beyondthepdf/  PDF files store text in many places, and not in logical sequence as it appears on the page  Scanned images can be converted to text using optical character recognition  Many tools to get text out of PDF files Extracting Text from PDF Files
  • 6.  Software tested  Adobe acrobat (plain text, accessible text, RTF, XML) Converted to the newest PDF version  lapdftext (Location Aware) https://github.com/GullyAPCBurns/lapdftext  pdftotext (part of) Xpdf http://www.foolabs.com/xpdf/home.html Extracting Text from PDF Files
  • 8. Making Sense of Textual Data
  • 9.  Process the text file one line at a time  ‘Generic’ rules to make decisions on a line  May be multiple rules for one line  Includes instructions to move down or stay on line  Sets of rules tailored to match the generic layout of the text  These were linked to datatype and publication Rules and Rulesets More sophiticated GetFormula regex (([A-Z][a-zA-Z0-9]{0,2}s)+)
  • 10.  The data we extract with the ruleset is saved in a json string  But how should it be saved so that compatible with other data?  Different properties  Different units  We need a generic data model to put the data in… Overcoming Incompatibilities
  • 11.  What is data? A value of a measured property with/without unit …or a series of data ... or an observation (text)  But on its own its not that useful...  What’s the context?  Under what conditions was the data obtained?  What were you testing?  What instrument did you use to take the measurement?  Who did the experiment? A Generic Scientific Data Model
  • 12.  How to store all this information?  Relational database – very structured, data has to fixed schema  Graph database – unstructured, flexible, non-standardized  Intermediate solution needed  Must have some structure so it can be easily searched  Must be flexible to handle and type of data  Approach  Use JSON for linked data (JSON-LD) as format to hold data  Provide a framework upon which you can hang the data and metadata A Generic Scientific Data Model
  • 13.  JavaScript Object Notation (JSON)  JSON for Linked Data (JSON-LD)  W3C Recommentation (https://www.w3.org/TR/json-ld/)  Specification that allows data/metadata to be stored in JSON format but translated automatically to RDF (Resource Description Framework)  Provides a mechanism to add context to data/metadata stored in JSON  Note: Does not provide validation of the data/metadata  Can be validated using JSON Schema (http://json-schema.org) What is JSON for Linked Data?
  • 14. What is JSON for Linked Data? { "@context": { "name": "http://schema.org/name", "isAlive": "http://example.org/isAlive", "age": "http://example.org/age", "height": "http://schema.org/height", "@base": "http://www.unf.edu/chemistry/stuart_chalk.aspx" }, "@id": "", "name": "Stuart Chalk", "isAlive": true, "age": 49, "height": 188.0 }
  • 15. What is JSON for Linked Data? <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://example.org/age> "49"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://example.org/isAlive> "true"^^<http://www.w3.org/2001/XMLSchema#boolean> . <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/height> "188"^^<http://www.w3.org/2001/XMLSchema#integer> . <http://www.unf.edu/chemistry/stuart_chalk.aspx> <http://schema.org/name> "Stuart Chalk" . http://json-ld.org/playground/
  • 16.  CakePHP 2.7 PHP Framework (http://cakephp.org)  Apache 2.4 webserver (http://httpd.apache.org)  PHP 5.6 (http://www.php.org)  MySQL 5.5 (http://www.mysql.com)  jQuery (JavaScript) (http://jquery.com)  Bootstrap (http://getbootstrap.com/)  PHPStorm (https://www.jetbrains.com)  GitHub (https://github.com) Building the Website
  • 17.  Upload PDF  Assign Ruleset  Extract data  Perform QC  Ingest Data  View “dataset”  Mass Process – many files from one publication  Map the Process – see what happened Functionality of the Website
  • 20.  Formalize the system to  Organize the processing  Capture any processing current done in scripts  Publish a paper  Ingest data in SDM format into triplestore and perform SPARQL queries on aggregated data  Aggregate data of the same type on the website and view, organize, plot together  Provide option to capture and present equation based results Next Steps
  • 21. Take Home  ~27,500 PDF pages processed  ~530,000 pieces of physical property data extracted  Data can be extracted from PDF files as long as the  Text is cleaned up  Context is encoded in the process of extraction  Data homogeneity can be achieved by using a generic scientific data model  More work needed to create robust system to handle data in any “well structured” format
  • 22.  schalk@unf.edu  Phone: 904-620-5311  Skype: stuartchalk  LinkedIn/Slidehare: https://www.linkedin.com/in/stuchalk  ORCID: http://orcid.org/0000-0002-0703-7776  ResearcherID: http://www.researcherid.com/rid/D-8577-2013 Questions? Thanks to Michael Klinge and Stefan Scherer
  • 23.  Text Extraction of 11 Landolt-Börnstein Volumes  Regulat Expression (Regex) fragments used to develop rules to detect specific types of data  PHP Scripts used to apply ‘Rulesets’ to text  Extracted data and metadata captured in JSON  JSON data added to MySQL database to support searching and export in a generic scientific data model (JSON-LD)  Website developed to search and display data  ~27,500 PDF pages processed  ~530,000 pieces of physical property data extracted ChemExtractor: Semantic Data from PDF Files