SlideShare a Scribd company logo
Analysing Structured Scholarly Data
Embedded in Web Pages
Pracheta Sahoo, Ujwal Gadiraju, Ran Yu,
Sriparna Saha and Stefan Dietze
WWW 2016
April 11th
, 2016
Montreal, Canada
OVERVIEW
❏ INTRODUCTION
❏ MOTIVATION
❏ RESEARCH
QUESTIONS
❏ ANALYSES
❏ CONCLUSIONS
❏ FUTURE WORK
INTRODUCTION (1/3)
The Web: nearly 46 trillion
Web pages indexed by Google
VS
Linked Data: approx. 1000
datasets & 100 billion
statements
● different order of
magnitude w.r.t. scale &
dynamics
Are there other semantics (structured facts) on the Web?
INTRODUCTION (2/3)
● Web pages embed structured data
(microdata, microformats and RDFa)
○ Interpretation of web documents
(search & retrieval)
● Increase in prevalence of embedded
markup (2014 Google study of 12 bn
pages estimates an adoption of 26%)
● “Web Data Commons” (Meusel et al.
[ISWC’14])
○ Markup from Common Crawl (2.2 bn
pages)
○ 17 billion RDF quads
○ Markup in 26% of pages, 14% of PLDs
in 2013 (increase from 6% in 2011)
Other semantics
(structured facts) on
the Web!
INTRODUCTION (3/3)
Characteristics of Markup Data
MOTIVATION
● Embedded markup ⇒ sparsely
linked, large % of coreferences,
redundant statements
● Uptake and reuse of embedded
markup is hindered by the lack
of dynamics, scale
● Lack of understanding of the
adoption of markup for
scholarly resource metadata
WHAT WE BRING TO THE TABLE ...
● Study of scholarly data
extracted from embedded
annotations (Web Data
Commons)
● Shape & characteristics of
entity descriptions
● Level of adoption of terms
& types, distributions
across TLDs, PLDs, data
publishers
RESEARCH QUESTIONS
RQ1 What are frequently used
terms & types for scholarly data?
RQ2 How are statements about
bibliographic data distributed
across the web? Who are the key
providers of bibliographic markup?
RQ3 What are the frequent errors
that can be observed?
DATASET
● Web Data Commons (WDC) 2014 dataset
● Subset ⇒ all statements describing entities
of type s:ScholarlyArticle or co-
occuring on same document with any s:
ScholarlyArticle instance
○ 6,793,764 quads
○ 1,184,623 entities
○ 83 distinct classes
○ 429 distinct predicates
DATASET - Considerations
● s:ScholarlyArticle is the only type which
explicitly refers to scholarly articles
● We focus on schema.org, the most
widely used schema
● Types considered ⇒ s:ScholarlyArticle,
s:Person and s:Organization
○ 280,616 instances (s:
ScholarlyArticle)
○ 847,417 insrances (s:Person)
○ 3,798 instances (s:Organization)
SCHOLARLY TYPES & PREDICATES (½)
Cumulative dist. of predicates over instances across
extracted types
1 to 14
1 to 9 1 to 4
SCHOLARLY TYPES & PREDICATES (2/2)
Top-10 Predicates for s:ScholarlyArticle
DOMAINS & DOCUMENTS (1/5)
Distribution of Entities & Statements across PLDs
DOMAINS & DOCUMENTS (2/5)
Top-10 PLDs (ranked by no. of entities)
DOMAINS & DOCUMENTS (3/5)
Distribution of Entities & Statements across TLDs
DOMAINS & DOCUMENTS (4/5)
Distribution of Entities & Statements across HTML
Documents
DOMAINS & DOCUMENTS (5/5)
Top-10 Documents Ranked According to
Embedded Entities
TOPICS & PUBLICATION TYPES (1/4)
Distribution of Scholarly Articles across Publishers
TOPICS & PUBLICATION TYPES (2/4)
Top-10 Publishers and corresponding no. of
Publications
TOPICS & PUBLICATION TYPES (3/4)
Top-10 Publication Types (genres) across WDC
TOPICS & PUBLICATION TYPES (4/4)
Top-10 Article Titles (ranked by frequency of occurrence)
FREQUENT ERRORS - Schema Violations
Top-10 Misused Predicates
CONCLUSIONS (½)
● First study on coverage & char. of
bibliographic metadata embedded
in web pages.
● Early adopters ⇒ publishers,
libraries, other providers of
bibliographic data.
● Usage of terms, types ⇒ dist.
across providers, domains and
topics follows a power law; few
providers & documents
contributing to majority of data.
● Top-k genres & publishers indicate a
bias towards French, English data
providers.
● Article titles, PLDs & publishers ⇒
bias Computer Science and Life
Sciences.
● In this study we only consider entities
tagged explicitly as "scholarlyArticle",
a deeper analysis considering more
types (article, book, etc.) and other
creative works can shed light on the
true scale of and potential of
embedded markup data.
CONCLUSIONS (2/2)
FUTURE WORK
● Targeted crawl of typical
providers of scholarly data
(publishers, academic
orgs., libraries, etc.)
● Consider implicitly typed
bibliographic or creative
work as scholarly data
Contact Details :
gadiraju@l3s.de
http://www.L3S.de
LIMITATIONS
● Our study is limited to
schema.org & the types of
s:ScholarlyArticle, s:
Person, s:Organization.
● We consider only explicitly
linked scholarly works.

More Related Content

What's hot

Open science platforms
Open science platformsOpen science platforms
Open science platforms
Irina Radchenko
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
Laura Hollink
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
Graph-TA
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
Morgan Briles
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
Reynold Xin
 
Reference Hackers
Reference HackersReference Hackers
Reference Hackers
NicoleBranch
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
Armin Haller
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
Paul Groth
 
Data Publishing and Institutional Repositories
Data Publishing and Institutional RepositoriesData Publishing and Institutional Repositories
Data Publishing and Institutional Repositories
Varsha Khodiyar
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
datacite
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
Stanley Wang
 
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)Gregor Hagedorn
 
Bluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional RepositoriesBluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional Repositories
Richard Davis
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
National Institute of Informatics (NII)
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014
datacite
 
Freire model api
Freire model apiFreire model api
Freire model api
The European Library
 
GBIF ideas
GBIF ideasGBIF ideas
GBIF ideas
Roderic Page
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
ORCID, Inc
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
Ontotext
 

What's hot (20)

Open science platforms
Open science platformsOpen science platforms
Open science platforms
 
Linked Open Data
Linked Open DataLinked Open Data
Linked Open Data
 
RDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL PlatformsRDF Graph Data Management in Oracle Database and NoSQL Platforms
RDF Graph Data Management in Oracle Database and NoSQL Platforms
 
Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web Linked data 101: Getting Caught in the Semantic Web
Linked data 101: Getting Caught in the Semantic Web
 
Gonzalez-8-jun15
Gonzalez-8-jun15Gonzalez-8-jun15
Gonzalez-8-jun15
 
BibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 PresentationBibBase Linked Data Triplification Challenge 2010 Presentation
BibBase Linked Data Triplification Challenge 2010 Presentation
 
Reference Hackers
Reference HackersReference Hackers
Reference Hackers
 
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
What Are Links in Linked Open Data? A Characterization and Evaluation of Link...
 
Research Data Sharing: A Basic Framework
Research Data Sharing: A Basic FrameworkResearch Data Sharing: A Basic Framework
Research Data Sharing: A Basic Framework
 
Data Publishing and Institutional Repositories
Data Publishing and Institutional RepositoriesData Publishing and Institutional Repositories
Data Publishing and Institutional Repositories
 
Creating Incentives
Creating IncentivesCreating Incentives
Creating Incentives
 
Sparql a simple knowledge query
Sparql  a simple knowledge querySparql  a simple knowledge query
Sparql a simple knowledge query
 
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
pro-iBiosphere 2013-05 Linked Open Data (Gregor Hagedorn)
 
Bluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional RepositoriesBluffer's Guide to Institutional Repositories
Bluffer's Guide to Institutional Repositories
 
Expanding the content categories at JaLC
Expanding the content categories at JaLCExpanding the content categories at JaLC
Expanding the content categories at JaLC
 
DataCite overview 2014
DataCite overview 2014DataCite overview 2014
DataCite overview 2014
 
Freire model api
Freire model apiFreire model api
Freire model api
 
GBIF ideas
GBIF ideasGBIF ideas
GBIF ideas
 
Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013Mcentyre dryad-orcid_may2013
Mcentyre dryad-orcid_may2013
 
Efficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining ProcessEfficient Practices for Large Scale Text Mining Process
Efficient Practices for Large Scale Text Mining Process
 

Viewers also liked

Photos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigoPhotos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigo
rossgagne
 
Plan grand palais visiteur
Plan grand palais visiteur Plan grand palais visiteur
Plan grand palais visiteur 0665
 
January 15, 2015
January 15, 2015January 15, 2015
January 15, 2015khyps13
 
체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School
Choohan Cho
 
Clipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexoClipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexo
Pacto Ambiental
 
by geethuraj
by geethurajby geethuraj
by geethuraj
Predeep Thobiyas
 
Jenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueJenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueCERTyou Formation
 
And Then I Met Her
And Then I Met HerAnd Then I Met Her
And Then I Met Her
Nikhil Parekh
 
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth Ramaswamy
 
Manual software para acionamneto v75
Manual software para acionamneto v75Manual software para acionamneto v75
Manual software para acionamneto v75
FTorres Torres
 
e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)Bhupendra Shakya
 
King Of Buns
King Of BunsKing Of Buns
King Of Buns
ambiguoustailor82
 
너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True
Choohan Cho
 
Ғалымдар өмірінен
Ғалымдар өміріненҒалымдар өмірінен
Ғалымдар өмірінен
Beisek Serikbay
 
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
Jin-Yi Hsu
 
Cricket quiz 2014 mains
Cricket quiz 2014 mainsCricket quiz 2014 mains
Cricket quiz 2014 mains
IIM Calcutta Quiz Club
 
Diapos de sindrome treacher collins
Diapos de sindrome treacher collinsDiapos de sindrome treacher collins
Diapos de sindrome treacher collins
María Puentes
 

Viewers also liked (20)

Photos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigoPhotos retrouvaille 2015 provigo
Photos retrouvaille 2015 provigo
 
Plan grand palais visiteur
Plan grand palais visiteur Plan grand palais visiteur
Plan grand palais visiteur
 
January 15, 2015
January 15, 2015January 15, 2015
January 15, 2015
 
체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School체감형 게임활용 교육사례 2014 Kinect School
체감형 게임활용 교육사례 2014 Kinect School
 
Clipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexoClipping pacto ong pacto ambiental anexo
Clipping pacto ong pacto ambiental anexo
 
by geethuraj
by geethurajby geethuraj
by geethuraj
 
Obejtos yeissa ortiz
Obejtos yeissa ortizObejtos yeissa ortiz
Obejtos yeissa ortiz
 
Xerradamotivacional
XerradamotivacionalXerradamotivacional
Xerradamotivacional
 
Jenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continueJenki formation-jenkins-hudson-integration-continue
Jenki formation-jenkins-hudson-integration-continue
 
Work Sample - Arch Design 2
Work Sample - Arch Design 2Work Sample - Arch Design 2
Work Sample - Arch Design 2
 
And Then I Met Her
And Then I Met HerAnd Then I Met Her
And Then I Met Her
 
Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015Prashanth_Ramaswamy_Resume_11-22-2015
Prashanth_Ramaswamy_Resume_11-22-2015
 
Manual software para acionamneto v75
Manual software para acionamneto v75Manual software para acionamneto v75
Manual software para acionamneto v75
 
e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)e-network_IWM3765_14[1] (2 files merged)
e-network_IWM3765_14[1] (2 files merged)
 
King Of Buns
King Of BunsKing Of Buns
King Of Buns
 
너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True너 커서 뭐 될래? Dream Come True
너 커서 뭐 될래? Dream Come True
 
Ғалымдар өмірінен
Ғалымдар өміріненҒалымдар өмірінен
Ғалымдар өмірінен
 
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
20160919 Scientific Rationale for the Inclusion and Exclusion Criteria for In...
 
Cricket quiz 2014 mains
Cricket quiz 2014 mainsCricket quiz 2014 mains
Cricket quiz 2014 mains
 
Diapos de sindrome treacher collins
Diapos de sindrome treacher collinsDiapos de sindrome treacher collins
Diapos de sindrome treacher collins
 

Similar to Analysing Structured Scholarly Data Embedded in Web Pages

A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
Cuerpo Academico 'Estudios de la Información'
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
Dimitris Kontokostas
 
Researcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submitResearcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submit
apanigab2
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
Laura Po
 
Linked Data
Linked DataLinked Data
Linked Data
Angelica Lo Duca
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
Marko Rodriguez
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers
Getaneh Alemu
 
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and WritingRec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Aravind Sesagiri Raamkumar
 
Removing Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data AllianceRemoving Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data Alliance
Research Data Alliance
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016
Rebecca Raworth, MLIS
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016
Rebecca Raworth, MLIS
 
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Rensselaer Polytechnic Institute
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in CatalogingWilliam Worford
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
Bernadette Hyland-Wood
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
Chimezie Ogbuji
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentationjendibbern
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Kerstin Forsberg
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
andrea huang
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
Lucy McKenna
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
Peter Haase
 

Similar to Analysing Structured Scholarly Data Embedded in Web Pages (20)

A theory of Metadata enriching & filtering
A theory of  Metadata enriching & filteringA theory of  Metadata enriching & filtering
A theory of Metadata enriching & filtering
 
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...PhD thesis defense:  Large-scale multilingual knowledge extraction, publishin...
PhD thesis defense: Large-scale multilingual knowledge extraction, publishin...
 
Researcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submitResearcher identifiers in 21st c-rev to submit
Researcher identifiers in 21st c-rev to submit
 
Introduction to linked data
Introduction to linked dataIntroduction to linked data
Introduction to linked data
 
Linked Data
Linked DataLinked Data
Linked Data
 
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
A Practical Ontology for the Large-Scale Modeling of Scholarly Artifacts and ...
 
Metadata for researchers
Metadata for researchers Metadata for researchers
Metadata for researchers
 
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and WritingRec4LRW – Scientific Paper Recommender System for Literature Review and Writing
Rec4LRW – Scientific Paper Recommender System for Literature Review and Writing
 
Removing Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data AllianceRemoving Barriers to Data Sharing: the Research Data Alliance
Removing Barriers to Data Sharing: the Research Data Alliance
 
Research data management workshop april12 2016
Research data management workshop april12 2016 Research data management workshop april12 2016
Research data management workshop april12 2016
 
Research data management workshop April 2016
Research data management workshop April 2016Research data management workshop April 2016
Research data management workshop April 2016
 
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
Engineering a Semantic Web: ITWS Capstone Lecture (Spring 2014)
 
Summary of Trends in Cataloging
Summary of Trends in CatalogingSummary of Trends in Cataloging
Summary of Trends in Cataloging
 
Linking Open Government Data at Scale
Linking Open Government Data at Scale Linking Open Government Data at Scale
Linking Open Government Data at Scale
 
Semantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical InformaticsSemantic Web Technologies: A Paradigm for Medical Informatics
Semantic Web Technologies: A Paradigm for Medical Informatics
 
RDA Presentation
RDA PresentationRDA Presentation
RDA Presentation
 
Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015Linked data presentation for who umc 21 jan 2015
Linked data presentation for who umc 21 jan 2015
 
Reuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and RealizationReuse of Structured Data: Semantics, Linkage, and Realization
Reuse of Structured Data: Semantics, Linkage, and Realization
 
Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...Engaging Information Professionals in the Process of Authoritative Interlinki...
Engaging Information Professionals in the Process of Authoritative Interlinki...
 
Getting Started with Knowledge Graphs
Getting Started with Knowledge GraphsGetting Started with Knowledge Graphs
Getting Started with Knowledge Graphs
 

Recently uploaded

Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
Ashokrao Mane college of Pharmacy Peth-Vadgaon
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
David Douglas School District
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
TechSoup
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
chanes7
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Mohd Adib Abd Muin, Senior Lecturer at Universiti Utara Malaysia
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
Celine George
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
tarandeep35
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
heathfieldcps1
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
Dr. Shivangi Singh Parihar
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
ArianaBusciglio
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
Israel Genealogy Research Association
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
Celine George
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
Academy of Science of South Africa
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
IreneSebastianRueco1
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
camakaiclarkmusic
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
Nguyen Thanh Tu Collection
 

Recently uploaded (20)

Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.Biological Screening of Herbal Drugs in detailed.
Biological Screening of Herbal Drugs in detailed.
 
Pride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School DistrictPride Month Slides 2024 David Douglas School District
Pride Month Slides 2024 David Douglas School District
 
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat  Leveraging AI for Diversity, Equity, and InclusionExecutive Directors Chat  Leveraging AI for Diversity, Equity, and Inclusion
Executive Directors Chat Leveraging AI for Diversity, Equity, and Inclusion
 
Digital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments UnitDigital Artifact 1 - 10VCD Environments Unit
Digital Artifact 1 - 10VCD Environments Unit
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptxChapter 4 - Islamic Financial Institutions in Malaysia.pptx
Chapter 4 - Islamic Financial Institutions in Malaysia.pptx
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
How to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP ModuleHow to Add Chatter in the odoo 17 ERP Module
How to Add Chatter in the odoo 17 ERP Module
 
S1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptxS1-Introduction-Biopesticides in ICM.pptx
S1-Introduction-Biopesticides in ICM.pptx
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
The basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptxThe basics of sentences session 5pptx.pptx
The basics of sentences session 5pptx.pptx
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.PCOS corelations and management through Ayurveda.
PCOS corelations and management through Ayurveda.
 
Assignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docxAssignment_4_ArianaBusciglio Marvel(1).docx
Assignment_4_ArianaBusciglio Marvel(1).docx
 
The Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collectionThe Diamonds of 2023-2024 in the IGRA collection
The Diamonds of 2023-2024 in the IGRA collection
 
How to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold MethodHow to Build a Module in Odoo 17 Using the Scaffold Method
How to Build a Module in Odoo 17 Using the Scaffold Method
 
South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)South African Journal of Science: Writing with integrity workshop (2024)
South African Journal of Science: Writing with integrity workshop (2024)
 
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
RPMS TEMPLATE FOR SCHOOL YEAR 2023-2024 FOR TEACHER 1 TO TEACHER 3
 
CACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdfCACJapan - GROUP Presentation 1- Wk 4.pdf
CACJapan - GROUP Presentation 1- Wk 4.pdf
 
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
BÀI TẬP BỔ TRỢ TIẾNG ANH GLOBAL SUCCESS LỚP 3 - CẢ NĂM (CÓ FILE NGHE VÀ ĐÁP Á...
 

Analysing Structured Scholarly Data Embedded in Web Pages

  • 1. Analysing Structured Scholarly Data Embedded in Web Pages Pracheta Sahoo, Ujwal Gadiraju, Ran Yu, Sriparna Saha and Stefan Dietze WWW 2016 April 11th , 2016 Montreal, Canada
  • 2. OVERVIEW ❏ INTRODUCTION ❏ MOTIVATION ❏ RESEARCH QUESTIONS ❏ ANALYSES ❏ CONCLUSIONS ❏ FUTURE WORK
  • 3. INTRODUCTION (1/3) The Web: nearly 46 trillion Web pages indexed by Google VS Linked Data: approx. 1000 datasets & 100 billion statements ● different order of magnitude w.r.t. scale & dynamics Are there other semantics (structured facts) on the Web?
  • 4. INTRODUCTION (2/3) ● Web pages embed structured data (microdata, microformats and RDFa) ○ Interpretation of web documents (search & retrieval) ● Increase in prevalence of embedded markup (2014 Google study of 12 bn pages estimates an adoption of 26%) ● “Web Data Commons” (Meusel et al. [ISWC’14]) ○ Markup from Common Crawl (2.2 bn pages) ○ 17 billion RDF quads ○ Markup in 26% of pages, 14% of PLDs in 2013 (increase from 6% in 2011)
  • 7. MOTIVATION ● Embedded markup ⇒ sparsely linked, large % of coreferences, redundant statements ● Uptake and reuse of embedded markup is hindered by the lack of dynamics, scale ● Lack of understanding of the adoption of markup for scholarly resource metadata
  • 8. WHAT WE BRING TO THE TABLE ... ● Study of scholarly data extracted from embedded annotations (Web Data Commons) ● Shape & characteristics of entity descriptions ● Level of adoption of terms & types, distributions across TLDs, PLDs, data publishers
  • 9. RESEARCH QUESTIONS RQ1 What are frequently used terms & types for scholarly data? RQ2 How are statements about bibliographic data distributed across the web? Who are the key providers of bibliographic markup? RQ3 What are the frequent errors that can be observed?
  • 10. DATASET ● Web Data Commons (WDC) 2014 dataset ● Subset ⇒ all statements describing entities of type s:ScholarlyArticle or co- occuring on same document with any s: ScholarlyArticle instance ○ 6,793,764 quads ○ 1,184,623 entities ○ 83 distinct classes ○ 429 distinct predicates
  • 11. DATASET - Considerations ● s:ScholarlyArticle is the only type which explicitly refers to scholarly articles ● We focus on schema.org, the most widely used schema ● Types considered ⇒ s:ScholarlyArticle, s:Person and s:Organization ○ 280,616 instances (s: ScholarlyArticle) ○ 847,417 insrances (s:Person) ○ 3,798 instances (s:Organization)
  • 12. SCHOLARLY TYPES & PREDICATES (½) Cumulative dist. of predicates over instances across extracted types 1 to 14 1 to 9 1 to 4
  • 13. SCHOLARLY TYPES & PREDICATES (2/2) Top-10 Predicates for s:ScholarlyArticle
  • 14. DOMAINS & DOCUMENTS (1/5) Distribution of Entities & Statements across PLDs
  • 15. DOMAINS & DOCUMENTS (2/5) Top-10 PLDs (ranked by no. of entities)
  • 16. DOMAINS & DOCUMENTS (3/5) Distribution of Entities & Statements across TLDs
  • 17. DOMAINS & DOCUMENTS (4/5) Distribution of Entities & Statements across HTML Documents
  • 18. DOMAINS & DOCUMENTS (5/5) Top-10 Documents Ranked According to Embedded Entities
  • 19. TOPICS & PUBLICATION TYPES (1/4) Distribution of Scholarly Articles across Publishers
  • 20. TOPICS & PUBLICATION TYPES (2/4) Top-10 Publishers and corresponding no. of Publications
  • 21. TOPICS & PUBLICATION TYPES (3/4) Top-10 Publication Types (genres) across WDC
  • 22. TOPICS & PUBLICATION TYPES (4/4) Top-10 Article Titles (ranked by frequency of occurrence)
  • 23. FREQUENT ERRORS - Schema Violations Top-10 Misused Predicates
  • 24. CONCLUSIONS (½) ● First study on coverage & char. of bibliographic metadata embedded in web pages. ● Early adopters ⇒ publishers, libraries, other providers of bibliographic data. ● Usage of terms, types ⇒ dist. across providers, domains and topics follows a power law; few providers & documents contributing to majority of data.
  • 25. ● Top-k genres & publishers indicate a bias towards French, English data providers. ● Article titles, PLDs & publishers ⇒ bias Computer Science and Life Sciences. ● In this study we only consider entities tagged explicitly as "scholarlyArticle", a deeper analysis considering more types (article, book, etc.) and other creative works can shed light on the true scale of and potential of embedded markup data. CONCLUSIONS (2/2)
  • 26. FUTURE WORK ● Targeted crawl of typical providers of scholarly data (publishers, academic orgs., libraries, etc.) ● Consider implicitly typed bibliographic or creative work as scholarly data
  • 28. LIMITATIONS ● Our study is limited to schema.org & the types of s:ScholarlyArticle, s: Person, s:Organization. ● We consider only explicitly linked scholarly works.