SlideShare a Scribd company logo
1 of 41
Text mining in CORE
       Petr Knoth
   The Open University




         1/41
Outline
•   Introduction of the CORE system
•   Three phases:
    • Metadata and content harvesting
    • Semantic Enrichment
    • Providing services
•   Supporting research in mining databases of scientific
    publications (DiggiCORE)




                                2/41
CORE objectives
• To provide a platform for the delivery of Open Access content
  aggregated from multiple sources and to deliver a wide range of
  services on top of this aggregation.
• A nation-wide aggregation system that will improve the discovery
  of publications stored in British Open Access Repositories (OARs).




                               3/41
CORE functionality




                     4/41
CORE functionality

              Content harvesting, processing




                             5/41
CORE functionality

                            Semantic enrichment




                     6/41
CORE functionality




              Providing services




                              7/41
CORE functionality

              Content harvesting, processing




                             8/41
Growth of items in Open Access repositories




                         9/41
Growth of Open Access repositories




                         10/41
Green Open Access - statistics




                       11/41
Why we need aggregations?
“Each individual repository is of limited value for research: the real
power of Open Access lies in the possibility of connecting and tying
together repositories, which is why we need interoperability. In
order to create a seamless layer of content through connected
repositories from around the world, Open Access relies on
interoperability, the ability for systems to communicate with each
other and pass information back and forth in a usable format.
Interoperability allows us to exploit today's computational power so
that we can aggregate, data mine, create new tools and
services, and generate new knowledge from repository content.’’
                                                   [COAR manifesto]


                                12/41
Aggregation in CORE

•   OAI-PMH metadata harvesting
•   Locating full-text
•   Focused crawling (to locate full-texts)
•   Focused crawling (driven by citation analysis)




                                 13/41
CORE functionality

                             Semantic enrichment




                     14/41
Aggregations need access to content, not just metadata!

• Certain metadata types can be created only at the level of the
  aggregation
• Certain metadata can be changing in time
• Ensuring content:
   • accessibility
   • availability
   • validity
   • quality
   • …



                               15/41
Semantic similarity and duplicates detection
• Cosine similarity calculated on tfidf vectors extracted from full-
  texts




                [Knoth et al, COLING 2010; Knoth et al, IMMM 2011]
                                16/41
Semantic similarity and duplicates detection
• Heuristics to reduce the number of combinations (problem with
  the query length)
• Cross-language linking tests [Knoth et al, NTCIR-9 CrossLink 2011;
  Knoth et al IJC-NLP CLIA 2011]




                               17/41
Information extraction, citation parsing and target recognition

• ParsCIT tool (based on CRF) for extraction of reference sections
• Levensthein distance used for target detection




                               18/41
Text categorisation
• 17 top-level DOAJ classes
  (http://www.doaj.org/doaj?func=browse&uiLanguage=en)
• 1080 examples
• SVM multiclass
• 10 fold cross-validation
• 91.4% accuracy




                           19/41
CORE functionality




              Providing services




                              20/41
Who should be supported by aggregations?

The following users groups (divided according to the level of
abstraction of information they need):
   •   Raw data access.
   •   Transaction information access.
   •   Analytical information access.




                                    21/41
Who should be supported by aggregations?

• The following users groups (divided according to the level of
  abstraction of information they need):
   •   Raw data access. Developers, DLs, DL researchers, companies …
   •   Transaction information access. Researchers, students, life-long learners …
   •   Analytical information access. Funders, government, bussiness intelligence
       …




                                     22/41
Should a single aggregation system support all three user types?


             Can be realised by more than one system
                           providing that
                     the dataset is the same!




                               23/41
CORE applications

 •   CORE Portal
 •   CORE Mobile
 •   CORE Plugin
 •   CORE API
 •   Repository Analytics




                            24/41
Who should be supported by aggregations?

• The following users groups (divided according to the level of
  abstraction of information they need):
   •   Raw data access. Developers, DLs, DL researchers, companies …
   •   Transaction information access. Researchers, students, life-long learners …
   •   Analytical information access. Funders, government, bussiness intelligence
       …



  CORE API                   CORE Portal, CORE
                             Mobile, CORE Plugin     Repository Analytics




                                     25/41
CORE Applications
CORE API – Enables external systems and services to interact with the
CORE repository.


                                                  • Search service
                                                  • Pdf and plain text
                                                    service
                                                  • Similarity service
                                                  • Classification service
                                                  • Citation service




                                  26/41
CORE Applications
CORE Portal – Allows searching and navigating scientific publications
aggregated from Open Access repositories




                                   27/41
Snippets




           28/41
CORE Applications

CORE Mobile – Allows
searching and
navigating scientific
publications aggregated
from Open Access
repositories




                          29/41
CORE Applications
CORE Plugin – A plugin to system that recommendations for related
items.




                                 30/41
CORE Applications
Repository Analytics – is an analytical tool supporting providers of
open access content (in particular repository managers).




                                   31/41
32/41
33/41
CORE statistics
• Content
   • 7M records
   • 230 repositories
   • 402k full-texts
   • 1TB of data
   • 40GB large index
   • 35 million RDF triples in the CORE LOD repository
• Started: February 2011
• Budget: 140k£



                              34/41
Outline
•   Introduction of the CORE system
•   Three phases:
    • Metadata and content harvesting
    • Semantic Enrichment
    • Providing services
•   Supporting research in mining databases of scientific
    publications (DiggiCORE)




                               35/41
objective




Software for exploration and analysis of very large and
fast-growing amounts of research publications stored
across Open Access Repositories (OAR).




                           36/41
DiggiCORE networks




Three networks: (a) semantically related papers,
(b) citation network, (c) author citation network


                          37/41
DiggiCORE objectives

Allow researchers to use this platform to analyse
publications.
Why?
•   To identifying patterns in the behaviour of research
    communities
•   To detect trends in research disciplines
•   To gain new insights into the citation behaviour of researchers
•   To discover features that distinguish papers with high impact



                               38/41
Summary
•   The rapid growth of OA content provides great opportunity for
    text-mining.
•   Aggregations need to aggregate content, not just metadata.
•   Aggregations should serve the needs of different user groups
    including researchers who need access to data. CORE aims to
    support them.
•   We can have many services that are part of the infrastructure,
    but should work with the same data.




                               39/41
Thank you!




             William Wallace
   40/41
41/41

More Related Content

What's hot

Applying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual PreservationApplying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual PreservationJon W. Dunn
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies LIBIS
 

What's hot (6)

Applying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual PreservationApplying Repository Systems to Audiovisual Preservation
Applying Repository Systems to Audiovisual Preservation
 
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"Caplan and York, 'What It Takes To Make It Last:  E-Resources Preservation"
Caplan and York, 'What It Takes To Make It Last: E-Resources Preservation"
 
NewGenLib 3.0 Presentation
NewGenLib 3.0 PresentationNewGenLib 3.0 Presentation
NewGenLib 3.0 Presentation
 
Caa2015 2 a_gattiglia
Caa2015 2 a_gattigliaCaa2015 2 a_gattiglia
Caa2015 2 a_gattiglia
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies Introduction to Digital Humanities: Metadata standards and ontologies
Introduction to Digital Humanities: Metadata standards and ontologies
 

Viewers also liked

Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publicationspetrknoth
 
DEVCSI Core Mobile
DEVCSI Core MobileDEVCSI Core Mobile
DEVCSI Core Mobilepetrknoth
 
The murder of a student.
The murder of a student.The murder of a student.
The murder of a student.selimkaradag
 
Amicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource companyAmicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource companyrachna1122
 
DiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected RepositoriesDiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected Repositoriespetrknoth
 
Core presentation
Core presentationCore presentation
Core presentationpetrknoth
 
CORE projects family
CORE projects familyCORE projects family
CORE projects familypetrknoth
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...petrknoth
 
Snail 12345
Snail 12345Snail 12345
Snail 12345reblyn1
 
Ali’S Careers Power Point
Ali’S Careers Power PointAli’S Careers Power Point
Ali’S Careers Power Pointguestb4db5a8
 
My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?petrknoth
 
Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...petrknoth
 
FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)petrknoth
 
CORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open AccessCORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open Accesspetrknoth
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluationpetrknoth
 
93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-final93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-finalZarobiza
 

Viewers also liked (19)

Towards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific PublicationsTowards an Infrastructure for Mining Scientific Publications
Towards an Infrastructure for Mining Scientific Publications
 
All Joke Photos
All Joke PhotosAll Joke Photos
All Joke Photos
 
DEVCSI Core Mobile
DEVCSI Core MobileDEVCSI Core Mobile
DEVCSI Core Mobile
 
The murder of a student.
The murder of a student.The murder of a student.
The murder of a student.
 
Amicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource companyAmicable resources corporate presentation- Human resource company
Amicable resources corporate presentation- Human resource company
 
DiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected RepositoriesDiggiCORE: Digging into Connected Repositories
DiggiCORE: Digging into Connected Repositories
 
Core presentation
Core presentationCore presentation
Core presentation
 
CORE projects family
CORE projects familyCORE projects family
CORE projects family
 
From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...From Open Access Metadata to Open Access Content: Two Principles for Increase...
From Open Access Metadata to Open Access Content: Two Principles for Increase...
 
Snail 12345
Snail 12345Snail 12345
Snail 12345
 
Ali’S Careers Power Point
Ali’S Careers Power PointAli’S Careers Power Point
Ali’S Careers Power Point
 
My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?My repository is being aggregated: a blessing or a curse?
My repository is being aggregated: a blessing or a curse?
 
Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...Aggregating Research papers from Publishers' Systems to Support Text and Data...
Aggregating Research papers from Publishers' Systems to Support Text and Data...
 
FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)FOSTER - Content Delivery (WP3)
FOSTER - Content Delivery (WP3)
 
CORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open AccessCORE: Aggregating and Enriching Content to Support Open Access
CORE: Aggregating and Enriching Content to Support Open Access
 
Suman Pandit
Suman PanditSuman Pandit
Suman Pandit
 
The Clown Doctor
The Clown DoctorThe Clown Doctor
The Clown Doctor
 
Semantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research EvaluationSemantometrics: Towards Fulltext-based Research Evaluation
Semantometrics: Towards Fulltext-based Research Evaluation
 
93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-final93136540 spider-cloud-small-cell-cluster-case-study-091911-final
93136540 spider-cloud-small-cell-cluster-case-study-091911-final
 

Similar to Text mining in CORE (OR2012)

CORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research AssociateCORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research AssociateThe European Library
 
Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library servicesNikesh Narayanan
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...petrknoth
 
Next Generation Repositories
Next Generation RepositoriesNext Generation Repositories
Next Generation Repositoriesukcorr
 
Introducing the Open Discovery Initiative
Introducing the Open Discovery InitiativeIntroducing the Open Discovery Initiative
Introducing the Open Discovery InitiativeNASIG
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationMANENDRASINGH30
 
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...Open Science Fair
 
Implementing web scale discovery services: special reference to Indian Librar...
Implementing web scale discovery services: special reference to Indian Librar...Implementing web scale discovery services: special reference to Indian Librar...
Implementing web scale discovery services: special reference to Indian Librar...Nikesh Narayanan
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)floyd taag
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)floyd taag
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)floyd taag
 
-Open Archives Initiatives(final)
-Open Archives Initiatives(final)-Open Archives Initiatives(final)
-Open Archives Initiatives(final)floyd taag
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...Pedro Príncipe
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008Nancy Elkington
 

Similar to Text mining in CORE (OR2012) (20)

CORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research AssociateCORE - Petr Knoth, Research Associate
CORE - Petr Knoth, Research Associate
 
Current and emerging trends in library services
Current and emerging trends in library servicesCurrent and emerging trends in library services
Current and emerging trends in library services
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...Better together: building services for public good on top of content from the...
Better together: building services for public good on top of content from the...
 
Next Generation Repositories
Next Generation RepositoriesNext Generation Repositories
Next Generation Repositories
 
Introducing the Open Discovery Initiative
Introducing the Open Discovery InitiativeIntroducing the Open Discovery Initiative
Introducing the Open Discovery Initiative
 
Breeding, Introducing the Open Discovery Initiative
Breeding, Introducing the Open Discovery InitiativeBreeding, Introducing the Open Discovery Initiative
Breeding, Introducing the Open Discovery Initiative
 
NISO Webinar: Discovery & Delivery: Innovations & Challenges
NISO Webinar: Discovery & Delivery: Innovations & ChallengesNISO Webinar: Discovery & Delivery: Innovations & Challenges
NISO Webinar: Discovery & Delivery: Innovations & Challenges
 
Impact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and EducationImpact of Covid-19 on Learning and Education
Impact of Covid-19 on Learning and Education
 
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
OSFair2017 Workshop | Building a global knowledge commons - ramping up reposi...
 
Implementing web scale discovery services: special reference to Indian Librar...
Implementing web scale discovery services: special reference to Indian Librar...Implementing web scale discovery services: special reference to Indian Librar...
Implementing web scale discovery services: special reference to Indian Librar...
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
Open archives initiatives(final)
 Open archives initiatives(final) Open archives initiatives(final)
Open archives initiatives(final)
 
-Open Archives Initiatives(final)
-Open Archives Initiatives(final)-Open Archives Initiatives(final)
-Open Archives Initiatives(final)
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
 
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
OpenAIRE infrastructure presentation at the Semantic Services in EOSC worksho...
 
Jyoti singh
Jyoti singhJyoti singh
Jyoti singh
 
Gatways And Portal
Gatways And PortalGatways And Portal
Gatways And Portal
 
Update From OCLC Research May 2008
Update From OCLC Research May 2008Update From OCLC Research May 2008
Update From OCLC Research May 2008
 

More from petrknoth

Qui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingQui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingpetrknoth
 
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in RepositoriesOAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositoriespetrknoth
 
UKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet themUKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet thempetrknoth
 
Enabling Educators to Locate High-Quality Teaching Resources
Enabling Educators to LocateHigh-Quality Teaching ResourcesEnabling Educators to LocateHigh-Quality Teaching Resources
Enabling Educators to Locate High-Quality Teaching Resourcespetrknoth
 
Tracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository DashboardTracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository Dashboardpetrknoth
 
CORE Analytics Dashboard
CORE Analytics DashboardCORE Analytics Dashboard
CORE Analytics Dashboardpetrknoth
 
Analysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery toolsAnalysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery toolspetrknoth
 
Assessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access PolicyAssessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access Policypetrknoth
 
Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)petrknoth
 
Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure petrknoth
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriespetrknoth
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...petrknoth
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncpetrknoth
 

More from petrknoth (14)

Qui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishingQui Bono? Cumulative advantage in open access publishing
Qui Bono? Cumulative advantage in open access publishing
 
CORE APIv3
CORE APIv3CORE APIv3
CORE APIv3
 
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in RepositoriesOAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
OAI Identifiers: Decentralised PIDs for Research Outputs in Repositories
 
UKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet themUKRI OA policy requirements for repositories and how to meet them
UKRI OA policy requirements for repositories and how to meet them
 
Enabling Educators to Locate High-Quality Teaching Resources
Enabling Educators to LocateHigh-Quality Teaching ResourcesEnabling Educators to LocateHigh-Quality Teaching Resources
Enabling Educators to Locate High-Quality Teaching Resources
 
Tracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository DashboardTracking compliance of the REF2021 policy with the CORE Repository Dashboard
Tracking compliance of the REF2021 policy with the CORE Repository Dashboard
 
CORE Analytics Dashboard
CORE Analytics DashboardCORE Analytics Dashboard
CORE Analytics Dashboard
 
Analysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery toolsAnalysing the performance of open access papers discovery tools
Analysing the performance of open access papers discovery tools
 
Assessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access PolicyAssessing Compliance with the UK REF 2021 Open Access Policy
Assessing Compliance with the UK REF 2021 Open Access Policy
 
Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)Data interoperability toolkit (OpenMinTeD)
Data interoperability toolkit (OpenMinTeD)
 
Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure Integrating research indicators for use in the repositories infrastructure
Integrating research indicators for use in the repositories infrastructure
 
Towards effective research recommender systems for repositories
Towards effective research recommender systems for repositoriesTowards effective research recommender systems for repositories
Towards effective research recommender systems for repositories
 
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...COAR Next Generation Repositories WG - Text mining and Recommender system sto...
COAR Next Generation Repositories WG - Text mining and Recommender system sto...
 
Seamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSyncSeamless access to the world’s open access research papers via ResourceSync
Seamless access to the world’s open access research papers via ResourceSync
 

Recently uploaded

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityIES VE
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 

Recently uploaded (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Decarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a realityDecarbonising Buildings: Making a net-zero built environment a reality
Decarbonising Buildings: Making a net-zero built environment a reality
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 

Text mining in CORE (OR2012)

  • 1. Text mining in CORE Petr Knoth The Open University 1/41
  • 2. Outline • Introduction of the CORE system • Three phases: • Metadata and content harvesting • Semantic Enrichment • Providing services • Supporting research in mining databases of scientific publications (DiggiCORE) 2/41
  • 3. CORE objectives • To provide a platform for the delivery of Open Access content aggregated from multiple sources and to deliver a wide range of services on top of this aggregation. • A nation-wide aggregation system that will improve the discovery of publications stored in British Open Access Repositories (OARs). 3/41
  • 5. CORE functionality Content harvesting, processing 5/41
  • 6. CORE functionality Semantic enrichment 6/41
  • 7. CORE functionality Providing services 7/41
  • 8. CORE functionality Content harvesting, processing 8/41
  • 9. Growth of items in Open Access repositories 9/41
  • 10. Growth of Open Access repositories 10/41
  • 11. Green Open Access - statistics 11/41
  • 12. Why we need aggregations? “Each individual repository is of limited value for research: the real power of Open Access lies in the possibility of connecting and tying together repositories, which is why we need interoperability. In order to create a seamless layer of content through connected repositories from around the world, Open Access relies on interoperability, the ability for systems to communicate with each other and pass information back and forth in a usable format. Interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.’’ [COAR manifesto] 12/41
  • 13. Aggregation in CORE • OAI-PMH metadata harvesting • Locating full-text • Focused crawling (to locate full-texts) • Focused crawling (driven by citation analysis) 13/41
  • 14. CORE functionality Semantic enrichment 14/41
  • 15. Aggregations need access to content, not just metadata! • Certain metadata types can be created only at the level of the aggregation • Certain metadata can be changing in time • Ensuring content: • accessibility • availability • validity • quality • … 15/41
  • 16. Semantic similarity and duplicates detection • Cosine similarity calculated on tfidf vectors extracted from full- texts [Knoth et al, COLING 2010; Knoth et al, IMMM 2011] 16/41
  • 17. Semantic similarity and duplicates detection • Heuristics to reduce the number of combinations (problem with the query length) • Cross-language linking tests [Knoth et al, NTCIR-9 CrossLink 2011; Knoth et al IJC-NLP CLIA 2011] 17/41
  • 18. Information extraction, citation parsing and target recognition • ParsCIT tool (based on CRF) for extraction of reference sections • Levensthein distance used for target detection 18/41
  • 19. Text categorisation • 17 top-level DOAJ classes (http://www.doaj.org/doaj?func=browse&uiLanguage=en) • 1080 examples • SVM multiclass • 10 fold cross-validation • 91.4% accuracy 19/41
  • 20. CORE functionality Providing services 20/41
  • 21. Who should be supported by aggregations? The following users groups (divided according to the level of abstraction of information they need): • Raw data access. • Transaction information access. • Analytical information access. 21/41
  • 22. Who should be supported by aggregations? • The following users groups (divided according to the level of abstraction of information they need): • Raw data access. Developers, DLs, DL researchers, companies … • Transaction information access. Researchers, students, life-long learners … • Analytical information access. Funders, government, bussiness intelligence … 22/41
  • 23. Should a single aggregation system support all three user types? Can be realised by more than one system providing that the dataset is the same! 23/41
  • 24. CORE applications • CORE Portal • CORE Mobile • CORE Plugin • CORE API • Repository Analytics 24/41
  • 25. Who should be supported by aggregations? • The following users groups (divided according to the level of abstraction of information they need): • Raw data access. Developers, DLs, DL researchers, companies … • Transaction information access. Researchers, students, life-long learners … • Analytical information access. Funders, government, bussiness intelligence … CORE API CORE Portal, CORE Mobile, CORE Plugin Repository Analytics 25/41
  • 26. CORE Applications CORE API – Enables external systems and services to interact with the CORE repository. • Search service • Pdf and plain text service • Similarity service • Classification service • Citation service 26/41
  • 27. CORE Applications CORE Portal – Allows searching and navigating scientific publications aggregated from Open Access repositories 27/41
  • 28. Snippets 28/41
  • 29. CORE Applications CORE Mobile – Allows searching and navigating scientific publications aggregated from Open Access repositories 29/41
  • 30. CORE Applications CORE Plugin – A plugin to system that recommendations for related items. 30/41
  • 31. CORE Applications Repository Analytics – is an analytical tool supporting providers of open access content (in particular repository managers). 31/41
  • 32. 32/41
  • 33. 33/41
  • 34. CORE statistics • Content • 7M records • 230 repositories • 402k full-texts • 1TB of data • 40GB large index • 35 million RDF triples in the CORE LOD repository • Started: February 2011 • Budget: 140k£ 34/41
  • 35. Outline • Introduction of the CORE system • Three phases: • Metadata and content harvesting • Semantic Enrichment • Providing services • Supporting research in mining databases of scientific publications (DiggiCORE) 35/41
  • 36. objective Software for exploration and analysis of very large and fast-growing amounts of research publications stored across Open Access Repositories (OAR). 36/41
  • 37. DiggiCORE networks Three networks: (a) semantically related papers, (b) citation network, (c) author citation network 37/41
  • 38. DiggiCORE objectives Allow researchers to use this platform to analyse publications. Why? • To identifying patterns in the behaviour of research communities • To detect trends in research disciplines • To gain new insights into the citation behaviour of researchers • To discover features that distinguish papers with high impact 38/41
  • 39. Summary • The rapid growth of OA content provides great opportunity for text-mining. • Aggregations need to aggregate content, not just metadata. • Aggregations should serve the needs of different user groups including researchers who need access to data. CORE aims to support them. • We can have many services that are part of the infrastructure, but should work with the same data. 39/41
  • 40. Thank you! William Wallace 40/41
  • 41. 41/41

Editor's Notes

  1. The idea is to give you an overview of CORE and how it makes use of text-mining not a comprehensive description of one method
  2. Content – story about why I started to think about CORE. CORE is not a cross-repository search engine.Wide range of services (not focused only on people looking fro content) – will explain laterFocusing on British, but becoming international
  3. Ou main focus are British Open Access repositories, but because of the collaboration with Europeana we have to go international
  4. All text mining takes place at this phase
  5. Currently 99% of CORE data through metadata havestingThe combination with other techniques has more potential
  6. All text mining takes place at this phase
  7. The use of content is one of the relatively unique features of CORE
  8. Alternative tools TeamBeam,Mendeley tool
  9. I will give an overview of the system (not a comprehensive description of all text mining services)