• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Text mining in CORE (OR2012)
 

Text mining in CORE (OR2012)

on

  • 811 views

OR2012 presentation on Text Mining in CORE

OR2012 presentation on Text Mining in CORE

Statistics

Views

Total Views
811
Views on SlideShare
809
Embed Views
2

Actions

Likes
3
Downloads
6
Comments
0

1 Embed 2

http://storify.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • The idea is to give you an overview of CORE and how it makes use of text-mining not a comprehensive description of one method
  • Content – story about why I started to think about CORE. CORE is not a cross-repository search engine.Wide range of services (not focused only on people looking fro content) – will explain laterFocusing on British, but becoming international
  • Ou main focus are British Open Access repositories, but because of the collaboration with Europeana we have to go international
  • All text mining takes place at this phase
  • Currently 99% of CORE data through metadata havestingThe combination with other techniques has more potential
  • All text mining takes place at this phase
  • The use of content is one of the relatively unique features of CORE
  • Alternative tools TeamBeam,Mendeley tool
  • I will give an overview of the system (not a comprehensive description of all text mining services)

Text mining in CORE (OR2012) Text mining in CORE (OR2012) Presentation Transcript

  • Text mining in CORE Petr Knoth The Open University 1/41
  • Outline• Introduction of the CORE system• Three phases: • Metadata and content harvesting • Semantic Enrichment • Providing services• Supporting research in mining databases of scientific publications (DiggiCORE) 2/41
  • CORE objectives• To provide a platform for the delivery of Open Access content aggregated from multiple sources and to deliver a wide range of services on top of this aggregation.• A nation-wide aggregation system that will improve the discovery of publications stored in British Open Access Repositories (OARs). 3/41
  • CORE functionality 4/41
  • CORE functionality Content harvesting, processing 5/41
  • CORE functionality Semantic enrichment 6/41
  • CORE functionality Providing services 7/41
  • CORE functionality Content harvesting, processing 8/41
  • Growth of items in Open Access repositories 9/41
  • Growth of Open Access repositories 10/41
  • Green Open Access - statistics 11/41
  • Why we need aggregations?“Each individual repository is of limited value for research: the realpower of Open Access lies in the possibility of connecting and tyingtogether repositories, which is why we need interoperability. Inorder to create a seamless layer of content through connectedrepositories from around the world, Open Access relies oninteroperability, the ability for systems to communicate with eachother and pass information back and forth in a usable format.Interoperability allows us to exploit todays computational power sothat we can aggregate, data mine, create new tools andservices, and generate new knowledge from repository content.’’ [COAR manifesto] 12/41
  • Aggregation in CORE• OAI-PMH metadata harvesting• Locating full-text• Focused crawling (to locate full-texts)• Focused crawling (driven by citation analysis) 13/41
  • CORE functionality Semantic enrichment 14/41
  • Aggregations need access to content, not just metadata!• Certain metadata types can be created only at the level of the aggregation• Certain metadata can be changing in time• Ensuring content: • accessibility • availability • validity • quality • … 15/41
  • Semantic similarity and duplicates detection• Cosine similarity calculated on tfidf vectors extracted from full- texts [Knoth et al, COLING 2010; Knoth et al, IMMM 2011] 16/41
  • Semantic similarity and duplicates detection• Heuristics to reduce the number of combinations (problem with the query length)• Cross-language linking tests [Knoth et al, NTCIR-9 CrossLink 2011; Knoth et al IJC-NLP CLIA 2011] 17/41
  • Information extraction, citation parsing and target recognition• ParsCIT tool (based on CRF) for extraction of reference sections• Levensthein distance used for target detection 18/41
  • Text categorisation• 17 top-level DOAJ classes (http://www.doaj.org/doaj?func=browse&uiLanguage=en)• 1080 examples• SVM multiclass• 10 fold cross-validation• 91.4% accuracy 19/41
  • CORE functionality Providing services 20/41
  • Who should be supported by aggregations?The following users groups (divided according to the level ofabstraction of information they need): • Raw data access. • Transaction information access. • Analytical information access. 21/41
  • Who should be supported by aggregations?• The following users groups (divided according to the level of abstraction of information they need): • Raw data access. Developers, DLs, DL researchers, companies … • Transaction information access. Researchers, students, life-long learners … • Analytical information access. Funders, government, bussiness intelligence … 22/41
  • Should a single aggregation system support all three user types? Can be realised by more than one system providing that the dataset is the same! 23/41
  • CORE applications • CORE Portal • CORE Mobile • CORE Plugin • CORE API • Repository Analytics 24/41
  • Who should be supported by aggregations?• The following users groups (divided according to the level of abstraction of information they need): • Raw data access. Developers, DLs, DL researchers, companies … • Transaction information access. Researchers, students, life-long learners … • Analytical information access. Funders, government, bussiness intelligence … CORE API CORE Portal, CORE Mobile, CORE Plugin Repository Analytics 25/41
  • CORE ApplicationsCORE API – Enables external systems and services to interact with theCORE repository. • Search service • Pdf and plain text service • Similarity service • Classification service • Citation service 26/41
  • CORE ApplicationsCORE Portal – Allows searching and navigating scientific publicationsaggregated from Open Access repositories 27/41
  • Snippets 28/41
  • CORE ApplicationsCORE Mobile – Allowssearching andnavigating scientificpublications aggregatedfrom Open Accessrepositories 29/41
  • CORE ApplicationsCORE Plugin – A plugin to system that recommendations for relateditems. 30/41
  • CORE ApplicationsRepository Analytics – is an analytical tool supporting providers ofopen access content (in particular repository managers). 31/41
  • 32/41
  • 33/41
  • CORE statistics• Content • 7M records • 230 repositories • 402k full-texts • 1TB of data • 40GB large index • 35 million RDF triples in the CORE LOD repository• Started: February 2011• Budget: 140k£ 34/41
  • Outline• Introduction of the CORE system• Three phases: • Metadata and content harvesting • Semantic Enrichment • Providing services• Supporting research in mining databases of scientific publications (DiggiCORE) 35/41
  • objectiveSoftware for exploration and analysis of very large andfast-growing amounts of research publications storedacross Open Access Repositories (OAR). 36/41
  • DiggiCORE networksThree networks: (a) semantically related papers,(b) citation network, (c) author citation network 37/41
  • DiggiCORE objectivesAllow researchers to use this platform to analysepublications.Why?• To identifying patterns in the behaviour of research communities• To detect trends in research disciplines• To gain new insights into the citation behaviour of researchers• To discover features that distinguish papers with high impact 38/41
  • Summary• The rapid growth of OA content provides great opportunity for text-mining.• Aggregations need to aggregate content, not just metadata.• Aggregations should serve the needs of different user groups including researchers who need access to data. CORE aims to support them.• We can have many services that are part of the infrastructure, but should work with the same data. 39/41
  • Thank you! William Wallace 40/41
  • 41/41