Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
A talk given at the Geological Society of London, UK on 2016/03/09 as part of the Lyell meeting on Palaeoinformatics. http://www.geolsoc.org.uk/lyell16 #lyell16
Open scholarship [a FOSTER open science talk]Ross Mounce
A talk by Dr Ross Mounce, given at the FOSTER Open Science event 4th September, King's College London http://www.fosteropenscience.eu/event/foster-discovering-open-practices-pgr-and-early-career-researchers-0
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
Published on Mar 4, 2016 by PMR
Text and data mining (TDM) techniques can be applied to a wide range of materials, from published research papers, books and theses, to cultural heritage materials, digitised collections, administrative and management reports and documentation, etc. Use cases include academic research, resource discovery and business intelligence.
This workshop will show the value and benefits of TDM techniques and demonstrate how ContentMine aims to liberate 100,000,000 facts from the scientific literature, and ContentMine will provide a hands on demo on a topical and accessible scientific/medical subject.
Automatic Extraction of Knowledge from the LiteratureTheContentMine
Published on May 11, 2016 by PMR
ContentMine tools (and the Harvest alliance) can be used to search the literature for knowledge, especially in biomedicine. All tools are Open and shortly we shall be indexing the complete daily scholarly literature
Published on Jan 29, 2016 by PMR
Keynote talk to LEARN (LERU/H2020 project) for research data management. Emphasizes that problems are cultural not technical. Promotes modern approaches such as Git / continuous Integration, announces DAT. Asserts that the Right to Read in the Right to Mine. Calls for widespread development of content mining (TDM)
Specimen-level mining: bringing knowledge back 'home' to the Natural History ...Ross Mounce
A talk given at the Geological Society of London, UK on 2016/03/09 as part of the Lyell meeting on Palaeoinformatics. http://www.geolsoc.org.uk/lyell16 #lyell16
Open scholarship [a FOSTER open science talk]Ross Mounce
A talk by Dr Ross Mounce, given at the FOSTER Open Science event 4th September, King's College London http://www.fosteropenscience.eu/event/foster-discovering-open-practices-pgr-and-early-career-researchers-0
Liberating facts from the scientific literature - Jisc Digifest 2016 TheContentMine
Published on Mar 4, 2016 by PMR
Text and data mining (TDM) techniques can be applied to a wide range of materials, from published research papers, books and theses, to cultural heritage materials, digitised collections, administrative and management reports and documentation, etc. Use cases include academic research, resource discovery and business intelligence.
This workshop will show the value and benefits of TDM techniques and demonstrate how ContentMine aims to liberate 100,000,000 facts from the scientific literature, and ContentMine will provide a hands on demo on a topical and accessible scientific/medical subject.
Automatic Extraction of Knowledge from the LiteratureTheContentMine
Published on May 11, 2016 by PMR
ContentMine tools (and the Harvest alliance) can be used to search the literature for knowledge, especially in biomedicine. All tools are Open and shortly we shall be indexing the complete daily scholarly literature
Published on Jan 29, 2016 by PMR
Keynote talk to LEARN (LERU/H2020 project) for research data management. Emphasizes that problems are cultural not technical. Promotes modern approaches such as Git / continuous Integration, announces DAT. Asserts that the Right to Read in the Right to Mine. Calls for widespread development of content mining (TDM)
An overview of Text and Data Mining (ContentMining) including live demonstrations. The fundamentals: discover, scrape, normalize , facet/index, analyze, publish are exemplified using the recent Zika outbreak. Mining covers textual and non-textual content and examples of chemistry and phylogenetic tress are given.
Amanuens.is HUmans and machines annotating scholarly literaturepetermurrayrust
about 10,000 scholarly articles ("papers") are published each day. Amanuens.is is a symbiont of ContentMine and Hypothes.is (both Shuttleworth projects/Fellows) which annotates theses using an array of controlled vocabularies ("dictionaries"). The results, in semantic form are used to annotate the original material. The talk had live demos and used plant chemistry as the examples
Automatic Extraction of Knowledge from the Literaturepetermurrayrust
ContentMine tools (and the Harvest alliance) can be used to search the literature for knowledge, especially in biomedicine. All tools are Open and shortly we shall be indexing the complete daily scholarly literature
High throughput mining of the scholarly literature TheContentMine
Published on Jun 7, 2016 by PMR
Talk given to statisticians in Tilburg, with emphasis on scholarly comms for detecting unusual features. Includes demo of Amanuens.is and image mining
Use of ContentMine tools on the Open Access subset of EuropePubMedCentral to discover new knowledge about the Zika virus.
Three slides have embedded movies - these do not show in slideshare and a first pass of this can be seen as a single file at https://vimeo.com/154705161
Talk to EBI Industry group on Open Software for chemical and pharmaceutical sciences. Covers examples of chemistry , wit demos, and argues that all public knowledge should be Openly accessible
Open Access for Early Career ResearchersRoss Mounce
My talk for the University of Bath Open Access Week session; 23rd October 2013.
http://www.bath.ac.uk/learningandteaching/rdu/courses/pgskills/modules/RP00335.htm
Published on May 18, 2016 by PMR
Talk to EBI Industry group on Open Software for chemical and pharmaceutical sciences. Covers examples of chemistry , wit demos, and argues that all public knowledge should be Openly accessible
Amanuens.is HUmans and machines annotating scholarly literature TheContentMine
Published on May 19, 2016 by PMR
about 10,000 scholarly articles ("papers") are published each day. Amanuens.is is a symbiont of ContentMine and Hypothes.is (both Shuttleworth projects/Fellows) which annotates theses using an array of controlled vocabularies ("dictionaries"). The results, in semantic form are used to annotate the original material. The talk had live demos and used plant chemistry as the examples
Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust
a plenary lecture to Cochrane Collaboration in Birmingham, on the value of automatically extracting knowledge. Covers the Why? How? What? Who? and problems and invites collaboration
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
Published on Mar 16, 2016 by PMR
A plenary lecture to Cochrane Collaboration in Birmingham, on the value of automatically extracting knowledge. Covers the Why? How? What? Who? and problems and invites collaboration
Talk to OpenForum Academy (Open Forum Europe) about Text and data Mining. Four use cases selected fo non-scientists. Also discussion of latest on Europena copyright reform and TDM exceptions
PLUTo: Phyloinformatic Literature Unlocking Tools
A BBSRC-funded project to find phylogenetic trees in the literature, and make their underlying data re-usable again by extracting it & re-releasing it from the figure image as open, re-usable data
How can repositories support the text-mining of their content and why? Nancy Pontika
Co-presented with Petr Knoth http://www.slideshare.net/petrknoth/ at the "Mining Repositories: How to assist the research and academic community on their text and data mining needs" workshop, which took place at the 11th International Conference on Open Repositories, Monday 13 June 2016.
An overview of Text and Data Mining (ContentMining) including live demonstrations. The fundamentals: discover, scrape, normalize , facet/index, analyze, publish are exemplified using the recent Zika outbreak. Mining covers textual and non-textual content and examples of chemistry and phylogenetic tress are given.
Amanuens.is HUmans and machines annotating scholarly literaturepetermurrayrust
about 10,000 scholarly articles ("papers") are published each day. Amanuens.is is a symbiont of ContentMine and Hypothes.is (both Shuttleworth projects/Fellows) which annotates theses using an array of controlled vocabularies ("dictionaries"). The results, in semantic form are used to annotate the original material. The talk had live demos and used plant chemistry as the examples
Automatic Extraction of Knowledge from the Literaturepetermurrayrust
ContentMine tools (and the Harvest alliance) can be used to search the literature for knowledge, especially in biomedicine. All tools are Open and shortly we shall be indexing the complete daily scholarly literature
High throughput mining of the scholarly literature TheContentMine
Published on Jun 7, 2016 by PMR
Talk given to statisticians in Tilburg, with emphasis on scholarly comms for detecting unusual features. Includes demo of Amanuens.is and image mining
Use of ContentMine tools on the Open Access subset of EuropePubMedCentral to discover new knowledge about the Zika virus.
Three slides have embedded movies - these do not show in slideshare and a first pass of this can be seen as a single file at https://vimeo.com/154705161
Talk to EBI Industry group on Open Software for chemical and pharmaceutical sciences. Covers examples of chemistry , wit demos, and argues that all public knowledge should be Openly accessible
Open Access for Early Career ResearchersRoss Mounce
My talk for the University of Bath Open Access Week session; 23rd October 2013.
http://www.bath.ac.uk/learningandteaching/rdu/courses/pgskills/modules/RP00335.htm
Published on May 18, 2016 by PMR
Talk to EBI Industry group on Open Software for chemical and pharmaceutical sciences. Covers examples of chemistry , wit demos, and argues that all public knowledge should be Openly accessible
Amanuens.is HUmans and machines annotating scholarly literature TheContentMine
Published on May 19, 2016 by PMR
about 10,000 scholarly articles ("papers") are published each day. Amanuens.is is a symbiont of ContentMine and Hypothes.is (both Shuttleworth projects/Fellows) which annotates theses using an array of controlled vocabularies ("dictionaries"). The results, in semantic form are used to annotate the original material. The talk had live demos and used plant chemistry as the examples
Automatic Extraction of Knowledge from Biomedical literaturepetermurrayrust
a plenary lecture to Cochrane Collaboration in Birmingham, on the value of automatically extracting knowledge. Covers the Why? How? What? Who? and problems and invites collaboration
Automatic Extraction of Knowledge from Biomedical literature TheContentMine
Published on Mar 16, 2016 by PMR
A plenary lecture to Cochrane Collaboration in Birmingham, on the value of automatically extracting knowledge. Covers the Why? How? What? Who? and problems and invites collaboration
Talk to OpenForum Academy (Open Forum Europe) about Text and data Mining. Four use cases selected fo non-scientists. Also discussion of latest on Europena copyright reform and TDM exceptions
PLUTo: Phyloinformatic Literature Unlocking Tools
A BBSRC-funded project to find phylogenetic trees in the literature, and make their underlying data re-usable again by extracting it & re-releasing it from the figure image as open, re-usable data
How can repositories support the text-mining of their content and why? Nancy Pontika
Co-presented with Petr Knoth http://www.slideshare.net/petrknoth/ at the "Mining Repositories: How to assist the research and academic community on their text and data mining needs" workshop, which took place at the 11th International Conference on Open Repositories, Monday 13 June 2016.
Subscription costs versus open access costs, & Dissolving journals' boundariesAlex Holcombe
draft of talk for Reclaiming the Knowledge Commons http://www.eventbrite.com.au/e/reclaiming-the-knowledge-commons-the-ethics-of-academic-publishing-and-the-futures-of-research-tickets-17560178968
The slides that will accompany my live webcast for OpenCon 2014 attendees, all about open data in research. The benefits, the how to (both legally & technically), examples, pitfalls, and the future of open research data.
SocialCite makes its debut at the HighWire Press meetingKent Anderson
A new service designed to allow readers and researchers to comment on the appropriateness, quality, and type of citations made in the literature made its debut at the HighWire Press Publishers Meeting yesterday.
Open access (OA) to scholarly literature recently hit a major milestone: Half of all research articles published become open access, either immediately or after an embargo period. Are the articles you read among them? What about the articles you write? Are the journals to which you submit open-access friendly? What about the journals for which you peer review? Are there any reasons why the public should not have access to the results of taxpayer-funded research?
In this slideshow, Jill Cirasella (Associate Librarian for Public Services and Scholarly Communication, Graduate Center, CUNY) explains the motivation for OA, describes the details of OA, and differentiates between publishing in open access journals (“gold” OA) and self-archiving works in OA repositories (“green” OA). She also dispels persistent myths about OA and examines some of the challenges to OA.
Fifty shades of green and gold: open access to scholarly informationhierohiero
Presentation for Urban Research Utrecht, a research school at Utrecht University, on Open Access to scholarly information in geography and planning, focussing of advantages, disadvantges, various forms, costs and actions of stakeholders
ContentMining for France and Europe; Lessons from 2 years in UKpetermurrayrust
I have spend 2 years carrying out Content Mining (aka Text and Data Mining) in the UK under the 2014 "Hargreaves" exception. This talk was given in Paris, to ADBU , after France had passed the law of the numeric Republique. I illustrate what worked in what did not and why and offer ideas to France and Europe
The ContentMine system (Open Source) can search EuropePMC and download hundreds of articles in seconds. These can be indexed by AMI dictionaries allowing a rapid evaluations and refinement of the search
Basics of ContentMining presented to Synthetic Biologists. This was followed by a lively discussion of what components could be extracted from the literature
Published on May 18, 2015 by PMR
Basics of ContentMining presented to Synthetic Biologists. This was followed by a lively discussion of what components could be extracted from the literature
Liberating facts from the scientific literature - Jisc Digifest 2016Jisc
Text and data mining (TDM) techniques can be applied to a wide range of materials, from published research papers, books and theses, to cultural heritage materials, digitised collections, administrative and management reports and documentation, etc. Use cases include academic research, resource discovery and business intelligence.
This workshop will show the value and benefits of TDM techniques and demonstrate how ContentMine aims to liberate 100,000,000 facts from the scientific literature, and ContentMine will provide a hands on demo on a topical and accessible scientific/medical subject.
Keynote talk to LEARN (LERU/H2020 project) for research data management. Emphasizes that problems are cultural not technical. Promotes modern approaches such as Git / continuousIntegration, announces DAT. Asserts that the Right to Read in the Right to Mine. Calls for widespread development of contentmining (TDM)
The scientific scholarly literature now contains many millions of articles. The contain semi-structured information of high quality and veracity. We show how this resource can be converted to a universal Wikicite format and full-text indexed against Wikidata dictionaries. We now have > 5 million bibliographic records and over 200 dictionaries based in Wikidata properties and queriable by SPARQL.
ContentMine: Open Data and Social MachinesTheContentMine
Published on Nov 13, 2014 by PMR
Scientific information is often hidden or not published properly. The ContentMine is a Social Machine consisting of semantic software and communities of domain expertise; it aims to liberate all scientific facts from the published literature on a daily basis.
The talk , delivered to the Computational Institute, will be /was followed by a hands-on workshop learning how to use the technology and work as a community.
Scientific information is often hidden or not published properly. The ContentMine is a Social Machine consisting of semantic software and communities of domain expertise; it aims to liberate all scientific facts from the published literature on a daily basis.
The talk , delivered to the Computational Institute, will be /was followed by a hands-on workshop learning how to use the technology and work as a community.
High throughput mining of the scholarly literature; talk at NIHpetermurrayrust
The scientific and medical literature contains huge amounts of valuable unused information. This talk shows how to discover it, extract, re-use and interpret it. Wikidata is presented as a key new tool and infrastructure. Everyone can become involved. However some of the barriers to use are sociopolitical and these are identified and discussed.
Similar to Museum impact: linking-up specimens with research published on them (20)
My 2 slides for #nfdp13
It was a 5min talk (and I was told strictly no more than 3 slides!)
The future of data sharing involves educating future generations in digital techniques, tools & values e.g. http://www.opensciencetraining.com/
My talk given at the 2nd meeting of the Licences for Europe Stakeholder dialogue meeting in Brussels (8th March, 2013), Working Group 4: Text & Data Mining.
context: http://ec.europa.eu/licences-for-europe-dialogue/en/content/about-site
Show drafts
volume_up
Empowering the Data Analytics Ecosystem: A Laser Focus on Value
The data analytics ecosystem thrives when every component functions at its peak, unlocking the true potential of data. Here's a laser focus on key areas for an empowered ecosystem:
1. Democratize Access, Not Data:
Granular Access Controls: Provide users with self-service tools tailored to their specific needs, preventing data overload and misuse.
Data Catalogs: Implement robust data catalogs for easy discovery and understanding of available data sources.
2. Foster Collaboration with Clear Roles:
Data Mesh Architecture: Break down data silos by creating a distributed data ownership model with clear ownership and responsibilities.
Collaborative Workspaces: Utilize interactive platforms where data scientists, analysts, and domain experts can work seamlessly together.
3. Leverage Advanced Analytics Strategically:
AI-powered Automation: Automate repetitive tasks like data cleaning and feature engineering, freeing up data talent for higher-level analysis.
Right-Tool Selection: Strategically choose the most effective advanced analytics techniques (e.g., AI, ML) based on specific business problems.
4. Prioritize Data Quality with Automation:
Automated Data Validation: Implement automated data quality checks to identify and rectify errors at the source, minimizing downstream issues.
Data Lineage Tracking: Track the flow of data throughout the ecosystem, ensuring transparency and facilitating root cause analysis for errors.
5. Cultivate a Data-Driven Mindset:
Metrics-Driven Performance Management: Align KPIs and performance metrics with data-driven insights to ensure actionable decision making.
Data Storytelling Workshops: Equip stakeholders with the skills to translate complex data findings into compelling narratives that drive action.
Benefits of a Precise Ecosystem:
Sharpened Focus: Precise access and clear roles ensure everyone works with the most relevant data, maximizing efficiency.
Actionable Insights: Strategic analytics and automated quality checks lead to more reliable and actionable data insights.
Continuous Improvement: Data-driven performance management fosters a culture of learning and continuous improvement.
Sustainable Growth: Empowered by data, organizations can make informed decisions to drive sustainable growth and innovation.
By focusing on these precise actions, organizations can create an empowered data analytics ecosystem that delivers real value by driving data-driven decisions and maximizing the return on their data investment.
Opendatabay - Open Data Marketplace.pptxOpendatabay
Opendatabay.com unlocks the power of data for everyone. Open Data Marketplace fosters a collaborative hub for data enthusiasts to explore, share, and contribute to a vast collection of datasets.
First ever open hub for data enthusiasts to collaborate and innovate. A platform to explore, share, and contribute to a vast collection of datasets. Through robust quality control and innovative technologies like blockchain verification, opendatabay ensures the authenticity and reliability of datasets, empowering users to make data-driven decisions with confidence. Leverage cutting-edge AI technologies to enhance the data exploration, analysis, and discovery experience.
From intelligent search and recommendations to automated data productisation and quotation, Opendatabay AI-driven features streamline the data workflow. Finding the data you need shouldn't be a complex. Opendatabay simplifies the data acquisition process with an intuitive interface and robust search tools. Effortlessly explore, discover, and access the data you need, allowing you to focus on extracting valuable insights. Opendatabay breaks new ground with a dedicated, AI-generated, synthetic datasets.
Leverage these privacy-preserving datasets for training and testing AI models without compromising sensitive information. Opendatabay prioritizes transparency by providing detailed metadata, provenance information, and usage guidelines for each dataset, ensuring users have a comprehensive understanding of the data they're working with. By leveraging a powerful combination of distributed ledger technology and rigorous third-party audits Opendatabay ensures the authenticity and reliability of every dataset. Security is at the core of Opendatabay. Marketplace implements stringent security measures, including encryption, access controls, and regular vulnerability assessments, to safeguard your data and protect your privacy.
Data Centers - Striving Within A Narrow Range - Research Report - MCG - May 2...pchutichetpong
M Capital Group (“MCG”) expects to see demand and the changing evolution of supply, facilitated through institutional investment rotation out of offices and into work from home (“WFH”), while the ever-expanding need for data storage as global internet usage expands, with experts predicting 5.3 billion users by 2023. These market factors will be underpinned by technological changes, such as progressing cloud services and edge sites, allowing the industry to see strong expected annual growth of 13% over the next 4 years.
Whilst competitive headwinds remain, represented through the recent second bankruptcy filing of Sungard, which blames “COVID-19 and other macroeconomic trends including delayed customer spending decisions, insourcing and reductions in IT spending, energy inflation and reduction in demand for certain services”, the industry has seen key adjustments, where MCG believes that engineering cost management and technological innovation will be paramount to success.
MCG reports that the more favorable market conditions expected over the next few years, helped by the winding down of pandemic restrictions and a hybrid working environment will be driving market momentum forward. The continuous injection of capital by alternative investment firms, as well as the growing infrastructural investment from cloud service providers and social media companies, whose revenues are expected to grow over 3.6x larger by value in 2026, will likely help propel center provision and innovation. These factors paint a promising picture for the industry players that offset rising input costs and adapt to new technologies.
According to M Capital Group: “Specifically, the long-term cost-saving opportunities available from the rise of remote managing will likely aid value growth for the industry. Through margin optimization and further availability of capital for reinvestment, strong players will maintain their competitive foothold, while weaker players exit the market to balance supply and demand.”
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Subhajit Sahu
Abstract — Levelwise PageRank is an alternative method of PageRank computation which decomposes the input graph into a directed acyclic block-graph of strongly connected components, and processes them in topological order, one level at a time. This enables calculation for ranks in a distributed fashion without per-iteration communication, unlike the standard method where all vertices are processed in each iteration. It however comes with a precondition of the absence of dead ends in the input graph. Here, the native non-distributed performance of Levelwise PageRank was compared against Monolithic PageRank on a CPU as well as a GPU. To ensure a fair comparison, Monolithic PageRank was also performed on a graph where vertices were split by components. Results indicate that Levelwise PageRank is about as fast as Monolithic PageRank on the CPU, but quite a bit slower on the GPU. Slowdown on the GPU is likely caused by a large submission of small workloads, and expected to be non-issue when the computation is performed on massive graphs.
Chatty Kathy - UNC Bootcamp Final Project Presentation - Final Version - 5.23...John Andrews
SlideShare Description for "Chatty Kathy - UNC Bootcamp Final Project Presentation"
Title: Chatty Kathy: Enhancing Physical Activity Among Older Adults
Description:
Discover how Chatty Kathy, an innovative project developed at the UNC Bootcamp, aims to tackle the challenge of low physical activity among older adults. Our AI-driven solution uses peer interaction to boost and sustain exercise levels, significantly improving health outcomes. This presentation covers our problem statement, the rationale behind Chatty Kathy, synthetic data and persona creation, model performance metrics, a visual demonstration of the project, and potential future developments. Join us for an insightful Q&A session to explore the potential of this groundbreaking project.
Project Team: Jay Requarth, Jana Avery, John Andrews, Dr. Dick Davis II, Nee Buntoum, Nam Yeongjin & Mat Nicholas
Explore our comprehensive data analysis project presentation on predicting product ad campaign performance. Learn how data-driven insights can optimize your marketing strategies and enhance campaign effectiveness. Perfect for professionals and students looking to understand the power of data analysis in advertising. for more details visit: https://bostoninstituteofanalytics.org/data-science-and-artificial-intelligence/
2. Talk Structure
● Background: the collections, the research literature
● Interesting things you should know about access to research
○ The costs of knowledge $$$
● Examples of content mining
○ Including a video demo!
● My work (in progress) on finding NHM specimens in recent literature
4. ● New
● Open Data
● Easy-to-use
● Quick
● Images
● Audio
● Interactive
Maps
● Citable
● API access
● Open Source
Infrastructure
It’s not KE Emu :)
5. What I want to do:
link specimen records to their mentions in the literature
“Micro-computed tomography scan slice through four bat skulls, displaying the relative
position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
6. 114,000,000
scholarly papers available online
36,000,000 of which are
‘Biology’ / ‘Environmental Studies’ / ‘Geosciences’ / ‘Multidisciplinary’
Khabsa, M. and Giles, C. L. 2014. The number of scholarly documents on the public web. PLoS ONE
7. Sadly, the vast majority of papers are only ‘available’ online to paying subscribers and
no institution in the world has access to everything. Not even close to everything!
8. Cheryl Hall (2014) FOI request https://www.whatdotheyknow.com/request/academic_journal_subscription_co
We rent access to knowledge. Companies profiteer from it
2004/05 £357,197.79
2005/06 £383,214.29
2006/07 £340,690.33
2007/08 £381,526.57
2008/09 £441,706.36
2009/10 £437,539.71
2010/11 £430,105.08
2011/12 £449,515.12
2012/13 £469,007.50
2013/14 £494,913.01
10-year-total: £4,185,415.76
Tax Year Revenue Profit Profit Margin
2004 £1363m £460m 33.75%
2005 £1436m £449m 31.25%
2006 £1521m £465m 30.57%
2007 £1507m £477m 31.65%
2008 £1700m £568m 33.41%
2009 £1985m £693m 34.91%
2010 £2026m £724m 35.74%
2011 £2058m £768m 37.30%
2012 £2063m £780m 37.81%
2013 £2126m £826m 38.85%
Source: RELX Group (Parent company of Elsevier) Company Reports
9. Actually, the NHM’s annual bill isn’t bad compared to others
Source: Lawson S and Meghreblian B. (2015) Journal subscription
expenditure of UK higher education institutions. F1000Research
http://shiny.retr0.me/journal_costs/
10. Content Mining provides more bang for your buck
Making fuller use of our expensively provisioned access
● If the NHM is going to pay £500,000 per year to rent journals, why not use the
access to this resource to its fullest?
● I can’t read everything with my human eyes but…
computers can!
● If you can process one document with a computer,
you can process a million: content mining
11. Recent examples of Content Mining
Fig. 6 from the paper
Brachiopod body-size
estimates
Red-line humans
Grey bars machines
(PaleoDeepDive)
Better than PaleoDB ?
I think so. PDD more clearly-linked to evidence than PDB
Provenance matters.
12. Recent examples of Content Mining (Images)
3-second
image analysis
source: 10.1099/ijs.0.65212-0
(Zymobacter_palmae:261,((((Chromohalobacter_canadensis:42,
(Chromohalobacter_sarecensis:96,Chromohalobacter_nigrandesensls:154):41):80,
(Chromohalobacter_marismortui:125,Chromohalobacter_beijerinckii:103):164):61,
(Chromohalobacter_israelensis:11,Chromohalobacter_salexigens:11):92):293,
((Halomonas_halodurans:328,(Halomonas_ventosae:100,(Halomonas_pacifica:116,
(Halomonas_halophila:223,(Halomonas_eurihalina:27,Halomonas_elongate:58):236):
79):41):46):72,(Halomonas_desiderata:187,(Halomonas_pantelleriensis:173,
Halomonas_muralis:190):70):30):110):187);
outputs re-usable Newick
& NeXML
no manual input required
Can replot data,
re-analyse,
combine many to make a supertree!
PLUTo Project
Mounce, Murray-Rust, Wills (in prep.)
13. How to get a sufficient volume of journal articles?
● The ContentMine (CM) team are actively developing new tools
& training workshops to help researchers get into content
mining: be it text, data, or image mining
● CM are a not-for-profit Shuttleworth-funded project led by
Peter Murray-Rust
● All the software tools are open source and available on github:
https://github.com/ContentMine/
● I’m a Scientific Advisor with the ContentMine
● Try getpapers OR quickscrape to get journal content en masse
https://github.com/ContentMine/getpapers
https://github.com/ContentMine/quickscrape
http://contentmine.org/
14.
15. ● No problem. PMC to the rescue!
● PMC has a full text
Open Access-only subset which
you can download easily for free
● >1,100,000 full texts in XML
(compressed) is just 16.6GB
Want to download more than a million (OA) papers?
Source: Neil Saunders (2014) https://rpubs.com/neilfws/45828http://www.ncbi.nlm.nih.gov/pmc/tools/openftlist/
16. Are there NHM specimens in the PMC OA subset?
PMC is medically-focused, so one wouldn’t expect it to be
rich in organismal biology, however …some relevant content
ALL of PLOS ONE is in the PMC OA subset.
Over 100,000 articles in that journal alone!
18. Searching ALL full texts
is not enough!!!
A significant number of specimens
are probably ‘hiding-out’ in
supplementary data files of all sorts
of formats.
Google Scholar does not index SI
Web of Science doesn’t either
Nor does Scopus
At scale, journal-held
supplementary data files are the
‘darkest corners’ of science“Specimens were deposited in the collections of the California Academy of Sciences'
Department of Herpetology (CAS), the British Museum of Natural History (BMNH) and of
author GJM (Table S1)” 10.1371/journal.pone.0104628 http://rossmounce.co.uk/2015/06/20/deep-indexing-supplementary-data-files/
19. I don’t just find in-text mentions.
I’m trying to match them up to our
NHM Data Portal records too!
Specimens in RED do not
appear to be on the Data Portal
...yet
Blue globe represents a PLOS ONE paper
20. Blue globe represents a PLOS ONE paper
Very few specimens occur in more than one paper
Can you guess what BMNH 37001 is?
Hint: it’s famous! Grey represents an NHMUK specimen
21. Mining over 200 subscription access / non-PMC journals
from 2000 <-> 2015 inclusive
Nature + Science + PNAS + Phytotaxa + Zootaxa
BioOne Journals (131)
Springer Journals (32)
Wiley Journals (22)
Taylor & Francis Journals (14)
Elsevier Journals (12)
Oxford University Press Journals (8)
SciELO Journals (7) [Open Access but not in PMC]
Ecological Society of America Journals (6)
Geological Society Journals (4)
CSIRO Journals (4)
Cambridge University Press Journals (3)
Royal Society Journals (2)
Journal-omics!
22. Thanks to a recent change in
UK copyright law:
text and data mining for non-
commercial research purposes
is legal (in the UK),
(provided that you have
legitimate access to the
resource you want to mine e.g.
a paid-for institutional
subscription)
http://blogs.lse.ac.uk/impactofsocialsciences/2014/06/04/the-right-to-read-is-the-right-to-mine-tdm/
25. Almost nothing in Nature & Science ‘full (short) text’
Context: 15 years worth of full text research in Nature & Science examined.
Science: only 11 NHM specimens found in ~39,600 texts.
Nature: similar story. <30 specimens in 14,132 ‘full’ texts.
Clearly there are more, but it’s all buried in supplementary materials :(
26. Shoving all the research details into non-searchable
supplementary materials is bad for science
● For the avoidance of doubt, this is not a criticism of authors. This is squarely
aimed at journals that artificially restrict the ‘length’ of research articles online.
e.g. Prufer, K. et al. 2014. The complete genome sequence of a Neanderthal from
the Altai Mountains. Nature 2014, 505, 43-49.
7-pages (in paper), 12-pages (in PDF, with extra data tables & figures)
The supplementary data file?
249 pages!
27. Someone needs to build a searchable index of
supplementary data. ASAP
29. “Micro-computed tomography scan slice through four bat skulls, displaying the relative
position of the three semicircular canals within the skull. Scans are from the following
species: (A) Pteropus rodricensis (BMNH.76.3.15.14); …”
NHM Data Portal Link (Stable, Unique Identifier)
http://data.nhm.ac.uk/specimen/69e97f52-0275-4a82-9fa6-cf1c3949f408
Article DOI (Stable, Unique Identifier)
http://dx.doi.org/10.1371/journal.pone.0061998
Huge potential to go beyond mere linking-up of identifiers.
This specimen & others have been CT scanned in the PLOS ONE paper.
We could do data, media and knowledge ‘repatriation’ back to the museum/portal.
30. Credit: Davies KTJ, Bates PJJ, Maryanto I, Cotton JA, Rossiter SJ (2013)
The Evolution of Bat Vestibular Systems in the Face of Potential
Antagonistic Selection Pressures for Flight and Echolocation. PLoS ONE 8
(4): e61998. doi:10.1371/journal.pone.0061998
Openly-licensed data on specimens, published elsewhere,
could be re-incorporated back into the online museum
catalogue. A one-stop shop for information.
Beyond-linking:
repatriation of knowledge
This is a CT-scan of “BMNH 76.3.15.14”.
Without mining, I wouldn’t know this data exists.
Perhaps it could also be made available on the portal?
http://data.nhm.ac.uk/specimen/69e97f52-
0275-4a82-9fa6-cf1c3949f408
31. Does published info make it back ‘home’ to the collections?
BMNH 2013.2.13.3 on the portal as “Petrochromis nov.sp. Takahashi”
I found it (by text mining) here: http://dx.doi.org/10.1007/s10228-014-0396-9
It’s now called: Petrochromis horii n. sp. , according to the paper.
What mechanisms are there to update newer information back into the collection?
Content mining could definitely help keep collections data up-to-date!
32. Acknowledgements
● Sincere thanks to:
○ The NHM Library staff, particularly Sarah Vincent for actively supporting my content mining
○ Nancy Chillingsworth (IPR, NHM London)
○ Mark Wilkinson (Life Sciences, NHM London)
○ Peter Murray-Rust & the ContentMine team
○ Vince Smith (Life Sciences, NHM London)
○ Ben Scott (NHM Data Portal Lead Architect)
○ Rod Page (University of Glasgow)
○ All of the Biodiversity Informatics team
http://contentmine.org/
33. Please ask me questions!
Feedback appreciated :)
@rmounce
ross.mounce@nhm.ac.uk