1. Content Mining of Science in Cambridge
Peter Murray-Rust,
Dept of Chemistry, University of Cambridge
libraries@cambridge, Cambridge, UK 2016-01-07
What is mining?
Why is it useful?
Open Access and UK “Hargreaves” legislation
How Cambridge can become a world leader
2. The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
http://contentmine.org
3. Use Cases of ContentMining
• Epidemiology of obesity (Cambridge U)
• (OKF, OpenTrials) Mapping clinical trials
repositories to reports in scientific literature
• Mining chemical reactions from patents
• Creating a bacterial supertree-of-life from
4500 papers
4. Polly has 20 seconds to read this paper…
…and 10,000 more
5. ContentMine software can do this in a few minutes
Polly: “there were 10,000 abstracts and due
to time pressures, we split this between 6
researchers. It took about 2-3 days of work
(working only on this) to get through
~1,600 papers each. So, at a minimum this
equates to 12 days of full-time work (and
would normally be done over several weeks
under normal time pressures).”
6. 400,000 Clinical Trials
In 10 government registries
Mapping trials => papers
http://www.trialsjournal.com/content/16/1/80
2009 => 2015. What’s
happened in last 6 years??
Search the whole scientific literature
For “2009-0100068-41”
11. Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
19. Copyright and Mining
• UK (“Hargreaves”) 2014 legislation:
– “personal” “non-commercial*” “research” “data
analytics”
– legitimizes copying (?to disk), but not publishing
*teaching, textbooks, etc. may be “commercial”
20. STM Publishers prevent Mining
• FUD & disinformation about legality (Elsevier)
• Monopolies on infrastructure (“API”s, CCC
Rightfind)
• Technical obstruction (Wiley Captcha,
Macmillan Readcube)
• Restrictive contracts with libraries (ALL) [1]
• Wasting my/our time (ALL)
[1] [You may not] utilize the TDM Output to enhance … subject repositories
in a way that would [… ] have the potential to substitute and/or replicate
any other existing Elsevier products, services and/or solutions.
21. WILEY … “new security feature… to prevent systematic download of content
“[limit of] 100 papers per day”
“essential security feature … to protect both parties (sic)”
CAPTCHA
User has to type words
22. ContentMine working with Libraries
• Cambridge: Library, Plant Sciences, Public Health,
Chemistry
• Cochrane Collaboration on Systematic Reviews of
Clinical Trials
• FutureTDM (H2020, LIBER)
• Running workshops and training
• We have dedicated servers running in chemistry
Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.
In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.