Hi, I’m here to talk about AMI; a data extraction framework and tool. First, I just want highlight some of key contributors to the projects; Andy for his work on the ChemistryVisitor and Peter for the overall architecture.
In this talk, I’m going to impress the importance of data in a specific format and its utility to automated machine processing. Then I’m going to demonstrate AMI’s architecture and the transformation of data as it flows through the process. I’m going to dwell a little on a core format used, Scalable Vector Graphics (SVG) before introducing the concept of visitors, which are pluggable context specific data extractors. Next, I’m going to introduce Andy’s ChemVisitor, for extracting semantic chemistry data, along with a few other visitors that can process non-chemistry specific data. Finally, I will demonstrate some uses of the ChemVisitor, within the realm of validation and metabolism.
Principles and practice of Open Science
ContentMine.org, and University of Cambridge
Opencon2015, Bologna, IT 2015-11-18
What is “Open”?
Why is it essential?
Content Mining – a battle we must win
Young researchers are the present (Mike Eisen)
The Right to Read is the Right to Mine**PeterMurray-Rust, 2011
My European Heroes
• The system is completely broken
• We are at war with major publishers
• Students have the power to change the world
• Universities need help from students
• Open is a state of mind
• The opposite of Open is broken 
• Friction destroys Open
• Don’t buy it, build it …
• … TOGETHER
 (John Wilbanks)
@Senficon (Julia Reda) :Text & Data mining in times of
"Elsevier stopped me doing my research"
er-stopped-me-doing-my-research/ … #opencon #TDM
Elsevier stopped me doing my research
I am a statistician interested in detecting potentially problematic research such as data fabrication,
which results in unreliable findings and can harm policy-making, confound funding decisions, and
hampers research progress.
To this end, I am content mining results reported in the psychology literature. Content mining the
literature is a valuable avenue of investigating research questions with innovative methods. For
example, our research group has written an automated program to mine research papers for errors in
the reported results and found that 1/8 papers (of 30,000) contains at least one result that could
directly influence the substantive conclusion .
In new research, I am trying to extract test results, figures, tables, and other information reported in
papers throughout the majority of the psychology literature. As such, I need the research papers
published in psychology that I can mine for these data. To this end, I started ‘bulk’ downloading research
papers from, for instance, Sciencedirect. I was doing this for scholarly purposes and took into account
potential server load by limiting the amount of papers I downloaded per minute to 9. I had no intention
to redistribute the downloaded materials, had legal access to them because my university pays a
subscription, and I only wanted to extract facts from these papers.
Full disclosure, I downloaded approximately 30GB of data from Sciencedirect in approximately 10 days.
This boils down to a server load of 0.0021GB/[min], 0.125GB/h, 3GB/day.
Approximately two weeks after I started downloading psychology research papers, Elsevier notified
my university that this was a violation of the access contract, that this could be considered stealing of
content, and that they wanted it to stop. My librarian explicitly instructed me to stop downloading
(which I did immediately), otherwise Elsevier would cut all access to Sciencedirect for my university.
I am now not able to mine a substantial part of the literature, and because of this Elsevier is directly
hampering me in my research.
 Nuijten, M. B., Hartgerink, C. H. J., van Assen, M. A. L. M., Epskamp, S., & Wicherts, J. M. (2015). The
prevalence of statistical reporting errors in psychology (1985–2013). Behavior Research Methods, 1–22.
Chris Hartgerink’s blog post
Typical chemical synthesis
Open Content Mining of FACTs
Machines can interpret chemical reactions
We have done 500,000 patents. There are >
3,000,000 reactions/year. Added value > 1B Eur.
C) What’s the problem with this spectrum?
Org. Lett., 2011, 13 (15), pp 4084–4087
Original thanks to ChemBark
After AMI2 processing…..
… AMI2 has detected a square
CORE , HAL,
peerJ… Nature, IEEE,
30, 000 pages/day
CONTENTMINE Complete OPEN Platform for Mining Scientific Literature
Stand back! I am about to do
• Erriquez Daniela, Esame finale: Bologna, Aprile 2014
• Dott.ssa Elena Fiorentini, n. 0000274966, TESI DI DOTTORATO, Bologna
• Qian Gou, Esame finale: Bologna, finale 2014
• Maurizio BARONTINI, UNIVERSITÀ DEGLI STUDI DELLA TUSCIA DI VITERBO
• Terracciano Mario, Esame finale anno 2014
Refs: Erriquez_Daniela_tesi, Fiorentina_Elena_tesi, Gou_Qian_Tesi, mbarontini_tesid, terracciano_maria_tesi
BagOfWords for Italian Theses
Copyright and Mining
• UK (“Hargreaves”) 2014 legislation:
– “personal” “non-commercial*” “research” “data
– legitimizes copying (?to disk), but not publishing
• PMR-premise: You cannot do reproducible
scientific mining and avoid violating copyright.
Massive political activity in Europe
Elsevier wants to control Open Data
[asked by Michelle Brook]
Scholarly infrastructure becomes closed
No accountability for monitoring and control
We were stunned recently when we stumbled across an article by European
researchers in Annals of Virology : “The results seem to indicate that
Liberia has to be included in the Ebola virus endemic zone.” In the future,
the authors asserted, “medical personnel in Liberian health centers should be
aware of the possibility that they may come across active cases and thus be
prepared to avoid nosocomial epidemics,” referring to hospital-acquired
Adage in public health: “The road to inaction is paved with research
Bernice Dahn (chief medical officer of Liberia’s Ministry of Health)
Vera Mussah (director of county health services)
Cameron Nutt (Ebola response adviser to Partners in Health)
A System Failure of Scholarly Publishing
 The Military-Industrial-Academic complex (1961)
(Dwight D Eisenhower, US President)
The Publisher-Academic complex
[Wikipedia:] On the steps of Sproul Hall [Student] Mario Savio gave a
... But we're a bunch of raw materials that don't mean … to end up being
bought by some clients of the University, be they the government, be they
industry, be they organized labor, be they anyone! We're human beings!
... There's a time when the operation of the machine becomes so odious
— makes you so sick at heart — that you can't take part. You can't even
passively take part. And you've got to put your bodies upon the gears and
upon the wheels, upon the levers, upon all the apparatus, and you've got
to make it stop. And you've got to indicate to the people who run it, to the
people who own it, that unless you're free, the machine will be prevented
from working at all. 
The Free Speech Movement
student occupations and sit-ins
University of Stirling
Used without permission but with thanks and Love
Liverpool , Warwick, Emmanuel Coll Camb., UCL, Glasgow, Middlesex, …
["How We Stopped SOPA”:
This bill ... shut down whole websites. Essentially, it stopped Americans from
communicating entirely with certain groups....
I called all my friends, and we stayed up all night setting up a website for this new group,
Demand Progress, with an online petition opposing this noxious bill.... We [got] ... 300,000
signers.... We met with the staff of members of Congress and pleaded with them.... And then
it passed unanimously....
And then, suddenly, the process stopped. Senator Ron Wyden ... put a hold on the
He added, "We won this fight because everyone made themselves the hero of their own
story. Everyone took it as their job to save this crucial freedom.”
Robert Swartz: "Aaron was killed by the government, and MIT betrayed all of its basic
Rules for Revolutionaries
• Be publicly clear about your public aims.
• Gather whole-hearted allies.
• Choose your moment/s carefully.
• Be prominent – blogs, talks, papers.
• Be bold – and probably brave.
• Write Liberation Software.
• Create slogans, warcries, mantras.
Take the fight to publishers. Hold them accountable for the near-
criminal business models they operate on, and the stranglehold they
have had on academia for too long.
Extending this, I need your help. I want to know if we initiate a formal
investigation into the practices of publishers, in terms of the fact that
they operate within an unregulated market and enjoy enormous
profits to commit immoral acts (creating knowledge inequality). …. I
want to know what we can do, and if such an investigation is even
feasible, and whether or not we have a legal case supporting us.
Don’t sacrifice your career.. [PMR] said it best, that for any revolution
blood will be spilled. If you’re making someone angry, you’re probably
doing it right. But when you’re ‘advocating’ for open access, maintain
one simple rule: don’t be a dick…. (and lots more)
Jon Tennant 2014-11-25
The Right to Read
The Right to Roam
The Right to Mine
Kinder Mass Trespass
used without permission but with love and thanks
How can we achieve Freedom?
• Change the law to allow ContentMining
– Hard, tedious, but necessary
– Requires evidence, campaigning, making yourselves a
pain in the arse…
• Make all outputs Open
– Requires culture change in researchers
– Tools: Open Notebook Science, Github, Open source,
– Needs support from funders, learned societies,
Four Freedoms (Richard Stallman)
The freedom to:
0 run the program as you wish, for any purpose
1 study how the program works, and change it
2 to redistribute copies
3 distribute copies of your modified program
Most other “Opens” follow these principles, including CC-BY material.
However “Green Open Access” is incompatible with Freedom2 and 3
The Open Definition
“Open means anyone can freely access, use, modify, and share for
any purpose (subject, at most, to requirements that preserve
provenance and openness).”
… an unprecedented public good. …
… completely free and unrestricted access to [peer-
reviewed literature] by all scientists, scholars, teachers,
students, and other curious minds. …
…Removing access barriers to this literature will
accelerate research, enrich education, share the
learning of the rich with the poor and the poor with
the rich, make this literature as useful as it can be, and
lay the foundation for uniting humanity in a common
intellectual conversation and quest for knowledge.
(Budapest Open Access Initiative, 2003)
Panton Principles for Open Data in
• PUBLISH YOUR DATA OPENLY
• …make an explicit and robust statement of your wishes.
• Use a recognized waiver or license that is appropriate for
• open as defined by the Open Knowledge/Data Definition
(… NOT non-commercial)
• Explicit dedication of data … into the public domain via
PDDL or CCZero
Peter Murray-Rust, Cameron Neylon, Rufus Pollock, John
Bjorn Brembs enhanced by OpenData
This is a response to Dorothy Bishop’s post “Who’s afraid of open data?“.
After we had published a paper on how Drosophila strains that are referred to by the same name in the literature
(Canton S), but came from different laboratories behaved completely different in a particular behavioral experiment,
Casey Bergman from Manchester contacted me, asking if we shouldn’t sequence the genomes of these five fly strains
to find out how they differ. So I went and behaviorally tested each of the strains again, extracted the DNA from the 100
individuals I had just tested and sent the material to him. I also published the behavioral data immediately on our
GitHub project page.
Casey then sequenced the strains and made the sequences available, as well. A few weeks later, both Casey and I
were contacted by Nelson Lau at Brandeis, showing us his bioinformatics analyses of our genome data. Importantly,
his analyses wasn’t even close to what we had planned. On the contrary, he had looked at something I (not being a
bioinformatician) would have considered orthogonal (Casey may disagree). So there we had a large chunk of work we
would have never done on the data we hadn’t even started analyzing, yet. I was so thrilled! I learned so much from
Nelson’s work, this was fantastic! Nelson even asked us to be co-author, to which I quickly protested and suggested, if
anything, I might be mentioned in the acknowledgments for “technical assistance” – after all, I had only extracted the
However, after some back-and-forth, he persuaded me with the argument that he
wanted to have us as co-authors to set an example. He wanted to show everyone that
sharing data is something that can bring you direct rewards in publications. He
wanted us to be co-authors as a reward for posting our data and as incentive for
others to let go of their fears and also post their data online.
Arguments for Open
• Open Science:
– is Better Science
– can reach and involve everyone
– Open Science moves more quickly
– Open Science challenges injustice
– helps the world
It also happens to:
– Promote the careers of scientists
– Save money
Jean-Claude Bradley was one of the
most influential open scientists of our
time. He was an innovator in all that
he did, from Open Education to
bleeding edge Open Science; in 2006,
he coined the phrase Open Notebook
Science. His loss is felt deeply by
friends and colleagues around the
On Monday July 14, 2014 we shall
gather at Cambridge University to
honour his memory and the legacy he
leaves behind with a highly
distinguished set of invited speakers to
revisit and build upon the ideas which
inspired and defined his life’s work.
Wikipedia CC BY-SA
Traditional Research and Publication
“Lab” work paper/th
Free/Open Software Development
Example: ContentMine at
Open Notebook Science
Problems are solved communally;
Nothing is needlessly duplicated; “publication“ is
Mat Todd (Sydney) and MANY collaborators
Sam Moore Peter Kraker Rosie GraySophie Kay
Sophie: 3rd yr Grad students train 1st year students
Sophie Kershaw, Panton Fellow, Training PhD Students
Rotation-Based Learning (RBL)
Phase 1: Initiator
• No communication
permitted between groups
• Attempt to reproduce
• Deliver a coherent research
story by the end of Phase 1
Phase 2: Successor
• Communication between
groups still prohibited
• Validate and develop the
inherited research story
• Critique your predecessors
• Role of research producer vs. research user
• Can this approach help to foster awareness of reproducibility issues?
Throughout Phases 1 & 2:
• Daily lectures on open
science culture & techniques
• First-hand application to own
• Version control using GitHub
• Daily group supervision
“Do you think you would be
more confident in the future
about trying to apply Open
techniques to your work..?”
• 50% Yes, by myself
• 41% Yes, with help/guidance
• 9% No opinion/neutral
• 0% No
of the Digital Enlightenment
• David Carroll & Joe McArthur: OAButton
• Rayna Stamboliyska & Pierre-Carl Langlais
• Jon Tennant
• Ross Mounce
• Jenny Molloy
• Erin McKiernan
• Jack Andraka
• Michelle Brook
• Heather Piwowar
• TheContentMine Team
• Rufus Pollock
• Jonathan Gray
• Sophie Kay
Jean-Claude Bradley  a chemist
developed Open notebook science;
making the entire primary record of a
research project publicly available
online as it is recorded. (WP)
J-C promoted these ideas with
 Unfortunately J-C died in 2014;
we held a memorial meeting in
• Don’t negotiate with walled gardens, make
them change or make them obsolete
• Building on top of non-Open is very fragile,
unpredictable and usually bad engineering
• Many start-ups get acquired and lose their
• “Embrace, extend, exterminate” (Microsoft)
• Consider adding “Open Lock” clauses to
articles of incorporation