Slides from London TensorFlow Meetup on 18th July 2017
Corresponding repositories:
https://github.com/elifesciences/sciencebeam
https://github.com/elifesciences/sciencebeam-gym
Genislab builds better products and faster go-to-market with Lean project man...
London TensorFlow Meetup - 18th July 2017
1. @eLifeInnovation
Can TensorFlow help read
PDFs?
Changing the culture of peer review
London TensorFlow Meetup - 18/07/2017
Daniel Ecer
Data Scientist
6. @eLifeInnovation 6
eLife is a non-profit organisation inspired by research funders and led by scientists
elifesciences.org
7. @eLifeInnovation
Current problems in research communication
• Slow from discovery to communication
• Optimised for printing, not online sharing
• Little incentive to share the full story
• Competition over collaboration
• Inaccessible to the public
7elifesciences.org
9. @eLifeInnovation
What do we mean by “responsible behaviours”?
• Sharing of data, tools, and resources
• Objective and comprehensive reporting
• Cooperation and collaboration
• Constructive feedback and encouragement
9
WHY ELIFE IS IMPORTANT AND WHAT WE DO
elifesciences.org
13. @eLifeInnovation
Why
● Make preprints accessible:
○ Preprints:
■ complete scientific manuscripts
■ prior to:
● being peer reviewed and
● accepted in a journal
■ deposited online, e.g. arXiv
● Make historic scientific manuscripts accessible
● Make publishing process more efficient:
○ Journals pay third parties to create XML
● Wikidata, etc
13
16. @eLifeInnovation
What is JATS XML?
● XML
● Journal Article Tag Suite
● Meta data, e.g.:
○ Title
○ Authors
○ Abstract
○ References
● Full text, e.g.:
○ Section paragraphs
○ Formulas
○ Figures
○ Tables
16
17. @eLifeInnovation
What should the box include?
● PDF to XML Conversion Pipeline:
○ Automatic bulk translation
○ Full text XML (not just meta data)
○ Priority is accuracy over speed
○ Extensible (can add tools to process specific data)
17
25. @eLifeInnovation
Training data source
● Public training data, e.g.:
○ PMC
○ eLife
● Private training data:
○ Publishers will provide private data to train public model
○ Interested Publishers (pending contract):
■ PLOS
■ Cambridge University Press
■ De Gruyer
■ Taylor and Francis
25
31. @eLifeInnovation
The Model - U-Net / pix2pix
31
Original
https://github.com/phillipi/pix2pix
https://phillipi.github.io/pix2pix/
Isola, Phillip and Zhu, Jun-Yan
and Zhou, Tinghui and Efros,
Alexei A
TensorFlow port
https://github.com/affinelayer/pix2pix-tensorflow
https://affinelayer.com/pix2pix/
Christopher Hesse
40. @eLifeInnovation
Training details
● 1000 steps
● GPU: K80 (on Google ML Engine)
● ~600 training examples
● Results shown for separate test examples
● Model input: 256x256x3
40
43. @eLifeInnovation
Using colours for classifications
43
● Colours mix, e.g. red + blue = purple
● Classes don’t mix, e.g. abstract + title != paragraph
(discrete values)
52. @eLifeInnovation
Project outputs
52
● Customisable conversion pipeline or tools for it
● Trained public model
● Potentially one aspect of the conversion improved
(using computer vision)
● Tools to generate annotated PDFs,
and potentially a public dataset
● Training pipeline, that others people can run
(without having access to our private datasets though)