London TensorFlow Meetup - 18th July 2017

@eLifeInnovation
Can TensorFlow help read
PDFs?
Changing the culture of peer review
London TensorFlow Meetup - 18/07/2017
Daniel Ecer
Data Scientist

@eLifeInnovation
Outline
About eLife
Why
What
Challenges
Training Data
Computer Vision
Summary
Your Ideas
elifesciences.org 2

@eLifeInnovation
Objective
Why it matters
How TensorFlow may be able to help
How you could help
elifesciences.org 3

@eLifeInnovation
About eLife
@eLife @eLifeInnovation
4

@eLifeInnovation
-- Sir Mark Walport, UK Science Advisor
Founding member, eLife Board of Directors

@eLifeInnovation 6
eLife is a non-profit organisation inspired by research funders and led by scientists
elifesciences.org

@eLifeInnovation
Current problems in research communication
• Slow from discovery to communication
• Optimised for printing, not online sharing
• Little incentive to share the full story
• Competition over collaboration
• Inaccessible to the public
7elifesciences.org

@eLifeInnovation 8eLife workshop on peer review

@eLifeInnovation
What do we mean by “responsible behaviours”?
• Sharing of data, tools, and resources
• Objective and comprehensive reporting
• Cooperation and collaboration
• Constructive feedback and encouragement
9
WHY ELIFE IS IMPORTANT AND WHAT WE DO
elifesciences.org

@eLifeInnovation 10elifesciences.org

@eLifeInnovation 11elifesciences.org

@eLifeInnovation
Why
PDF to JATS XML
12

@eLifeInnovation
Why
● Make preprints accessible:
○ Preprints:
■ complete scientific manuscripts
■ prior to:
● being peer reviewed and
● accepted in a journal
■ deposited online, e.g. arXiv
● Make historic scientific manuscripts accessible
● Make publishing process more efficient:
○ Journals pay third parties to create XML
● Wikidata, etc
13

@eLifeInnovation
What
PDF to JATS XML
14

@eLifeInnovation
The Box
15
PDF
JATS
XML

@eLifeInnovation
What is JATS XML?
● XML
● Journal Article Tag Suite
● Meta data, e.g.:
○ Title
○ Authors
○ Abstract
○ References
● Full text, e.g.:
○ Section paragraphs
○ Formulas
○ Figures
○ Tables
16

@eLifeInnovation
What should the box include?
● PDF to XML Conversion Pipeline:
○ Automatic bulk translation
○ Full text XML (not just meta data)
○ Priority is accuracy over speed
○ Extensible (can add tools to process specific data)
17

@eLifeInnovation
Conversion pipeline
18

@eLifeInnovation
Challenges
PDF to JATS XML
19

@eLifeInnovation
What can reliably be read from a PDF?
● Meta data
● Paragraphs
● Words
● Characters*
● Font glyphs
● Images
● Paths
20

@eLifeInnovation
PDF to XML is difficult...
21

@eLifeInnovation
Training Data
PDF to JATS XML
22

@eLifeInnovation
What training data do we have?
23
PDF JATS XML

@eLifeInnovation
Generate training data from PDF & XML
elifesciences.org
24

@eLifeInnovation
Training data source
● Public training data, e.g.:
○ PMC
○ eLife
● Private training data:
○ Publishers will provide private data to train public model
○ Interested Publishers (pending contract):
■ PLOS
■ Cambridge University Press
■ De Gruyer
■ Taylor and Francis
25

@eLifeInnovation
Training data pipeline
26

@eLifeInnovation
Computer Vision
PDF to JATS XML
27

@eLifeInnovation
Can you guess...?
28
2?
1?
5?
4?
7?
6?
7?
3?

@eLifeInnovation
Training data for computer vision
29
convert

@eLifeInnovation
The Objective… learn annotation
30
learn

@eLifeInnovation
The Model - U-Net / pix2pix
31
Original
https://github.com/phillipi/pix2pix
https://phillipi.github.io/pix2pix/
Isola, Phillip and Zhu, Jun-Yan
and Zhou, Tinghui and Efros,
Alexei A
TensorFlow port
https://github.com/affinelayer/pix2pix-tensorflow
https://affinelayer.com/pix2pix/
Christopher Hesse

@eLifeInnovation
Encoder-Decoder
32
encode decode
4x4x3 (48) 4x4x3 (48)
1x1x12 (12)

@eLifeInnovation
Encoder-Decoder w/ multiple en-/decoders
33
encode
decode
4x4x3 (48) 4x4x3 (48)
1x1x12 (12)
encode
decode
2x2x6 (24) 2x2x6 (24)

@eLifeInnovation
Encoder-Decoder w/ skip connections (U-Net)
34
encode
decode
4x4x3 (48) 4x4x3 (48)
1x1x12 (12)
encode
decode
2x2x6 (24) 2x2x6 (24)

@eLifeInnovation
Generator vs Discriminator (conditional GAN)
35
Generator
Discriminator
error
Target
(annotation)
Input
?
conditional

@eLifeInnovation 36
Discriminator Discriminator
PatchGANGAN vs
1x1x1 (1)
2x2x1 (4)

@eLifeInnovation
pix2pix:
● Conditional
● PatchGAN
● U-Net
37

@eLifeInnovation
Generator Model layer details
● 256x256x3 (192k)
● 128x128x64 (1024k)
● 64x64x128 (512k)
● 32x32x256 (256k)
● 16x16x512 (128k)
● 8x8x512 (32k)
● 4x4x512 (8k)
● 2x2x512 (2k)
● 1x1x512 (0.5k)
(all with batch norm & ReLU activation)
38

@eLifeInnovation
Discriminator Model layer details
● 256x256x6 (384k)
● 128x128x64 (1024k)
● 64x64x128 (512k)
● 32x32x256 (256k)
● 31x31x512 (480k)
● 30x30x1 (0.9k)
(all with batch norm & ReLU activation)
39

@eLifeInnovation
Training details
● 1000 steps
● GPU: K80 (on Google ML Engine)
● ~600 training examples
● Results shown for separate test examples
● Model input: 256x256x3
40

@eLifeInnovation
Training pipeline
41

@eLifeInnovation
Initial results using coloured segmentation
42
input target prediction

@eLifeInnovation
Using colours for classifications
43
● Colours mix, e.g. red + blue = purple
● Classes don’t mix, e.g. abstract + title != paragraph
(discrete values)

@eLifeInnovation
Using separate channels - R, G, B
44
R: abstract
G: doi (identifier)
B: title

@eLifeInnovation
3: paragraph
Using separate channels - more than three
45
4: ...
0: abstract
1: doi (identifier)
2: title

@eLifeInnovation
‘Manuscript title’ channel (coloured blue)
46

@eLifeInnovation
‘Abstract’ channel (coloured red)
47

@eLifeInnovation
‘Author’ channel (coloured dark red)
48

@eLifeInnovation
All channels combined (coloured)
49

@eLifeInnovation
Next steps
50
● Improve model performance
● Use quantitative measure
● Use computer vision predictions for XML extraction

@eLifeInnovation
Summary
PDF to JATS XML
51

@eLifeInnovation
Project outputs
52
● Customisable conversion pipeline or tools for it
● Trained public model
● Potentially one aspect of the conversion improved
(using computer vision)
● Tools to generate annotated PDFs,
and potentially a public dataset
● Training pipeline, that others people can run
(without having access to our private datasets though)

@eLifeInnovation
Your Ideas & QA
PDF to JATS XML
53

@eLifeInnovation
Your Ideas & questions?
● Improving aspects of the PDF to XML conversion
● Contribute!
● Questions?
54

London TensorFlow Meetup - 18th July 2017

Recommended

Recommended

More Related Content

Similar to London TensorFlow Meetup - 18th July 2017

Similar to London TensorFlow Meetup - 18th July 2017 (20)

Recently uploaded

Recently uploaded (20)

London TensorFlow Meetup - 18th July 2017