Ways to generate PDF from Python Web applications, Gaël Le Mignot
1. Generating PDF from Python web applications
Gaël LE MIGNOT
Pilot Systems
June 6, 2017
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
2. Summary
1 Introduction
2 Tools
3 Tips, tricks and pitfalls
4 Conclusion
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
3. Introduction
Pilot Systems
Free Software service provider
Python Web application development and hosting
Using Zope/Plone (since 2000) and Django (since 0.96)
All kind of customers (public/private, small/big, . . . )
Generating PDFs
Very frequently asked
Different purpose require different tools
Several pitfalls to avoid
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
4. Weasyprint - presentation
What is weasyprint?
Free Software Python library
Convert HTML5 page (using a print CSS) into PDF
Also exists in command-line
When to use it?
To convert an existing HTML document
Consistency: same templating engine, same language
For simple page layouts
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
5. Weasyprint - code details
Simple usage
from weasyprint import HTML, CSS
html = template()
data = HTML(string=html).write_pdf()
Some mangling with BeautifulSoup
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
bl = ('typography.com', 'logged-in.css')
for css in soup.findAll("link"):
for cssname in bl:
if cssname in css['href']:
css.extract()
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
6. Weasyprint - code details
Add some page header/footer
@page {
margin: 3cm 2cm;
@bottom-right {
content: "Page " counter(page)
}
@top-center {
content: "Pilot Systems";
}
}
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
7. Reportlab - presentation
What is reportlab?
Python library for generating PDF and graphs
Powerful RML templating language
Template and story concepts
Versions and tools
Complicated licensing
Reportlab PDF toolkit: limited Free Software version
Reportlab PLUS: non-free complete version
trml2pdf: free software, third-party implementation of RML
RMLPageTemplate: Zope integration of trml2pdf
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
8. Reportlab - code example
Bright warning
<template>
<fill color="yellow"/>
<rect x="115mm" y="217mm"
width="90mm" height="18mm"
fill="yes" stroke="yes"/>
<frame id="warning" x1="115mm" y1="213mm"
width="90mm" height="24mm" />
</template>
<story>
<para>
TEMPORARY DOCUMENT - DO NOT PRINT
</para>
</story>
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
9. pdftk
What is pdftk?
Toolbox to manipulate PDF
Perform operation like extract pages, concatenate
Can also stamp a PDF on top of another
Command-line tool, so use subprocess
Use-case
Afdas - collect taxes and finance training
Companies make a yearly declaration
Take a background and fill cells
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
10. LATEX
What is LATEX?
Very powerful document composition system
Used for scientific publishing, among others
Used for those slides, too
How to use it?
Generate a .tex file
Can use a template, or intermediate language (like rst)
Then execute pdflatex
When to use it?
Rich formatting
Table of content, index, glossary, . . .
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
11. Other tools
For the brave
Client-side rendering with JS libraries
Using LibreOffice with pyuno
Generate QR-code/datamatrix with elaphe
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
12. HTTP Headers
Don’t forget HTTP headers
Specify the content-type
Hint between displaying and downloading
Provide default filename
Code example
response.setHeader('Content-Type',
'application/pdf')
cd = 'attachment; filename="%s"' % filename
response.setHeader('Content-Disposition', cd)
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
13. Handling long generation times
The problem
Generate a PDF report of 500 pages
It takes 10 minutes
Timeout or users get angry
Solutions
Increase timeouts, inform users
Use fork or threads to generate async
Use a scheduler like Celery
Send the result by email, with a link
Cleaning
find /path/to/pdfs -mtime +14 -delete
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
14. Careful with search engines
Typical situation
Public website (using a CMS)
Button on each page to get a PDF version
A crawler comes... and boom.
Don’t panic
Use robots.txt file, but limited
Have the button do a POST
Use load-balancer like haproxy and pin PDF requests
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
15. CPU and RAM usage
PDF generation is expensive
PDF generation can be heavy both in CPU and RAM
Always estimate your volume before deploying
Task schedulers (like Celery) are great help
Be nice!
#!/bin/sh
PDFTK=/usr/bin/pdftk
exec nice -n 10 taskset -c 0 $PDFTK "$@"
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
16. Accessing external resources
The problem
Restricted access CSS and images
Common with weasyprint, but can also happen with other
tools
Solutions
Reuse the user’s cookies in the sub-requests
Extract the resources to a temporary directory
Allow unprotected access from localhost (dangerous)
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
17. Accessing external resources
Cookie code example
import urllib2
cookies = request.cookies.items()
cookies = [ '%s=%s' % (k,v) for k,v in cookies ]
cookiestr = "; ".join(cookies)
cookiestr = cookiestr.replace('n', '')
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie', cookies))
html = opener.open(ressource_url).read()
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
18. Encrypted PDFs
Typical use-case
User submitted a form with text fields and PDF
attachments
At the end the answers are contactened into a PDF
Or even all the answers of all users!
Use weasyprint + pdftk or LATEX
What happens
It works most of the time
But on some PDF it breaks weirdly
The culprit: DRM (Digital Restrictions Management)
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
19. Encrypted PDFs
What to do?
Ensure the PDF is not DRM-protected
Use pdfinfo from poppler
Code example
out = subprocess.check_output([ 'pdfinfo',
pdffile ])
if re.search('Encrypted:.*yes', out):
raise ValueError, "DRM protected"
Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications