Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ways to generate PDF from Python Web applications, Gaël Le Mignot

972 views

Published on

PyParis 2017
http://pyparis.org

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Ways to generate PDF from Python Web applications, Gaël Le Mignot

  1. 1. Generating PDF from Python web applications Gaël LE MIGNOT Pilot Systems June 6, 2017 Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  2. 2. Summary 1 Introduction 2 Tools 3 Tips, tricks and pitfalls 4 Conclusion Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  3. 3. Introduction Pilot Systems Free Software service provider Python Web application development and hosting Using Zope/Plone (since 2000) and Django (since 0.96) All kind of customers (public/private, small/big, . . . ) Generating PDFs Very frequently asked Different purpose require different tools Several pitfalls to avoid Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  4. 4. Weasyprint - presentation What is weasyprint? Free Software Python library Convert HTML5 page (using a print CSS) into PDF Also exists in command-line When to use it? To convert an existing HTML document Consistency: same templating engine, same language For simple page layouts Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  5. 5. Weasyprint - code details Simple usage from weasyprint import HTML, CSS html = template() data = HTML(string=html).write_pdf() Some mangling with BeautifulSoup from bs4 import BeautifulSoup soup = BeautifulSoup(html, 'html.parser') bl = ('typography.com', 'logged-in.css') for css in soup.findAll("link"): for cssname in bl: if cssname in css['href']: css.extract() Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  6. 6. Weasyprint - code details Add some page header/footer @page { margin: 3cm 2cm; @bottom-right { content: "Page " counter(page) } @top-center { content: "Pilot Systems"; } } Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  7. 7. Reportlab - presentation What is reportlab? Python library for generating PDF and graphs Powerful RML templating language Template and story concepts Versions and tools Complicated licensing Reportlab PDF toolkit: limited Free Software version Reportlab PLUS: non-free complete version trml2pdf: free software, third-party implementation of RML RMLPageTemplate: Zope integration of trml2pdf Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  8. 8. Reportlab - code example Bright warning <template> <fill color="yellow"/> <rect x="115mm" y="217mm" width="90mm" height="18mm" fill="yes" stroke="yes"/> <frame id="warning" x1="115mm" y1="213mm" width="90mm" height="24mm" /> </template> <story> <para> TEMPORARY DOCUMENT - DO NOT PRINT </para> </story> Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  9. 9. pdftk What is pdftk? Toolbox to manipulate PDF Perform operation like extract pages, concatenate Can also stamp a PDF on top of another Command-line tool, so use subprocess Use-case Afdas - collect taxes and finance training Companies make a yearly declaration Take a background and fill cells Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  10. 10. LATEX What is LATEX? Very powerful document composition system Used for scientific publishing, among others Used for those slides, too How to use it? Generate a .tex file Can use a template, or intermediate language (like rst) Then execute pdflatex When to use it? Rich formatting Table of content, index, glossary, . . . Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  11. 11. Other tools For the brave Client-side rendering with JS libraries Using LibreOffice with pyuno Generate QR-code/datamatrix with elaphe Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  12. 12. HTTP Headers Don’t forget HTTP headers Specify the content-type Hint between displaying and downloading Provide default filename Code example response.setHeader('Content-Type', 'application/pdf') cd = 'attachment; filename="%s"' % filename response.setHeader('Content-Disposition', cd) Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  13. 13. Handling long generation times The problem Generate a PDF report of 500 pages It takes 10 minutes Timeout or users get angry Solutions Increase timeouts, inform users Use fork or threads to generate async Use a scheduler like Celery Send the result by email, with a link Cleaning find /path/to/pdfs -mtime +14 -delete Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  14. 14. Careful with search engines Typical situation Public website (using a CMS) Button on each page to get a PDF version A crawler comes... and boom. Don’t panic Use robots.txt file, but limited Have the button do a POST Use load-balancer like haproxy and pin PDF requests Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  15. 15. CPU and RAM usage PDF generation is expensive PDF generation can be heavy both in CPU and RAM Always estimate your volume before deploying Task schedulers (like Celery) are great help Be nice! #!/bin/sh PDFTK=/usr/bin/pdftk exec nice -n 10 taskset -c 0 $PDFTK "$@" Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  16. 16. Accessing external resources The problem Restricted access CSS and images Common with weasyprint, but can also happen with other tools Solutions Reuse the user’s cookies in the sub-requests Extract the resources to a temporary directory Allow unprotected access from localhost (dangerous) Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  17. 17. Accessing external resources Cookie code example import urllib2 cookies = request.cookies.items() cookies = [ '%s=%s' % (k,v) for k,v in cookies ] cookiestr = "; ".join(cookies) cookiestr = cookiestr.replace('n', '') opener = urllib2.build_opener() opener.addheaders.append(('Cookie', cookies)) html = opener.open(ressource_url).read() Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  18. 18. Encrypted PDFs Typical use-case User submitted a form with text fields and PDF attachments At the end the answers are contactened into a PDF Or even all the answers of all users! Use weasyprint + pdftk or LATEX What happens It works most of the time But on some PDF it breaks weirdly The culprit: DRM (Digital Restrictions Management) Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  19. 19. Encrypted PDFs What to do? Ensure the PDF is not DRM-protected Use pdfinfo from poppler Code example out = subprocess.check_output([ 'pdfinfo', pdffile ]) if re.search('Encrypted:.*yes', out): raise ValueError, "DRM protected" Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  20. 20. Conclusion Conclusion Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications
  21. 21. Conclusion Thanks for listening! Any question? Gaël LE MIGNOT Pilot Systems Generating PDF from Python web applications

×