Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Perl and PDF - YAPC::EU 2015 Presentation

8,362 views

Published on

My slides for the PDF and Perl presentation at YAPC::EU 2015. Summary of CPAN modules related to PDF creation. PDF creation using pdflib. Compose PDF documents with text, images, barcode, non-page data, compression, PDF/A etc.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Perl and PDF - YAPC::EU 2015 Presentation

  1. 1. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 1 Perl and PDF Prabhakar Somu psomu@yahoo.com Zentech Innovations Pvt. Ltd. Hyderabad, India September 2, 2015
  2. 2. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 2 What we do Transactional Communications • Our business is primarily involved in creation, printing, dispatching, emailing and web-presenting transactional/financial documents. • Large volume PDF document production • Variable Data, Statement composition
  3. 3. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 3 A brief history of PDF (Portable Document Format) • Created by Adobe in 1993 • Compact, device independent, cross platform • A subset of Postscript page description language • Font embedding/replacement/sub-setting • Compression and structured (reusable) component storage
  4. 4. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 4 A brief history of PDF contd… • Versions: – PDF 1.0 (Acrobat 1.0) – 1992 – PDF 1.1 (Acrobat 2.0) – 1994 – PDF 1.2 (Acrobat 3.0) – 1996 – PDF 1.3 (Acrobat 4.0) – 1999 – PDF 1.4 (Acrobat 5.0) – 2001 – PDF 1.5 (Acrobat 6.0) – 2003 – PDF 1.6 (Acrobat 7.0) – 2005 – PDF 1.7 (Acrobat 8.0) - 2006
  5. 5. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 5 Postscript and PDF • Postscript is a Page Description Language and a programming language • Has to be interpreted and Imaged (ripped) on a device • Device specific • PDF contains a subset of Postscript elements without any control flow • Self-Contained
  6. 6. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 6 Postscript and PDF • A page in Postscript has to be ‘ripped’ and imaged before a subsequent page can be imaged (graphics state needs to be maintained) • Any page of a PDF file can be displayed without needing to display earlier pages • Device independent • Self contained, fonts embedded, identical rendering on all platforms
  7. 7. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 7 Postscript and PDF • Ideal in a Web environment – as any page can be displayed at any time • PDF files can be streamed • Compact, reusable components within a PDF file • Identical rendering across devices • No interpretation required in PDF
  8. 8. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 8 Perl and PDF • PDF::Create • CAM::PDF • PDF::API2 • PDF::API3 • PDF::Extract • PDF::Xtract • PDF::GetImages • PDF::Template • PDF::Reuse • PDF::ReportWriter • PDF::Table • PDF::Parse • PDF::Report Several modules on CPAN
  9. 9. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 9 Perl and PDF (contd.) • PDF::Core • PDF::OCR2 • Fuse::PDF • PDF::Burst • PDF::Haru • PDF::Imposition • PDF::EasyPDF • Image::Magick::Thumbn ail::PDF • PDF::Labels • PDF::Tk • PDF::Reuse::Barcode • deletepdfpage.pl • PDF::Boxer • PDFlib
  10. 10. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 10 Special Mention • PDF::API2 • PDF::API3 • CAM::PDF • PDF::Haru • Quite extensive
  11. 11. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 11 Approximate categorization of Perl PDF Modules • Creation • Repurposing • Extraction • Miscellaneous
  12. 12. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 12 Creation Repurposing Content Extraction Miscellaneous PDF::Create PDF::Create PDF::Table PDF::Report PDF::Template PDF::ReportWriter PDF::Haru PDF::EasyPDF PDF::Labels PDF::Boxer PDF::API2 PDF::API3 CAM::PDF PDF::Extract PDF::Xtract PDF::GetImages PDF::OCR PDF::OCR2 PDF::Reuse PDF::Burst Image::Magick:: Thumbnail::PDF PDF::Tk PDF::Parse PDF::Core Fuse::PDF pdflib PDF::Reuse::Barc ode
  13. 13. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 13 Approximate categorization of Perl PDF Modules - Creation • PDF::Create • PDF::Table • PDF::Report • PDF::Template • PDF::ReportWriter • PDF::Haru • PDF::EasyPDF • PDF::Labels • PDF::Boxer
  14. 14. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 14 Approximate categorization of Perl PDF Modules - Repurposing • PDF::Reuse • PDF::Burst • PDF::Imposition • Image::Magick::Thumbnail::PDF • PDF::Tk
  15. 15. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 15 Approximate categorization of Perl PDF Modules – Content Extraction • PDF::Extract • PDF::Xtract • PDF::OCR • PDF::OCR2 • PDF::GetImages
  16. 16. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 16 Approximate categorization of Perl PDF Modules - Miscellaneous • PDF::Parse • PDF::Core • Fuse::PDF
  17. 17. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 17 Approximate categorization of Perl PDF Modules – General Purpose • PDF::API2 • PDF::API3 • CAM::PDF • pdflib
  18. 18. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 18 Will be using PDFLib as a way of showing various aspects • From pdflib.com • Commercial as well as open source • Comprehensive, cross-platform, wide support for versions, image formats, color spaces, text rendering, graphics etc. • PDFlib (for creating PDF files) • PDI (for repurposing existing PDF files) • TET (for extracting text) • pCos (for accessing non-page data) • PLOP (for linearizing PDFs)
  19. 19. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 19 Creating PDF files - PDFLib use PDFlib::PDFlib 8.0; my $p = new PDFlib::PDFlib; $p->set_parameter(“compatibility” , “1.7”); $p->set_parameter(“license” , “XYX); $p->begin_document(“output.pdf” , “optimize”); $p->begin_page_ext(“width=A4.width height=A4.height”, “”); # Create content here $p->end_page_ext( “” ); $p->end_document( “” );
  20. 20. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 20 Creating linearized PDF files - PDFLib use PDFlib::PDFlib 8.0; my $p = new PDFlib::PDFlib; $p->set_parameter(“compatibility” , “1.7”); $p->set_parameter(“license” , “XYX); $p->begin_document(“output.pdf” , “optimize linearize”); $p->begin_page_ext(“width=A4.width height=A4.height”, “”); # Create content here $p->end_page_ext( “” ); $p->end_document( “” );
  21. 21. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 21 Creating PDF files - PDFLib Several Options when creating a document and page: Document Options • password, document open actions, openmode (bookmarks, thumbnails etc.), optimize, • permissions (noprint, nomodify etc.) Page Options • Specify artbox, cropbox etc. • Width and height • XMP Metadata • and many more
  22. 22. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 22 Fonts and PDF my $font_handle1 = $p->load_font(“Courier” , “” , “” ); my $font_handle2 = $p->load_font(“Calibri” , “” , “” ); • Searches for fonts in the resource path (set_parameter) • Unicode fonts
  23. 23. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 23 Laying out Text $p->setfont( $font_handle1, “32”); $p->fittextline( “Your Text Here” , $xpos, $ypos, “” ); $p->fittextline( “More text” , $xpos2, $ypos2, “” );
  24. 24. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 24 Placing Images my $image_handle = $p->load_image( “auto” , $image_file, “options”); $p->fit_image( $image_handle, $x, $y, “options” ); • Many image formats such as JPG, TIF, BMP are automatically identified and loaded • Multi-Page TIF files are handled as well • Black and White, Grayscale, Color images handled • Color profiles and many other parameters can be set • Several positioning and fitting options
  25. 25. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 25 Barcodes in PDF files • A number of barcode types are possible • Two methods of placing barcodes: – Font Based – Image Based • In Font based method, load a barcode font (like a QRCode font) and place text in that font • In the image based method, load an image (of the barcode) and place the image on a page. PDF-SamplesPunjabi.pdf PDF-SamplesAssame.pdf
  26. 26. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 26 Metadata in PDF files (XMP Metadata) • Concept of non-printable metadata in PDF files • Some information such as Author, Date of Creation, Key Words can be placed using set_parameter call • More extensive arbitrary data can be injected usingthe ‘XMP Metadata’ channel • Possible in TIFF, JPEG files as well
  27. 27. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 27 Metadata in PDF files (XMP Metadata) • XMP data can be placed at the document, page or image level PDF-Samplessimple.txt $p->begin_document( $output_pdf, “metadata={filename={simple.txt}}”);
  28. 28. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 28 Named Destinations – Navigating to specific pages in a PDF $p->begin_page_ext(“width=A4.width height=A4.height”, “”); $p->add_nameddest( “Page1” , “options” ); $p->end_page_ext( “” ); file:///C:/Output.pdf#nameddest=Page1
  29. 29. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 29 Javascript in PDF files • Javascript can be embedded in PDF files • Actions can be tied to Javascript code • For example, when a page is displayed (opened) – execute a function PDF-Samplesbarcode_field.pdf PDF-Samplesbarcode_field.pl
  30. 30. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 30 PDF/A • Special version of PDF meant for long term archival and retrieval • Many interactive elements not allowed • No Hyperlinks, forms etc. • Guaranteed to be supported by Adobe • Applications in Library archival systems, legal document archival and retrieval etc. where long term compatibility of documents is crucial • PDFLib can create such documents
  31. 31. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 31 Types of Page Boundaries http://www.prepressure.com/pdf/basics/page -boxes • Media Box – Specifies the width and height of the media (paper size) • Crop Box – Are to which page contents are clipped (for display) • Trim Box – Intended dimensions of the finished page (by default = Crop Box) • All of these can be specified in PDFLib as options in begin_document
  32. 32. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 32 Bookmarks, Thumbnails • Bookmarks and Thumbnails can be created using PDFLib $p->begin_page_ext(“width=A4.width height=A4.height”, “”); $p->create_bookmark(“Bookmark Display Name” , “{type fitwindow}” ); $p->add_thumbnail( $thumbnail_image_handle); $p->end_page_ext( “” );
  33. 33. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 33 Controlling how a PDF file is first displayed in Adobe (Acrobat or Viewer) $p->begin_document( “output.pdf”, “viewer- preferences=centerwindow”); $p->begin_document( “output.pdf”, “viewer- preferences=duplex”);
  34. 34. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 34 Dealing with encrypted PDF files • Specify password in begin_document
  35. 35. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 35 Disabling ability to Print, Cut/Copy/Save etc. $p->begin_document( “output.pdf”, “action=noprint nomodify”);
  36. 36. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 36 Repurposing existing PDF files using PDI • An existing PDF file can be read in and content placed as is in an output my $input_doc_handle = $p->open_pdi( $input_file, “”, 0); $p->begin_document( $output_file , “” ); $p->begin_page_ext( $width, $height); my $page_handle = $p->open_pdi_page( $input_doc_handle, $page_no); $p->fit_pdi_page( $page_handle, 0 , 0, $boxsize); $p->end_page_ext( “” ); $p->close_pdi_page( “” ); $p->end_document( “” );
  37. 37. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 37 PDF Files with embedded images • Each page can have an image and nothing else • Scanned images are typically combined into such PDF files • Many options and possibilities to compress such images in a PDF file (from Adobe Acrobat as well as PDFLib) • Text and other content can be overlaid on such files as well using PDFLib’s graphic operators
  38. 38. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 38 Converting PDF pages into other formats • ImageMagick and PerlMagick • Convert individual pages to images • PDF::GetImages
  39. 39. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 39 Text Extraction with TET Using the Text Extraction Tool (TET), text in a PDF file can be intelligently and reliably extracted tet.exe –tetopt —pageopt=“{{200 750 400 755}}” –xml line (find text in the box 200,750,400,755 output as XML and recognize lines) • A perl binding for TET exists • Note that this is not performing an ‘OCR’ option – it is intelligently querying the PDF nodes
  40. 40. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 40 Non Printable data extraction using pCOS Using the pCOS tool, non printable data such as number of pages, XMP metadata, Author/Creator/Date information can be extracted. • Extract/check for bookmarks • Extract ICC Profiles • Check for security problems
  41. 41. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 41 Important take-aways from this presentation • Too many modules on CPAN (quite confusing) • None are complete (in my humble opinion) • Commercial or open source equivalent of PDFLib (and associated libraries such as TET, pCOS) make an ideal toolset. • A lot more than text and graphics is possible with PDF files
  42. 42. Zentech Innovatiosn Pvt. Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 42 Prabhakar Somu +91 97048 71236 (Mobile India) (908) 500 5902 (Mobile US) Email: somup@zensys.com Thanks for your attention!

×