Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 1
Perl and PDF
Prabhakar Somu
psomu@yahoo.com
Zentech Innovations Pvt. Ltd.
Hyderabad, India
September 2, 2015
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 2
What we do
Transactional Communications
• Our business is primarily involved in
creation, printing, dispatching,
emailing and web-presenting
transactional/financial documents.
• Large volume PDF document
production
• Variable Data, Statement
composition
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 3
A brief history of PDF (Portable
Document Format)
• Created by Adobe in 1993
• Compact, device independent, cross
platform
• A subset of Postscript page description
language
• Font embedding/replacement/sub-setting
• Compression and structured (reusable)
component storage
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 4
A brief history of PDF contd…
• Versions:
– PDF 1.0 (Acrobat 1.0) – 1992
– PDF 1.1 (Acrobat 2.0) – 1994
– PDF 1.2 (Acrobat 3.0) – 1996
– PDF 1.3 (Acrobat 4.0) – 1999
– PDF 1.4 (Acrobat 5.0) – 2001
– PDF 1.5 (Acrobat 6.0) – 2003
– PDF 1.6 (Acrobat 7.0) – 2005
– PDF 1.7 (Acrobat 8.0) - 2006
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 5
Postscript and PDF
• Postscript is a Page Description Language
and a programming language
• Has to be interpreted and Imaged (ripped)
on a device
• Device specific
• PDF contains a subset of Postscript
elements without any control flow
• Self-Contained
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 6
Postscript and PDF
• A page in Postscript has to be ‘ripped’ and
imaged before a subsequent page can be
imaged (graphics state needs to be
maintained)
• Any page of a PDF file can be displayed
without needing to display earlier pages
• Device independent
• Self contained, fonts embedded, identical
rendering on all platforms
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 7
Postscript and PDF
• Ideal in a Web environment – as any page
can be displayed at any time
• PDF files can be streamed
• Compact, reusable components within a
PDF file
• Identical rendering across devices
• No interpretation required in PDF
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 8
Perl and PDF
• PDF::Create
• CAM::PDF
• PDF::API2
• PDF::API3
• PDF::Extract
• PDF::Xtract
• PDF::GetImages
• PDF::Template
• PDF::Reuse
• PDF::ReportWriter
• PDF::Table
• PDF::Parse
• PDF::Report
Several modules on CPAN
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 9
Perl and PDF (contd.)
• PDF::Core
• PDF::OCR2
• Fuse::PDF
• PDF::Burst
• PDF::Haru
• PDF::Imposition
• PDF::EasyPDF
• Image::Magick::Thumbn
ail::PDF
• PDF::Labels
• PDF::Tk
• PDF::Reuse::Barcode
• deletepdfpage.pl
• PDF::Boxer
• PDFlib
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 10
Special Mention
• PDF::API2
• PDF::API3
• CAM::PDF
• PDF::Haru
• Quite extensive
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 11
Approximate categorization of Perl
PDF Modules
• Creation
• Repurposing
• Extraction
• Miscellaneous
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 12
Creation
Repurposing
Content
Extraction
Miscellaneous
PDF::Create
PDF::Create
PDF::Table
PDF::Report
PDF::Template
PDF::ReportWriter
PDF::Haru
PDF::EasyPDF
PDF::Labels
PDF::Boxer
PDF::API2
PDF::API3
CAM::PDF
PDF::Extract
PDF::Xtract
PDF::GetImages
PDF::OCR
PDF::OCR2
PDF::Reuse
PDF::Burst
Image::Magick::
Thumbnail::PDF
PDF::Tk
PDF::Parse
PDF::Core
Fuse::PDF
pdflib
PDF::Reuse::Barc
ode
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 13
Approximate categorization of Perl
PDF Modules - Creation
• PDF::Create
• PDF::Table
• PDF::Report
• PDF::Template
• PDF::ReportWriter
• PDF::Haru
• PDF::EasyPDF
• PDF::Labels
• PDF::Boxer
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 14
Approximate categorization of Perl
PDF Modules - Repurposing
• PDF::Reuse
• PDF::Burst
• PDF::Imposition
• Image::Magick::Thumbnail::PDF
• PDF::Tk
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 15
Approximate categorization of Perl
PDF Modules – Content Extraction
• PDF::Extract
• PDF::Xtract
• PDF::OCR
• PDF::OCR2
• PDF::GetImages
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 16
Approximate categorization of Perl
PDF Modules - Miscellaneous
• PDF::Parse
• PDF::Core
• Fuse::PDF
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 17
Approximate categorization of Perl
PDF Modules – General Purpose
• PDF::API2
• PDF::API3
• CAM::PDF
• pdflib
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 18
Will be using PDFLib as a way of
showing various aspects
• From pdflib.com
• Commercial as well as open source
• Comprehensive, cross-platform, wide support for
versions, image formats, color spaces, text
rendering, graphics etc.
• PDFlib (for creating PDF files)
• PDI (for repurposing existing PDF files)
• TET (for extracting text)
• pCos (for accessing non-page data)
• PLOP (for linearizing PDFs)
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 19
Creating PDF files - PDFLib
use PDFlib::PDFlib 8.0;
my $p = new PDFlib::PDFlib;
$p->set_parameter(“compatibility” , “1.7”);
$p->set_parameter(“license” , “XYX);
$p->begin_document(“output.pdf” , “optimize”);
$p->begin_page_ext(“width=A4.width height=A4.height”, “”);
# Create content here
$p->end_page_ext( “” );
$p->end_document( “” );
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 20
Creating linearized PDF files -
PDFLib
use PDFlib::PDFlib 8.0;
my $p = new PDFlib::PDFlib;
$p->set_parameter(“compatibility” , “1.7”);
$p->set_parameter(“license” , “XYX);
$p->begin_document(“output.pdf” , “optimize linearize”);
$p->begin_page_ext(“width=A4.width height=A4.height”, “”);
# Create content here
$p->end_page_ext( “” );
$p->end_document( “” );
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 21
Creating PDF files - PDFLib
Several Options when creating a document and page:
Document Options
• password, document open actions, openmode (bookmarks,
thumbnails etc.), optimize,
• permissions (noprint, nomodify etc.)
Page Options
• Specify artbox, cropbox etc.
• Width and height
• XMP Metadata
• and many more
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 22
Fonts and PDF
my $font_handle1 = $p->load_font(“Courier” , “” , “” );
my $font_handle2 = $p->load_font(“Calibri” , “” , “” );
• Searches for fonts in the resource path
(set_parameter)
• Unicode fonts
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 23
Laying out Text
$p->setfont( $font_handle1, “32”);
$p->fittextline( “Your Text Here” , $xpos, $ypos, “” );
$p->fittextline( “More text” , $xpos2, $ypos2, “” );
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 24
Placing Images
my $image_handle =
$p->load_image( “auto” , $image_file, “options”);
$p->fit_image( $image_handle, $x, $y, “options” );
• Many image formats such as JPG, TIF, BMP are
automatically identified and loaded
• Multi-Page TIF files are handled as well
• Black and White, Grayscale, Color images handled
• Color profiles and many other parameters can be
set
• Several positioning and fitting options
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 25
Barcodes in PDF files
• A number of barcode types are possible
• Two methods of placing barcodes:
– Font Based
– Image Based
• In Font based method, load a barcode font (like a
QRCode font) and place text in that font
• In the image based method, load an image (of the
barcode) and place the image on a page.
PDF-SamplesPunjabi.pdf
PDF-SamplesAssame.pdf
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 26
Metadata in PDF files (XMP
Metadata)
• Concept of non-printable metadata in PDF
files
• Some information such as Author, Date of
Creation, Key Words can be placed using
set_parameter call
• More extensive arbitrary data can be
injected usingthe ‘XMP Metadata’ channel
• Possible in TIFF, JPEG files as well
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 27
Metadata in PDF files (XMP
Metadata)
• XMP data can be placed at the document,
page or image level
PDF-Samplessimple.txt
$p->begin_document( $output_pdf,
“metadata={filename={simple.txt}}”);
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 28
Named Destinations – Navigating to
specific pages in a PDF
$p->begin_page_ext(“width=A4.width
height=A4.height”, “”);
$p->add_nameddest( “Page1” , “options” );
$p->end_page_ext( “” );
file:///C:/Output.pdf#nameddest=Page1
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 29
Javascript in PDF files
• Javascript can be embedded in PDF files
• Actions can be tied to Javascript code
• For example, when a page is displayed
(opened) – execute a function
PDF-Samplesbarcode_field.pdf
PDF-Samplesbarcode_field.pl
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 30
PDF/A
• Special version of PDF meant for long term
archival and retrieval
• Many interactive elements not allowed
• No Hyperlinks, forms etc.
• Guaranteed to be supported by Adobe
• Applications in Library archival systems,
legal document archival and retrieval etc.
where long term compatibility of documents
is crucial
• PDFLib can create such documents
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 31
Types of Page Boundaries
http://www.prepressure.com/pdf/basics/page
-boxes
• Media Box – Specifies the width and height
of the media (paper size)
• Crop Box – Are to which page contents are
clipped (for display)
• Trim Box – Intended dimensions of the
finished page (by default = Crop Box)
• All of these can be specified in PDFLib as
options in begin_document
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 32
Bookmarks, Thumbnails
• Bookmarks and Thumbnails can be created using
PDFLib
$p->begin_page_ext(“width=A4.width
height=A4.height”, “”);
$p->create_bookmark(“Bookmark Display Name” ,
“{type fitwindow}” );
$p->add_thumbnail( $thumbnail_image_handle);
$p->end_page_ext( “” );
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 33
Controlling how a PDF file is first
displayed in Adobe (Acrobat or
Viewer)
$p->begin_document( “output.pdf”, “viewer-
preferences=centerwindow”);
$p->begin_document( “output.pdf”, “viewer-
preferences=duplex”);
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 34
Dealing with encrypted PDF files
• Specify password in begin_document
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 35
Disabling ability to Print,
Cut/Copy/Save etc.
$p->begin_document( “output.pdf”,
“action=noprint nomodify”);
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 36
Repurposing existing PDF files using
PDI
• An existing PDF file can be read in and
content placed as is in an output
my $input_doc_handle = $p->open_pdi( $input_file, “”, 0);
$p->begin_document( $output_file , “” );
$p->begin_page_ext( $width, $height);
my $page_handle = $p->open_pdi_page( $input_doc_handle,
$page_no);
$p->fit_pdi_page( $page_handle, 0 , 0, $boxsize);
$p->end_page_ext( “” );
$p->close_pdi_page( “” );
$p->end_document( “” );
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 37
PDF Files with embedded images
• Each page can have an image and nothing
else
• Scanned images are typically combined into
such PDF files
• Many options and possibilities to compress
such images in a PDF file (from Adobe
Acrobat as well as PDFLib)
• Text and other content can be overlaid on
such files as well using PDFLib’s graphic
operators
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 38
Converting PDF pages into other
formats
• ImageMagick and PerlMagick
• Convert individual pages to images
• PDF::GetImages
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 39
Text Extraction with TET
Using the Text Extraction Tool (TET), text in a PDF file
can be intelligently and reliably extracted
tet.exe –tetopt —pageopt=“{{200 750 400 755}}” –xml
line
(find text in the box 200,750,400,755 output as XML
and recognize lines)
• A perl binding for TET exists
• Note that this is not performing an ‘OCR’ option – it
is intelligently querying the PDF nodes
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 40
Non Printable data extraction using
pCOS
Using the pCOS tool, non printable data such as
number of pages, XMP metadata,
Author/Creator/Date information can be extracted.
• Extract/check for bookmarks
• Extract ICC Profiles
• Check for security problems
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 41
Important take-aways from this
presentation
• Too many modules on CPAN (quite confusing)
• None are complete (in my humble opinion)
• Commercial or open source equivalent of PDFLib
(and associated libraries such as TET, pCOS) make
an ideal toolset.
• A lot more than text and graphics is possible with
PDF files
Zentech Innovatiosn Pvt. Ltd.
Hyderabad
Telangana
India
© 2010 Zentech Innovations Pvt. Ltd.
Page 42
Prabhakar Somu
+91 97048 71236 (Mobile India)
(908) 500 5902 (Mobile US)
Email: somup@zensys.com
Thanks for your attention!

Perl and PDF - YAPC::EU 2015 Presentation

  • 1.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 1 Perl and PDF Prabhakar Somu psomu@yahoo.com Zentech Innovations Pvt. Ltd. Hyderabad, India September 2, 2015
  • 2.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 2 What we do Transactional Communications • Our business is primarily involved in creation, printing, dispatching, emailing and web-presenting transactional/financial documents. • Large volume PDF document production • Variable Data, Statement composition
  • 3.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 3 A brief history of PDF (Portable Document Format) • Created by Adobe in 1993 • Compact, device independent, cross platform • A subset of Postscript page description language • Font embedding/replacement/sub-setting • Compression and structured (reusable) component storage
  • 4.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 4 A brief history of PDF contd… • Versions: – PDF 1.0 (Acrobat 1.0) – 1992 – PDF 1.1 (Acrobat 2.0) – 1994 – PDF 1.2 (Acrobat 3.0) – 1996 – PDF 1.3 (Acrobat 4.0) – 1999 – PDF 1.4 (Acrobat 5.0) – 2001 – PDF 1.5 (Acrobat 6.0) – 2003 – PDF 1.6 (Acrobat 7.0) – 2005 – PDF 1.7 (Acrobat 8.0) - 2006
  • 5.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 5 Postscript and PDF • Postscript is a Page Description Language and a programming language • Has to be interpreted and Imaged (ripped) on a device • Device specific • PDF contains a subset of Postscript elements without any control flow • Self-Contained
  • 6.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 6 Postscript and PDF • A page in Postscript has to be ‘ripped’ and imaged before a subsequent page can be imaged (graphics state needs to be maintained) • Any page of a PDF file can be displayed without needing to display earlier pages • Device independent • Self contained, fonts embedded, identical rendering on all platforms
  • 7.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 7 Postscript and PDF • Ideal in a Web environment – as any page can be displayed at any time • PDF files can be streamed • Compact, reusable components within a PDF file • Identical rendering across devices • No interpretation required in PDF
  • 8.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 8 Perl and PDF • PDF::Create • CAM::PDF • PDF::API2 • PDF::API3 • PDF::Extract • PDF::Xtract • PDF::GetImages • PDF::Template • PDF::Reuse • PDF::ReportWriter • PDF::Table • PDF::Parse • PDF::Report Several modules on CPAN
  • 9.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 9 Perl and PDF (contd.) • PDF::Core • PDF::OCR2 • Fuse::PDF • PDF::Burst • PDF::Haru • PDF::Imposition • PDF::EasyPDF • Image::Magick::Thumbn ail::PDF • PDF::Labels • PDF::Tk • PDF::Reuse::Barcode • deletepdfpage.pl • PDF::Boxer • PDFlib
  • 10.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 10 Special Mention • PDF::API2 • PDF::API3 • CAM::PDF • PDF::Haru • Quite extensive
  • 11.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 11 Approximate categorization of Perl PDF Modules • Creation • Repurposing • Extraction • Miscellaneous
  • 12.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 12 Creation Repurposing Content Extraction Miscellaneous PDF::Create PDF::Create PDF::Table PDF::Report PDF::Template PDF::ReportWriter PDF::Haru PDF::EasyPDF PDF::Labels PDF::Boxer PDF::API2 PDF::API3 CAM::PDF PDF::Extract PDF::Xtract PDF::GetImages PDF::OCR PDF::OCR2 PDF::Reuse PDF::Burst Image::Magick:: Thumbnail::PDF PDF::Tk PDF::Parse PDF::Core Fuse::PDF pdflib PDF::Reuse::Barc ode
  • 13.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 13 Approximate categorization of Perl PDF Modules - Creation • PDF::Create • PDF::Table • PDF::Report • PDF::Template • PDF::ReportWriter • PDF::Haru • PDF::EasyPDF • PDF::Labels • PDF::Boxer
  • 14.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 14 Approximate categorization of Perl PDF Modules - Repurposing • PDF::Reuse • PDF::Burst • PDF::Imposition • Image::Magick::Thumbnail::PDF • PDF::Tk
  • 15.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 15 Approximate categorization of Perl PDF Modules – Content Extraction • PDF::Extract • PDF::Xtract • PDF::OCR • PDF::OCR2 • PDF::GetImages
  • 16.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 16 Approximate categorization of Perl PDF Modules - Miscellaneous • PDF::Parse • PDF::Core • Fuse::PDF
  • 17.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 17 Approximate categorization of Perl PDF Modules – General Purpose • PDF::API2 • PDF::API3 • CAM::PDF • pdflib
  • 18.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 18 Will be using PDFLib as a way of showing various aspects • From pdflib.com • Commercial as well as open source • Comprehensive, cross-platform, wide support for versions, image formats, color spaces, text rendering, graphics etc. • PDFlib (for creating PDF files) • PDI (for repurposing existing PDF files) • TET (for extracting text) • pCos (for accessing non-page data) • PLOP (for linearizing PDFs)
  • 19.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 19 Creating PDF files - PDFLib use PDFlib::PDFlib 8.0; my $p = new PDFlib::PDFlib; $p->set_parameter(“compatibility” , “1.7”); $p->set_parameter(“license” , “XYX); $p->begin_document(“output.pdf” , “optimize”); $p->begin_page_ext(“width=A4.width height=A4.height”, “”); # Create content here $p->end_page_ext( “” ); $p->end_document( “” );
  • 20.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 20 Creating linearized PDF files - PDFLib use PDFlib::PDFlib 8.0; my $p = new PDFlib::PDFlib; $p->set_parameter(“compatibility” , “1.7”); $p->set_parameter(“license” , “XYX); $p->begin_document(“output.pdf” , “optimize linearize”); $p->begin_page_ext(“width=A4.width height=A4.height”, “”); # Create content here $p->end_page_ext( “” ); $p->end_document( “” );
  • 21.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 21 Creating PDF files - PDFLib Several Options when creating a document and page: Document Options • password, document open actions, openmode (bookmarks, thumbnails etc.), optimize, • permissions (noprint, nomodify etc.) Page Options • Specify artbox, cropbox etc. • Width and height • XMP Metadata • and many more
  • 22.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 22 Fonts and PDF my $font_handle1 = $p->load_font(“Courier” , “” , “” ); my $font_handle2 = $p->load_font(“Calibri” , “” , “” ); • Searches for fonts in the resource path (set_parameter) • Unicode fonts
  • 23.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 23 Laying out Text $p->setfont( $font_handle1, “32”); $p->fittextline( “Your Text Here” , $xpos, $ypos, “” ); $p->fittextline( “More text” , $xpos2, $ypos2, “” );
  • 24.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 24 Placing Images my $image_handle = $p->load_image( “auto” , $image_file, “options”); $p->fit_image( $image_handle, $x, $y, “options” ); • Many image formats such as JPG, TIF, BMP are automatically identified and loaded • Multi-Page TIF files are handled as well • Black and White, Grayscale, Color images handled • Color profiles and many other parameters can be set • Several positioning and fitting options
  • 25.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 25 Barcodes in PDF files • A number of barcode types are possible • Two methods of placing barcodes: – Font Based – Image Based • In Font based method, load a barcode font (like a QRCode font) and place text in that font • In the image based method, load an image (of the barcode) and place the image on a page. PDF-SamplesPunjabi.pdf PDF-SamplesAssame.pdf
  • 26.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 26 Metadata in PDF files (XMP Metadata) • Concept of non-printable metadata in PDF files • Some information such as Author, Date of Creation, Key Words can be placed using set_parameter call • More extensive arbitrary data can be injected usingthe ‘XMP Metadata’ channel • Possible in TIFF, JPEG files as well
  • 27.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 27 Metadata in PDF files (XMP Metadata) • XMP data can be placed at the document, page or image level PDF-Samplessimple.txt $p->begin_document( $output_pdf, “metadata={filename={simple.txt}}”);
  • 28.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 28 Named Destinations – Navigating to specific pages in a PDF $p->begin_page_ext(“width=A4.width height=A4.height”, “”); $p->add_nameddest( “Page1” , “options” ); $p->end_page_ext( “” ); file:///C:/Output.pdf#nameddest=Page1
  • 29.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 29 Javascript in PDF files • Javascript can be embedded in PDF files • Actions can be tied to Javascript code • For example, when a page is displayed (opened) – execute a function PDF-Samplesbarcode_field.pdf PDF-Samplesbarcode_field.pl
  • 30.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 30 PDF/A • Special version of PDF meant for long term archival and retrieval • Many interactive elements not allowed • No Hyperlinks, forms etc. • Guaranteed to be supported by Adobe • Applications in Library archival systems, legal document archival and retrieval etc. where long term compatibility of documents is crucial • PDFLib can create such documents
  • 31.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 31 Types of Page Boundaries http://www.prepressure.com/pdf/basics/page -boxes • Media Box – Specifies the width and height of the media (paper size) • Crop Box – Are to which page contents are clipped (for display) • Trim Box – Intended dimensions of the finished page (by default = Crop Box) • All of these can be specified in PDFLib as options in begin_document
  • 32.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 32 Bookmarks, Thumbnails • Bookmarks and Thumbnails can be created using PDFLib $p->begin_page_ext(“width=A4.width height=A4.height”, “”); $p->create_bookmark(“Bookmark Display Name” , “{type fitwindow}” ); $p->add_thumbnail( $thumbnail_image_handle); $p->end_page_ext( “” );
  • 33.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 33 Controlling how a PDF file is first displayed in Adobe (Acrobat or Viewer) $p->begin_document( “output.pdf”, “viewer- preferences=centerwindow”); $p->begin_document( “output.pdf”, “viewer- preferences=duplex”);
  • 34.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 34 Dealing with encrypted PDF files • Specify password in begin_document
  • 35.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 35 Disabling ability to Print, Cut/Copy/Save etc. $p->begin_document( “output.pdf”, “action=noprint nomodify”);
  • 36.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 36 Repurposing existing PDF files using PDI • An existing PDF file can be read in and content placed as is in an output my $input_doc_handle = $p->open_pdi( $input_file, “”, 0); $p->begin_document( $output_file , “” ); $p->begin_page_ext( $width, $height); my $page_handle = $p->open_pdi_page( $input_doc_handle, $page_no); $p->fit_pdi_page( $page_handle, 0 , 0, $boxsize); $p->end_page_ext( “” ); $p->close_pdi_page( “” ); $p->end_document( “” );
  • 37.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 37 PDF Files with embedded images • Each page can have an image and nothing else • Scanned images are typically combined into such PDF files • Many options and possibilities to compress such images in a PDF file (from Adobe Acrobat as well as PDFLib) • Text and other content can be overlaid on such files as well using PDFLib’s graphic operators
  • 38.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 38 Converting PDF pages into other formats • ImageMagick and PerlMagick • Convert individual pages to images • PDF::GetImages
  • 39.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 39 Text Extraction with TET Using the Text Extraction Tool (TET), text in a PDF file can be intelligently and reliably extracted tet.exe –tetopt —pageopt=“{{200 750 400 755}}” –xml line (find text in the box 200,750,400,755 output as XML and recognize lines) • A perl binding for TET exists • Note that this is not performing an ‘OCR’ option – it is intelligently querying the PDF nodes
  • 40.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 40 Non Printable data extraction using pCOS Using the pCOS tool, non printable data such as number of pages, XMP metadata, Author/Creator/Date information can be extracted. • Extract/check for bookmarks • Extract ICC Profiles • Check for security problems
  • 41.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 41 Important take-aways from this presentation • Too many modules on CPAN (quite confusing) • None are complete (in my humble opinion) • Commercial or open source equivalent of PDFLib (and associated libraries such as TET, pCOS) make an ideal toolset. • A lot more than text and graphics is possible with PDF files
  • 42.
    Zentech Innovatiosn Pvt.Ltd. Hyderabad Telangana India © 2010 Zentech Innovations Pvt. Ltd. Page 42 Prabhakar Somu +91 97048 71236 (Mobile India) (908) 500 5902 (Mobile US) Email: somup@zensys.com Thanks for your attention!