Apache POI
           Recipes
           Paolo Mottadelli - ApacheCon Oakland 2009




  http://chromasia.com
Thursday, November 5, 2009
paolo@apache.org



   my to-do list




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   POI @ Content Tech
      ✴ Document to application (and back)
               ✴ Publish data

               ✴ Build a doc from your content

      ✴ Know your documents
               ✴ Extract text

               ✴ Extract content



                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             1
                             A-B-C
paolo@apache.org




   POI modules (1): OLE2
      ✴ POIFS: reading/writing Office
               Documents
      ✴ HSSF r/w Excel Spreadsheets
      ✴ HWPF r/w Word Docs
      ✴ HSLF r/w PowerPoint Docs
      ✴ HPSF r/w property sets

                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   POI modules (2): OOXML
      ✴ XSSF: r/w OXML Excel
      ✴ XWPF: r/w OXML Word
      ✴ XSLF: r/w OXML PowerPoint




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
POI 3.5
  http://chromasia.com
Thursday, November 5, 2009
paolo@apache.org




   OOXML dev status
      ✴ XSSF: Final in POI-3.5
      ✴ XWPF: Draft (basic features)
      ✴ XSLF: Not covered (only text ext.)




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   HSSF & XSSF
      ✴ Common user model interface
      ✴ User model based on existing HSSF
      ✴ Using OpenXML4J and SAX




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             2
                             Same recipe,
                             different flavours
paolo@apache.org




   Common H/XSSF access
      ✴ org.apache.poi.ss.usermodel




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Upgrading to POI-3.5
      ✴ HSSFFormulaEvaluator.CellValue
               ✴ convert from .hssf. to .ss.

      ✴ HSSFRow.MissingCellPolicy
               ✴ convert from .hssf. to .ss.

      ✴ RecordFormatException in DDF
               ✴ convert from .hssf. to .util.                           Dreadful Drawing
                                                                             Format


                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             3
                             Meet
                             Office Open XML
paolo@apache.org



                                               made (very) simple
   Open XML
      ✴ XML based
               ✴ WordprocessingML

               ✴ SpreadsheetML

               ✴ PresentationML

      ✴ Stored as a package
               ✴ Open Packaging Conventions



                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Package concepts
      ✴ Package (the container)
      ✴ Part (xml file)
      ✴ Relationship
               ✴ package-relationship

               ✴ part-relationship




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Expanded package, Excel




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   WordprocessingML
      ✴ body
               ✴ paragraphs
                      ✴ runs


      ✴ properties (for runs and pars)
      ✴ styles
      ✴ headers/footers ...

                               - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   SpreadsheetML
      ✴ workbook
               ✴ worksheets
                      ✴ rows

                             ✴ cells



      ✴ styles
      ✴ formulas
      ✴ images ...
                                       - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   PresentationML
      ✴ presentation
               ✴ slides

               ✴ slides-masters

               ✴ notes-masters

      ✴ layout, animation, audio, video,
               transitions ...

                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             4
                             openxml4j
paolo@apache.org




   openXML4J
      ✴ Package, parts, rels


                                                                          "/xl/worksheets/sheet1.xml"




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             5
                             Text Extraction
paolo@apache.org




   Extractors
      ✴ POITextExtractor
               ✴ POIOLE2TextExtractor
                                                                    getT xt()
                                                                        e
               ✴ POIXMLTextExtractor
                      ✴ XSSFExcelExtractor

                      ✴ XWPFWordExtractor

                      ✴ XSLFPowerPointExtractor


      ✴ If text is all what you need

                              - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Text extraction
      ✴ made simple




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             6
                             EXCEL
                             Simple Tasks
paolo@apache.org




   New Workbook




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   New Sheet




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Creating Cells




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Cell types




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Fills and colors




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             7
                             EXCEL
                             Imp/Exp to XML
paolo@apache.org




   Export to XML




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   xmlMaps.xml




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   XML Import/Export




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             8
                             WORD
                             Simple Doc
paolo@apache.org




   A simple doc




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             9
                             Use Case 1
                             Alfresco Search
paolo@apache.org




   Use Case
      ✴ Upload a document
      ✴ Detect document mimetype
      ✴ Extract text and metadata
      ✴ Create search index
      ✴ Search (and find) the document


                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Without Tika
   ✴ Detect the document mimetype
               ✴ (source/target mimetype)

      ✴ Get the proper ContentTransformer
               ✴ (ContentTransformerRegistry)

      ✴ Tranform Doc Content to Text
               ✴ (PoiHssfContentTransformer) I here
                                          PO
      ✴ Create Lucene index
                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   With Tika




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Extension use case
      ✴ Adding support for Office Open
               XML documents (Office 2007+)
               ✴ Word 2007+

               ✴ Excel 2007+

               ✴ PowerPoint 2007+




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   POI text extractors
      ✴ Remember?




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Apache Tika (Excel)




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Apache Tika




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Apache Tika (Word)




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Apache Tika (Word)




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             10
                             Use Case 2
                             JM Lafferty
                             Financial Forecasting
paolo@apache.org




   Make your wb look pro-
      ✴ Rich text
      ✴ Graphics
      ✴ Formulas & Named Ranges
      ✴ Data validations
      ✴ Conditional formatting
      ✴ Cell comments
                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
Thursday, November 5, 2009
paolo@apache.org




   Formula evaluation
      ✴ The evaluation engine enables you
               to calculate formula results from
               within a POI application
      ✴ Formulas may be added to your
               workbook by POI
      ✴ Evaluation is available for .xls
               and .xlsx
                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Formula evaluation (continued)
      ✴ All arithmetic operators are
               implemented
      ✴ Over 280 Excel built in functions
               are supported




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Formula evaluation (code)




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
                             11
                             Use Case 3:
                             CQ5 Import
Thursday, November 5, 2009
Thursday, November 5, 2009
paolo@apache.org




   importDocument()




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   getParagraphs(...)
      ✴ Makes use of
               ✴ org.apache.poi.hwpf.usermodel.Range




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   importDocument()




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   getTitle(...)
      ✴ Gets the first paragraph’s text




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   importDocument()




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
Thursday, November 5, 2009
Thursday, November 5, 2009
Thursday, November 5, 2009
                             12
                             Want more?
paolo@apache.org




   More Examples
      ✴ http://poi.apache.org/spreadsheet/examples.html




                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




   Even more
      ✴ Get in touch
               ✴ http://poi.apache.org/

      ✴ Get informed
               ✴ dev@poi.apache.org

      ✴ Get involved
               ✴ http://svn.apache.org/repos/asf/poi/trunk/


                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009
paolo@apache.org




      ✴ Get slides
               ✴ http://www.slideshare.net/paolomoz/apache-poi-recipes




   Thanks


                             - ApacheCon US 2009, Oakland - Apache POI Recipes -
Thursday, November 5, 2009

Apache Poi Recipes

  • 1.
    Apache POI Recipes Paolo Mottadelli - ApacheCon Oakland 2009 http://chromasia.com Thursday, November 5, 2009
  • 2.
    paolo@apache.org my to-do list - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 3.
    paolo@apache.org POI @ Content Tech ✴ Document to application (and back) ✴ Publish data ✴ Build a doc from your content ✴ Know your documents ✴ Extract text ✴ Extract content - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 4.
  • 5.
    paolo@apache.org POI modules (1): OLE2 ✴ POIFS: reading/writing Office Documents ✴ HSSF r/w Excel Spreadsheets ✴ HWPF r/w Word Docs ✴ HSLF r/w PowerPoint Docs ✴ HPSF r/w property sets - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 6.
    paolo@apache.org POI modules (2): OOXML ✴ XSSF: r/w OXML Excel ✴ XWPF: r/w OXML Word ✴ XSLF: r/w OXML PowerPoint - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 7.
    POI 3.5 http://chromasia.com Thursday, November 5, 2009
  • 8.
    paolo@apache.org OOXML dev status ✴ XSSF: Final in POI-3.5 ✴ XWPF: Draft (basic features) ✴ XSLF: Not covered (only text ext.) - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 9.
    paolo@apache.org HSSF & XSSF ✴ Common user model interface ✴ User model based on existing HSSF ✴ Using OpenXML4J and SAX - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 10.
    Thursday, November 5,2009 2 Same recipe, different flavours
  • 11.
    paolo@apache.org Common H/XSSF access ✴ org.apache.poi.ss.usermodel - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 12.
    paolo@apache.org Upgrading to POI-3.5 ✴ HSSFFormulaEvaluator.CellValue ✴ convert from .hssf. to .ss. ✴ HSSFRow.MissingCellPolicy ✴ convert from .hssf. to .ss. ✴ RecordFormatException in DDF ✴ convert from .hssf. to .util. Dreadful Drawing Format - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 13.
    Thursday, November 5,2009 3 Meet Office Open XML
  • 14.
    paolo@apache.org made (very) simple Open XML ✴ XML based ✴ WordprocessingML ✴ SpreadsheetML ✴ PresentationML ✴ Stored as a package ✴ Open Packaging Conventions - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 15.
    paolo@apache.org Package concepts ✴ Package (the container) ✴ Part (xml file) ✴ Relationship ✴ package-relationship ✴ part-relationship - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 16.
    paolo@apache.org Expanded package, Excel - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 17.
    paolo@apache.org WordprocessingML ✴ body ✴ paragraphs ✴ runs ✴ properties (for runs and pars) ✴ styles ✴ headers/footers ... - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 18.
    paolo@apache.org SpreadsheetML ✴ workbook ✴ worksheets ✴ rows ✴ cells ✴ styles ✴ formulas ✴ images ... - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 19.
    paolo@apache.org PresentationML ✴ presentation ✴ slides ✴ slides-masters ✴ notes-masters ✴ layout, animation, audio, video, transitions ... - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 20.
    Thursday, November 5,2009 4 openxml4j
  • 21.
    paolo@apache.org openXML4J ✴ Package, parts, rels "/xl/worksheets/sheet1.xml" - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 22.
    Thursday, November 5,2009 5 Text Extraction
  • 23.
    paolo@apache.org Extractors ✴ POITextExtractor ✴ POIOLE2TextExtractor getT xt() e ✴ POIXMLTextExtractor ✴ XSSFExcelExtractor ✴ XWPFWordExtractor ✴ XSLFPowerPointExtractor ✴ If text is all what you need - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 24.
    paolo@apache.org Text extraction ✴ made simple - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 25.
    Thursday, November 5,2009 6 EXCEL Simple Tasks
  • 26.
    paolo@apache.org New Workbook - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 27.
    paolo@apache.org New Sheet - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 28.
    paolo@apache.org Creating Cells - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 29.
    paolo@apache.org Cell types - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 30.
    paolo@apache.org Fills and colors - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 31.
    Thursday, November 5,2009 7 EXCEL Imp/Exp to XML
  • 32.
    paolo@apache.org Export to XML - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 33.
    paolo@apache.org xmlMaps.xml - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 34.
    paolo@apache.org XML Import/Export - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 35.
    Thursday, November 5,2009 8 WORD Simple Doc
  • 36.
    paolo@apache.org A simple doc - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 37.
    paolo@apache.org - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 38.
    Thursday, November 5,2009 9 Use Case 1 Alfresco Search
  • 39.
    paolo@apache.org Use Case ✴ Upload a document ✴ Detect document mimetype ✴ Extract text and metadata ✴ Create search index ✴ Search (and find) the document - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 40.
    paolo@apache.org Without Tika ✴ Detect the document mimetype ✴ (source/target mimetype) ✴ Get the proper ContentTransformer ✴ (ContentTransformerRegistry) ✴ Tranform Doc Content to Text ✴ (PoiHssfContentTransformer) I here PO ✴ Create Lucene index - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 41.
    paolo@apache.org With Tika - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 42.
    paolo@apache.org Extension use case ✴ Adding support for Office Open XML documents (Office 2007+) ✴ Word 2007+ ✴ Excel 2007+ ✴ PowerPoint 2007+ - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 43.
    paolo@apache.org POI text extractors ✴ Remember? - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 44.
    paolo@apache.org Apache Tika (Excel) - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 45.
    paolo@apache.org Apache Tika - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 46.
    paolo@apache.org Apache Tika (Word) - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 47.
    paolo@apache.org Apache Tika (Word) - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 48.
    Thursday, November 5,2009 10 Use Case 2 JM Lafferty Financial Forecasting
  • 49.
    paolo@apache.org Make your wb look pro- ✴ Rich text ✴ Graphics ✴ Formulas & Named Ranges ✴ Data validations ✴ Conditional formatting ✴ Cell comments - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 50.
  • 51.
  • 52.
    paolo@apache.org Formula evaluation ✴ The evaluation engine enables you to calculate formula results from within a POI application ✴ Formulas may be added to your workbook by POI ✴ Evaluation is available for .xls and .xlsx - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 53.
    paolo@apache.org Formula evaluation (continued) ✴ All arithmetic operators are implemented ✴ Over 280 Excel built in functions are supported - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 54.
    paolo@apache.org Formula evaluation (code) - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 55.
    Thursday, November 5,2009 11 Use Case 3: CQ5 Import
  • 56.
  • 57.
  • 58.
    paolo@apache.org importDocument() - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 59.
    paolo@apache.org getParagraphs(...) ✴ Makes use of ✴ org.apache.poi.hwpf.usermodel.Range - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 60.
    paolo@apache.org importDocument() - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 61.
    paolo@apache.org getTitle(...) ✴ Gets the first paragraph’s text - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 62.
    paolo@apache.org importDocument() - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 63.
    paolo@apache.org - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 64.
  • 65.
  • 66.
    Thursday, November 5,2009 12 Want more?
  • 67.
    paolo@apache.org More Examples ✴ http://poi.apache.org/spreadsheet/examples.html - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 68.
    paolo@apache.org Even more ✴ Get in touch ✴ http://poi.apache.org/ ✴ Get informed ✴ dev@poi.apache.org ✴ Get involved ✴ http://svn.apache.org/repos/asf/poi/trunk/ - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009
  • 69.
    paolo@apache.org ✴ Get slides ✴ http://www.slideshare.net/paolomoz/apache-poi-recipes Thanks - ApacheCon US 2009, Oakland - Apache POI Recipes - Thursday, November 5, 2009