Office Open XML: a technical approach for OOo OOoCon 2007, Barcelona, September 21st, 2007 Hubert FIguiere Software Engineer, OpenOffice.org Novell - hfiguiere@novell.com
Getting Started
What is Office Open XML? An office application file format XML based Created by Microsoft... ...for Microsoft Office 2007 ECMA standard 376 Proposed to ISO
What Office Open XML is not? Office Open XML is not OpenDocument (ISO 26300) ... nor the previous XML formats for Microsoft Office introduced in the last few MS-Office release ... nor an ISO standard though it has been proposed
Why supporting Open XML? Support = importing from (and/or exporting to) For interoperability reasons with Microsoft Office 2007
Overview of the format
The specification Available to anybody as ECMA standard 376 5 PDF documents Fundamentals Open Packaging Conventions Primer Markup Language Reference Markup Compatibility and Extensibility 173 + 129 + 472 + 5129 + 43 = 5946 pages
The specification (cont.) Some have printed it. OpenXML printed spec photo by Pavel Janik photo by Pavel Janik http://blog.janik.cz/archives/2007/05/19/T20_32_07/
“Packaging Conventions” A zip file: “Open Package” Contain the main content... ... and the embedded content Same container used for other Microsoft format like XPS Replace the old OLE structured storage In principle similar to OpenDocument, but not really.
Content DrawingML Diagrams, Charts, etc. WordprocessingML Word document SpreadsheetML Excel document PresentationML PowerPoint presentation Heavily relies on DrawingML
Content (cont.) Relationships Maps embedded objects Set the relationships between fragments
Content (cont.) VML Legacy format from Office 2000 Embedded objects Sound files Images Can be anything ! I have seen some PowerPoint document with an OpenDocument chart in an OLE container that was referenced from a slide
OpenOffice implementation
Plans Implement a native filter for Office Open XML Import (in progress) Export (Novell is committed to do it) Split in 2 modules Target is tentatively 2.4 Novell's “ooo-build” 2.3 has it: Ship with openSUSE 10.3 Will ship with other Linux distros
Joint effort between Sun and Novell
“ [...] a team of 5 developers will implement 25 handlers a week, which means that we'd have all the XML handlers written in 44 weeks. [...] Nevertheless, we’ve taken a little less than a year to get the converters reading the new file format.” [...] This is just for Word.” -- Rick Schaut, Mac Office team, about implementing the Office 2007 importer for Word for Mac, December 2006. http://blogs.msdn.com/rick_schaut/archive/2006/12/07/open-xml-converters-for-mac-office.aspx
Microsoft released the beta version of the Word 2007 to RTF converter for MacOS in May 2007...
...and PowerPoint support was released July 31 st  2007
Modules Writerfilter Word import Refactoring of the RTF and binary doc filter See Fridrich Strba presentation for all the details OOX Excel and PowerPoint, but not Word CWS xmlfilter02 implements VML as well called by the writerfilter if needed.
No XSLT OOX is not an XSLT based filter. Process XML to input into OpenOffice.org internal model Written in C++
The fast SAX parser 5568 tokens are listed in our code String comparisons for tokens are slow The fast SAX parser is designed to reduce the number of string comparisons by using a 32-bits hash for string tokens (including the xml namespace) offer that API through UNO It lives in the sax module Off course it is generic and could be used anywhere
Fast parser details Hash tokens are generated by gperf at compile time From a compile time generated list (OOX) Each know string token is referenced by a const like  XML_token XML namespace in the high order bits of token Allow selecting the  namespace with a simple bit-mask
Example switch( aToken ) { case NMSP_DRAWINGML|XML_lnSpc: break; case NMSP_DRAWINGML|XML_spcBef: break; case NMSP_DRAWINGML|XML_spcAft: break; default: }
API The OOX module only depend on UNO API Can't always get inspiration from the binary filters that mostly use the internal APIs Some UNO API are incomplete or missing They need to be implemented
The data model The Office Open XML data model is somewhat very close to the one from the binary format
“ [...] XLSX may be ugly, but its concepts were very familiar from XLS. We already had much of the code required to handle it.” -- Jody Goldberg about Gnumeric Excel 2007 support, http://blogs.gnome.org/jody/2007/09/10/odf-vs-oox-asking-the-wrong-questions/
Excel vs Calc Excel 2007 has more feature difference than Calc Dealing with missing features in Calc: Find a workaround “Downgrade” the data Problem with round-trip conversions Implement the missing feature
Excel 2007 vs Excel 2003 No notable new feature into the core Overall structures are very similar shared string table that contains cell string Sheet protection options data contain the identical set of options. Autofilter uses internal cell range names (not visible to the user) that are identical both in xlsx and xls.
Excel 2007 vs Excel 2003 (cont.) Overall structures are very similar (cont.) In both xls and xlsx formats, pivot table record contains a cached source data. Excel allows rich text and field objects in the header and footer, and they are encoded.  In both xls and xlsx, the same encoding scheme is used.
PowerPoint vs Impress Pixel perfect rendering People spend hours in airport to refine their “PowerPoint”... ...so the import has to be perfect SmartArt This is a big feature in PowerPoint 2007 Animation / transition Both based on SMIL
PowerPoint 2007 vs PowerPoint 2003 Not much changes SmartArt Saving in PowerPoint 2007 as binary PPT makes it an embedded OLE Off course this require having the engine
DrawingML A shared ML Used directly by PresentationML Encountered in WordprocessingML and SpreadsheetML documents. Defines styles, shapes, text, charts, diagrams, audio/video, etc Supposed to be more functional than VML, therefore to replace it.
VML Legacy Microsoft XML format Still generated by 2007 version if MS applications Replace the binary EMF for OLE Used by annotations in Excel and a lot of drawing features in Word supposed to be superseded by DrawingML
Alternative Implementations
odf-converter (Free Software) Microsoft sponsored ODF to Office OpenXML converter XSLT based Written in C# / .Net Also runs with Mono (Free Software platform) Free Software (MIT style license) Currently shipped by Novell for SUSE and Windows
GNOME (Free Software) libgsf Implement OpenPackage reading and writing Gnumeric Import .xlsx files Export .xlsx files (somewhat) AbiWord Import .docx Both run on non-GNOME platforms like Windows
“ The initial importer was written on the flight to London for the ECMA meeting, and export was added on the flight back. Toss in a few hours of debugging and the sample file [...] was under a week of effort to read and write.” -- Jody Goldberg about Gnumeric Excel 2007 support, http://blogs.gnome.org/jody/2007/09/10/odf-vs-oox-asking-the-wrong-questions/
Apple iWork '08 (non-Free) Pages Import and export .docx Numbers Import and export .xlsx Keynote Import and export .pptx
Questions?
 
Unpublished Work of Novell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments.  No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc.  Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product.  It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions.  Novell, Inc. makes no representations or warranties with respect to the contents  of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose.  The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell.  Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries.  All third-party trademarks are the property of their respective owners.

Office OpenXML: a technical approach for OOo.

  • 1.
    Office Open XML:a technical approach for OOo OOoCon 2007, Barcelona, September 21st, 2007 Hubert FIguiere Software Engineer, OpenOffice.org Novell - hfiguiere@novell.com
  • 2.
  • 3.
    What is OfficeOpen XML? An office application file format XML based Created by Microsoft... ...for Microsoft Office 2007 ECMA standard 376 Proposed to ISO
  • 4.
    What Office OpenXML is not? Office Open XML is not OpenDocument (ISO 26300) ... nor the previous XML formats for Microsoft Office introduced in the last few MS-Office release ... nor an ISO standard though it has been proposed
  • 5.
    Why supporting OpenXML? Support = importing from (and/or exporting to) For interoperability reasons with Microsoft Office 2007
  • 6.
  • 7.
    The specification Availableto anybody as ECMA standard 376 5 PDF documents Fundamentals Open Packaging Conventions Primer Markup Language Reference Markup Compatibility and Extensibility 173 + 129 + 472 + 5129 + 43 = 5946 pages
  • 8.
    The specification (cont.)Some have printed it. OpenXML printed spec photo by Pavel Janik photo by Pavel Janik http://blog.janik.cz/archives/2007/05/19/T20_32_07/
  • 9.
    “Packaging Conventions” Azip file: “Open Package” Contain the main content... ... and the embedded content Same container used for other Microsoft format like XPS Replace the old OLE structured storage In principle similar to OpenDocument, but not really.
  • 10.
    Content DrawingML Diagrams,Charts, etc. WordprocessingML Word document SpreadsheetML Excel document PresentationML PowerPoint presentation Heavily relies on DrawingML
  • 11.
    Content (cont.) RelationshipsMaps embedded objects Set the relationships between fragments
  • 12.
    Content (cont.) VMLLegacy format from Office 2000 Embedded objects Sound files Images Can be anything ! I have seen some PowerPoint document with an OpenDocument chart in an OLE container that was referenced from a slide
  • 13.
  • 14.
    Plans Implement anative filter for Office Open XML Import (in progress) Export (Novell is committed to do it) Split in 2 modules Target is tentatively 2.4 Novell's “ooo-build” 2.3 has it: Ship with openSUSE 10.3 Will ship with other Linux distros
  • 15.
    Joint effort betweenSun and Novell
  • 16.
    “ [...] ateam of 5 developers will implement 25 handlers a week, which means that we'd have all the XML handlers written in 44 weeks. [...] Nevertheless, we’ve taken a little less than a year to get the converters reading the new file format.” [...] This is just for Word.” -- Rick Schaut, Mac Office team, about implementing the Office 2007 importer for Word for Mac, December 2006. http://blogs.msdn.com/rick_schaut/archive/2006/12/07/open-xml-converters-for-mac-office.aspx
  • 17.
    Microsoft released thebeta version of the Word 2007 to RTF converter for MacOS in May 2007...
  • 18.
    ...and PowerPoint supportwas released July 31 st 2007
  • 19.
    Modules Writerfilter Wordimport Refactoring of the RTF and binary doc filter See Fridrich Strba presentation for all the details OOX Excel and PowerPoint, but not Word CWS xmlfilter02 implements VML as well called by the writerfilter if needed.
  • 20.
    No XSLT OOXis not an XSLT based filter. Process XML to input into OpenOffice.org internal model Written in C++
  • 21.
    The fast SAXparser 5568 tokens are listed in our code String comparisons for tokens are slow The fast SAX parser is designed to reduce the number of string comparisons by using a 32-bits hash for string tokens (including the xml namespace) offer that API through UNO It lives in the sax module Off course it is generic and could be used anywhere
  • 22.
    Fast parser detailsHash tokens are generated by gperf at compile time From a compile time generated list (OOX) Each know string token is referenced by a const like XML_token XML namespace in the high order bits of token Allow selecting the namespace with a simple bit-mask
  • 23.
    Example switch( aToken) { case NMSP_DRAWINGML|XML_lnSpc: break; case NMSP_DRAWINGML|XML_spcBef: break; case NMSP_DRAWINGML|XML_spcAft: break; default: }
  • 24.
    API The OOXmodule only depend on UNO API Can't always get inspiration from the binary filters that mostly use the internal APIs Some UNO API are incomplete or missing They need to be implemented
  • 25.
    The data modelThe Office Open XML data model is somewhat very close to the one from the binary format
  • 26.
    “ [...] XLSXmay be ugly, but its concepts were very familiar from XLS. We already had much of the code required to handle it.” -- Jody Goldberg about Gnumeric Excel 2007 support, http://blogs.gnome.org/jody/2007/09/10/odf-vs-oox-asking-the-wrong-questions/
  • 27.
    Excel vs CalcExcel 2007 has more feature difference than Calc Dealing with missing features in Calc: Find a workaround “Downgrade” the data Problem with round-trip conversions Implement the missing feature
  • 28.
    Excel 2007 vsExcel 2003 No notable new feature into the core Overall structures are very similar shared string table that contains cell string Sheet protection options data contain the identical set of options. Autofilter uses internal cell range names (not visible to the user) that are identical both in xlsx and xls.
  • 29.
    Excel 2007 vsExcel 2003 (cont.) Overall structures are very similar (cont.) In both xls and xlsx formats, pivot table record contains a cached source data. Excel allows rich text and field objects in the header and footer, and they are encoded. In both xls and xlsx, the same encoding scheme is used.
  • 30.
    PowerPoint vs ImpressPixel perfect rendering People spend hours in airport to refine their “PowerPoint”... ...so the import has to be perfect SmartArt This is a big feature in PowerPoint 2007 Animation / transition Both based on SMIL
  • 31.
    PowerPoint 2007 vsPowerPoint 2003 Not much changes SmartArt Saving in PowerPoint 2007 as binary PPT makes it an embedded OLE Off course this require having the engine
  • 32.
    DrawingML A sharedML Used directly by PresentationML Encountered in WordprocessingML and SpreadsheetML documents. Defines styles, shapes, text, charts, diagrams, audio/video, etc Supposed to be more functional than VML, therefore to replace it.
  • 33.
    VML Legacy MicrosoftXML format Still generated by 2007 version if MS applications Replace the binary EMF for OLE Used by annotations in Excel and a lot of drawing features in Word supposed to be superseded by DrawingML
  • 34.
  • 35.
    odf-converter (Free Software)Microsoft sponsored ODF to Office OpenXML converter XSLT based Written in C# / .Net Also runs with Mono (Free Software platform) Free Software (MIT style license) Currently shipped by Novell for SUSE and Windows
  • 36.
    GNOME (Free Software)libgsf Implement OpenPackage reading and writing Gnumeric Import .xlsx files Export .xlsx files (somewhat) AbiWord Import .docx Both run on non-GNOME platforms like Windows
  • 37.
    “ The initialimporter was written on the flight to London for the ECMA meeting, and export was added on the flight back. Toss in a few hours of debugging and the sample file [...] was under a week of effort to read and write.” -- Jody Goldberg about Gnumeric Excel 2007 support, http://blogs.gnome.org/jody/2007/09/10/odf-vs-oox-asking-the-wrong-questions/
  • 38.
    Apple iWork '08(non-Free) Pages Import and export .docx Numbers Import and export .xlsx Keynote Import and export .pptx
  • 39.
  • 40.
  • 41.
    Unpublished Work ofNovell, Inc. All Rights Reserved. This work is an unpublished work and contains confidential, proprietary, and trade secret information of Novell, Inc. Access to this work is restricted to Novell employees who have a need to know to perform tasks within the scope of their assignments. No part of this work may be practiced, performed, copied, distributed, revised, modified, translated, abridged, condensed, expanded, collected, or adapted without the prior written consent of Novell, Inc. Any use or exploitation of this work without authorization could subject the perpetrator to criminal and civil liability. General Disclaimer This document is not to be construed as a promise by any participating company to develop, deliver, or market a product. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. Novell, Inc. makes no representations or warranties with respect to the contents of this document, and specifically disclaims any express or implied warranties of merchantability or fitness for any particular purpose. The development, release, and timing of features or functionality described for Novell products remains at the sole discretion of Novell. Further, Novell, Inc. reserves the right to revise this document and to make changes to its content, at any time, without obligation to notify any person or entity of such revisions or changes. All Novell marks referenced in this presentation are trademarks or registered trademarks of Novell, Inc. in the United States and other countries. All third-party trademarks are the property of their respective owners.