Upcoming SlideShare
×

# 302 sargent word2007-ssp2008

626 views
498 views

Published on

0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total views
626
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
5
0
Likes
0
Embeds 0
No embeds

No notes for slide
• This talk describes and demonstrates how Unicode’s rich mathematical character set combined with OpenType font technology, TeX &apos;s mathematical typography principles, and enhanced autocorrection can be used to produce high-quality, streamlined technical text processing in Word 2007
• This project was considerably harder than any of us imagined it would be. Mathematical typography is very intricate and varied, and making it work in a international rich text environment encounters many complications one might not expect. On the other hand, that environment offers many advantages too. Mathematical expressions are always entered into math zones. These zones are regions of text like those in between \$’s or $$’s in TeX, but are handled by a character format run attribute in our approach. • Infrastructures outside and inside of Microsoft have emerged to enable major advances in the editing and display of mathematical formulae. While TeX has been stable since about 1986 (last major changes were in 1990), most of the other infrastructures have become available only recently. • TeX (see the TeXbook, by Donald Knuth), a widely used document preparation program, provides both fundamental examples and many specifications for our new math editing and display facility. TeX is the most dominant technical document preparation program today, used to typeset technical books and journals throughout the world. It’s also used widely on the web to display technical documents, either in TeX or pdf form. The experts and users alike agree that the typography used is excellent and sufficient to meet their needs. The program allows the user or copy editor to tweak settings to match end preferences. TeX’s input method can be used with any plain-text editor. While easy to use in principle, the method becomes awkward for complicated mathematical formulae. In addition, one of TeX’s strengths—easy definition of macros—is also a problem when it comes to interchange. The TeXbook is a user manual that includes a detailed specification for mathematical typography. We have used many of its choices and methodology in creating our solutions, which are appropriately enhanced with the use of OpenType tables and some additional constructs. Although the TeX source code is available, it cannot be used directly for several reasons. First the code is like a web rather than being hierarchical and uses many global variables. This makes it cumbersome to employ in the instance-oriented contexts used at Microsoft. Complicating this is that TeX is a complete document imaging system, not one limited to mathematics. As such many aspects of the program that are used for mathematics are used also for other kinds of layout like headers, footers, figures, and footnotes. Extricating the mathematical algorithms from this web of code would be significantly harder than recreating the desired display quality using our own methodologies and the specifications given in The TeXbook . Furthermore we want to take advantage of our OpenType math fonts to obtain better positioning of subscripts, superscripts, and other symbols than possible by default using TeX. Another complication is that Office is an international environment and our math facility needs to be compatible with all languages that we support, potentially simultaneously. Limitations on screen display quality are discussed in later slides. • Unicode is a character encoding system that Knuth would have loved to have had when he and his students developed TeX. Unicode 5.0 contains all standard mathematical characters used in print today. This includes about 2000 characters plus all the combinations that can be made with combining marks. As such Unicode provides an excellent foundation for technical documents, significantly better than the character sets used in TeX itself. In particular, all of TeX’s characters are included in Unicode or in glyphs variants thereof. • See http://www.unicode.org/charts for displays of all characters in Unicode 4.0. This slide shows some of the Miscellaneous Mathematical Symbols-B, range U+2980 – U+29FF. For information about the Unicode math characters, see B. Beeton, A. Freytag, M. Sargent III, Unicode support for mathematics , http://www.unicode.org/reports/tr25/ (2003). • Mathematical notation uses a basic set of mathematical alphanumeric characters which consists of: - set of basic Latin digits (0 - 9) (U+0030 – U+0039) - set of basic upper- and lowercase Latin letters (a - z, A - Z) - uppercase Greek letters Α - Ω (U+0391 – U+03A9), plus the nabla ∇ (U+2207) and the variant of theta Θ given by U+03F4 - lowercase Greek letters α - ω (U+03B1 – U+03C9), plus the partial differential sign ∂ (U+2202) and the six glyph variants of ε, θ, κ, φ, ρ, and π, given by U+03F5, U+03D1, U+03F0, U+03D5, U+03F1, and U+03D6. Only unaccented forms of the letters are used for mathematical notation, because general accents such as the acute accent would interfere with common mathematical diacritics. Examples of common mathematical diacritics that can interfere with general accents are the circumflex, macron, or the single or double dot above, the latter two of which are used in physics to denote derivatives with respect to the time variable. Mathematical symbols with diacritics are always represented by combining character sequences, except as required by normalization. In addition to this basic set, mathematical notation also uses the four Hebrew-derived characters (U+2135 – U+2138). Occasional uses of other alphabetic and numeric characters are known. Examples include U+0428 cyrillic capital letter sha, U+306E hiragana letter no, and Eastern Arabic-Indic digits (U+06F0 – U+06F9). However, these characters are used in only the basic form. • Generally the math alphanumerics substantially reduce the verbosity of markup, although one can construct cases that aren’t so verbose. But a markup representation is poor for several reasons: 1) it complicates a search for a bold italic a, since the search engine needs to understand the bold and italic tags or attributes and dissect the tag contents, 2) it doesn’t tag the characters individually as math identifiers, which is a MathML requirement, and 3) it introduces complexity into the tag model by introducing multiple variable identifier tags. The last of these disadvantages can be overcome by representing the nature of the variables with attributes, e.g., &lt;mi style=bolditalic&gt; , but this approach is quite verbose for items as small as math characters. Admittedly this approach is necessary to handle (quite rare) alphanumeric math symbols that aren’t included in the math alphanumeric block. Searching for such symbols requires a sophisticated attribute-aware search engine since simple plain-text search engines would yield many undesired search hits. • Mathematics has need for a number of Latin and Greek alphabets that on first thought appear to be just font variations of one another, e.g., normal, bold, italic and script H. However in any given document, these characters have distinct mathematical semantics. For example, a normal H represents a different variable from a bold H, etc. If one drops these distinctions in plain text, one gets gibberish. The next slide shows that instead of the well-known Hamiltonian formula H =  d  (  E ² +  H ²), you’d get the integral equation H =  d  (  E² +  H²). Accordingly, bold, italic, script, etc., Latin and Greek alphabets. Straight encoding leads to 996 characters. They allow plain text to retain the proper character semantics and simple (nonrich) search methods to work. For example when you want to search for a script upper-case H math variable, you don’t want to find any other kind of H. • The World Wide Web Consortium W3C recognized the need for a format for representing scientific and technical information. In fact, the HTML 3.0 working draft (1994) included a proposal for HTML Math from Dave Raggett. In March, 1997, the W3C HTML Math working group was formally constituted. The first product of the W3C HTML Math working group was the Mathematical Markup Language (MathML). MathML 1.0 was released as a W3C Recommendation in April, 1998. As the first W3C endorsed XML application, MathML is a low-level format for describing mathematics. MathML provides a much needed foundation for the inclusion of mathematical expressions in Web page and as a common encoding for scientific processors. Indeed, MathML facilitates the use and re-use of scientific content. The MathML 2.0 specification also provides a wealth of information about putting math on computers. • Each MathML element falls into one of three categories: presentation elements, content elements and interface elements. Just as titles, sections, and paragraphs capture the level syntactic structure of a textual document, presentation elements are meant to express the syntactic structure of math notation. Content elements describe mathematical objects directly, as opposed to describing the notation which represents them. Presentation MathML specifies how to display mathematical formulae, but it doesn’t specify the content unambiguously. Here the 2 is a square, known to most everyone. But such notation can also be used as an index. The corresponding content markup specifies the two cases unambiguously. • Content MathML unambiguously defines the meaning of expressions. But it doesn’t specify how to display such expressions. It is possible to give both content and presentation forms for expressions using the &lt;semantics&gt; tag. • See MathML 2.0 Section 7.2.3 Attributes for unspecified data. Could put in WordProcessingML or DrawingML in attributes or inside &lt;annotation-XML&gt;. • The linear format is by far the simplest, but it’s not XML • Math information is collected into two areas: 1) Document default math properties in the {\\mmathPr…} group, and 2) Math zones in {\\mmath…} groups. A math zone is a text range within which math typography rules usually apply and outside of which math typography rules do not apply. Math zones can contain specially marked normal text runs for which math typography rules don’t apply (see \\mnor ). With Office math, math zones are identified internally by a character-format effect bit like bold. Hence if you delete the ordinary text separating two math zones, you get a single merged math zone. Math zones can be inline or display , corresponding to TeX ’s  and$$ toggle keys. If a math zone fills an entire paragraph, it is a display math zone, i.e., it is displayed on its own line(s). If a math zone is preceded and/or followed by nonmath text other than a \\par , the math zone is inline and is rendered in a more compressed fashion. Inline math zones usually consist of math expressions or variables, whereas display math zones usually consist of one or more equations or formulas. The RTF for the content of an inline math zone replaces the first ellipsis of the nested group structure {\\mmath {\\*\\moMath…}{\\mmathPict…}} Readers that do not understand the ignorable {\\*\\moMath…} group can use one of the pictures in the {\\mmathPict…} group. The RTF for the content of a display math zone replaces the second ellipsis in the nested group structure {\\mmath{\\*\\moMathPara{\\moMathParaPr…}{\\*\\moMath…}+}{\\mmathPict…}} Here the + means that a {\\*\\moMath…} group is emitted for each instance of mathematical text that should start on a new line, e.g., for each new equation. The control word \\moMathPara stands for a “math paragraph”, which can contain multiple equations with various alignment and breaking options. A math paragraph may be part of a text paragraph (text ending in a \\par and either starting a document or following a \\par ). In general, a text paragraph can contain multiple math paragraphs separated from one another by lines of normal text. In this discussion, we see that math RTF uses two ways to assign property values depending on the property: 1) the standard RTF way with a parameter N as in \\msty2, and 2) using a mini group like {\\mtype skw}. The latter way is inspired from the corresponding OMML syntax, such as &lt;m:type m:val=&quot;skw&quot;/&gt;, while the RTF way is more succinct. For detailed information see the RTF Specification, Version 1.9.1.
• Mathematics is the product of a myriad ingenious minds and many notational variations are in use. We have attempted to support most of these variations.
• Rigorous math spacing is essential for high quality mathematical typography. In the simplest cases, such as an equation like a = b + c , the variables a , b , and c , are represented by Unicode math-italic letters and the operators are separated from the letters by spacing chosen according to a set of rules specified in Chap. 18 of The TeXbook . In more complicated equations, special “built-up” math-handler objects are used to place the glyphs in the correct places. These objects allow the math handler in conjunction with the math font to place glyphs as TeX would along with automating a number of spacing refinements that TeX delegates to the user. The objects are summarized in a later slide. The MathML 2.0 specification also has math spacing information.
• Math ribbons and handwriting recognition are beyond the scope of this talk.
• A handy hex-to-Unicode entry method works with WordPad 2000/XP, Office 2000/XP edit boxes, RichEdit controls in general, and in Microsoft Word starting with Word 2002. Basically you type a character’s hexadecimal code (in ASCII), making corrections as need be, and then type Alt+x. Presto! The hexadecimal code is replaced by the corresponding Unicode character. The Alt+x can be a toggle (as in Microsoft Word 2002). That is, type it once to convert the hex code to a character and type it again to convert the character back to a hex code. If the hex code is preceded by one or more hexadecimal digits, you need to “select” the code so that the preceding hexadecimal characters aren’t included in the code. The code can range up to the value 0x10FFFF, which is the highest character in the 17 planes of Unicode. The only problem with this approach is that some programs use Alt+x for something else (like quit) or the keyboard doesn’t have direct access to ASCII alphabetics.
• You can add autocorrect entries using the Tools/Autocorrect Options dialog. Type what you want replaced in the “Replace:” dialog and what you want it replaced with in the “With:” dialog. You can put mathematical expressions in linear form into the “With:” dialog. Then when the replace text is encountered, it will be replaced by a built-up form of the replacement text.
• It’s possible to define a “plain text” encoding that often looks like mathematics. Some constructs require some simplified mark up, but many expressions are literally plain (Unicode) text. The notation is handy as a math input language for more elaborate markup languages like TeX and MathML and can be used in its own right. We define a simple operand to consist of all consecutive alphanumeric characters. We call this sequence of one or more alphanumeric characters a span of alphanumeric s. As such, a simple numerator or denominator is terminated by any operator, including, for example, arithmetic operators, the blank operator U+0020, all Unicode characters with codes U+22xx. The fraction operator is the ASCII forward slash U+002F.
• For more complicated operands, such as those that include operators, parentheses ( ), brackets [ ], or { } can be used to enclose the desired character combinations. If parentheses are used and the outermost parenthesis set is preceded and followed by operators, that set is not displayed in built-up form, since usually one doesn’t want to see such parentheses. So the plain text ( a + b ) / c displays as shown in the slide. In practice, this approach leads to a linear text that is significantly easier to read than TeX’s, e.g., {a + c \\over d} , since in many cases, outermost parentheses are not needed, while TeX requires { }’s except for single letters. To force the display of an outermost parenthesis set, one encloses the set, in turn, within parentheses, which then become the outermost set. A really neat feature of this notation is that the linear text is, in fact, a legitimate mathematical notation in its own right, so it’s relatively easy to read. I plan to submit the full linear format as a Unicode Technical Note.
• Nature isn’t so kind with subscripts and superscripts, but they’re still quite readable. Specifically, we introduce a subscript by a subscript operator _ which we display as a subscripted down arrow. Similarly we introduce a superscript with a superscript operator ^, which we display as a superscripted up arrow. The subscript itself can be any operand as defined above. Another compound subscript is a subscripted subscript, which works using right-to-left associativity. This associativity can be overruled using parentheses as described for fractions. If you use Unicode’s built-in subscripts and superscripts, they should be rendered to look the same as if they had been represented by the corresponding general subscript/superscript markup. The numeric subscripts and superscripts are often used and can streamline the look of technical plain text.
• A large community of technically oriented people have TeX input “in their fingers”. In addition, this kind of input is easy to describe and appears in many readily available books. The problem is that it becomes cumbersome to work with in plain text for formulae that have much complexity. However this problem goes away in our environment thanks to autocorrect in combination with formula autobuildup. Essentially the user sees the formulae automatically build up on the screen as s/he types them in. This contrasts remarkably with the traditional TeX scenario, in which the user always edits the full original text in TeX’s linear format. To get an idea of how simple the new approach is, consider the following. In TeX a user types \\delta to see δ in print. With autocorrect and the right autocorrect data file (even Word 97 autocorrect) as soon as a blank or punctuation symbol is typed after the a in \\delta, the Greek letter δ appears on the screen. No need to wait for a printout or preview. Similarly with the formula autobuildup facility, one can type in integrals with \\int, fractions, square roots, etc., and see them displayed in built-up form on the screen instead of the relatively complicated way they appear when typed in. You never have to search the original plain text input to find where to edit. You just point and click at the right place in a formula and edit as desired. Typically such WYSIWYG editing is preferred once a formula is built up and you can use autoformula buildup wherever you want to, including inside built-up formulas. You can also toggle back to the linear format if that makes things easier, e.g., in converting a fraction to something else. A complete mathematical expression can be entered in linear form into an autocorrect target. The formula autobuildup mechanism automatically builds such expressions up as they are entered.
• The space bar is the easiest key to hit on the keyboard and we make extensive use of it.
• Unicode has a variety of spaces that can be used in mathematical text. Fonts need to show no glyph for these.
• Note that many characters that are not operators in algebra nevertheless behave as operators in the linear format, namely all characters of the category concatenation . This includes space characters, along with arithmetic operators like +, *, =, etc. Note also that the absolute-value and norm operators don’t appear in the table, since they require a slightly more complicated formalism to handle (sometimes a ‘|’ acts like an opOpen and sometimes like an opClose ). Similarly period and comma don’t appear, since when sandwiched between ASCII digits they treated as part of an operand, while otherwise they have a precedence of 4.
• In this model, math layout is performed by a collaboration between four entities: 1) a Unicode rich-text text processing program such as Word or RichEdit, 2) the math handler built into the latest version of the Microsoft text layout component, 3) the math font, and 4) the math-font handler. This collaboration is invoked whenever text inside a math zone needs to be displayed. All such text is rendered using appropriate glyphs with measurements dependent on the glyph ascents, descents, and widths. In the simplest cases, such as an equation like a = b + c , the variables a , b , and c , are represented by Unicode math-italic letters and the operators are separated from the letters by spacing chosen according to a set of rules specified in The TeXbook .
• The math handler is also capable of breaking equations into multiple lines either automatically or by user defined breaks. This feature is valuable particularly on screen, where window widths tend to change readily, making the hand breaking used for paper less successful. While no special font properties are required for this feature, the client backing store has to support the concept of a “math paragraph”. Word 2007 implements the line breaking functionality, but postponed equation-numbering. One can get equation numbers using tables.
• In more complicated equations, special “built-up” math-handler objects are used to place the glyphs in the correct places. These objects allow the math handler in conjunction with the math font to place glyphs as TeX would along with automating a number of spacing refinements that TeX delegates to the user. The Line Services math objects are: Accent: Display accent over base character(s) Box: Give properties to base Boxed formula: Display borders and/or lines through base Delimiters: Enclose base in parens, brackets, braces, etc. Delimiters with separators: Enclose bases separated by separator character, such as a vertical bar Equation array: Display set of horizontally aligned equations Fraction: Display normal or small built-up fraction Function apply: Display trigonometric and other functions with function name and base Left subsup: Prefix a subscript and/or superscript to base Lower limit: Display limit below base Matrix: Display matrix with n columns and m rows n -ary: Display large n -ary operator with a base and optional upper and lower limits Operator character: Used internally to give proper spacing to operators Overbar: Display bar over base (boxed formula special case) Phantom: Suppress any combination of base ascent, descent, width, display, or transparency Radical: Display square and n th roots Slashed fraction: Display slashed or built-up linear fraction Stack: Display first argument over second (like fraction w/o bar) Stretch stack: Display stretchable character above/below base or a limit above/below stretchable base character Subscript: Display subscript relative to base Subsup: Display subscript and superscript relative to base Superscript: Display superscript relative to base Underbar: Display bar under base (boxed formula special case) Upper limit: Display limit above base
• Cambria Math contains a full set of glyph variants that have a heavier weighting so that when scaled down to the first script level (about 71% of text size) the stem widths match those of the text level glyphs. Prime (U+2032) and multiple primes need to be superscripted and scaled down accordingly. Dotless i and j are automatically used in the bases of accent objects. When putting an accent over capital letters, partially flattened glyph variants are used. Furthermore the glyph variants are requested to have sufficient width to cover the accent base, which may consist of more than one character. Brackets, braces, parentheses and other growable characters have a number of larger glyph variants as well as arbitrarily large size created using glyph assemblies. When the assemblies are displayed, the pieces are clipped to prevent overlap, which would create ClearType artifacts. According to a document setting, the italic open-face characters 0x2145 - 0x2149 (differentials, e, i, j) can be displayed as themselves (useful for patent applications) or with the corresponding math italic or corresponding ASCII letters. Serifed italic glyphs are used for these in most math publications, but serifed upright glyphs are used in some European math publications. The use of the differential d (U+2146) automatically introduces a small space between it and the preceding character if that character is alphabetic. Right-to-left math requires mirroring the images of parentheses, integrals, square roots, arrows, etc. Many such mirror images can be obtained by using corresponding Unicode characters. For example the mirror image of a left parenthesis is a right parenthesis and vice versa. But Unicode doesn’t have many characters that are mirror images of other characters, such as integral signs and square roots. Furthermore it seems that a glyph variant approach for these characters makes more sense than adding characters to serve as the mirror images. Other approaches include using world transforms and mirrored bitmaps. The present version of our software doesn’t handle true right-to-left math. Math zones in right-to-left paragraphs are treated as left-to-right objects, with all characters in the math zone being strong left-to-right except those defined by Unicode to be strong right-to-left.
• New Unicode 4.0 Math Fonts have been developed both at Microsoft as well as by the STIX committee, which played a key role in generalizing Unicode to include all standard math characters. Our new math facilities have been developed along with the Cambria Math font, influencing one another to obtain ideal results. The Cambria Math effort was managed by Geraldine Wade and Michael Duggan together with Tiro Typeworks. Andrei Burago and Sergey Malkin also contributed in key ways. Cambria Math is part of a TrueType collection that also includes Cambria, Cambria Italic, Cambria Bold, and Cambria Bold Italic. High-quality low-resolution screen display is very important for the way people work with documents in the Internet age: most documents are perused on screen and only printed for purposes of detailed examination. This is a major advantage of our math system.
• Here’s a list of 202 other characters needed to complete the math character set (from UTR #25; includes the circled single digits and 52 circled alphabetics and six parenthesized alphabetics that I use in the math linear format):   232C..232E, 23E1..23E7, 2460..2468, 24A9, 249D, 249E, 24A8, 24AD, 24B1, 24B6..24EA, 25A2, 25AA..25AB, 25B2, , 25B4..25B9, 25BC..25BF, 25C0..25C3, 25C6..25C7, 25C9, 25CE..25CF, 25E6, 25EF, 25FB..25FE, 2605..2606, 2609, 26AA..26AC, 2772..2773, 27C0..27C9, 27CC, 2B00..2B03, 2B05, 2B08..2B0C, 2B0E..2B19, 2B1B..2B54   The circled/parenthesized characters in the new math linear format are: 24A9, 249D, 249E, 24A8, 24AD, 24B1, 24B7, 24B8, 24C1, 24C3, 24C9, 24D1, 24D2.   The geometric characters in UTR #25 Table 2.5 need to be sized appropriately. We probably should brainstorm about these and compare sizes in Cambria Math, STIX and UTR #25. Also there’s a passionate user who’s written up comments (http://www.unicode.org/~rick/Chastney-Phillip-Shapes-II.pdf) on this.   We also ought to discuss the glyph variants in UTR #25 tables 2.7—2.9. Presumably these can be accommodated using shaping.
• The new font tables enable one to position subscripts and superscripts horizontally better than TeX as well as having richer glyph choices for operators like the integral sign, square root, and growable brackets. The tables include parameters such as the em-size-dependent sub/superscript values LONG lSubscriptShiftDown; LONG lSubscriptTopMax; LONG lSubscriptBottomDropMin; LONG lSuperscriptShiftUp; LONG lSuperscriptShiftUpCramped; LONG lSuperscriptBottomMin; LONG lSuperscriptTopRiseMin; LONG lSubSuperscriptMinGap; LONG lSuperscriptBottomMaxWithSubscript; LONG lSpaceAfterScript; In addition math characters have four cut-in values, one for each corner, allowing sub/superscripts to be kerned with their bases. The information in the tables can be obtained from mathfont.dll along with appropriate scaling and glyph assemblies.
• Functions in mathfont.dll accessing math font tables
• The subscript/superscript/prescript callbacks are shown
• For example, consider Einstein’s most famous equation, E = mc 2 . The E is in its own text run, the equal sign is a mathematical operator object, the m is in its own text run, and the c 2 is a superscript object with text runs for arguments. The text runs result in various callbacks to obtain character properties, widths, and glyphs, as well as to display the glyphs or variants thereof once the whole line is laid out. All text is treated using glyphs and glyph-ink ascents and descents. The math italic letters are given by Unicode math alphabetics in plane 1. The operator object for the equal sign results in callbacks to determine the operator’s text characteristics and its default spacing class, in this case, relational. The superscript object results in callbacks to get text-run information for the base and superscript text, as well as to obtain the superscript vertical shift and the cut-in values for the upper-right corner of the and lower-left corner of the 2. These displacements are obtained from the math font handler (MFH), which is responsible for access to the math font’s math tables along with appropriate scaling. When the glyph for the superscript 2 is fetched, the MFH is requested to return a script level-1 glyph variant with a relative size specified by the font (typically about 70% of the text size). This example shows how even a simple mathematical equation involves interplay between the client, the math layout handler, the math font handler, and the font itself. More complicated examples have math objects like brackets or integrals and need glyph assemblies and other information. In addition, larger equations may need to be wrapped to two or more lines, a process that involves further callbacks and information.
• If a standard fraction&apos;s argument is a standard fraction that has a width greater than the outer fraction&apos;s rule length - 20% EM, increase the rule length by 20% EM to reveal which fraction contains the other.
• Math editing and display in Office 2007 appear only in Word 2007, but the appearance is stunning. The math typography is competitive or superior to TeX’s, the input methods are state of the art, and the environment is Office’s, which comes with internationalization, spelling and grammar checking, interoperability, bibliography support, and many other features one expects from the leading word processor. Much of the underlying functionality is based on the sharable components, PTLS 4.0 (Page/Table/LineServices with its high quality math handlers), RichEdit, the math font library, Uniscribe, and the incredible Cambria Math font.
• The underlying technology was mostly available, but there wasn’t enough time to integrate it
• WordPad uses a RichEdit control for editing and displaying text. By using the latest RichEdit control with WordPad, we can edit and display mathematical expressions as described in this talk. To get an upgraded msftedit.dll for use with WordPad, go to \\\\scratch2\\scratch\\murrays\\wordpad. The math build up/down facility is housed in Office 12’s RichEdit 6.0 dll and communicates with clients using a subset of the TOM2 (Text Object Model 2) interface methods. Any application that implements this subset of methods can have formula autobuildup and manual build up/down. In particular, the client needs to implement the ITextStrings rich-text strings interface. This interface gives access to a set of strings similar to a stack of C strings, but the ITextStrings strings may have rich-text properties that the build up/down facility doesn’t need to understand.
• Many people around the company and many technologies are involved in this effort. This slide lists the people and groups most directly involved. Many thanks also are due to our managers at all levels, who offered lots of support and encouragement.
• The Unicode Standard , Version 4.0, (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1) or online as http://www.unicode.org/versions/Unicode4.0.0/   Barbara Beeton, Asmus Freytag, Murray Sargent III, Unicode Technical Report #25 “Unicode Support for Mathematics”, http://www.unicode.org/reports/tr25   Donald E. Knuth, The TeXbook , (Reading, Massachusetts: Addison-Wesley 1984)   Mathematical Markup Language (MathML) Version 2.0 (Second Edition) http://www.w3.org/TR/2003/REC-MathML2-20031021/ .   Murray Sargent III, Unicode Nearly Plain-Text Encoding of Mathematics , http://www.unicode.org/notes/tn28/UTN28-PlainTextMath.pdf.
• Caveat: we’re not finished yet. right-to-left Arabic math and a number of important features have been postponed to the next iteration of Office. PowerPoint and OneNote didn’t quite make it, although we have impressive demos. We don’t have converters to/from TeX, although one could import/export TeX via MathML.
• ### 302 sargent word2007-ssp2008

1. 1. Math Editing and Display in Word 2007 Murray Sargent III Publisher Text Services 28-may-2008
2. 2. Overview 8 math infrastructures enable better math display/editing New Office math edit/display environment Interoperate with math programs such as Mathematica, Maple, publisher workflow Input methods and formats Layout Math font
3. 3. Complex Project Intricacies of math typesetting Creating and using a large set of glyph variants Vagaries of math notation Embedding math zones into international text environments Interaction with complex scripts Math in other objects like hyperlinks, ruby Input with nonASCII keyboards
4. 4. Eight Math Infrastructures [La]TeX: current tech-doc standards Unicode 5.0: includes ~2000 math symbols MathML 2.0: math K – 12 and beyond OpenType font technology: special math tables New math font (Cambria Math) Math layout handler Shared math input components MS Office environment, autocorrect
5. 5. [La]TeX Widely used, high-quality tech document preparation language Simple ASCII keyboard entry Usage and math typography are very well documented Stable since 1990 Complex scenarios are hard to edit Numerous dialects, user macros, and lack of Unicode complicate interchange Fonts aren’t well suited to screen display
6. 6. Unicode 5.0 340 math chars exist in ASCII, U+2200 block, arrows, combining marks 1016 math alphanumeric characters are in Unicode Plane 1 or Letterlike Symbols 591 new math symbols and operators are on BMP One math variant selector One new combining character (reverse solidus) New math characters were requested by STIX
7. 7. Basic Set of Alphanumeric Characters Latin digits (0 - 9) Upper- & lowercase Latin letters (a - z, A - Z) Uppercase Greek letters Α - Ω plus the nabla ∇ and a variant of theta Θ Lowercase Greek letters α - ω plus the partial differential sign ∂ and glyph variants of ε, θ, κ, φ, ρ, and π Only unaccented forms of letters are used
8. 8. Legibility LossWithout math alphabetics, the Hamiltonian formula   H = ∫dτ [εE2 + μH2] becomes an integral equation H = ∫dτ [εE2 + μH2]
9. 9. Math Alphanumeric Characters• Math needs various Latin and Greek styles like normal, bold, italic, script, Fraktur, and open-face• May appear to be font variations, but have distinct semantics and spacings• Without these distinctions, you get gibberish, violating Unicode rule: plain text must contain enough info to permit text to be rendered legibly, and nothing more• Plain-text searches should distinguish between alphabets, e.g., a search for script H shouldn’t match H, etc.
10. 10. MathML MathML 1.0 (April, 1998) was the first World Wide Web Consortium (W3C) endorsed XML vocabulary Low-level format for describing mathematics as a basis for machine to machine communication MathML facilitates the use and re-use of scientific content on the Web MathML 2.0 released in late 2003 is now widely used in exchanging mathematical text MathML 2.0 spec has a wealth of math info
11. 11. MathML Presentation Markup Presentation markup directs how the math should be rendered. <mrow> <mi>E</mi> <mo>=</mo> <mrow> <mi>m</mi> <mo>&InvisibleTimes;</mo> <msup> E = mc2 <mi>c</mi> <mn>2</mn> </msup> </mrow> </mrow>
12. 12. Office MathML (OMML)<m:oMath> <m:r><m:t>E=m</m:t></m:r> <m:sSup> <m:e> <m:r><m:t>c</m:t></m:r> </m:e> E = mc2 <m:sup> <m:r><m:t>2</m:t></m:r> </m:sup> </m:sSup></m:oMath>
13. 13. MathML with Custom XML Can put arbitrary namespace attributes in MathML tags More complicated embellishments can use <semantics> MathML representation <annotation-XML> Enhancements </annotation-XML> </semantics>
14. 14. MathML ParsingMathML can be tricky to parse. For sin x: <mrow> <mi>sin</mi> <mo>&FunctionApply;</mo> <mi>x</mi> </mrow>Don’t know it’s a function-apply object until reaching &FunctionApply: have to analyze expressions as with the linear format
15. 15. Linear FormatE=mc^2 E = mc2
16. 16. Math RTF Math RTF is OMML in RTF syntax Somewhat simplified (doesn’t need text tag) For example, <m:f> ... </m:f> → {mf ... } Thoroughly defined in latest RTF spec Reading spec is great way to learn how Word represents math
17. 17. Accented characters Accents are handled by math accent object Accents may apply to multiple characters Accents may be flattened
18. 18. Vagaries of Math Notation Choice of subscript/superscript base Function arguments like Integrands and n-aryands Absolute value ambiguities like ||a|-|b||. Actually this example is unambiguous, but |a|b - c|d| has two possible meanings Context sensitive ellipses: … vs ⋯
19. 19. Math Spacing Operators have math spacing given by extended TeX spacing rules Function object gives correct spacing between object and neighbors, and between function name and argument n-aryand object gives correct spacing between n-ary operator and its n-aryand Automate much need for TeX spacing “tweaks” Context-dependent operator spacing like + - . , :
20. 20. Font Sizing Text style, script style (70%), script script style (60%) Sub/sups…, fractions in line Cramped
21. 21. Confusables 1 vs ll 1 vs 𝑎𝑎vs � vs � vs � vs � vs vs 𝒳 vs � 𝒳 vs Y vs Υ Y vs ΥOther letter similarities are so close that theyOther letter similarities are so close that they are avoided, e.g., UC alpha and LC omicron are avoided, e.g., UC alpha and LC omicron are never used. are never used.
22. 22. Math Input Methods Linear format input and manual buildup Formula autobuildup (FAB) Math ribbons Recognition of handwritten formulae Hex code input WYSIWYG editing Hybrid editing (combination of WYSIWYG and FAB)
23. 23. Hex to Unicode Input Method Type Unicode character hexadecimal code Make corrections as need be Type Alt+x to convert to character Type Alt+x to convert back to hex (useful especially for “missing glyph” character) Resolve ambiguities by selection Input higher-plane chars using 5 or 6-digit code MS Word and RichEdit standard
24. 24. Autocorrect Examples Type delta and get δ, Delta and get Δ Define quadratic to be x = (-b ± √(b^2 - 4ac))/2a Then typing quadratic<space> inserts:
25. 25. Math Alphabetics scriptA, frakturA, doubleA, etc., are used to insert math script, Fraktur, and double-struck alphabetics Italic and bold are controlled by italic & bold format tools and only apply to math alphabetics Italic and/or bold is ignored for characters that don’t have corresponding Unicode
26. 26. Linear format math• Simple operand is a span of alphanumeric characters• E.g., simple numerator or denominator is terminated by any nonalphanumeric character abc• abc/d gives d• More complicated operands use parentheses ( ), brackets [ ], or { }• Outermost parens in fractions aren’t displayed in built-up form
27. 27. Linear format math (cont)E.g., plain text (a + c)/d displays as• Easier to read than TEX’s, e.g., {a + cover d}• MathML: <mfrac><mrow><mi>a</mi><mo>+</mo> <mi>c</mi></mrow><mrow><mi>d</mi> </mrow></mfrac>• Neat feature: linear-format text looks like math
28. 28. Subscripts and Superscripts Unicode has numeric subscripts and superscripts along with some operators (U+2070-U+208E): convert to regular Others need some kind of markup like <msup>… </msup> Use TeX’s _ and ^ subscript/superscript ops for input; they can be displayed as a subscripted down arrow and superscripted up arrow Use parentheses as for fractions to overrule built-in precedence order
29. 29. Formula Autobuildup Enter formulas in linear format in a math zone When a character is typed that renders an expression syntactically unambiguous, the expression is built up Edit expressions in built-up form or in linear form For integrals, type int (which autocorrects to ∫ ) optionally followed by subscript and superscript for limits, which auto build up Can autocorrect <letters> to built-up characters or expressions
30. 30. Roles of Space (U+0020) The ASCII space is rarely needed inside math expressions, since math spacing is automatic Use to terminate autocorrect entries and to terminate expressions. When so used, is deleted Use as command to build up math objects Use to define spacings for , . and : and to force a unary operator to display with binary spacing A space builds up one subexpression; other operators build up as many as they can
31. 31. Unicode SpacesSpace Unicode Autocorrect0 em U+200B zwsp1/18 em U+200A hairsp3/18 em U+2009 thinsp4/18 em U+205F medsp5/18 em U+2005 thicksp6/18 em U+2004 vthicksp9/18 em U+2002 ensp18/18 em U+2003 emsp(digit width) U+2007 numsp(space width) U+00A0 nbsp
32. 32. Operators Operator Precedence CR 0 opOpen 1 opClose 2 opSeparator 3 concatenation 4 / atop 5 opNary 6_ ^ opFApply above below 7 □ ∛ ∜ ■ opHbracket 8 opAccent 9 opUniSubSup 10
33. 33. Four Math InvisiblesThere are four “invisible” math control codes Math control code Unicode Invisible Function Apply U+2061 Invisible Times U+2062 Invisible Comma U+2063 Invisible Plus U+2064Used for semantic content and usually don’t display a glyph. May have a small width, e.g., Function Apply has thinsp
34. 34. Math LayoutCollaboration between 5 entities: Unicode rich-text text processing program such as Word or RichEdit LineServices math handler Page/TableServices math handler Math font, e.g., Cambria Math Math-font handler
35. 35. Equation Breaking & Numbering PTS math handler can break equations into multiple lines automatically or by user breaks PTS can handle layout of equation numbers Client needs to support “math paragraph” Two kinds of user breaks: at operator via context menu, at line break (Shift+Enter) At operator indentation: each TAB indents to next binary/relational operator Line break: align at specific operators, e.g., =
36. 36. Math Engine Objects
37. 37. Glyph Variants Subscripts/superscripts Primes Dotless i, j used in bases of accent objects Flattened and wide accents Growable brackets, integrals, arrows Display of differentials using U+2146 Mirror images for right-to-left math Variation selector U+FE00
38. 38. Cambria Math Font Cambria typeface designed by Jelle Bosma Extended for math by Ross Mills and Andrei Burago in collaboration with the ClearType and math-layout groups Contains extensive math tables, glyph variants and much of the Unicode math set Is designed with ClearType and excellent screen readibility in mind Enables best screen-resolution display of math
39. 39. New Math Fonts Cambria Math has new version with more math characters, e.g., U+2900..U+2AFF 202 math characters still needed for Unicode 5.1 STIX Times Roman math font is in beta; doesn’t support Word 2007 math well STIX has full math character set + some STIX font is Type I, so it doesn’t work with the Office pdf writer Font demos
40. 40. Font Math Tables Specialized math tables have been created to control glyph placements Position subscripts/superscripts horizontally using cut-ins and italic corrections Many math constants: axis height, fraction rule thickness, etc. Compare kerning of The math tables are formalized as OpenType tables accessible via mathfont.dll
41. 41. Math Constants
42. 42. User Spacing Adjustments Layout engine attempts to render with high typographic quality Users can spoil layout by inserting space where engine would insert it automatically Have autocorrect procedure to reduce this Users can insert Unicode spaces Phantoms and smashes Size and placement overrides
43. 43. Phantoms and Smashes Phantoms have size but no display. Can have both width & height, ascent only, descent only Smashes display, but remove one or more sizes, e.g., descent, ascent, and/or width
44. 44. Word 2007 Math Facility Elegant math entry and display Display is competitive with TeX Automatic line breaking, special kerning More math semantics than TeX: greater interoperability (Presentation MathML) Input with math ribbon, context menus Formula autobuildup input method WYSIWYG editing as well as linear format MS Math graphing calculator add-in
45. 45. What Word 2007 doesn’t have Built-in equation numbering Math Find/Replace OpenType enhancements (aside from math table functionality) Optimal line breaking Configurable math-zone vertical spacing [La]TeX import/export Document wide MathML support (only MathML for a single math zone)
46. 46. Conclusions Eight infrastructures allow us to do math display and editing better than ever before High quality math handler and font enable typography competitive with or better than TeX Best screen-resolution display of mathematics Streamlined input methods such as Formula Autobuildup Incorporated into Word 2007, Word down-level converter, Microsoft Math calculator Cambria Math font: state-of-art math font