• Save
XML Data Stream Preparation for VDP
Upcoming SlideShare
Loading in...5
×

Like this? Share it with your network

Share

XML Data Stream Preparation for VDP

  • 1,367 views
Uploaded on

Data preparation, XML conversion, and programming approaches for creating variable data documents. Demonstrates the juncture of graphic design and applied information technology. Uses affordable......

Data preparation, XML conversion, and programming approaches for creating variable data documents. Demonstrates the juncture of graphic design and applied information technology. Uses affordable tools and techniques with cross-platform capabilities (PC & Mac).

More in: Design
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,367
On Slideshare
1,367
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Data Stream Preparation One Approach to Document Variability A Classroom Lecture for Print Media, Graphic Arts, and Applied Information Technology Nick D. Barzelay April, 2010 Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 2. Talking Points I. The Workflow II. Data Structures III. Simple XML IV. Basic Data Processing V. Data Preparation Demonstration VI. Direct XML Data Stream Maintenance Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 3. I. The Workflow  Basic Content Integration  The Data Handling Thread  Data Preparation Workflow Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 4. Basic Content Integration Content Source: Integrating Element 1 Data Text (Transfer medium & Design & Assembly Element associated tools) Graphics Content Management Selected Dynamic Data Selected Static Content 2 Data Management Selected Dynamic Graphics Selected Static Graphics 3 Desktop DB or Spreadsheet XML w/XSLT or PERL Integration Engine Adobe InDesign Rudimentary Database Elementary XML & optional Basic Adobe InDesign 4 & FileMaker Pro basic XSLT or PERL Proficiencies Integration Layers: 1. Process Layer 2. Content/Data Layer 3. Tools Layer 4. Skills & Techniques Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 5. The Data Handling Thread Design Image Sketch Preparation Concept & Objectives Document Document Design Assembly Content VCP XML Management Document Data Working Prepar- & QA Stream Source Dataset ation A Variable Document Workflow Proofing PDF Generation Discussion Focus Prepress & Discussion End Point Imposition Production Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 6. Data Preparation Workflow Marketing Cleansed Data: Results: Update Use to Update Data for Ongoing Data Source Campaign (OPTIONAL) (OPTIONAL) Data Source Working Prepared Import Data Data Select Data Working into (Database, Storage Fields & Database & Working Spreadsheet , Export XML XML Content Database or Text Files) Stream 1 7 Processes 1 through 7 Interact With and Are Dependent Upon Working Storage Supplement Reorder Combine Evaluate Cleanse Fields & Fields & Sources into Records Records Reference Sort Single Table Images Records 21 3 4 5 6 Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 7. II. Data Structures  File Metadata  Sequential File  Delimited File  Spreadsheet  Database Table Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 8. File Metadata File headers File name, type, and size Dates for creation, modification, & last use Access security and use permissions Author, company, and copyright information Title, subject, and comments Key words and labeling More extensive “structural & descriptive” metadata is used for Spreadsheets and Databases Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 9. Sequential Data File File Header First Record Field 1 Field 2 Processing Order: Field 3 Top to Bottom Second Record Left to Right Field 1 Field 2 Field 3 Nth Record Field 1 Field 2 Field 3 End of File (EOF) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 10. Delimited Sequential File Tab Tab Tab Tab Tab Column 1 Column 2 Column 3 Column 4 Column 5 Column n (Identifier) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 11. Spreadsheet Column 1 Column 2 Column 3 Column 4 Column 5 Column n (Identifier) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 12. Database Table Column 1 Column 2 Column 3 Column 4 Column 5 Column n (usually Key) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Table Metadata Column name Data type (text, number, date, time, or binary container) Size (number of characters) Use (primary key, foreign key, index key) Creation date & Modification date Permissions (who can read or manipulate field content) Validation value (a specified test value) Default value (including automatic incrementing and increment value) Data entry mask (formatting assistance for data input) & Display format Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 13. Relational Table Structure Parent Record Set Descriptive according to demographics or other qualities Transactional: Parent and Child Interaction Child Record Set Experiential: Transactions External to the Relationship Relational Metadata Table name Table relationships (child table names) Relationship type Relationship direction (the “one” or “many”) Table unique key (makes every record unique) Table foreign key (points from child to parent) Number of fields (columns) Number of records (rows) Table access (accounts and privileges) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 14. Parent & Child Example Detail Reference points to the Parent Table Customer Number Child Table Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 15. Joining (flattening) the Data Data Sources may be Sequential Files, Delimited Files, Spreadsheets, or Database Tables Data Selection Options: Primary Source Sort Primary and Secondary by Matching Sort Criteria Matching data selected from Primary according to Secondary Common data Element Unmatched data selected from Primary according to secondary (Assumed Key) Matching data from Primary & Secondary selected for merging Secondary or Merge Processing: Qualifying Source Read Primary and Secondary, match by option, and write new file Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 16. Approaches to Joining Datasets Concatenate: set A + set B + set “n” (Main issue is validation of information content between dat sources) Are there duplicate records between the collections (files or tables)? Are there erroneously matched fields? Do matched records really contain the exact same information, or is some different? Does one record contain more or cleaner information than the other? Update: secondary to primary according to defined logic Dataset to Dataset Table to Table Merge: primary1 + primary2 = new primary (Maintain data source identity to assure data integrity, traceability, and recoverability) Record the primary keys from both records in the new record Record in the new record the respective matching fields, if other than the primary keys Create a new unique primary key for the new record Group fields in the new record according to their respective source Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 17. III. Simple XML  Concepts & Considerations  XML Building Blocks  DTD Building Blocks  Quality & Standardization Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 18. Concepts & Considerations XML is hierarchical & extensible Describes content, not layout or formatting Can reference external objects Since content separated from layout, content may be multi-purposed “Well Formed” (follows all XML conventions); “Valid” (agrees with the document definition) Being “well formed” doesn’t necessarily mean a document is “valid” No “right way” to code documents, but some approaches may be more effective than others Keep XML structures as simple and clean as possible; avoid unnecessary complexity White space is preserved; indenting sub-elements and inserting comments may help readability, but may also interfere with integration processing Fixed width fonts during XML preparation increases text control at the character level. Formatting XML documents is superfluous, distracting, and a potential source of problems; plain lower case text is best! Regarding plain text, a text editor is a better choice than a word processor because in XML processing, “what you see is not necessarily what you think you have “ XML structure and sequence drive processing: document design and XML must correlate Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 19. XML Building Blocks XML Declaration <?xml version=“1.0” encoding=“UTF-8”?> Namespace Declaration Schema Access: <xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”> XSL transformations (XSLT): <xsl:stylesheetxmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> Processing Instruction <!DOCTYPE rootname SYSTEM "filename.dtd"> CSS Reference <?xml-stylesheet type=”text/css” href=”path/filename.css”> Comment <!-- the double dashes are not single hyphens --> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 20. Elements & Attributes Element <any_name>some element content</any_name> Indented <state> <name>New York</name> <abbreviation>NY</abbreviation> </state> Not Indented <state> <name>New York</name> <abbreviation>NY</abbreviation> </state> Empty <tag_name></tag_name> <tag_name/> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 21. Elements & Attributes continued Attribute <element attribute=”item”>element content</element> <state abbreviation=”NY”> New York</state> External file reference (Provide added content without impacting XML content stream size.) <image href="file://server3/photographs/RedRose.jpg"/> <state> <name>New York</name> <abbreviation>NY</abbreviation> <map href="file://NYmap.jpg" /> <population/> </state> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 22. Defined Entities Predefined Entity &amp; ampersand &lt; less-than or left angle bracket &gt; greater-than or right angle bracket &quot; quote marks &apos; apostrophe or single quote marks General Entities (defined by developer) Substitute into XML when value may not “parse” correctly by XML processor. <!ENTITY entity_name “substituted value”> Example shows the declaration, the XML statement, and the parsing result: <!ENTITY bang “!”> <warning>Hazardous Material&bang; Do Not Ingest&bang;</warning> RESULT: Hazardous Material! Do Not Ingest! Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 23. “Sack Lunch” XML Example <?xml version=“1.0” encoding=“UTF-8”?> XML Declaration <!DOCTYPE lunch SYSTEM "SackLunch.dtd"> DTD Declaration <!-- Sack lunch for the day --> Comment <lunch> Data Root <day name="Monday"> Element with Attribute <sandwich>Ham &amp; Swiss on Rye</sandwich> <fruit>Orange</fruit> <chips>Potato Ruffles</chips> <drink>Root Beer</drink> <sweet>Chocolate Bar</sweet> <nuts>Cashews</nuts> </day> <day name="Wednesday"> <sandwich>Egg &amp; Olive on Wheat</sandwich> Element content with <fruit>Apple</fruit> Pre-defined Entity <chips>Corn Chips</chips> <drink>Iced Tea</drink> <sweet>Molasses Cookies</sweet> <nuts>Smoked Almonds</nuts> </day> </lunch> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 24. DTD Building Blocks Common element statements (see definitions below): <!ELEMENT elementName (#PCDATA)> <!ELEMENT elementName (#CDATA)> <!ELEMENT elementName (sequence of sub-elements)> <!ELEMENT elementName EMPTY> Definitions: Element terms and qualifiers: #PCDATA Stands for “parseable” data #CDATA Stands for “character data” exempt from parsing EMPTY The element has no content or imbedded sub-elements ( ) Element grouping , Element separator indicating specific order | Element “or” separator indicating choice + Element may occur 1 or more times ? Element may occur 0 or 1 time * Element may occur 0 or more times Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 25. DTD Building Blocks continued Common attribute statements (see definitions below): <!ATTLIST elementName attributeName TYPE #KEYWORD> <!ATTLIST elementName valueName (val1|val2|val3) “defaultVal”> <!ATTLIST elementName href CDATA #REQUIRED> Definitions: Attribute types and keywords: CDATA Means that the attribute contains character data ID The attribute provides a unique XML identifier name #REQUIRED An attribute value is mandatory #IMPLIED An attribute value is optional #FIXED The attribute has a fixed value Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 26. “Sack Lunch” DTD Example <?xml version="1.0" encoding="UTF-8"?> XML Declaration <!-- Sack Lunch DTD for one or more weekdays --> Comment <!ELEMENT lunch (day)+> Data Root <!ELEMENT day (sandwich, fruit, chips, drink, sweet, nuts) > XML Element and list <!ELEMENT sandwich (#PCDATA)> of “sub” elements <!ELEMENT fruit (#PCDATA)> <!ELEMENT chips (#PCDATA)> Elements nested within “day” <!ELEMENT drink (#PCDATA)> <!ELEMENT sweet (#PCDATA)> <!ELEMENT nuts (#PCDATA)> <!ATTLIST day name CDATA #REQUIRED > Attribute belonging to the Element “day” Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 27. “Mailer” Data Stream Example DTD & XML for a VDP Document DTD XML <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <!-- DTD for Mailer --> <!DOCTYPE dataroot SYSTEM "Mailer.dtd"> <!ELEMENT dataroot (row+)> <dataroot> <!ELEMENT row (name, image, location, offer, product, title, <row> first, last, street, city, state, zip, code) > <name>Anthony</name> <image href="file://Berries.jpg"/> <!ELEMENT name (#PCDATA)> <location>North Side</location> <!ELEMENT image EMPTY> <offer>5% Discount</offer> <!ELEMENT location (#PCDATA)> <product>Paving Stone</product> <!ELEMENT offer (#PCDATA)> <title>Mr.</title> <!ELEMENT product (#PCDATA)> <first>Anthony</first> <!ELEMENT title (#PCDATA)> <last>Able</last> <!ELEMENT first (#PCDATA)> <street>27 Able Street</street> <!ELEMENT last (#PCDATA)> <city>North Side</city> <!ELEMENT street (#PCDATA)> <state>NY</state> <!ELEMENT city (#PCDATA)> <zip>14533</zip> <!ELEMENT state (#PCDATA)> <code>14533-27200803</code> <!ELEMENT zip (#PCDATA)> </row> <!ELEMENT code (#PCDATA)> </dataroot> <!ATTLIST image href CDATA #REQUIRED> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 28. XML Standards & Conventions Well Formed 1. XML documents must begin with an XML declaration 2. An XML document must contain a single root element that contains all the other elements 3. All elements must be nested and not over-lap (child tags are closed prior to closing the parent) 4. XML tags are case-sensitive and need consistency 5. Tag names can’t contain spaces or start with “XML”, a number, or punctuation (except underscore ) 6. All elements require start and end tags (empty tags may contain the end marker) 7. All attribute values must be enclosed in quote marks 8. Syntactical characters are replaced with pre-defined entities 9. The conventions apply to all types of XML documents Valid The XML-tagged Document must match the document’s definition (DTD) “internal DTD” – The definition is contained within brackets in the XML file. <?xml version=“1.0” encoding=“UTF-8”?> <!DOCTYPE document_root [ DTD definition details here ]> “external DTD” – XML file and the DTD file are separate. <?xml version=“1.0” encoding=“UTF-8”?> <!DOCTYPE document_root SYSTEM “path/dtdName.dtd”> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 29. IV. Basic Data Processing  Handling Text Strings  Handling Typical Data Structures  Simple Programming Logic  “DogDB” Processing Example Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 30. Handling Text Strings Fixed Length Record and Field Processing Files contain records; records contain fields made of one or more characters in a string This sample text string has 26 characters + Line Feed: “abcdefghijklmnopqrstuvwxyzLF” Each character occupies a position in the string numbered sequentially from zero through 25 If the first field were six characters, the substring would start at zero (the first character position) and count six character positions (zero through five), giving “abcdef” The next field (substring) would start at position six, and suppose it is ten characters long (six through fifteen), which gives “ghijklmnop” Each record in the file would be a 26 character string subdivided into fields by substrings AT ISSUE: Any deviation in the number of characters would corrupt field processing Delimited (variable length) Record and Field Processing Delimiters resolve any variances in character counts Record fields can be delimited with commas, called comma-separated values (CSV) “abcdef,ghijklmnop,qrstuvwxyzLF”. Other values that are not likely to appear in the text may also be used as delimiters, an example being the tab-delimited record below: “abcdef ghijk,lmnop qrstuvwxyzLF” Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 31. Handling Typical Data Structures  Sequential File Processing  Table or Spreadsheet Processing  XML DOM & XSLT Processing  XML Sequential Processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 32. Sequential File Processing File Header File Processing Sequence Start Processing (optional Open of new file) First Record Field 1 Read Record Field 2 Read Record Fields (optional Write to new file) Field 3 Second Record Read Record Field 1 Field 2 Read Record Fields (optional Write to new file) Field 3 Nth Record Read Record Field 1 Read Record Fields (optional Write to new file) Field 2 Field 3 End of File (EOF) Complete Processing (Close any open files) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 33. Table or Spreadsheet Processing of selected Columns within selected Rows Column 1 Column 2 Column 3 Column 4 Column 5 Column n (usually Key) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Data Cleansing: Seek by row and column selection criteria Process selected row columns (read, update, delete, add) Seek by row and column selection criteria for targeted processing Data Flattening: Make conditional table joins to link related data Query join results to create new table views for further processing Data Enhancement: Convert views into tables and perform further enhancement processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 34. XML DOM Processing Structure Document Object Model Hierarchical Structure: XML Header Data Root Tag First Row Tag Second Row Tag Last Row Tag column tag 1 column tag 2 column tag 1 column tag 2 column tag 1 column tag 2 Hierarchical Recursive Processing: Load Load Entire Document or File into Memory Traverse Start at Root Down by Branch Across Levels Repeat to End Select Locate by Tag Identify Content Identify Context Record Content Execute Move Delete Add Revise Replace Construct Generate a New Document or File Output to Designated Storage Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 35. Processing with XSLT Extensible Stylesheet Language Transforms (XSLT) Source XML File The original XML file is preserved XSL Style Sheet A new XML file (new xml file XSLT Processor is assembled in generation instructions) XSLT storage The result is a Revised XML File new XML file Note: memory-intensive processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 36. XML Sequential Processing XML Header Sequential Processing by XML Tag Data Root Tag Start Processing (optional Open of new file) First Row Tag column tag 1 Read Record (load data into memory) column tag 2 Read Record Fields (optional Write to new file) Row End Tag Second Row Tag Read Record (load new data into memory) column tag 1 column tag 2 Read Record Fields (optional Write to new file) Row End Tag Last Row Tag Read Record (load new data into memory) column tag 1 Read Record Fields (optional Write to new file) column tag 2 Row End Tag Data Root End Tag Complete Processing (Close any open files) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 37. Simple Programming Logic  Fundamental Logic  Program Overview Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 38. Simple Read/Write Logic Access (open) File Basic File Processing Loop Read a Record (string) Test for File End If end of file (EOF), then close file and exit processing; Else next Test Selection (substring) If substring matches criteria, then perform specific actions; Else next Return to Top Note: disk-interactive processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 39. XML Tag Process Logic Reading left to right, each successive line from the top... Select and Process a Line of Text: XML Element <tag> Some enclosed content </tag> Same as Text Text <tag> Some enclosed content </tag> String of Characters String "<tag> Some" enclosed content </tag> Substring (red) Substring 10 Characters Long Starting at Zero Decision Criteria Decision If Substring(0,10) = "<tag> Some" Then Do Something Do A Specific process or group of Processes Otherwise (Else) Do a Different Process or Simply Advance to Next Element Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 40. Processing Program Overview Perl Declaration & Processing Parameters Input/Output Identification: Enter File Names or URI's Initiating Primary Processing Subroutine Calls Primary Process Calls Calls Subroutine: Open Files (Input & Output) File Handling Logic: -- main process start call Subroutine: Close Files -- main precess end call (Input & Output) Packaged Reusable Primary File Processing Logic Subroutine(s): Primary Logic (one or more) (file processing loop) Additional Processing Logic Subroutine(s): Secondary (return control to caller (zero or more) after processing completion) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 41. “DogDB” Processing Example Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 42. A Short File of Dogs number name breed gender photo weight 1 Slinky German Shepherd Male file://slinky.jpg 115 2 Nora Doberman Pinscher Female file://nora.jpg 90 3 Crunch Rottweiler Male file://crunch.jpg 125 Possible Storage Formats: Sequential File of Comma-Separated Values (CSV) with or without Header Row Sequential File of Tab-Separated Values (CSV) with or without Header Row Spreadsheet with or without Header Row Database Table (will automatically use field names for Column Heads Objectives: 1. Import data into a Desktop Database (FileMaker Pro in this example) 2. Cleanse and Enhance Data 3. Convert it into a usable XML Data Stream 4. Sanitize & simplify XML to avoid potential layout integration problems Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 43. Raw Converted XML XML Source Data cleanup and enhancement results exported into XML from FileMaker Pro <?xml version="1.0" encoding="UTF-8" ?><!-- This grammar has been deprecated - use FMPXMLRESULT instead --><FMPDSORESULT xmlns="http://www.filemaker.com/ fmpdsoresult"><ERRORCODE>0</ERRORCODE><DATABASE>DogDB.fp7</DATABASE><LAYOUT></ LAYOUT><ROW MODID="3" RECORDID="1"><number>1</number><name>Slinky</ name><breed>German Shepherd</breed><gender>Male</gender><photo>file://slinky.jpg</ photo><weight>115</weight></ROW><ROW MODID="2" RECORDID="2"><number>2</ number><name>Nora</name><breed>Doberman Pinscher</breed><gender>Female</ gender><photo>file://nora.jpg</photo><weight>90</weight></ROW><ROW MODID="2" RECORDID="3"><number>3</number><name>Crunch</name><breed>Rottweiler</ breed><gender>Male</gender><photo>file://crunch.jpg</photo><weight>125</weight></ROW></ FMPDSORESULT> Preliminary Setup Use a text editor to search on “><“ character combination and replace it with “>CRLF<“ or other appropriate line feed symbol. Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 44. Search/Replace Results <?xml version="1.0" encoding="UTF-8" ?><!-- This grammar has been deprecated - use FMPXMLRESULT instead --> <FMPDSORESULT xmlns="http://www.filemaker.com/fmpdsoresult"> <ERRORCODE>0</ERRORCODE> <DATABASE>DogDB.fp7</DATABASE> <LAYOUT/> <ROW MODID="3" RECORDID="1"> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo>file://slinky.jpg</photo> <weight>115</weight> </ROW> <ROW MODID="2" RECORDID="2"> <number>2</number> <name>Nora</name> <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo>file://nora.jpg</photo> <weight>90</weight> </ROW> <ROW MODID="2" RECORDID="3"> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo>file://crunch.jpg</photo> <weight>125</weight> </ROW> </FMPDSORESULT> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 45. Export Result using FMPXMLRESULT <?xml version="1.0" encoding="UTF-8" ?> <FMPXMLRESULT xmlns="http://www.filemaker.com/fmpxmlresult"> <ERRORCODE>0</ERRORCODE> <PRODUCT BUILD="11-30-2007" NAME="FileMaker Pro" VERSION="8.5v2"/> <DATABASE DATEFORMAT="M/d/yyyy" LAYOUT="" NAME="DogDB.fp7" RECORDS="3" TIMEFORMAT="h:mm:ss a"/> <METADATA> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="number" TYPE="NUMBER"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="name" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="breed" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="gender" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="photo" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="weight" TYPE="NUMBER"/> </METADATA> <RESULTSET FOUND="3"> <ROW MODID="2" RECORDID="1"> <COL> <DATA>1</DATA> </COL> <COL> <DATA>Slinky</DATA> </COL> <COL> <DATA>German Shepherd</DATA> </COL> <COL> <DATA>Male</DATA> </COL> <COL> <DATA>file://slinky.jpg</DATA> </COL> <COL> <DATA>115</DATA> </COL> </ROW> FMPXMLRESULT Export Format -- Not What Is Needed! Superfluous XML -- pink highlights Columnar Data not Tagged with Database Column Names -- yellow highlights Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 46. Export Result using FMPDSORESULT <?xml version="1.0" encoding="UTF-8" ?> <!-- This grammar has been deprecated - use FMPXMLRESULT instead --> <FMPDSORESULT xmlns="http://www.filemaker.com/fmpdsoresult"> <ERRORCODE>0</ERRORCODE> <DATABASE>TestDB.fp7</DATABASE> <LAYOUT/> <ROW MODID="2" RECORDID="1"> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo>file://slinky.jpg</photo> <weight>115</weight> </ROW> Notice the differences from the previous export format! <ROW MODID="0" RECORDID="2"> <number>2</number> <name>Nora</name> <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo>file://nora.jpg</photo> <weight>90</weight> </ROW> Delete all pink and yellow highlighted items <ROW MODID="0" RECORDID="3"> <number>3</number> Modify all green highlighted items <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo>file://crunch.jpg</photo> <weight>125</weight> </ROW> </FMPDSORESULT> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 47. Fully Prep’ed XML <?xml version="1.0" encoding="UTF-8" ?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo href=”file://slinky.jpg”/> <weight>115</weight> </row> All superfluous XML removed <row> <number>2</number> All superfluous “white space” removed <name>Nora</name> <breed>Doberman Pinscher</breed> Image links converted to attribute <gender>Female</gender> <photo href=”file://nora.jpg”/> references linking external files <weight>90</weight> </row> <row> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> How do we get there? <gender>Male</gender> <photo href=”file://crunch.jpg”/> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 48. Option 1: Using a Text Editor Extra XML: simply delete! XML or content changes: use Search & Replace Create Attribute Reference Links: 1. Target text string: <photo>file://nora.jpg</photo> 2. Search on <photo>, and replace it with <photo href=” 3. Search on </photo>, and replace it with “/> 4. The result is <photo href=”file://nora.jpg”/> Get rid of “row” attributes: <ROW MODID="2" RECORDID="1"> 1. Target text string: <ROW MODID=”2” RECORDID=”1”> 2. Set search to ignore case 3. Search on <row modid=, and replace it with <row> 4. The result is <row> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 49. Option 2: XSLT Step 1 XSL Stylesheet: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:strip-space elements="*"/> <!-- Remove superfluous xml elements --> <xsl:template match="ERRORCODE"/> <xsl:template match="DATABASE"/> <xsl:template match="LAYOUT"/> <!-- Copy ROW Children into new row parents --> <xsl:template match="ROW"> <row> <xsl:copy-of select="*|node()"/> </row> </xsl:template> <!-- Replace FMPDSORESULT root with a new root --> <xsl:template match="/"> <dataroot> <xsl:apply-templates/> </dataroot> </xsl:template> </xsl:stylesheet> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 50. Option 2: XSLT Result #1 <?xml version="1.0" encoding="UTF-8"?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo>file://slinky.jpg</photo> <weight>115</weight> </row> <row> <number>2</number> <name>Nora</name> <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo>file://nora.jpg</photo> <weight>90</weight> </row> <row> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo>file://crunch.jpg</photo> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 51. Option 2: XSLT Step 2 XSL Stylesheet: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:strip-space elements="*"/> <!-- Add new image element with reference attribute --> <!-- Add new weight element following image element --> <xsl:template match="row"> <row> <xsl:apply-templates/> <photo href="{photo}"/> <weight><xsl:value-of select="weight"/></weight> </row> </xsl:template> <!-- Remove old duplicate elements --> <xsl:template match="photo"/> <xsl:template match="weight"/> <!-- Copy everything else --> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 52. Option 2: XSLT Result #2 <?xml version="1.0" encoding="UTF-8"?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo href="file://slinky.jpg"/> All superfluous XML removed <weight>115</weight> </row> <row> Image links converted to attribute <number>2</number> <name>Nora</name> references linking to external files <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo href="file://nora.jpg"/> Superfluous “leading white space” <weight>90</weight> will need removal with a Text Editor </row> <row> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo href="file://crunch.jpg"/> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 53. Option 3: Cleanup with Perl XML Cleanup -- Program Logic Overview 1. Call the “open” subroutine to commence file processing 2. Enter the processing loop and read a record, eliminating trailing blanks 3. If a “ROW” element with attribute, replace it with a “row” element without attributes 4. If an end “ROW” element, replace it with an end “row” element 5. If a comment, skip and don’t write it to output 6. If the starting “FMPDSORESULT” tag, replace it with “dataroot” (Do the same with the corresponding end tag when encountered.) 7. If “ERRORCODE”, “DATABASE”, or “LAYOUT” tags, skip and don’t write to output 8. If a “photo” tag, rebuild it using the “hrefElement” subroutine 9. Read a record and either write the record as is, write the replacement, or don’t write anything; then we read the next record 10. Upon reading all records, call the “close” subroutine to terminate file processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 54. Option 3: The Perl Program #!/usr/bin/perl use warnings; use strict "subs"; $inputfl = "rawXMLin.txt"; $outputfl = "cleanXMLout.xml"; Set up input & output file names $transfl = "deleteME.xml"; $href = "href=""; $endtag = ""/>"; Delete this file after program completes &initializeXML; &cleanseXML; Main processing function calls sub openFiles { Open input & output files for processing $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { Close input & output files after processing close(INFILE); close(OUTFILE); } sub initializeXML { Prepare raw file by inserting line feed between >< junctions &openFiles($inputfl, $transfl); while ($import = <INFILE>) { chomp($import); $import =~ s/></>n</g; $export = $import . "n"; print OUTFILE $export; } &closeFiles; } (continued on next page) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 55. Option 3: Perl Program continued sub cleanseXML { Main file preparation function &openFiles($transfl, $outputfl); while ($import = <INFILE>) { chomp($import); if (substr($import,0,4) eq "<ROW"){$import = "<row>";} elsif (substr($import,0,6) eq "</ROW>") {$import = "</row>";} elsif (substr($import,0,4) eq "<!--") {next;} elsif (substr($import,0,13) eq "<FMPDSORESULT") {$import = "<dataroot>";} elsif (substr($import,0,11) eq "<ERRORCODE>") {next;} elsif (substr($import,0,10) eq "<DATABASE>") {next;} elsif (substr($import,0,8) eq "<LAYOUT>") {next;} elsif (substr($import,0,9) eq "</LAYOUT>") {next;} elsif (substr($import,0,15) eq "</FMPDSORESULT>") {$import = "</dataroot>";} elsif (substr($import,0,7) eq "<photo>") {$import = hrefElement($import);} $export = $import . "n"; print OUTFILE $export; } &closeFiles; } sub hrefElement { href preparation function my $element = $_[0]; $lentxt = length($element); $boundary1 = index($element,">"); $tag = substr($element,0,$boundary1); $offset = $boundary1 + 1; $boundary2 = ($lentxt - rindex($element,"<")); $sublen = $lentxt - $offset - $boundary2; $content = substr($element, $offset, $sublen); $newelement = ($tag . " " . $href . $content . $endtag); return $newelement; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 56. Option 4: File Conversion with Perl Convert a Spreadsheet or Delimited File Directly to clean, integration-ready XML using a Perl program Spreadsheet or Delimited File number name breed gender photo weight 1 Slinky German Shepherd Male file://slinky.jpg 115 2 Nora Doberman Pinscher Female file://nora.jpg 90 3 Crunch Rottweiler Male file://crunch.jpg 125 This represents a tab Any type of delimiters are acceptable, BUT Tab Delimiters present fewer problems Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 57. Option 4: What You Need to Know CONVERSION CAVEATS: 1. The absolute necessity of understanding the source data file or spreadsheet 2. The absolute necessity of adhering to any conventions established by the program Additional Program Setup Considerations A. If first row of data file contains headings, set program “hdflag” to “1”, else set it to “0” (zero). B. If file doesn’t have column headings: (1) Edit them directly into the file with a text editor or spreadsheet program, or (2) Replace the default “tag names” in the program with desired names (3) Insure tag names correspond to the data columns (same number; same sequence). C. Insure that the “line feeds” at the end of each row in the delimited file conform to the system on which the operation will take place. A. Mac OS X will use the standard UNIX line feed (LF). B. PC’s will expect the combination carriage return and line feed (CRLF). D. References to external files or images use the format: file://URI path/file name with extension. Program explicitly searches for “file://” to identify references. (Example from “Dog Database”: file://desktop/images/slinky.jpg.) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 58. Option 4: The Perl Program (page 1of 3) #!/usr/bin/perl use warnings; use strict "subs"; # Set input and output file names $inputfl = "DataFileIn.tab"; $outputfl = "XMLFileOut.xml"; Set up input & output file names $errlist = "ErrListOut.txt"; # Static variables $declaration = '<?xml version="1.0" encoding="UTF-8"?>'; $rootfore = "<dataroot>n"; $rootaft = "</dataroot>n"; $rowfore = "<row>n"; $rowaft = "</row>n"; $tagfore = "<"; $tagaft = ">"; $tagterm = "</"; $tagend = ">n"; $href = " href=""; $hrefend = ""/>n"; $reftest = "file://"; # XML element processing preparations # Column names can be revised and increased or decreased @tagName = qw/col1 col2 col3 col4 col5 col6 col7/; Column name setup $cols = @tagName; # Set flag to 1 if first line of input contains column headings $hdflag = 0; Column headings in file? if (not $hdflag) {&prepTags;} 1 = yes; 0 = no Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 59. Option 4: Perl Program continued (pg. 2of 3) # Conversion processing &openFiles($inputfl, $outputfl, $errlist); $export = $declaration . "n" . $rootfore; print OUTFILE $export; $export = "DELIMITED DATA ERROR LISTINGnn"; print ERRFILE $export; # Read delimited data in, write XML out sub convert2xml { while ($import = <INFILE>) { chomp($import); @rowBuilder = split("t", $import); $check = @rowBuilder; if ($hdflag) { $hdflag = 0; @tagName = @rowBuilder; $cols = @tagName; &prepTags; next; } if ($check != $cols) { $errcount++; $export = "Item Error " . $errcount . ": " . $import . "n"; print ERRFILE $export; next; } $export = &tagContent; print OUTFILE $export; } $export = $rootaft; print OUTFILE $export; if ($errcount == 0) { $export = "No data errors found!n"; print ERRFILE $export; } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 60. Option 4: Perl Program continued (pg. 3of 3) # Prepare XML element tags sub prepTags { @tagFront = @tagName; @tagBack = @tagName; foreach $tag (@tagFront) { $tag = $tagfore . $tag . $tagaft; } foreach $tag (@tagBack) { $tag = $tagterm . $tag . $tagend; } } # Apply XML tags to row content sub tagContent { for ($i = 0; $i < $cols; $i++) { if (substr($rowBuilder[$i], 0, 7) eq $reftest) { $tag = substr($tagFront[$i], 0, (length($tagFront[$i]) - 1)); $rowBuilder[$i] = $tag . $href . $rowBuilder[$i] . $hrefend; } else { $rowBuilder[$i] = $tagFront[$i] . $rowBuilder[$i] . $tagBack[$i]; } } $newXMLrow = $rowfore . join("",@rowBuilder) . $rowaft; return $newXMLrow; } sub openFiles { $infl = $_[0]; $outfl = $_[1]; $errfl = $_[2]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; open(ERRFILE, ">$errfl") || die "Can't open $errfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); close(ERRFILE); } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 61. Option 4: Completely Prepared XML <?xml version="1.0" encoding="UTF-8" ?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo href="file://slinky.jpg"/> <weight>115</weight> This Preparation was done </row> <row> in a single pass through a <number>2</number> <name>Nora</name> delimited file: <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo href="file://nora.jpg"/> One process <weight>90</weight> </row> No additional steps <row> No manual text edits <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo href="file://crunch.jpg"/> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 62. V. Data Preparation Demonstration:  Data Acquisition & Import  Data Cleanup  Supplement, Arrange, & Sort  Database Conversion to XML Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 63. Acquiring Data - Initial Hurdles Commonly Encountered Operational Issues:  Multiple data structures and schemas  Multiple platforms and operating systems  Multiple formats  Multiple security levels  Multiple locations  Multiple transport media  Multiple spoken languages Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 64. Database Import 1. Check text file or spreadsheet to determine whether first row contains column headings. a. If first row contains column headings, make note for possible use in the database import setup. b. If the first row contains data, use a text editor or spreadsheet application to insert a new first row of simple descriptive column names. 2. The database application import processing setup may only require the file source name, file type, delimiters used, and whether the first row of the file contains headings (life is much easier with column headings). a. As an additional setup requirement, some database applications may require creating an empty table as the destination for imported data. b. If so, the application may also require pairing each source data column heading with a corresponding destination table column heading (this pairing may be done automatically). 3. If an empty destination table for receiving imported data must be built. a. The first column of the new table should automatically increment row numbers (values will be generated at import). b. Most fields for imported data (including zip codes and static numbers) can be designated as text. c. Fields for quantities and monetary values that may be used in calculations should be number fields (some databases may require explicit designations as integers – whole numbers – or as floating or fixed-place decimals). Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 65. Create Empty Database Table for Import select the “fields” button If no appropriate table exists: Select “File > Define > Database” from Menu Bar to add new Table enter the new field name choose a field type ✔ click the “create” button Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 66. Database Import from Spreadsheet 1. Select the Customer Worksheet Omit: the design won’t use these 2. Select the desired columns Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 67. Data Evaluation Before Cleanup 1. Based on project objectives and requirements: a. What types of record problems are irrelevant and can be ignored? b. What types of problems should be resolved? c. What types are significant and must be repaired? 4. Based on the ratio of detected problem records to the total number of records in the data set, can the data set be sufficiently cleansed or can it be replaced? a. Is the replacement likely to be any better? b. If actually replaced, what is the condition of the replacement? 5. Regarding individual records, on what basis should a record be replaced or discarded? a. If the question is about record discard, how significant is the record in relation to the entire data set? b. If the question is about replacement, what is the likelihood that the replacement will be any better? Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 68. Potential Design-related Issues 1. Normalization – inadequate or omitted normalization or database design steps a. Data redundancy – repeating data elements within individual tables b. Multiple values for the same data element – failure to establish lookup tables c. Duplicate records – unique record key not established or enforced for key-only related content 2. Referential integrity - flawed parent and child table references are a design problem exacerbated by processing a. Update anomalies – updating the wrong parent or child record b. Insertion anomalies – adding a new record into the database using an existing key c. Deletion anomalies – deleting a parent record and orphaning its child records or removing the parent record while deleting a child record 3. Data integrity – database design deficiencies or programming logic that degrades the value or accuracy of the data, resulting in the following problems: a. Incomplete data – failure to select, capture, or account for all records or data elements within records b. Lost data – faulty program procedures that drop data elements or whole records during processing c. Missing data – data needs never comprehended during database analysis and design d. Corrupted data – data elements or whole records made unintelligible or unusable during processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 69. Potential Usage Issues Incorrect or inconsistent spelling, and transposed characters Ambiguous names or variations of a name Incorrect, inconsistent, or mixed use of capitalization Improperly merged fields or merged content within a field Extraneous characters or leading and trailing blank characters Empty or null values Abstract or meaningless coded values and incorrectly selected codes Unit conversion errors Cross-record duplication during data entry (data from separate records erroneously mixed) Incorrect, irrelevant, or unknown content Excluded or omitted content Under-utilized databases (spotty data and partially completed records) Excessively large data sets (under-maintained and out of date records) Incorrect or ad hoc use of fields for purposes unintended in the design Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 70. Common Data Cleanup Tasks Make sure field names or column headings are accurate and comply with XML element naming conventions (These will become the XML element tags when converting to XML) Correct invalid information if possible Check and correct spelling, inconsistent spelling, or transpositions Correct upper and lower case issues Eliminate duplicate records (any other records with all the same values) Resolve ambiguous names or name variations Resolve cross-record duplication (if workable, set up households) Decode or correct coded data values – make values meaningful Make appropriate unit conversions Remove extraneous characters and leading and trailing blank characters Remove unintelligible, meaningless, or irrelevant content Resolve incorrect or missing content if possible, including four-digit zip code extensions) Make sure all irrelevant or irreparable records are removed At the data field and record levels, collect and “trash” all remaining extraneous, irrelevant, or corrupted data that can’t be repaired Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 71. Data Cleanup Example: 1. Case is the most obvious issue 2. Hidden white space is a potential issue 3. Spelling could be an issue 4. Duplicate records could be another issue 5. Incorrect data could be a hidden problem 6. Then there’s the question of multiple people at the same address 7. Check column names for XML conversion Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 72. Supplement, Arrange, & Sort Purpose: 1. Address business logic at database level by incorporating decision results into data structure. a. This eliminates need to process logic during variable document generation b. It provides a reviewable and revisable audit trail prior to document generation, if corrections or changes are identified c. It provides a basis for updating the working database to accommodate new requirements as a campaign progresses or a new one is launched 2. Align working database (and ultimately XML content stream) to satisfy document design and sequential processing requirements 3. Take advantage of postal rates where applicable Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 73. Field Supplement Needs Potential Types of Field Additions to Data Records: Insert processing flags to facilitate database-level decision making: Binary flags that are either turned “on” or “off” Optional values that are set from a controlled list Relational values or thresholds (equality, non-equality, relative size) Add a new field to contain concatenated data from multiple fields in a single field, simplifying content stream processing Replicate a field as many times as it is used in the document design Summarize quantitative fields (amounts and counts) if a total is needed Insert a field to reference the location (URI or storage address) of each digital image used in the document design that is dependent on other data in the record. Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 74. Example: Given a Document Layout Document Sketch (one side of a duplex document) <first name> Pre-paid Post Floating text message plus <store location> and Mark <discount offer> on <product> <referenced graphic> Additional static information or message 6" 3" .25 " <postal barcode> <title> <given name> <last name> 3" <address (one line or two lines)> <city>, <state> <campaign barcode> .5 " <zip code> 9" Text in angle brackets " < > " is variable content from an XML content stream Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 75. And the Envisioned Document... How the Design Sketch will look as a completed document Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 76. What Added Data Will Be Needed? Record Layout Prior to Adding Fields adoption_number (C and four digits for cats; D and four digits dogs) last_name first_name prefix (Mr., Mrs. or Ms.) pet_name Required Supplemental Fields: adoption_date 1. tracking_code address 2. mailing_code city 3. salutation state 4. addressee zip 5. location 6. pet_image 7. pet_message 8. adoption_message Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 77. And the Rationale Behind Each One? (1) tracking_code – Field will be used to track responses. Content will be created by concatenating adoption number, zip code, and start date of the campaign (example: “D1234 14532 090615”). On the document it will be converted to a standard barcode by applying a font. (2) mailing_code – To take advantage of mailing rates, field will be loaded with the addressee’s zip code. On the document it will be converted to a mail barcode by applying a font. (3) salutation – The salutation will be created by concatenating “Dear “ with prefix, last name, followed by “,” (example: “Dear Mrs. Smith,”). If prefix is empty, it will be dropped and the salutation will use the first name (example: “Dear George,”). (4) addressee – created by combining prefix (if not empty), first name, and last name (example: “Ms. Ann Gorah”). (5) location – Combination of city, state, and zip into one field to simplify processing (example: “Anyville, OH 45678”). (6) pet_image – Field will contain location of the pet’s digital photo. The photo file name is based on the adoption number and the pet’s name. Construct the content to anticipate the XML export; combine “href = file://photos/” with adoption number, pet name, and filetype extension (example: “href = file://photos/C3456Blossom.jpg”). (7) pet_message – If the first character of the adoption number is a “D”, this field will have a dog-related message. If the first character of the adoption number is a “C”, this field will have a cat-related message. (8) adoption_message – If the adoption date is less than one year, the message will be “Thank you for recently adopting “ plus the pet name. If the adoption date is older than a year, the message will be “Would “ plus pet name, plus “ like a playmate?” (example: “Would Scruffy like a playmate?”). Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 78. Fields Added (✔) to Match a Design Layout ✔ ✔ ✔ ✔ ✔ Processing Requirement: Each XML element can only be used ONCE when placed in a document layout, and ordered in design PLACEMENT SEQUENCE from top to bottom, left to right 1.“offer”, “image”, & “code” elements didn’t exist in original data 2.“name” & “location” replicated since each used twice in layout Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 79. Editing or Loading Field Content Use desktop database scripting capabilities: To populate (fill with content) any supplemental fields (columns) To edit (modify) content in originally existing fields Most functions tailored to specific needs can be variations of the “basic four” Four basic scripted functions (shown on the following two pages): 1. Insert new text or values in every row of a particular column 2. Selectively insert new text or values in one row column based on the value of another column in the same row 3. Copy column content for each row and paste it into another column in the same row 4. Assemble values from multiple row columns into a single string and paste it into another column in the same row (The following functions were defined using FileMaker Pro) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 80. Scripting: select “Scripts > ScriptMaker” An existing script may be used “ is”, or... as Create a new script Edit existing script Script Edit Pannel (point & click) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 81. Setup Functions (continued) Script for Inserting Column Values Script for Selected Column Value Insertion Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 82. Setup Functions (continued) Script to Copy and Paste Values Script to Combine Multiple Values and Then Insert the Result Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 83. Prep Results & Postal Sort Last step before XML Export: click to perform sort The data will be sorted into zip code order to get better postal rates Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 84. Database Conversion to XML Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 85. Why Not Just Export Directly From a Spreadsheet? The Spreadsheet A Portion of Exported XML The Issues: 1.128 rows before data starts 2.Column names are not tag names 3.Individual data elements are not tagged with usable XML 4.Superfluous XML throughout Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 86. Possible Conversion Paths Clean, Supplement, & Sort Detailed Program Setup Spreadsheet Tab-delimited File Perl Conversion “Clean” XML More Complex Prep Detailed Program Setup Tab-delimited File Perl Conversion “Clean” XML Edit Requires Manual Efforts & XSLT May Be Problematic Text Edit Cleanup All Preparations Database “Raw” XML XSLT Cleanup “Clean” XML Perl Cleanup Simply Set File Names Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 87. XML Export Setup Select “File > Export Records” from the menu bar Select the columns to be exported Set order of selected columns to meet design Click or drag to requirements move selection to export list Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 88. Obvious Export Format: FMPXMLRESULT “Obvious” is not necessarily “Optimal”! FMPXMLRESULT Output is similar to that of an export directly from a spreadsheet Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 89. A Better Selection: “FMPDSORESULT” Column elements are imbedded bet ween Row elements Tag names use column names in database This result will be much easier to prepare for design integration! Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 90. Final XML Touchup Steps Optional Tools: 1. Insert line breaks (CR, LF, or CRLF) between adjoining XML Text Editor: element: “><“ Delete 2. Delete unneeded XML non- Search & Replace repeating content XSLT 3. Delete unneeded XML Programmatically: repeating content Perl 4. Delete extra & leading spaces Ruby 5. Create attributes for image Scripting language references Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 91. Before XML Touchup Touchup Key: XML code to delete XML code to modify Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 92. After XML Touchup All unneeded XML and leading space removed Editing of selected elements completed XML is ready for Adobe InDesign import The cleaner the XML, the more trouble-free the design integration processing will be Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 93. DTD Validation Before Integration Assure Quality: ✓ Is it “Well Formed”? ✓ Is it “Valid”? Reduce Work: Detect problems prior to design integration Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 94. VI. Direct XML Data Stream Maintenance A set of XML Data Stream Utilities written in Perl provide most maintenance functionality Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 95. Data Maintenance Alternatives While there may be other approaches, most will focus on one of three data conditions: 1. Source Data Files prior to any preparation 2. Data Files after Spreadsheet or Database import 3. Data after conversion to an XML Data Stream Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 96. Data Stream Utilities:  Insert  Delete  Change  Reorder  Combine  Merge  Split Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 97. Overview Extend the usefulness of generated XML content streams beyond original setup or intent Make repairs or adjustments to content streams to more closely match document designs Contains logic for manipulating both XML element content and XML structure Capability spans the entire XML content stream Modifications can be applied selectively: Specific XML elements Specific XML elements with particular content Elements or content based on some previously read XML element All operations are performed as if the XML stream was simply another text file. No programs use native XML manipulation capabilities such as DOM or XSLT In depth knowledge of programming or the Perl programming language is not required Logic is already built and only requires some setup information. Required setup changes are highlighted in yellow, like this Comments about program code are bold character type Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 98. About Their Use These programs have been tested with sample data: They do what they are supposed to do. Be warned that attention to detail is extremely important Proof your work after copying or making changes When setting up a program: Back up the original program before changing settings or logic Back up your data files before program processing There are not a lot of settings to worry about You may see the same setup requirements in multiple programs Most settings are similar from program to program Required settings are highlighted in yellow, and functionality is commented in bold Default settings illustrate patterns and program expectations Text string settings are very nuanced: The first character in a string is at position zero A ten character string is numbered from position zero through position nine. Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 99. Insert #!/usr/bin/perl use warnings; This program inserts one or more XML use strict "subs"; elements at a specified element location within # set input and output each <row> element of the XML content stream. $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; &insertProcess; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub insertProcess { &openFiles($inputfl, $outputfl); $addFlag = 0; # Number of new elements $new = 3; # List of new elements @newElements = ("<one>Insert First Element</one>", "<two>Insert Second Element</two>", "<three>Insert third element</three>"); Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 100. Insert continued # File processing main loop while ($import = <INFILE>) { chomp($import); if ($addFlag) { for ($i = 0; $i < $new; $i++) { $export = $newElements[$i] . "n"; print OUTFILE $export; } $addFlag = 0; } else { $export = ($import . "n"); print OUTFILE $export; # Tag location after which new elements are inserted # Test string length and name if (substr($import,0,7) eq "<offer>") {$addFlag = 1;} } } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 101. Delete #!/usr/bin/perl This program deletes one or more XML use warnings; elements from each <row> element of the XML use strict "subs"; content stream. &deleteProcessing; # set input and output $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub deleteProcessing { &openFiles($inputfl, $outputfl); while ($import = <INFILE>) { chomp($import); # Tag length and name to be deleted if (substr($import,0,4) eq "<one") {next;} if (substr($import,0,4) eq "<two") {next;} if (substr($import,0,6) eq "<three") {next;} $export = $import . "n"; print OUTFILE $export; } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 102. Change #!/usr/bin/perl This program completely changes the use warnings; content of one or more selected elements from use strict "subs"; each <row> element of the XML content stream. # set input and output $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; &changeProcessing; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub changeProcessing { &openFiles($inputfl, $outputfl); while ($import = <INFILE>) { chomp($import); # Set change tag test length and name if (substr($import,0,10) eq "<location>") { # Set decision value position and length $loc = substr($import,10,1); } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 103. Change continued # Set test length, tag name, and replacement values if (substr($import,0,7) eq "<offer>") { if ($loc eq "N") {$export = ("<offer>5% Discount</offer>" . "n");} if ($loc eq "S") {$export = ("<offer>10% Discount</offer>" . "n");} if ($loc eq "E") {$export = ("<offer>15% Discount</offer>" . "n");} if ($loc eq "W") {$export = ("<offer>20% Discount</offer>" . "n");} } if ($loc eq "N") {$export = ("<product>Paving Stone</product>" ."n");} if ($loc eq "S") {$export = ("<product>Fertilizer</product>" . "n");} if ($loc eq "E") {$export = ("<product>Garden Tools</product>" ."n");} if ($loc eq "W") {$export = ("<product>Seed Packets</product>" ."n");} } else { $export = $import . "n"; } print OUTFILE $export; } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 104. Reorder #!/usr/bin/perl This program reorders (changes the use warnings; sequential relationship) of each element within use strict "subs"; each <row> element of the XML content stream. # set input and output $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; &reorderProcess; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub reorderProcess { &openFiles($inputfl, $outputfl); # set up operating values # desired order of row elements in new file @order = (10, 3, 4, 5, 6, 7, 8, 9, 2, 1); # iteration limit is number of elements within a row $limit = 10; @content = (); # array of elements within a row $flag = "N"; # process control flag $count = 0; # iteration counter Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 105. Reorder continued # File processing loop while ($import = <INFILE>) { chomp($import); # test for row content processing if ($flag eq "Y") { $content[$count] = ($import . "n"); # test for end row content completion if ($count == $limit) { # row content reorder processing for ($i = 0; $i < $limit; $i++) { $key = ($order[$i] - 1); $export = $content[$key]; print OUTFILE $export; } $flag = "N"; # turn off row flag } $count++; } else { $export = ($import . "n"); print OUTFILE $export; # test for row if (substr($import,0,5) eq "<row>") { $flag = "Y"; # turn on row flag $count = 0; } } } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 106. Combine #!/usr/bin/perl One Application Example use warnings; A usual mailing address: use strict "subs"; name, address, city, state, and zip # set input and output But some may be address1 & address2: $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; Building 15 Suite 112 32 Canal Parkway &combineElements; If document design has two address lines sub openFiles { single addresses will a contain blank line $infl = $_[0]; $outfl = $_[1]; If document design has one address line open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; All double addresses will omit a line } Following XML placement may be flawed sub closeFiles { close(INFILE); One solution close(OUTFILE); } Design document for a single address line Combine double addresses into single XML elements sub combineElements { # Set element tag name patterns in the content stream like this: $pat1 = "<address1>"; address1+line feed+ address2 $pat2 = "<address2>"; $pat3 = "<address2/>"; Build flexibility & tolerance into the document design $pat4 = "<address2><"; &openFiles($inputfl, $outputfl); Let XML design integration handle address differences Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 107. Combine continued # File processing loop while ($import = <INFILE>) { chomp($import); if (substr($import,0,10) eq $pat1) { @tags2a = evalElement($import); next; } if (substr($import,0,11) eq $pat3 or substr($import,0,11) eq $pat4) { $import = join("",@tags2a); } elsif (substr($import,0,10) eq $pat2) { @tags2b = evalElement($import); if ($tags2b[1] =~ m/^[ +|0]/ ) { $import = join("",@tags2a);} else { $combine = $tags2a[1] . "n" . $tags2b[1]; $tags2a[1] = $combine; $import = join("",@tags2a); } } $export = $import . "n"; print OUTFILE $export; } &closeFiles; } # Element analysis subroutine sub evalElement($) { $element = shift(@_); $lentxt = length($element); $boundary1 = index($element,">"); $offset = $boundary1 + 1; $boundary2 = ($lentxt - rindex($element,"<")); $sublen = $lentxt - $offset - $boundary2; $content = substr($element, $offset, $sublen); $tag = substr($element, 0, $offset); $endtag = substr($element, ($offset + $sublen)); @elementArray = ($tag, $content, $endtag); return @elementArray; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 108. Merge #!/usr/bin/perl Imagine that we have three use warnings; XML files (same type, use strict "subs"; different content) # List files to be merged # First file name in list is the destination file # Remaining file names in list are subsequent source files for merger They’ve been used &mergeProcess("MergeOut.xml", "MergeOne.xml", "MergeTwo.xml", separately, but now we "MergeThree.xml", "MergeFour.xml"); want to bring them sub mergeProcess { together in a single content $outfl = shift(@_); @mergelist = @_; stream that satisfies XML $count = $#mergelist; $cycle = 0; language conventions. $proc = "<?xml"; $root = "<dataroot>"; This program takes one or $endroot = "</dataroot>"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; more XML content streams foreach $infl (@mergelist) { and appends them open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; while ($import = <INFILE>) { together – one after chomp($import); if (substr($import,0,5) eq $proc and $cycle > 0) {next;} another – as a single “well if (substr($import,0,10) eq $root and $cycle > 0) {next;} formed” XML file. if (substr($import,0,11) eq $endroot and $cycle < $count) {next;} $export = $import . "n"; print OUTFILE $export; } close(INFILE); $cycle += 1; } close(OUTFILE); } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 109. Split (1 of 3) #!/usr/bin/perl This is the converse of the previous use warnings; situation use strict "subs"; We have a single large XML content # set input and output stream we want to divide into two or $inputfl = "SplitSource.xml"; $outputfl = "SplitSeg1.xml"; more files: each having a set number of records, $firstline = '<?xml version="1.0" encoding="UTF-8"?>'; $secndline = "<dataroot>"; each adhering to XML conventions – $endrecord = "</row>"; each new file is well formed. $endfile = "</dataroot>"; That is exactly what this program does. &splitFile; There may likely be other criteria for sub openFiles { $infl = $_[0]; the split, rather than only record count $outfl = $_[1]; That logic could easily be added open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; Or the record count logic replaced. } sub closeFiles { close(INFILE); close(OUTFILE); } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 110. Split continued (2 of 3) sub splitFile { &openFiles($inputfl, $outputfl); &recordCount; # Processing variables # Maximum number of records per file $rcrdlimit = 13; $rcrdcount = 0; $rcrdaccum = 0; $filecount = 1; # File processing loop while ($import = <INFILE>) { chomp($import); if ($import eq $endfile) {last;} if (substr($import,0,6) eq $endrecord) { $rcrdcount += 1; $rcrdaccum += 1; } $export = $import . "n"; print OUTFILE $export; # Check for split count if ($rcrdcount == $rcrdlimit) { $rcrdcount = 0; $filecount += 1; $export = $endfile . "n"; print OUTFILE $export; close(OUTFILE); Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 111. Split continued (3 of 3) # Check for last record if ($rcrdaccum < $rcrdtotal) { substr($outfl,8,1) = $filecount; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; $export = $firstline . "n"; print OUTFILE $export; $export = $secndline . "n"; print OUTFILE $export; } } } &closeFiles; } sub recordCount { $rcrdtotal = 0; while ($import = <INFILE>) { chomp($import); if (substr($import,0,6) eq $endrecord) { $rcrdtotal += 1;} } print "Record Count: $rcrdtotal n"; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  • 112. Thank You!  Any Comments?  Any Questions?  How Might I Help You? Copyright © Nick D. Barzelay, 2010. All rights reserved.