Data Stream Preparation

    One Approach to Document Variability
     A Classroom Lecture for Print Media, Graphic
      ...
Talking Points
   I.   The Workflow
   II. Data Structures
   III. Simple XML
   IV. Basic Data Processing
   V. Data Prep...
I. The Workflow
      Basic Content Integration
      The Data Handling Thread
      Data Preparation Workflow


       ...
Basic Content Integration
           Content Source:
                                              Integrating Element
1  ...
The Data Handling Thread
                  Design                 Image
                  Sketch               Preparation...
Data Preparation Workflow
                                                                             Marketing
         ...
II. Data Structures
       File Metadata
       Sequential File
       Delimited File
       Spreadsheet
       Datab...
File Metadata

                    File headers
                            File name, type, and size
                    ...
Sequential Data File
                                                                       File Header


                ...
Delimited Sequential File

               Tab               Tab                     Tab                     Tab           ...
Spreadsheet


  Column 1     Column 2            Column 3                Column 4                  Column 5   Column n
 (I...
Database Table
                     Column 1         Column 2           Column 3           Column 4   Column 5   Column n
...
Relational Table Structure
Parent Record Set   Descriptive according to demographics or other qualities



               ...
Parent & Child Example



                                                                  Detail Reference
             ...
Joining (flattening) the Data
 Data Sources may be Sequential Files, Delimited Files,
          Spreadsheets, or Database ...
Approaches to Joining Datasets
 Concatenate: set A + set B + set “n”
 (Main issue is validation of information content bet...
III. Simple XML
       Concepts & Considerations
       XML Building Blocks
       DTD Building Blocks
       Quality ...
Concepts & Considerations
XML is hierarchical & extensible
Describes content, not layout or formatting
Can reference exter...
XML Building Blocks
  XML Declaration
  
   <?xml version=“1.0” encoding=“UTF-8”?>


  Namespace Declaration
      Schema ...
Elements & Attributes
  Element
  	   <any_name>some element content</any_name>

  Indented
  	   <state>
  	   	 <name>Ne...
Elements & Attributes continued

  Attribute
  
   <element attribute=”item”>element content</element>
  
   <state abbrev...
Defined Entities
  Predefined Entity
  &amp;	     ampersand
  &lt;	      less-than or left angle bracket
  &gt;	      grea...
“Sack Lunch” XML Example
 <?xml version=“1.0” encoding=“UTF-8”?>                                                   XML Dec...
DTD Building Blocks
  Common element statements (see definitions below):
  <!ELEMENT elementName (#PCDATA)>

  <!ELEMENT e...
DTD Building Blocks continued

  Common attribute statements (see definitions below):
  <!ATTLIST elementName attributeNam...
“Sack Lunch” DTD Example

<?xml version="1.0" encoding="UTF-8"?>                                            XML Declaratio...
“Mailer” Data Stream Example
                   DTD & XML for a VDP Document
             DTD                             ...
XML Standards & Conventions
                                           Well Formed
1. XML documents must begin with an XML...
IV. Basic Data
Processing
      Handling Text Strings
      Handling Typical Data Structures
      Simple Programming L...
Handling Text Strings
Fixed Length Record and Field Processing
    Files contain records; records contain fields made of o...
Handling Typical Data
Structures
       Sequential File Processing
       Table or Spreadsheet Processing
       XML DO...
Sequential File Processing
               File Header
                                                          File Proce...
Table or Spreadsheet
    Processing of selected Columns within selected Rows

          Column 1         Column 2         ...
XML DOM Processing Structure
                              Document Object Model
 Hierarchical Structure:                 ...
Processing with XSLT
  Extensible Stylesheet Language Transforms (XSLT)


                             Source XML File    ...
XML Sequential Processing
                XML Header                     Sequential Processing by XML Tag
               D...
Simple Programming Logic
       Fundamental Logic
       Program Overview



        Copyright © Nick D. Barzelay, 2010....
Simple Read/Write Logic

     Access (open) File           Basic File Processing Loop

   Read a Record (string)



      ...
XML Tag Process Logic
  Reading left to right, each successive line from the top...


  Select and Process a Line of Text:...
Processing Program Overview
                            Perl Declaration &
                          Processing Parameters...
“DogDB” Processing Example




        Copyright © Nick D. Barzelay, 2010. All rights reserved.
A Short File of Dogs
number	   name	     breed	                           gender	              photo	              weight
...
Raw Converted XML
 XML Source
 Data cleanup and enhancement results exported into XML from FileMaker Pro



 <?xml version...
Search/Replace Results
<?xml version="1.0" encoding="UTF-8" ?><!-- This grammar has been deprecated
- use FMPXMLRESULT ins...
Export Result using FMPXMLRESULT
  <?xml version="1.0" encoding="UTF-8" ?>
  <FMPXMLRESULT xmlns="http://www.filemaker.com/...
Export Result using FMPDSORESULT
 <?xml version="1.0" encoding="UTF-8" ?>
 <!-- This grammar has been deprecated - use FMP...
Fully Prep’ed XML
<?xml version="1.0" encoding="UTF-8" ?>
<dataroot>
<row>
<number>1</number>
<name>Slinky</name>
<breed>G...
Option 1: Using a Text Editor
Extra XML: simply delete!
XML or content changes: use Search & Replace
Create Attribute Refe...
Option 2: XSLT Step 1
XSL Stylesheet:
                  <?xml version="1.0" encoding="UTF-8"?>
                  <xsl:styl...
Option 2: XSLT Result #1
<?xml version="1.0" encoding="UTF-8"?>
<dataroot>
  <row>
    <number>1</number>
    <name>Slinky...
Option 2: XSLT Step 2
XSL Stylesheet:
                  <?xml version="1.0" encoding="UTF-8"?>
                  <xsl:styl...
Option 2: XSLT Result #2
<?xml version="1.0" encoding="UTF-8"?>
<dataroot>
  <row>
    <number>1</number>
    <name>Slinky...
Option 3: Cleanup with Perl
XML Cleanup -- Program Logic Overview
 1.    Call the “open” subroutine to commence file proces...
Option 3: The Perl Program
#!/usr/bin/perl

use warnings;
use strict "subs";

$inputfl = "rawXMLin.txt";
$outputfl = "cleanX...
Option 3: Perl Program continued
sub cleanseXML {                                                                  Main   ...
Option 4: File Conversion with Perl

Convert a Spreadsheet or Delimited File Directly to clean,
integration-ready XML usin...
Option 4: What You Need to Know
                                    CONVERSION CAVEATS:
1. The absolute necessity of under...
Option 4: The Perl Program (page 1of 3)
#!/usr/bin/perl

use warnings;
use strict "subs";

# Set input and output file name...
Option 4: Perl Program continued (pg. 2of 3)
# Conversion processing
&openFiles($inputfl, $outputfl, $errlist);
$export = $d...
Option 4: Perl Program continued (pg. 3of 3)
# Prepare XML element tags
sub prepTags {
	      @tagFront = @tagName;
	     ...
Option 4: Completely Prepared XML
<?xml version="1.0" encoding="UTF-8" ?>
<dataroot>
<row>
<number>1</number>
<name>Slinky...
V. Data Preparation
   Demonstration:
      Data Acquisition & Import
      Data Cleanup
      Supplement, Arrange, & S...
Acquiring Data - Initial Hurdles

Commonly Encountered Operational Issues:
 Multiple data structures and schemas
 Multip...
Database Import
1. Check text file or spreadsheet to determine whether first row contains column headings.
     a. If first ...
Create Empty Database Table for Import
                                                   select the “fields” button




 ...
Database Import from Spreadsheet




         1. Select the Customer Worksheet




                                       ...
Data Evaluation Before Cleanup
1. Based on project objectives and requirements:
    a. What types of record problems are i...
Potential Design-related Issues
1. Normalization – inadequate or omitted normalization or database design steps
       a. ...
Potential Usage Issues
 Incorrect or inconsistent spelling, and transposed characters
 Ambiguous names or variations of a ...
Common Data Cleanup Tasks
 Make sure field names or column headings are accurate and comply with XML element naming
 conven...
Data Cleanup Example:




1.   Case is the most obvious issue
2.   Hidden white space is a potential issue
3.   Spelling c...
Supplement, Arrange, & Sort

Purpose:
1. Address business logic at database level by incorporating decision results into d...
Field Supplement Needs

Potential Types of Field Additions to Data Records:
 Insert processing flags to facilitate database...
Example: Given a Document Layout
         Document Sketch (one side of a duplex document)

       <first name>
            ...
And the Envisioned Document...

How the Design Sketch will look as a completed document




                Copyright © Ni...
What Added Data Will Be Needed?

Record Layout Prior to Adding Fields
adoption_number (C and four digits for cats; D and f...
And the Rationale Behind Each One?
(1) tracking_code – Field will be used to track responses. Content will be created by c...
Fields Added (✔) to Match a Design Layout
                 ✔                 ✔                     ✔                      ...
Editing or Loading Field Content

Use desktop database scripting capabilities:
        To populate (fill with content) any...
Scripting: select “Scripts > ScriptMaker”
                                                 An existing script may be
     ...
Setup Functions (continued)
           Script for Inserting Column Values




       Script for Selected Column Value Inse...
Setup Functions (continued)
              Script to Copy and Paste Values




   Script to Combine Multiple Values and The...
Prep Results & Postal Sort

   Last step before XML Export: click to perform sort




The data will be sorted into zip cod...
Database Conversion to XML




       Copyright © Nick D. Barzelay, 2010. All rights reserved.
Why Not Just Export Directly From a Spreadsheet?

 The Spreadsheet                             A Portion of Exported XML

...
Possible Conversion Paths
Clean, Supplement, & Sort                               Detailed Program Setup
 Spreadsheet     ...
XML Export Setup
Select “File > Export Records” from the menu bar



     Select the columns
     to be exported
         ...
Obvious Export Format: FMPXMLRESULT

“Obvious” is not necessarily “Optimal”!




                                         ...
A Better Selection:
“FMPDSORESULT”
 Column elements are
 imbedded bet ween
 Row elements
                                 ...
Final XML Touchup Steps
Optional Tools:                        1. Insert line breaks (CR, LF, or
                         ...
Before XML Touchup




Touchup Key:
XML code to
delete

XML code to
modify
               Copyright © Nick D. Barzelay, 20...
After XML Touchup
                                                          All unneeded XML
                             ...
DTD Validation Before Integration

                                                        Assure Quality:
               ...
VI. Direct XML Data
    Stream Maintenance
      A set of XML Data Stream
      Utilities written in Perl provide
      mo...
Data Maintenance Alternatives

While there may be other approaches,
most will focus on one of three data conditions:


1. ...
Data Stream Utilities:
                                  Insert
                                  Delete
               ...
Overview
 Extend the usefulness of generated XML content streams beyond original setup or intent

 Make repairs or adjustm...
About Their Use
 These programs have been tested with sample data:
   They do what they are supposed to do.
   Be warned t...
Insert
#!/usr/bin/perl

use warnings;                                                       This program inserts one or mo...
Insert continued


	   # File processing main loop
	   while ($import = <INFILE>) {
	       	 chomp($import);
	       	 if...
Delete
#!/usr/bin/perl                                                     This program deletes one or more XML
use warnin...
Change
#!/usr/bin/perl                                                     This program completely changes the
use warning...
Change continued

	   	        # Set test length, tag name, and replacement values
	       	      if (substr($import,0,7) ...
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
XML Data Stream Preparation for VDP
Upcoming SlideShare
Loading in …5
×

XML Data Stream Preparation for VDP

1,378 views
1,252 views

Published on

Data preparation, XML conversion, and programming approaches for creating variable data documents. Demonstrates the juncture of graphic design and applied information technology. Uses affordable tools and techniques with cross-platform capabilities (PC & Mac).

Published in: Design
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
1,378
On SlideShare
0
From Embeds
0
Number of Embeds
2
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

XML Data Stream Preparation for VDP

  1. 1. Data Stream Preparation One Approach to Document Variability A Classroom Lecture for Print Media, Graphic Arts, and Applied Information Technology Nick D. Barzelay April, 2010 Copyright © Nick D. Barzelay, 2010. All rights reserved.
  2. 2. Talking Points I. The Workflow II. Data Structures III. Simple XML IV. Basic Data Processing V. Data Preparation Demonstration VI. Direct XML Data Stream Maintenance Copyright © Nick D. Barzelay, 2010. All rights reserved.
  3. 3. I. The Workflow  Basic Content Integration  The Data Handling Thread  Data Preparation Workflow Copyright © Nick D. Barzelay, 2010. All rights reserved.
  4. 4. Basic Content Integration Content Source: Integrating Element 1 Data Text (Transfer medium & Design & Assembly Element associated tools) Graphics Content Management Selected Dynamic Data Selected Static Content 2 Data Management Selected Dynamic Graphics Selected Static Graphics 3 Desktop DB or Spreadsheet XML w/XSLT or PERL Integration Engine Adobe InDesign Rudimentary Database Elementary XML & optional Basic Adobe InDesign 4 & FileMaker Pro basic XSLT or PERL Proficiencies Integration Layers: 1. Process Layer 2. Content/Data Layer 3. Tools Layer 4. Skills & Techniques Copyright © Nick D. Barzelay, 2010. All rights reserved.
  5. 5. The Data Handling Thread Design Image Sketch Preparation Concept & Objectives Document Document Design Assembly Content VCP XML Management Document Data Working Prepar- & QA Stream Source Dataset ation A Variable Document Workflow Proofing PDF Generation Discussion Focus Prepress & Discussion End Point Imposition Production Copyright © Nick D. Barzelay, 2010. All rights reserved.
  6. 6. Data Preparation Workflow Marketing Cleansed Data: Results: Update Use to Update Data for Ongoing Data Source Campaign (OPTIONAL) (OPTIONAL) Data Source Working Prepared Import Data Data Select Data Working into (Database, Storage Fields & Database & Working Spreadsheet , Export XML XML Content Database or Text Files) Stream 1 7 Processes 1 through 7 Interact With and Are Dependent Upon Working Storage Supplement Reorder Combine Evaluate Cleanse Fields & Fields & Sources into Records Records Reference Sort Single Table Images Records 21 3 4 5 6 Copyright © Nick D. Barzelay, 2010. All rights reserved.
  7. 7. II. Data Structures  File Metadata  Sequential File  Delimited File  Spreadsheet  Database Table Copyright © Nick D. Barzelay, 2010. All rights reserved.
  8. 8. File Metadata File headers File name, type, and size Dates for creation, modification, & last use Access security and use permissions Author, company, and copyright information Title, subject, and comments Key words and labeling More extensive “structural & descriptive” metadata is used for Spreadsheets and Databases Copyright © Nick D. Barzelay, 2010. All rights reserved.
  9. 9. Sequential Data File File Header First Record Field 1 Field 2 Processing Order: Field 3 Top to Bottom Second Record Left to Right Field 1 Field 2 Field 3 Nth Record Field 1 Field 2 Field 3 End of File (EOF) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  10. 10. Delimited Sequential File Tab Tab Tab Tab Tab Column 1 Column 2 Column 3 Column 4 Column 5 Column n (Identifier) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Copyright © Nick D. Barzelay, 2010. All rights reserved.
  11. 11. Spreadsheet Column 1 Column 2 Column 3 Column 4 Column 5 Column n (Identifier) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Copyright © Nick D. Barzelay, 2010. All rights reserved.
  12. 12. Database Table Column 1 Column 2 Column 3 Column 4 Column 5 Column n (usually Key) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Table Metadata Column name Data type (text, number, date, time, or binary container) Size (number of characters) Use (primary key, foreign key, index key) Creation date & Modification date Permissions (who can read or manipulate field content) Validation value (a specified test value) Default value (including automatic incrementing and increment value) Data entry mask (formatting assistance for data input) & Display format Copyright © Nick D. Barzelay, 2010. All rights reserved.
  13. 13. Relational Table Structure Parent Record Set Descriptive according to demographics or other qualities Transactional: Parent and Child Interaction Child Record Set Experiential: Transactions External to the Relationship Relational Metadata Table name Table relationships (child table names) Relationship type Relationship direction (the “one” or “many”) Table unique key (makes every record unique) Table foreign key (points from child to parent) Number of fields (columns) Number of records (rows) Table access (accounts and privileges) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  14. 14. Parent & Child Example Detail Reference points to the Parent Table Customer Number Child Table Copyright © Nick D. Barzelay, 2010. All rights reserved.
  15. 15. Joining (flattening) the Data Data Sources may be Sequential Files, Delimited Files, Spreadsheets, or Database Tables Data Selection Options: Primary Source Sort Primary and Secondary by Matching Sort Criteria Matching data selected from Primary according to Secondary Common data Element Unmatched data selected from Primary according to secondary (Assumed Key) Matching data from Primary & Secondary selected for merging Secondary or Merge Processing: Qualifying Source Read Primary and Secondary, match by option, and write new file Copyright © Nick D. Barzelay, 2010. All rights reserved.
  16. 16. Approaches to Joining Datasets Concatenate: set A + set B + set “n” (Main issue is validation of information content between dat sources) Are there duplicate records between the collections (files or tables)? Are there erroneously matched fields? Do matched records really contain the exact same information, or is some different? Does one record contain more or cleaner information than the other? Update: secondary to primary according to defined logic Dataset to Dataset Table to Table Merge: primary1 + primary2 = new primary (Maintain data source identity to assure data integrity, traceability, and recoverability) Record the primary keys from both records in the new record Record in the new record the respective matching fields, if other than the primary keys Create a new unique primary key for the new record Group fields in the new record according to their respective source Copyright © Nick D. Barzelay, 2010. All rights reserved.
  17. 17. III. Simple XML  Concepts & Considerations  XML Building Blocks  DTD Building Blocks  Quality & Standardization Copyright © Nick D. Barzelay, 2010. All rights reserved.
  18. 18. Concepts & Considerations XML is hierarchical & extensible Describes content, not layout or formatting Can reference external objects Since content separated from layout, content may be multi-purposed “Well Formed” (follows all XML conventions); “Valid” (agrees with the document definition) Being “well formed” doesn’t necessarily mean a document is “valid” No “right way” to code documents, but some approaches may be more effective than others Keep XML structures as simple and clean as possible; avoid unnecessary complexity White space is preserved; indenting sub-elements and inserting comments may help readability, but may also interfere with integration processing Fixed width fonts during XML preparation increases text control at the character level. Formatting XML documents is superfluous, distracting, and a potential source of problems; plain lower case text is best! Regarding plain text, a text editor is a better choice than a word processor because in XML processing, “what you see is not necessarily what you think you have “ XML structure and sequence drive processing: document design and XML must correlate Copyright © Nick D. Barzelay, 2010. All rights reserved.
  19. 19. XML Building Blocks XML Declaration <?xml version=“1.0” encoding=“UTF-8”?> Namespace Declaration Schema Access: <xs:schema xmlns:xs=”http://www.w3.org/2001/XMLSchema”> XSL transformations (XSLT): <xsl:stylesheetxmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> Processing Instruction <!DOCTYPE rootname SYSTEM "filename.dtd"> CSS Reference <?xml-stylesheet type=”text/css” href=”path/filename.css”> Comment <!-- the double dashes are not single hyphens --> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  20. 20. Elements & Attributes Element <any_name>some element content</any_name> Indented <state> <name>New York</name> <abbreviation>NY</abbreviation> </state> Not Indented <state> <name>New York</name> <abbreviation>NY</abbreviation> </state> Empty <tag_name></tag_name> <tag_name/> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  21. 21. Elements & Attributes continued Attribute <element attribute=”item”>element content</element> <state abbreviation=”NY”> New York</state> External file reference (Provide added content without impacting XML content stream size.) <image href="file://server3/photographs/RedRose.jpg"/> <state> <name>New York</name> <abbreviation>NY</abbreviation> <map href="file://NYmap.jpg" /> <population/> </state> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  22. 22. Defined Entities Predefined Entity &amp; ampersand &lt; less-than or left angle bracket &gt; greater-than or right angle bracket &quot; quote marks &apos; apostrophe or single quote marks General Entities (defined by developer) Substitute into XML when value may not “parse” correctly by XML processor. <!ENTITY entity_name “substituted value”> Example shows the declaration, the XML statement, and the parsing result: <!ENTITY bang “!”> <warning>Hazardous Material&bang; Do Not Ingest&bang;</warning> RESULT: Hazardous Material! Do Not Ingest! Copyright © Nick D. Barzelay, 2010. All rights reserved.
  23. 23. “Sack Lunch” XML Example <?xml version=“1.0” encoding=“UTF-8”?> XML Declaration <!DOCTYPE lunch SYSTEM "SackLunch.dtd"> DTD Declaration <!-- Sack lunch for the day --> Comment <lunch> Data Root <day name="Monday"> Element with Attribute <sandwich>Ham &amp; Swiss on Rye</sandwich> <fruit>Orange</fruit> <chips>Potato Ruffles</chips> <drink>Root Beer</drink> <sweet>Chocolate Bar</sweet> <nuts>Cashews</nuts> </day> <day name="Wednesday"> <sandwich>Egg &amp; Olive on Wheat</sandwich> Element content with <fruit>Apple</fruit> Pre-defined Entity <chips>Corn Chips</chips> <drink>Iced Tea</drink> <sweet>Molasses Cookies</sweet> <nuts>Smoked Almonds</nuts> </day> </lunch> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  24. 24. DTD Building Blocks Common element statements (see definitions below): <!ELEMENT elementName (#PCDATA)> <!ELEMENT elementName (#CDATA)> <!ELEMENT elementName (sequence of sub-elements)> <!ELEMENT elementName EMPTY> Definitions: Element terms and qualifiers: #PCDATA Stands for “parseable” data #CDATA Stands for “character data” exempt from parsing EMPTY The element has no content or imbedded sub-elements ( ) Element grouping , Element separator indicating specific order | Element “or” separator indicating choice + Element may occur 1 or more times ? Element may occur 0 or 1 time * Element may occur 0 or more times Copyright © Nick D. Barzelay, 2010. All rights reserved.
  25. 25. DTD Building Blocks continued Common attribute statements (see definitions below): <!ATTLIST elementName attributeName TYPE #KEYWORD> <!ATTLIST elementName valueName (val1|val2|val3) “defaultVal”> <!ATTLIST elementName href CDATA #REQUIRED> Definitions: Attribute types and keywords: CDATA Means that the attribute contains character data ID The attribute provides a unique XML identifier name #REQUIRED An attribute value is mandatory #IMPLIED An attribute value is optional #FIXED The attribute has a fixed value Copyright © Nick D. Barzelay, 2010. All rights reserved.
  26. 26. “Sack Lunch” DTD Example <?xml version="1.0" encoding="UTF-8"?> XML Declaration <!-- Sack Lunch DTD for one or more weekdays --> Comment <!ELEMENT lunch (day)+> Data Root <!ELEMENT day (sandwich, fruit, chips, drink, sweet, nuts) > XML Element and list <!ELEMENT sandwich (#PCDATA)> of “sub” elements <!ELEMENT fruit (#PCDATA)> <!ELEMENT chips (#PCDATA)> Elements nested within “day” <!ELEMENT drink (#PCDATA)> <!ELEMENT sweet (#PCDATA)> <!ELEMENT nuts (#PCDATA)> <!ATTLIST day name CDATA #REQUIRED > Attribute belonging to the Element “day” Copyright © Nick D. Barzelay, 2010. All rights reserved.
  27. 27. “Mailer” Data Stream Example DTD & XML for a VDP Document DTD XML <?xml version="1.0" encoding="UTF-8"?> <?xml version="1.0" encoding="UTF-8"?> <!-- DTD for Mailer --> <!DOCTYPE dataroot SYSTEM "Mailer.dtd"> <!ELEMENT dataroot (row+)> <dataroot> <!ELEMENT row (name, image, location, offer, product, title, <row> first, last, street, city, state, zip, code) > <name>Anthony</name> <image href="file://Berries.jpg"/> <!ELEMENT name (#PCDATA)> <location>North Side</location> <!ELEMENT image EMPTY> <offer>5% Discount</offer> <!ELEMENT location (#PCDATA)> <product>Paving Stone</product> <!ELEMENT offer (#PCDATA)> <title>Mr.</title> <!ELEMENT product (#PCDATA)> <first>Anthony</first> <!ELEMENT title (#PCDATA)> <last>Able</last> <!ELEMENT first (#PCDATA)> <street>27 Able Street</street> <!ELEMENT last (#PCDATA)> <city>North Side</city> <!ELEMENT street (#PCDATA)> <state>NY</state> <!ELEMENT city (#PCDATA)> <zip>14533</zip> <!ELEMENT state (#PCDATA)> <code>14533-27200803</code> <!ELEMENT zip (#PCDATA)> </row> <!ELEMENT code (#PCDATA)> </dataroot> <!ATTLIST image href CDATA #REQUIRED> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  28. 28. XML Standards & Conventions Well Formed 1. XML documents must begin with an XML declaration 2. An XML document must contain a single root element that contains all the other elements 3. All elements must be nested and not over-lap (child tags are closed prior to closing the parent) 4. XML tags are case-sensitive and need consistency 5. Tag names can’t contain spaces or start with “XML”, a number, or punctuation (except underscore ) 6. All elements require start and end tags (empty tags may contain the end marker) 7. All attribute values must be enclosed in quote marks 8. Syntactical characters are replaced with pre-defined entities 9. The conventions apply to all types of XML documents Valid The XML-tagged Document must match the document’s definition (DTD) “internal DTD” – The definition is contained within brackets in the XML file. <?xml version=“1.0” encoding=“UTF-8”?> <!DOCTYPE document_root [ DTD definition details here ]> “external DTD” – XML file and the DTD file are separate. <?xml version=“1.0” encoding=“UTF-8”?> <!DOCTYPE document_root SYSTEM “path/dtdName.dtd”> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  29. 29. IV. Basic Data Processing  Handling Text Strings  Handling Typical Data Structures  Simple Programming Logic  “DogDB” Processing Example Copyright © Nick D. Barzelay, 2010. All rights reserved.
  30. 30. Handling Text Strings Fixed Length Record and Field Processing Files contain records; records contain fields made of one or more characters in a string This sample text string has 26 characters + Line Feed: “abcdefghijklmnopqrstuvwxyzLF” Each character occupies a position in the string numbered sequentially from zero through 25 If the first field were six characters, the substring would start at zero (the first character position) and count six character positions (zero through five), giving “abcdef” The next field (substring) would start at position six, and suppose it is ten characters long (six through fifteen), which gives “ghijklmnop” Each record in the file would be a 26 character string subdivided into fields by substrings AT ISSUE: Any deviation in the number of characters would corrupt field processing Delimited (variable length) Record and Field Processing Delimiters resolve any variances in character counts Record fields can be delimited with commas, called comma-separated values (CSV) “abcdef,ghijklmnop,qrstuvwxyzLF”. Other values that are not likely to appear in the text may also be used as delimiters, an example being the tab-delimited record below: “abcdef ghijk,lmnop qrstuvwxyzLF” Copyright © Nick D. Barzelay, 2010. All rights reserved.
  31. 31. Handling Typical Data Structures  Sequential File Processing  Table or Spreadsheet Processing  XML DOM & XSLT Processing  XML Sequential Processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  32. 32. Sequential File Processing File Header File Processing Sequence Start Processing (optional Open of new file) First Record Field 1 Read Record Field 2 Read Record Fields (optional Write to new file) Field 3 Second Record Read Record Field 1 Field 2 Read Record Fields (optional Write to new file) Field 3 Nth Record Read Record Field 1 Read Record Fields (optional Write to new file) Field 2 Field 3 End of File (EOF) Complete Processing (Close any open files) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  33. 33. Table or Spreadsheet Processing of selected Columns within selected Rows Column 1 Column 2 Column 3 Column 4 Column 5 Column n (usually Key) (name) (address) (city) (state) (etc.) Row 1 name 1 address 1 city 1 state 1 etc. 1 Row 2 name 2 address 2 city 2 state 2 etc. 2 Row 3 name 3 address 3 city 3 state 3 etc. 3 Row 4 name 4 address 4 city 4 state 4 Field 4 Row 5 name 5 address 5 city 5 state 5 Field 5 Row n name n address n city n state n etc. n Data Cleansing: Seek by row and column selection criteria Process selected row columns (read, update, delete, add) Seek by row and column selection criteria for targeted processing Data Flattening: Make conditional table joins to link related data Query join results to create new table views for further processing Data Enhancement: Convert views into tables and perform further enhancement processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  34. 34. XML DOM Processing Structure Document Object Model Hierarchical Structure: XML Header Data Root Tag First Row Tag Second Row Tag Last Row Tag column tag 1 column tag 2 column tag 1 column tag 2 column tag 1 column tag 2 Hierarchical Recursive Processing: Load Load Entire Document or File into Memory Traverse Start at Root Down by Branch Across Levels Repeat to End Select Locate by Tag Identify Content Identify Context Record Content Execute Move Delete Add Revise Replace Construct Generate a New Document or File Output to Designated Storage Copyright © Nick D. Barzelay, 2010. All rights reserved.
  35. 35. Processing with XSLT Extensible Stylesheet Language Transforms (XSLT) Source XML File The original XML file is preserved XSL Style Sheet A new XML file (new xml file XSLT Processor is assembled in generation instructions) XSLT storage The result is a Revised XML File new XML file Note: memory-intensive processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  36. 36. XML Sequential Processing XML Header Sequential Processing by XML Tag Data Root Tag Start Processing (optional Open of new file) First Row Tag column tag 1 Read Record (load data into memory) column tag 2 Read Record Fields (optional Write to new file) Row End Tag Second Row Tag Read Record (load new data into memory) column tag 1 column tag 2 Read Record Fields (optional Write to new file) Row End Tag Last Row Tag Read Record (load new data into memory) column tag 1 Read Record Fields (optional Write to new file) column tag 2 Row End Tag Data Root End Tag Complete Processing (Close any open files) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  37. 37. Simple Programming Logic  Fundamental Logic  Program Overview Copyright © Nick D. Barzelay, 2010. All rights reserved.
  38. 38. Simple Read/Write Logic Access (open) File Basic File Processing Loop Read a Record (string) Test for File End If end of file (EOF), then close file and exit processing; Else next Test Selection (substring) If substring matches criteria, then perform specific actions; Else next Return to Top Note: disk-interactive processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  39. 39. XML Tag Process Logic Reading left to right, each successive line from the top... Select and Process a Line of Text: XML Element <tag> Some enclosed content </tag> Same as Text Text <tag> Some enclosed content </tag> String of Characters String "<tag> Some" enclosed content </tag> Substring (red) Substring 10 Characters Long Starting at Zero Decision Criteria Decision If Substring(0,10) = "<tag> Some" Then Do Something Do A Specific process or group of Processes Otherwise (Else) Do a Different Process or Simply Advance to Next Element Copyright © Nick D. Barzelay, 2010. All rights reserved.
  40. 40. Processing Program Overview Perl Declaration & Processing Parameters Input/Output Identification: Enter File Names or URI's Initiating Primary Processing Subroutine Calls Primary Process Calls Calls Subroutine: Open Files (Input & Output) File Handling Logic: -- main process start call Subroutine: Close Files -- main precess end call (Input & Output) Packaged Reusable Primary File Processing Logic Subroutine(s): Primary Logic (one or more) (file processing loop) Additional Processing Logic Subroutine(s): Secondary (return control to caller (zero or more) after processing completion) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  41. 41. “DogDB” Processing Example Copyright © Nick D. Barzelay, 2010. All rights reserved.
  42. 42. A Short File of Dogs number name breed gender photo weight 1 Slinky German Shepherd Male file://slinky.jpg 115 2 Nora Doberman Pinscher Female file://nora.jpg 90 3 Crunch Rottweiler Male file://crunch.jpg 125 Possible Storage Formats: Sequential File of Comma-Separated Values (CSV) with or without Header Row Sequential File of Tab-Separated Values (CSV) with or without Header Row Spreadsheet with or without Header Row Database Table (will automatically use field names for Column Heads Objectives: 1. Import data into a Desktop Database (FileMaker Pro in this example) 2. Cleanse and Enhance Data 3. Convert it into a usable XML Data Stream 4. Sanitize & simplify XML to avoid potential layout integration problems Copyright © Nick D. Barzelay, 2010. All rights reserved.
  43. 43. Raw Converted XML XML Source Data cleanup and enhancement results exported into XML from FileMaker Pro <?xml version="1.0" encoding="UTF-8" ?><!-- This grammar has been deprecated - use FMPXMLRESULT instead --><FMPDSORESULT xmlns="http://www.filemaker.com/ fmpdsoresult"><ERRORCODE>0</ERRORCODE><DATABASE>DogDB.fp7</DATABASE><LAYOUT></ LAYOUT><ROW MODID="3" RECORDID="1"><number>1</number><name>Slinky</ name><breed>German Shepherd</breed><gender>Male</gender><photo>file://slinky.jpg</ photo><weight>115</weight></ROW><ROW MODID="2" RECORDID="2"><number>2</ number><name>Nora</name><breed>Doberman Pinscher</breed><gender>Female</ gender><photo>file://nora.jpg</photo><weight>90</weight></ROW><ROW MODID="2" RECORDID="3"><number>3</number><name>Crunch</name><breed>Rottweiler</ breed><gender>Male</gender><photo>file://crunch.jpg</photo><weight>125</weight></ROW></ FMPDSORESULT> Preliminary Setup Use a text editor to search on “><“ character combination and replace it with “>CRLF<“ or other appropriate line feed symbol. Copyright © Nick D. Barzelay, 2010. All rights reserved.
  44. 44. Search/Replace Results <?xml version="1.0" encoding="UTF-8" ?><!-- This grammar has been deprecated - use FMPXMLRESULT instead --> <FMPDSORESULT xmlns="http://www.filemaker.com/fmpdsoresult"> <ERRORCODE>0</ERRORCODE> <DATABASE>DogDB.fp7</DATABASE> <LAYOUT/> <ROW MODID="3" RECORDID="1"> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo>file://slinky.jpg</photo> <weight>115</weight> </ROW> <ROW MODID="2" RECORDID="2"> <number>2</number> <name>Nora</name> <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo>file://nora.jpg</photo> <weight>90</weight> </ROW> <ROW MODID="2" RECORDID="3"> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo>file://crunch.jpg</photo> <weight>125</weight> </ROW> </FMPDSORESULT> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  45. 45. Export Result using FMPXMLRESULT <?xml version="1.0" encoding="UTF-8" ?> <FMPXMLRESULT xmlns="http://www.filemaker.com/fmpxmlresult"> <ERRORCODE>0</ERRORCODE> <PRODUCT BUILD="11-30-2007" NAME="FileMaker Pro" VERSION="8.5v2"/> <DATABASE DATEFORMAT="M/d/yyyy" LAYOUT="" NAME="DogDB.fp7" RECORDS="3" TIMEFORMAT="h:mm:ss a"/> <METADATA> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="number" TYPE="NUMBER"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="name" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="breed" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="gender" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="photo" TYPE="TEXT"/> <FIELD EMPTYOK="YES" MAXREPEAT="1" NAME="weight" TYPE="NUMBER"/> </METADATA> <RESULTSET FOUND="3"> <ROW MODID="2" RECORDID="1"> <COL> <DATA>1</DATA> </COL> <COL> <DATA>Slinky</DATA> </COL> <COL> <DATA>German Shepherd</DATA> </COL> <COL> <DATA>Male</DATA> </COL> <COL> <DATA>file://slinky.jpg</DATA> </COL> <COL> <DATA>115</DATA> </COL> </ROW> FMPXMLRESULT Export Format -- Not What Is Needed! Superfluous XML -- pink highlights Columnar Data not Tagged with Database Column Names -- yellow highlights Copyright © Nick D. Barzelay, 2010. All rights reserved.
  46. 46. Export Result using FMPDSORESULT <?xml version="1.0" encoding="UTF-8" ?> <!-- This grammar has been deprecated - use FMPXMLRESULT instead --> <FMPDSORESULT xmlns="http://www.filemaker.com/fmpdsoresult"> <ERRORCODE>0</ERRORCODE> <DATABASE>TestDB.fp7</DATABASE> <LAYOUT/> <ROW MODID="2" RECORDID="1"> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo>file://slinky.jpg</photo> <weight>115</weight> </ROW> Notice the differences from the previous export format! <ROW MODID="0" RECORDID="2"> <number>2</number> <name>Nora</name> <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo>file://nora.jpg</photo> <weight>90</weight> </ROW> Delete all pink and yellow highlighted items <ROW MODID="0" RECORDID="3"> <number>3</number> Modify all green highlighted items <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo>file://crunch.jpg</photo> <weight>125</weight> </ROW> </FMPDSORESULT> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  47. 47. Fully Prep’ed XML <?xml version="1.0" encoding="UTF-8" ?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo href=”file://slinky.jpg”/> <weight>115</weight> </row> All superfluous XML removed <row> <number>2</number> All superfluous “white space” removed <name>Nora</name> <breed>Doberman Pinscher</breed> Image links converted to attribute <gender>Female</gender> <photo href=”file://nora.jpg”/> references linking external files <weight>90</weight> </row> <row> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> How do we get there? <gender>Male</gender> <photo href=”file://crunch.jpg”/> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  48. 48. Option 1: Using a Text Editor Extra XML: simply delete! XML or content changes: use Search & Replace Create Attribute Reference Links: 1. Target text string: <photo>file://nora.jpg</photo> 2. Search on <photo>, and replace it with <photo href=” 3. Search on </photo>, and replace it with “/> 4. The result is <photo href=”file://nora.jpg”/> Get rid of “row” attributes: <ROW MODID="2" RECORDID="1"> 1. Target text string: <ROW MODID=”2” RECORDID=”1”> 2. Set search to ignore case 3. Search on <row modid=, and replace it with <row> 4. The result is <row> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  49. 49. Option 2: XSLT Step 1 XSL Stylesheet: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:strip-space elements="*"/> <!-- Remove superfluous xml elements --> <xsl:template match="ERRORCODE"/> <xsl:template match="DATABASE"/> <xsl:template match="LAYOUT"/> <!-- Copy ROW Children into new row parents --> <xsl:template match="ROW"> <row> <xsl:copy-of select="*|node()"/> </row> </xsl:template> <!-- Replace FMPDSORESULT root with a new root --> <xsl:template match="/"> <dataroot> <xsl:apply-templates/> </dataroot> </xsl:template> </xsl:stylesheet> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  50. 50. Option 2: XSLT Result #1 <?xml version="1.0" encoding="UTF-8"?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo>file://slinky.jpg</photo> <weight>115</weight> </row> <row> <number>2</number> <name>Nora</name> <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo>file://nora.jpg</photo> <weight>90</weight> </row> <row> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo>file://crunch.jpg</photo> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  51. 51. Option 2: XSLT Step 2 XSL Stylesheet: <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"> <xsl:output method="xml" version="1.0" encoding="UTF-8" indent="yes"/> <xsl:strip-space elements="*"/> <!-- Add new image element with reference attribute --> <!-- Add new weight element following image element --> <xsl:template match="row"> <row> <xsl:apply-templates/> <photo href="{photo}"/> <weight><xsl:value-of select="weight"/></weight> </row> </xsl:template> <!-- Remove old duplicate elements --> <xsl:template match="photo"/> <xsl:template match="weight"/> <!-- Copy everything else --> <xsl:template match="@*|node()"> <xsl:copy> <xsl:apply-templates select="@*|node()"/> </xsl:copy> </xsl:template> </xsl:stylesheet> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  52. 52. Option 2: XSLT Result #2 <?xml version="1.0" encoding="UTF-8"?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo href="file://slinky.jpg"/> All superfluous XML removed <weight>115</weight> </row> <row> Image links converted to attribute <number>2</number> <name>Nora</name> references linking to external files <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo href="file://nora.jpg"/> Superfluous “leading white space” <weight>90</weight> will need removal with a Text Editor </row> <row> <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo href="file://crunch.jpg"/> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  53. 53. Option 3: Cleanup with Perl XML Cleanup -- Program Logic Overview 1. Call the “open” subroutine to commence file processing 2. Enter the processing loop and read a record, eliminating trailing blanks 3. If a “ROW” element with attribute, replace it with a “row” element without attributes 4. If an end “ROW” element, replace it with an end “row” element 5. If a comment, skip and don’t write it to output 6. If the starting “FMPDSORESULT” tag, replace it with “dataroot” (Do the same with the corresponding end tag when encountered.) 7. If “ERRORCODE”, “DATABASE”, or “LAYOUT” tags, skip and don’t write to output 8. If a “photo” tag, rebuild it using the “hrefElement” subroutine 9. Read a record and either write the record as is, write the replacement, or don’t write anything; then we read the next record 10. Upon reading all records, call the “close” subroutine to terminate file processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  54. 54. Option 3: The Perl Program #!/usr/bin/perl use warnings; use strict "subs"; $inputfl = "rawXMLin.txt"; $outputfl = "cleanXMLout.xml"; Set up input & output file names $transfl = "deleteME.xml"; $href = "href=""; $endtag = ""/>"; Delete this file after program completes &initializeXML; &cleanseXML; Main processing function calls sub openFiles { Open input & output files for processing $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { Close input & output files after processing close(INFILE); close(OUTFILE); } sub initializeXML { Prepare raw file by inserting line feed between >< junctions &openFiles($inputfl, $transfl); while ($import = <INFILE>) { chomp($import); $import =~ s/></>n</g; $export = $import . "n"; print OUTFILE $export; } &closeFiles; } (continued on next page) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  55. 55. Option 3: Perl Program continued sub cleanseXML { Main file preparation function &openFiles($transfl, $outputfl); while ($import = <INFILE>) { chomp($import); if (substr($import,0,4) eq "<ROW"){$import = "<row>";} elsif (substr($import,0,6) eq "</ROW>") {$import = "</row>";} elsif (substr($import,0,4) eq "<!--") {next;} elsif (substr($import,0,13) eq "<FMPDSORESULT") {$import = "<dataroot>";} elsif (substr($import,0,11) eq "<ERRORCODE>") {next;} elsif (substr($import,0,10) eq "<DATABASE>") {next;} elsif (substr($import,0,8) eq "<LAYOUT>") {next;} elsif (substr($import,0,9) eq "</LAYOUT>") {next;} elsif (substr($import,0,15) eq "</FMPDSORESULT>") {$import = "</dataroot>";} elsif (substr($import,0,7) eq "<photo>") {$import = hrefElement($import);} $export = $import . "n"; print OUTFILE $export; } &closeFiles; } sub hrefElement { href preparation function my $element = $_[0]; $lentxt = length($element); $boundary1 = index($element,">"); $tag = substr($element,0,$boundary1); $offset = $boundary1 + 1; $boundary2 = ($lentxt - rindex($element,"<")); $sublen = $lentxt - $offset - $boundary2; $content = substr($element, $offset, $sublen); $newelement = ($tag . " " . $href . $content . $endtag); return $newelement; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  56. 56. Option 4: File Conversion with Perl Convert a Spreadsheet or Delimited File Directly to clean, integration-ready XML using a Perl program Spreadsheet or Delimited File number name breed gender photo weight 1 Slinky German Shepherd Male file://slinky.jpg 115 2 Nora Doberman Pinscher Female file://nora.jpg 90 3 Crunch Rottweiler Male file://crunch.jpg 125 This represents a tab Any type of delimiters are acceptable, BUT Tab Delimiters present fewer problems Copyright © Nick D. Barzelay, 2010. All rights reserved.
  57. 57. Option 4: What You Need to Know CONVERSION CAVEATS: 1. The absolute necessity of understanding the source data file or spreadsheet 2. The absolute necessity of adhering to any conventions established by the program Additional Program Setup Considerations A. If first row of data file contains headings, set program “hdflag” to “1”, else set it to “0” (zero). B. If file doesn’t have column headings: (1) Edit them directly into the file with a text editor or spreadsheet program, or (2) Replace the default “tag names” in the program with desired names (3) Insure tag names correspond to the data columns (same number; same sequence). C. Insure that the “line feeds” at the end of each row in the delimited file conform to the system on which the operation will take place. A. Mac OS X will use the standard UNIX line feed (LF). B. PC’s will expect the combination carriage return and line feed (CRLF). D. References to external files or images use the format: file://URI path/file name with extension. Program explicitly searches for “file://” to identify references. (Example from “Dog Database”: file://desktop/images/slinky.jpg.) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  58. 58. Option 4: The Perl Program (page 1of 3) #!/usr/bin/perl use warnings; use strict "subs"; # Set input and output file names $inputfl = "DataFileIn.tab"; $outputfl = "XMLFileOut.xml"; Set up input & output file names $errlist = "ErrListOut.txt"; # Static variables $declaration = '<?xml version="1.0" encoding="UTF-8"?>'; $rootfore = "<dataroot>n"; $rootaft = "</dataroot>n"; $rowfore = "<row>n"; $rowaft = "</row>n"; $tagfore = "<"; $tagaft = ">"; $tagterm = "</"; $tagend = ">n"; $href = " href=""; $hrefend = ""/>n"; $reftest = "file://"; # XML element processing preparations # Column names can be revised and increased or decreased @tagName = qw/col1 col2 col3 col4 col5 col6 col7/; Column name setup $cols = @tagName; # Set flag to 1 if first line of input contains column headings $hdflag = 0; Column headings in file? if (not $hdflag) {&prepTags;} 1 = yes; 0 = no Copyright © Nick D. Barzelay, 2010. All rights reserved.
  59. 59. Option 4: Perl Program continued (pg. 2of 3) # Conversion processing &openFiles($inputfl, $outputfl, $errlist); $export = $declaration . "n" . $rootfore; print OUTFILE $export; $export = "DELIMITED DATA ERROR LISTINGnn"; print ERRFILE $export; # Read delimited data in, write XML out sub convert2xml { while ($import = <INFILE>) { chomp($import); @rowBuilder = split("t", $import); $check = @rowBuilder; if ($hdflag) { $hdflag = 0; @tagName = @rowBuilder; $cols = @tagName; &prepTags; next; } if ($check != $cols) { $errcount++; $export = "Item Error " . $errcount . ": " . $import . "n"; print ERRFILE $export; next; } $export = &tagContent; print OUTFILE $export; } $export = $rootaft; print OUTFILE $export; if ($errcount == 0) { $export = "No data errors found!n"; print ERRFILE $export; } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  60. 60. Option 4: Perl Program continued (pg. 3of 3) # Prepare XML element tags sub prepTags { @tagFront = @tagName; @tagBack = @tagName; foreach $tag (@tagFront) { $tag = $tagfore . $tag . $tagaft; } foreach $tag (@tagBack) { $tag = $tagterm . $tag . $tagend; } } # Apply XML tags to row content sub tagContent { for ($i = 0; $i < $cols; $i++) { if (substr($rowBuilder[$i], 0, 7) eq $reftest) { $tag = substr($tagFront[$i], 0, (length($tagFront[$i]) - 1)); $rowBuilder[$i] = $tag . $href . $rowBuilder[$i] . $hrefend; } else { $rowBuilder[$i] = $tagFront[$i] . $rowBuilder[$i] . $tagBack[$i]; } } $newXMLrow = $rowfore . join("",@rowBuilder) . $rowaft; return $newXMLrow; } sub openFiles { $infl = $_[0]; $outfl = $_[1]; $errfl = $_[2]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; open(ERRFILE, ">$errfl") || die "Can't open $errfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); close(ERRFILE); } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  61. 61. Option 4: Completely Prepared XML <?xml version="1.0" encoding="UTF-8" ?> <dataroot> <row> <number>1</number> <name>Slinky</name> <breed>German Shepherd</breed> <gender>Male</gender> <photo href="file://slinky.jpg"/> <weight>115</weight> This Preparation was done </row> <row> in a single pass through a <number>2</number> <name>Nora</name> delimited file: <breed>Doberman Pinscher</breed> <gender>Female</gender> <photo href="file://nora.jpg"/> One process <weight>90</weight> </row> No additional steps <row> No manual text edits <number>3</number> <name>Crunch</name> <breed>Rottweiler</breed> <gender>Male</gender> <photo href="file://crunch.jpg"/> <weight>125</weight> </row> </dataroot> Copyright © Nick D. Barzelay, 2010. All rights reserved.
  62. 62. V. Data Preparation Demonstration:  Data Acquisition & Import  Data Cleanup  Supplement, Arrange, & Sort  Database Conversion to XML Copyright © Nick D. Barzelay, 2010. All rights reserved.
  63. 63. Acquiring Data - Initial Hurdles Commonly Encountered Operational Issues:  Multiple data structures and schemas  Multiple platforms and operating systems  Multiple formats  Multiple security levels  Multiple locations  Multiple transport media  Multiple spoken languages Copyright © Nick D. Barzelay, 2010. All rights reserved.
  64. 64. Database Import 1. Check text file or spreadsheet to determine whether first row contains column headings. a. If first row contains column headings, make note for possible use in the database import setup. b. If the first row contains data, use a text editor or spreadsheet application to insert a new first row of simple descriptive column names. 2. The database application import processing setup may only require the file source name, file type, delimiters used, and whether the first row of the file contains headings (life is much easier with column headings). a. As an additional setup requirement, some database applications may require creating an empty table as the destination for imported data. b. If so, the application may also require pairing each source data column heading with a corresponding destination table column heading (this pairing may be done automatically). 3. If an empty destination table for receiving imported data must be built. a. The first column of the new table should automatically increment row numbers (values will be generated at import). b. Most fields for imported data (including zip codes and static numbers) can be designated as text. c. Fields for quantities and monetary values that may be used in calculations should be number fields (some databases may require explicit designations as integers – whole numbers – or as floating or fixed-place decimals). Copyright © Nick D. Barzelay, 2010. All rights reserved.
  65. 65. Create Empty Database Table for Import select the “fields” button If no appropriate table exists: Select “File > Define > Database” from Menu Bar to add new Table enter the new field name choose a field type ✔ click the “create” button Copyright © Nick D. Barzelay, 2010. All rights reserved.
  66. 66. Database Import from Spreadsheet 1. Select the Customer Worksheet Omit: the design won’t use these 2. Select the desired columns Copyright © Nick D. Barzelay, 2010. All rights reserved.
  67. 67. Data Evaluation Before Cleanup 1. Based on project objectives and requirements: a. What types of record problems are irrelevant and can be ignored? b. What types of problems should be resolved? c. What types are significant and must be repaired? 4. Based on the ratio of detected problem records to the total number of records in the data set, can the data set be sufficiently cleansed or can it be replaced? a. Is the replacement likely to be any better? b. If actually replaced, what is the condition of the replacement? 5. Regarding individual records, on what basis should a record be replaced or discarded? a. If the question is about record discard, how significant is the record in relation to the entire data set? b. If the question is about replacement, what is the likelihood that the replacement will be any better? Copyright © Nick D. Barzelay, 2010. All rights reserved.
  68. 68. Potential Design-related Issues 1. Normalization – inadequate or omitted normalization or database design steps a. Data redundancy – repeating data elements within individual tables b. Multiple values for the same data element – failure to establish lookup tables c. Duplicate records – unique record key not established or enforced for key-only related content 2. Referential integrity - flawed parent and child table references are a design problem exacerbated by processing a. Update anomalies – updating the wrong parent or child record b. Insertion anomalies – adding a new record into the database using an existing key c. Deletion anomalies – deleting a parent record and orphaning its child records or removing the parent record while deleting a child record 3. Data integrity – database design deficiencies or programming logic that degrades the value or accuracy of the data, resulting in the following problems: a. Incomplete data – failure to select, capture, or account for all records or data elements within records b. Lost data – faulty program procedures that drop data elements or whole records during processing c. Missing data – data needs never comprehended during database analysis and design d. Corrupted data – data elements or whole records made unintelligible or unusable during processing Copyright © Nick D. Barzelay, 2010. All rights reserved.
  69. 69. Potential Usage Issues Incorrect or inconsistent spelling, and transposed characters Ambiguous names or variations of a name Incorrect, inconsistent, or mixed use of capitalization Improperly merged fields or merged content within a field Extraneous characters or leading and trailing blank characters Empty or null values Abstract or meaningless coded values and incorrectly selected codes Unit conversion errors Cross-record duplication during data entry (data from separate records erroneously mixed) Incorrect, irrelevant, or unknown content Excluded or omitted content Under-utilized databases (spotty data and partially completed records) Excessively large data sets (under-maintained and out of date records) Incorrect or ad hoc use of fields for purposes unintended in the design Copyright © Nick D. Barzelay, 2010. All rights reserved.
  70. 70. Common Data Cleanup Tasks Make sure field names or column headings are accurate and comply with XML element naming conventions (These will become the XML element tags when converting to XML) Correct invalid information if possible Check and correct spelling, inconsistent spelling, or transpositions Correct upper and lower case issues Eliminate duplicate records (any other records with all the same values) Resolve ambiguous names or name variations Resolve cross-record duplication (if workable, set up households) Decode or correct coded data values – make values meaningful Make appropriate unit conversions Remove extraneous characters and leading and trailing blank characters Remove unintelligible, meaningless, or irrelevant content Resolve incorrect or missing content if possible, including four-digit zip code extensions) Make sure all irrelevant or irreparable records are removed At the data field and record levels, collect and “trash” all remaining extraneous, irrelevant, or corrupted data that can’t be repaired Copyright © Nick D. Barzelay, 2010. All rights reserved.
  71. 71. Data Cleanup Example: 1. Case is the most obvious issue 2. Hidden white space is a potential issue 3. Spelling could be an issue 4. Duplicate records could be another issue 5. Incorrect data could be a hidden problem 6. Then there’s the question of multiple people at the same address 7. Check column names for XML conversion Copyright © Nick D. Barzelay, 2010. All rights reserved.
  72. 72. Supplement, Arrange, & Sort Purpose: 1. Address business logic at database level by incorporating decision results into data structure. a. This eliminates need to process logic during variable document generation b. It provides a reviewable and revisable audit trail prior to document generation, if corrections or changes are identified c. It provides a basis for updating the working database to accommodate new requirements as a campaign progresses or a new one is launched 2. Align working database (and ultimately XML content stream) to satisfy document design and sequential processing requirements 3. Take advantage of postal rates where applicable Copyright © Nick D. Barzelay, 2010. All rights reserved.
  73. 73. Field Supplement Needs Potential Types of Field Additions to Data Records: Insert processing flags to facilitate database-level decision making: Binary flags that are either turned “on” or “off” Optional values that are set from a controlled list Relational values or thresholds (equality, non-equality, relative size) Add a new field to contain concatenated data from multiple fields in a single field, simplifying content stream processing Replicate a field as many times as it is used in the document design Summarize quantitative fields (amounts and counts) if a total is needed Insert a field to reference the location (URI or storage address) of each digital image used in the document design that is dependent on other data in the record. Copyright © Nick D. Barzelay, 2010. All rights reserved.
  74. 74. Example: Given a Document Layout Document Sketch (one side of a duplex document) <first name> Pre-paid Post Floating text message plus <store location> and Mark <discount offer> on <product> <referenced graphic> Additional static information or message 6" 3" .25 " <postal barcode> <title> <given name> <last name> 3" <address (one line or two lines)> <city>, <state> <campaign barcode> .5 " <zip code> 9" Text in angle brackets " < > " is variable content from an XML content stream Copyright © Nick D. Barzelay, 2010. All rights reserved.
  75. 75. And the Envisioned Document... How the Design Sketch will look as a completed document Copyright © Nick D. Barzelay, 2010. All rights reserved.
  76. 76. What Added Data Will Be Needed? Record Layout Prior to Adding Fields adoption_number (C and four digits for cats; D and four digits dogs) last_name first_name prefix (Mr., Mrs. or Ms.) pet_name Required Supplemental Fields: adoption_date 1. tracking_code address 2. mailing_code city 3. salutation state 4. addressee zip 5. location 6. pet_image 7. pet_message 8. adoption_message Copyright © Nick D. Barzelay, 2010. All rights reserved.
  77. 77. And the Rationale Behind Each One? (1) tracking_code – Field will be used to track responses. Content will be created by concatenating adoption number, zip code, and start date of the campaign (example: “D1234 14532 090615”). On the document it will be converted to a standard barcode by applying a font. (2) mailing_code – To take advantage of mailing rates, field will be loaded with the addressee’s zip code. On the document it will be converted to a mail barcode by applying a font. (3) salutation – The salutation will be created by concatenating “Dear “ with prefix, last name, followed by “,” (example: “Dear Mrs. Smith,”). If prefix is empty, it will be dropped and the salutation will use the first name (example: “Dear George,”). (4) addressee – created by combining prefix (if not empty), first name, and last name (example: “Ms. Ann Gorah”). (5) location – Combination of city, state, and zip into one field to simplify processing (example: “Anyville, OH 45678”). (6) pet_image – Field will contain location of the pet’s digital photo. The photo file name is based on the adoption number and the pet’s name. Construct the content to anticipate the XML export; combine “href = file://photos/” with adoption number, pet name, and filetype extension (example: “href = file://photos/C3456Blossom.jpg”). (7) pet_message – If the first character of the adoption number is a “D”, this field will have a dog-related message. If the first character of the adoption number is a “C”, this field will have a cat-related message. (8) adoption_message – If the adoption date is less than one year, the message will be “Thank you for recently adopting “ plus the pet name. If the adoption date is older than a year, the message will be “Would “ plus pet name, plus “ like a playmate?” (example: “Would Scruffy like a playmate?”). Copyright © Nick D. Barzelay, 2010. All rights reserved.
  78. 78. Fields Added (✔) to Match a Design Layout ✔ ✔ ✔ ✔ ✔ Processing Requirement: Each XML element can only be used ONCE when placed in a document layout, and ordered in design PLACEMENT SEQUENCE from top to bottom, left to right 1.“offer”, “image”, & “code” elements didn’t exist in original data 2.“name” & “location” replicated since each used twice in layout Copyright © Nick D. Barzelay, 2010. All rights reserved.
  79. 79. Editing or Loading Field Content Use desktop database scripting capabilities: To populate (fill with content) any supplemental fields (columns) To edit (modify) content in originally existing fields Most functions tailored to specific needs can be variations of the “basic four” Four basic scripted functions (shown on the following two pages): 1. Insert new text or values in every row of a particular column 2. Selectively insert new text or values in one row column based on the value of another column in the same row 3. Copy column content for each row and paste it into another column in the same row 4. Assemble values from multiple row columns into a single string and paste it into another column in the same row (The following functions were defined using FileMaker Pro) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  80. 80. Scripting: select “Scripts > ScriptMaker” An existing script may be used “ is”, or... as Create a new script Edit existing script Script Edit Pannel (point & click) Copyright © Nick D. Barzelay, 2010. All rights reserved.
  81. 81. Setup Functions (continued) Script for Inserting Column Values Script for Selected Column Value Insertion Copyright © Nick D. Barzelay, 2010. All rights reserved.
  82. 82. Setup Functions (continued) Script to Copy and Paste Values Script to Combine Multiple Values and Then Insert the Result Copyright © Nick D. Barzelay, 2010. All rights reserved.
  83. 83. Prep Results & Postal Sort Last step before XML Export: click to perform sort The data will be sorted into zip code order to get better postal rates Copyright © Nick D. Barzelay, 2010. All rights reserved.
  84. 84. Database Conversion to XML Copyright © Nick D. Barzelay, 2010. All rights reserved.
  85. 85. Why Not Just Export Directly From a Spreadsheet? The Spreadsheet A Portion of Exported XML The Issues: 1.128 rows before data starts 2.Column names are not tag names 3.Individual data elements are not tagged with usable XML 4.Superfluous XML throughout Copyright © Nick D. Barzelay, 2010. All rights reserved.
  86. 86. Possible Conversion Paths Clean, Supplement, & Sort Detailed Program Setup Spreadsheet Tab-delimited File Perl Conversion “Clean” XML More Complex Prep Detailed Program Setup Tab-delimited File Perl Conversion “Clean” XML Edit Requires Manual Efforts & XSLT May Be Problematic Text Edit Cleanup All Preparations Database “Raw” XML XSLT Cleanup “Clean” XML Perl Cleanup Simply Set File Names Copyright © Nick D. Barzelay, 2010. All rights reserved.
  87. 87. XML Export Setup Select “File > Export Records” from the menu bar Select the columns to be exported Set order of selected columns to meet design Click or drag to requirements move selection to export list Copyright © Nick D. Barzelay, 2010. All rights reserved.
  88. 88. Obvious Export Format: FMPXMLRESULT “Obvious” is not necessarily “Optimal”! FMPXMLRESULT Output is similar to that of an export directly from a spreadsheet Copyright © Nick D. Barzelay, 2010. All rights reserved.
  89. 89. A Better Selection: “FMPDSORESULT” Column elements are imbedded bet ween Row elements Tag names use column names in database This result will be much easier to prepare for design integration! Copyright © Nick D. Barzelay, 2010. All rights reserved.
  90. 90. Final XML Touchup Steps Optional Tools: 1. Insert line breaks (CR, LF, or CRLF) between adjoining XML Text Editor: element: “><“ Delete 2. Delete unneeded XML non- Search & Replace repeating content XSLT 3. Delete unneeded XML Programmatically: repeating content Perl 4. Delete extra & leading spaces Ruby 5. Create attributes for image Scripting language references Copyright © Nick D. Barzelay, 2010. All rights reserved.
  91. 91. Before XML Touchup Touchup Key: XML code to delete XML code to modify Copyright © Nick D. Barzelay, 2010. All rights reserved.
  92. 92. After XML Touchup All unneeded XML and leading space removed Editing of selected elements completed XML is ready for Adobe InDesign import The cleaner the XML, the more trouble-free the design integration processing will be Copyright © Nick D. Barzelay, 2010. All rights reserved.
  93. 93. DTD Validation Before Integration Assure Quality: ✓ Is it “Well Formed”? ✓ Is it “Valid”? Reduce Work: Detect problems prior to design integration Copyright © Nick D. Barzelay, 2010. All rights reserved.
  94. 94. VI. Direct XML Data Stream Maintenance A set of XML Data Stream Utilities written in Perl provide most maintenance functionality Copyright © Nick D. Barzelay, 2010. All rights reserved.
  95. 95. Data Maintenance Alternatives While there may be other approaches, most will focus on one of three data conditions: 1. Source Data Files prior to any preparation 2. Data Files after Spreadsheet or Database import 3. Data after conversion to an XML Data Stream Copyright © Nick D. Barzelay, 2010. All rights reserved.
  96. 96. Data Stream Utilities:  Insert  Delete  Change  Reorder  Combine  Merge  Split Copyright © Nick D. Barzelay, 2010. All rights reserved.
  97. 97. Overview Extend the usefulness of generated XML content streams beyond original setup or intent Make repairs or adjustments to content streams to more closely match document designs Contains logic for manipulating both XML element content and XML structure Capability spans the entire XML content stream Modifications can be applied selectively: Specific XML elements Specific XML elements with particular content Elements or content based on some previously read XML element All operations are performed as if the XML stream was simply another text file. No programs use native XML manipulation capabilities such as DOM or XSLT In depth knowledge of programming or the Perl programming language is not required Logic is already built and only requires some setup information. Required setup changes are highlighted in yellow, like this Comments about program code are bold character type Copyright © Nick D. Barzelay, 2010. All rights reserved.
  98. 98. About Their Use These programs have been tested with sample data: They do what they are supposed to do. Be warned that attention to detail is extremely important Proof your work after copying or making changes When setting up a program: Back up the original program before changing settings or logic Back up your data files before program processing There are not a lot of settings to worry about You may see the same setup requirements in multiple programs Most settings are similar from program to program Required settings are highlighted in yellow, and functionality is commented in bold Default settings illustrate patterns and program expectations Text string settings are very nuanced: The first character in a string is at position zero A ten character string is numbered from position zero through position nine. Copyright © Nick D. Barzelay, 2010. All rights reserved.
  99. 99. Insert #!/usr/bin/perl use warnings; This program inserts one or more XML use strict "subs"; elements at a specified element location within # set input and output each <row> element of the XML content stream. $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; &insertProcess; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub insertProcess { &openFiles($inputfl, $outputfl); $addFlag = 0; # Number of new elements $new = 3; # List of new elements @newElements = ("<one>Insert First Element</one>", "<two>Insert Second Element</two>", "<three>Insert third element</three>"); Copyright © Nick D. Barzelay, 2010. All rights reserved.
  100. 100. Insert continued # File processing main loop while ($import = <INFILE>) { chomp($import); if ($addFlag) { for ($i = 0; $i < $new; $i++) { $export = $newElements[$i] . "n"; print OUTFILE $export; } $addFlag = 0; } else { $export = ($import . "n"); print OUTFILE $export; # Tag location after which new elements are inserted # Test string length and name if (substr($import,0,7) eq "<offer>") {$addFlag = 1;} } } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  101. 101. Delete #!/usr/bin/perl This program deletes one or more XML use warnings; elements from each <row> element of the XML use strict "subs"; content stream. &deleteProcessing; # set input and output $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub deleteProcessing { &openFiles($inputfl, $outputfl); while ($import = <INFILE>) { chomp($import); # Tag length and name to be deleted if (substr($import,0,4) eq "<one") {next;} if (substr($import,0,4) eq "<two") {next;} if (substr($import,0,6) eq "<three") {next;} $export = $import . "n"; print OUTFILE $export; } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  102. 102. Change #!/usr/bin/perl This program completely changes the use warnings; content of one or more selected elements from use strict "subs"; each <row> element of the XML content stream. # set input and output $inputfl = "InputName.xml"; $outputfl = "OutputName.xml"; &changeProcessing; sub openFiles { $infl = $_[0]; $outfl = $_[1]; open(INFILE, "<$infl") || die "Can't find or open $infl: $!"; open(OUTFILE, ">$outfl") || die "Can't open $outfl: $!"; } sub closeFiles { close(INFILE); close(OUTFILE); } sub changeProcessing { &openFiles($inputfl, $outputfl); while ($import = <INFILE>) { chomp($import); # Set change tag test length and name if (substr($import,0,10) eq "<location>") { # Set decision value position and length $loc = substr($import,10,1); } Copyright © Nick D. Barzelay, 2010. All rights reserved.
  103. 103. Change continued # Set test length, tag name, and replacement values if (substr($import,0,7) eq "<offer>") { if ($loc eq "N") {$export = ("<offer>5% Discount</offer>" . "n");} if ($loc eq "S") {$export = ("<offer>10% Discount</offer>" . "n");} if ($loc eq "E") {$export = ("<offer>15% Discount</offer>" . "n");} if ($loc eq "W") {$export = ("<offer>20% Discount</offer>" . "n");} } if ($loc eq "N") {$export = ("<product>Paving Stone</product>" ."n");} if ($loc eq "S") {$export = ("<product>Fertilizer</product>" . "n");} if ($loc eq "E") {$export = ("<product>Garden Tools</product>" ."n");} if ($loc eq "W") {$export = ("<product>Seed Packets</product>" ."n");} } else { $export = $import . "n"; } print OUTFILE $export; } &closeFiles; } Copyright © Nick D. Barzelay, 2010. All rights reserved.

×