Jakob Voß
Revealing digital documents
 Concealed structures in data
     http://arxiv.org/abs/1105.5832
          http://aboutdata.org


          International Conference on Theory
          and Practice in Digital Libraries (TPDL)
          Doctoral Consortium, Berlin 2011-09-25
question




           how are (digital) documents
            structured and described?



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th     http://aboutdata.org
what is a document?

          “[...] any physical or symbolic sign, preserved
                or recorded, intended to represent, to
            reconstruct, or to demonstrate a physical or
             conceptual phenomenon” – Suzanne Briet

       “[...] consists of anything that someone wishes
       to store. A document is something designated
      by a person to be a document [...]“ – Ted Nelson



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
scope




                digital documents
            somehow recorded (stable),
           eventually as sequence of bits



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
CR2, AAF, AAT, ADL, AES Core Audio, AES Process History, AGLS, Alleg
SCII, ASN.1, Atom, BIBO, BibTeX, BISAC, BPEL, BPMN, BSON, CanCor
 CO, CDR, CDWA, CDWA Lite, CIDOC/CRM, CQL, CSDGM, CSV, DACS
ata Committee Content Standard, DC, DCAM, DDC, DDI, DDL, DFDL, DI
 G35, DjVU, DOM, DTD, Dublin Core, DwC, EAC, EAC-CPF, EAD, ebXM
    ECN, Ediakt, EDIFAKT, eduPerson, EML, ERM, Etch, EXIF, Federal
eographic, FOAF, FRAD, FRBR, FRSAD, FRSAR, GEM, GILS, GKD, GM
ssian, HTML, HTTP, ID3, IDL, IEEE/LOM, indecs, inetOrgPerson, INI, IPT
I, ISAAR(CPF), ISAD(G), ISBD, ISBN, ISO 19115, ISO 19119, JSON, KM
               there is not one
LCC, LCSH, LDAP, Linked Data, LMER, MAB2, MADS, MARC, MARC21
 RC Relator Codes, MARCXML, MathML, MEI, MESH, METS, METS Rig
           single document format
MFC, MGraph, MIX, MO, MODS, MOTS, MPEG-21 , MPEG-7, MSchema
seumDat, MusicXML, MXF, NewsML, NFC, NFD, NFKC, NFKD, NIAM, O
OAI-ORE, OAI-PMH, OAIS, ODRL, ONIX, Ontology for Media, OODBMS
OpenDocument, OpenSearch, OpenURL, ORM, OWL, PB Core, PDF, PI
ca+, Pica3, PND, PREMIS, PRISM, Proto, QDC, RAD, RAK, RDA, RDBM
DF, RDFS, RDF/XML, Relax NG, RELAX NG, Resource, RIS, RSS, RSW
 Schematron, SCORM, SDXF, Seel, S-EXP, SGML, SIOC, SKOS, SMIL,
PECTRUM, SQL, SRU/SRW, SWAP, SWB, TEI, TEX, TextMD, TGM I, TG
 TGN, Thrift, Topic Maps, UCS, ULAN, UML, unAPI, UNIMARC, URI, UTF
 ard, Vorbis Comment, VRA, VSO Data Model, XDR, XMetaDiss, XML, XM
thesis



       but there are common patterns
          on all levels of description,
               independent from
            particular technologies


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
examples of particular technologies
     XML                                                        relational databases
      ●   Unicode                                                ●   Relational Model
      ●   XML Infoset                                            ●   SQL
      ●   XML Schema                                             ●   Entity-Relationship-
      ●   Xpath                                                      Diagrams



                      families of related standards


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
method


                   not statistical
           this would limit my research to
             one level and technology of
                     description


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th    http://aboutdata.org
method




              phenomenological
      data description in all of its forms
       as it appears in our experience



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th    http://aboutdata.org
phenomenological method

                                                                data description analyzed
                                                                as phenomena:
                                                                1. critical intuiting
                                                                   (experience)
                                                                2. analyzing structures,
      Hegel                                                        free of known
                      Husserl                                      categories
                                     Merleau-Ponty*
                                                                3. describing the essence



  * Image CC-BY Pierre-Alain Gouanvic

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
results
      1) Categorization
         of data structuring methods
      2) Collection
         of data structuring paradigms
      3) Pattern language
         of data patterns




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th    http://aboutdata.org
result 1: categorization of methods
      ●   encodings express data
          (UTF-8 Unicode, IEEE floating point, Base64…)
      ●   file and database systems store data
      ●   identifiers and query languages refer to data
      ●   data structuring and markup languages
          structure data
      ●   schema languages constrain and validate data
      ●   conceptual models describe data

    ¡Concrete methods appear as combinations of categories!

Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 2: paradigms
      ●   Document- or Object-oriented approach
            ●   Document-oriented (e.g. ordered tree with
                tagged character strings: XML, Relax NG…)
                ⇒ descriptive data description
            ●   Object-oriented (objects with properties and
                defined value spaces: XML Schema, UML…)
                ⇒ prescriptive data description




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 2: paradigms
      ●   Entities and connections

              Jakob                    1979


                                      born
               Jakob                                          1979



               Jakob                   Birth                  1979


Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 2: paradigms
      ●   Layers of abstraction
      ●   Standards and rules
      ●   Collections and types
      ●   Granularity




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 3: patterns
      ●   patterns as systematic tool for describing good design
          practice, introduced by Christopher Alexander:
          “Each pattern describes a problem which occurs over and
            over again in our environment, and then describes the
                   core of the solution to that problem […]”
      ●   Adopted as design patterns in software engineering
      ●   Collected in a pattern language with meaningful
          connections between patterns (network of patterns).




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
result 3: patterns
                                            collection

          separator                                                              known size


                                            sequence




       position                           ordered set                                  array



Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th     http://aboutdata.org
applications
      ●   data archeology
            ●   In 200 years someone finds snapshots and
                archives of Wikipedia in different forms
                (SQL, XML, Wikitext, DBPedia, HTML…)
            ●   What are significant parts?
                How relate parts to each other?




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document




                               to give a simple example…




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document

                                   sequence with delimiter




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document

                                   sequence with delimiter



                     grouping of sequences with delimiter




Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th   http://aboutdata.org
… another document

                                   sequence with delimiter



                     grouping of sequences with delimiter



                                   encoding (morse code)
 D           A        T        A                   P              A        T T E             R       N          S
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th       http://aboutdata.org

Revealing digital documents - concealed structures in data

  • 1.
    Jakob Voß Revealing digitaldocuments Concealed structures in data http://arxiv.org/abs/1105.5832 http://aboutdata.org International Conference on Theory and Practice in Digital Libraries (TPDL) Doctoral Consortium, Berlin 2011-09-25
  • 2.
    question how are (digital) documents structured and described? Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 3.
    what is adocument? “[...] any physical or symbolic sign, preserved or recorded, intended to represent, to reconstruct, or to demonstrate a physical or conceptual phenomenon” – Suzanne Briet “[...] consists of anything that someone wishes to store. A document is something designated by a person to be a document [...]“ – Ted Nelson Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 4.
    scope digital documents somehow recorded (stable), eventually as sequence of bits Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 5.
    CR2, AAF, AAT,ADL, AES Core Audio, AES Process History, AGLS, Alleg SCII, ASN.1, Atom, BIBO, BibTeX, BISAC, BPEL, BPMN, BSON, CanCor CO, CDR, CDWA, CDWA Lite, CIDOC/CRM, CQL, CSDGM, CSV, DACS ata Committee Content Standard, DC, DCAM, DDC, DDI, DDL, DFDL, DI G35, DjVU, DOM, DTD, Dublin Core, DwC, EAC, EAC-CPF, EAD, ebXM ECN, Ediakt, EDIFAKT, eduPerson, EML, ERM, Etch, EXIF, Federal eographic, FOAF, FRAD, FRBR, FRSAD, FRSAR, GEM, GILS, GKD, GM ssian, HTML, HTTP, ID3, IDL, IEEE/LOM, indecs, inetOrgPerson, INI, IPT I, ISAAR(CPF), ISAD(G), ISBD, ISBN, ISO 19115, ISO 19119, JSON, KM there is not one LCC, LCSH, LDAP, Linked Data, LMER, MAB2, MADS, MARC, MARC21 RC Relator Codes, MARCXML, MathML, MEI, MESH, METS, METS Rig single document format MFC, MGraph, MIX, MO, MODS, MOTS, MPEG-21 , MPEG-7, MSchema seumDat, MusicXML, MXF, NewsML, NFC, NFD, NFKC, NFKD, NIAM, O OAI-ORE, OAI-PMH, OAIS, ODRL, ONIX, Ontology for Media, OODBMS OpenDocument, OpenSearch, OpenURL, ORM, OWL, PB Core, PDF, PI ca+, Pica3, PND, PREMIS, PRISM, Proto, QDC, RAD, RAK, RDA, RDBM DF, RDFS, RDF/XML, Relax NG, RELAX NG, Resource, RIS, RSS, RSW Schematron, SCORM, SDXF, Seel, S-EXP, SGML, SIOC, SKOS, SMIL, PECTRUM, SQL, SRU/SRW, SWAP, SWB, TEI, TEX, TextMD, TGM I, TG TGN, Thrift, Topic Maps, UCS, ULAN, UML, unAPI, UNIMARC, URI, UTF ard, Vorbis Comment, VRA, VSO Data Model, XDR, XMetaDiss, XML, XM
  • 6.
    thesis but there are common patterns on all levels of description, independent from particular technologies Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 7.
    examples of particulartechnologies XML relational databases ● Unicode ● Relational Model ● XML Infoset ● SQL ● XML Schema ● Entity-Relationship- ● Xpath Diagrams families of related standards Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 8.
    method not statistical this would limit my research to one level and technology of description Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 9.
    method phenomenological data description in all of its forms as it appears in our experience Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 10.
    phenomenological method data description analyzed as phenomena: 1. critical intuiting (experience) 2. analyzing structures, Hegel free of known Husserl categories Merleau-Ponty* 3. describing the essence * Image CC-BY Pierre-Alain Gouanvic Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 11.
    results 1) Categorization of data structuring methods 2) Collection of data structuring paradigms 3) Pattern language of data patterns Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 12.
    result 1: categorizationof methods ● encodings express data (UTF-8 Unicode, IEEE floating point, Base64…) ● file and database systems store data ● identifiers and query languages refer to data ● data structuring and markup languages structure data ● schema languages constrain and validate data ● conceptual models describe data ¡Concrete methods appear as combinations of categories! Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 13.
    result 2: paradigms ● Document- or Object-oriented approach ● Document-oriented (e.g. ordered tree with tagged character strings: XML, Relax NG…) ⇒ descriptive data description ● Object-oriented (objects with properties and defined value spaces: XML Schema, UML…) ⇒ prescriptive data description Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 14.
    result 2: paradigms ● Entities and connections Jakob 1979 born Jakob 1979 Jakob Birth 1979 Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 15.
    result 2: paradigms ● Layers of abstraction ● Standards and rules ● Collections and types ● Granularity Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 16.
    result 3: patterns ● patterns as systematic tool for describing good design practice, introduced by Christopher Alexander: “Each pattern describes a problem which occurs over and over again in our environment, and then describes the core of the solution to that problem […]” ● Adopted as design patterns in software engineering ● Collected in a pattern language with meaningful connections between patterns (network of patterns). Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 17.
    result 3: patterns collection separator known size sequence position ordered set array Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 18.
    applications ● data archeology ● In 200 years someone finds snapshots and archives of Wikipedia in different forms (SQL, XML, Wikitext, DBPedia, HTML…) ● What are significant parts? How relate parts to each other? Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 19.
    Jakob Voß: Revealingdigital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 20.
    … another document to give a simple example… Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 21.
    … another document sequence with delimiter Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 22.
    … another document sequence with delimiter grouping of sequences with delimiter Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
  • 23.
    … another document sequence with delimiter grouping of sequences with delimiter encoding (morse code) D A T A P A T T E R N S Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org