Revealing digital documents - concealed structures in data
Jakob Voß
Revealing digital documents
Concealed structures in data
http://arxiv.org/abs/1105.5832
http://aboutdata.org
International Conference on Theory
and Practice in Digital Libraries (TPDL)
Doctoral Consortium, Berlin 2011-09-25
question
how are (digital) documents
structured and described?
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
what is a document?
“[...] any physical or symbolic sign, preserved
or recorded, intended to represent, to
reconstruct, or to demonstrate a physical or
conceptual phenomenon” – Suzanne Briet
“[...] consists of anything that someone wishes
to store. A document is something designated
by a person to be a document [...]“ – Ted Nelson
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
scope
digital documents
somehow recorded (stable),
eventually as sequence of bits
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
thesis
but there are common patterns
on all levels of description,
independent from
particular technologies
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
examples of particular technologies
XML relational databases
● Unicode ● Relational Model
● XML Infoset ● SQL
● XML Schema ● Entity-Relationship-
● Xpath Diagrams
families of related standards
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
method
not statistical
this would limit my research to
one level and technology of
description
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
method
phenomenological
data description in all of its forms
as it appears in our experience
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
phenomenological method
data description analyzed
as phenomena:
1. critical intuiting
(experience)
2. analyzing structures,
Hegel free of known
Husserl categories
Merleau-Ponty*
3. describing the essence
* Image CC-BY Pierre-Alain Gouanvic
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
results
1) Categorization
of data structuring methods
2) Collection
of data structuring paradigms
3) Pattern language
of data patterns
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 1: categorization of methods
● encodings express data
(UTF-8 Unicode, IEEE floating point, Base64…)
● file and database systems store data
● identifiers and query languages refer to data
● data structuring and markup languages
structure data
● schema languages constrain and validate data
● conceptual models describe data
¡Concrete methods appear as combinations of categories!
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 2: paradigms
● Document- or Object-oriented approach
● Document-oriented (e.g. ordered tree with
tagged character strings: XML, Relax NG…)
⇒ descriptive data description
● Object-oriented (objects with properties and
defined value spaces: XML Schema, UML…)
⇒ prescriptive data description
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 2: paradigms
● Entities and connections
Jakob 1979
born
Jakob 1979
Jakob Birth 1979
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 2: paradigms
● Layers of abstraction
● Standards and rules
● Collections and types
● Granularity
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 3: patterns
● patterns as systematic tool for describing good design
practice, introduced by Christopher Alexander:
“Each pattern describes a problem which occurs over and
over again in our environment, and then describes the
core of the solution to that problem […]”
● Adopted as design patterns in software engineering
● Collected in a pattern language with meaningful
connections between patterns (network of patterns).
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
result 3: patterns
collection
separator known size
sequence
position ordered set array
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
applications
● data archeology
● In 200 years someone finds snapshots and
archives of Wikipedia in different forms
(SQL, XML, Wikitext, DBPedia, HTML…)
● What are significant parts?
How relate parts to each other?
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
to give a simple example…
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
sequence with delimiter
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
sequence with delimiter
grouping of sequences with delimiter
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org
… another document
sequence with delimiter
grouping of sequences with delimiter
encoding (morse code)
D A T A P A T T E R N S
Jakob Voß: Revealing digital documents. Concealed structures in data. TPDL 2011, Sep. 25th http://aboutdata.org