Some Notes on Digital Data – with a suggestion
Tom Moritz / Internet Archive February, 2009
A UNIVERSE OF DATA???
What is “data”? The US NSF DataNet solicitation defines “data” as: “Any
information that can be stored in digital form and accessed electronically, including,
but not limited to, numeric data, text, publications, sensor streams, video, audio,
algorithms, software, models and simulations, images, etc.” i This definition is
technically acceptable but not scientifically epistemic. In fact, it is useful to think of
“data” in two distinct ways. “Data” refers (as in the DataNet definition) to the
computer readable code that is stored in, accessed from or flows between
computers. “Data” also means precise, well‐defined representations of observations,
descriptions or measurements of a referent (object or event) recorded in some
standard, well‐specified way.
The more inclusive DataNet definition has the virtue of forcing us to consider a
unified, holistic approach to knowledge and to the formal resources that inform and
express it; we are forced to confront the Web as it exists today.
HOW MUCH DATA?
In a now famous quip, Lewis Carroll noted that the perfect scale for maps was 1:1
but that farmers tend to become disgruntled when such maps are unrolled over
their fields. The notion that we could theoretically record “everything” in real time
‐‐ “ 1:1 capture “ – leaves us to ponder the limits of “data” collection, management
and longevity – full‐life‐cycleii curation and stewardship. With the evolution of
satellite coverages, nanotechnology, robotics and embedded network sensors, it is
possible, for example, to systematically record presence/absence data for birds at a
nesting site – at every nesting site in a given area ‐‐ 24‐7, forever [SEE for example:
http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14 ] iii or for that
matter to record every human heartbeat. iv And to archive these data in perpetuity?
(The casual assumption that we might comprehensively save all data is belied by a
recent forecast projecting that in 2007, the total data produced on earth for the first
time exceeded the available storage.v )
It is also the case that technology, standards and methodologies, that institutions,
organizations and professions, have evolved and become established to manage and
preserve logical domains of knowledge as well as selected technical formats of data.
The point respecting logical segments is relatively clear – natural history museums
and herbaria hold preserved (e.g. dead) organisms as specimens; zoos and gardens
and aquaria hold living organisms ex situ; protected areas hold living organisms in
situ; cryogenics facilities hold tissue samples – similarly, their libraries hold logically
corresponding published or archival works. Respecting technical formats: libraries
hold bound paper/print materials; archives hold unbound paper/ manuscript or
unbound paper/typescript materials; media repositories hold non‐print media;
computer centers hold data sets and complex models (hypothetical assemblages of
data that generate new data); art museums hold paintings and sculptures; a dance
company performs dances; and indigenous group stewards its “old knowledge”.
Similarly, librarians and archivists, curators and zookeepers, rangers and
information technologists, dancers and shamans have all received vocational charge
for siloed segments of our “knowledge base”. But who is responsible for the whole?
Before the advent of digital technology this latter question would have been
metaphysically interesting but pointless ‐‐ no longer it seems. Scanning our society
and culture, it seems libraries and librarians are the most eligible candidates for the
And if the received “compartments” organizational, professional, logical structures
are no longer dictated by operational constraints (eg the ability to curate a
dragonfly or to select and conserve a book) how can we most effectively organize
the management of knowledge as data. At the national level, there are prime
examples of institutions that admirably serve logical domains of our knowledge
base, the National Library of Medicine is one.vii The Library of Congress alone has
the stature and scope of interest to command our trust and expectations.
BUT DATA FOR WHAT???
Harvard biologist Richard Lewontin notes that – like the drunk looking for his keys
under a street light “because the light is better there” – research has often been
constrained to studies for which career oriented researchers have the apparatus
and methods to produce creditable (e.g. laudable, promotion‐worthy) results.viii Our
current era has seen an evolution of technology that challenges comfortable
“disciplinary” categories of research and conventional format‐defined codes of
fiduciary responsibility. Not only have traditional distinctions between the domains
of the arts and the humanities and the sciences been challenged but the conventions
of scientific disciplines in themselves –as foci for research and investment – are
being challenged. New possibilities for trans‐disciplinarity are emerging but the
requisite tools and methods are not yet fully formed and organizational paths for
such research are not always clear.
AND HOW DOES DATA HAVE MEANING?
When data is considered in the scientific or research context, its semantic properties
necessarily become essential. Thus our ability to contextualize data becomes
primary. Parameters of time and space are immediately relevant – some data will
have a geographic context (deriving one parameter of meaning from location ‐‐ in
situ) other data will be essentially ageographic (ex situ), experimental and
independent of geography but not of experimental frame. Time as a parameter of
data may similarly be historical or ahistorical. Agency, materials, equipment
(calibration) and operations also set primary parameters for data.
Huge – dare we say “exorbitant”? ‐‐ investments have been made in the “metadata
industry” – most particularly in library and archival cataloging. In the new media,
Web environment – other solutions operating upon natural language and “native
[pre‐existent] metadata” have produced prodigious, cost‐effective (profitable)
In an era when combinations and recombinations of data are routine, “demand side”
problems occur respecting validation and certification of results and “supply side”
problems occur respecting attribution and credit for the originators of data.
Moreover scientists’ claims for discrete personal “priority” of discovery are
inevitably being challenged. Collaboration is more and more common ‐‐ as foreseen
by Robert K. Mertonix ‐‐ an individual’s contribution to the whole corpus of
knowledge is less and less clearly attributable. Notions of “authorship” are
challenged by anonymous institutional/ organizational claims to authorship. x And
“small science” (ecology, field biology, etc) – where the individual scientist is still
seem as a single actor ‐‐ is often perceived as weakly developed – as providing no
more than “disaggregated components of an incipient network”xi.
At the same time there has been a quantum increase in the effort to isolate and to
monetize intellectual propertyxii. Intellectual “assets” – whether in the form of
genomic discoveries or scientific journal articles – have become increasingly
It is also the case that the digital environment has disrupted traditional economic
value chains (this has been obviously true in the publishing industry and in the
entertainment industry where the consequences of these pressures have been
accusations, threats and law suits – often to the bizarre extent that natural allies in
the value chain have attacked each other or even to the degree that customers
/clients of an industry have been attacked by the industry itself.
A GLOBAL DATA IMPERATIVE???
Perhaps neglecting Faust (?), Thomas Jefferson asserted, “The field of knowledge is
the common property of all mankind.” It seems more responsible to consider an
ethical scale of need that compels free and open public access to the results of
nondestructive research (obviously the definition of “nondestructive” requires
debate). This spectrum of common need includes: human health, pharmacology,
public health; agrarian and agricultural knowledge; environmental knowledge and
conservation and – more generally – most non‐destructive science and technology,
critical for education. The dilemma we face, worldwide is that most developing
countries and developing segments of society are those least capable of clearing the
thresholds of use imposed by market controls on knowledge in all forms.xiv
In the naive exuberance that formed the League of Nations, an “International
Committee on Intellectual Cooperation” was envisioned as a forum for global focus
on common goods ‐‐ today, in a far more exact way, we have the opportunity to plan
and develop technical resources, standards and methodologies that will not deny
the benefits of human knowledge to the least privileged. A comprehensive strategy
requires that we successfully address 4 primary modalities of constraint:
technology, culture, economy and law.
The Internet Archive – focusing on R&D and prototyping ‐‐ has built essential
components of what could ultimately become a full service, full life cycle ‘collective
utility’ or “service cloud” ‐‐ for open digital management of human knowledge. This
evolution does not require that the Archive itself become this “service cloud” but
that it compose a comprehensive response and ‐‐ together with other institutions
and organizations, programs and initiatives ‐‐ catalyze a comprehensive response.xv
Most essential elements are in place – or at least emerging. We can and should act
i Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation
NSF 07-601 , p.5.
“the data management life cycle (including data creation, access, use, and preservation)” Sustainable
Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 ,
Or as another instance see recent NYT article: Natalie Anger “Tracking forest creatures on the move.”
NYT Feb 2, 2009
The California poet William Everson once asked poignantly: “And when the last coyote has been
“…the amount of information created, captured or replicated exceeded available storage for the first tie in
2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital
universe will not have a permanent home.” John Gantz et al. (IDC) The diverse and exploding digital
universe; an updated forecast or worldwide information growth through 2011. (March, 2008)
Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE:
vii HISTORIC BUDGET SUPPORT FOR NLM
R. Lewontin, The Triple Helix: Gene, Organism, Environment
“Property rights in science are whittled down to a bare minimum by the rationale of the scientific ethic.
The scientist’s claim to “his” intellectual “property” is limited to that of recognition and esteem which, if
the institution functions with a modicum of efficiency, is roughly commensurate with the significance of
the increments brought to the common fund of knowledge.” Robert K. Merton, “A Note on Science and
Democarcy,” Journal of Law and Political Sociology 1 (1942): 121.
SEE for example: Peter Galison, “The Collective Author,” in M. Biagioli and P. Galison (ed.s) Scientific
Authorship: Crdit and Intelletual Property in ScienceNY, Routledge, 2003.
SEE: THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE
PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM J.M. Esanu and P.F. Uhlir, (Ed.s) Steering
Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of
International Scientific and Technical Information Programs Board on International Scientific
Organizations Policy and Global Affairs Division, National Research Council of the National Academies,,
SEE L. Lessig, Code
SEE Julian Birkinshaw and Tony Sheehan, “Managing the Knowledge Life Cycle,” MIT Sloan
Management Review, 44 (2) Fall, 2002: 77.
xiv SEE for ex.:
A short list is relatively easy to compose…