Moritz A Universe Of Data

Some Notes on Digital Data – with a suggestion  
Tom Moritz / Internet Archive        February, 2009 
 
A UNIVERSE OF DATA??? 
 
What is “data”?  The US NSF DataNet solicitation defines “data” as:   “Any 
information that can be stored in digital form and accessed electronically, including, 
but not limited to, numeric data, text, publications, sensor streams, video, audio, 
algorithms, software, models and simulations, images, etc.” i   This definition is 
technically acceptable but not scientifically epistemic. In fact, it is useful to think of 
“data” in two distinct ways.  “Data” refers (as in the DataNet   definition) to the 
computer readable code that is stored in, accessed from or flows between 
computers. “Data” also means precise, well‐defined representations of observations, 
descriptions or measurements of a referent (object or event) recorded in some 
standard, well‐specified way.  
 
The more inclusive DataNet definition has the virtue of forcing us to consider a 
unified, holistic approach to knowledge and to the formal resources that inform and 
express it; we are forced to confront the Web as it exists today. 
 
HOW MUCH DATA? 
 
In a now famous quip, Lewis Carroll noted that the perfect scale for maps was 1:1 
but that farmers tend to become disgruntled when such maps are unrolled over 
their fields.    The notion that we could theoretically record “everything” in real time  
‐‐ “ 1:1 capture “ – leaves us to ponder the limits of “data” collection, management 
and longevity – full‐life‐cycleii curation and stewardship.    With the evolution of 
satellite coverages, nanotechnology, robotics and embedded network sensors, it is 
possible, for example, to systematically record presence/absence data for birds at a 
nesting site – at every nesting site in a given area ‐‐ 24‐7, forever  [SEE for example: 
http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14  ] iii  or for that 
matter to record every human heartbeat. iv  And to archive these data in perpetuity?   
(The casual assumption that we might comprehensively save all data is belied by a 
recent forecast projecting that in 2007, the total data produced on earth for the first 
time exceeded the available storage.v )

vi 

WHO’S RESPONSIBLE? 
 
It is also the case that technology, standards and methodologies, that institutions, 
organizations and professions, have evolved and become established to manage and 
preserve logical domains of knowledge as well as selected technical formats of data.  
The point respecting logical segments is relatively clear – natural history museums 
and herbaria hold preserved (e.g. dead) organisms as specimens; zoos and gardens 
and aquaria hold living organisms ex situ; protected areas hold living organisms in 
situ; cryogenics facilities hold tissue samples – similarly, their libraries hold logically 
corresponding published or archival works.  Respecting technical formats: libraries 
hold bound paper/print materials; archives hold unbound paper/ manuscript or 
unbound paper/typescript materials; media repositories hold non‐print media; 
computer centers hold data sets and complex models (hypothetical assemblages of 
data that generate new data); art museums hold paintings and sculptures; a dance 
company performs dances; and indigenous group stewards its “old knowledge”.  
 
Similarly, librarians and archivists, curators and zookeepers, rangers and 
information technologists, dancers and shamans have all received vocational charge 
for siloed segments of our “knowledge base”. But who is responsible for the whole? 
Before the advent of digital technology this latter question would have been 
metaphysically interesting but pointless ‐‐ no longer it seems.  Scanning our society 
and culture, it seems libraries and librarians are the most eligible candidates for the 
role. 
 
And if the received “compartments” organizational, professional, logical structures 
are no longer dictated by operational constraints (eg the ability to curate a 
dragonfly or to select and conserve a book) how can we most effectively organize 
the management of knowledge as data.    At the national level, there are prime 
examples of institutions that admirably serve logical domains of our knowledge 
base, the National Library of Medicine is one.vii  The Library of Congress alone has 
the stature and scope of interest to command our trust and expectations. 
 
BUT DATA FOR WHAT???

Harvard biologist Richard Lewontin notes that – like the drunk looking for his keys 
under a street light “because the light is better there” – research has often been 
constrained to studies for which career oriented researchers have the apparatus 
and methods to produce creditable (e.g. laudable, promotion‐worthy) results.viii  Our 
current era has seen an evolution of technology that challenges comfortable 
“disciplinary” categories of research and conventional format‐defined codes of 
fiduciary responsibility.  Not only have traditional distinctions between the domains 
of the arts and the humanities and the sciences been challenged but the conventions 
of scientific disciplines in themselves –as foci for research and investment – are 
being challenged. New possibilities for trans‐disciplinarity are emerging but the 
requisite tools and methods are not yet fully formed and organizational paths for 
such research are not always clear. 
AND HOW DOES DATA HAVE MEANING? 
 
When data is considered in the scientific or research context, its semantic properties 
necessarily become essential.  Thus our ability to contextualize data becomes 
primary.  Parameters of time and space are immediately relevant – some data will 
have a geographic context (deriving one parameter of meaning from location ‐‐ in 
situ) other data will be essentially ageographic (ex situ), experimental and 
independent of geography but not of experimental frame.  Time as a parameter of 
data may similarly be historical or ahistorical.   Agency, materials, equipment 
(calibration) and operations also set primary parameters for data.  
 
Huge – dare we say “exorbitant”?  ‐‐ investments have been made in the “metadata 
industry” – most particularly in library and archival cataloging.  In the new media, 
Web environment – other solutions operating upon natural language and  “native 
[pre‐existent] metadata” have produced prodigious, cost‐effective (profitable) 
results. 
 
WHOSE DATA? 
 
In an era when combinations and recombinations of data are routine, “demand side” 
problems occur respecting validation and certification of results and “supply side” 
problems occur respecting attribution and credit for the originators of data. 
 
Moreover scientists’ claims for discrete personal “priority” of discovery are 
inevitably being challenged.  Collaboration is more and more common ‐‐ as foreseen 
by Robert K. Mertonix ‐‐ an individual’s contribution to the whole corpus of 
knowledge is less and less clearly attributable.  Notions of “authorship” are 
challenged by anonymous institutional/ organizational claims to authorship. x  And 
“small science”  (ecology, field biology, etc) – where the individual scientist is still 
seem as a single actor ‐‐  is often perceived as weakly developed – as providing no 
more than “disaggregated components of an incipient network”xi. 
 
At the same time there has been a quantum increase in the effort to isolate and to 
monetize intellectual propertyxii.   Intellectual “assets” – whether in the form of

genomic discoveries or scientific journal articles – have become increasingly 
commoditized.xiii  
 
It is also the case that the digital environment has disrupted traditional economic 
value chains (this has been obviously true in the publishing industry and in the 
entertainment industry where the consequences of these pressures have been 
accusations, threats and law suits – often to the bizarre extent that natural allies in 
the value chain have attacked each other or even  to the degree that customers 
/clients of an industry have been attacked by the industry itself.   
 
 
 
 
A GLOBAL DATA IMPERATIVE??? 
 
Perhaps neglecting Faust (?), Thomas Jefferson asserted, “The field of knowledge is 
the common property of all mankind.” It seems more responsible to consider an 
ethical scale of need that compels free and open public access to the results of 
nondestructive research (obviously the definition of “nondestructive” requires 
debate).  This spectrum of common need includes: human health, pharmacology, 
public health; agrarian and agricultural knowledge; environmental knowledge and 
conservation and – more generally – most non‐destructive science and technology, 
critical for education.   The dilemma we face, worldwide is that most developing  
countries and developing segments of society are those least capable of clearing the 
thresholds of use imposed by market controls on knowledge in all forms.xiv  
 
In the naive exuberance that formed the League of Nations, an  “International 
Committee on Intellectual Cooperation” was envisioned as a forum for global focus 
on common goods  ‐‐ today, in a far more exact way, we have the opportunity to plan 
and develop technical resources, standards and methodologies that will not deny 
the benefits of human knowledge to the least privileged.  A comprehensive strategy 
requires that we successfully address 4 primary modalities of constraint: 
technology, culture, economy and law. 
 
The Internet Archive – focusing on R&D and prototyping ‐‐ has built essential 
components of what could ultimately become a full service, full life cycle ‘collective 
utility’ or  “service cloud” ‐‐ for open digital management of human knowledge.  This 
evolution does not require that the Archive itself become this “service cloud” but 
that it compose a comprehensive response and ‐‐ together with other institutions 
and organizations, programs and initiatives ‐‐ catalyze a comprehensive response.xv  
Most essential elements are in place – or at least emerging.   We can and should act 
now.  
                                                        
i Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation

NSF 07-601 , p.5.
ii
“the data management life cycle (including data creation, access, use, and preservation)” Sustainable

Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 ,
p.5.
iii
Or as another instance see recent NYT article: Natalie Anger “Tracking forest creatures on the move.”
NYT Feb 2, 2009
http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=c
se
iv
The California poet William Everson once asked poignantly: “And when the last coyote has been
tagged…?”
v
“…the amount of information created, captured or replicated exceeded available storage for the first tie in
2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital
universe will not have a permanent home.” John Gantz et al. (IDC) The diverse and exploding digital
universe; an updated forecast or worldwide information growth through 2011. (March, 2008) 
www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
vi
Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE:
http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=c
se
vii HISTORIC BUDGET SUPPORT FOR NLM 
viii
R. Lewontin, The Triple Helix: Gene, Organism, Environment
ix
“Property rights in science are whittled down to a bare minimum by the rationale of the scientific ethic.
The scientist’s claim to “his” intellectual “property” is limited to that of recognition and esteem which, if
the institution functions with a modicum of efficiency, is roughly commensurate with the significance of
the increments brought to the common fund of knowledge.” Robert K. Merton, “A Note on Science and
Democarcy,” Journal of Law and Political Sociology 1 (1942): 121.
x
SEE for example: Peter Galison, “The Collective Author,” in M. Biagioli and P. Galison (ed.s) Scientific
Authorship: Crdit and Intelletual Property in ScienceNY, Routledge, 2003.
xi
SEE: THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE
PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM J.M. Esanu and P.F. Uhlir, (Ed.s) Steering
Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of
International Scientific and Technical Information Programs Board on International Scientific
Organizations Policy and Global Affairs Division, National Research Council of the National Academies,,
xii
SEE L. Lessig, Code
xiii
SEE Julian Birkinshaw and Tony Sheehan, “Managing the Knowledge Life Cycle,” MIT Sloan
Management Review, 44 (2) Fall, 2002: 77.
xiv SEE for ex.:  
xv
A short list is relatively easy to compose…

Moritz A Universe Of Data

Recommended

Recommended

More Related Content

What's hot

What's hot (19)

Viewers also liked

Viewers also liked (11)

Similar to Moritz A Universe Of Data

Similar to Moritz A Universe Of Data (20)

More from Tom Moritz

More from Tom Moritz (20)

Recently uploaded

Recently uploaded (20)

Moritz A Universe Of Data