Moritz A Universe Of Data
Upcoming SlideShare
Loading in...5
×
 

Moritz A Universe Of Data

on

  • 858 views

 

Statistics

Views

Total Views
858
Views on SlideShare
829
Embed Views
29

Actions

Likes
0
Downloads
0
Comments
0

2 Embeds 29

http://www.lmodules.com 27
http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Moritz A Universe Of Data Moritz A Universe Of Data Document Transcript

    • Some
Notes
on
Digital
Data
–
with
a
suggestion

 Tom
Moritz
/
Internet
Archive







February,
2009
 
 A
UNIVERSE
OF
DATA???
 
 What
is
“data”?

The
US
NSF
DataNet
solicitation
defines
“data”
as:


“Any
 information
that
can
be
stored
in
digital
form
and
accessed
electronically,
including,
 but
not
limited
to,
numeric
data,
text,
publications,
sensor
streams,
video,
audio,
 algorithms,
software,
models
and
simulations,
images,
etc.”
i


This
definition
is
 technically
acceptable
but
not
scientifically
epistemic.
In
fact,
it
is
useful
to
think
of
 “data”
in
two
distinct
ways.

“Data”
refers
(as
in
the
DataNet


definition)
to
the
 computer
readable
code
that
is
stored
in,
accessed
from
or
flows
between
 computers.
“Data”
also
means
precise,
well‐defined
representations
of
observations,
 descriptions
or
measurements
of
a
referent
(object
or
event)
recorded
in
some
 standard,
well‐specified
way.

 
 The
more
inclusive
DataNet
definition
has
the
virtue
of
forcing
us
to
consider
a
 unified,
holistic
approach
to
knowledge
and
to
the
formal
resources
that
inform
and
 express
it;
we
are
forced
to
confront
the
Web
as
it
exists
today.
 
 HOW
MUCH
DATA?
 
 In
a
now
famous
quip,
Lewis
Carroll
noted
that
the
perfect
scale
for
maps
was
1:1
 but
that
farmers
tend
to
become
disgruntled
when
such
maps
are
unrolled
over
 their
fields.



The
notion
that
we
could
theoretically
record
“everything”
in
real
time

 ‐‐
“
1:1
capture
“
–
leaves
us
to
ponder
the
limits
of
“data”
collection,
management
 and
longevity
–
full‐life‐cycleii
curation
and
stewardship.



With
the
evolution
of
 satellite
coverages,
nanotechnology,
robotics
and
embedded
network
sensors,
it
is
 possible,
for
example,
to
systematically
record
presence/absence
data
for
birds
at
a
 nesting
site
–
at
every
nesting
site
in
a
given
area
‐‐
24‐7,
forever

[SEE
for
example:
 http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14

]
iii

or
for
that
 matter
to
record
every
human
heartbeat.
iv

And
to
archive
these
data
in
perpetuity?


 (The
casual
assumption
that
we
might
comprehensively
save
all
data
is
belied
by
a
 recent
forecast
projecting
that
in
2007,
the
total
data
produced
on
earth
for
the
first
 time
exceeded
the
available
storage.v
)

    • vi
 WHO’S
RESPONSIBLE?
 
 It
is
also
the
case
that
technology,
standards
and
methodologies,
that
institutions,
 organizations
and
professions,
have
evolved
and
become
established
to
manage
and
 preserve
logical
domains
of
knowledge
as
well
as
selected
technical
formats
of
data.

 The
point
respecting
logical
segments
is
relatively
clear
–
natural
history
museums
 and
herbaria
hold
preserved
(e.g.
dead)
organisms
as
specimens;
zoos
and
gardens
 and
aquaria
hold
living
organisms
ex
situ;
protected
areas
hold
living
organisms
in
 situ;
cryogenics
facilities
hold
tissue
samples
–
similarly,
their
libraries
hold
logically
 corresponding
published
or
archival
works.

Respecting
technical
formats:
libraries
 hold
bound
paper/print
materials;
archives
hold
unbound
paper/
manuscript
or
 unbound
paper/typescript
materials;
media
repositories
hold
non‐print
media;
 computer
centers
hold
data
sets
and
complex
models
(hypothetical
assemblages
of
 data
that
generate
new
data);
art
museums
hold
paintings
and
sculptures;
a
dance
 company
performs
dances;
and
indigenous
group
stewards
its
“old
knowledge”.

 
 Similarly,
librarians
and
archivists,
curators
and
zookeepers,
rangers
and
 information
technologists,
dancers
and
shamans
have
all
received
vocational
charge
 for
siloed
segments
of
our
“knowledge
base”.
But
who
is
responsible
for
the
whole?
 Before
the
advent
of
digital
technology
this
latter
question
would
have
been
 metaphysically
interesting
but
pointless
‐‐
no
longer
it
seems.

Scanning
our
society
 and
culture,
it
seems
libraries
and
librarians
are
the
most
eligible
candidates
for
the
 role.
 
 And
if
the
received
“compartments”
organizational,
professional,
logical
structures
 are
no
longer
dictated
by
operational
constraints
(eg
the
ability
to
curate
a
 dragonfly
or
to
select
and
conserve
a
book)
how
can
we
most
effectively
organize
 the
management
of
knowledge
as
data.



At
the
national
level,
there
are
prime
 examples
of
institutions
that
admirably
serve
logical
domains
of
our
knowledge
 base,
the
National
Library
of
Medicine
is
one.vii

The
Library
of
Congress
alone
has
 the
stature
and
scope
of
interest
to
command
our
trust
and
expectations.
 
 BUT
DATA
FOR
WHAT???
 

    • Harvard
biologist
Richard
Lewontin
notes
that
–
like
the
drunk
looking
for
his
keys
 under
a
street
light
“because
the
light
is
better
there”
–
research
has
often
been
 constrained
to
studies
for
which
career
oriented
researchers
have
the
apparatus
 and
methods
to
produce
creditable
(e.g.
laudable,
promotion‐worthy)
results.viii

Our
 current
era
has
seen
an
evolution
of
technology
that
challenges
comfortable
 “disciplinary”
categories
of
research
and
conventional
format‐defined
codes
of
 fiduciary
responsibility.

Not
only
have
traditional
distinctions
between
the
domains
 of
the
arts
and
the
humanities
and
the
sciences
been
challenged
but
the
conventions
 of
scientific
disciplines
in
themselves
–as
foci
for
research
and
investment
–
are
 being
challenged.
New
possibilities
for
trans‐disciplinarity
are
emerging
but
the
 requisite
tools
and
methods
are
not
yet
fully
formed
and
organizational
paths
for
 such
research
are
not
always
clear.
 AND
HOW
DOES
DATA
HAVE
MEANING?
 
 When
data
is
considered
in
the
scientific
or
research
context,
its
semantic
properties
 necessarily
become
essential.

Thus
our
ability
to
contextualize
data
becomes
 primary.

Parameters
of
time
and
space
are
immediately
relevant
–
some
data
will
 have
a
geographic
context
(deriving
one
parameter
of
meaning
from
location
‐‐
in
 situ)
other
data
will
be
essentially
ageographic
(ex
situ),
experimental
and
 independent
of
geography
but
not
of
experimental
frame.

Time
as
a
parameter
of
 data
may
similarly
be
historical
or
ahistorical.


Agency,
materials,
equipment
 (calibration)
and
operations
also
set
primary
parameters
for
data.

 
 Huge
–
dare
we
say
“exorbitant”?

‐‐
investments
have
been
made
in
the
“metadata
 industry”
–
most
particularly
in
library
and
archival
cataloging.

In
the
new
media,
 Web
environment
–
other
solutions
operating
upon
natural
language
and

“native
 [pre‐existent]
metadata”
have
produced
prodigious,
cost‐effective
(profitable)
 results.
 
 WHOSE
DATA?
 
 In
an
era
when
combinations
and
recombinations
of
data
are
routine,
“demand
side”
 problems
occur
respecting
validation
and
certification
of
results
and
“supply
side”
 problems
occur
respecting
attribution
and
credit
for
the
originators
of
data.
 
 Moreover
scientists’
claims
for
discrete
personal
“priority”
of
discovery
are
 inevitably
being
challenged.

Collaboration
is
more
and
more
common
‐‐
as
foreseen
 by
Robert
K.
Mertonix
‐‐
an
individual’s
contribution
to
the
whole
corpus
of
 knowledge
is
less
and
less
clearly
attributable.

Notions
of
“authorship”
are
 challenged
by
anonymous
institutional/
organizational
claims
to
authorship.
x

And
 “small
science”

(ecology,
field
biology,
etc)
–
where
the
individual
scientist
is
still
 seem
as
a
single
actor
‐‐

is
often
perceived
as
weakly
developed
–
as
providing
no
 more
than
“disaggregated
components
of
an
incipient
network”xi.
 
 At
the
same
time
there
has
been
a
quantum
increase
in
the
effort
to
isolate
and
to
 monetize
intellectual
propertyxii.


Intellectual
“assets”
–
whether
in
the
form
of

    • genomic
discoveries
or
scientific
journal
articles
–
have
become
increasingly
 commoditized.xiii

 
 It
is
also
the
case
that
the
digital
environment
has
disrupted
traditional
economic
 value
chains
(this
has
been
obviously
true
in
the
publishing
industry
and
in
the
 entertainment
industry
where
the
consequences
of
these
pressures
have
been
 accusations,
threats
and
law
suits
–
often
to
the
bizarre
extent
that
natural
allies
in
 the
value
chain
have
attacked
each
other
or
even

to
the
degree
that
customers
 /clients
of
an
industry
have
been
attacked
by
the
industry
itself.


 
 
 
 
 A
GLOBAL
DATA
IMPERATIVE???
 
 Perhaps
neglecting
Faust
(?),
Thomas
Jefferson
asserted,
“The
field
of
knowledge
is
 the
common
property
of
all
mankind.”
It
seems
more
responsible
to
consider
an
 ethical
scale
of
need
that
compels
free
and
open
public
access
to
the
results
of
 nondestructive
research
(obviously
the
definition
of
“nondestructive”
requires
 debate).

This
spectrum
of
common
need
includes:
human
health,
pharmacology,
 public
health;
agrarian
and
agricultural
knowledge;
environmental
knowledge
and
 conservation
and
–
more
generally
–
most
non‐destructive
science
and
technology,
 critical
for
education.


The
dilemma
we
face,
worldwide
is
that
most
developing

 countries
and
developing
segments
of
society
are
those
least
capable
of
clearing
the
 thresholds
of
use
imposed
by
market
controls
on
knowledge
in
all
forms.xiv

 
 In
the
naive
exuberance
that
formed
the
League
of
Nations,
an

“International
 Committee
on
Intellectual
Cooperation”
was
envisioned
as
a
forum
for
global
focus
 on
common
goods

‐‐
today,
in
a
far
more
exact
way,
we
have
the
opportunity
to
plan
 and
develop
technical
resources,
standards
and
methodologies
that
will
not
deny
 the
benefits
of
human
knowledge
to
the
least
privileged.

A
comprehensive
strategy
 requires
that
we
successfully
address
4
primary
modalities
of
constraint:
 technology,
culture,
economy
and
law.
 
 The
Internet
Archive
–
focusing
on
R&D
and
prototyping
‐‐
has
built
essential
 components
of
what
could
ultimately
become
a
full
service,
full
life
cycle
‘collective
 utility’
or

“service
cloud”
‐‐
for
open
digital
management
of
human
knowledge.

This
 evolution
does
not
require
that
the
Archive
itself
become
this
“service
cloud”
but
 that
it
compose
a
comprehensive
response
and
‐‐
together
with
other
institutions
 and
organizations,
programs
and
initiatives
‐‐
catalyze
a
comprehensive
response.xv

 Most
essential
elements
are
in
place
–
or
at
least
emerging.


We
can
and
should
act
 now.

 























































 i
Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 , p.5. ii “the data management life cycle (including data creation, access, use, and preservation)” Sustainable
    • 























































 Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 , p.5. iii Or as another instance see recent NYT article: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=c se iv The California poet William Everson once asked poignantly: “And when the last coyote has been tagged…?” v “…the amount of information created, captured or replicated exceeded available storage for the first tie in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home.” John Gantz et al. (IDC) The diverse and exploding digital universe; an updated forecast or worldwide information growth through 2011. (March, 2008)
 www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf vi Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE: http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=c se vii
HISTORIC
BUDGET
SUPPORT
FOR
NLM
 viii R. Lewontin, The Triple Helix: Gene, Organism, Environment ix “Property rights in science are whittled down to a bare minimum by the rationale of the scientific ethic. The scientist’s claim to “his” intellectual “property” is limited to that of recognition and esteem which, if the institution functions with a modicum of efficiency, is roughly commensurate with the significance of the increments brought to the common fund of knowledge.” Robert K. Merton, “A Note on Science and Democarcy,” Journal of Law and Political Sociology 1 (1942): 121. x SEE for example: Peter Galison, “The Collective Author,” in M. Biagioli and P. Galison (ed.s) Scientific Authorship: Crdit and Intelletual Property in ScienceNY, Routledge, 2003. xi SEE: THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM J.M. Esanu and P.F. Uhlir, (Ed.s) Steering Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of International Scientific and Technical Information Programs Board on International Scientific Organizations Policy and Global Affairs Division, National Research Council of the National Academies,, xii SEE L. Lessig, Code xiii SEE Julian Birkinshaw and Tony Sheehan, “Managing the Knowledge Life Cycle,” MIT Sloan Management Review, 44 (2) Fall, 2002: 77. xiv
SEE
for
ex.:

 xv A short list is relatively easy to compose…