Some	Notes	on	Digital	Data	–	with	a	suggestion		
Tom	Moritz	/	Internet	Archive								February,	2009	
	
A	UNIVERSE	OF	DATA???	
	
What	is	“data”?		The	US	NSF	DataNet	solicitation	defines	“data”	as:			“Any	
information	that	can	be	stored	in	digital	form	and	accessed	electronically,	including,	
but	not	limited	to,	numeric	data,	text,	publications,	sensor	streams,	video,	audio,	
algorithms,	software,	models	and	simulations,	images,	etc.”	i			This	definition	is	
technically	acceptable	but	not	scientifically	epistemic.	In	fact,	it	is	useful	to	think	of	
“data”	in	two	distinct	ways.		“Data”	refers	(as	in	the	DataNet			definition)	to	the	
computer	readable	code	that	is	stored	in,	accessed	from	or	flows	between	
computers.	“Data”	also	means	precise,	well-defined	representations	of	observations,	
descriptions	or	measurements	of	a	referent	(object	or	event)	recorded	in	some	
standard,	well-specified	way.		
	
The	more	inclusive	DataNet	definition	has	the	virtue	of	forcing	us	to	consider	a	
unified,	holistic	approach	to	knowledge	and	to	the	formal	resources	that	inform	and	
express	it;	we	are	forced	to	confront	the	Web	as	it	exists	today.	
	
HOW	MUCH	DATA?	
	
In	a	now	famous	quip,	Lewis	Carroll	noted	that	the	perfect	scale	for	maps	was	1:1	
but	that	farmers	tend	to	become	disgruntled	when	such	maps	are	unrolled	over	
their	fields.				The	notion	that	we	could	theoretically	record	“everything”	in	real	time		
--	“	1:1	capture	“	–	leaves	us	to	ponder	the	limits	of	“data”	collection,	management	
and	longevity	–	full-life-cycleii	curation	and	stewardship.				With	the	expansion	and	
evolution	of	satellite	coverages,	nanotechnology,	robotics	and	embedded	network	
sensors,	it	is	possible,	for	example,	to	systematically	record	presence/absence	data	
for	birds	at	a	nesting	site	–	at	every	nesting	site	in	a	given	area	--	24-7,	forever		[SEE	
for	example:	http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14		]	iii		
or	for	that	matter	to	record	every	human	heartbeat.	iv		And	to	archive	these	data	in	
perpetuity?			(The	notion	that	we	might	comprehensively	save	all	data	is	belied	by	a	
recent	forecast	projecting	that	in	2007,	the	total	data	produced	on	earth	for	the	first	
time	exceeded	the	available	storage.v	)
vi	
WHO’S	RESPONSIBLE?	
	
It	is	also	the	case	that	technology,	standards	and	methodologies,	that	institutions,	
organizations	and	professions,	have	evolved	and	become	established	to	manage	and	
preserve	logical	domains	of	knowledge	as	well	as	selected	technical	formats	of	data.		
The	point	respecting	logical	segments	is	relatively	clear	–	natural	history	museums	
and	herbaria	hold	preserved	(e.g.	dead)	organisms	as	specimens;	zoos	and	gardens	
and	aquaria	hold	living	organisms	ex	situ;	protected	areas	hold	living	organisms	in	
situ;	cryogenics	facilities	hold	tissue	samples	–	similarly,	thei	libraries	of	all	these	
institutions	hold	logically	corresponding	published	or	archival	works.			
	
Respecting	technical	formats:	libraries	hold	bound	paper/print	materials;	archives	
hold	unbound	paper/	manuscript	or	unbound	paper/typescript	materials;	media	
repositories	hold	non-print	media;	computer	centers	hold	data	sets	and	complex	
models	(hypothetical	assemblages	of	data	that	generate	new	data);	art	museums	
hold	paintings	and	sculptures;	a	dance	company	performs	dances;	and	indigenous	
group	stewards	its	“old	knowledge”.		
	
Similarly,	librarians	and	archivists,	curators	and	zookeepers,	rangers	and	
information	technologists,	dancers	and	shamans	have	all	received	vocational	charge	
for	siloed	segments	of	our	“knowledge	base”.	But	who	is	responsible	for	the	whole?	
Before	the	advent	of	digital	technology	this	latter	question	would	have	been	
metaphysically	interesting	but	pointless	--	no	longer	it	seems.		Scanning	our	society	
and	culture,	it	seems	libraries	and	librarians	are	the	most	eligible	candidates	for	the	
role.	
	
And	if	the	received	“compartments”	organizational,	professional,	logical	structures	
are	no	longer	dictated	by	operational	constraints	(eg	the	ability	to	curate	a	
dragonfly	or	to	select	and	conserve	a	book)	how	can	we	most	effectively	organize	
the	management	of	knowledge	as	data.				At	the	national	level,	there	are	prime	
examples	of	institutions	that	admirably	serve	logical	domains	of	our	knowledge	
base,	the	National	Library	of	Medicine	is	one.vii		The	Library	of	Congress	alone	has	
the	stature	and	scope	of	interest	to	command	our	trust	and	expectations.
BUT	DATA	FOR	WHAT???	
	
Harvard	biologist	Richard	Lewontin	notes	that	–	like	the	drunk	looking	for	his	keys	
under	a	street	light	“because	the	light	is	better	there”	–	research	has	often	been	
constrained	to	studies	for	which	career	oriented	researchers	have	the	apparatus	
and	methods	to	produce	creditable	(e.g.	laudable,	promotion-worthy)	results.viii		Our	
current	era	has	seen	an	evolution	of	technology	that	challenges	comfortable	
“disciplinary”	categories	of	research	and	conventional	format-defined	codes	of	
fiduciary	responsibility.		Not	only	have	traditional	distinctions	between	the	domains	
of	the	arts	and	the	humanities	and	the	sciences	been	challenged	but	the	
conventional	scientific	disciplines	in	themselves	–as	foci	for	research	and	
investment	–	are	being	challenged.	New	possibilities	for	trans-disciplinarity	are	
emerging	but	the	requisite	tools	and	methods	are	not	yet	fully	formed	and	
organizational	paths	for	such	research	are	not	always	clear.	
	
AND	HOW	DOES	DATA	HAVE	MEANING?	
	
Both	“26.07”ix	and	““0.59998”	x	are	actual	datum.			
	
The	former	was	recorded	by	Henry	Cavendish	in	his	“Experiments	to	Determine	the	
Density	of	the	Earth”	[June	21,	1798];	the	latter	a	reading	obtained	by	monitoring	
sap	flow	in	Manzanita	at	the	University	of	California	James	Reserve,	Mt.	San	Jacinto,	
California	.	[December	4,	2007	11:37]	
	
It	is	immediately	obvious	that	without	description	of	the	primary	context	–	for	
capture	of	the	data,	mere	numbers	are	meaningless.		More	information	is	necessary	
to	impart	meaning,	but	it	must	be	more	than	the	simple	“context”	of	agent,	place	and	
time,	and	general	purpose	just	disclosed.		Our	ability	to	effectively	contextualize	data	
becomes	a	primary.		The	problem	thus	becomes	what	information	about	data	will	
make	it	fit	for	use.	
	
And	hence	we	are	led	to	consider	the	primary	uses	of	data.	
	
When	data	is	considered	in	the	scientific	or	research	context,	its	semantic	properties	
necessarily	become	essential.	Parameters	of	time	and	space	are	immediately	
relevant	–	some	data	will	have	a	geographic	context	(deriving	one	parameter	of	
meaning	from	location	--	in	situ)	other	data	will	be	essentially	ageographic	(ex	situ),	
experimental	and	independent	of	geography	but	not	of	experimental	frame.		Time	as	
a	parameter	of	data	may	similarly	be	historical	or	ahistorical.			Agency,	materials,	
equipment	(calibration)	and	operations	also	set	primary	parameters	for	data.		
	
Huge	–	dare	we	say	“exorbitant”?		--	investments	have	been	made	in	the	“metadata	
industry”	–	most	particularly	in	library	and	archival	cataloging.		In	the	new	media,	
Web	environment	–	other	solutions	operating	upon	natural	language	and		“native	
[pre-existent]	metadata”	have	produced	prodigious,	cost-effective	(profitable)	
results.
WHOSE	DATA?	
	
In	an	era	when	combinations	and	recombinations	of	data	are	routine,	“demand	side”	
problems	occur	respecting	validation	and	certification	of	results	and	“supply	side”	
problems	occur	respecting	attribution	and	credit	for	the	originators	of	data.	
	
Moreover	scientists’	claims	for	discrete	personal	“priority”	of	discovery	are	
inevitably	being	challenged.		Collaboration	is	more	and	more	common	--	as	foreseen	
by	Robert	K.	Mertonxi	--	an	individual’s	contribution	to	the	whole	corpus	of	
knowledge	is	less	and	less	clearly	attributable.		Notions	of	“authorship”	are	
challenged	by	anonymous	institutional/	organizational	claims	to	authorship.	xii		And	
“small	science”		(ecology,	field	biology,	etc)	–	where	the	individual	scientist	is	still	
seem	as	a	single	actor	--		is	often	perceived	as	weakly	developed	–	as	providing	no	
more	than	“disaggregated	components	of	an	incipient	network”xiii.	
	
At	the	same	time	there	has	been	a	quantum	increase	in	the	effort	to	isolate	and	to	
monetize	intellectual	propertyxiv.			Intellectual	“assets”	–	whether	in	the	form	of	
genomic	discoveries	or	scientific	journal	articles	–	have	become	increasingly	
commoditized.xv		
	
It	is	also	the	case	that	the	digital	environment	has	disrupted	traditional	economic	
value	chains	(this	has	been	obviously	true	in	the	publishing	industry	and	in	the	
entertainment	industry	where	the	consequences	of	these	pressures	have	been	
accusations,	threats	and	law	suits	–	often	to	the	bizarre	extent	that	natural	allies	in	
the	value	chain	have	attacked	each	other	or	even		to	the	degree	that	customers	
/clients	of	an	industry	have	been	attacked	by	the	industry	itself.			
	
	
	
	
A	GLOBAL	DATA	IMPERATIVE???	
	
Perhaps	neglecting	Faust	(?),	Thomas	Jefferson	asserted,	“The	field	of	knowledge	is	
the	common	property	of	all	mankind.”	It	seems	more	responsible	to	consider	an	
ethical	scale	of	need	that	compels	free	and	open	public	access	to	the	results	of	
nondestructive	research	(obviously	the	definition	of	“nondestructive”	requires	
debate).		This	spectrum	of	common	need	includes:	human	health,	pharmacology,	
public	health;	agrarian	and	agricultural	knowledge;	environmental	knowledge	and	
conservation	and	–	more	generally	–	most	non-destructive	science	and	technology,	
critical	for	education.			The	dilemma	we	face,	worldwide	is	that	most	developing		
countries	and	developing	segments	of	society	are	those	least	capable	of	clearing	the	
thresholds	of	use	imposed	by	market	controls	on	knowledge	in	all	forms.xvi		
	
In	the	naive	exuberance	that	formed	the	League	of	Nations,	an		“International	
Committee	on	Intellectual	Cooperation”	was	envisioned	as	a	forum	for	global	focus
on	common	goods		--	today,	in	a	far	more	exact	way,	we	have	the	opportunity	to	plan	
and	develop	technical	resources,	standards	and	methodologies	that	will	not	deny	
the	benefits	of	human	knowledge	to	the	least	privileged.		A	comprehensive	strategy	
requires	that	we	successfully	address	4	primary	modalities	of	constraint:	
technology,	culture,	economy	and	law.	
	
The	Internet	Archive	–	focusing	on	R&D	and	prototyping	--	has	built	essential	
components	of	what	could	ultimately	become	a	full	service,	full	life	cycle	‘collective	
utility’	or		“service	cloud”	--	for	open	digital	management	of	human	knowledge.		This	
evolution	does	not	require	that	the	Archive	itself	become	this	“service	cloud”	but	
that	it	compose	a	comprehensive	response	and	--	together	with	other	institutions	
and	organizations,	programs	and	initiatives	--	catalyze	a	comprehensive	response.xvii		
Most	essential	elements	are	in	place	–	or	at	least	emerging.			We	can	and	should	act	
now.		
	
i	Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation
NSF 07-601 , p.5.
ii
“the data management life cycle (including data creation, access, use, and preservation)” Sustainable
Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation NSF 07-601 ,
p.5.
iii
Or as another instance see recent NYT article: Natalie Anger “Tracking forest creatures on the move.”
NYT Feb 2, 2009
http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=c
se
iv
The California poet William Everson once asked poignantly: “And when the last coyote has been
tagged…?”
v
“…the amount of information created, captured or replicated exceeded available storage for the first tie in
2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital
universe will not have a permanent home.” John Gantz et al. (IDC) The diverse and exploding digital
universe; an updated forecast or worldwide information growth through 2011. (March, 2008)	
www.emc.com/collateral/analyst-reports/diverse-exploding-digital-universe.pdf
vi
Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE:
http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=c
se
vii	HISTORIC	BUDGET	SUPPORT	FOR	NLM	
viii
R. Lewontin, The Triple Helix: Gene, Organism, Environment
ix	“Experiments	to	determine	the	density	of	the	earth,”	by	Henry	Cavendish,	ESQ.,	F.R.S.	AND	
A.S.	Read	June	21,	1798							(From	the	Philosophical	Transactions	of	the	Royal	Society	of	London	for	
the	year	1798,	Part	II.	,	pp.	469-526)	
x	Personal	communication.		“manzanita_sapflow_12-5-07_to_7-7-08.xls	“instantaneous	sap	
flow	data	(as	temperature	differences	on	a	constant	temperature	heat	dissipation	probe)	for	multiple	
branches	of	Manzanita,	collected	with	a	data	logger.	used	to	correlate	physiological	activity	with	
below-ground	measures	of	root	growth	and	CO2	production.”		University	of	California	James	Reserve,	
Mt	San	Jacinto,	California	“	
xi
“Property rights in science are whittled down to a bare minimum by the rationale of the scientific ethic.
The scientist’s claim to “his” intellectual “property” is limited to that of recognition and esteem which, if
the institution functions with a modicum of efficiency, is roughly commensurate with the significance of
the increments brought to the common fund of knowledge.” Robert K. Merton, “A Note on Science and
Democarcy,” Journal of Law and Political Sociology 1 (1942): 121.
xii
SEE for example: Peter Galison, “The Collective Author,” in M. Biagioli and P. Galison (ed.s) Scientific
Authorship: Crdit and Intelletual Property in ScienceNY, Routledge, 2003.
xiii
SEE: THE ROLE OF SCIENTIFIC AND TECHNICAL DATA AND INFORMATION IN THE
PUBLIC DOMAIN PROCEEDINGS OF A SYMPOSIUM J.M. Esanu and P.F. Uhlir, (Ed.s) Steering
Committee on the Role of Scientific and Technical Data and Information in the Public Domain Office of
International Scientific and Technical Information Programs Board on International Scientific
Organizations Policy and Global Affairs Division, National Research Council of the National Academies,,
xiv
SEE L. Lessig, Code
xv
SEE Julian Birkinshaw and Tony Sheehan, “Managing the Knowledge Life Cycle,” MIT Sloan
Management Review, 44 (2) Fall, 2002: 77.
xvi	SEE	for	ex.:		
xvii
A short list is relatively easy to compose…

A Universe of Data