Your SlideShare is downloading. ×
University of California, Berkeley: iSchool Nov, 2009
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

University of California, Berkeley: iSchool Nov, 2009

1,388

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,388
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
1
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Hinges and Loops? -- Data as Evidence I-School UC, Berkeley November 13, 2009 “Vertical section drawing of Cavendishs torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment
  • 2. “Othello: ‘Villain: be sure thou prove my love a whore; Be sure of it; give me the ocular proof; Or by the worth of man’s eternal soul, Thou hadst been better born a dog Than answer my naked wrath!Iago: ‘Is’t come to this?’Othello: ‘Make me to see‘t ; or at the least so prove it, That the probation bear no hinge nor loop To hang doubt on; or woe upon thy life!’ “ The Tragedy of Othello: The Moor of Venice (Act 3 Scene 3)
  • 3. “So the universe has always appeared to the natural mind as a kind of enigma, of which the key must be sought in the shape of some illuminating or power- bringing word or name. That word names the universes principle, and to possess it is, after a fashion, to possess the universe itself God, Matter, Reason,’ the Absolute,’ ‘Energy,’ are so many solving names. You can rest when you have them. You are at the end of your metaphysical quest.” William James. "What Pragmatism Means". Lecture 2 in Pragmatism: A new name for some old ways of thinking. New York: Longman Green and Co (1922): 52-52. http://www.archive.org/stream/pragmatismnewnam00jame
  • 4. Internet Archive:http://www.archive.org/stream/pragmatismnewnam00jameNote Date of Publiction: 1922
  • 5. Clear definitions are good (!)We should not reflexively rely on metaphysical “solving” / “power-bringing” words…ADD to James’s list?:“Knowledge”“Information”“Data” ???
  • 6. “Data” ?
  • 7. UsageData: The word data is the Latin plural of datum, neuter past participle of dare, "to give", hence "something given".“ Data leads a life of its own quite independent of datum, of which it was originally the plural. It occurs in two constructions: as a plural noun (like earnings), taking a plural verb and plural modifiers (as these, many, a few) but not cardinal numbers, and serving as a referent for plural pronouns; and as an abstract mass noun (like information), taking a singular verb and singular modifiers (as this, much, little), and being referred to by a singular pronoun. Both constructions are standard. The plural construction is more common in print, perhaps because the house style of some publishers mandates it.” The Merriam-Webster Online Dictionary http://www.merriam-webster.com/dictionary/data
  • 8. “Data” ? [technological]“…’data’ are defined as any information that can be stored in digital form and accessed electronically, including, but not limited to, numeric data, text, publications, sensor streams, video, audio, algorithms, software, models and simulations, images, etc.” -- Program Solicitation 07-601 “Sustainable Digital Data Preservation and Access Network Partners (DataNet)”Taken in this broadest possible sense, “data” are thus simply electronic coded forms of information. And virtually anything can be represented as “data” so long as it is electronically machine-readable.
  • 9. “The digital universe in 2007 — at 2.25 x 1021bits (281 exabytes or 281 billion gigabytes) — was 10% bigger than we thought. The resizing comes as a result of faster growth in cameras, digital TV shipments, and better understanding of information replication. “By 2011, the digital universe will be 10 times the size it was in 2006. “As forecast, the amount of information created, captured, or replicated exceeded available storage for the first time in 2007. Not all information created and transmitted gets stored, but by 2011, almost half of the digital universe will not have a permanent home. “Fast-growing corners of the digital universe include those related to digital TV, surveillance cameras, Internet access in emerging countries, sensor-based applications, datacenters supporting “cloud computing,” and social networks.The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide Information Growth through 2011 -- ExecutiveSummary. IDC Information and Data, March, 2008http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
  • 10. “The diversity of the digital universe can be seen in the variability of file sizes, from 6 gigabyte movies on DVD to 128-bit signals from RFID tags. Because of the growth of VoIP, sensors, and RFID, the number of electronic information “containers” — files, images, packets, tag contents — is growing 50% faster than the number of gigabytes. The information created in 2011 will be contained in more than 20 quadrillion — 20 million billion — of such containers, a tremendous management challenge for both businesses and consumers. alone. “The Diverse and Exploding Digital Universe: An Updated Forecast of Worldwide InformationGrowth through 2011 -- Executive Summary. IDC Information and Data, March, 2008http://www.emc.com/collateral/analyst-reports/diverse-exploding-idc-exec-summary.pdf
  • 11. “Data” [epistemic]“Measurements, observations or descriptions of a referent -- such as an individual, an event, a specimen in a collection or an excavated/surveyed object -- created or collected through human interpretation (whether directly “by hand” or through the use of technologies)” -- AnthroDPA Working Group on Metadata (May, 2009)
  • 12. “The General Definition of Information (GDI)”σ is an instance of information, understood as semantic content, if and only if:• (GDI.1) σ consists of one or more data;• (GDI.2) the data in σ are well-formed;• (GDI.3) the well-formed data in σ are meaningful. Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information” (First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophy http://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]
  • 13. “…with the corollary assumptions that they are objective -- that is, not conditioned by subjective perspectives and invariant – that is, true under all circumstances.” -- Draft GBIF DPFTG Report, 2009SEE: R. Nozick, Invariances: The Structure of the Objective World, HarvardUniversity Press, Cambridge, 2001. AND L. Daston and P. Galison, Objectivity,Zone Books, NY, 2007.
  • 14. The Diaphoric Definition of Data (DDD):“According to GDI, information cannot be dataless but, in the simplest case, it can consist of a single datum. Now a datum is reducible to just a lack of uniformity (diaphora is the Greek word for “difference”), so a general definition of a datum is: The Diaphoric Definition of Data (DDD): A datum is a putative fact regarding some difference or lack of uniformity within some context.“Depending on philosophical inclinations, DDD can be applied at three levels: 1. data as diaphora de re, that is, as lacks of uniformity in the real world out there. There is no specific name for such “data in the wild”. A possible suggestion is to refer to them as dedomena (“data” in Greek; note that our word “data” comes from the Latin translation of a work by Euclid entitled Dedomena). Dedomena are not to be confused with environmental data (see section 1.7.1). They are pure data or proto-epistemic data, that is, data before they are epistemically interpreted. As “fractures in the fabric of being” they can only be posited as an external anchor of our information, for dedomena are never accessed or elaborated independently of a level of abstraction (more on this in section 3.2.2). They can be reconstructed as ontological requirements, like Kants noumena or Lockes substance: they are not epistemically experienced but their presence is empirically inferred from (and required by) experience. Of course, no example can be provided, but dedomena are whatever lack of uniformity in the world is the source of (what looks to information systems like us as) as data, e.g., a red light against a dark background. Note that the point here is not to argue for the existence of such pure data in the wild, but to provide a distinction that (in section 1.6) will help to clarify why some philosophers have been able to accept the thesis that there can be no information without data representation while rejecting the thesis that information requires physical implementation; …”
  • 15. The Diaphoric Definition of Data (DDD): (cont.) “2. data as diaphora de signo, that is, lacks of uniformity between (the perception of) at least two physical states, such as a higher or lower charge in a battery, a variable electrical signal in a telephone conversation, or the dot and the line in the Morse alphabet; and 3. data as diaphora de dicto, that is, lacks of uniformity between two symbols, for example the letters A and B in the Latin alphabet.”Luciano Floridi <luciano.floridi@philosophy.ox.ac.uk> “Semantic Conceptions of Information”(First published Wed Oct 5, 2005) Stanford Encyclopedia of Philosophyhttp://plato.stanford.edu/entries/information-semantic/ [visited 11/12/09]
  • 16. “Evidence”? “Data having probative value and authority”?i.e. well supported by scientific logic and considered technically valid
  • 17. Policy Formation and Decision Making
  • 18. Poder Politico y Conocimiento Alto ??? PolíticosResponsabilidad y Poder Administradores o Gestores Analistas- Técnicos Científicos Alto Bajo Conocimiento (en términos científicos-occidentales) (Sutton, 1999) From: Organizaciones que aprenden, paises que aprenden: lecciones y AP en Costa Rica by Andrea Ballestero Directora ELAP
  • 19. Wednesday, January 21st, 2009 at 12:00 amMEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS ANDAGENCIESSUBJECT: Freedom of Information ActA democracy requires accountability, and accountability requires transparency. As Justice LouisBrandeis wrote, "sunlight is said to be the best of disinfectants." In our democracy, the Freedomof Information Act (FOIA), which encourages accountability through transparency, is the mostprominent expression of a profound national commitment to ensuring an open Government. At theheart of that commitment is the idea that accountability is in the interest of the Government andthe citizenry alike.The Freedom of Information Act should be administered with a clear presumption: In the face ofdoubt, openness prevails. The Government should not keep information confidential merelybecause public officials might be embarrassed by disclosure, because errors and failures mightbe revealed, or because of speculative or abstract fears. Nondisclosure should never be based onan effort to protect the personal interests of Government officials at the expense of those they aresupposed to serve. In responding to requests under the FOIA, executive branch agencies(agencies) should act promptly and in a spirit of cooperation, recognizing that such agencies areservants of the public.All agencies should adopt a presumption in favor of disclosure, in order to renew theircommitment to the principles embodied in FOIA, and to usher in a new era of open Government.The presumption of disclosure should be applied to all decisions involving FOIA…[clip]Barack Obama http://www.whitehouse.gov/the_press_office/Freedom_of_Information_Act/
  • 20. “Declaration of Scientific Principles” in “The Commonwealth of Science”“7. The pursuit of scientific inquiry demands complete intellectual freedom. And unrestricted international exchange of knowledge…“ from “The Commonwealth of Science ” Nature No.3753 October 4, 1941.
  • 21. August 4, 2009: the White House issued a memorandum stating unequivocally “Sound science should inform policy decisions”“Science and Technology Priorities for the FY2011 Budget,” PR Orszag andJP Holdren August 4, 2009, Memorandum for the Heads of ExecutiveDepartments and Agencies, M-09-27.http://www.whitehouse.gov/omb/assets/memoranda_fy2009/m09-27.pdf
  • 22. The $3.6 billion Large Hadron Collider(LHC) will sample and record theresults of up to 600 million protoncollisions per second, producingroughly 15 petabytes (15 milliongigabytes) of data annually in search ofnew fundamental particles. To allowthousands of scientists from around theglobe to collaborate on the analysis ofthese data over the next 15 years (theestimated lifetime of the LHC), tens ofthousands of computers located aroundthe world are being harnessed in adistributed computing network calledthe Grid. Within the Grid, described asthe most powerful supercomputersystem in the world, the avalanche ofdata will be analyzed, shared, re-purposed and combined in innovativenew ways designed to reveal thesecrets of the fundamental propertiesof matter.LHC source:http://public.web.cern.ch/public/en/LHC/LSource:http://public.web.cern.ch/Public/en/LHC/L
  • 23. “The Legacy of GenBank: TheDNA Sequence Database That Set a Precedent,” 1663: Los Alamos Science andTechnology Magazine August 2008http://www.lanl.gov/news/1663/imag
  • 24. “The Legacy of GenBank: The DNA Sequence Database That Set a Precedent,” 1663: Los Alamos Science and Technology Magazine August 2008 http://www.lanl.gov/news/1663/images/aug08/22lg.jpg
  • 25. The (US) NCAR Research Data Archive (RDA) “The NCAR Research Data Archive (RDA) is a comparatively small (currently 246 TB, less than 5% of the MSS [Mass Storage System] total size), but very important, part of the MSS stored data. The RDA has been curated by the staff in the Computational and Information Systems Laboratory for over 40 years, [emphasis added] and as such contains reference datasets used by large numbers of scientists. The RDA contents are long-term atmospheric (surface and upper air) and oceanographic observations, grid analyses of observational datasets, operational weather prediction model output, reanalyses, satellite derived datasets, and ancillary datasets, such as topography/bathymetry, vegetation, and land use. The RDA is not a static collection; it is now over 580 datasets with about 100 routinely updated and 10-20 new ones added each year. “C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 5. www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
  • 26. NCAR Research Data Archive (RDA)C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008 , page 7. www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
  • 27. http://www.ncdc.noaa.gov/img/climate/globalwarming/ar4-fig-3-9.gif
  • 28. Facebook?Facebook, for example, uses more than 1 petabyte of storage space to manage its users’ 40 billion photos. (A petabyte is about 1,000 times as large as a terabyte, and could store about 500 billion pages of text.) Training to Climb an Everest of Digital Data By ASHLEE VANCE NYT Published: October 11, 2009 http://www.nytimes.com/2009/10/12/technology/12data.html?_r=1
  • 29. “Vertical section drawing of Cavendishs torsion balance instrument including the building in which it was housed.” http://en.wikipedia.org/wiki/Cavendish_experiment
  • 30. http://www.newscientist.com/articleimages/mg12016390.100/0-four-fundamental-forces.html
  • 31. “Experiments to determine the density of the earth,” by Henry Cavendish, ESQ., F.R.S. AND A.S. Read June 21, 1798 (From the Philosophical Transactions of the Royal Society of London for the year 1798, Part II. , pp. 469-526) From: http://www.archive.org/details/lawsofgravitatio00mackrich
  • 32. 2-d_soil_temps.csv surface, and sub-surface soil temperatures (at 2cm and 8cm depths) measured at one location for a few days in order to calibrate a model of temperature propagation. Surface temperature was measured with an infrared thermometer, subsurface temperatures with a thermocouple. ---------------------------- 5-minute_light_data_for_4_continuous_days_plus_reference.xls PPF (photosynthetic photon flux = photosynthetically active radiation 400-700nm) measured with an array of photodiodes calibrated to a Licor sensor, along a linear transect for a few days. used to get an idea of how much light plants along the transect are receiving. ---------------------------- DATA CO2_of_air_at_different_heights_July_9.xls concentration of CO2 in the air during the evening for one day, measured with a Licor infrared gas analyzer and a series of relays and tubes with a pump. used to examine the gradient of CO2 coming from the soil when the air is still during the evening. SETS ---------------------------- Fern_light_response.xls Light response curves for bracken ferns, measured with a Licor photosynthesis system. Fronds are exposed to different light levels and their instantaneous photosynthesis and conductance is measured. used in conjunction with the induction data (below) for physiological characterization of the ferns. ---------------------------- La_Selva_species_photosyntheis_table.xls incomplete data set on instantaneous photosynthesis rates for various tropical understory and epiphytic species grown in a shade house in Costa Rica. ---------------------------- some manzanita_sapflow_12-5-07_to_7-7-08.xls instantaneous sap flow data (as temperature differences on a constant temperature heat dissipation probe) for multiple branches of Manzanita, collected with a datalogger. used to correlate physiological activity with below-ground examples measures of root grown and CO2 production. ---------------------------- moisture_release_curves.xlswith “native percentage of water content, water potential (in MegaPascals) and temperature of soil samples, measured in the laboratory for calibration of water content with water potential. soil is from the James Reserve in California. ---------------------------- Photosynthetic_induction.xlsmetadata” 2 O C . 5 3 v l d n y h p f s r u o c - e m i t a � m/2/s and light level is probably 1000 micromoles. used to determine physiological characteristics of bracken ferns. ---------------------------- run_2_24-h_data_for_mesh.xls measurements of micrometeorological parameters on a moving shuttle, going from a clearing across a forest edge and into the forest for about 30 meters. Pyronometers facing up and down, pyrgeometer facing up and down, PAR, air temperature, relative humidity. Also data from a station fixed in the clearing and some derived variables calculated. used for examining edge effects in forests. ---------------------------- Segment_of_wallflower_compare_colorspaces_blur.xls pixel counts from images of wallflowers that were segmented into flower/not-flower under different color spaces. segmentation was made using a probability matrix of hand-segmented images. used to automatically count flowers in images collected after this training data was collected (and used to determine the best color space for this task).
  • 33. manzanita_sapflow_12-5-07_to_7-7-08.xlsinstantaneous sap flow data (as temperature differences on a constant temperature heatdissipation probe) for multiple branches of Manzanita, collected with a datalogger. used tocorrelate physiological activity with below-ground measures of root grown and CO2 production.sbid battery datetime heater_voltage Manz1Sap1 Manz1Sap2 Manz1Sap3 Manz1Sap4 Manz2Sap5 Manz2Sap6 Manz2Sap7 Manz3Sap10 Manz3Sap8 Manz3Sap9 Manz4Sap11 timestamp Datagap Julian2 12.365 1196796112 2018.8 0.5585 0.51029 0.55517 0.54354 0.6067 0.52858 0.55351 0.59008 0.59506 0.60337 0.56514 12/4/07 11:21 4.473513 12.348 1196796232 2017.9 0.55682 0.51028 0.5535 0.54352 0.60669 0.52857 0.55017 0.59007 0.59505 0.60336 0.56513 12/4/07 11:23 0 4.474904 12.357 1196796352 2018.6 0.55514 0.51027 0.55348 0.54351 0.60501 0.52855 0.55016 0.59005 0.59504 0.60501 0.56512 12/4/07 11:25 0 4.476285 12.354 1196796472 2017.6 0.55514 0.51026 0.55181 0.5435 0.60334 0.52855 0.54849 0.59004 0.59503 0.60334 0.56511 12/4/07 11:27 0 4.477676 12.334 1196796592 2018.3 0.55347 0.51026 0.55015 0.5435 0.60333 0.52854 0.54682 0.59004 0.59502 0.605 0.56511 12/4/07 11:29 0 4.479067 12.34 1196796712 2018.5 0.55014 0.50859 0.55014 0.54349 0.60332 0.53019 0.54349 0.59003 0.59501 0.60498 0.56676 12/4/07 11:31 0 4.480458 12.337 1196796832 2017.8 0.55013 0.50692 0.55013 0.54348 0.60332 0.53019 0.54182 0.59002 0.59501 0.60498 0.56675 12/4/07 11:33 0 4.481849 12.328 1196796952 2017.5 0.5468 0.50691 0.5468 0.54347 0.60331 0.53018 0.53849 0.59001 0.595 0.60497 0.56674 12/4/07 11:35 0 4.4832310 12.323 1196797072 2017 0.54679 0.50524 0.54679 0.54347 0.59998 0.53017 0.53682 0.59 0.59499 0.60496 0.56674 12/4/07 11:37 0 4.4846211 12.328 1196797192 2018.9 0.54679 0.50191 0.54512 0.5418 0.59665 0.53017 0.53349 0.59 0.59498 0.60496 0.56673 12/4/07 11:39 0 4.4860112 12.319 1196797312 2017.7 0.54345 0.49857 0.54178 0.54178 0.59663 0.53015 0.53015 0.58998 0.5933 0.60327 0.56671 12/4/07 11:41 0 4.4874013 12.311 1196797432 2017.3 0.54343 0.4969 0.54011 0.54177 0.59661 0.53014 0.52848 0.58997 0.59329 0.6016 0.5667 12/4/07 11:43 0 4.4887814 12.316 1196797552 2018.6 0.5401 0.49357 0.53678 0.54176 0.59328 0.53013 0.5268 0.58995 0.59328 0.60325 0.56669 12/4/07 11:45 0 4.4901715 12.31 1196797672 2016.8 0.53844 0.4919 0.53511 0.54176 0.59494 0.53013 0.52514 0.58995 0.59328 0.60325 0.56503 12/4/07 11:47 0 4.4915616 12.31 1196797792 2017.1 0.53676 0.48856 0.53343 0.54174 0.59326 0.53011 0.5218 0.58993 0.59326 0.60323 0.56501 12/4/07 11:49 0 4.4929517 12.31 1196797912 2017.1 0.53342 0.48523 0.5301 0.54173 0.59324 0.5301 0.51846 0.58826 0.59324 0.60321 0.56499 12/4/07 11:51 0 4.4943418 12.301 1196798031 2017.5 0.53174 0.48521 0.52842 0.53839 0.59156 0.53008 0.51845 0.58824 0.59323 0.6032 0.56498 12/4/07 11:53 0 4.4957319 12.301 1196798151 2016.3 0.53007 0.48188 0.52509 0.53838 0.59155 0.53007 0.51512 0.58823 0.59321 0.60152 0.5633 12/4/07 11:55 0 4.4971220 12.303 1196798271 2016.6 0.5284 0.47855 0.52175 0.53837 0.59154 0.5284 0.5151 0.58821 0.59154 0.60151 0.56163 12/4/07 11:57 0 4.49851 Datum: “0.59998”
  • 34. “Jim Gray on eScience: A Transformed Scientific Method” T. Hey, S. Tansley, and K.Tolle (eds)| Microsoft Research Based on the transcript of a talk given by Jim Gray to the NRC-CSTB1 in Mountain View, CA, on January 11, 2007http://research.microsoft.com/en-us/collaboration/fourthparadigm/4th_paradigm_book_jim_gray_transcript.pdf
  • 35. “Reanalyses” [or Meta-Analyses ] “Atmospheric reanalyses are a main feature within the RDA and were intended to be, and have become, a very valuable data resource for a wide variety of climate and weather studies. By combining many types of atmospheric observations with advanced data assimilation and forecast models a “best possible” 3D estimate of the atmospheric state over extended time periods is achieved. “Reanalyses are supported by many historical data sources that have been curated over time. As an illustration the major sources of atmospheric profile data include wind only soundings beginning in 1920 (Figure 2). These are augmented with soundings of temperature, humidity, and wind beginning in 1948. “C.A. Jacobs, S. J. Worley, “Data Curation in Climate and Weather: Transforming our ability to improve predictions through global knowledge sharing ,” from the 4th International Digital Curation Conference December 2008, page 6. www.dcc.ac.uk/events/dcc-2008/programme/papers/Data%20Curation%20in%20Climate%20and%20Weather.pdf [03 02 09]
  • 36. Fundamental Questions:• Data Specification – scientific logic of data definition• Data Creation – specification of methodology• Data Integrity – preservation -- “chain of custody” “Chain of custody refers to the chronological documentation or paper trail, showing the seizure, custody, control, transfer, analysis, and disposition of evidence, physical or electronic.”[ http://en.wikipedia.org/wiki/Chain_of_custody [clipped 11/12/09 10:30pm PST]• Data transformations – Logic – Competence /Technical Performance / Execution
  • 37. “Keeping Raw Data in Context”“…any initiative to share raw clinical research data must also pay close attention to sharing clear and complete information about the design of the original studies. Relying on journal articles for study design information is problematic, for three reasons. First, journal articles often provide insufficient detail when describing key study design features such as randomization (1) and intervention details (2). Second, some data sets may come from studies with no publications [only 21% of oncology trials registered in ClinicalTrials.gov before 2004 and completed by September 2007 were published (3)]. Finally, investigators cannot reliably search journal articles for methodological concepts like “double blinding” or “interrupted time series,” crucial concepts for proper interpretation of the data. A mishmash of non- standardized databases of raw results and unevenly reported study designs is not a strong foundation for clinical research data sharing. ““ We believe that the effective sharing of clinical research data requires the establishment of an interoperable federated database system that includes both study design and results data. A key component of this system is a logical model of clinical study characteristics in which all the data elements are standardized to controlled vocabularies and common ontologies to facilitate cross-study comparison and synthesis. “I Sim, et al. “Keeping Raw Data in Context”[letter] Science v 323 6 Feb 2009, p713.
  • 38. “Increasing levels of coordinate digit noise associated with repeated projection transformations”Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and InformationContent". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23,2005.
  • 39. “It is well known that cartographic coordinates stored in double precision arefar more precisely specified than is merited by their accuracy, even forhighly-accurate global datasets. Far more coordinate digit places are storedfor the sake of avoiding machine error than are needed to define the locationof map objects within the necessary tolerances for both absolute and relativeaccuracies.”“A careful look at the coordinate digits stored as double precision variablesin a GIS yields a variety of interesting patterns that are a result of previousmachine error, rounding error, measurement error, and so forth. Anyslight cartographic alteration (rotation/skewing, clipping/sub-setting,reprojecting, etc.) can add noise into the coordinate and can be used tocharacterize a vector dataset.” Rice, Matt, Michael F. Goodchild, Keith C. Clarke (2005) "Cartographic Data Precision and Information Content". In Proceedings of Auto-Carto 2005: A Research Symposium. Las Vegas, Nevada, March 18-23, 2005.
  • 40. GRIDS Data International Centers Collaborative Research EffortIndividual National Disciplinary InitiativesLibraries Cooperative ProjectsLocal / IndividualsPersonalArchiving “Small Science” “BIG Science”
  • 41. “Small Science”?
  • 42. The “small science,” independent investigator approach traditionally hascharacterized a large area of experimental laboratory sciences, such aschemistry or biomedical research, and field work and studies, such asbiodiversity, ecology, microbiology, soil science, and anthropology. The dataor samples are collected and analyzed independently, and the resulting data independentlysets from such studies generally are heterogeneous and unstandardized, with unstandardizedfew of the individual data holdings deposited in public data repositories oropenly shared. The data exist in various twilight states of accessibility, depending on accessibilitythe extent to which they are published, discussed in papers but not revealed, orjust known about because of reputation or ongoing work, but kept underabsolute or relative secrecy. The data are thus disaggregated components ofan incipient network that is only as effective as the individual transactionsthat put it together. Openness and sharing are not ignored, but they are not togethernecessarily dominant either. These values must compete with strategicconsiderations of self-interest, secrecy, and the logic of mutually beneficialexchange, particularly in areas of research in which commercial applicationsare more readily identifiable.The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. JulieM. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in thePublic Domain Office of International Scientific and Technical Information Programs Board on International ScientificOrganizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8
  • 43. Maria Sibylla Merian Metamorphosisinsectorum Surinamensium(Metamorphosis of the Insects ofSurinam) Amsterdam, 1705, figure 46Hand-colored engraving (123) http://www.loc.gov/exhibits/dres/dre123.jpg
  • 44. DARWINhttp://darwin-online.org.uk/converted/published/1975_NaturalSelection_F1583/1975_NaturalSelection_F1583_fig03.jpg http://www.nyu.edu/projects/materialworld/images/1_ Darwin%20Tree%20B%2036.jpg
  • 45. FIELD NOTESFROM THE AMERICAN MUSEM CONGO EXPEDITION 1909-1915 http://diglib1.amnh.org/cgi-bin/database/index.cgi
  • 46. http://diglib1.amnh.org/galleries/bats/taphozous_mauritianus.html
  • 47. Rheinardia ocellata, the Crested Argus. Photographed at night by anautomatic camera-trap in the Ngoc Linh foothills (Quang Nam Province). Courtesy AMNH Center for Biodiversity and Conservation
  • 48. By Serge Bloch in NYT: Natalie Anger “Tracking forest creatures on the move.” NYT Feb 2, 2009 SEE: http://www.nytimes.com/2009/02/03/science/03angier.html?_r=1&scp=1&sq=tracking%20mammals&st=cse http://www.jamesreserve.edu/webcams.lasso?CameraID=Cam14
  • 49. How many data sources contributed to this analysis?
  • 50. The “small science,” independent investigator approach traditionally hascharacterized a large area of experimental laboratory sciences, such aschemistry or biomedical research, and field work and studies, such asbiodiversity, ecology, microbiology, soil science, and anthropology. The dataor samples are collected and analyzed independently, and the resulting data independentlysets from such studies generally are heterogeneous and unstandardized, with unstandardizedfew of the individual data holdings deposited in public data repositories oropenly shared. The data exist in various twilight states of accessibility, depending on accessibilitythe extent to which they are published, discussed in papers but not revealed, orjust known about because of reputation or ongoing work, but kept underabsolute or relative secrecy. The data are thus disaggregated components ofan incipient network that is only as effective as the individual transactionsthat put it together. Openness and sharing are not ignored, but they are not togethernecessarily dominant either. These values must compete with strategicconsiderations of self-interest, secrecy, and the logic of mutually beneficialexchange, particularly in areas of research in which commercial applicationsare more readily identifiable.The Role of Scientific and Technical Data and Information in the Public Domain: Proceedings of a Symposium. JulieM. Esanu and Paul F. Uhlir, Eds. Steering Committee on the Role of Scientific and Technical Data and Information in thePublic Domain Office of International Scientific and Technical Information Programs Board on International ScientificOrganizations Policy and Global Affairs Division, National Research Council of the National Academies, p. 8
  • 51. GRIDS Data International Centers Collaborative Research EffortIndividual National Disciplinary InitiativesLibraries Cooperative ProjectsLocal / IndividualsPersonalArchiving “Small Science” “BIG Science”
  • 52. Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430http://ocde.p4.siteinternet.com/publications/doifiles/publishin g-standards-data-2009.pdf
  • 53. Green, T (2009), “We Need Publishing Standards for Datasets and Data Tables”, OECD Publishing White Paper, OECD Publishing. doi: 10.1787/603233448430 http://dx.doi.org/10.1787/603233448430 http://ocde.p4.siteinternet.com/publications/doifiles/publishing-standards-data-2009.pdf
  • 54. What does “Full Life-Cycle” Data Management Mean ?
  • 55. US NSF “DataNet” Program “the full data preservation and access lifecycle” • “acquisition” • “documentation” • “protection” • “access” • “analysis and dissemination” • “migration” • “disposition”“Sustainable Digital Data Preservation and Access Network Partners (DataNet) Program Solicitation” NSF 07- 601 US National Science Foundation Office of Cyberinfrastructure Directorate for Computer & Information Science & Engineering
  • 56. Incentives?
  • 57. How do we Incentivize Change ?• Individuals• Professions / Disciplines• Organizations• Institutions (Universities, Research Institutes, Museums, Gardens, Herbaria, Aquariums, Zoos)• “Memory Institutions” (Libraries, Archives)• Governments• Funders / Sponsors• Publishers!
  • 58. Individual’s willingness to share: the Core functions of Scholarly Communication• “Registration, which allows claims of precedence for a scholarly finding.• “Certification, which establishes the validity of a registered scholarly claim.• “Awareness, which allows participants in the scholarly system to remain aware of new claims and findings.• “Archiving, which preserves the scholarly record over time.• “Rewarding, which rewards participants for their performance in the communication system based on metrics derived from that system. Roosendaal, H., Geurts, P in Cooperative Research Information Systems in Physics (Oldenburg, Germany, 1997).
  • 59. Professional / Disciplinary Incentives?
  • 60. • Norms and standards for sharing vary by discipline• In “big science” (astrophysics / astronomy / meteorology / oceanography / genomics) sharing is expected (if not required) and contributions to a common fund of knowledge are assumed (See also: GENBANK ) – Standards are relatively clear – Mechanisms for sharing are well-developed – Collective / collaborative authorship is commonplace• In “small science” such norms are weaker
  • 61. Small Science: Data Deposit and Access• Data are typically held in many formats• Discovery of data is very weakly supported by standards-development• Access to and use of data are highly variable• [ However progress has been made respecting museum specimen data in the past 20 years [SEE for ex. : GBIF and many allied projects] ]• Some progress has been made respecting observational and other data• Ecological and conservation field data remain highly problematic
  • 62. Some suggestions for action include: government agencies and private foundations must both set strict requirements for effective sharing – with serious penalties (such as disqualification for future research funding) for failures to share;• peer review processes must include rigorous scrutiny of past histories of sharing and must require state-of-the-art planning for sharing (not simply a promise to “put data up on the Web” ];• negotiations for “overhead” (“indirect costs”) compensation from funders must include examination of digital infrastructure adequate for sharing and maintenance of data;• accreditation bodies for educational institutions and museums must start to require demonstrated evidence of capacity to support digital access and maintenance of data;• professional societies and professional disciplines must begin to require evidence of effective sharing of data in evaluating credentials for hiring, promotion and tenure;
  • 63. http://www.mikero.com/blog/2009/02/20/more-darwin http://www.zazzle.com/darwin2009
  • 64. From: Tom Moritz [mailto:tom.moritz@gmail.com]Sent: Thursday, November 12, 2009 1:46 AMTo: Donat AgostiSubject: Snapple Real Fact #134: " An ant can lift 50 times its own weight. ”Is this true?Tom________________________________________________From: Donat Agosti <agosti@amnh.org>Date: Wed, Nov 11, 2009 at 8:03 PMSubject: RE: Snapple Real Fact #134: " An ant can lift 50 times its own weight. "To: Tom Moritz tom.moritz@gmail.comPeople says so [emphasis added] – but we once looked for the evidence, but could not find a scientific paper confirming this.D
  • 65. Iobi Ludolfi aliàs Leut-holf dicti Historia Æthiopica, sive Brevis & succincta descriptio regni Habessinorum, quod vulgò malè Presbyteri Iohannis vocatur : 2009 Cambridge University Library"They [the hippopotami] present the following appearance; four- footed, with cloven hooves like cattle; blunt-nosed; with a horses mane, visible tusks, a horses tail and voice; big as the biggest bull. Their hide is so thick that, when it is dried, spearshafts are made of it.” Herodotus, The Histories (with an English translation by A. D. Godley). Cambridge. Harvard University Press. 1920. LXXIhttp://old.perseus.tufts.edu/cgi-bin/ptext?doc=Perseus%3Aabo%3Atlg%2C0016%2C001&query=2%3A71%3A1 [clipped 11/12/09]
  • 66. a problem with “evidence”…“…the great trouble with the world was that which survived was held in hard evidence as to past events. A false authority clung to what persisted, as if those artifacts of the past which had endured had done so by some act of their own will.” -- Cormac McCarthy The Crossing
  • 67. “Πάντα ῥ εῖ καὶ οὐ δὲ ν μένει”Heraclitus: “Everything flows, nothing stands still.” All data is dynamic
  • 68. From examination of elephants’ skulls the early Greeks deduced that a species of humanoid Cyclops existed… (SEE -- for example -- The Odyssey and Ulysses encounter with Polyphemus on the island of Sicily… )http://www.amnh.org/exhibitions/mythiccreatures/land/greek.php
  • 69. Another deduction from the evidence of narwhal tusks…“In the Middle Ages, narwhal tusks were widely thought to be unicorn hornswith magical, curative properties. Indeed, cups made from narwhal tusks(above) were thought to neutralize poisons and were highly valued. “http://www.amnh.org/exhibitions/mythiccreatures/land/unicorns.php
  • 70. Kirtland’s Warbler / Abaco Island, The Bahamas
  • 71. “NATIVE” METADATA DEAD HARBOR SEALand 5 CALIFORNIA CONDORS !!!

×