Databasing the World:Biodiversity and the 2000s Written by Bowker, G. C. Presented by Chen Zhang (Mike)
Four Key Aspects Database Infrastructure Standards—flexible, stable Technology—stable Communication Data Sharing Ownership Disarticulation Data collection
Four Key Aspects Distributed Collective Practice Collaborate work New Knowledge Economy Accounting for life Development of Classification Cladistics The Future
Standards Why do we need standards Example of air-conditioner industry Diameter Match between screw and the hole on the panel Reasons for database Need ‘handshake’ among various media MIME<Multipurpose Internet Mail Extensions>protocol Each layer of infrastructure requires its own set of standards Need standardized categories.
Standards Standards will not always win Some best-known standards QWERTY keyboard
Standards Standards will not always win Some best-known standards VHS (Video Home System) standard
Standards Standards will not always win Some best-known standards DOS computing system
Standards Standards will not always win Why? The best standard maybe doesn’t have best market Standards setting is a key site of political work The inferior standard may be respected by the political agency. ( Such as standards-setting bodies)
Standards Interoperability Continuum of strategies for standards setting One Standard Fits All Let A Thousand standards bloom
Standards Interoperability Some Related Standards 1. ANSI/NISO Z39.50 ANSI/NISO Z39.50 is the American National Standard Information Retrieval Application Service Definition and Protocol Specification for Open Systems Interconnection. IT makes it easier to use large information databases by standardizing the procedures and features for searching and retrieving information.
Standards Interoperability Some Related Standards ANSI/NISO Z39.50
Standards Interoperability Some Related Standards 1. ANSI/NISO Z39.50 A single enquiry over multiple databases. widely adopter in the library world.
Standards Interoperability Some Related Standards 2. XML Extensible Markup Language (XML) is a set of rules for encoding documents in machine-readable form. Two extremes: a. Colonial model b. Democratic model (win out) People’s established computing environment
Technology Technology must be stable Nothing to guarantee the stability of vast data sets Failure of Paul Otlet’s well catalogued microfiches Development of computer memory Hard to retrieve information
Technology Technology must stable Data accessible and usable Infrastructure will require a continued maintenance effort Reasons a. Data is passed from one medium to another b. Data is analyzed by one generation of database technology to the next.
Issues of Communication Problem of reliable metadata Metadata—data about data The blue lines are metadata
Issues of Communication Problem of reliable metadata The standard name of certain kinds of data Searchable—easy to search over multiple database Issue—how detail does the name of data should be? Lack of details— the information of data is useless Too many details— longer time, more work
Issues of Communication Dublin code The Dublin Core set ofmetadata elements provides a small and fundamental group of text elements through which most resources can be described and cataloged. The Simple Dublin Core Metadata Element Set (DCMES) consists of 15 metadata elements: Language Relation Coverage Rights Title Creator Subject Description Publisher Contributor Date Type Format Identifier Source
Ownership Control of knowledge Mid-nineteenth century: only professionally trained scientists and doctors New information economy: from many people Example: patients group
Ownership Privacy Keep data private is difficult : Example: data is complied by third-company to generate a new, marketable form of knowledge New Patterns of ownership Science has frequently been analyzed as a “public good” Increasing privatization of knowledge : It is unclear to what extent the vaunted openness of the scientific community will last
Disarticulation Ideal database Should according to most practitioners be theory-neutral, but should serve as a common basis for a number of scientific disciplines to progress. Example: genome databank new kind of science genome construct arguments about the genetic causation ≠ the process of mapping the genome
Data must be reusable by scientists
The data in a database should be easily manipulated by other scientists.
Data Collection Biodiversity Large-scale databases are being developed for a diverse array of animal and plant groups Worldwide effort IUBS CODATA IUMS Deal with old data Data was rolled into a theory should remember All its own data Potentially data that had not yet been collected
Data Collection Deal with old data Difficulties Scientific paper don’t in general offer enough information to allow an experiment or procedure to be repeated. The distributed database is becoming a new model form of scientific publication in its own right Issues of Update No automatic update from one field to a cognate one Scientist are not able to share information across discipline divides
Data Collection International Technoscience Purpose: Narrow the gaps between countries Issues: People do not have equal knowledge Access is never really equal Government have doubts of the usefulness of opening the database onto internet.
Distributed Collective Practice
Collaborative Work Management structures in universities and industry still tend to support the heroic myth of the individual researcher. What kind of value the large publishing houses add to journal production. Great attention must be paid to the social and organizational setting of technoscientific work
New Knowledge Economy Three central issues The development of flexible, stable data standard The generation of protocols for data sharing The restructuring of scientific careers
Accounting For Life
Development of Classification Introduction: PANDORA taxonomic database
Development of Classification Importance of classification 18th-19th centuries : botanist must know all genera, and commit their names to memory, but cannot be expected to remember all specific names. ( A.J. Cain, 1958) Later part of 19th century: new information technologies developed which permitted the easy storage and coding of larger amounts of data than could previously be easily manipulated. (Chandler,1977),(Yates,1989)
Development of Classification Example of classification Paper-based archival practice. Issues: hard to reclassified Type specimen had to be relocated physically So do Series of articles or books
Development of Classification Example of classification Multifaceted classification system Improve: Enabling the classifications to be ordered in multiple ways, rather than in a single Example: A collection of books might be classified using an author facet, a subject facet, a date facet
Development of Classification Example of classification Hierarchical classification (for reading the past) E.F. Codd In the early 1970s Split physical storage of data in the computer and the representation of that data. Disadvantage: becomes awkward to introduce other levels of taxonomic category as an afterthought. Improve method: one record for every name, regardless of its taxonomic level
Cladistics Definition It is a method of classifying species of organisms into groups called clades, which consist of 1) all the descendants of an ancestral organism and 2) the ancestor itself. Features : Give a more regular algorithm for determining phylogeny Focusing attention on shared, derived characteristics of set organisms Using ‘outgroup’ comparisons to develop the classification system
Cladistics Tree of life Cladists use cladograms, diagrams which show ancestral relations between taxa, to represent the evolutionary tree of life Charles Darwin (1809–1882) was the first to produce an evolutionary tree of life
Cladistics Tree of life
Cladistics Computer programs in cladistics Undertaken using Swofford’s (1985) package PAUP version 2.4installed on a Cyber mainframe computer and version 2.4.1 on an amstrad 1512 PC David Swofford’s PAUP is a software package for inference of evolutionary trees Purpose: follow a given algorithm for generating and testing cladograms
Cladistics Computer programs in cladistics
Cladistics Computer programs in cladistics Issues: The packages produce variable results and cannot possibly look at all the possibilities, since there is NP-complete problem. Algorithm issues
The Future Store the life Life is described as itself a program, with DNA being code. IF everything is information, then life can equally well be “stored”