Successfully reported this slideshow.

Towards Supporting Data-Intensive Research

740 views

Published on

What principles will we use to design and develop future systems that deal with large volumes and complex data?

  • Be the first to comment

  • Be the first to like this

Towards Supporting Data-Intensive Research

  1. 1. Towards Supporting Data-Intensive Research Jano van Hemert NI VER research.nesc.ac.uk U S E IT TH Y O F H G E R D I U N B
  2. 2. Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
  3. 3. Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
  4. 4. NEWS FEATURE 2020 COMPUTING NATURE|Vol 440|23 March 2006 J. MAGEE EVERYTHING,EVERYWHERE Tiny computers that constantly monitor ecosystems, buildings and even human bodies could turn science on its head. Declan Butler investigates.
  5. 5. Vol 455|4 September 2008 BOOKS & ARTS Distilling meaning from data Buried in vast streams of data are clues to new science. But we may need to craft new lenses to see them, explain Felice Frankel and Rosalind Reid. It is a breathtaking time in science they will create effective computer displays, those run by the US National Science Foun- as masses of data pour in, prom- slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www. ising new insights. But how can they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt- we find meaning in these tera- kept at arm’s length by scientists who are busy ing to visually communicate scientific data and bytes? To search successfully getting their experiments done. Opportunities concepts opens a path to understanding. When for new science in large datasets, we must find for useful dialogue are thus squandered. science and design students collaborate, their unexpected patterns and interpret evidence When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science. further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline. questions and answers. imageandmeaning.org), it becomes clear The ingrained habits of highly trained sci- To extract new meaning entists make them rarely as D. ARMENDARIZ from the sea of data, scien- adventurous as these young tists have begun to embrace minds. We think we are on 23.3 Commentary Muggleton jw 20/3/06 6:29 PM Page 409 the tools of visualization. Yet the path to insight when few appreciate that visual rep- shading reveals contours resentation is also a form of in 3D renderings, or when communication. A rich body bursts of red appear on heat of communication expertise maps, for example. But the Vol 440|23 March 2006 holds the potential to greatly algorithms used to produce improve these tools. We pro- the graphics may create illu- pose that graphic artists, com- sions or embed assumptions. municators and visualization scientists should be brought into conversation with theo- The human visual system creates in the brain an appar- ent understanding of what COMMENTARY rists and experimenters a picture represents, not before all the data have been necessarily a picture of the gathered. If we design experi- underlying science. Unless Exceeding human limits ments in ways that offer varied we know all the steps from opportunities for represent- hypothesis to understand- ing and communicating data, ing — by conversing with techniques for extracting new theorists, experimentalists, understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software are turning to automated processes and technologies in a bid to cope with ever higher volumes of data. Scientists available. developers, visualization But automation offers so much more to the future of science than just data handling, says Stephen H. Muggleton. Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy- intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis- FIREFLY PRODUCTIONS/CORBIS for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading. The collection and curation near Geneva, for example, physicists will have dimensional volume rendering may give no The greatest opportunity and risk lie in that of data throughout the pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether sciences is becoming increas- important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled ingly automated. For exam- that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk. ple, a single high-throughput on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk. I experiment in biology can to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the easily generate more than a gigabyte of data per day, and in astronomy a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University, broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a automatic tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci- On the Surface other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas- ences the querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing low-cost digital tion, discovery and communication. for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is efficiency Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the clear communicators are often consulted only after science emerged. expansion of automation in all its aspects: data
  6. 6. c and probability cal- and charge distributionshould become easier for autonomous experimen On such timescales it of individual molecules however, still a decade ic provides a formal need to be integrated scientists to reproduce new experiments and becoming standard scie Vol 455|4 September 2008 gramming languages with models describ- refute their hypotheses. Despite the potentia BOOKS & ARTS probability calculus ing Today’s generation of microfluidic “Owing tomachines severe danger data the scale and rate of that incre the interdepen- generation, computational models of ms of probability for dency of chemical out a specific series of ume of data generation is designed to carry Distilling meaning from data reactions, scientific flexibility decreases in compreh s bayesian networks.new science. But we may needHowever, but further data now require automatic chemical Buried in vast streams of data are clues to reactions. to craft new stic logic’ is a formaland Rosalind Reid. be added the tool kit by developing Academic studies on the could to this construction and modification.” lenses to see them, explain Felice Frankel differences in statements of sound mathematical under- call what one might t It is a breathtaking time in science they will create effective computer displays, those run by the US National Science Foun- as masses of data pour in, prom- slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www. ising new insights. But how can they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt- a ‘chemical Turing “There is a severe danger that i robability of A being pinnings of, say, differential equations, bayesian puter. Such chips contai we find meaning in these tera- kept at arm’s length by scientists who are busy ing to visually communicate scientific data and bytes? To search successfully getting their experiments done. Opportunities concepts opens a path to understanding. When for new science in large datasets, we must find for useful dialogue are thus squandered. science and design students collaborate, their machine’. The universal ure forms of existing networks and logic programs make integrating chambers, ducts, gates t unexpected patterns and interpret evidence When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes increases in speed and volume of n in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science. further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training Turing machine, devised fortunately computa- these various models virtually impossible. reagent stores, and allow ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline. wever, an increasing Although by Alan Turing, be data generation could leadat high sp in 1936 hybrid models can built by simply sis and testing to questions and answers. imageandmeaning.org), it becomes clear The ingrained habits of highly trained sci- t To extract new meaning entists make them rarely as D. ARMENDARIZ from the sea of data, scien- adventurous as these young tists have begun to embrace minds. We think we are on was intended to mimic decreases in comprehensibility.” ups have developed patching two models together, the underlying miniaturizing our robot-o 23.3 Commentary Muggleton jw 20/3/06 6:29 PM Page 409 the tools of visualization. Yet the path to insight when few appreciate that visual rep- shading reveals contours resentation is also a form of in 3D renderings, or when communication. A rich body of communication expertise holds the potential to greatly the pencil-and-paper ques that can handle differences lead to unpredictable and error- this way, with the overal bursts of red appear on heat maps, for example. But the Vol 440|23 March 2006 algorithms used to produce s probabilistic logic6. prone behaviour mathematician. The chemical experimental cycle time improve these tools. We pro- pose that graphic artists, com- operations of a when changes are made. beings. This is particu the graphics may create illu- sions or embed assumptions. municators and visualization such research holds Turing encouraging development in this liseconds.associated with scientists should be brought machine would be a universal proces- nologies With microflu COMMENTARY The human visual system creates in the brain an appar- One into conversation with theo- ent understanding of what rists and experimenters a picture represents, not egration of scientific respect is the emergence withinbroad range of chemical reaction not onA before all the data have been gathered. If we design experi- sor capable of performing a computer sci- and experimentation. necessarily a picture of the underlying science. Unless al and computer-sci- ence of new formalisms5 that integrate, in alimits chemical operations Exceeding human complete, but also requi ments in ways that offer varied we know all the steps from opportunities for represent- ing and communicating data, techniques for extracting new on both the reagents essentially human activhypothesis to understand- ing — by conversing with theorists, experimentalists, available to it at the start andoffersto automated processes andof science thaninjustbid to cope with saysStephen H. Muggleton. a thoseof mathe- of input materials, with o Scientists are turning chemicals bothhandling, ever higher volumes of data. technologies a data in the statement understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software available. sound fashion, two major branches more to the future But automation so much developers, visualization Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy- matics: mathematical logic and probabilityauto- On such timescales it sho it later generates. The machine would cal- clear and undeniable intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis- FIREFLY PRODUCTIONS/CORBIS for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading. The collection and curation near Geneva, for example, physicists will have dimensional volume rendering may give no The greatest opportunity and risk lie in that of data throughout the s culus. Mathematicaland test chemical com- scientists to reproduce n matically prepare logic provides a formal experimentation. pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether sciences is becoming increas- important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled ingly automated. For exam- that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk. ple, a single high-throughput pounds but it would also be programmable, Stephen H. Muggleton is learning approaches foundation for logic programming languages refute their hypotheses. on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk. I experiment in biology can to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the easily generate more than a gigabyte of data per day, and in astronomy a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University, ng scientific models such as Prolog, much theprobability calculusa Computing and the Centr thus allowing whereas same flexibility as broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a automatic Today’s generation of m tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci- On the Surface other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas- ences the real chemist has in the lab. p’ systems with no provides the basic axioms of probability for is designed to carry ou Systems Biology at Imper querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing tion, discovery and communication. low-cost digital for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is efficiency to the collection of One can think of a chemical Turing 2BZ, UK. Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the communicators are often consulted only after science emerged. clear expansion of automation in all its aspects: data
  7. 7. To be released under Creative Commons License this Thursday
  8. 8. Science Paradigms T empirical describing natural phenomena theoretical 2 using models, generalizations 4 2 = K 3 2 computational simulating complex phenomena T data exploration unify theory, experiment, and simulation s FIGURE 1 CE: WHAT IS IT? ce is where “IT meets scientists.” Researchers are using many di erent meth- collect or generate data—from sensors and CCDs to supercomputers and e colliders. When the data finally shows up in your computer, what do with all this information that is now in your digital shoebox? People are To me out and saying,underI’ve got all this Commons ILicense ually seeking be released “Help! Creative data. What am this Thursday
  9. 9. Principles? CIENCE AND GOVERNMENT POLICY FORUM Appropriate professional and career re- ward structures are necessary (20–22). The An International Framework way scientists are being evaluated and how their careers are shaped are at stake. For ex- to Promote Access to Data ample, researchers who have spent years on building new databases, such as the Sloan Digital Sky Survey in astronomy, have ef- Peter Arzberger, 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 fectively put their scientific careers on hold Kathleen Casey, 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these databases are critical for C OV E R F E AT U RE the future development of the R ecent national and multina- field. These considerations apply tional investments (1) in OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to those who produce, networking and continued Openness manage, and reuse research data. ins in information technologi- Transparency and active data dissemination At this point there is consid- l capability (2) have given rise erable heterogeneity in policies. Assignment and assumption of formal responsibilities a complex cyberinfrastructure In the United States, federal Technical and semantic interoperability of databases at is rapidly increasing our abil- government databases are not Downloaded from www.sciencemag.org on August 30, 2009 y to produce, manage, and use Quality control, data validation, authentication, and authorization copyright protected, whereas in ta (3). As research becomes in- Operational efficiency and flexibility the European Union govern- easingly global (4), data-inten- Respect for intellectual property and other ethical and legal requirements ment databases are eligible for ve, and multifaceted (5, 6), it is Management accountability, including funding approaches protection under several data- mperative to address national base protection laws. Even with- d international data access and in countries, different funding aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci- pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a ownstream commercialization of outputs, making communities (11–17). third has no policy (23). National laws and d provides decision-makers with facts Analysis of these, and other examples international agreements can directly af- eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices. THE CHANGING PARADIGM OF onal, problems. This article summarizes and sharing arrangements exhibit a number At the last meeting of the OECD Com- y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli- udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin- DATA-INTENSIVE COMPUTING lopment (OECD) (7), which resulted in a (see figure, this page) ciple that research data inisterial-level declaration (8). provide a framework from public funding Legitimate restrictions on open access, for locating and ana- Technological should be openly avail- d strong disincentives to sharing exist, lyzing where improve- able. Furthermore, they sed on concerns of protecting national se- ments can be made. Data access invited OECD to devel- Cultural Institutional rity, privacy and confidentiality, intellec- Diversity in science and management and op a set of guidelines al property, and time-limited exclusive use suggests that a variety behaviorial domains managerial based on commonly the scientific investigator. The lack of of institutional models agreed principles (simi- ear funding-agency policies in the face of and tailored data man- L Legal Financial an lar to those in the table) and and rong competing interests, often far re- agement approaches policy budgetary to facilitate optimal oved from academic research, poses prob- will be needed. cost-effective access to ms for scientists in developing and devel- Establishing and Domains of a data access regime. digital research data Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and ped countries and inhibit the advance of maintaining this infra- from public funding. It Deborah K. Gracio, Pacific Northwest National Laboratory ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg- on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24), University of California, San Diego, La Jolla, CA 92093, ment, and preservation costs (including which have been a model for legislation all Through the development of new classes of erogeneous full-scale simulations will require not only SA. 2Ministry of Education, Culture and Science, cost of documentation and metadata cre- around the Western world. peta op capabilities but also a computational infrastruc- oetermeer, Netherlands. 3Networked Research and ation) are an afterthought or are insuffi- Although the involvement of re- software, algorithms, and hardware, data- gital Information, Royal Netherlands Academy of Arts ciently or inconsistently funded in research searchers in resolving these issues is criti- intensive applications provide timely and ture that permits model integration. Simultaneously, it nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about must couple to huge databases created by an ever-in- omputing Ltd., Espoo, Finland. 5Social Sciences and Hu- anities Research Council, Ottawa, Canada. 6National mend that roughly one-third of the provi- existing policies at their institutions or na- meaningful analytical results in response creasing number of high-throughput instruments.”2 search Council, Washington, DC 20418, USA. sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To to exponentially growing data complexity
  10. 10. An International Framework way scientists are being their careers are shaped to Promote Access to Data ample, researchers who Principles? building new databases Digital Sky Survey in Peter Arzberger, 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 fectively put their scien Kathleen Casey, 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these datab the future d R ecent national and multina- field. These c tional investments (1) in OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to t networking and continued Openness manage, and gains in information technologi- Transparency and active data dissemination At this po cal capability (2) have given rise to a complex cyberinfrastructure CIENCE AND GOVERNMENT POLICY FORUM Assignment and assumption of formal responsibilities Technical and semantic interoperability of databases erable hetero In the Unit that is rapidly increasing our abil- Appropriate professional and career re- ward structures are necessary (20–22). The government An International Framework ity to produce, manage, and use Quality control, data validation, authentication, and authorization way scientists are being evaluated and how their careers are shaped are at stake. For ex- copyright pr to Promote Access to Data research becomes in- data (3). As Operational efficiency and flexibility ample, researchers who have spent years on building new databases, such as the Sloan Digital Sky Survey in astronomy, have ef- the Europe 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 creasingly global (4), data-inten- ment databa Peter Arzberger, fectively put their scientific careers on hold Kathleen Casey, Respect F E ATintellectual property and other ethical and legal requirements C OV E R for U RE 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these databases are critical for the future development of the sive, and multifaceted (5, 6), it is protection u R ecent national and multina- field. These considerations apply tional investments (1) in Management accountability, including funding approaches OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to those who produce, imperative to address national base protecti networking and continued Openness manage, and reuse research data. ins in information technologi- Transparency and active data dissemination At this point there is consid- l capability (2) have given rise erable heterogeneity in policies. and international data access and in countries Assignment and assumption of formal responsibilities a complex cyberinfrastructure In the United States, federal Technical and semantic interoperability of databases at is rapidly increasing our abil- government databases are not Downloaded from www.sciencemag.org on August 30, 2009 y to produce, manage, and use Quality control, data validation, authentication, and authorization copyright protected, whereas in sharing issues systematically in a policy are- derstanding global climate change (10) re- agencies have different ta (3). As research becomes in- easingly global (4), data-inten- Operational efficiency and flexibility Respect for intellectual property and other ethical and legal requirements the European Union govern- ment databases are eligible for na that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, w ve, and multifaceted (5, 6), it is mperative to address national Management accountability, including funding approaches protection under several data- base protection laws. Even with- d international data access and in countries, different funding Open access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci- vides greater returns from the public invest- topic of recent debate and its resolution is a principles in the OEC pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a ment in research, generates wealth through high priority in many scientific and policy- states access should no ownstream commercialization of outputs, making communities (11–17). d provides decision-makers with facts third has no policy (23). National laws and Analysis of these, and other examples international agreements can directly af- downstream commercialization of outputs, making communities (11–17). eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices. third has no policy (23) THE CHANGING PARADIGM OF onal, problems. This article summarizes and sharing arrangements exhibit a number At the last meeting of the OECD Com- y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli- and provides decision-makers with facts Analysis of these, and other examples international agreemen udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin- needed to address complex, often transna- (18), suggests that successful data access fect data access and sha DATA-INTENSIVE COMPUTING lopment (OECD) (7), which resulted in a (see figure, this page) ciple that research data inisterial-level declaration (8). provide a framework from public funding Legitimate restrictions on open access, for locating and ana- Technological should be openly avail- tional, problems. This article summarizes and sharing arrangements exhibit a number d strong disincentives to sharing exist, lyzing where improve- sed on concerns of protecting national se- ments can be made. Cultural Data access Institutional invited OECD to devel-At the last meeting able. Furthermore, they rity, privacy and confidentiality, intellec- Diversity in science op a set of guidelines key findings of an international group that of key attributes and operating principles mittee for Scientific and al property, and time-limited exclusive use suggests that a variety the scientific investigator. The lack of of institutional models and behaviorial management domains and managerial based on commonly agreed principles (simi- studied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the minist ear funding-agency policies in the face of and tailored data man- rong competing interests, often far re- agement approaches oved from academic research, poses prob- will be needed. L Legal and policy Financial an and budgetary lar to those in the table) to facilitate optimal cost-effective access to isation for Economic Cooperation and Richard T. Kouzes, GordonNorthwest National Laboratory Ian Gorton,“domains” endorsed a declaration ( ms for scientists in developing and devel- ped countries and inhibit the advance of maintaining this infra- De- organizational Stephen T. Elbert, Deborah K. Gracio, Pacific A. Anderson, management and Establishing and Domains of a data access regime. digital research data from public funding. It velopment (OECD) (7), which resulted in a (see figure, this page) ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg- on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD cipl ministerial-level declaration (8). Through the development of a framework provide new classes of erogeneous full-scale simulations will require not only from cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24), University of California, San Diego, La Jolla, CA 92093, ment, and preservation costs (including which have been a model for legislation all Legitimate restrictions on open access, applications and hardware, data- ana- that permits model integration. Simultaneously, it for locating and cost of documentation and metadata cre- around the Western world. Technological sho SA. 2Ministry of Education, Culture and Science, oetermeer, Netherlands. 3Networked Research and software, algorithms, peta op capabilities but also a computational infrastruc- ation) are an afterthought or are insuffi- Although the involvement of re- gital Information, Royal Netherlands Academy of Arts intensive provide timely and ture ciently or inconsistently funded in research searchers in resolving these issues is criti- meaningful lyzing where improve- and strong disincentives to sharing exist, analytical results in response must couple to huge databases created by an ever-in- able nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific omputing Ltd., Espoo, Finland. 5Social Sciences and Hu- projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about anities Research Council, Ottawa, Canada. 6National mend that roughly one-third of the provi- existing policies at their institutions or na- creasing number of high-throughput instruments.” 2 search Council, Washington, DC 20418, USA. sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To to exponentially growing data complexity

×