Data-Intensive Research

3,974 views

Published on

Science is witnessing a data revolution. Data are now created by faster and cheaper physical technologies, software tools and digital collaborations. Examples of these include satellite networks, simulation models and social network data. To transform these data successfully into information then into knowledge and finally into wisdom, we need new forms of computational thinking. These may be enabled by building "instruments" that make data comprehensible for the "naked mind" in a similar fashion to the way in which telescopes reveal the universe to the naked eye. These new instruments must be grounded in well-founded principles to ensure they have the fidelity and capacity to transform the complex and large-scale data into comprehensive forms; this demands new data-intensive methods.

Data-intensive refers to huge volumes of data, complex patterns of data integration and analysis and intricate interactions between data and users. Current methods and tools are failing to address data-intensive challenges effectively: they fail for several reasons, all of which are aspects of scalability. I will introduce three main aspects of data-intensive research and show how we are addressing the challenges that arise from the interaction of these aspects. I will make use of results from our interdisciplinary collaborations as examples of solutions to specific challenges that can arise when scaling up intensity.

Published in: Technology
0 Comments
6 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
3,974
On SlideShare
0
From Embeds
0
Number of Embeds
15
Actions
Shares
0
Downloads
0
Comments
0
Likes
6
Embeds 0
No embeds

No notes for slide
  • * This is not about projects, publications
  • * One of the papers that is signposting
  • * Sensors, large machines, interaction with data (software), interaction between people, interaction of software on data, ...
  • * EMBL-EBI now reached 4.5 petabytes
    * MESUR has 1 billion records on usage data
    * PACS at 160 GB in August 2009, quadruples every year
  • * More explicit forms of demands
  • * More explicit forms of demands
  • * A proposed solution
    * How do you go about implementing a solution under the fourth paradigm?
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Research focuses on progressing computer science
    * by evaluating both generic and tailored methodologies
    * in a multidisciplinary context with
    * rich use cases to test hypotheses
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • * Formulation = an abstract description of the data-intensive challenge
    * Execution = an implementation of the challenge that runs on a computational platform
    * Interaction = necessary to manage the formulation process and to steer the execution
  • Data-Intensive Research

    1. 1. Data-Intensive Research Jano van Hemert research.nesc.ac.uk NI VER U S E IT TH Y O F H G E R D I U N B
    2. 2. Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
    3. 3. o investigate the 10.1126/science.1171406 Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- The demands of data-intensive science ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential represent a challenge for diverse scientific third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the communities. universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
    4. 4. NEWS FEATURE 2020 COMPUTING NATURE|Vol 440|23 March 2006 J. MAGEE EVERYTHING,EVERYWHERE Tiny computers that constantly monitor ecosystems, buildings and even human bodies could turn science on its head. Declan Butler investigates.

    ×