Advanced Data Mining and Integration Research for Europe (ADMIRE)
1. Advanced Data Mining and
Integration Research for
Europe (ADMIRE)
Jano van Hemert
NI VER
research.nesc.ac.uk
U S
E
IT
TH
Y
O F
H
G
E
R
D I U
N B
2. Downloaded from www.sciencemag.org on July 6, 2009
COMPUTER SCIENCE
The demands of data-intensive science
Beyond the Data Deluge represent a challenge for diverse scientific
communities.
Gordon Bell,1 Tony Hey,1 Alex Szalay2
S
ince at least Newton’s laws of motion in
the 17th century, scientists have recog-
nized experimental and theoretical sci-
ence as the basic research paradigms for
understanding nature. In recent decades, com-
puter simulations have become an essential
third paradigm: a standard tool for scientists to
explore domains that are inaccessible to theory
and experiment, such as the evolution of the
universe, car passenger crash testing, and pre-
dicting climate change. As simulations and
experiments yield ever more data, a fourth par-
adigm is emerging, consisting of the tech-
niques and technologies needed to perform
data-intensive science (1). For example, new
types of computer clusters are emerging that
are optimized for data movement and analysis
rather than computing, while in astronomy and
other sciences, integrated data systems allow
data analysis and storage on site instead of
requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service.
throughput instruments, sensor networks,
accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas-
to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists.
(2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s
these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to
CREDIT: JONATHAN FAY/MICROSOFT
(1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the
bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for
the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever
management challenges. In almost every labo- increasing flood of scientific data generated
ratory, “born digital” data proliferate in files, by the faster computers. In university research
1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually
98052, USA. 2Department of Physics and Astronomy, Johns
Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can
21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have
www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297
Published by AAAS
3. Downloaded from www.sciencemag.org on July 6, 2009
COMPUTER SCIENCE
The demands of data-intensive science
Beyond the Data Deluge represent a challenge for diverse scientific
communities.
Gordon Bell,1 Tony Hey,1 Alex Szalay2
S
ince at least Newton’s laws of motion in
the 17th century, scientists have recog-
nized experimental and theoretical sci-
ence as the basic research paradigms for
understanding nature. In recent decades, com-
puter simulations have become an essential
third paradigm: a standard tool for scientists to
explore domains that are inaccessible to theory
and experiment, such as the evolution of the
universe, car passenger crash testing, and pre-
dicting climate change. As simulations and
experiments yield ever more data, a fourth par-
adigm is emerging, consisting of the tech-
niques and technologies needed to perform
data-intensive science (1). For example, new
types of computer clusters are emerging that
are optimized for data movement and analysis
rather than computing, while in astronomy and
other sciences, integrated data systems allow
data analysis and storage on site instead of
requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service.
throughput instruments, sensor networks,
accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas-
to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists.
(2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s
these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to
CREDIT: JONATHAN FAY/MICROSOFT
(1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the
bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for
the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever
management challenges. In almost every labo- increasing flood of scientific data generated
ratory, “born digital” data proliferate in files, by the faster computers. In university research
1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually
98052, USA. 2Department of Physics and Astronomy, Johns
Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can
21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have
www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297
Published by AAAS
4. Vol 455|4 September 2008
BOOKS & ARTS
Distilling meaning from data
Buried in vast streams of data are clues to new science. But we may need to craft new
lenses to see them, explain Felice Frankel and Rosalind Reid.
It is a breathtaking time in science they will create effective computer displays, those run by the US National Science Foun-
as masses of data pour in, prom- slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www.
ising new insights. But how can they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt-
we find meaning in these tera- kept at arm’s length by scientists who are busy ing to visually communicate scientific data and
bytes? To search successfully getting their experiments done. Opportunities concepts opens a path to understanding. When
for new science in large datasets, we must find for useful dialogue are thus squandered. science and design students collaborate, their
unexpected patterns and interpret evidence When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes
in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science.
further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training
ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for
venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline.
questions and answers. imageandmeaning.org), it becomes clear The ingrained habits of highly trained sci-
To extract new meaning entists make them rarely as
D. ARMENDARIZ
from the sea of data, scien- adventurous as these young
tists have begun to embrace minds. We think we are on
23.3 Commentary Muggleton jw 20/3/06 6:29 PM Page 409
the tools of visualization. Yet the path to insight when
few appreciate that visual rep- shading reveals contours
resentation is also a form of in 3D renderings, or when
communication. A rich body bursts of red appear on heat
of communication expertise maps, for example. But the
Vol 440|23 March 2006
holds the potential to greatly algorithms used to produce
improve these tools. We pro- the graphics may create illu-
pose that graphic artists, com- sions or embed assumptions.
municators and visualization
scientists should be brought
into conversation with theo-
The human visual system
creates in the brain an appar-
ent understanding of what
COMMENTARY
rists and experimenters a picture represents, not
before all the data have been necessarily a picture of the
gathered. If we design experi- underlying science. Unless
Exceeding human limits
ments in ways that offer varied we know all the steps from
opportunities for represent- hypothesis to understand-
ing and communicating data, ing — by conversing with
techniques for extracting new theorists, experimentalists,
understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software are turning to automated processes and technologies in a bid to cope with ever higher volumes of data.
Scientists
available. developers, visualization
But automation offers so much more to the future of science than just data handling, says Stephen H. Muggleton.
Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy-
intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis-
FIREFLY PRODUCTIONS/CORBIS
for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading. The collection and curation
near Geneva, for example, physicists will have dimensional volume rendering may give no The greatest opportunity and risk lie in that of data throughout the
pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether sciences is becoming increas-
important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled ingly automated. For exam-
that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk. ple, a single high-throughput
on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk. I
experiment in biology can
to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the easily generate more than a
gigabyte of data per day, and in astronomy
a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University,
broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a
automatic
tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci-
On the Surface
other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas-
ences the
querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by
ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing
low-cost digital
tion, discovery and communication. for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is
efficiency
Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the
clear
communicators are often consulted only after science emerged. expansion of automation in all its aspects: data
5. c and probability cal- and charge distributionshould become easier for autonomous experimen
On such timescales it of individual molecules however, still a decade
ic provides a formal need to be integrated scientists to reproduce new experiments and becoming standard scie Vol 455|4 September 2008
gramming languages with models describ- refute their hypotheses. Despite the potentia
BOOKS & ARTS
probability calculus ing Today’s generation of microfluidic
“Owing tomachines severe danger data
the scale and rate of that incre
the interdepen- generation, computational models of
ms of probability for dency of chemical out a specific series of ume of data generation
is designed to carry
Distilling meaning from data reactions, scientific flexibility decreases in compreh
s bayesian networks.new science. But we may needHowever, but further data now require automatic
chemical
Buried in vast streams of data are clues to reactions. to craft new
stic logic’ is a formaland Rosalind Reid. be added the tool kit by developing Academic studies on the
could to this
construction and modification.”
lenses to see them, explain Felice Frankel
differences in
statements of sound mathematical under- call what one might t
It is a breathtaking time in science they will create effective computer displays, those run by the US National Science Foun-
as masses of data pour in, prom- slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www.
ising new insights. But how can they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt-
a ‘chemical Turing “There is a severe danger that i
robability of A being pinnings of, say, differential equations, bayesian puter. Such chips contai
we find meaning in these tera- kept at arm’s length by scientists who are busy ing to visually communicate scientific data and
bytes? To search successfully getting their experiments done. Opportunities concepts opens a path to understanding. When
for new science in large datasets, we must find for useful dialogue are thus squandered. science and design students collaborate, their
machine’. The universal
ure forms of existing networks and logic programs make integrating chambers, ducts, gates t
unexpected patterns and interpret evidence When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes
increases in speed and volume of n
in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science.
further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training
Turing machine, devised
fortunately computa- these various models virtually impossible. reagent stores, and allow
ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for
venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline.
wever, an increasing Although by Alan Turing, be data generation could leadat high sp
in 1936 hybrid models can built by simply sis and testing to
questions and answers. imageandmeaning.org), it becomes clear The ingrained habits of highly trained sci-
t
To extract new meaning entists make them rarely as
D. ARMENDARIZ
from the sea of data, scien- adventurous as these young
tists have begun to embrace minds. We think we are on
was intended to mimic decreases in comprehensibility.”
ups have developed patching two models together, the underlying miniaturizing our robot-o
23.3 Commentary Muggleton jw 20/3/06 6:29 PM Page 409
the tools of visualization. Yet the path to insight when
few appreciate that visual rep- shading reveals contours
resentation is also a form of in 3D renderings, or when
communication. A rich body
of communication expertise
holds the potential to greatly
the pencil-and-paper
ques that can handle differences lead to unpredictable and error- this way, with the overal bursts of red appear on heat
maps, for example. But the
Vol 440|23 March 2006
algorithms used to produce
s
probabilistic logic6. prone behaviour mathematician. The chemical experimental cycle time
improve these tools. We pro-
pose that graphic artists, com- operations of a when changes are made. beings. This is particu
the graphics may create illu-
sions or embed assumptions.
municators and visualization
such research holds Turing encouraging development in this liseconds.associated with
scientists should be brought
machine would be a universal proces- nologies With microflu COMMENTARY
The human visual system
creates in the brain an appar-
One
into conversation with theo- ent understanding of what
rists and experimenters a picture represents, not
egration of scientific respect is the emergence withinbroad range of chemical reaction not onA
before all the data have been
gathered. If we design experi- sor capable of performing a computer sci- and experimentation.
necessarily a picture of the
underlying science. Unless
al and computer-sci- ence of new formalisms5 that integrate, in alimits
chemical operations Exceeding human complete, but also requi
ments in ways that offer varied we know all the steps from
opportunities for represent-
ing and communicating data,
techniques for extracting new
on both the reagents essentially human activhypothesis to understand-
ing — by conversing with
theorists, experimentalists,
available to it at the start andoffersto automated processes andof science thaninjustbid to cope with saysStephen H. Muggleton. a
thoseof mathe- of input materials, with o
Scientists are turning
chemicals bothhandling, ever higher volumes of data.
technologies a
data in the statement
understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software
available.
sound fashion, two major branches more to the future
But automation so much developers, visualization
Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy-
matics: mathematical logic and probabilityauto- On such timescales it sho
it later generates. The machine would cal- clear and undeniable
intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis-
FIREFLY PRODUCTIONS/CORBIS
for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading. The collection and curation
near Geneva, for example, physicists will have dimensional volume rendering may give no The greatest opportunity and risk lie in that of data throughout the
s culus. Mathematicaland test chemical com- scientists to reproduce n
matically prepare logic provides a formal experimentation.
pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether sciences is becoming increas-
important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled ingly automated. For exam-
that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk. ple, a single high-throughput
pounds but it would also be programmable, Stephen H. Muggleton is
learning approaches foundation for logic programming languages refute their hypotheses.
on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk. I
experiment in biology can
to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the easily generate more than a
gigabyte of data per day, and in astronomy
a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University,
ng scientific models such as Prolog, much theprobability calculusa Computing and the Centr
thus allowing whereas same flexibility as
broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a
automatic
Today’s generation of m
tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci-
On the Surface
other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas-
ences the
real chemist has in the lab.
p’ systems with no provides the basic axioms of probability for is designed to carry ou Systems Biology at Imper
querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by
ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing
tion, discovery and communication.
low-cost digital
for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is
efficiency
to the collection of One can think of a chemical Turing 2BZ, UK.
Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the
communicators are often consulted only after science emerged.
clear
expansion of automation in all its aspects: data
7. Aims
• ADMIRE aims to deliver a consistent and easy-to-
use technology for extracting information and
knowledge.
8. Aims
• ADMIRE aims to deliver a consistent and easy-to-
use technology for extracting information and
knowledge.
• The project is motivated by the difficulty of extracting
meaningful information by data mining combinations
of data from multiple heterogeneous and
distributed resources.
9. Aims
• ADMIRE aims to deliver a consistent and easy-to-
use technology for extracting information and
knowledge.
• The project is motivated by the difficulty of extracting
meaningful information by data mining combinations
of data from multiple heterogeneous and
distributed resources.
• It will also provide an abstract view of data mining
and integration, which will give users and
developers the power to cope with complexity and
heterogeneity of services, data and processes.
11. Separating concerns
User and application diversity
Iterative DMI
process
development Accommodating
Tool level
Many application domains
Many tool sets
Many process representations
Many working practices
Gateway interface
DMI canonical representation and abstract machine
one model
Composing or hiding
Enactment Many autonomous resources & services
level Mapping
optimisation Multiple enactment mechanisms
and Multiple platform implementations
enactment
System diversity and complexity
26. Data mining results
Table 1. The preliminary result of classification performance using 10-fold validation
hhhh
h hhClassification Performance
hhhh
hhhh Sensitivity Specificity
Gene expression hh h
Humerus 0.7525 0.7921
Handplate 0.7105 0.7231
Fibula 0.7273 0.718
Tibia 0.7467 0.7451
Femur 0.7241 0.7345
Ribs 0.5614 0.7538
Petrous part 0.7903 0.7538
Scapula 0.7882 0.7099
Head mesenchyme 0.7857 0.5507
Note: Sensitivity: true positive rate. Specificity: true negative rate.
5 Conclusion and Future Work
27. Where we are
• Architecture prototype works
• Intuitive workbench created
• Will be connected next
• Two more use cases
28. Team
National e-Science Centre http://www.admire-project.eu/
Malcolm Atkinson
Jano van Hemert
Liangxiu Han
Gagarine Yaikhom
Chee-Sun Liew
EPCC
Mark Parsons et al.
University of Vienna
Peter Brezany et al.
Universidad Politécnica de Madrid
Oscar Corcho
Slovak Academy of Sciences
Ladislav Hluchý
Fujitsu Labs Europe
David Snelling
ComArch SA
Marcin Choiński http://research.nesc.ac.uk/
Editor's Notes
* This is not about projects, publications
* Where did we suddenly appear from
* One of the papers that is signposting
* Sensors, large machines, interaction with data (software), interaction between people, interaction of software on data, ...
* More explicit forms of demands
* More explicit forms of demands
* A proposed solution
* How do you go about implementing a solution under the fourth paradigm?