Edinburgh Data-Intensive Research Data-intensive refers to huge volumes of data, complex patterns of data integration and analysis, and intricate interactions between data and users. Current methods and tools are failing to address data-intensive challenges effectively. They fail for several reasons, all of which are aspects of scalability. The deluge of computational methods and plethora of computational systems prevents effective and efficient use of resources, user interfaces are not adopted at a sufficient rate to satisfy demand for scientific computing and data and knowledge is created outside suitable contexts for collaborative research to be effective. The Edinburgh Data-Intensive Research group addresses these scalability issues by providing mappings from abstract formulations to concrete and optimised executions of research challenges, by developing intuitive interfaces to enable access to steer these executions and by developing systems to aid in creating new research challenges. In this talk I will present several exemplars where we have dealt with scalability issues in scientific scenarios.
2. Efficient distributed
systems
Computer Science
Research
Effective
algorithms Data-intensive
computing
3. Efficient distributed Reusable computational
systems models
Computer Science Interdisciplinary
Research Applications
Effective Intuitive
algorithms Data-intensive Collaborative interfaces
computing environments
New conceptual
models for systems
4. Developmental Medical Emergency
Chemistry Response
Biology Genetics
Reusable computational
models
alpha release of a combined earth-
quake selection and waveform selec-
tion service combining the EMSC and
Real-time access to European BB
data successively increasing
The Virtual European Broad-band
the ORFEUS services. The web por- Seismograph Network (VEBSN) is
tal also includes a first test version steadily increasing its size. Currently
of the underlying software structure more then 270 stations are contrib-
Interdisciplinary
of the distributed archive services of uting data to the VEBSN in near real-
the Integrated European Distributed time. For some tens of these stations
Archive (EIDA) for waveform data. we still need to compile the instru-
The alpha release implies that a mentation and data details (data-
test version of the current service is less Seed volumes). An example of
made accessible for a selected group the earthquake in Greece on Febru-
Applications
of scientist that are willing to test it ary 14, 2008 illustrates the available
and recommend modifications. In- data. The VEBSN is a joint initiative
terested seismologists, student, re- of European-Mediterranean seismo-
searcher or network operator, are logical networks. More information
encouraged to contact the NERIES can be obtained from www.orfeus-
Project Office if they are interested eu.org/Data-info/vebsn.html.
to test the services. A short video
Intuitive
presentation is available (http:// Figure 3. The Greek earthquake of February 14, 2008
as recorded by the vertical component of broadband
www.neries-eu.org/main.php/demo. stations of the VEBSN (mainly in the European-Medi-
terranean area) and made available by ORFEUS. The
wmv?fileitem=8798210). Alessan- VEBSN is currently still expanding.
Collaborative Brain
dro Spinuso, Sergio Rives, Luca Tra-
Neuro- Quantitative
ni, Phetaphone Thomy, Rémy Bossu,
interfaces
Seismology
Torild van Eck. (See figure 2 below.)
informatics Genetics Imaging
environments
10. !
Figure 3: Screenshots of the DGEMap Web Portal, showing the facility for adding new project
details to the database.
Page 2 Deliverable D2.8
Design Study Contract number 011993
!
11. ?
!
Figure 3: Screenshots of the DGEMap Web Portal, showing the facility for adding new project
details to the database.
Page 2 Deliverable D2.8
Design Study Contract number 011993
!
?
12.
13.
14.
15.
16. Scaling
• More users able to join in
• Deal with more experiments
• Better reproducibility (in progress)
Want your own scientific computing portal?
Ask me!
28. Data mining results
Table 1. The preliminary result of classification performance using 10-fold validation
hhhh
h hhClassification Performance
hhhh
hhhh Sensitivity Specificity
Gene expression hh h
Humerus 0.7525 0.7921
Handplate 0.7105 0.7231
Fibula 0.7273 0.718
Tibia 0.7467 0.7451
Femur 0.7241 0.7345
Ribs 0.5614 0.7538
Petrous part 0.7903 0.7538
Scapula 0.7882 0.7099
Head mesenchyme 0.7857 0.5507
Note: Sensitivity: true positive rate. Specificity: true negative rate.
How good we can predict it is there
5 Conclusion and Future Work
How good we can predict it is not there
29. Scaling
• Size of experiment
• Volume of data
• Available resources
Want your own (distributed) data integration & mining?
Ask me!
35. Scaling
• Larger collaborations
• Handle more & diverse knowledge
• Speed-up “Fourth Paradigm”
(http://bit.ly/dwQzYe)
Want your own 3D visualisation & annotation?
Ask me!
36. Multi-disciplinary
[1] D. Rodr´ıguez Gonz´lez, T. Carpenter, J.I. van Hemert, and J. Wardlaw. An open source toolkit for
a
medical imaging de-identification. European Radiology, page First Online, 2010.
[2] R.R. Kitchen, V.S. Sabine, A.H. Sims, E.J. Macaskill, L. Renshaw, J.S. Thomas, J.I. van Hemert,
J.M. Dixon, and J.M.S. Bartlett. Correcting for intra-experiment variation in illumina beadchip data is
necessary to generate robust gene-expression profiles. BMC Genomics, 11, 2010.
[3] C.A. Morrison, N. Robertson, A. Turner, J. van Hemert, and J. Koetsier. Molecular Orbital Calculations
of Inorganic Compounds, chapter 3.33, pages 261–267. Wiley-VCH, 3 edition, 2010.
[4] Ales Tichopad, Tzachi Bar, Ladislav Pecen, Robert R. Kitchen, Mikael Kubista, and Michael W. Pfaffl.
Quality control for quantitative pcr based on amplification compatibility test. Methods, 50:308–312, 2010.
[5] Robert R. Kitchen, Mikael Kubista, and Ales Tichopad. Statistical aspects of quantitative real-time pcr
experiment design. Methods, 50:231–236, 2010.
[6] J. Koetsier, A. Turner, P. Richardson, and J.I. van Hemert. Rapid chemistry portals through engaging
researchers. In IEEE 5th International Conference on e-Science, page In press, 2009.
[7] Liangxiu Han, Jano van Hemert, Richard Baldock, and Malcolm P. Atkinson. Automating gene expression
annotation for mouse embryo. In Ronghuai Huang; Qiang Yang; Jian Pei et al., editor, Advanced Data
Mining and Applications, 5th International Conference, volume LNAI 5678. Springer, 2009.
[8] J. O’Donoghue and J.I. van Hemert. Using the DCC Lifecycle Model to curate a gene expression database:
A case study. International Journal of Digital Curation, page In press, 2009.
[9] J.D. Armstrong and J.I. van Hemert. Towards a virtual fly brain. Philosophical Transactions A,
367(1896):2387–2397, June 2009.
37. Multi-disciplinary
[1] D. Rodr´ıguez Gonz´lez, T. Carpenter, J.I. van Hemert, and J. Wardlaw. An open source toolkit for
a
medical imaging de-identification. European Radiology, page First Online, 2010.
[2] R.R. Kitchen, V.S. Sabine, A.H. Sims, E.J. Macaskill, L. Renshaw, J.S. Thomas, J.I. van Hemert,
J.M. Dixon, and J.M.S. Bartlett. Correcting for intra-experiment variation in illumina beadchip data is
necessary to generate robust gene-expression profiles. BMC Genomics, 11, 2010.
[3] C.A. Morrison, N. Robertson, A. Turner, J. van Hemert, and J. Koetsier. Molecular Orbital Calculations
of Inorganic Compounds, chapter 3.33, pages 261–267. Wiley-VCH, 3 edition, 2010.
[4] Ales Tichopad, Tzachi Bar, Ladislav Pecen, Robert R. Kitchen, Mikael Kubista, and Michael W. Pfaffl.
Quality control for quantitative pcr based on amplification compatibility test. Methods, 50:308–312, 2010.
[5] Robert R. Kitchen, Mikael Kubista, and Ales Tichopad. Statistical aspects of quantitative real-time pcr
experiment design. Methods, 50:231–236, 2010.
[6] J. Koetsier, A. Turner, P. Richardson, and J.I. van Hemert. Rapid chemistry portals through engaging
researchers. In IEEE 5th International Conference on e-Science, page In press, 2009.
[7] Liangxiu Han, Jano van Hemert, Richard Baldock, and Malcolm P. Atkinson. Automating gene expression
annotation for mouse embryo. In Ronghuai Huang; Qiang Yang; Jian Pei et al., editor, Advanced Data
Mining and Applications, 5th International Conference, volume LNAI 5678. Springer, 2009.
[8] J. O’Donoghue and J.I. van Hemert. Using the DCC Lifecycle Model to curate a gene expression database:
A case study. International Journal of Digital Curation, page In press, 2009.
[9] J.D. Armstrong and J.I. van Hemert. Towards a virtual fly brain. Philosophical Transactions A,
367(1896):2387–2397, June 2009.
38. "'()*+!,&
Jano van Hemert—j.vanhemert@ed.ac.uk '$-$.()-")#(/"
!"#"$!%&
Academics
Malcolm Atkinson
Research Assistants
Jos Koetsier
Liangxiu Han
David Rodriguez
Gagarine Yaikhom
Laura Valkonen
PhD Students
Thomas French
Luna De Ferrari
Rob Kitchen
Chee-Sun Liew IDEA Lab 29:
Fan Zhu
Research Students
A scientific gateway for real time
Gary, Vijay, Hwee, Yue, geophysical experiments
Charalampos, Jeff,
Gideon, Charis, Gareth,
Harika, Andrejs http://research.nesc.ac.uk/partners/
Editor's Notes
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Research focuses on progressing computer science
* by evaluating both generic and tailored methodologies
* in a multidisciplinary context with
* rich use cases to test hypotheses
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* scaling 1: rapid to portal building
* scaling 2: portal to gaussian use (140 students)
* mention myExperiment
* scaling 1: rapid to portal building
* scaling 2: portal to gaussian use (140 students)
* mention myExperiment
* scaling 1: rapid to portal building
* scaling 2: portal to gaussian use (140 students)
* mention myExperiment
* scaling 1: rapid to portal building
* scaling 2: portal to gaussian use (140 students)
* mention myExperiment
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution
* Formulation = an abstract description of the data-intensive challenge
* Execution = an implementation of the challenge that runs on a computational platform
* Interaction = necessary to manage the formulation process and to steer the execution