Data-Intensive Research
                           Jano van Hemert
                            research.nesc.ac.uk
            NI VER
          U        S
 E




                      IT
TH




                       Y
O F




                       H
                       G




      E
                   R




          D I     U
              N B
Downloaded from www.sciencemag.org on July 6, 2009
                                 COMPUTER SCIENCE
                                                                                                                                                      The demands of data-intensive science

                                 Beyond the Data Deluge                                                                                               represent a challenge for diverse scientific
                                                                                                                                                      communities.
                                 Gordon Bell,1 Tony Hey,1 Alex Szalay2



                                 S
                                        ince at least Newton’s laws of motion in
                                        the 17th century, scientists have recog-
                                        nized experimental and theoretical sci-
                                 ence as the basic research paradigms for
                                 understanding nature. In recent decades, com-
                                 puter simulations have become an essential
                                 third paradigm: a standard tool for scientists to
                                 explore domains that are inaccessible to theory
                                 and experiment, such as the evolution of the
                                 universe, car passenger crash testing, and pre-
                                 dicting climate change. As simulations and
                                 experiments yield ever more data, a fourth par-
                                 adigm is emerging, consisting of the tech-
                                 niques and technologies needed to perform
                                 data-intensive science (1). For example, new
                                 types of computer clusters are emerging that
                                 are optimized for data movement and analysis
                                 rather than computing, while in astronomy and
                                 other sciences, integrated data systems allow
                                 data analysis and storage on site instead of
                                 requiring download of large amounts of data.               Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
                                     Today, some areas of science are facing                science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
                                 hundred- to thousandfold increases in data                 tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
                                 volumes from satellites, telescopes, high-                 of the moon, synthesized within the World Wide Telescope service.
                                 throughput instruments, sensor networks,
                                 accelerators, and supercomputers, compared                 challenging scientists (4). In contrast to the tra-       ing of these digital data are becoming increas-
                                 to the volumes generated only a decade ago                 ditional hypothesis-led approach to biology,              ingly burdensome for research scientists.
                                 (2). In astronomy and particle physics,                    Venter and others have argued that a data-                   Over the past 40 years or more, Moore’s
                                 these new experiments generate petabytes                   intensive inductive approach to genomics                  Law has enabled transistors on silicon chips to
CREDIT: JONATHAN FAY/MICROSOFT




                                 (1 petabyte = 1015 bytes) of data per year. In             (such as shotgun sequencing) is necessary to              get smaller and processors to get faster. At the
                                 bioinformatics, the increasing volume (3) and              address large-scale ecosystem questions (5, 6).           same time, technology improvements for
                                 the extreme heterogeneity of the data are                      Other research fields also face major data            disks for storage cannot keep up with the ever
                                                                                            management challenges. In almost every labo-              increasing flood of scientific data generated
                                                                                            ratory, “born digital” data proliferate in files,         by the faster computers. In university research
                                 1MicrosoftResearch, One Microsoft Way, Redmond, WA         spreadsheets, or databases stored on hard                 labs, Beowulf clusters—groups of usually
                                 98052, USA. 2Department of Physics and Astronomy, Johns
                                 Hopkins University, 3701 San Martin Drive, Baltimore, MD   drives, digital notebooks, Web sites, blogs, and          identical, inexpensive PC computers that can
                                 21218, USA. E-mail: szalay@jhu.edu                         wikis. The management, curation, and archiv-              be used for parallel computations—have

                                                                               www.sciencemag.org            SCIENCE         VOL 323       6 MARCH 2009                                                        1297
                                                                                                              Published by AAAS
o investigate the                                                                                                                                                          10.1126/science.1171406




                                                                                                                                                                                                                           Downloaded from www.sciencemag.org on July 6, 2009
                                          COMPUTER SCIENCE
                                                                                                                                                               The demands of data-intensive science

                                          Beyond the Data Deluge                                                                                               represent a challenge for diverse scientific
                                                                                                                                                               communities.
                                          Gordon Bell,1 Tony Hey,1 Alex Szalay2



                                          S
                                                 ince at least Newton’s laws of motion in
                                                 the 17th century, scientists have recog-
                                                 nized experimental and theoretical sci-

                                                                                The demands of data-intensive science
                                          ence as the basic research paradigms for
                                          understanding nature. In recent decades, com-
                                          puter simulations have become an essential
                                                                                represent a challenge for diverse scientific
                                          third paradigm: a standard tool for scientists to
                                          explore domains that are inaccessible to theory
                                          and experiment, such as the evolution of the
                                                                                communities.
                                          universe, car passenger crash testing, and pre-
                                          dicting climate change. As simulations and
                                          experiments yield ever more data, a fourth par-
                                          adigm is emerging, consisting of the tech-
                                          niques and technologies needed to perform
                                          data-intensive science (1). For example, new
                                          types of computer clusters are emerging that
                                          are optimized for data movement and analysis
                                          rather than computing, while in astronomy and
                                          other sciences, integrated data systems allow
                                          data analysis and storage on site instead of
                                          requiring download of large amounts of data.               Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive
                                              Today, some areas of science are facing                science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen-
                                          hundred- to thousandfold increases in data                 tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image
                                          volumes from satellites, telescopes, high-                 of the moon, synthesized within the World Wide Telescope service.
                                          throughput instruments, sensor networks,
                                          accelerators, and supercomputers, compared                 challenging scientists (4). In contrast to the tra-       ing of these digital data are becoming increas-
                                          to the volumes generated only a decade ago                 ditional hypothesis-led approach to biology,              ingly burdensome for research scientists.
                                          (2). In astronomy and particle physics,                    Venter and others have argued that a data-                   Over the past 40 years or more, Moore’s
                                          these new experiments generate petabytes                   intensive inductive approach to genomics                  Law has enabled transistors on silicon chips to
         CREDIT: JONATHAN FAY/MICROSOFT




                                          (1 petabyte = 1015 bytes) of data per year. In             (such as shotgun sequencing) is necessary to              get smaller and processors to get faster. At the
                                          bioinformatics, the increasing volume (3) and              address large-scale ecosystem questions (5, 6).           same time, technology improvements for
                                          the extreme heterogeneity of the data are                      Other research fields also face major data            disks for storage cannot keep up with the ever
                                                                                                     management challenges. In almost every labo-              increasing flood of scientific data generated
                                                                                                     ratory, “born digital” data proliferate in files,         by the faster computers. In university research
                                          1MicrosoftResearch, One Microsoft Way, Redmond, WA         spreadsheets, or databases stored on hard                 labs, Beowulf clusters—groups of usually
                                          98052, USA. 2Department of Physics and Astronomy, Johns
                                          Hopkins University, 3701 San Martin Drive, Baltimore, MD   drives, digital notebooks, Web sites, blogs, and          identical, inexpensive PC computers that can
                                          21218, USA. E-mail: szalay@jhu.edu                         wikis. The management, curation, and archiv-              be used for parallel computations—have

                                                                                        www.sciencemag.org            SCIENCE         VOL 323       6 MARCH 2009                                                        1297
                                                                                                                       Published by AAAS
NEWS FEATURE 2020 COMPUTING                                               NATURE|Vol 440|23 March 2006




                                                                                                         J. MAGEE
EVERYTHING,EVERYWHERE
      Tiny computers that constantly monitor ecosystems, buildings and even human bodies
                   could turn science on its head. Declan Butler investigates.
o P,;,(>.?.;90:(,1;.=(/(7.,=<G(
                                      -     40(J<.(;5.(!%@)(10=(=.<.,=D5(A0J?>(=.IJ9=.B(                                                                                                                                                                   #;'#*"(2)              !"2:/"1#("-,0+"2*0+
                                                                                                                                                                                                                                                                               1#("2*0+!/(04")2(
                                                                                                                                                                                                                                                                                                             )581,7-56@.2.35=52/

                         3000                                                                                                                                                                 (7>.26!/,+058
                                                                                                                                                                                                                                                                 #;'#*'+.'(2                                                       &0(.3"!"#//+"50(
                                                                                                                                                                                                                                                                                                                                       )0'.3"!"#//+"50(
                                                                                                                                                                                                                                                                                                                               *#&"!"50(2)."(2)
                                                                                                                                                                                                                                                                                                                                     82)."!"#//+"50(




                                                 o 45,;(;5.(=.IJ9=.F.:;<(A.=.(>.19:.>G(
                                                                                                                                                                                                                                                                                               36$(0-,0+04,#                      50(2).")*,
                                                                                                                                                                                                                                                                                 !"5,)3"-,0+
                                                                                                                                                                                                                                                                                                                                          !"50(2).
                                                                                                                                                                                              #&&"#))0*"#1"4204(
                                                                                                                                                                                                                      (24").'$
                                                                                                                                                                                                                                                                                                *#&"!"5,)3"#;'#.")*,
                                                                                                                                                                                                                                                                                                           5(2)38#.2("-,0+         50(2)."2*0+"1#&#4
                                                                                                                                                                                                ,&."!"'(-#&"(24,0&#+
                                                                                                                                                                                                               30'),&4").'$ !")/2*"2$'*
                                                                                                                                                                                                                              ,&.2(%")*3"*+,&
                                                                                                                                                                                                    .","-(,."4204(
                                                                                                                                                                                             /(04"3'1"4204                         -(,."!")0*,0+"2$'*
                                                                                                                                                                                                                             (21")/2*"2$'*
                                                                                                                                                  )5=1/56!528023                      #&&"4+#*,0+        #&.,/0$2
                                                                                                                                                                                                               '(-#&").'$
                                                                                                                                                                                                                                                                $+,-./012                                                                                        G01;139

                                                 o 45.()D0;;9<5(Q0C.=:F.:;(,:>(L.,?;5(<.=C9D.(8=0C9>.=<(A.=.(8.=<J,>.>(;5,;(9;(
                                                                                                                                                                                   !"4+#*,0+
                                                                                                                                                                    ,&."!"(210.2")2&)
                                                                                                                                                             (210.2")2&)"2&%,(0&                                               !"+2#(&"$,)#-,+                                                                                                   10+"2*0+
                                                                                                                                                                                                                                     -(,."2$'*"(2)"!                                                                                !"#//+"2*0+
                                                                                                                                                                            /(04"/36)"4204                                                                                                                                       0,70)
                                                                                                                                                                                                                                                                                                                             -23#%,0'( -,0+"*0&)2(%
                                                                                                                                                                                                                      #1"!"/)6*3,#.
Disk space (Terabytes)
                                                                                                                                                                                                                                                                                                                               -23#%"2*0+

                         2500
                                                                                                                                                                                                                                                                                                                                         02*0+04,#




                                                     A,<(9F80=;,:;(,:>(:.D.<<,=7G(
                                                                                                                                                                                                                                                                                                                                       #&,1"-23#%
                                                                                                                                                                                     !"0**'/"0(4#&"/)6*3                                                                                                                                      *0&)2(%"-,0+
                                                                                                                                                                   30+0*2&2
                                                                                                                                                                                                                      /2(*2/."/)6*30/36)
                                                                                                                                                                                                       -(,."!"2$'*"/)6*30+                                      %5,718-052-5                         &2'(0,1#42
                                                                                                                                                                                                                                                                                                                                             2*0+046 2%0+'.,0&
                                                                                                                                                                                                                         1210(6
                                                                                                                                                                              )2:"(0+2)                            ;"!"2:/"/)6*30+!#    !"*04&,.,%2"&2'(0)*,
                                                                                                                                                                                                                            /)6*30&"-"(2%
                                                                                                                                                                                         !"1#((,#42"5#1
                                                                                                                                                                                                                           121"*04&,.,0&                                                                 /36),*#"*                                                       @5+0-025
                                                                                                                                            <51;139                                                                      /)6*30+")*,
                                                                                                                                                                                                                                                                                                                                                               +#&*2.
                                                                                                                                                                                                                                                                                                                                                                  &28"2&4+"!"12$




                                                 o 45.(D0<;(01(<;0=,-.(A0J?>(:..>(;0(H.(:.-0;9,;.>G(
                                                                                                                                                                                                                                                                                                                 0/."+2..

                                                                                                                                          !"420+")0*"+0&$0&                ,&.2++,42&*2
                                                                                                                                                                                                           -(,."!"/)6*30+
                                                                                                                                                                                                         )0*"$2%
                                                                                                                                                                                                                                        "1320/0456!-052-5                                                                                              !"&2'(0)*,


                         2000                                                                                                      /"420+04,)."#))0*
                                                                                                                                               /2.(0+"420)*,
                                                                                                                                                                                           -(,."!"$2%"/)6*30+
                                                                                                                                                                                         -(,."!")0*"/)6*30+
                                                                                                                                                                                                         *3,+$"$2%
                                                                                                                                                                                                                                                                                                      ,222"."1#4&
                                                                                                                                                                                                                                                                                                                !"%#*")*,".2*3&0+"-
                                                                                                                                                                                                                                                                                                                                                                                                G01-:5=08/79
                                                                                                                                             *+#6"1,&2(                               !"*3,+$"/)6*30+"/)6*
                                                                                                                                                                       /2()",&$,%"$,552(        ,&."!"+#&4"*011"$,)                                                                                                     )0+,$").#.2"*011'&        %&'()$
                                                                                                                                                                            !"/2()")0*"/)6*30+
                                                                                                                                                                                    !"#'.,)1"$2%"$,)0($
                                                                                                                                                                                                                                                                                        *:980-8                      /36),*#"-
                                                                                                                                                                                                                                                                                                                       !"1#4&"1#4&"1#.2(
                                                                                                                                                                                                                                                                                                                                                                                *2++
                                                                                                                                                                                                                                                                                                                                                                                        !",11'&0+
                                                                                                                                   *+#6"*+#6"1,&2(                                                                                                                                                                                          /36)"(2%"+2..




                                                 o 45.(<.=C9D.(A0J?>(:..>(;0(H.(:.-0;9,;.>G(                                                                                                                                                                                                                                                    !"#$%"$
                                                                                                                                                                                                                                                                                                                                /36)"(2%"-
                                                                                                                                                                                                                                                                                                                              #//+"/36)"+2..                                                         !"%,(0+
                                                                                                                                                                                                                #/3#),0+046                                                                                                        !"/36)!*0&$2&)"1#.
                                                                                                                                                                                                                                                                                                                                    !#/#&2)2"!0'(&#+"05"#//+,2$"/36),*)
                                                                                                                                                                                                                                                                                                                       !"#//+"/36)                                                           2#(.3"#&$"/+#&2.#(6")*,2&*2"+2..2()
                                                                                                                                                                                                                                                                                                                           !"*(6)."4(08.3
                                                                                                                                                                                                          *89-:1;139                               @./570.;6!-052-5                               )'(5"*0#.".2*3
                                                                                                                                                                                                                                                                                        !"2'("*2(#1")0*
                                                                                                                                                                                                                                                                                   1#.2(".(#&)
                                                                                                                                                                                                                                                                                                                                #//+"/36)"#!1#.2(
                                                                                                                                                                                                                                                                                                                        #//+")'(5")*,
                                                                                                                                                                                                                                                                                                             !"&0&!*(6).")0+,$)!"#//+"2+2*.(0*321
                                                                                                                                                                                                                                                                                                                   ,&."!"36$(042&"2&2(4      &#&0"+2..
                                                                                                                                                                                                                                                                                                                                          /36)"(2%"2
                                                                                                                                                                                                                                                                                                                                   2+2*.(0*321")0+,$").
                                                                                                                                                                                                                                                                                                                            !"2+2*.(0*321")0*
                                                                                                                                                                                                                                                                                                                !"#++06"*01/$ )'(5")*,   2+2*.(0*321"*011'&
                                                                                                                                                                                                                                                                                                                                                                                420/36)"(2)"+2..
                                                                                                                                                                                                                                                                                                                                                                            -,0*321,).(6 -,0*321"-,0/3"(2)"*0
                                                                                                                                                                                                                                                                                                                                                                        !"10+"-,0+
                                                                                                                                                                                                                                                                                                                                                                                     52-)"+2..
                                                                                                                                                       *3,+$"*#(2"32#+.3"#&$"$2%2+0/12&.
                                                                                                                                                                                         -(,."!"*+,&"/)6*30+
                                                                                                                                                                       !",&.2++"$,)#-,+"(2)
                                                                                                                                                                               -(,."!"32#+.3"/)6*3
                                                                                                                                                                                                                                                                                          !"#1"*2(#1")0*
                                                                                                                                                                                                                                                                                                               1#.2("+2.. !")0+,$").#.2"*321
                                                                                                                                                                                                                                                                                                                         .3,&")0+,$"5,+1) *321"/36)"+2..
                                                                                                                                                                                                                                                                                                                                      !"*321"/36)
                                                                                                                                                                                                                                                                                                                                 &#&0.2*3&0+046
                                                                                                                                                                                                                                                                                                                                                          *%&!             &'*+2,*"#*,$)"(2)
                                                                                                                                                                                                                                                                                                                     )0+,$").#.2",0&,*)                         -,0/36)"!
                                                                                                                                                                                                                                                                                1#.")*,"2&4"#!).('*.
                                                                                                                                                                                                                                                                                                 #*.#"1#.2(                                       /36)"*321"*321"/36)
                                                                                                                                                                                                                                     !"1#.2("/(0*2))".2*3          )*(,/.#"1#.2(                                                  )6&.32.,*"12.
                                                                                                                                                                                                                                                                                                                                 2+2*.(0*3,1"#*.#
                                                                                                                                                                                                                                                                                                                                             *#(-0&                               !"-,0+"*321
                                                                                                                                                                                                                                                                                                                                          !"/082(")0'(*2)
                                                                                                                                                                                                                                                                                                                                      *321"1#.2(
                                                                                                                                                                                                                                                                                                                                              !"/36)"*321"-




                                                 o R.7( 8.08?.( A.=.( ,H?.( ;0( 9:1?J.:D.( ;5.( 8=0>JD;( >.<9-:( ;0( .:<J=.( ;5,;( 9;( F.;(
                                                                                                                                                                                       &2'(0/)6*30+"(23#-,+                                                                                   !"1#.2(")*,

                         1500                                                                                                                                                                                                                                                                                                           +#&41',( )0+"2&2(4"1#.")0+"*
                                                                                                                                            32#+.3"9")0*,#+"*#(2",&".32"*011'&,.6                   -(#,&",&!'(6                                                                                                                        !"1#.2("*321
                                                                                                                                                                                                                                                                                                                                                 !"/36)"*321"#
                                                                                                                                                                                                                                                                                                                                                             1,*(0/0("12)0/0("1#.
                                                                                                                                                                                                                                                                                                                                                                   *#.#+"+2.. 420*3,1"*0)10*3,1"#*
                                                                                                                                                                                                                                                                                                                                                       !"#1"*321")0*               !"420/36)"(2)
                                                                                                                                                                -(,."!")0*"80(7                                                                                                                          *01/0)")*,".2*3&0+
                                                                                                                                     *,>;0-6?5.;/:                                      #')."&<"!"/)6*3,#.
                                                                                                                                                                                                                                    12.#++"1#.2(".(#&)"#                                                                                               *321"(2%
                                                                                                                                                                 )0*,0+"32#+.3",++$,)#-,+"(23#-,+
                                                                                                                                                                      !"&'()")*30+#()3,/    !"(23#-,+"12$               !"1#.2("2&4"/2(50(1
                                                                                                                                                                                                                                               -(,."*2(#1".
                                                                                                                                                                                                                                                                                                                 !"&#&0)*,"&#&0.2*3&0 1#*(010+2*'+2)!"*#.#+
                                                                                                                                                                                                                                                                                                                            -,01#*(010+2*'+2)
                                                                                                                                                                                                                                                                                                                                               *0++0,$")'(5#*2"#
                                                                                                                                                                                                                                                                                                                                                    *321"*011'&
                                                                                                                                                                                                                                                                                                                                               *321!2'("!
                                                                                                                                                                                                                                                                                                                                      /0+61"$24(#$").#-,+
                                                                                                                                                                                                                                                                                                                            /0+612( ,&0(4"*321
                                                                                                                                                                                                                                                                                                                                                                         ":5=08/79
                                                                                                                                                                                                                                                                                                                                                                  *#.#+".0$#6
                                                                                                                                                                                                                                                                                                                                                                   #//+"*#.#+"#!42&
                                                                                                                                                                     )*#&$"!"*#(,&4")*,                                                                                                                                            !"/0+61")*,"/0+"*321 !"10+"*#.#+"#!*321
                                                                                                                                                            #&#2).32),#                                                            /08$2("12.#++                                                                          1#*(010+"(#/,$"*011
                                                                                                                                                          /'-+,*"32#+.3"&'()                                                                                                                                                                                                                                        ,&."!"32#."1#))".(#&
                                                                                                                                                                             &'()"2.3,*)                                                 )'(5"2&4                                                                     2'("/0+61"!
                                                                                                                                                                                                                                                                                                                              1#*(010+")61/ 0(4"+2..0(4#&012.#++,*)
                                                                                                                                                                                    #*.#"/)6*3,#.")*#&$
                                                                                                                                                                   /#++,#.,%2"12$ !"*+,&"&'()                              ,(0&1#7").22+1#7                                                                                       !"#//+"/0+61")*,   2'("!"0(4"*321
                                                                                                                                                                                                                                                                                                         /0+61",&.         1#*(010+"*321"/36) !"*0++0,$",&.2(5")*,
                                                                                                                                                                                                                                                                                                                                                           0(4"-,010+"*321
                                                                                                                                                                                                                                                                                                                                                 !"0(4#&012."*321
                                                                                                                                                                                                                                                                                                                                           ,&0(4"*3,1"#*.#           #//+"*#.#+"-!2&%,(0&
                                                                                                                                                                                          !"#$%"&'()                                                                                                                                                          -,00(4#&"12$"*321 *321"2&4")*,




                                                     =.IJ9=.F.:;<G(
                                                                                                                                                           2'("!"*#&*2("*#(2 12$"2$'*                                                                                                                                                             !"0(4"*321
                                                                                                                                                                                                                                                                                                                                                           .2.(#32$(0&!#)6112.(
                                                                                                                                                                               #$%#&*2)",&").('*.'(#+"2&4,&22(,&4
                                                                                                                                                                                            #42"#42,&4             )*,".2*3&0+"82+$"!0,                                                                                    /0+61"2&4")*, )2&)0("#*.'#."-!*321
                                                                                                                                                                                                                                                                                                                                       !"/0+61")*,"/0+"/36)!"12$"*321
                                                                                                                                                                          -(,."!"42&"/(#*.        *+,&"(23#-,+                                                                                             -,01#.2(,#+)                                      -,00(4"12$"*321"+2..
                                                                                                                                                                                                                                                                                                                                             )6&+2.. )6&.32),)!).'..4#(.
                                                                                                                                                                                                    !".2+212$".2+2*#(2 2&2(4"2:/+0("2:/+0,.
                                                                                                                                                                                                                                 1#.2(")*,".2*3")2(                                                                                               .2.(#32$(0&
                                                                                                                                                                                                                                                                                                                                                    .2.(#32$(0&"+2.. ,&$"2&4"*321"(2)  *01/'."*321"2&4
                                                                                                                                                                                                            !"12$")*(22&                                                                                                                                                   #,*32"!
                                                                                                                                                                                                                                                                                                                                                         !"121-(#&2")*,
                                                                                                                                                                           ?5.;/:6".75                          +#-"#&,1
                                                                                                                                                                                                                  ,&."!").$"#,$)
                                                                                                                                                                                                            -(,."12$"-'++                    /+#)."('--2("*01/0)
                                                                                                                                                                                                                                                                    !"-,01#.")*,!/0+61"2
                                                                                                                                                                                                                                                                                                 !"-,012$"1#.2("(2)                                                     !"-#*.2(,0+           5+',$"/3#)2"2;',+,-(
                                                                                                                                                                                                                                                                                                                                                                                                 !"*321"2&4"$#.#
                                                                                                                                                                                                                                                                                                                                                                                                                             !"36$(0+
                                                                                                                                                                                                                                                                                                                                                                                                                        8#.2("(2)0'("(2)
                                                                                                                                                                                                                    *0((0)"2&4")*,".2*3&
                                                                                                                                                                                                                       #&&".(0/"12$"/#(#),.
                                                                                                                                                                                                                             ,&.2($,)*,/+")*,"(2%
                                                                                                                                                                                      !"#1"42(,#.(")0* #&&"*+,&"-,0*321
                                                                                                                                                                                                                                                                         *:.7=.-1;139                                                                                                   #.10)"2&%,(0&

                         1000
                                                                                                                                                                                                                                                                                                                                         !"&#."/(0$'*.)
                                                                                                                                                                                                                           #&&".(0/"/#2$,#.(                                                                                )6&.32.,*"*011'&
                                                                                                                                                                                                                                                                                                                                          #&#+"*321               )2/"/'(,5".2*3&0+ 2&2(4"5'2+
                                                                                                                                                                                                                                                                                                                                                                                        /+#&."/36),0+
                                                                                                                                                2'("!"0/2("(2)
                                                                                                                                                                         $2(1#.0+")'(4
                                                                                                                                                                       ,&."!"/(0$"(2)                   #$$,*.,0&
                                                                                                                                                                                                      ,&."!".'-2(*"+'&4"$          &2'(0+"(2)
                                                                                                                                                                                                                                         (2$0:"(2/
                                                                                                                                                                                                                                                                                !"*0&.(0+"(2+2#)2
                                                                                                                                                                                                                                                                                    ,&."!"/3#(1
                                                                                                                                                                                                                                                                                                                                              -,0)2&)"-,02+2*.(0& 2&%,(0&")*,".2*3&0+
                                                                                                                                                                                                                                                                                                                                                                  $2)#+,&#.,0&
                                                                                                                                                                                                                                                                                                                                                                 !"3#<#($"1#.2(
                                                                                                                                                                                                                                                                                                                                                                                                         $240712=52/.;
                                                                                                                                                      ,&."!"/(0$"2*0&
                                                                                                                                                                                                       $('4"#4,&4          /(2%"12$
                                                                                                                                                                                                                                                                            !"/3#(1")*,
                                                                                                                                                                                                                                                                                                                                  /36.0*321,).(6
                                                                                                                                                                                                                                                                                                                                     .#+#&.#                                       2&%,(0&"/0++'.
                                                                                                                                                                                                                                                                                                                                                                                                            !-052-5
                                                                                                                                                                                                                                                                                                                                                                                   8#.2("#,("#&$")0,+"/0++'.,0&
                                                                                                                                                                                                                 !"3'1"&'.("$,2. !"#&#+".0:,*0+
                                                                                                                                                                                                          1'+.")*+2(                                                                                                                                                            5'2+                      2&2(4"*0&%2()"1#&#42
                                                                                                                                    A57=./1;139                    ,&."!"$2(1#.0+
                                                                                                                                                                    -(,."!"$2(1#.0+
                                                                                                                                                        !"2'("#*#$"$2(1#.0+
                                                                                                                                                                                                                *&)"$('4)
                                                                                                                                                                                                                                   !"/3#(1"/3#(1#*0+
                                                                                                                                                                                                                                                               /3#(1#*2'."(2)
                                                                                                                                                                                                                                                                        $('4"$2%",&$"/3#(1
                                                                                                                                                                                                                                                                               2'("!"/3#(1"-,0/3#(1
                                                                                                                                                                                                                                                                                                                   !"/3#(1#*2'."-,012$     2+2*.(0#&#+
                                                                                                                                                                                                                                                                                                                                  2+2*.(0/30(2),)
                                                                                                                                                                                                                                                                                                                              !"*3(01#.04("-
                                                                                                                                                                                                                                                                                                                                                                       *3210)/32(2
                                                                                                                                                                                                                                                                                                                                                               8#.2("(2)
                                                                                                                                                                                                                                                                                                                                                                              )*,".0.#+"2&%,(0&

                                                                                                                                                                                       /3#(1#*02*0&01,*)                                              *#(.04("!                                                                        #&#+"*3,1"#*.#                                 (2&28"2&2(4
                                                                                                                                              ,&."!"0/2("/(0$"1#&         #1"!"*+,&"$2(1#.0+ *+,&"$('4",&%2).                                                                                                                  #&#+6).(#/,$"*011'&"1#))")/                          1#("/0++'."-'++
                                                                                                                                                                                                                                                                                                                                                                        !"2&%,(0&"2&4!#)*2
                                                                                                                                                             *+,&"2:/"$2(1#.0+                                $('4")#52.6
                                                                                                                                                                                                         *+,&"/3#(1#*07,&2.
                                                                                                                                                                                                                    *'(("&2'(0%#)*"(2)
                                                                                                                                                                                                                     $('4)                                                                                                       !"#&#+"#.01")/2*.(01 8#.2(")*,".2*3&0+          2&%,(0&".0:,*0+"*321
                                                                                                                                                                                                                 /'-+,*"32#+.3"&'.(
                                                                                                                                                                                               2:/2(."0/,&"/3#(1#*0
                                                                                                                                                                                             -(,."!"*+,&"/3#(1#*0                                                                                                                  !"*3(01#.04("#
                                                                                                                                                                                                                                                                                                                             #&#+"-,0#&#+"*321
                                                                                                                                                                                                                                       #*.#"/#2$,#.(
                                                                                                                                                                                                                            *'(("$('4"12.#-                                   !"*321"(2)!)
                                                                                                                                                                                                                               /"&'.(")0*                                                                                                                                         /+#&."!
                                                                                                                                                                                                *'(("12$"(2)"0/,& *'(("12$"*321
                                                                                                                                                                                                        2:/2(."0/,&",&%"$('4
                                                                                                                                                                                                                   *'(("/3#(1"$2),4&
                                                                                                                                                                                                                          *'(("$('4".#(42.)                                                                                                                -,0.2*3&0+"-,02&4                  G01!525739
                                                                                                                                                                                                                                                                                                                                                                               -,01#))"9"-,02&2(46




                                   K9-J=.(2B(@JFJ?,;9C.(<;0=,-.(9:(;5.($,;90:,?(!%@)(<9:D.(9F8?.F.:;,;90:(9:().8;.FH.=(STTUG((
                                                                                                                                                                                          ,&/3#(1#"8227+6      )/0(.)"12$
                                                                                                                                                                                                                      *'((".0/"12$"*321                                                                                                                    -,0(2)0'(*2".2*3&0+
                                                                                                                                                                                                                                                                                                                                                                      8#.2("2&%,(0&"(2)
                                                                                                                                                                                                                          *'(("10+"12$                                                                                                                             2&%,(0&".2*3&0+   !"2:/"-0.
                                                                                                                                                                                                                                    &'.("(2%                                                                                                              /(0*2))"-,0*321              /+#&."*2++"#&$"2&%,(0&12&.
                                                                                                                                                                                                            2:/2(."0/,&".32("/#.                                                                                                              #//+"2&%,(0&"1,*(0-            /36),0+"/+#&.#('1
                                                                                                                                                                              #&&"#++2(4"#).31#",1                        *01-"*321"3,43".")*( *'(("0(4"*321

                         500                                                                                                                              #++2(46"#&$"#).31#"/(0*22$,&4)
                                                                                                                                                                     *+,&"2:/"#++2(46
                                                                                                                                                                                          )*#&$"!"12$")*,")/0(
                                                                                                                                                                                      #++2(46
                                                                                                                                                                                                                   *'(("/(0.2,&"/2/.")*
                                                                                                                                                                                                                !")/0(.")*,
                                                                                                                                                                                                                                   &'.("(2)"(2%     -(,."!"&'.(
                                                                                                                                                                                                                                                     80(+$"/0'+.(6")*,"!
                                                                                                                                                                                                                                                                                                                 +2.."#//+"1,*(0-,0+
                                                                                                                                                                                                                                                                                                               !"#4("500$"*321
                                                                                                                                                                                                                                                                                                    !"#//+"1,*(0-,0+
                                                                                                                                                                                                                                                                                              !"500$"/(0.2*.          500$"*321
                                                                                                                                                                                                                                                                                                                                                         #//+"1,*(0-,0+"-,0.
                                                                                                                                                                                                                                                                                                                                                                                  )0,+")*,")0*"#1"!
                                                                                                                                                                                                                                                                                                                                                                                          G01/5-:21;139
                                                                                                                                                                                                                                                                                                                                                               2&<612"#&$"1,*(0-,#+".2*3&0+046
                                                                                                                                                                                                                                                                                                                                                      -,0.2*3&0+"+2.. !"/+#&."/36),0+
                                                                                                                                                                                                  !"#//+")/0(."/)6*30+
                                                                                                                                                                                                    -(,."!")/0(."12$                                           .#:0&                                                                                                                 &28"/36.0+
                                                                                                                                                                                                                                                                                      ,&."!"500$")*,".2*3 ,&."!"500$"1,*(0-,0+
                                                                                                                                                                                                                                          0-2)")'(4
                                                                                                                                                                                                                                -"2&.010+"(2)                        D0458/1-E                   !")*,"500$"#4( !"500$"2&4




                                                $0;.(;5.(?,=-.<;(F0<;(=,89>(.M8,:<90:(9:(<8,D.(J<,-.(9<(A9;5(@4G(
                                                                                                                                                                                                     2(40&01,*)                                                                                      500$")*,".2*3&0+!+2-
                                                                                                                                                                                                                                                                                                            !"500$")*,                                               )0,+"-,0+"-,0*321
                                                                                                                                                         &;;5739B&8/:=.                            12$")*,")/0(."2:2(
                                                                                                                                                                                                                                                       -(,."/0'+.(6")*,                           500$"1,*(0-,0+                                     !"30(.,*")*,
                                                                                                                                                                                                                                                              #1"!"*+,&"&'.(                                                                                         ,&"%,.(0"*2++"$2%!/+
                                                                                                                                                                                                                                                                                                                                                                                 /+#&."#&$")0,+

                                                                                                                                                                       !"+#(6&40+"0.0+                                                                                             !")0'&$"%,-                                                                         *;.2/6F6!10;6!-052-5
                                                                                                                                                                                                                                                                                                   %,/70/012                                     /+#&."*2++".,))"0(4
                                                                                                                                                                                                                               2&%,(0&"2&.010+

                           0                                                                                                                                                                  !C17/86@5+0-025!"2*0&"2&.010+

                                1996 1997 1998 1999 2000 2001         2002   2003 2004 2005 2006 2007 2008                                                                                                                        $2/1=1;139



                                 EMBL-EBI storage requirements until 2008                                                         Journal connectivity through scholarly usage data
                                                                                                                                                             Figure 3: Visualization of usage network created from MESUR’s 200M usage events.

                                         (source: Andrew Lyall)                                                                                    (source: MESUR project)
                                                                                                                                   5. USAGE-BASED METRICS                                                                                                                               on the basis of COUNTER reports, i.e. the average amount
                                                                                                                                     The journal usage and citation networks also enable the                                                                                            of usage recorded for the articles published in a journal [4,
                                                                                                                                  calculation of a variety of impact metrics. A total of 47                                                                                             22].
                                                                                                                                  possible impact metrics were calculated, and the resulting                                                                                            However, to use a social analogy, one’s importance is not
                                                                                                                                  rankings were analyzed to determine the degree to which                                                                                               solely assessed on the basis of how many people one knows.
                                                                                                  Cumulative GB storage           usage- and citation-based metrics express similar or dissim-                                                                                          Who one knows and how one is embedded in a network of
                                                                                                                                  ilar aspect of scholarly impact.                                                                                                                      social relationships are equally important factors. Network
                                                                                                                                                                                                                                                                                        theory has produced a rich literature on indicators to deter-
                                                                                                                                                                                                                                                                                        mine different facets of a person’s status (e.g. prestige, popu-
                                       35000                                                                                      5.1          Defining and validating usage-based met-                                                                                                  larity, trust) on the basis of social network structure, instead
                                                                                                                                               rics                                                                                                                                     of using simple counts of the number of the person’s rela-
                                                                                                                                     The most common indicator of journal status is Thom-                                                                                               tionships. Many of these indicators have found applications
                                                                                                                                  son Scientific’s journal Impact Factor (IF) that is published                                                                                                                         XA
                                                                                                                                                                                                                                                                                        in other domains. For example, the Google search engine
                                       30000                                                                                      every year for a set of about 8,000 selected journals. The                                                                                            uses the PageRank metric to rank web pages on the basis
                                                                                                                                  IF is defined as the average citation rate for articles pub-                                                                                                                          US
                                                                                                                                                                                                                                                                                        of the WWW’s hyperlink network structure. In addition,
                                                                                                                                  lished in a particular journal. A similar statistical approach                                                                                        recent proposals have been made to rank journals according
                                                                                                                                  to journal ranking has been proposed for journal usage data                                                                                                                          SR
                                                                                                                                                                                                                                                                                        to their citation PageRank [2] and a range of social network
                                       25000
                                                                                                                                                                                                                                                                                                                                                         RF
                                                                                                                                                                                                                                                                                                                                                         PX
                                       20000                                                                                                                                                                                                                                                                                                             OT
                                                                                                                                                                                                                                                                                                                                                         NM
                                       15000                                                                                                                                                                                                                                                                                                             MR
                                                                                                                                                                                                                                                                                                                                                         MG
                                                                                                                                                                                                                                                                                                                                                         HC
                                       10000
                                                                                                                                                                                                                                                                                                                                                         CT
                                                                                                                                                                                                                                                                                                                                                         DX
                                          5000
                                                                                                                                                                                                                                                                                                                                                         CR

                                             0
                                                 Se   O     No   De   Ja     Fe   Ma    A    Ma   Ju    J    Au   Se   O     No   De                 Ja                   Fe                    Ma                      A                    Ma                     Ju                      J                    Au
                                                 p-   ct-   v-   c-   n-     b-   r-   pr-   y-   n-   ul-   g-   p-   ct-   v-   c-                 n-                   b-                    r-                     pr-                   y-                     n-                     ul-                   g-
                                                 06   06    06   06   07     07   07   07    07   07   07    07   07   07    07   07                 08                   08                    08                     08                    08                     08                     08                    08



                                                 Cumulative storage in the National PACS since implementation in September 2006
                                                                            (source: Hamish McRitchie)
                                                                                                                                                                                                                                                                                                     !,-.(/(01(23(
Vol 455|4 September 2008




BOOKS & ARTS
Distilling meaning from data
Buried in vast streams of data are clues to new science. But we may need to craft new
lenses to see them, explain Felice Frankel and Rosalind Reid.
It is a breathtaking time in science               they will create effective computer displays, those run by the US National Science Foun-
as masses of data pour in, prom-                   slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www.
ising new insights. But how can                    they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt-
we find meaning in these tera-                     kept at arm’s length by scientists who are busy ing to visually communicate scientific data and
bytes? To search successfully                      getting their experiments done. Opportunities concepts opens a path to understanding. When
for new science in large datasets, we must find for useful dialogue are thus squandered.              science and design students collaborate, their
unexpected patterns and interpret evidence            When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes
in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science.
further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training
ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for
venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline.
questions and answers.                             imageandmeaning.org), it becomes clear               The ingrained habits of highly trained sci-
    To extract new meaning                                                                                              entists make them rarely as




                                                                                                                                     D. ARMENDARIZ
from the sea of data, scien-                                                                                            adventurous as these young
tists have begun to embrace                                                                                             minds. We think we are on
                                                                                                                             23.3 Commentary Muggleton jw                20/3/06 6:29 PM Page 409
the tools of visualization. Yet                                                                                         the path to insight when
few appreciate that visual rep-                                                                                         shading reveals contours
resentation is also a form of                                                                                           in 3D renderings, or when
communication. A rich body                                                                                              bursts of red appear on heat
of communication expertise                                                                                              maps, for example. But the
                                                                                                                                         Vol 440|23 March 2006
holds the potential to greatly                                                                                          algorithms used to produce
improve these tools. We pro-                                                                                            the graphics may create illu-
pose that graphic artists, com-                                                                                         sions or embed assumptions.
municators and visualization
scientists should be brought
into conversation with theo-
                                                                                                                        The human visual system
                                                                                                                        creates in the brain an appar-
                                                                                                                        ent understanding of what
                                                                                                                                                                                                   COMMENTARY
rists and experimenters                                                                                                 a picture represents, not
before all the data have been                                                                                           necessarily a picture of the
gathered. If we design experi-                                                                                          underlying science. Unless


                                                                                                                      Exceeding human limits
ments in ways that offer varied                                                                                         we know all the steps from
opportunities for represent-                                                                                            hypothesis to understand-
ing and communicating data,                                                                                             ing — by conversing with
techniques for extracting new                                                                                           theorists, experimentalists,
understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software are turning to automated processes and technologies in a bid to cope with ever higher volumes of data.
                                                                                                                                         Scientists
available.                                                                                                              developers, visualization
                                                                                                                                         But automation offers so much more to the future of science than just data handling, says Stephen H. Muggleton.
    Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy-
intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis-




                                                                                                                                                                                                                                                           FIREFLY PRODUCTIONS/CORBIS
for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading.                            The collection and curation
near Geneva, for example, physicists will have dimensional volume rendering may give no                 The greatest opportunity and risk lie in that     of data throughout the
pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether                      sciences is becoming increas-
important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled                ingly automated. For exam-
that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk.               ple, a single high-throughput
on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk.                                          I
                                                                                                                                                          experiment in biology can
to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the                easily generate more than a
                                                                                                                                         gigabyte of data per day, and in astronomy
a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University,
broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a
                                                                                                                                         automatic
tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci-
                                                                                                                                         On the Surface
other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas-
                                                                                                                                         ences the
querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by
ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing
                                                                                                                                         low-cost digital
tion, discovery and communication.                 for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is
                                                                                                                                         efficiency
    Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the
                                                                                                                                         clear
communicators are often consulted only after science emerged.                                                                            expansion of automation in all its aspects: data
c and probability cal- and charge distributionshould become easier for autonomous experimen
                                                 On such timescales it of individual molecules                                 however, still a decade
 ic provides a formal need to be integrated      scientists to reproduce new experiments and becoming standard scie        Vol 455|4 September 2008


gramming languages with models describ-          refute their hypotheses.                                                              Despite the potentia
    BOOKS & ARTS
  probability calculus ing Today’s generation of microfluidic
                                                                                   “Owing tomachines severe danger data
                                                                                                              the scale and rate of that incre
                                                     the interdepen- generation, computational models of
ms of probability for dency of chemical out a specific series of ume of data generation
                                                 is designed to carry
    Distilling meaning from data reactions, scientific flexibility decreases in compreh
 s bayesian networks.new science. But we may needHowever, but further data now require automatic
                                                 chemical
    Buried in vast streams of data are clues to reactions. to craft new
stic logic’ is a formaland Rosalind Reid. be added the tool kit by developing Academic studies on the
                                                 could                  to this
                                                                                         construction and modification.”
    lenses to see them, explain Felice Frankel
                                                differences in
statements of sound mathematical under- call     what one might                                                                                                                   t
    It is a breathtaking time in science               they will create effective computer displays, those run by the US National Science Foun-
    as masses of data pour in, prom-                   slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www.
    ising new insights. But how can                    they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt-

                                                 a ‘chemical Turing                              “There is a severe danger that i
 robability of A being pinnings of, say, differential equations, bayesian puter. Such chips contai
    we find meaning in these tera-                     kept at arm’s length by scientists who are busy ing to visually communicate scientific data and
    bytes? To search successfully                      getting their experiments done. Opportunities concepts opens a path to understanding. When
    for new science in large datasets, we must find for useful dialogue are thus squandered.              science and design students collaborate, their

                                                 machine’. The universal
ure forms of existing networks and logic programs make integrating chambers, ducts, gates                                                                                         t
    unexpected patterns and interpret evidence            When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes


                                                                                             increases in speed and volume of n
    in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science.
    further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training

                                                 Turing machine, devised
fortunately computa- these various models virtually impossible. reagent stores, and allow
    ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for
    venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline.


wever, an increasing Although by Alan Turing, be data generation could leadat high sp
                                                 in 1936 hybrid models can built by simply sis and testing to
    questions and answers.                             imageandmeaning.org), it becomes clear               The ingrained habits of highly trained sci-

                                                                                                                                                                                  t
        To extract new meaning                                                                                              entists make them rarely as




                                                                                                                                                      D. ARMENDARIZ
    from the sea of data, scien-                                                                                            adventurous as these young
    tists have begun to embrace                                                                                             minds. We think we are on

                                                 was intended to mimic decreases in comprehensibility.”
 ups have developed patching two models together, the underlying miniaturizing our robot-o
                                                                                                                                 23.3 Commentary Muggleton jw                20/3/06 6:29 PM    Page 409
    the tools of visualization. Yet                                                                                         the path to insight when
    few appreciate that visual rep-                                                                                         shading reveals contours
    resentation is also a form of                                                                                           in 3D renderings, or when
    communication. A rich body
    of communication expertise
    holds the potential to greatly
                                                 the pencil-and-paper
 ques that can handle differences lead to unpredictable and error- this way, with the overal                                bursts of red appear on heat
                                                                                                                            maps, for example. But the
                                                                                                                                             Vol 440|23 March 2006
                                                                                                                            algorithms used to produce
                                                                                                                                                                                  s
probabilistic logic6. prone behaviour mathematician. The chemical experimental cycle time
    improve these tools. We pro-
    pose that graphic artists, com-              operations of a when changes are made.                                        beings. This is particu
                                                                                                                            the graphics may create illu-
                                                                                                                            sions or embed assumptions.
    municators and visualization

   such research holds Turing encouraging development in this liseconds.associated with
    scientists should be brought
                                                          machine would be a universal proces- nologies With microflu              COMMENTARY
                                                                                                                            The human visual system
                                                                                                                            creates in the brain an appar-

                                                   One
    into conversation with theo-                                                                                            ent understanding of what
    rists and experimenters                                                                                                 a picture represents, not

 egration of scientific respect is the emergence withinbroad range of chemical reaction not onA
    before all the data have been
    gathered. If we design experi-               sor capable of performing a computer sci-                                     and experimentation.
                                                                                                                            necessarily a picture of the
                                                                                                                            underlying science. Unless


 al and computer-sci- ence of new formalisms5 that integrate, in alimits
                                                 chemical operations Exceeding human complete, but also requi
    ments in ways that offer varied                                                                                         we know all the steps from
    opportunities for represent-
    ing and communicating data,
    techniques for extracting new
                                                                                on both the reagents essentially human activhypothesis to understand-
                                                                                                                            ing — by conversing with
                                                                                                                            theorists, experimentalists,

                                                 available to it at the start andoffersto automated processes andof science thaninjustbid to cope with saysStephen H. Muggleton. a
                                                                                                 thoseof mathe- of input materials, with o
                                                                              Scientists are turning
                                                                                                             chemicals bothhandling, ever higher volumes of data.
                                                                                                                   technologies a
                                                                                                                                       data in the statement
    understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software
    available.
                                                sound fashion, two major branches more to the future
                                                                              But automation         so much                developers, visualization
        Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy-


                                                matics: mathematical logic and probabilityauto- On such timescales it sho
                                                 it later generates. The machine would cal-                                    clear and undeniable
    intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis-




                                                                                                                                                                                                           FIREFLY PRODUCTIONS/CORBIS
    for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading.                            The collection and curation
    near Geneva, for example, physicists will have dimensional volume rendering may give no                 The greatest opportunity and risk lie in that     of data throughout the

 s                                              culus. Mathematicaland test chemical com- scientists to reproduce n
                                                 matically prepare logic provides a formal                                     experimentation.
    pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether                      sciences is becoming increas-
    important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled                ingly automated. For exam-
    that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk.               ple, a single high-throughput

                                                 pounds but it would also be programmable, Stephen H. Muggleton is
 learning approaches foundation for logic programming languages refute their hypotheses.
    on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk.                                          I
                                                                                                                                                              experiment in biology can
    to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the                easily generate more than a
                                                                                                                                             gigabyte of data per day, and in astronomy
    a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University,

ng scientific models such as Prolog, much theprobability calculusa Computing and the Centr
                                                 thus allowing whereas same flexibility as
    broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a
                                                                                                                                             automatic

                                                                                                                                   Today’s generation of m
    tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci-
                                                                                                                                             On the Surface
    other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas-
                                                                                                                                             ences the

                                                 real chemist has in the lab.
 p’ systems with no provides the basic axioms of probability for is designed to carry ou                                       Systems Biology at Imper
    querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by
    ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing
    tion, discovery and communication.
                                                                                                                                             low-cost digital
                                                       for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is
                                                                                                                                             efficiency

  to the collection of                              One can think of a chemical Turing 2BZ, UK.
        Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the
    communicators are often consulted only after science emerged.
                                                                                                                                             clear
                                                                                                                                             expansion of automation in all its aspects: data
Released under Creative Commons License
Science Paradigms
          T
                         empirical
              describing natural phenomena

          theoretical                             2
              using models, generalizations            4           2
                                                   =           K
                                                           3       2


              computational
              simulating complex phenomena
          T        data exploration
              unify theory, experiment, and simulation




                                              s



       FIGURE 1

CE: WHAT IS IT?

ce is where “IT meets scientists.” Researchers are using many di erent meth-
 collect or generate data—from sensors and CCDs to supercomputers and
e colliders. When the data finally shows up in your computer, what do
 with all this information that is now in your digital shoebox? People are
                         Released under Creative Commons
ually seeking me out and saying, “Help! I’ve got all this data. What am I
                                                                               License
Principles?
 CIENCE AND GOVERNMENT
                                                                       POLICY FORUM
                                                                                                                            Appropriate professional and career re-
                                                                                                                        ward structures are necessary (20–22). The
       An International Framework                                                                                       way scientists are being evaluated and how
                                                                                                                        their careers are shaped are at stake. For ex-

       to Promote Access to Data                                                                                        ample, researchers who have spent years on
                                                                                                                        building new databases, such as the Sloan
                                                                                                                        Digital Sky Survey in astronomy, have ef-
            Peter Arzberger,      1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1
                                                                                                                        fectively put their scientific careers on hold
     Kathleen Casey,       1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3
                                                                                                                        even though these databases are critical for                                                                    C OV E R F E AT U RE
                                                                                                                                       the future development of the


R
       ecent national and multina-                                                                                                     field. These considerations apply
       tional investments (1) in                        OPERATING PRINCIPLES FOR DATA ACCESS REGIMES                                   equally to those who produce,
       networking and continued                   Openness                                                                             manage, and reuse research data.
  ins in information technologi-                  Transparency and active data dissemination                                                At this point there is consid-
 l capability (2) have given rise                                                                                                      erable heterogeneity in policies.
                                                  Assignment and assumption of formal responsibilities
   a complex cyberinfrastructure                                                                                                       In the United States, federal
                                                  Technical and semantic interoperability of databases
 at is rapidly increasing our abil-                                                                                                    government databases are not
                                                                                                                                                                               Downloaded from www.sciencemag.org on August 30, 2009


y to produce, manage, and use                     Quality control, data validation, authentication, and authorization                  copyright protected, whereas in
  ta (3). As research becomes in-                 Operational efficiency and flexibility                                               the European Union govern-
 easingly global (4), data-inten-                 Respect for intellectual property and other ethical and legal requirements           ment databases are eligible for
ve, and multifaceted (5, 6), it is                Management accountability, including funding approaches                              protection under several data-
mperative to address national                                                                                                          base protection laws. Even with-
  d international data access and                                                                                                      in countries, different funding
 aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for
    that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci-
 pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the
 des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one
 ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a
ownstream commercialization of outputs, making communities (11–17).                                                     third has no policy (23). National laws and
  d provides decision-makers with facts                          Analysis of these, and other examples international agreements can directly af-
  eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices.



                                                                                                                                                                                                                                        THE CHANGING PARADIGM OF
onal, problems. This article summarizes and sharing arrangements exhibit a number                                           At the last meeting of the OECD Com-
 y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli-
udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers
ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin-



                                                                                                                                                                                                                                        DATA-INTENSIVE COMPUTING
  lopment (OECD) (7), which resulted in a (see figure, this page)                                                                                 ciple that research data
 inisterial-level declaration (8).                           provide a framework                                                                  from public funding
   Legitimate restrictions on open access, for locating and ana-                                                Technological
                                                                                                                                                  should be openly avail-
  d strong disincentives to sharing exist, lyzing where improve-                                                                                  able. Furthermore, they
  sed on concerns of protecting national se- ments can be made.                                                  Data access
                                                                                                                                                  invited OECD to devel-
                                                                                                   Cultural                       Institutional
  rity, privacy and confidentiality, intellec- Diversity in science                                  and        management            and         op a set of guidelines
 al property, and time-limited exclusive use suggests that a variety                             behaviorial       domains        managerial      based on commonly
    the scientific investigator. The lack of of institutional models                                                                              agreed principles (simi-
 ear funding-agency policies in the face of and tailored data man-                                          L
                                                                                                            Legal            Financial
                                                                                                                                an                lar to those in the table)
                                                                                                             and                and
rong competing interests, often far re- agement approaches                                                  policy          budgetary             to facilitate optimal
 oved from academic research, poses prob- will be needed.                                                                                         cost-effective access to
 ms for scientists in developing and devel-                      Establishing and Domains of a data access regime.                                digital research data                                                                 Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and
ped countries and inhibit the advance of maintaining this infra-                                                                                  from public funding. It                                                               Deborah K. Gracio, Pacific Northwest National Laboratory
 ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines
 search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg-
on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD
                                                             cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24),
University of California, San Diego, La Jolla, CA 92093,     ment, and preservation costs (including which have been a model for legislation all                                                                                       Through the development of new classes of          erogeneous full-scale simulations will require not only
 SA.  2Ministry of Education, Culture and Science,
                                                             cost of documentation and metadata cre- around the Western world.                                                                                                                                                            peta op capabilities but also a computational infrastruc-
oetermeer, Netherlands. 3Networked Research and              ation) are an afterthought or are insuffi-                     Although the involvement of re-
                                                                                                                                                                                                                                       software, algorithms, and hardware, data-
 gital Information, Royal Netherlands Academy of Arts
                                                             ciently or inconsistently funded in research searchers in resolving these issues is criti-                                                                                intensive applications provide timely and          ture that permits model integration. Simultaneously, it
nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific
                                                             projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about                                                                                                                                        must couple to huge databases created by an ever-in-
omputing Ltd., Espoo, Finland. 5Social Sciences and Hu-
 anities Research Council, Ottawa, Canada. 6National         mend that roughly one-third of the provi- existing policies at their institutions or na-                                                                                  meaningful analytical results in response
                                                                                                                                                                                                                                                                                          creasing number of high-throughput instruments.”2
  search Council, Washington, DC 20418, USA.                 sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To                                                                                  to exponentially growing data complexity
An International Framework                                             way scientists are being
                                                                                                                                                                    their careers are shaped

                                                                                             to Promote Access to Data                                              ample, researchers who



                                                                                                                              Principles?
                                                                                                                                                                    building new databases
                                                                                                                                                                    Digital Sky Survey in
                                Peter Arzberger,   1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1
                                                                                                                                                                    fectively put their scien
                          Kathleen Casey,   1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3
                                                                                                                                                                    even though these datab
                                                                                                                                                                                the future d


                                                                       R
                            ecent national and multina-                                                                                                                         field. These c
                            tional investments (1) in                  OPERATING PRINCIPLES FOR DATA ACCESS REGIMES                                                             equally to t
                            networking and continued          Openness                                                                                                          manage, and
                     gains in information technologi-         Transparency and active data dissemination                                                                            At this po
                     cal capability (2) have given rise
                     to a complex cyberinfrastructure
 CIENCE AND GOVERNMENT
                                                                       POLICY FORUM
                                                              Assignment and assumption of formal responsibilities
                                                              Technical and semantic interoperability of databases
                                                                                                                                                                                erable hetero
                                                                                                                                                                                In the Unit
                     that is rapidly increasing our abil-                                                                   Appropriate professional and career re-
                                                                                                                        ward structures are necessary (20–22). The
                                                                                                                                                                                government
  An International Framework
                     ity to produce, manage, and use          Quality control, data validation, authentication, and authorization
                                                                                                                        way scientists are being evaluated and how
                                                                                                                        their careers are shaped are at stake. For ex-          copyright pr
  to Promote Access to Data research becomes in-
                     data (3). As                             Operational efficiency and flexibility
                                                                                                                        ample, researchers who have spent years on
                                                                                                                        building new databases, such as the Sloan
                                                                                                                        Digital Sky Survey in astronomy, have ef-
                                                                                                                                                                                the Europe
                                  1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1

                     creasingly global (4), data-inten-                                                                                                                         ment databa
            Peter Arzberger,                                                                                            fectively put their scientific careers on hold
     Kathleen Casey,                                          Respect F E ATintellectual property and other ethical and legal requirements
                                                                 C OV E R for U RE
                           1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3
                                                                                                                        even though these databases are critical for
                                                                                                                                       the future development of the
                     sive, and multifaceted (5, 6), it is                                                                                                                       protection u
R
       ecent national and multina-                                                                                                     field. These considerations apply
       tional investments (1) in                              Management accountability, including funding approaches
                                                        OPERATING PRINCIPLES FOR DATA ACCESS REGIMES                                   equally to those who produce,

                     imperative to address national                                                                                                                             base protecti
       networking and continued                   Openness                                                                             manage, and reuse research data.
  ins in information technologi-                  Transparency and active data dissemination                                                At this point there is consid-
 l capability (2) have given rise                                                                                                      erable heterogeneity in policies.
                     and international data access and                                                                                                                          in countries
                                                  Assignment and assumption of formal responsibilities
   a complex cyberinfrastructure                                                                                                       In the United States, federal
                                                  Technical and semantic interoperability of databases
 at is rapidly increasing our abil-                                                                                                    government databases are not
                                                                                                                                                                               Downloaded from www.sciencemag.org on August 30, 2009


y to produce, manage, and use                     Quality control, data validation, authentication, and authorization                  copyright protected, whereas in
                     sharing issues systematically in a policy are- derstanding global climate change (10) re- agencies have different
  ta (3). As research becomes in-
 easingly global (4), data-inten-
                                                  Operational efficiency and flexibility
                                                  Respect for intellectual property and other ethical and legal requirements
                                                                                                                                       the European Union govern-
                                                                                                                                       ment databases are eligible for

                     na that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, w
ve, and multifaceted (5, 6), it is
mperative to address national
                                                  Management accountability, including funding approaches                              protection under several data-
                                                                                                                                       base protection laws. Even with-
  d international data access and                                                                                                      in countries, different funding
                     Open access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies
 aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for
    that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci-

                     vides greater returns from the public invest- topic of recent debate and its resolution is a principles in the OEC
 pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the
 des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one
 ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a
                     ment in research, generates wealth through high priority in many scientific and policy- states access should no
ownstream commercialization of outputs, making communities (11–17).
  d provides decision-makers with facts
                                                                                                                        third has no policy (23). National laws and
                                                                 Analysis of these, and other examples international agreements can directly af-

                     downstream commercialization of outputs, making communities (11–17).
  eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices.
                                                                                                                                                                    third has no policy (23)
                                                                                                                                                                                                                                        THE CHANGING PARADIGM OF
onal, problems. This article summarizes and sharing arrangements exhibit a number                                           At the last meeting of the OECD Com-
 y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli-
                     and provides decision-makers with facts                       Analysis of these, and other examples international agreemen
udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers
ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin-

                     needed to address complex, often transna- (18), suggests that successful data access fect data access and sha
                                                                                                                                                                                                                                        DATA-INTENSIVE COMPUTING
  lopment (OECD) (7), which resulted in a (see figure, this page)                                                                                 ciple that research data
 inisterial-level declaration (8).                           provide a framework                                                                  from public funding
   Legitimate restrictions on open access, for locating and ana-                                                Technological
                                                                                                                                                  should be openly avail-
                     tional, problems. This article summarizes and sharing arrangements exhibit a number
  d strong disincentives to sharing exist, lyzing where improve-
  sed on concerns of protecting national se- ments can be made.
                                                                                                   Cultural      Data access      Institutional
                                                                                                                                                  invited OECD to devel-At the last meeting
                                                                                                                                                  able. Furthermore, they

  rity, privacy and confidentiality, intellec- Diversity in science                                                                               op a set of guidelines
                     key findings of an international group that of key attributes and operating principles mittee for Scientific and
 al property, and time-limited exclusive use suggests that a variety
    the scientific investigator. The lack of of institutional models
                                                                                                     and
                                                                                                 behaviorial
                                                                                                                management
                                                                                                                   domains
                                                                                                                                      and
                                                                                                                                  managerial      based on commonly
                                                                                                                                                  agreed principles (simi-

                     studied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the minist
 ear funding-agency policies in the face of and tailored data man-
rong competing interests, often far re- agement approaches
 oved from academic research, poses prob- will be needed.
                                                                                                            L
                                                                                                            Legal
                                                                                                             and
                                                                                                            policy
                                                                                                                             Financial
                                                                                                                                an
                                                                                                                                and
                                                                                                                            budgetary
                                                                                                                                                  lar to those in the table)
                                                                                                                                                  to facilitate optimal
                                                                                                                                                  cost-effective access to
                     isation for Economic Cooperation and Richard T. Kouzes, GordonNorthwest National Laboratory Ian Gorton,“domains” endorsed a declaration (
 ms for scientists in developing and devel-
ped countries and inhibit the advance of maintaining this infra-
                                                                 De- organizational Stephen T. Elbert,
                                                                 Deborah K. Gracio, Pacific
                                                                                            A. Anderson,
                                                                                                         management and
                                                                 Establishing and Domains of a data access regime.                                digital research data
                                                                                                                                                  from public funding. It

                     velopment (OECD) (7), which resulted in a (see figure, this page)
 ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines
 search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg-
on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD
                                                                                                                                                                                          cipl
                     ministerial-level declaration (8).        Through the development of a framework
                                                                             provide new classes of erogeneous full-scale simulations will require not only                               from
                                                             cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24),
University of California, San Diego, La Jolla, CA 92093,     ment, and preservation costs (including which have been a model for legislation all

                         Legitimate restrictions on open access, applications and hardware, data- ana- that permits model integration. Simultaneously, it
                                                                             for locating and
                                                             cost of documentation and metadata cre- around the Western world.                       Technological
                                                                                                                                                                                          sho
 SA.  2Ministry of Education, Culture and Science,

oetermeer, Netherlands. 3Networked Research and                software, algorithms,                        peta op capabilities but also a computational infrastruc-
                                                             ation) are an afterthought or are insuffi-                     Although the involvement of re-
 gital Information, Royal Netherlands Academy of Arts
                                                               intensive                 provide timely and ture
                                                             ciently or inconsistently funded in research searchers in resolving these issues is criti-
                                                               meaningful lyzing where improve-
                     and strong disincentives to sharing exist, analytical results in response must couple to huge databases created by an ever-in-                                       able
nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific
omputing Ltd., Espoo, Finland. 5Social Sciences and Hu-      projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about
 anities Research Council, Ottawa, Canada. 6National         mend that roughly one-third of the provi- existing policies at their institutions or na-
                                                                                                            creasing number of high-throughput instruments.”                                                                                                                      2
  search Council, Washington, DC 20418, USA.                 sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To                                                                                  to exponentially growing data complexity
An International Framework
                                                • Approximations are currently made because of limitations on spatial scale; more accurate
                                                                                                                                                                        way scientists are being
                                                                                                                                                                        their careers are shaped
                                                algorithms are needed.
                                                                                             to Promote Access to Data                                                  ample, researchers who



                                                                                                                              Principles?
                                                                                                                                                                        building new databases
                                                • Need a universal parser to tag information for analysis.                                                              Digital Sky Survey in
                                 Peter Arzberger,   1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1
                                                                                                                                                                        fectively put their scien
                           Kathleen Casey,   1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul combining all available data.
                                                • Need algorithms to recognize and predict intent by Wouters3                                                           even though these datab
   Information analytics                        • Require an integrating informatics resource manager that takes in sensor data, transforming d                                     the future


                                                                       R
                             ecent national andbetween heterogeneous datasets and integrating computational tools, and presents results to c
                                                 multina-                                                                                                                           field. These
                             tional investments (1) in                   OPERATING PRINCIPLES FOR DATA ACCESS REGIMES                                                               equally to t
                                                the human user.
                             networking and continued           Openness                                                                                                            manage, and
                      gains in information technologi- scale of data and problems overwhelms visual displays, new approaches are this po
                                                • Since the     Transparency and active data dissemination                                                                              At

 CIENCE AND GOVERNMENT
                                                                       POLICY FORUM
                      cal capability (2) have given rise develop the appropriate level of abstraction and to condense and select data to dis-
                                                needed to       Assignment and assumption of formal responsibilities
                      to a complex cyberinfrastructure the user’s control.
                                                play under      Technical and semantic interoperability of databases
                                                                                                                                                                                    erable hetero
                                                                                                                                                                                    In the Unit
                      that is rapidly increasing our abil-                                                                  Appropriate professional and career re-
                                                                                                                        ward structures are necessary (20–22). The
                                                                                                                                                                                    government
  An International Framework
                      ity to produce, manage,• CollaborativeQuality control,analysis of datasets and observations are desirable.
                                                  and use                               data validation, authentication, and authorization
                                                                                                                        way scientists are being evaluated and how
                                                                                                                                                                                    copyright pr
                                                                 sharing and                                            their careers are shaped are at stake. For ex-

  to Promote Access to Data research becomes in-
                      data (3). As                              Operational efficiency and flexibility
                                                                                                                        ample, researchers who have spent years on
                                                                                                                        building new databases, such as the Sloan
                                                                                                                        Digital Sky Survey in astronomy, have ef-
                                                                                                                                                                                    the Europe
                                  1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1

                                                                Respect F ATintellectual propertythe problem set to access large local andment databa
                      creasingly global (4), data-inten- architecturesEareRE
            Peter Arzberger,
   Computing platforms                          • Current          C OV E R for U inadequate for and other ethical and legal requirements                                            distributed
                                                                                                                        fectively put their scientific careers on hold
     Kathleen Casey,       1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3
                                                                                                                        even though these databases are critical for
                                                                                                                                       the future development of the
                      sive, and multifaceted (5, 6), it is providing solutions with reasonable throughput to analyze and model atprotection u
                                                datasets,                                                                                                                            the
R
       ecent national and multina-                                                                                                     field. These considerations apply
       tional investments (1) in                                Management accountability, including funding approaches
                                                        OPERATING PRINCIPLES FOR DATA ACCESS REGIMES                                   equally to those who produce,

                      imperative to address required spatial scales.
                                                  national                                                                                                                          base protecti
       networking and continued                   Openness                                                                             manage, and reuse research data.
  ins in information technologi-                  Transparency and active data dissemination                                                At this point there is consid-
 l capability (2) have given rise                                                                                                      erable heterogeneity in policies.
                      and international data access and                                                                                                                             in countries
                                                  Assignment and assumption of formal responsibilities
   a complex cyberinfrastructure                                                                                                       In the United States, federal
                                                  Technical and semantic interoperability of databases
 at is rapidly increasing our abil-                                                                                                    government databases are not
                                                                                                                                                                               Downloaded from www.sciencemag.org on August 30, 2009



                      sharing issues systematically in a policy are- not have the storage climate change amount of data that models use
                                                • Current machines do derstanding global capacity for the (10) re-
y to produce, manage, and use                     Quality control, data validation, authentication, and authorization                  copyright protected, whereas in
  ta (3). As research becomes in-
 easingly global (4), data-inten-
                                                  Operational efficiency and flexibility
                                                  Respect for intellectual property and other ethical and legal requirements
                                                                                                                                       the European Union govern-
                                                                                                                                       ment databases are eligible for
                                                                                                                                                                        agencies have different
                      na that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, w
ve, and multifaceted (5, 6), it is
mperative to address national                   and produce.
                                                  Management accountability, including funding approaches                              protection under several data-
                                                                                                                                       base protection laws. Even with-
  d international data access and                                                                                                      in countries, different funding
                      Open access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies
 aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for
    that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci-

                      vides greater returns from the public invest- topic of recent debate and its systems with high-performance net- OEC
                                                • Need self-healing and intrinsically secure operating resolution is a principles in the
 pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the
 des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one

                      ment in research, generates wealth through built-in encryption.
                                                working that provides high priority in many scientific and policy- states access should no
 ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a
ownstream commercialization of outputs, making communities (11–17).                                                     third has no policy (23). National laws and
  d provides decision-makers with facts                          Analysis of these, and other examples international agreements can directly af-

                      downstream commercialization of outputs, making communities (11–17).
  eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices.
                                                                                                                                                                        third has no policy (23)
                                                                                                                                                                                                                                        THE CHANGING PARADIGM OF
onal, problems. This article summarizes and sharing arrangements exhibit a number                                           At the last meeting of the OECD Com-

                                                • Computational needs range from largethese, and other examples
                      and provides decision-makers with facts                        Analysis of           central high-performance computing systems to
 y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli-
udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers
                                                                                                                                                                        international agreemen
ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin-

                      needed to address complex, often transna- systemssuggestsminiaturized labs on a chip for eld deployments. and sha
                                                portable lightweight (18), such as that successful data access                                                          fect data access
                                                                                                                                                                                                                                        DATA-INTENSIVE COMPUTING
  lopment (OECD) (7), which resulted in a (see figure, this page)                                                                                 ciple that research data
 inisterial-level declaration (8).                           provide a framework                                                                  from public funding
   Legitimate restrictions on open access, for locating and ana-                                                Technological
                                                                                                                                                  should be openly avail-
                      tional, problems. This article summarizes and sharing arrangements exhibit a number
  d strong disincentives to sharing exist, lyzing where improve-
  sed on concerns of protecting national se- ments can be made.
                                                                                                   Cultural      Data access      Institutional
                                                                                                                                                  able. Furthermore, they
                                                                                                                                                  invited OECD to devel-    At the last meeting
  rity, privacy and confidentiality, intellec- Diversity in science                                                                               op a set of guidelines
                      key findings of an international group that of key attributes and operating principles mittee for Scientific and
 al property, and time-limited exclusive use suggests that a variety
    the scientific investigator. The lack of of institutional models
                                                                                                     and
                                                                                                 behaviorial
                                                                                                                management
                                                                                                                   domains
                                                                                                                                      and
                                                                                                                                  managerial      based on commonly
                                                                                                                                                  agreed principles (simi-

                      studied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the minist
 ear funding-agency policies in the face of and tailored data man-
rong competing interests, often far re- agement approaches
 oved from academic research, poses prob- will be needed.
                                                                                                            L
                                                                                                            Legal
                                                                                                             and
                                                                                                            policy
                                                                                                                             Financial
                                                                                                                                an
                                                                                                                                and
                                                                                                                            budgetary
                                                                                                                                                  lar to those in the table)
                                                                                                                                                  to facilitate optimal
                                                                                                                                                  cost-effective access to
                      isation for Economic Cooperation and Richard T. Kouzes, GordonNorthwest National Laboratory Ian Gorton,“domains” endorsed a declaration (
 ms for scientists in developing and devel-
ped countries and inhibit the advance of maintaining this infra-
                                                                   De- organizational Stephen T. Elbert,
                                                                   Deborah K. Gracio, Pacific
                                                                                              A. Anderson,
                                                                                                           management and
                                                                 Establishing and Domains of a data access regime.                                digital research data
                                                                                                                                                  from public funding. It
    COMPUTER          velopment (OECD) (7), which resulted in a (see figure, this page)
 ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines
 search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg-                                                          cipl
on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD
                      ministerial-level declaration (8).         Through the development of a framework
                                                                               provide new classes of erogeneous full-scale simulations will require not only                                 from
                                                             cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24),
University of California, San Diego, La Jolla, CA 92093,     ment, and preservation costs (including which have been a model for legislation all

                          Legitimate restrictions on open access, applications and hardware, data- ana- that permits model integration. Simultaneously, it
                                                                               for locating and
                                                             cost of documentation and metadata cre- around the Western world.                           Technological
                                                                                                                                                                                              sho
 SA.  2Ministry of Education, Culture and Science,

oetermeer, Netherlands. 3Networked Research and                  software, algorithms,                          peta op capabilities but also a computational infrastruc-
                                                             ation) are an afterthought or are insuffi-                     Although the involvement of re-
 gital Information, Royal Netherlands Academy of Arts
                                                                 intensive                 provide timely and   ture
                                                             ciently or inconsistently funded in research searchers in resolving these issues is criti-
                                                                 meaningful lyzing where improve-
                      and strong disincentives to sharing exist, analytical results in response must couple to huge databases created by an ever-in-                                          able
nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific
omputing Ltd., Espoo, Finland. 5Social Sciences and Hu-      projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about
 anities Research Council, Ottawa, Canada. 6National         mend that roughly one-third of the provi- existing policies at their institutions or na-
                                                                                                                creasing number of high-throughput instruments.”                                                                                                                  2
  search Council, Washington, DC 20418, USA.                 sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To                                                                                  to exponentially growing data complexity
Computional                                              Domain
  Thinkers                                               Specialists
                            Creating
Formulation                                             Interaction
Data models &                                           Experiments &
computational                                            knowledge
  methods                                                  creation




                Mapping                      Steering


                           Data-Intensive
                            Engineers

                           Execution
                          Implementations,
                           compute & data
                             resources
Efficient distributed
         systems

Computer Science
       Research
    Effective
   algorithms       Data-intensive
                     computing
Efficient distributed                          Reusable computational
         systems                                         models

Computer Science                           Interdisciplinary
       Research                            Applications
    Effective                                             Intuitive
   algorithms       Data-intensive      Collaborative    interfaces
                     computing          environments
                            New conceptual
                           models for systems
Developmental Medical                                                                                Emergency
Chemistry                                                                                                       Response
             Biology    Genetics
              Reusable computational
                      models
              alpha release of a combined earth-
              quake selection and waveform selec-
              tion service combining the EMSC and
                                                         Real-time access to European BB
                                                         data successively increasing
                                                         The Virtual European Broad-band
              the ORFEUS services. The web por-          Seismograph Network (VEBSN) is
              tal also includes a first test version      steadily increasing its size. Currently
              of the underlying software structure       more then 270 stations are contrib-




     Interdisciplinary
              of the distributed archive services of     uting data to the VEBSN in near real-
              the Integrated European Distributed        time. For some tens of these stations
              Archive (EIDA) for waveform data.          we still need to compile the instru-
              The alpha release implies that a           mentation and data details (data-
              test version of the current service is     less Seed volumes). An example of
              made accessible for a selected group       the earthquake in Greece on Febru-




     Applications
              of scientist that are willing to test it   ary 14, 2008 illustrates the available
              and recommend modifications. In-            data. The VEBSN is a joint initiative
              terested seismologists, student, re-       of European-Mediterranean seismo-
              searcher or network operator, are          logical networks. More information
              encouraged to contact the NERIES           can be obtained from www.orfeus-
              Project Office if they are interested       eu.org/Data-info/vebsn.html.
              to test the services. A short video



                                                                  Intuitive
              presentation   is   available   (http://   Figure 3. The Greek earthquake of February 14, 2008
                                                         as recorded by the vertical component of broadband
              www.neries-eu.org/main.php/demo.           stations of the VEBSN (mainly in the European-Medi-
                                                         terranean area) and made available by ORFEUS. The
              wmv?fileitem=8798210).           Alessan-   VEBSN is currently still expanding.




  Collaborative                                                  Brain
              dro Spinuso, Sergio Rives, Luca Tra-


  Neuro-    Quantitative
              ni, Phetaphone Thomy, Rémy Bossu,

                                                                 interfaces
                                                                          Seismology
              Torild van Eck. (See figure 2 below.)




informatics  Genetics                                        Imaging
  environments
Computional                                              Domain
  Thinkers                                               Specialists
                            Creating
Formulation                                             Interaction
Data models &                                           Experiments &
computational                                            knowledge
  methods                                                  creation




                Mapping                      Steering


                           Data-Intensive
                            Engineers

                           Execution
                          Implementations,
                           compute & data
                             resources
Interaction
                            Experiments &
                             knowledge
                               creation




                 Steering


Data-Intensive
 Engineers

Execution
Interaction
                            Experiments &
                             knowledge
                               creation




                 Steering


Data-Intensive
 Engineers

Execution
XML
 interface,
   task &
  resource
description
               1. specifies                2. uses
                                                                                                    students teacher
                                                             3. deploys

                             portal                                              4. performs task       researcher
                            designer


                                      9. analyses       web portal
                                         results


                                                                          5. configures
                                              6. runs jobs


                                                                   7. monitors
         8. returns
          results
                              !"#$%&"'(




                      compute resources
Computional                                              Domain
  Thinkers                                               Specialists
                            Creating
Formulation                                             Interaction
Data models &                                           Experiments &
computational                                            knowledge
  methods                                                  creation




                Mapping                      Steering


                           Data-Intensive
                            Engineers

                           Execution
                          Implementations,
                           compute & data
                             resources
Formulation
Data models &
computational
  methods




                Mapping


                          Data-Intensive
                           Engineers

                          Execution
Formulation
Data models &
computational
  methods



Classification of Gene
        MappingPatterns
 Expression

                  Data-Intensive
                   Engineers

                  Execution
Testing phase    Training phase
    Manual            Image                             Image
  Annotations      integration                        processing
                                       Image
                                     processing


                                                       Feature
                                      Feature         generation
    Images                           generation


                                      Feature           Feature
                                     selection/        selection/
Deployment phase                                       extraction
                                     extraction
                     Apply
                   classifier
   Automatic                         Prediction        Classifier
  annotations                        evaluation       construction
can be continuously and the initial input signal therefore is decomposed into
different subbands.



                                                                      LL2     LH2

                                                                                    LH1
                                                         Wavelet
                                                      decomposition


                                                                      HL2     HH2




                                         2   LL2out
                                    LL

                                         2   HL2out
                                    HL
                    2   LL1out
               LL                                                           HL1     HH1
                                         2   LH2out
                        HL1out      LH
                    2
               HL
    2D-Array                             2   HH2out
                                    HH
     Input          2   LH1out
               LH

                    2   HH1out
               HH



  (a) Wavelet decomposition on 2D-array (b) Wavelet decomposition on an image
                                 Fig. 2. Wavelet decomposition

                                                                      Liangxiu Han
   Mathematically, for a signal f (x, y) with 2D array(M ∗ N ), the wavele
User and application diversity


                Iterative DMI
                   process
                development       Accommodating
Tool level
                                  Many application domains
                                  Many tool sets
                                  Many process representations
                                  Many working practices

Gateway interface
                         DMI canonical representation and abstract machine
one model


                                  Composing or hiding
Enactment                         Many autonomous resources & services
level             Mapping
                optimisation      Multiple enactment mechanisms
                    and           Multiple platform implementations
                 enactment




      System diversity and complexity
User and application diversity


                Iterative DMI
                   process
                development       Accommodating
Tool level
                                  Many application domains
                                  Many tool sets
                                  Many process representations
                                  Many working practices

Gateway interface
                         DMI canonical representation and abstract machine
one model


                                  Composing or hiding
Enactment                         Many autonomous resources & services
level             Mapping
                optimisation      Multiple enactment mechanisms
                    and           Multiple platform implementations
                 enactment




      System diversity and complexity
Testing phase    Training phase
          Manual             Image                             Image
        Annotations       integration                        processing
                                              Image
                                            processing


                                                               Feature


              Formulation
                                             Feature          generation
          Images                            generation




                                                                                 OGSA-DAI
                                             Feature           Feature
                                            selection/        selection/
     Deployment phase                                         extraction
                                            extraction
                           Apply
                         classifier
         Automatic                          Prediction        Classifier
        annotations                         evaluation       construction




Data-Intensive Systems Process
/* import non-universal components from the computational environment */
import uk.org.ogsadai.SQLQuery; //get definition of SQLQuery
Engineering Language
import uk.org.ogsadai.TupleToWebRowSetCharArrays; // serialisation
import uk.org.ogsadai.DeliverToRequestStatus;

/* construct and identify instances of the PE */
SQLQuery query = new SQLQuery();

                                                                                  Java
TupleToWebRowSetCharArrays wrs = new TupleToWebRowSetCharArrays();
DeliverToRequestStatus del = new DeliverToRequestStatus();

/* form connection c1 with an explicit literal stream expression as its source
and query as its destination */

String q1 = "SELECT * FROM weather";
|- q1 -| => expression->query;
String resourceID = "MySQLResource";
|- resourceID -| => resource->query;
query->data => data->wrs;
wrs->result => input->del;
                                                                     



                                                                                                                            
                                                                                                                   
                                          
                                                                                                                         
                                          
                                        
                                                
                                                                                                                      
                                                                                                                      
                                                                                                                    
                                                
                                                                                                                            
                                          
                                       
                                                                                                                            
                                                                                                                      
                             
                                                                                                                 
             
                           
                                                                                                          
                                                                                          
     
                                                                                                       
 

    
                                                                                   
                                                                                                                              
                                                                              
                                                                                                                       

                                                                                                                                                                                                                                                                            

                                                                                                                                                                  
                                                                                                                                                         
                                                                                                                                                                                                                                                                           
                                                                                                                                                                 
                                                   
                                                                                                                                                                                                                                                                
                                                         
                                                                                                                                                                                                                                                                           
                                                                                                                                                              
                                                                                                                                                            
                                                         
                                                                                                                                                                  
                                                                                                                                                                                                                                                                         
                                                      
                                                                                                                                                                                                                                                                               
                                                                                                                                                                  
                                                                                                                                                                                                                                                                             
                                                                                                                                                               
                                      
                                                                                                                                                                                                                                                  
                       
                                    
                                                                                                                                                
                                                                                                                                                                                                                                                        
     
                                                                                                                                                                                                                                                                 
 

    
                                                                                                                                                                                                                                                                                   
                                                                                                              
                                                                                                                                                                     
                                                                                                      
                                                                                                                                                                
                                                                                                     
                
                                                                                                                                                                  
             
                                                                                                                                                                                                                                                                   
                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                         
                                                                                                                                             
                                                                                                                                       
                                                              
                                                                                                                                                                                                                                     
                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                                         
                                                                                                                                                                                                                                                                         
                                                                                                                                                                                                        
                                                                                                                                                                                                                                                             
                        
                                                                                                                                                                                                                                     

                                                                                                                                           
                                                                                                       
                                       
                                                                                                                                                                                                                                                                                       
                                                                                                                                           
                                                                                                                                                                                                                                                                              
                                                                                                                                                                                                                                                                                                         
                                                                                                                                                                                
                                                                                                                                          
                                                                                                                                                                                                  
                                                                                                                                                                                                                                                                  
                                                                                                                                                                                                                                                      
                                                                                                                                             
                                                                                                                                                                         
                                                                                                                                                                   



                                                                                                                                                                                                                                                       
                                                                                       
                                                                                                                                                                                                                                                        
                                                                                  
                                                                                                                                                                                                                                                                   

                                                                                                                
                                                                                                                          
                                                                                                           
                                                                                                                                                                                                                                                                                                        
                                                                                                                                                                                                                                                                                 
                                                                                                                                                                                                                                                                                                      
                                                                                                                                         
                                                                                         
                                                                                                                                
                                                                                                          

                                                                                                                                                                                                                                                                                                  
                                                                                                                 
                                                                                                                                                                                                                                                        
                                                               
                                                                                                                                                                                                                                                                       
                                                                                                                                                          
                                                                                                                                            
                                                                                             
                                                                                                                                                         
                                                                                                                                                                                                                 
                                                                                                                                                                                                                    
                                                                                                                                                                                                                                                             
                                                                                              
                                                                                                                                                                     
                                                                 
                                                                                                                                                       
                                                                                           
                                                                                                                                                                                                                                                                                           
                                                                                                                                                                                                                                                     
                                                                                           
                                                                                                                                                                                                                                                                   
                                                                                                                                                            
                                                                                                                                                                             
                                                                                                                                                                                                                                                                                
                                                                                                                                                     
                                                                                                                                                                                               
                                                                                                                                                         
                                                                                                                                                                                                                                 
                                                                             
                                                                                                                                                                                               
                                                                                                                                                                                               
                                                                                                                                                                                                                                                                                                             
                                                                  
                                                                                                                                                                                                                                                                                                      
Architecture results
           5
                                                  2 nodes
                                                  3 nodes
          4.5                                     4 nodes
                                                  5 nodes
                                                  6 nodes
           4                                      7 nodes
                                                  8 nodes

          3.5
Speedup




           3


          2.5


           2


          1.5


           1


          0.5
                0   5000         10000        15000         20000
                           Number of Images
The ‘hump’
                     6000
                                                              Workflow Execution Time
                                                                                  PE1
                                                                                  PE2
                     5000                                                         PE3
                                                                                  PE4
                                                                                  PE5
Processing Time(s)




                                                                                  PE6
                     4000                                                         PE7
                                                                                  PE8
                                                                                  PE9

                     3000



                     2000



                     1000



                        0
                            1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8
                              6400 Images          12800 Images          19200 Images
                                            Number of Computing Nodes
The ‘hump’
                     6000
                                                              Workflow Execution Time
                                                                                  PE1
                                                                                  PE2
                     5000                                                         PE3
                                                                                  PE4
                                                                                  PE5
Processing Time(s)




                                                                                  PE6
                     4000                                                         PE7
                                                                                  PE8
                                                                                  PE9

                     3000



                     2000



                     1000



                        0
                            1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8
                              6400 Images          12800 Images          19200 Images
                                            Number of Computing Nodes
No ‘hump’
                           5000                                     Workflow Execution Time
                                                                                        PE1
                                                                                        PE2
                                                                                        PE3
                                                                                        PE4
Workflow execution time




                           4000                                                         PE5
      Processing Time(s)




                                                                                        PE6
                                                                                        PE7
                                                                                        PE8
                                                                                        PE9
                           3000



                           2000



                           1000



                              0
                                  1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8
                                    6400 Images          12800 Images          19200 Images
                                                  Number of Computing Nodes
No ‘hump’
                           5000                                     Workflow Execution Time
                                                                                        PE1
                                                                                        PE2
                                                                                        PE3
                                                                                        PE4
Workflow execution time




                           4000                                                         PE5
      Processing Time(s)




                                                                                        PE6
                                                                                        PE7
                                                                                        PE8
                                                                                        PE9
                           3000



                           2000



                           1000



                              0
                                  1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8      1 2 3 4 5 6 7 8
                                    6400 Images          12800 Images          19200 Images
                                                  Number of Computing Nodes
Data mining results
Table 1. The preliminary result of classification performance using 10-fold validation
hhhh
    h      hhClassification Performance
                hhhh
                       hhhh                                              Sensitivity Specificity
Gene expression                  hh h
Humerus                                                                    0.7525     0.7921
Handplate                                                                  0.7105     0.7231
Fibula                                                                     0.7273      0.718
Tibia                                                                      0.7467     0.7451
Femur                                                                      0.7241     0.7345
Ribs                                                                       0.5614     0.7538
Petrous part                                                               0.7903     0.7538
Scapula                                                                    0.7882     0.7099
Head mesenchyme                                                            0.7857     0.5507
Note: Sensitivity: true positive rate. Specificity: true negative rate.



5   Conclusion and Future Work
Computional                                              Domain
  Thinkers                                               Specialists
                            Creating
Formulation                                             Interaction
Data models &                                           Experiments &
computational                                            knowledge
  methods                                                  creation




                Mapping                      Steering


                           Data-Intensive
                            Engineers

                           Execution
                          Implementations,
                           compute & data
                             resources
D
                Sp
    Creating
n              Inte
&              Exp
l               kn
                  c
Spatial atlases for
    developmental biology
                              D
                             Sp
            Creating
n                           Inte
&                           Exp
l                            kn
                               c
Next Generation
         Embryology




   Google Maps for
Developmental Biology
Next generation technology
              Current Repository                                  Enhanced Repository



                       Query,
                     Navigation &
                     Contribution

                                                                  Query &
                                                                 Contribution          Navigation
Web & Java clients
                                    e-MouseAtlas                                                    e-MouseAtlas+
                                                   Web clients
                                                                         DSpace + webviewer




Knowledge + Maps                                                  Knowledge                         Maps
                                                                                Gagarine Yaikhom
Annotating on-line
Data-intensive
research workshop
Monday, 15 March 2010 @ e-Science Edinburgh


              Volume    Complexity Interaction

Databases
                    Pri
                        nci
Paradigms                   ple
Analysis
                                 s?
Data-intensive Research Group
            Academics
      Malcolm Atkinson
 Research Assistants
            Jos Koetsier
           Liangxiu Han
      David Rodriguez
    Gagarine Yaikhom
         PhD Students
       Thomas French
       Luna De Ferrari
            Rob Kitchen
        Chee-Sun Liew
                Fan Zhu
  Research Students
Gary, Vijay, Hwee, Yue,
          Charalampos,     http://research.nesc.ac.uk/
  Gideon, Jeff, Gareth,
       Charis, Andrejs                      Jano van Hemert

Data-Intensive Research

  • 1.
    Data-Intensive Research Jano van Hemert research.nesc.ac.uk NI VER U S E IT TH Y O F H G E R D I U N B
  • 2.
    Downloaded from www.sciencemag.orgon July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
  • 3.
    o investigate the 10.1126/science.1171406 Downloaded from www.sciencemag.org on July 6, 2009 COMPUTER SCIENCE The demands of data-intensive science Beyond the Data Deluge represent a challenge for diverse scientific communities. Gordon Bell,1 Tony Hey,1 Alex Szalay2 S ince at least Newton’s laws of motion in the 17th century, scientists have recog- nized experimental and theoretical sci- The demands of data-intensive science ence as the basic research paradigms for understanding nature. In recent decades, com- puter simulations have become an essential represent a challenge for diverse scientific third paradigm: a standard tool for scientists to explore domains that are inaccessible to theory and experiment, such as the evolution of the communities. universe, car passenger crash testing, and pre- dicting climate change. As simulations and experiments yield ever more data, a fourth par- adigm is emerging, consisting of the tech- niques and technologies needed to perform data-intensive science (1). For example, new types of computer clusters are emerging that are optimized for data movement and analysis rather than computing, while in astronomy and other sciences, integrated data systems allow data analysis and storage on site instead of requiring download of large amounts of data. Moon and Pleiades from the VO. Astronomy has been one of the first disciplines to embrace data-intensive Today, some areas of science are facing science with the Virtual Observatory (VO), enabling highly efficient access to data and analysis tools at a cen- hundred- to thousandfold increases in data tralized site. The image shows the Pleiades star cluster form the Digitized Sky Survey combined with an image volumes from satellites, telescopes, high- of the moon, synthesized within the World Wide Telescope service. throughput instruments, sensor networks, accelerators, and supercomputers, compared challenging scientists (4). In contrast to the tra- ing of these digital data are becoming increas- to the volumes generated only a decade ago ditional hypothesis-led approach to biology, ingly burdensome for research scientists. (2). In astronomy and particle physics, Venter and others have argued that a data- Over the past 40 years or more, Moore’s these new experiments generate petabytes intensive inductive approach to genomics Law has enabled transistors on silicon chips to CREDIT: JONATHAN FAY/MICROSOFT (1 petabyte = 1015 bytes) of data per year. In (such as shotgun sequencing) is necessary to get smaller and processors to get faster. At the bioinformatics, the increasing volume (3) and address large-scale ecosystem questions (5, 6). same time, technology improvements for the extreme heterogeneity of the data are Other research fields also face major data disks for storage cannot keep up with the ever management challenges. In almost every labo- increasing flood of scientific data generated ratory, “born digital” data proliferate in files, by the faster computers. In university research 1MicrosoftResearch, One Microsoft Way, Redmond, WA spreadsheets, or databases stored on hard labs, Beowulf clusters—groups of usually 98052, USA. 2Department of Physics and Astronomy, Johns Hopkins University, 3701 San Martin Drive, Baltimore, MD drives, digital notebooks, Web sites, blogs, and identical, inexpensive PC computers that can 21218, USA. E-mail: szalay@jhu.edu wikis. The management, curation, and archiv- be used for parallel computations—have www.sciencemag.org SCIENCE VOL 323 6 MARCH 2009 1297 Published by AAAS
  • 4.
    NEWS FEATURE 2020COMPUTING NATURE|Vol 440|23 March 2006 J. MAGEE EVERYTHING,EVERYWHERE Tiny computers that constantly monitor ecosystems, buildings and even human bodies could turn science on its head. Declan Butler investigates.
  • 5.
    o P,;,(>.?.;90:(,1;.=(/(7.,=<G( - 40(J<.(;5.(!%@)(10=(=.<.,=D5(A0J?>(=.IJ9=.B( #;'#*"(2) !"2:/"1#("-,0+"2*0+ 1#("2*0+!/(04")2( )581,7-56@.2.35=52/ 3000 (7>.26!/,+058 #;'#*'+.'(2 &0(.3"!"#//+"50( )0'.3"!"#//+"50( *#&"!"50(2)."(2) 82)."!"#//+"50( o 45,;(;5.(=.IJ9=.F.:;<(A.=.(>.19:.>G( 36$(0-,0+04,# 50(2).")*, !"5,)3"-,0+ !"50(2). #&&"#))0*"#1"4204( (24").'$ *#&"!"5,)3"#;'#.")*, 5(2)38#.2("-,0+ 50(2)."2*0+"1#&#4 ,&."!"'(-#&"(24,0&#+ 30'),&4").'$ !")/2*"2$'* ,&.2(%")*3"*+,& .","-(,."4204( /(04"3'1"4204 -(,."!")0*,0+"2$'* (21")/2*"2$'* )5=1/56!528023 #&&"4+#*,0+ #&.,/0$2 '(-#&").'$ $+,-./012 G01;139 o 45.()D0;;9<5(Q0C.=:F.:;(,:>(L.,?;5(<.=C9D.(8=0C9>.=<(A.=.(8.=<J,>.>(;5,;(9;( !"4+#*,0+ ,&."!"(210.2")2&) (210.2")2&)"2&%,(0& !"+2#(&"$,)#-,+ 10+"2*0+ -(,."2$'*"(2)"! !"#//+"2*0+ /(04"/36)"4204 0,70) -23#%,0'( -,0+"*0&)2(% #1"!"/)6*3,#. Disk space (Terabytes) -23#%"2*0+ 2500 02*0+04,# A,<(9F80=;,:;(,:>(:.D.<<,=7G( #&,1"-23#% !"0**'/"0(4#&"/)6*3 *0&)2(%"-,0+ 30+0*2&2 /2(*2/."/)6*30/36) -(,."!"2$'*"/)6*30+ %5,718-052-5 &2'(0,1#42 2*0+046 2%0+'.,0& 1210(6 )2:"(0+2) ;"!"2:/"/)6*30+!# !"*04&,.,%2"&2'(0)*, /)6*30&"-"(2% !"1#((,#42"5#1 121"*04&,.,0& /36),*#"* @5+0-025 <51;139 /)6*30+")*, +#&*2. &28"2&4+"!"12$ o 45.(D0<;(01(<;0=,-.(A0J?>(:..>(;0(H.(:.-0;9,;.>G( 0/."+2.. !"420+")0*"+0&$0& ,&.2++,42&*2 -(,."!"/)6*30+ )0*"$2% "1320/0456!-052-5 !"&2'(0)*, 2000 /"420+04,)."#))0* /2.(0+"420)*, -(,."!"$2%"/)6*30+ -(,."!")0*"/)6*30+ *3,+$"$2% ,222"."1#4& !"%#*")*,".2*3&0+"- G01-:5=08/79 *+#6"1,&2( !"*3,+$"/)6*30+"/)6* /2()",&$,%"$,552( ,&."!"+#&4"*011"$,) )0+,$").#.2"*011'& %&'()$ !"/2()")0*"/)6*30+ !"#'.,)1"$2%"$,)0($ *:980-8 /36),*#"- !"1#4&"1#4&"1#.2( *2++ !",11'&0+ *+#6"*+#6"1,&2( /36)"(2%"+2.. o 45.(<.=C9D.(A0J?>(:..>(;0(H.(:.-0;9,;.>G( !"#$%"$ /36)"(2%"- #//+"/36)"+2.. !"%,(0+ #/3#),0+046 !"/36)!*0&$2&)"1#. !#/#&2)2"!0'(&#+"05"#//+,2$"/36),*) !"#//+"/36) 2#(.3"#&$"/+#&2.#(6")*,2&*2"+2..2() !"*(6)."4(08.3 *89-:1;139 @./570.;6!-052-5 )'(5"*0#.".2*3 !"2'("*2(#1")0* 1#.2(".(#&) #//+"/36)"#!1#.2( #//+")'(5")*, !"&0&!*(6).")0+,$)!"#//+"2+2*.(0*321 ,&."!"36$(042&"2&2(4 &#&0"+2.. /36)"(2%"2 2+2*.(0*321")0+,$"). !"2+2*.(0*321")0* !"#++06"*01/$ )'(5")*, 2+2*.(0*321"*011'& 420/36)"(2)"+2.. -,0*321,).(6 -,0*321"-,0/3"(2)"*0 !"10+"-,0+ 52-)"+2.. *3,+$"*#(2"32#+.3"#&$"$2%2+0/12&. -(,."!"*+,&"/)6*30+ !",&.2++"$,)#-,+"(2) -(,."!"32#+.3"/)6*3 !"#1"*2(#1")0* 1#.2("+2.. !")0+,$").#.2"*321 .3,&")0+,$"5,+1) *321"/36)"+2.. !"*321"/36) &#&0.2*3&0+046 *%&! &'*+2,*"#*,$)"(2) )0+,$").#.2",0&,*) -,0/36)"! 1#.")*,"2&4"#!).('*. #*.#"1#.2( /36)"*321"*321"/36) !"1#.2("/(0*2))".2*3 )*(,/.#"1#.2( )6&.32.,*"12. 2+2*.(0*3,1"#*.# *#(-0& !"-,0+"*321 !"/082(")0'(*2) *321"1#.2( !"/36)"*321"- o R.7( 8.08?.( A.=.( ,H?.( ;0( 9:1?J.:D.( ;5.( 8=0>JD;( >.<9-:( ;0( .:<J=.( ;5,;( 9;( F.;( &2'(0/)6*30+"(23#-,+ !"1#.2(")*, 1500 +#&41',( )0+"2&2(4"1#.")0+"* 32#+.3"9")0*,#+"*#(2",&".32"*011'&,.6 -(#,&",&!'(6 !"1#.2("*321 !"/36)"*321"# 1,*(0/0("12)0/0("1#. *#.#+"+2.. 420*3,1"*0)10*3,1"#* !"#1"*321")0* !"420/36)"(2) -(,."!")0*"80(7 *01/0)")*,".2*3&0+ *,>;0-6?5.;/: #')."&<"!"/)6*3,#. 12.#++"1#.2(".(#&)"# *321"(2% )0*,0+"32#+.3",++$,)#-,+"(23#-,+ !"&'()")*30+#()3,/ !"(23#-,+"12$ !"1#.2("2&4"/2(50(1 -(,."*2(#1". !"&#&0)*,"&#&0.2*3&0 1#*(010+2*'+2)!"*#.#+ -,01#*(010+2*'+2) *0++0,$")'(5#*2"# *321"*011'& *321!2'("! /0+61"$24(#$").#-,+ /0+612( ,&0(4"*321 ":5=08/79 *#.#+".0$#6 #//+"*#.#+"#!42& )*#&$"!"*#(,&4")*, !"/0+61")*,"/0+"*321 !"10+"*#.#+"#!*321 #&#2).32),# /08$2("12.#++ 1#*(010+"(#/,$"*011 /'-+,*"32#+.3"&'() ,&."!"32#."1#))".(#& &'()"2.3,*) )'(5"2&4 2'("/0+61"! 1#*(010+")61/ 0(4"+2..0(4#&012.#++,*) #*.#"/)6*3,#.")*#&$ /#++,#.,%2"12$ !"*+,&"&'() ,(0&1#7").22+1#7 !"#//+"/0+61")*, 2'("!"0(4"*321 /0+61",&. 1#*(010+"*321"/36) !"*0++0,$",&.2(5")*, 0(4"-,010+"*321 !"0(4#&012."*321 ,&0(4"*3,1"#*.# #//+"*#.#+"-!2&%,(0& !"#$%"&'() -,00(4#&"12$"*321 *321"2&4")*, =.IJ9=.F.:;<G( 2'("!"*#&*2("*#(2 12$"2$'* !"0(4"*321 .2.(#32$(0&!#)6112.( #$%#&*2)",&").('*.'(#+"2&4,&22(,&4 #42"#42,&4 )*,".2*3&0+"82+$"!0, /0+61"2&4")*, )2&)0("#*.'#."-!*321 !"/0+61")*,"/0+"/36)!"12$"*321 -(,."!"42&"/(#*. *+,&"(23#-,+ -,01#.2(,#+) -,00(4"12$"*321"+2.. )6&+2.. )6&.32),)!).'..4#(. !".2+212$".2+2*#(2 2&2(4"2:/+0("2:/+0,. 1#.2(")*,".2*3")2( .2.(#32$(0& .2.(#32$(0&"+2.. ,&$"2&4"*321"(2) *01/'."*321"2&4 !"12$")*(22& #,*32"! !"121-(#&2")*, ?5.;/:6".75 +#-"#&,1 ,&."!").$"#,$) -(,."12$"-'++ /+#)."('--2("*01/0) !"-,01#.")*,!/0+61"2 !"-,012$"1#.2("(2) !"-#*.2(,0+ 5+',$"/3#)2"2;',+,-( !"*321"2&4"$#.# !"36$(0+ 8#.2("(2)0'("(2) *0((0)"2&4")*,".2*3& #&&".(0/"12$"/#(#),. ,&.2($,)*,/+")*,"(2% !"#1"42(,#.(")0* #&&"*+,&"-,0*321 *:.7=.-1;139 #.10)"2&%,(0& 1000 !"&#."/(0$'*.) #&&".(0/"/#2$,#.( )6&.32.,*"*011'& #&#+"*321 )2/"/'(,5".2*3&0+ 2&2(4"5'2+ /+#&."/36),0+ 2'("!"0/2("(2) $2(1#.0+")'(4 ,&."!"/(0$"(2) #$$,*.,0& ,&."!".'-2(*"+'&4"$ &2'(0+"(2) (2$0:"(2/ !"*0&.(0+"(2+2#)2 ,&."!"/3#(1 -,0)2&)"-,02+2*.(0& 2&%,(0&")*,".2*3&0+ $2)#+,&#.,0& !"3#<#($"1#.2( $240712=52/.; ,&."!"/(0$"2*0& $('4"#4,&4 /(2%"12$ !"/3#(1")*, /36.0*321,).(6 .#+#&.# 2&%,(0&"/0++'. !-052-5 8#.2("#,("#&$")0,+"/0++'.,0& !"3'1"&'.("$,2. !"#&#+".0:,*0+ 1'+.")*+2( 5'2+ 2&2(4"*0&%2()"1#&#42 A57=./1;139 ,&."!"$2(1#.0+ -(,."!"$2(1#.0+ !"2'("#*#$"$2(1#.0+ *&)"$('4) !"/3#(1"/3#(1#*0+ /3#(1#*2'."(2) $('4"$2%",&$"/3#(1 2'("!"/3#(1"-,0/3#(1 !"/3#(1#*2'."-,012$ 2+2*.(0#&#+ 2+2*.(0/30(2),) !"*3(01#.04("- *3210)/32(2 8#.2("(2) )*,".0.#+"2&%,(0& /3#(1#*02*0&01,*) *#(.04("! #&#+"*3,1"#*.# (2&28"2&2(4 ,&."!"0/2("/(0$"1#& #1"!"*+,&"$2(1#.0+ *+,&"$('4",&%2). #&#+6).(#/,$"*011'&"1#))")/ 1#("/0++'."-'++ !"2&%,(0&"2&4!#)*2 *+,&"2:/"$2(1#.0+ $('4")#52.6 *+,&"/3#(1#*07,&2. *'(("&2'(0%#)*"(2) $('4) !"#&#+"#.01")/2*.(01 8#.2(")*,".2*3&0+ 2&%,(0&".0:,*0+"*321 /'-+,*"32#+.3"&'.( 2:/2(."0/,&"/3#(1#*0 -(,."!"*+,&"/3#(1#*0 !"*3(01#.04("# #&#+"-,0#&#+"*321 #*.#"/#2$,#.( *'(("$('4"12.#- !"*321"(2)!) /"&'.(")0* /+#&."! *'(("12$"(2)"0/,& *'(("12$"*321 2:/2(."0/,&",&%"$('4 *'(("/3#(1"$2),4& *'(("$('4".#(42.) -,0.2*3&0+"-,02&4 G01!525739 -,01#))"9"-,02&2(46 K9-J=.(2B(@JFJ?,;9C.(<;0=,-.(9:(;5.($,;90:,?(!%@)(<9:D.(9F8?.F.:;,;90:(9:().8;.FH.=(STTUG(( ,&/3#(1#"8227+6 )/0(.)"12$ *'((".0/"12$"*321 -,0(2)0'(*2".2*3&0+ 8#.2("2&%,(0&"(2) *'(("10+"12$ 2&%,(0&".2*3&0+ !"2:/"-0. &'.("(2% /(0*2))"-,0*321 /+#&."*2++"#&$"2&%,(0&12&. 2:/2(."0/,&".32("/#. #//+"2&%,(0&"1,*(0- /36),0+"/+#&.#('1 #&&"#++2(4"#).31#",1 *01-"*321"3,43".")*( *'(("0(4"*321 500 #++2(46"#&$"#).31#"/(0*22$,&4) *+,&"2:/"#++2(46 )*#&$"!"12$")*,")/0( #++2(46 *'(("/(0.2,&"/2/.")* !")/0(.")*, &'.("(2)"(2% -(,."!"&'.( 80(+$"/0'+.(6")*,"! +2.."#//+"1,*(0-,0+ !"#4("500$"*321 !"#//+"1,*(0-,0+ !"500$"/(0.2*. 500$"*321 #//+"1,*(0-,0+"-,0. )0,+")*,")0*"#1"! G01/5-:21;139 2&<612"#&$"1,*(0-,#+".2*3&0+046 -,0.2*3&0+"+2.. !"/+#&."/36),0+ !"#//+")/0(."/)6*30+ -(,."!")/0(."12$ .#:0& &28"/36.0+ ,&."!"500$")*,".2*3 ,&."!"500$"1,*(0-,0+ 0-2)")'(4 -"2&.010+"(2) D0458/1-E !")*,"500$"#4( !"500$"2&4 $0;.(;5.(?,=-.<;(F0<;(=,89>(.M8,:<90:(9:(<8,D.(J<,-.(9<(A9;5(@4G( 2(40&01,*) 500$")*,".2*3&0+!+2- !"500$")*, )0,+"-,0+"-,0*321 &;;5739B&8/:=. 12$")*,")/0(."2:2( -(,."/0'+.(6")*, 500$"1,*(0-,0+ !"30(.,*")*, #1"!"*+,&"&'.( ,&"%,.(0"*2++"$2%!/+ /+#&."#&$")0,+ !"+#(6&40+"0.0+ !")0'&$"%,- *;.2/6F6!10;6!-052-5 %,/70/012 /+#&."*2++".,))"0(4 2&%,(0&"2&.010+ 0 !C17/86@5+0-025!"2*0&"2&.010+ 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 $2/1=1;139 EMBL-EBI storage requirements until 2008 Journal connectivity through scholarly usage data Figure 3: Visualization of usage network created from MESUR’s 200M usage events. (source: Andrew Lyall) (source: MESUR project) 5. USAGE-BASED METRICS on the basis of COUNTER reports, i.e. the average amount The journal usage and citation networks also enable the of usage recorded for the articles published in a journal [4, calculation of a variety of impact metrics. A total of 47 22]. possible impact metrics were calculated, and the resulting However, to use a social analogy, one’s importance is not rankings were analyzed to determine the degree to which solely assessed on the basis of how many people one knows. Cumulative GB storage usage- and citation-based metrics express similar or dissim- Who one knows and how one is embedded in a network of ilar aspect of scholarly impact. social relationships are equally important factors. Network theory has produced a rich literature on indicators to deter- mine different facets of a person’s status (e.g. prestige, popu- 35000 5.1 Defining and validating usage-based met- larity, trust) on the basis of social network structure, instead rics of using simple counts of the number of the person’s rela- The most common indicator of journal status is Thom- tionships. Many of these indicators have found applications son Scientific’s journal Impact Factor (IF) that is published XA in other domains. For example, the Google search engine 30000 every year for a set of about 8,000 selected journals. The uses the PageRank metric to rank web pages on the basis IF is defined as the average citation rate for articles pub- US of the WWW’s hyperlink network structure. In addition, lished in a particular journal. A similar statistical approach recent proposals have been made to rank journals according to journal ranking has been proposed for journal usage data SR to their citation PageRank [2] and a range of social network 25000 RF PX 20000 OT NM 15000 MR MG HC 10000 CT DX 5000 CR 0 Se O No De Ja Fe Ma A Ma Ju J Au Se O No De Ja Fe Ma A Ma Ju J Au p- ct- v- c- n- b- r- pr- y- n- ul- g- p- ct- v- c- n- b- r- pr- y- n- ul- g- 06 06 06 06 07 07 07 07 07 07 07 07 07 07 07 07 08 08 08 08 08 08 08 08 Cumulative storage in the National PACS since implementation in September 2006 (source: Hamish McRitchie) !,-.(/(01(23(
  • 6.
    Vol 455|4 September2008 BOOKS & ARTS Distilling meaning from data Buried in vast streams of data are clues to new science. But we may need to craft new lenses to see them, explain Felice Frankel and Rosalind Reid. It is a breathtaking time in science they will create effective computer displays, those run by the US National Science Foun- as masses of data pour in, prom- slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www. ising new insights. But how can they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt- we find meaning in these tera- kept at arm’s length by scientists who are busy ing to visually communicate scientific data and bytes? To search successfully getting their experiments done. Opportunities concepts opens a path to understanding. When for new science in large datasets, we must find for useful dialogue are thus squandered. science and design students collaborate, their unexpected patterns and interpret evidence When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science. further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline. questions and answers. imageandmeaning.org), it becomes clear The ingrained habits of highly trained sci- To extract new meaning entists make them rarely as D. ARMENDARIZ from the sea of data, scien- adventurous as these young tists have begun to embrace minds. We think we are on 23.3 Commentary Muggleton jw 20/3/06 6:29 PM Page 409 the tools of visualization. Yet the path to insight when few appreciate that visual rep- shading reveals contours resentation is also a form of in 3D renderings, or when communication. A rich body bursts of red appear on heat of communication expertise maps, for example. But the Vol 440|23 March 2006 holds the potential to greatly algorithms used to produce improve these tools. We pro- the graphics may create illu- pose that graphic artists, com- sions or embed assumptions. municators and visualization scientists should be brought into conversation with theo- The human visual system creates in the brain an appar- ent understanding of what COMMENTARY rists and experimenters a picture represents, not before all the data have been necessarily a picture of the gathered. If we design experi- underlying science. Unless Exceeding human limits ments in ways that offer varied we know all the steps from opportunities for represent- hypothesis to understand- ing and communicating data, ing — by conversing with techniques for extracting new theorists, experimentalists, understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software are turning to automated processes and technologies in a bid to cope with ever higher volumes of data. Scientists available. developers, visualization But automation offers so much more to the future of science than just data handling, says Stephen H. Muggleton. Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy- intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis- FIREFLY PRODUCTIONS/CORBIS for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading. The collection and curation near Geneva, for example, physicists will have dimensional volume rendering may give no The greatest opportunity and risk lie in that of data throughout the pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether sciences is becoming increas- important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled ingly automated. For exam- that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk. ple, a single high-throughput on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk. I experiment in biology can to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the easily generate more than a gigabyte of data per day, and in astronomy a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University, broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a automatic tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci- On the Surface other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas- ences the querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing low-cost digital tion, discovery and communication. for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is efficiency Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the clear communicators are often consulted only after science emerged. expansion of automation in all its aspects: data
  • 7.
    c and probabilitycal- and charge distributionshould become easier for autonomous experimen On such timescales it of individual molecules however, still a decade ic provides a formal need to be integrated scientists to reproduce new experiments and becoming standard scie Vol 455|4 September 2008 gramming languages with models describ- refute their hypotheses. Despite the potentia BOOKS & ARTS probability calculus ing Today’s generation of microfluidic “Owing tomachines severe danger data the scale and rate of that incre the interdepen- generation, computational models of ms of probability for dency of chemical out a specific series of ume of data generation is designed to carry Distilling meaning from data reactions, scientific flexibility decreases in compreh s bayesian networks.new science. But we may needHowever, but further data now require automatic chemical Buried in vast streams of data are clues to reactions. to craft new stic logic’ is a formaland Rosalind Reid. be added the tool kit by developing Academic studies on the could to this construction and modification.” lenses to see them, explain Felice Frankel differences in statements of sound mathematical under- call what one might t It is a breathtaking time in science they will create effective computer displays, those run by the US National Science Foun- as masses of data pour in, prom- slides and figures for publication. Meanwhile, dation’s Picturing to Learn project (www. ising new insights. But how can they may be developing their tools in isolation, picturingtolearn.org), teach us that attempt- a ‘chemical Turing “There is a severe danger that i robability of A being pinnings of, say, differential equations, bayesian puter. Such chips contai we find meaning in these tera- kept at arm’s length by scientists who are busy ing to visually communicate scientific data and bytes? To search successfully getting their experiments done. Opportunities concepts opens a path to understanding. When for new science in large datasets, we must find for useful dialogue are thus squandered. science and design students collaborate, their machine’. The universal ure forms of existing networks and logic programs make integrating chambers, ducts, gates t unexpected patterns and interpret evidence When scientists, graphic artists, writers, ani- drive to understand one another’s ideas pushes increases in speed and volume of n in ways that frame new questions and suggest mators and other designers come together to them to create new ways of seeing science. further explorations. Old habits of represent- discuss problems in the visual representation Investment in visual communication training Turing machine, devised fortunately computa- these various models virtually impossible. reagent stores, and allow ing data can fail to meet these challenges, pre- of science, such as at the Image and Meaning for young scientists will pay off handsomely for venting us from reaching beyond the familiar workshops run by Harvard University (www. any data-intensive discipline. wever, an increasing Although by Alan Turing, be data generation could leadat high sp in 1936 hybrid models can built by simply sis and testing to questions and answers. imageandmeaning.org), it becomes clear The ingrained habits of highly trained sci- t To extract new meaning entists make them rarely as D. ARMENDARIZ from the sea of data, scien- adventurous as these young tists have begun to embrace minds. We think we are on was intended to mimic decreases in comprehensibility.” ups have developed patching two models together, the underlying miniaturizing our robot-o 23.3 Commentary Muggleton jw 20/3/06 6:29 PM Page 409 the tools of visualization. Yet the path to insight when few appreciate that visual rep- shading reveals contours resentation is also a form of in 3D renderings, or when communication. A rich body of communication expertise holds the potential to greatly the pencil-and-paper ques that can handle differences lead to unpredictable and error- this way, with the overal bursts of red appear on heat maps, for example. But the Vol 440|23 March 2006 algorithms used to produce s probabilistic logic6. prone behaviour mathematician. The chemical experimental cycle time improve these tools. We pro- pose that graphic artists, com- operations of a when changes are made. beings. This is particu the graphics may create illu- sions or embed assumptions. municators and visualization such research holds Turing encouraging development in this liseconds.associated with scientists should be brought machine would be a universal proces- nologies With microflu COMMENTARY The human visual system creates in the brain an appar- One into conversation with theo- ent understanding of what rists and experimenters a picture represents, not egration of scientific respect is the emergence withinbroad range of chemical reaction not onA before all the data have been gathered. If we design experi- sor capable of performing a computer sci- and experimentation. necessarily a picture of the underlying science. Unless al and computer-sci- ence of new formalisms5 that integrate, in alimits chemical operations Exceeding human complete, but also requi ments in ways that offer varied we know all the steps from opportunities for represent- ing and communicating data, techniques for extracting new on both the reagents essentially human activhypothesis to understand- ing — by conversing with theorists, experimentalists, available to it at the start andoffersto automated processes andof science thaninjustbid to cope with saysStephen H. Muggleton. a thoseof mathe- of input materials, with o Scientists are turning chemicals bothhandling, ever higher volumes of data. technologies a data in the statement understanding can be made Discussing visual communication before designing experiments may reveal new science. instrument and software available. sound fashion, two major branches more to the future But automation so much developers, visualization Visual representation is familiar in data- that representations repeatedly fail to com- scientists, graphic artists and cognitive psy- matics: mathematical logic and probabilityauto- On such timescales it sho it later generates. The machine would cal- clear and undeniable intensive fields. Years before a detector is built municate understanding or address obvious chologists — we cannot be sure whether a dis- FIREFLY PRODUCTIONS/CORBIS for a facility such as the Large Hadron Collider questions about the underlying data. A three- play is accurate or misleading. The collection and curation near Geneva, for example, physicists will have dimensional volume rendering may give no The greatest opportunity and risk lie in that of data throughout the s culus. Mathematicaland test chemical com- scientists to reproduce n matically prepare logic provides a formal experimentation. pored over simulations. They examine how hint of important uncertainties or data gaps; last step in the path: understanding. Whether sciences is becoming increas- important events will ‘look’ in the displays solid surfaces or sharp edges may suggest data verbal or visual, any language that is garbled ingly automated. For exam- that reveal and communicate what is going where they do not exist. A graphic artist might and inconsistent fails to do its job. Let’s talk. ple, a single high-throughput pounds but it would also be programmable, Stephen H. Muggleton is learning approaches foundation for logic programming languages refute their hypotheses. on inside the machine. Such discussions tend propose ways to reveal gaps or deviations from Let’s all talk. I experiment in biology can to take place within the visual conventions of expectation early in an experiment, guiding Felice Frankel is senior research fellow in the easily generate more than a gigabyte of data per day, and in astronomy a field. But perhaps conversations might be subsequent data collection or highlighting new faculty of arts and sciences at Harvard University, ng scientific models such as Prolog, much theprobability calculusa Computing and the Centr thus allowing whereas same flexibility as broadened to consider alternative represen- avenues of enquiry. When we asked Harvard Cambridge, Massachusetts 02138, USA. With data collection leads to more than a automatic Today’s generation of m tations of the same data. These might suggest University chemist George Whitesides to G. M. Whitesides, she is co-author of terabyte of data per night. Throughout the sci- On the Surface other approaches to collecting, organizing and change the geometry of a self-assembled of Things: Images of the Extraordinary in Science. volumes of archived data are increas- ences the real chemist has in the lab. p’ systems with no provides the basic axioms of probability for is designed to carry ou Systems Biology at Imper querying data that will maximize the transpar- monolayer with clearly delineated hydropho- e-mail: felice_frankel@harvard.edu ing exponentially, supported not only by ency of experimental results and thus aid intui- bic and hydrophilic areas to create an image Rosalind Reid is executive director of the Initiative storage but also by the growing tion, discovery and communication. low-cost digital for submission to a journal, he found himself in Innovative Computing at Harvard University of automated instrumentation. It is efficiency to the collection of One can think of a chemical Turing 2BZ, UK. Unfortunately, visualization experts and redesigning the experiment, and unexpected and former Editor of American Scientist. that the future of science involves the communicators are often consulted only after science emerged. clear expansion of automation in all its aspects: data
  • 8.
    Released under CreativeCommons License
  • 9.
    Science Paradigms T empirical describing natural phenomena theoretical 2 using models, generalizations 4 2 = K 3 2 computational simulating complex phenomena T data exploration unify theory, experiment, and simulation s FIGURE 1 CE: WHAT IS IT? ce is where “IT meets scientists.” Researchers are using many di erent meth- collect or generate data—from sensors and CCDs to supercomputers and e colliders. When the data finally shows up in your computer, what do with all this information that is now in your digital shoebox? People are Released under Creative Commons ually seeking me out and saying, “Help! I’ve got all this data. What am I License
  • 10.
    Principles? CIENCE ANDGOVERNMENT POLICY FORUM Appropriate professional and career re- ward structures are necessary (20–22). The An International Framework way scientists are being evaluated and how their careers are shaped are at stake. For ex- to Promote Access to Data ample, researchers who have spent years on building new databases, such as the Sloan Digital Sky Survey in astronomy, have ef- Peter Arzberger, 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 fectively put their scientific careers on hold Kathleen Casey, 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these databases are critical for C OV E R F E AT U RE the future development of the R ecent national and multina- field. These considerations apply tional investments (1) in OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to those who produce, networking and continued Openness manage, and reuse research data. ins in information technologi- Transparency and active data dissemination At this point there is consid- l capability (2) have given rise erable heterogeneity in policies. Assignment and assumption of formal responsibilities a complex cyberinfrastructure In the United States, federal Technical and semantic interoperability of databases at is rapidly increasing our abil- government databases are not Downloaded from www.sciencemag.org on August 30, 2009 y to produce, manage, and use Quality control, data validation, authentication, and authorization copyright protected, whereas in ta (3). As research becomes in- Operational efficiency and flexibility the European Union govern- easingly global (4), data-inten- Respect for intellectual property and other ethical and legal requirements ment databases are eligible for ve, and multifaceted (5, 6), it is Management accountability, including funding approaches protection under several data- mperative to address national base protection laws. Even with- d international data access and in countries, different funding aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci- pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a ownstream commercialization of outputs, making communities (11–17). third has no policy (23). National laws and d provides decision-makers with facts Analysis of these, and other examples international agreements can directly af- eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices. THE CHANGING PARADIGM OF onal, problems. This article summarizes and sharing arrangements exhibit a number At the last meeting of the OECD Com- y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli- udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin- DATA-INTENSIVE COMPUTING lopment (OECD) (7), which resulted in a (see figure, this page) ciple that research data inisterial-level declaration (8). provide a framework from public funding Legitimate restrictions on open access, for locating and ana- Technological should be openly avail- d strong disincentives to sharing exist, lyzing where improve- able. Furthermore, they sed on concerns of protecting national se- ments can be made. Data access invited OECD to devel- Cultural Institutional rity, privacy and confidentiality, intellec- Diversity in science and management and op a set of guidelines al property, and time-limited exclusive use suggests that a variety behaviorial domains managerial based on commonly the scientific investigator. The lack of of institutional models agreed principles (simi- ear funding-agency policies in the face of and tailored data man- L Legal Financial an lar to those in the table) and and rong competing interests, often far re- agement approaches policy budgetary to facilitate optimal oved from academic research, poses prob- will be needed. cost-effective access to ms for scientists in developing and devel- Establishing and Domains of a data access regime. digital research data Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and ped countries and inhibit the advance of maintaining this infra- from public funding. It Deborah K. Gracio, Pacific Northwest National Laboratory ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg- on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24), University of California, San Diego, La Jolla, CA 92093, ment, and preservation costs (including which have been a model for legislation all Through the development of new classes of erogeneous full-scale simulations will require not only SA. 2Ministry of Education, Culture and Science, cost of documentation and metadata cre- around the Western world. peta op capabilities but also a computational infrastruc- oetermeer, Netherlands. 3Networked Research and ation) are an afterthought or are insuffi- Although the involvement of re- software, algorithms, and hardware, data- gital Information, Royal Netherlands Academy of Arts ciently or inconsistently funded in research searchers in resolving these issues is criti- intensive applications provide timely and ture that permits model integration. Simultaneously, it nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about must couple to huge databases created by an ever-in- omputing Ltd., Espoo, Finland. 5Social Sciences and Hu- anities Research Council, Ottawa, Canada. 6National mend that roughly one-third of the provi- existing policies at their institutions or na- meaningful analytical results in response creasing number of high-throughput instruments.”2 search Council, Washington, DC 20418, USA. sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To to exponentially growing data complexity
  • 11.
    An International Framework way scientists are being their careers are shaped to Promote Access to Data ample, researchers who Principles? building new databases Digital Sky Survey in Peter Arzberger, 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 fectively put their scien Kathleen Casey, 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these datab the future d R ecent national and multina- field. These c tional investments (1) in OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to t networking and continued Openness manage, and gains in information technologi- Transparency and active data dissemination At this po cal capability (2) have given rise to a complex cyberinfrastructure CIENCE AND GOVERNMENT POLICY FORUM Assignment and assumption of formal responsibilities Technical and semantic interoperability of databases erable hetero In the Unit that is rapidly increasing our abil- Appropriate professional and career re- ward structures are necessary (20–22). The government An International Framework ity to produce, manage, and use Quality control, data validation, authentication, and authorization way scientists are being evaluated and how their careers are shaped are at stake. For ex- copyright pr to Promote Access to Data research becomes in- data (3). As Operational efficiency and flexibility ample, researchers who have spent years on building new databases, such as the Sloan Digital Sky Survey in astronomy, have ef- the Europe 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 creasingly global (4), data-inten- ment databa Peter Arzberger, fectively put their scientific careers on hold Kathleen Casey, Respect F E ATintellectual property and other ethical and legal requirements C OV E R for U RE 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these databases are critical for the future development of the sive, and multifaceted (5, 6), it is protection u R ecent national and multina- field. These considerations apply tional investments (1) in Management accountability, including funding approaches OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to those who produce, imperative to address national base protecti networking and continued Openness manage, and reuse research data. ins in information technologi- Transparency and active data dissemination At this point there is consid- l capability (2) have given rise erable heterogeneity in policies. and international data access and in countries Assignment and assumption of formal responsibilities a complex cyberinfrastructure In the United States, federal Technical and semantic interoperability of databases at is rapidly increasing our abil- government databases are not Downloaded from www.sciencemag.org on August 30, 2009 y to produce, manage, and use Quality control, data validation, authentication, and authorization copyright protected, whereas in sharing issues systematically in a policy are- derstanding global climate change (10) re- agencies have different ta (3). As research becomes in- easingly global (4), data-inten- Operational efficiency and flexibility Respect for intellectual property and other ethical and legal requirements the European Union govern- ment databases are eligible for na that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, w ve, and multifaceted (5, 6), it is mperative to address national Management accountability, including funding approaches protection under several data- base protection laws. Even with- d international data access and in countries, different funding Open access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci- vides greater returns from the public invest- topic of recent debate and its resolution is a principles in the OEC pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a ment in research, generates wealth through high priority in many scientific and policy- states access should no ownstream commercialization of outputs, making communities (11–17). d provides decision-makers with facts third has no policy (23). National laws and Analysis of these, and other examples international agreements can directly af- downstream commercialization of outputs, making communities (11–17). eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices. third has no policy (23) THE CHANGING PARADIGM OF onal, problems. This article summarizes and sharing arrangements exhibit a number At the last meeting of the OECD Com- y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli- and provides decision-makers with facts Analysis of these, and other examples international agreemen udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin- needed to address complex, often transna- (18), suggests that successful data access fect data access and sha DATA-INTENSIVE COMPUTING lopment (OECD) (7), which resulted in a (see figure, this page) ciple that research data inisterial-level declaration (8). provide a framework from public funding Legitimate restrictions on open access, for locating and ana- Technological should be openly avail- tional, problems. This article summarizes and sharing arrangements exhibit a number d strong disincentives to sharing exist, lyzing where improve- sed on concerns of protecting national se- ments can be made. Cultural Data access Institutional invited OECD to devel-At the last meeting able. Furthermore, they rity, privacy and confidentiality, intellec- Diversity in science op a set of guidelines key findings of an international group that of key attributes and operating principles mittee for Scientific and al property, and time-limited exclusive use suggests that a variety the scientific investigator. The lack of of institutional models and behaviorial management domains and managerial based on commonly agreed principles (simi- studied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the minist ear funding-agency policies in the face of and tailored data man- rong competing interests, often far re- agement approaches oved from academic research, poses prob- will be needed. L Legal and policy Financial an and budgetary lar to those in the table) to facilitate optimal cost-effective access to isation for Economic Cooperation and Richard T. Kouzes, GordonNorthwest National Laboratory Ian Gorton,“domains” endorsed a declaration ( ms for scientists in developing and devel- ped countries and inhibit the advance of maintaining this infra- De- organizational Stephen T. Elbert, Deborah K. Gracio, Pacific A. Anderson, management and Establishing and Domains of a data access regime. digital research data from public funding. It velopment (OECD) (7), which resulted in a (see figure, this page) ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg- on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD cipl ministerial-level declaration (8). Through the development of a framework provide new classes of erogeneous full-scale simulations will require not only from cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24), University of California, San Diego, La Jolla, CA 92093, ment, and preservation costs (including which have been a model for legislation all Legitimate restrictions on open access, applications and hardware, data- ana- that permits model integration. Simultaneously, it for locating and cost of documentation and metadata cre- around the Western world. Technological sho SA. 2Ministry of Education, Culture and Science, oetermeer, Netherlands. 3Networked Research and software, algorithms, peta op capabilities but also a computational infrastruc- ation) are an afterthought or are insuffi- Although the involvement of re- gital Information, Royal Netherlands Academy of Arts intensive provide timely and ture ciently or inconsistently funded in research searchers in resolving these issues is criti- meaningful lyzing where improve- and strong disincentives to sharing exist, analytical results in response must couple to huge databases created by an ever-in- able nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific omputing Ltd., Espoo, Finland. 5Social Sciences and Hu- projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about anities Research Council, Ottawa, Canada. 6National mend that roughly one-third of the provi- existing policies at their institutions or na- creasing number of high-throughput instruments.” 2 search Council, Washington, DC 20418, USA. sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To to exponentially growing data complexity
  • 12.
    An International Framework • Approximations are currently made because of limitations on spatial scale; more accurate way scientists are being their careers are shaped algorithms are needed. to Promote Access to Data ample, researchers who Principles? building new databases • Need a universal parser to tag information for analysis. Digital Sky Survey in Peter Arzberger, 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 fectively put their scien Kathleen Casey, 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul combining all available data. • Need algorithms to recognize and predict intent by Wouters3 even though these datab Information analytics • Require an integrating informatics resource manager that takes in sensor data, transforming d the future R ecent national andbetween heterogeneous datasets and integrating computational tools, and presents results to c multina- field. These tional investments (1) in OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to t the human user. networking and continued Openness manage, and gains in information technologi- scale of data and problems overwhelms visual displays, new approaches are this po • Since the Transparency and active data dissemination At CIENCE AND GOVERNMENT POLICY FORUM cal capability (2) have given rise develop the appropriate level of abstraction and to condense and select data to dis- needed to Assignment and assumption of formal responsibilities to a complex cyberinfrastructure the user’s control. play under Technical and semantic interoperability of databases erable hetero In the Unit that is rapidly increasing our abil- Appropriate professional and career re- ward structures are necessary (20–22). The government An International Framework ity to produce, manage,• CollaborativeQuality control,analysis of datasets and observations are desirable. and use data validation, authentication, and authorization way scientists are being evaluated and how copyright pr sharing and their careers are shaped are at stake. For ex- to Promote Access to Data research becomes in- data (3). As Operational efficiency and flexibility ample, researchers who have spent years on building new databases, such as the Sloan Digital Sky Survey in astronomy, have ef- the Europe 1* Peter Schroeder,2 Anne Beaulieu,3 Geof Bowker,1 Respect F ATintellectual propertythe problem set to access large local andment databa creasingly global (4), data-inten- architecturesEareRE Peter Arzberger, Computing platforms • Current C OV E R for U inadequate for and other ethical and legal requirements distributed fectively put their scientific careers on hold Kathleen Casey, 1 Leif Laaksonen,4 David Moorman,5 Paul Uhlir,6 Paul Wouters3 even though these databases are critical for the future development of the sive, and multifaceted (5, 6), it is providing solutions with reasonable throughput to analyze and model atprotection u datasets, the R ecent national and multina- field. These considerations apply tional investments (1) in Management accountability, including funding approaches OPERATING PRINCIPLES FOR DATA ACCESS REGIMES equally to those who produce, imperative to address required spatial scales. national base protecti networking and continued Openness manage, and reuse research data. ins in information technologi- Transparency and active data dissemination At this point there is consid- l capability (2) have given rise erable heterogeneity in policies. and international data access and in countries Assignment and assumption of formal responsibilities a complex cyberinfrastructure In the United States, federal Technical and semantic interoperability of databases at is rapidly increasing our abil- government databases are not Downloaded from www.sciencemag.org on August 30, 2009 sharing issues systematically in a policy are- not have the storage climate change amount of data that models use • Current machines do derstanding global capacity for the (10) re- y to produce, manage, and use Quality control, data validation, authentication, and authorization copyright protected, whereas in ta (3). As research becomes in- easingly global (4), data-inten- Operational efficiency and flexibility Respect for intellectual property and other ethical and legal requirements the European Union govern- ment databases are eligible for agencies have different na that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, w ve, and multifaceted (5, 6), it is mperative to address national and produce. Management accountability, including funding approaches protection under several data- base protection laws. Even with- d international data access and in countries, different funding Open access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies aring issues systematically in a policy are- derstanding global climate change (10) re- agencies have different stated policies; for that transcends national jurisdictions. quires access to data drawn from many dis- example, in Canada, with three major sci- vides greater returns from the public invest- topic of recent debate and its systems with high-performance net- OEC • Need self-healing and intrinsically secure operating resolution is a principles in the pen access to publicly funded data pro- ciplines and sources. This issue has been a ence funding agencies, one follows the des greater returns from the public invest- topic of recent debate and its resolution is a principles in the OECD declaration, one ment in research, generates wealth through built-in encryption. working that provides high priority in many scientific and policy- states access should no ent in research, generates wealth through high priority in many scientific and policy- states access should not be a barrier, and a ownstream commercialization of outputs, making communities (11–17). third has no policy (23). National laws and d provides decision-makers with facts Analysis of these, and other examples international agreements can directly af- downstream commercialization of outputs, making communities (11–17). eded to address complex, often transna- (18), suggests that successful data access fect data access and sharing practices. third has no policy (23) THE CHANGING PARADIGM OF onal, problems. This article summarizes and sharing arrangements exhibit a number At the last meeting of the OECD Com- • Computational needs range from largethese, and other examples and provides decision-makers with facts Analysis of central high-performance computing systems to y findings of an international group that of key attributes and operating principles mittee for Scientific and Technological Poli- udied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the ministerial level, ministers international agreemen ation for Economic Cooperation and De- organizational management “domains” endorsed a declaration (8) based on the prin- needed to address complex, often transna- systemssuggestsminiaturized labs on a chip for eld deployments. and sha portable lightweight (18), such as that successful data access fect data access DATA-INTENSIVE COMPUTING lopment (OECD) (7), which resulted in a (see figure, this page) ciple that research data inisterial-level declaration (8). provide a framework from public funding Legitimate restrictions on open access, for locating and ana- Technological should be openly avail- tional, problems. This article summarizes and sharing arrangements exhibit a number d strong disincentives to sharing exist, lyzing where improve- sed on concerns of protecting national se- ments can be made. Cultural Data access Institutional able. Furthermore, they invited OECD to devel- At the last meeting rity, privacy and confidentiality, intellec- Diversity in science op a set of guidelines key findings of an international group that of key attributes and operating principles mittee for Scientific and al property, and time-limited exclusive use suggests that a variety the scientific investigator. The lack of of institutional models and behaviorial management domains and managerial based on commonly agreed principles (simi- studied these issues on behalf of the Organ- (see table, this page). Administrative and cy (CSTP) at the minist ear funding-agency policies in the face of and tailored data man- rong competing interests, often far re- agement approaches oved from academic research, poses prob- will be needed. L Legal and policy Financial an and budgetary lar to those in the table) to facilitate optimal cost-effective access to isation for Economic Cooperation and Richard T. Kouzes, GordonNorthwest National Laboratory Ian Gorton,“domains” endorsed a declaration ( ms for scientists in developing and devel- ped countries and inhibit the advance of maintaining this infra- De- organizational Stephen T. Elbert, Deborah K. Gracio, Pacific A. Anderson, management and Establishing and Domains of a data access regime. digital research data from public funding. It COMPUTER velopment (OECD) (7), which resulted in a (see figure, this page) ience for the public good. For example, structure requires continued and dedicated can be expected that these future guidelines search on cholera outbreaks and their rela- budgetary planning, with appropriate fi- will influence national and international reg- cipl on to environmental factors (9) or on un- nancial support. The use of research data ulation of research data, much as the OECD ministerial-level declaration (8). Through the development of a framework provide new classes of erogeneous full-scale simulations will require not only from cannot be maximized if access, manage- Guidelines on the Protection of Privacy (24), University of California, San Diego, La Jolla, CA 92093, ment, and preservation costs (including which have been a model for legislation all Legitimate restrictions on open access, applications and hardware, data- ana- that permits model integration. Simultaneously, it for locating and cost of documentation and metadata cre- around the Western world. Technological sho SA. 2Ministry of Education, Culture and Science, oetermeer, Netherlands. 3Networked Research and software, algorithms, peta op capabilities but also a computational infrastruc- ation) are an afterthought or are insuffi- Although the involvement of re- gital Information, Royal Netherlands Academy of Arts intensive provide timely and ture ciently or inconsistently funded in research searchers in resolving these issues is criti- meaningful lyzing where improve- and strong disincentives to sharing exist, analytical results in response must couple to huge databases created by an ever-in- able nd Sciences, Amsterdam, Netherlands. 4CSC-Scientific omputing Ltd., Espoo, Finland. 5Social Sciences and Hu- projects (19). D. Atkins et al. (3) recom- cal, many scientists remain ignorant about anities Research Council, Ottawa, Canada. 6National mend that roughly one-third of the provi- existing policies at their institutions or na- creasing number of high-throughput instruments.” 2 search Council, Washington, DC 20418, USA. sioning and operations of cyberinfrastruc- tions, let alone those of other countries. To to exponentially growing data complexity
  • 13.
    Computional Domain Thinkers Specialists Creating Formulation Interaction Data models & Experiments & computational knowledge methods creation Mapping Steering Data-Intensive Engineers Execution Implementations, compute & data resources
  • 14.
    Efficient distributed systems Computer Science Research Effective algorithms Data-intensive computing
  • 15.
    Efficient distributed Reusable computational systems models Computer Science Interdisciplinary Research Applications Effective Intuitive algorithms Data-intensive Collaborative interfaces computing environments New conceptual models for systems
  • 16.
    Developmental Medical Emergency Chemistry Response Biology Genetics Reusable computational models alpha release of a combined earth- quake selection and waveform selec- tion service combining the EMSC and Real-time access to European BB data successively increasing The Virtual European Broad-band the ORFEUS services. The web por- Seismograph Network (VEBSN) is tal also includes a first test version steadily increasing its size. Currently of the underlying software structure more then 270 stations are contrib- Interdisciplinary of the distributed archive services of uting data to the VEBSN in near real- the Integrated European Distributed time. For some tens of these stations Archive (EIDA) for waveform data. we still need to compile the instru- The alpha release implies that a mentation and data details (data- test version of the current service is less Seed volumes). An example of made accessible for a selected group the earthquake in Greece on Febru- Applications of scientist that are willing to test it ary 14, 2008 illustrates the available and recommend modifications. In- data. The VEBSN is a joint initiative terested seismologists, student, re- of European-Mediterranean seismo- searcher or network operator, are logical networks. More information encouraged to contact the NERIES can be obtained from www.orfeus- Project Office if they are interested eu.org/Data-info/vebsn.html. to test the services. A short video Intuitive presentation is available (http:// Figure 3. The Greek earthquake of February 14, 2008 as recorded by the vertical component of broadband www.neries-eu.org/main.php/demo. stations of the VEBSN (mainly in the European-Medi- terranean area) and made available by ORFEUS. The wmv?fileitem=8798210). Alessan- VEBSN is currently still expanding. Collaborative Brain dro Spinuso, Sergio Rives, Luca Tra- Neuro- Quantitative ni, Phetaphone Thomy, Rémy Bossu, interfaces Seismology Torild van Eck. (See figure 2 below.) informatics Genetics Imaging environments
  • 17.
    Computional Domain Thinkers Specialists Creating Formulation Interaction Data models & Experiments & computational knowledge methods creation Mapping Steering Data-Intensive Engineers Execution Implementations, compute & data resources
  • 18.
    Interaction Experiments & knowledge creation Steering Data-Intensive Engineers Execution
  • 19.
    Interaction Experiments & knowledge creation Steering Data-Intensive Engineers Execution
  • 20.
    XML interface, task & resource description 1. specifies 2. uses students teacher 3. deploys portal 4. performs task researcher designer 9. analyses web portal results 5. configures 6. runs jobs 7. monitors 8. returns results !"#$%&"'( compute resources
  • 22.
    Computional Domain Thinkers Specialists Creating Formulation Interaction Data models & Experiments & computational knowledge methods creation Mapping Steering Data-Intensive Engineers Execution Implementations, compute & data resources
  • 23.
    Formulation Data models & computational methods Mapping Data-Intensive Engineers Execution
  • 24.
    Formulation Data models & computational methods Classification of Gene MappingPatterns Expression Data-Intensive Engineers Execution
  • 25.
    Testing phase Training phase Manual Image Image Annotations integration processing Image processing Feature Feature generation Images generation Feature Feature selection/ selection/ Deployment phase extraction extraction Apply classifier Automatic Prediction Classifier annotations evaluation construction
  • 26.
    can be continuouslyand the initial input signal therefore is decomposed into different subbands. LL2 LH2 LH1 Wavelet decomposition HL2 HH2 2 LL2out LL 2 HL2out HL 2 LL1out LL HL1 HH1 2 LH2out HL1out LH 2 HL 2D-Array 2 HH2out HH Input 2 LH1out LH 2 HH1out HH (a) Wavelet decomposition on 2D-array (b) Wavelet decomposition on an image Fig. 2. Wavelet decomposition Liangxiu Han Mathematically, for a signal f (x, y) with 2D array(M ∗ N ), the wavele
  • 27.
    User and applicationdiversity Iterative DMI process development Accommodating Tool level Many application domains Many tool sets Many process representations Many working practices Gateway interface DMI canonical representation and abstract machine one model Composing or hiding Enactment Many autonomous resources & services level Mapping optimisation Multiple enactment mechanisms and Multiple platform implementations enactment System diversity and complexity
  • 28.
    User and applicationdiversity Iterative DMI process development Accommodating Tool level Many application domains Many tool sets Many process representations Many working practices Gateway interface DMI canonical representation and abstract machine one model Composing or hiding Enactment Many autonomous resources & services level Mapping optimisation Multiple enactment mechanisms and Multiple platform implementations enactment System diversity and complexity
  • 29.
    Testing phase Training phase Manual Image Image Annotations integration processing Image processing Feature Formulation Feature generation Images generation OGSA-DAI Feature Feature selection/ selection/ Deployment phase extraction extraction Apply classifier Automatic Prediction Classifier annotations evaluation construction Data-Intensive Systems Process /* import non-universal components from the computational environment */ import uk.org.ogsadai.SQLQuery; //get definition of SQLQuery Engineering Language import uk.org.ogsadai.TupleToWebRowSetCharArrays; // serialisation import uk.org.ogsadai.DeliverToRequestStatus; /* construct and identify instances of the PE */ SQLQuery query = new SQLQuery(); Java TupleToWebRowSetCharArrays wrs = new TupleToWebRowSetCharArrays(); DeliverToRequestStatus del = new DeliverToRequestStatus(); /* form connection c1 with an explicit literal stream expression as its source and query as its destination */ String q1 = "SELECT * FROM weather"; |- q1 -| => expression->query; String resourceID = "MySQLResource"; |- resourceID -| => resource->query; query->data => data->wrs; wrs->result => input->del;
  • 30.
                                                         
  • 31.
                                                                                                                                                                                                                                                                                                                      
  • 32.
    Architecture results 5 2 nodes 3 nodes 4.5 4 nodes 5 nodes 6 nodes 4 7 nodes 8 nodes 3.5 Speedup 3 2.5 2 1.5 1 0.5 0 5000 10000 15000 20000 Number of Images
  • 33.
    The ‘hump’ 6000 Workflow Execution Time PE1 PE2 5000 PE3 PE4 PE5 Processing Time(s) PE6 4000 PE7 PE8 PE9 3000 2000 1000 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 6400 Images 12800 Images 19200 Images Number of Computing Nodes
  • 34.
    The ‘hump’ 6000 Workflow Execution Time PE1 PE2 5000 PE3 PE4 PE5 Processing Time(s) PE6 4000 PE7 PE8 PE9 3000 2000 1000 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 6400 Images 12800 Images 19200 Images Number of Computing Nodes
  • 35.
    No ‘hump’ 5000 Workflow Execution Time PE1 PE2 PE3 PE4 Workflow execution time 4000 PE5 Processing Time(s) PE6 PE7 PE8 PE9 3000 2000 1000 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 6400 Images 12800 Images 19200 Images Number of Computing Nodes
  • 36.
    No ‘hump’ 5000 Workflow Execution Time PE1 PE2 PE3 PE4 Workflow execution time 4000 PE5 Processing Time(s) PE6 PE7 PE8 PE9 3000 2000 1000 0 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 6400 Images 12800 Images 19200 Images Number of Computing Nodes
  • 37.
    Data mining results Table1. The preliminary result of classification performance using 10-fold validation hhhh h hhClassification Performance hhhh hhhh Sensitivity Specificity Gene expression hh h Humerus 0.7525 0.7921 Handplate 0.7105 0.7231 Fibula 0.7273 0.718 Tibia 0.7467 0.7451 Femur 0.7241 0.7345 Ribs 0.5614 0.7538 Petrous part 0.7903 0.7538 Scapula 0.7882 0.7099 Head mesenchyme 0.7857 0.5507 Note: Sensitivity: true positive rate. Specificity: true negative rate. 5 Conclusion and Future Work
  • 38.
    Computional Domain Thinkers Specialists Creating Formulation Interaction Data models & Experiments & computational knowledge methods creation Mapping Steering Data-Intensive Engineers Execution Implementations, compute & data resources
  • 39.
    D Sp Creating n Inte & Exp l kn c
  • 40.
    Spatial atlases for developmental biology D Sp Creating n Inte & Exp l kn c
  • 41.
    Next Generation Embryology Google Maps for Developmental Biology
  • 42.
    Next generation technology Current Repository Enhanced Repository Query, Navigation & Contribution Query & Contribution Navigation Web & Java clients e-MouseAtlas e-MouseAtlas+ Web clients DSpace + webviewer Knowledge + Maps Knowledge Maps Gagarine Yaikhom
  • 44.
  • 45.
    Data-intensive research workshop Monday, 15March 2010 @ e-Science Edinburgh Volume Complexity Interaction Databases Pri nci Paradigms ple Analysis s?
  • 46.
    Data-intensive Research Group Academics Malcolm Atkinson Research Assistants Jos Koetsier Liangxiu Han David Rodriguez Gagarine Yaikhom PhD Students Thomas French Luna De Ferrari Rob Kitchen Chee-Sun Liew Fan Zhu Research Students Gary, Vijay, Hwee, Yue, Charalampos, http://research.nesc.ac.uk/ Gideon, Jeff, Gareth, Charis, Andrejs Jano van Hemert

Editor's Notes

  • #2 * This is not about projects, publications
  • #3 * One of the papers that is signposting
  • #4 * Sensors, large machines, interaction with data (software), interaction between people, interaction of software on data, ...
  • #5 * EMBL-EBI now reached 4.5 petabytes * MESUR has 1 billion records on usage data * PACS at 160 GB in August 2009, quadruples every year
  • #6 * More explicit forms of demands
  • #7 * More explicit forms of demands
  • #8 * A proposed solution * How do you go about implementing a solution under the fourth paradigm?
  • #11 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #17 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #18 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #19 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #20 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #21 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #22 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #23 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #24 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #25 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #26 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #27 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #28 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #29 * Research focuses on progressing computer science * by evaluating both generic and tailored methodologies * in a multidisciplinary context with * rich use cases to test hypotheses
  • #30 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #31 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #32 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #35 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #36 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #37 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #50 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #51 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution
  • #52 * Formulation = an abstract description of the data-intensive challenge * Execution = an implementation of the challenge that runs on a computational platform * Interaction = necessary to manage the formulation process and to steer the execution