eScience -- A Transformed Scientific Method  Jim Gray ,  eScience Group,  Microsoft Research  http://research.microsoft.co...
Talk Goals <ul><li>Explain eScience (and what I am doing) & </li></ul><ul><li>Recommend CSTB foster tools for </li></ul><u...
eScience:  What is it? <ul><li>Synthesis of  information technology and science.  </li></ul><ul><li>Science  methods  are ...
Science Paradigms <ul><li>Thousand years ago:    science was  empirical </li></ul><ul><ul><li>describing natural phenomena...
X-Info <ul><li>The evolution of X-Info and Comp-X   for each discipline X </li></ul><ul><li>How to codify and represent ou...
Experiment Budgets ¼… ½   Software <ul><li>Software for </li></ul><ul><li>Instrument scheduling </li></ul><ul><li>Instrume...
Experiment Budgets ¼… ½   Software <ul><li>Software for </li></ul><ul><li>Instrument scheduling </li></ul><ul><li>Instrume...
Project Pyramids In most disciplines there are  a few “giga” projects,  several “mega” consortia  and then many small labs...
Pyramid Funding <ul><li>Giga Projects need Giga Funding Major Research Equipment Grants  </li></ul><ul><li>Need projects a...
Action item Invest in tools  at all levels
Need Lab Info Management Systems  (LIMSs) <ul><li>Pipeline Instrument + Simulator data  to archive & publish to web. </li>...
Need Lab Info Management Systems  (LIMSs) <ul><li>Pipeline Instrument + Simulator data  to archive & publish to web. </li>...
Science Needs Info Management <ul><li>Simulators produce lots of data </li></ul><ul><li>Experiments produce lots of data  ...
Data Analysis <ul><li>Looking for </li></ul><ul><ul><li>Needles in haystacks – the Higgs particle </li></ul></ul><ul><ul><...
Analysis and Databases <ul><li>Much statistical analysis deals with </li></ul><ul><ul><li>Creating uniform samples –  </li...
Data Delivery: Hitting a Wall <ul><li>You can GREP 1 MB in a second </li></ul><ul><li>You can GREP 1 GB in a minute  </li>...
Accessing Data <ul><li>If there is too much data to move around, </li></ul><ul><li>take the analysis to the data! </li></u...
Analysis and Databases <ul><li>Much statistical analysis deals with </li></ul><ul><ul><li>Creating uniform samples –  </li...
Let 100 Flowers Bloom <ul><li>Comp-X has some nice tools  </li></ul><ul><ul><li>Beowulf </li></ul></ul><ul><ul><li>Condor ...
Talk Goals <ul><li>Explain eScience (and what I am doing) & </li></ul><ul><li>Recommend CSTB foster tools and tools for </...
All Scientific Data Online <ul><li>Many disciplines overlap and  use data from other sciences.  </li></ul><ul><li>Internet...
Unlocking Peer-Reviewed Literature <ul><li>Agencies and Foundations mandating  research be public domain. </li></ul><ul><u...
How Does the New Library Work? <ul><li>Who pays for storage access  (unfunded mandate) ? </li></ul><ul><ul><li>Its cheap: ...
Pub Med Central International <ul><li>“ Information at your fingertips” </li></ul><ul><li>Deployed US, China, England, Ita...
Overlay Journals  <ul><li>Articles and Data in  public archives </li></ul><ul><li>Journal title page in public archive. </...
Overlay Journals  <ul><li>Articles and Data in  public archives </li></ul><ul><li>Journal title page in public archive. </...
Overlay Journals  <ul><li>Articles and Data in  public archives </li></ul><ul><li>Journal title page in public archive. </...
Overlay Journals  <ul><li>Articles and Data in  public archives </li></ul><ul><li>Journal title page in public archive. </...
Better Authoring Tools <ul><li>Extend Authoring tools to  </li></ul><ul><ul><li>capture document metadata  (NLM tagset) </...
Conference Management Tool <ul><li>Currently a conference  peer-review  system (~300 conferences) </li></ul><ul><ul><li>Fo...
Publishing Peer Review <ul><li>Add publishing steps </li></ul><ul><ul><li>Form committee </li></ul></ul><ul><ul><li>Accept...
Why Not a Wiki? <ul><li>Peer-Review is  different  </li></ul><ul><ul><li>It is very structured </li></ul></ul><ul><ul><li>...
Why Not a Wiki? <ul><li>Peer-Review is  different  </li></ul><ul><ul><li>It is very structured </li></ul></ul><ul><ul><li>...
So… What about Publishing Data? <ul><li>The answer is  42 . </li></ul><ul><li>But… </li></ul><ul><ul><li>What are the unit...
Thought Experiment <ul><li>You have collected some data and want to publish science based on it.  </li></ul><ul><li>How do...
Objectifying Knowledge <ul><li>This requires agreement about  </li></ul><ul><ul><li>Units : cgs  </li></ul></ul><ul><ul><l...
Objectifying Knowledge <ul><li>This requires agreement about  </li></ul><ul><ul><li>Units: cgs  </li></ul></ul><ul><ul><li...
The Best Example: Entrez-GenBank http:// www.ncbi.nlm.nih.gov / <ul><li>Sequence data deposited with Genbank </li></ul><ul...
Publishing Data <ul><li>Exponential growth: </li></ul><ul><ul><li>Projects last at least 3-5 years </li></ul></ul><ul><ul>...
Data Pyramid <ul><li>Very extended distribution of data sets: </li></ul><ul><li>data on all scales! </li></ul><ul><li>Most...
Data Sharing/Publishing  <ul><li>What is the business model (reward/career benefit)? </li></ul><ul><li>Three tiers (power ...
The Best Example: Entrez-GenBank http:// www.ncbi.nlm.nih.gov / <ul><li>Sequence data deposited with Genbank </li></ul><ul...
Talk Goals <ul><li>Explain eScience (and what I am doing) & </li></ul><ul><li>Recommend CSTB foster tools and tools for </...
backup
Astronomy <ul><li>Help build world-wide telescope </li></ul><ul><ul><li>All astronomy data and literature  online and cros...
World Wide Telescope Virtual Observatory http://www.us-vo.org/   http://www.ivoa.net/ <ul><li>Premise:  Most data is (or c...
Why Astronomy Data? <ul><li>It has no commercial value </li></ul><ul><ul><li>No privacy concerns </li></ul></ul><ul><ul><l...
Time and Spectral Dimensions The Multiwavelength Crab Nebulae X-ray,  optical,  infrared, and  radio  views of the nearby ...
SkyServer.SDSS.org <ul><li>A modern archive </li></ul><ul><ul><li>Access to Sloan Digital Sky Survey Spectroscopic and Opt...
SkyServer SkyServer.SDSS.org <ul><li>Like the TerraServer,  but looking the other way:  a picture of ¼ of the universe </l...
Demo of SkyServer <ul><li>Shows standard web server </li></ul><ul><li>Pixel/image data </li></ul><ul><li>Point and click  ...
SkyQuery ( http://skyquery.net/ ) <ul><li>Distributed Query tool using a set of web services </li></ul><ul><li>Many astron...
SkyQuery Structure <ul><li>Each SkyNode publishes  </li></ul><ul><ul><li>Schema Web Service </li></ul></ul><ul><ul><li>Dat...
Schema (aka metadata) <ul><li>Everyone starts with the same schema   <stuff/> Then the start arguing about semantics. </li...
SkyServer/SkyQuery Evolution   MyDB  and Batch Jobs <ul><li>Problem:  need multi-step data analysis (not just single query...
Ecosystem Sensor Net LifeUnderYourFeet.Org <ul><li>Small sensor net monitoring soil </li></ul><ul><li>Sensors feed to a da...
RNA Structural Genomics <ul><li>Goal:  Predict secondary and tertiary structure  from sequence. Deduce tree of life. </li>...
VHA Health Informatics <ul><li>VHA: largest standardized electronic medical records system in US. </li></ul><ul><li>Design...
HDR Vitals Based Body Mass Index Calculation on VHA FY04 Population Source: VHA Corporate Data Warehouse Total Patients 23...
Upcoming SlideShare
Loading in …5
×

eScience: A Transformed Scientific Method

4,544 views

Published on

Presentation by Jim Gray on eScience

Published in: Technology, Education
1 Comment
10 Likes
Statistics
Notes
  • re: slide 4- empirical, theoretical, and computational study exist in all epochs because they represent the fundamental scientific process of observation, prediction, and codification
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
No Downloads
Views
Total views
4,544
On SlideShare
0
From Embeds
0
Number of Embeds
692
Actions
Shares
0
Downloads
126
Comments
1
Likes
10
Embeds 0
No embeds

No notes for slide

eScience: A Transformed Scientific Method

  1. 1. eScience -- A Transformed Scientific Method Jim Gray , eScience Group, Microsoft Research http://research.microsoft.com/~Gray in collaboration with Alex Szalay Dept. Physics & Astronomy Johns Hopkins University http:// www.sdss.jhu.edu/~szalay /
  2. 2. Talk Goals <ul><li>Explain eScience (and what I am doing) & </li></ul><ul><li>Recommend CSTB foster tools for </li></ul><ul><li>data capture (lab info management systems) </li></ul><ul><li>data curation (schemas, ontologies, provenance) </li></ul><ul><li>data analysis (workflow, algorithms, databases, data visualization ) </li></ul><ul><li>data+doc publication (active docs, data-doc integration) </li></ul><ul><li>peer review (editorial services) </li></ul><ul><li>access (doc + data archives and overlay journals) </li></ul><ul><li>Scholarly communication (wiki’s for each article and dataset) </li></ul>
  3. 3. eScience: What is it? <ul><li>Synthesis of information technology and science. </li></ul><ul><li>Science methods are evolving (tools). </li></ul><ul><li>Science is being codified/objectified. How represent scientific information and knowledge in computers? </li></ul><ul><li>Science faces a data deluge. How to manage and analyze information? </li></ul><ul><li>Scientific communication changing </li></ul><ul><li>publishing data & literature (curation, access, preservation) </li></ul>
  4. 4. Science Paradigms <ul><li>Thousand years ago: science was empirical </li></ul><ul><ul><li>describing natural phenomena </li></ul></ul><ul><li>Last few hundred years: theoretical branch </li></ul><ul><ul><li>using models, generalizations </li></ul></ul><ul><li>Last few decades: a computational branch </li></ul><ul><ul><li>simulating complex phenomena </li></ul></ul><ul><li>Today: data exploration (eScience) </li></ul><ul><ul><li>unify theory, experiment, and simulation </li></ul></ul><ul><ul><li>Data captured by instruments Or generated by simulator </li></ul></ul><ul><ul><li>Processed by software </li></ul></ul><ul><ul><li>Information/Knowledge stored in computer </li></ul></ul><ul><ul><li>Scientist analyzes database / files using data management and statistics </li></ul></ul>
  5. 5. X-Info <ul><li>The evolution of X-Info and Comp-X for each discipline X </li></ul><ul><li>How to codify and represent our knowledge </li></ul><ul><li>Data ingest </li></ul><ul><li>Managing a petabyte </li></ul><ul><li>Common schema </li></ul><ul><li>How to organize it </li></ul><ul><li>How to re organize it </li></ul><ul><li>How to share with others </li></ul><ul><li>Query and Vis tools </li></ul><ul><li>Building and executing models </li></ul><ul><li>Integrating data and Literature </li></ul><ul><li>Documenting experiments </li></ul><ul><li>Curation and long-term preservation </li></ul>The Generic Problems Experiments & Instruments Simulations facts facts answers questions Literature Other Archives facts facts ?
  6. 6. Experiment Budgets ¼… ½ Software <ul><li>Software for </li></ul><ul><li>Instrument scheduling </li></ul><ul><li>Instrument control </li></ul><ul><li>Data gathering </li></ul><ul><li>Data reduction </li></ul><ul><li>Database </li></ul><ul><li>Analysis </li></ul><ul><li>Modeling </li></ul><ul><li>Visualization </li></ul><ul><li>Millions of lines of code </li></ul><ul><li>Repeated for experiment after experiment </li></ul><ul><li>Not much sharing or learning </li></ul><ul><li>CS can change this </li></ul><ul><li>Build generic tools </li></ul><ul><li>Workflow schedulers </li></ul><ul><li>Databases and libraries </li></ul><ul><li>Analysis packages </li></ul><ul><li>Visualizers </li></ul><ul><li>… </li></ul>
  7. 7. Experiment Budgets ¼… ½ Software <ul><li>Software for </li></ul><ul><li>Instrument scheduling </li></ul><ul><li>Instrument control </li></ul><ul><li>Data gathering </li></ul><ul><li>Data reduction </li></ul><ul><li>Database </li></ul><ul><li>Analysis </li></ul><ul><li>Modeling </li></ul><ul><li>Visualization </li></ul><ul><li>Millions of lines of code </li></ul><ul><li>Repeated for experiment after experiment </li></ul><ul><li>Not much sharing or learning </li></ul><ul><li>CS can change this </li></ul><ul><li>Build generic tools </li></ul><ul><li>Workflow schedulers </li></ul><ul><li>Databases and libraries </li></ul><ul><li>Analysis packages </li></ul><ul><li>Visualizers </li></ul><ul><li>… </li></ul>Action item Foster Tools and Foster Tool Support
  8. 8. Project Pyramids In most disciplines there are a few “giga” projects, several “mega” consortia and then many small labs. Often some instrument creates need for giga-or mega-project Polar station Accelerator Telescope Remote sensor Genome sequencer Supercomputer Tier 1, 2, 3 facilities to use instrument + data International Multi-Campus Single Lab
  9. 9. Pyramid Funding <ul><li>Giga Projects need Giga Funding Major Research Equipment Grants </li></ul><ul><li>Need projects at all scales </li></ul><ul><li>computing example: supercomputers, + departmental clusters + lab clusters </li></ul><ul><li>technical+ social issues </li></ul><ul><li>Fully fund giga projects, fund ½ of smaller projects they get matching funds from other sources </li></ul><ul><li>“Petascale Computational Systems: Balanced Cyber-Infrastructure in a Data-Centric World ,” IEEE Computer ,  V. 39.1, pp 110-112, January, 2006. </li></ul>
  10. 10. Action item Invest in tools at all levels
  11. 11. Need Lab Info Management Systems (LIMSs) <ul><li>Pipeline Instrument + Simulator data to archive & publish to web. </li></ul><ul><li>NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived) </li></ul><ul><li>Needs workflow tool to manage pipeline </li></ul><ul><li>Build prototypes. </li></ul><ul><li>Examples: </li></ul><ul><ul><li>SDSS, LifeUnderYourFeet MBARI Shore Side Data System. </li></ul></ul>
  12. 12. Need Lab Info Management Systems (LIMSs) <ul><li>Pipeline Instrument + Simulator data to archive & publish to web. </li></ul><ul><li>NASA Level 0 (raw) data Level 1 (calibrated) Level 2 (derived) </li></ul><ul><li>Needs workflow tool to manage pipeline </li></ul><ul><li>Build prototypes. </li></ul><ul><li>Examples: </li></ul><ul><ul><li>SDSS, LifeUnderYourFeet MBARI Shore Side Data System. </li></ul></ul>Action item Foster generic LIMS
  13. 13. Science Needs Info Management <ul><li>Simulators produce lots of data </li></ul><ul><li>Experiments produce lots of data </li></ul><ul><li>Standard practice: </li></ul><ul><ul><li>each simulation run produces a file </li></ul></ul><ul><ul><li>each instrument-day produces a file </li></ul></ul><ul><ul><li>each process step produces a file </li></ul></ul><ul><ul><li>files have descriptive names </li></ul></ul><ul><ul><li>files have similar formats (described elsewhere) </li></ul></ul><ul><li>Projects have millions of files (or soon will) </li></ul><ul><li>No easy way to manage or analyze the data. </li></ul>
  14. 14. Data Analysis <ul><li>Looking for </li></ul><ul><ul><li>Needles in haystacks – the Higgs particle </li></ul></ul><ul><ul><li>Haystacks: Dark matter, Dark energy </li></ul></ul><ul><li>Needles are easier than haystacks </li></ul><ul><li>Global statistics have poor scaling </li></ul><ul><ul><li>Correlation functions are N 2 , likelihood techniques N 3 </li></ul></ul><ul><li>We can only do N logN </li></ul><ul><li>Must accept approximate answers New algorithms </li></ul><ul><li>Requires combination of </li></ul><ul><ul><li>statistics & </li></ul></ul><ul><ul><li>computer science </li></ul></ul>
  15. 15. Analysis and Databases <ul><li>Much statistical analysis deals with </li></ul><ul><ul><li>Creating uniform samples – </li></ul></ul><ul><ul><li>data filtering </li></ul></ul><ul><ul><li>Assembling relevant subsets </li></ul></ul><ul><ul><li>Estimating completeness </li></ul></ul><ul><ul><li>Censoring bad data </li></ul></ul><ul><ul><li>Counting and building histograms </li></ul></ul><ul><ul><li>Generating Monte-Carlo subsets </li></ul></ul><ul><ul><li>Likelihood calculations </li></ul></ul><ul><ul><li>Hypothesis testing </li></ul></ul><ul><li>Traditionally performed on files </li></ul><ul><li>These tasks better done in structured store with </li></ul><ul><ul><li>indexing, </li></ul></ul><ul><ul><li>aggregation, </li></ul></ul><ul><ul><li>parallelism </li></ul></ul><ul><ul><li>query, analysis, </li></ul></ul><ul><ul><li>visualization tools. </li></ul></ul>
  16. 16. Data Delivery: Hitting a Wall <ul><li>You can GREP 1 MB in a second </li></ul><ul><li>You can GREP 1 GB in a minute </li></ul><ul><li>You can GREP 1 TB in 2 days </li></ul><ul><li>You can GREP 1 PB in 3 years </li></ul><ul><li>Oh!, and 1PB ~4,000 disks </li></ul><ul><li>At some point you need indices to limit search parallel data search and analysis </li></ul><ul><li>This is where databases can help </li></ul><ul><li>You can FTP 1 MB in 1 sec </li></ul><ul><li>FTP 1 GB / min (~1 $/GB) </li></ul><ul><li>… 2 days and 1K$ </li></ul><ul><li>… 3 years and 1M$ </li></ul>FTP and GREP are not adequate
  17. 17. Accessing Data <ul><li>If there is too much data to move around, </li></ul><ul><li>take the analysis to the data! </li></ul><ul><li>Do all data manipulations at database </li></ul><ul><ul><li>Build custom procedures and functions in the database </li></ul></ul><ul><li>Automatic parallelism guaranteed </li></ul><ul><li>Easy to build-in custom functionality </li></ul><ul><ul><li>Databases & Procedures being unified </li></ul></ul><ul><ul><li>Example temporal and spatial indexing </li></ul></ul><ul><ul><li>Pixel processing </li></ul></ul><ul><li>Easy to reorganize the data </li></ul><ul><ul><li>Multiple views, each optimal for certain analyses </li></ul></ul><ul><ul><li>Building hierarchical summaries are trivial </li></ul></ul><ul><li>Scalable to Petabyte datasets </li></ul>active databases!
  18. 18. Analysis and Databases <ul><li>Much statistical analysis deals with </li></ul><ul><ul><li>Creating uniform samples – </li></ul></ul><ul><ul><li>data filtering </li></ul></ul><ul><ul><li>Assembling relevant subsets </li></ul></ul><ul><ul><li>Estimating completeness </li></ul></ul><ul><ul><li>Censoring bad data </li></ul></ul><ul><ul><li>Counting and building histograms </li></ul></ul><ul><ul><li>Generating Monte-Carlo subsets </li></ul></ul><ul><ul><li>Likelihood calculations </li></ul></ul><ul><ul><li>Hypothesis testing </li></ul></ul><ul><li>Traditionally performed on files </li></ul><ul><li>These tasks better done in structured store with </li></ul><ul><ul><li>indexing, </li></ul></ul><ul><ul><li>aggregation, </li></ul></ul><ul><ul><li>parallelism </li></ul></ul><ul><ul><li>query, analysis, </li></ul></ul><ul><ul><li>visualization tools. </li></ul></ul>Action item Foster Data Management Data Analysis Data Visualization Algorithms &Tools
  19. 19. Let 100 Flowers Bloom <ul><li>Comp-X has some nice tools </li></ul><ul><ul><li>Beowulf </li></ul></ul><ul><ul><li>Condor </li></ul></ul><ul><ul><li>BOINC </li></ul></ul><ul><ul><li>Matlab </li></ul></ul><ul><li>These tools grew from the community </li></ul><ul><li>It’s HARD to see a common pattern </li></ul><ul><ul><li>Linux vs FreeBSD why was Linux more successful? Community, personality, timing, ….??? </li></ul></ul><ul><li>Lesson: let 100 flowers bloom. </li></ul>
  20. 20. Talk Goals <ul><li>Explain eScience (and what I am doing) & </li></ul><ul><li>Recommend CSTB foster tools and tools for </li></ul><ul><li>data capture (lab info management systems) </li></ul><ul><li>data curation (schemas, ontologies, provenance) </li></ul><ul><li>data analysis (workflow, algorithms, databases, data visualization ) </li></ul><ul><li>data+doc publication (active docs, data-doc integration) </li></ul><ul><li>peer review (editorial services) </li></ul><ul><li>access (doc + data archives and overlay journals) </li></ul><ul><li>Scholarly communication (wiki’s for each article and dataset) </li></ul>
  21. 21. All Scientific Data Online <ul><li>Many disciplines overlap and use data from other sciences. </li></ul><ul><li>Internet can unify all literature and data </li></ul><ul><li>Go from literature to computation to data back to literature. </li></ul><ul><li>Information at your fingertips For everyone-everywhere </li></ul><ul><li>Increase Scientific Information Velocity </li></ul><ul><li>Huge increase in Science Productivity </li></ul>Literature Derived and Re-combined data Raw Data
  22. 22. Unlocking Peer-Reviewed Literature <ul><li>Agencies and Foundations mandating research be public domain. </li></ul><ul><ul><li>NIH (30 B$/y, 40k PIs,…) (see http:// www.taxpayeraccess.org / ) </li></ul></ul><ul><ul><li>Welcome Trust </li></ul></ul><ul><ul><li>Japan, China, Italy, South Africa,.… </li></ul></ul><ul><ul><li>Public Library of Science.. </li></ul></ul><ul><li>Other agencies will follow NIH </li></ul>
  23. 23. How Does the New Library Work? <ul><li>Who pays for storage access (unfunded mandate) ? </li></ul><ul><ul><li>Its cheap: 1 milli-dollar per access </li></ul></ul><ul><li>But… curation is not cheap : </li></ul><ul><ul><li>Author/Title/Subject/Citation/….. </li></ul></ul><ul><ul><li>Dublin Core is great but… </li></ul></ul><ul><ul><li>NLM has a 6,000-line XSD for documents http://dtd.nlm.nih.gov/publishing </li></ul></ul><ul><ul><li>Need to capture document structure from author </li></ul></ul><ul><ul><ul><li>Sections, figures, equations, citations,… </li></ul></ul></ul><ul><ul><ul><li>Automate curation </li></ul></ul></ul><ul><ul><li>NCBI-PubMedCentral is doing this </li></ul></ul><ul><ul><ul><li>Preparing for 1M articles/year </li></ul></ul></ul><ul><ul><li>Automate it! </li></ul></ul>
  24. 24. Pub Med Central International <ul><li>“ Information at your fingertips” </li></ul><ul><li>Deployed US, China, England, Italy, South Africa, Japan </li></ul><ul><li>UK PMCI http://ukpmc.ac.uk/ </li></ul><ul><li>Each site can accept documents </li></ul><ul><li>Archives replicated </li></ul><ul><li>Federate thru web services </li></ul><ul><li>Working to integrate Word/Excel/… with PubmedCentral – e.g. WordML, XSD , </li></ul><ul><li>To be clear: NCBI is doing 99.99% of the work. </li></ul>
  25. 25. Overlay Journals <ul><li>Articles and Data in public archives </li></ul><ul><li>Journal title page in public archive. </li></ul><ul><li>All covered by Creative Commons License </li></ul><ul><ul><li>permits: copy/distribute </li></ul></ul><ul><ul><li>requires: attribution </li></ul></ul><ul><ul><li>http://creativecommons.org/ </li></ul></ul>Data Archives articles Data Sets
  26. 26. Overlay Journals <ul><li>Articles and Data in public archives </li></ul><ul><li>Journal title page in public archive. </li></ul><ul><li>All covered by Creative Commons License </li></ul><ul><ul><li>permits: copy/distribute </li></ul></ul><ul><ul><li>requires: attribution </li></ul></ul><ul><ul><li>http://creativecommons.org/ </li></ul></ul>Journal Management System Data Archives articles title page Data Sets
  27. 27. Overlay Journals <ul><li>Articles and Data in public archives </li></ul><ul><li>Journal title page in public archive. </li></ul><ul><li>All covered by Creative Commons License </li></ul><ul><ul><li>permits: copy/distribute </li></ul></ul><ul><ul><li>requires: attribution </li></ul></ul><ul><ul><li>http://creativecommons.org/ </li></ul></ul>Journal Management System Journal Collaboration System Data Archives articles title page comments Data Sets
  28. 28. Overlay Journals <ul><li>Articles and Data in public archives </li></ul><ul><li>Journal title page in public archive. </li></ul><ul><li>All covered by Creative Commons License </li></ul><ul><ul><li>permits: copy/distribute </li></ul></ul><ul><ul><li>requires: attribution </li></ul></ul><ul><ul><li>http://creativecommons.org/ </li></ul></ul>Journal Management System Journal Collaboration System Data Archives Action item Do for other sciences what NLM has done for BIO Genbank-PubMedCentral… articles title page comments Data Sets
  29. 29. Better Authoring Tools <ul><li>Extend Authoring tools to </li></ul><ul><ul><li>capture document metadata (NLM tagset) </li></ul></ul><ul><ul><li>represent documents in standard format </li></ul></ul><ul><ul><ul><li>WordML (ECMA standard) </li></ul></ul></ul><ul><ul><li>capture references </li></ul></ul><ul><ul><li>Make active documents (words and data). </li></ul></ul><ul><li>Easier for authors </li></ul><ul><li>Easier for archives </li></ul>
  30. 30. Conference Management Tool <ul><li>Currently a conference peer-review system (~300 conferences) </li></ul><ul><ul><li>Form committee </li></ul></ul><ul><ul><li>Accept Manuscripts </li></ul></ul><ul><ul><li>Declare interest/recuse </li></ul></ul><ul><ul><li>Review </li></ul></ul><ul><ul><li>Decide </li></ul></ul><ul><ul><li>Form program </li></ul></ul><ul><ul><li>Notify </li></ul></ul><ul><ul><li>Revise </li></ul></ul>
  31. 31. Publishing Peer Review <ul><li>Add publishing steps </li></ul><ul><ul><li>Form committee </li></ul></ul><ul><ul><li>Accept Manuscripts </li></ul></ul><ul><ul><li>Declare interest/recuse </li></ul></ul><ul><ul><li>Review </li></ul></ul><ul><ul><li>Decide </li></ul></ul><ul><ul><li>Form program </li></ul></ul><ul><ul><li>Notify </li></ul></ul><ul><ul><li>Revise </li></ul></ul><ul><ul><li>Publish </li></ul></ul><ul><li>& improve author-reader experience </li></ul><ul><li>Manage versions </li></ul><ul><li>Capture data </li></ul><ul><li>Interactive documents </li></ul><ul><li>Capture Workshop </li></ul><ul><ul><li>presentations </li></ul></ul><ul><ul><li>proceedings </li></ul></ul><ul><li>Capture classroom ConferenceXP </li></ul><ul><li>Moderated discussions of published articles </li></ul><ul><li>Connect to Archives </li></ul>
  32. 32. Why Not a Wiki? <ul><li>Peer-Review is different </li></ul><ul><ul><li>It is very structured </li></ul></ul><ul><ul><li>It is moderated </li></ul></ul><ul><ul><li>There is a degree of confidentiality </li></ul></ul><ul><li>Wiki is egalitarian </li></ul><ul><ul><li>It’s a conversation </li></ul></ul><ul><ul><li>It’s completely transparent </li></ul></ul><ul><li>Don’t get me wrong: </li></ul><ul><ul><li>Wiki’s are great </li></ul></ul><ul><ul><li>SharePoints are great </li></ul></ul><ul><ul><li>But.. Peer-Review is different. </li></ul></ul><ul><ul><li>And, incidentally: review of proposals, projects,… is more like peer-review. </li></ul></ul><ul><li>Let’s have Moderated Wiki re published literature PLoS-One is doing this </li></ul>
  33. 33. Why Not a Wiki? <ul><li>Peer-Review is different </li></ul><ul><ul><li>It is very structured </li></ul></ul><ul><ul><li>It is moderated </li></ul></ul><ul><ul><li>There is a degree of confidentiality </li></ul></ul><ul><li>Wiki is egalitarian </li></ul><ul><ul><li>It’s a conversation </li></ul></ul><ul><ul><li>It’s completely transparent </li></ul></ul><ul><li>Don’t get me wrong: </li></ul><ul><ul><li>Wiki’s are great </li></ul></ul><ul><ul><li>SharePoints are great </li></ul></ul><ul><ul><li>But.. Peer-Review is different. </li></ul></ul><ul><ul><li>And, incidentally: review of proposals, projects,… is more like peer-review. </li></ul></ul><ul><li>Let’s have Moderated Wiki re published literature PLoS-One is doing this </li></ul>Action item Foster new document authoring and publication models and tools
  34. 34. So… What about Publishing Data? <ul><li>The answer is 42 . </li></ul><ul><li>But… </li></ul><ul><ul><li>What are the units? </li></ul></ul><ul><ul><li>How precise? How accurate 42.5 ± .01 </li></ul></ul><ul><ul><li>Show your work data provenance </li></ul></ul>
  35. 35. Thought Experiment <ul><li>You have collected some data and want to publish science based on it. </li></ul><ul><li>How do you publish the data so that others can read it and reproduce your results in 100 years? </li></ul><ul><ul><li>Document collection process? </li></ul></ul><ul><ul><li>How document data processing (scrubbing & reducing the data)? </li></ul></ul><ul><ul><li>Where do you put it? </li></ul></ul>
  36. 36. Objectifying Knowledge <ul><li>This requires agreement about </li></ul><ul><ul><li>Units : cgs </li></ul></ul><ul><ul><li>Measurements : who/what/when/where/how </li></ul></ul><ul><ul><li>CONCEPTS: </li></ul></ul><ul><ul><ul><li>What’s a planet, star, galaxy,…? </li></ul></ul></ul><ul><ul><ul><li>What’s a gene, protein, pathway…? </li></ul></ul></ul><ul><li>Need to objectify science: </li></ul><ul><ul><li>what are the objects? </li></ul></ul><ul><ul><li>what are the attributes? </li></ul></ul><ul><ul><li>What are the methods (in the OO sense)? </li></ul></ul><ul><li>This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things </li></ul>
  37. 37. Objectifying Knowledge <ul><li>This requires agreement about </li></ul><ul><ul><li>Units: cgs </li></ul></ul><ul><ul><li>Measurements: who/what/when/where/how </li></ul></ul><ul><ul><li>CONCEPTS: </li></ul></ul><ul><ul><ul><li>What’s a planet, star, galaxy,…? </li></ul></ul></ul><ul><ul><ul><li>What’s a gene, protein, pathway…? </li></ul></ul></ul><ul><li>Need to objectify science: </li></ul><ul><ul><li>what are the objects? </li></ul></ul><ul><ul><li>what are the attributes? </li></ul></ul><ul><ul><li>What are the methods (in the OO sense)? </li></ul></ul><ul><li>This is mostly Physics/Bio/Eco/Econ/... But CS can do generic things </li></ul>Warning! Painful discussions ahead: The “O” word: Ontology The “S” word: Schema The “CV” words: Controlled Vocabulary Domain experts do not agree
  38. 38. The Best Example: Entrez-GenBank http:// www.ncbi.nlm.nih.gov / <ul><li>Sequence data deposited with Genbank </li></ul><ul><li>Literature references Genbank ID </li></ul><ul><li>BLAST searches Genbank </li></ul><ul><li>Entrez integrates and searches </li></ul><ul><ul><li>PubMedCentral </li></ul></ul><ul><ul><li>PubChem </li></ul></ul><ul><ul><li>Genbank </li></ul></ul><ul><ul><li>Proteins, SNP, </li></ul></ul><ul><ul><li>Structure,.. </li></ul></ul><ul><ul><li>Taxonomy… </li></ul></ul><ul><ul><li>Many more </li></ul></ul>Nucleotide sequences Protein sequences Taxon Phylogeny MMDB 3 -D Structure PubMed abstracts Complete Genomes PubMed Entrez Genomes Publishers Genome Centers
  39. 39. Publishing Data <ul><li>Exponential growth: </li></ul><ul><ul><li>Projects last at least 3-5 years </li></ul></ul><ul><ul><li>Data sent upwards only at the end of the project </li></ul></ul><ul><ul><li>Data will never be centralized </li></ul></ul><ul><li>More responsibility on projects </li></ul><ul><ul><li>Becoming Publishers and Curators </li></ul></ul><ul><li>Data will reside with projects </li></ul><ul><ul><li>Analyses must be close to the data </li></ul></ul>Roles Authors Publishers Curators Consumers Traditional Scientists Journals Libraries Scientists Emerging Collaborations Project www site Bigger Archives Scientists
  40. 40. Data Pyramid <ul><li>Very extended distribution of data sets: </li></ul><ul><li>data on all scales! </li></ul><ul><li>Most datasets are small, and manually maintained (Excel spreadsheets) </li></ul><ul><li>Total volume dominated by multi-TB archives </li></ul><ul><li>But, small datasets have real value </li></ul><ul><li>Most data is born digital collected via electronic sensors or generated by simulators. </li></ul>
  41. 41. Data Sharing/Publishing <ul><li>What is the business model (reward/career benefit)? </li></ul><ul><li>Three tiers (power law!!!) </li></ul><ul><ul><li>(a) big projects </li></ul></ul><ul><ul><li>(b) value added, refereed products </li></ul></ul><ul><ul><li>(c) ad-hoc data, on-line sensors, images, outreach info </li></ul></ul><ul><li>We have largely done (a) </li></ul><ul><li>Need “Journal for Data” to solve (b) </li></ul><ul><li>Need “VO-Flickr” (a simple interface) (c) </li></ul><ul><li>Mashups are emerging in science </li></ul><ul><li>Need an integrated environment for ‘ virtual excursions ’ for education (C. Wong) </li></ul>
  42. 42. The Best Example: Entrez-GenBank http:// www.ncbi.nlm.nih.gov / <ul><li>Sequence data deposited with Genbank </li></ul><ul><li>Literature references Genbank ID </li></ul><ul><li>BLAST searches Genbank </li></ul><ul><li>Entrez integrates and searches </li></ul><ul><ul><li>PubMedCentral </li></ul></ul><ul><ul><li>PubChem </li></ul></ul><ul><ul><li>Genbank </li></ul></ul><ul><ul><li>Proteins, SNP, </li></ul></ul><ul><ul><li>Structure,.. </li></ul></ul><ul><ul><li>Taxonomy… </li></ul></ul><ul><ul><li>Many more </li></ul></ul>Action item Foster Digital Data Libraries (not metadata, real data) and integration with literature Nucleotide sequences Protein sequences Taxon Phylogeny MMDB 3 -D Structure PubMed abstracts Complete Genomes PubMed Entrez Genomes Publishers Genome Centers
  43. 43. Talk Goals <ul><li>Explain eScience (and what I am doing) & </li></ul><ul><li>Recommend CSTB foster tools and tools for </li></ul><ul><li>data capture (lab info management systems) </li></ul><ul><li>data curation (schemas, ontologies, provenance) </li></ul><ul><li>data analysis (workflow, algorithms, databases, data visualization ) </li></ul><ul><li>data+doc publication (active docs, data-doc integration) </li></ul><ul><li>peer review (editorial services) </li></ul><ul><li>access (doc + data archives and overlay journals) </li></ul><ul><li>Scholarly communication (wiki’s for each article and dataset) </li></ul>
  44. 44. backup
  45. 45. Astronomy <ul><li>Help build world-wide telescope </li></ul><ul><ul><li>All astronomy data and literature online and cross indexed </li></ul></ul><ul><ul><li>Tools to analyze the data </li></ul></ul><ul><li>Built SkyServer.SDSS.org </li></ul><ul><li>Built Analysis system </li></ul><ul><ul><li>MyDB </li></ul></ul><ul><ul><li>CasJobs (batch job) </li></ul></ul><ul><li>OpenSkyQuery Federation of ~20 observatories. </li></ul><ul><li>Results: </li></ul><ul><ul><li>It works and is used every day </li></ul></ul><ul><ul><li>Spatial extensions in SQL 2005 </li></ul></ul><ul><ul><li>A good example of Data Grid </li></ul></ul><ul><ul><li>Good examples of Web Services. </li></ul></ul>
  46. 46. World Wide Telescope Virtual Observatory http://www.us-vo.org/ http://www.ivoa.net/ <ul><li>Premise: Most data is (or could be online) </li></ul><ul><li>So, the Internet is the world’s best telescope: </li></ul><ul><ul><li>It has data on every part of the sky </li></ul></ul><ul><ul><li>In every measured spectral band: optical, x-ray, radio.. </li></ul></ul><ul><ul><li>As deep as the best instruments (2 years ago). </li></ul></ul><ul><ul><li>It is up when you are up. The “seeing” is always great (no working at night, no clouds no moons no..). </li></ul></ul><ul><ul><li>It’s a smart telescope: links objects and data to literature on them. </li></ul></ul>
  47. 47. Why Astronomy Data? <ul><li>It has no commercial value </li></ul><ul><ul><li>No privacy concerns </li></ul></ul><ul><ul><li>Can freely share results with others </li></ul></ul><ul><ul><li>Great for experimenting with algorithms </li></ul></ul><ul><li>It is real and well documented </li></ul><ul><ul><li>High-dimensional data (with confidence intervals) </li></ul></ul><ul><ul><li>Spatial data </li></ul></ul><ul><ul><li>Temporal data </li></ul></ul><ul><li>Many different instruments from many different places and many different times </li></ul><ul><li>Federation is a goal </li></ul><ul><li>There is a lot of it (petabytes) </li></ul>IRAS 100  ROSAT ~keV DSS Optical 2MASS 2  IRAS 25  NVSS 20cm WENSS 92cm GB 6cm
  48. 48. Time and Spectral Dimensions The Multiwavelength Crab Nebulae X-ray, optical, infrared, and radio views of the nearby Crab Nebula, which is now in a state of chaotic expansion after a supernova explosion first sighted in 1054 A.D. by Chinese Astronomers. Slide courtesy of Robert Brunner @ CalTech. Crab star 1053 AD
  49. 49. SkyServer.SDSS.org <ul><li>A modern archive </li></ul><ul><ul><li>Access to Sloan Digital Sky Survey Spectroscopic and Optical surveys </li></ul></ul><ul><ul><li>Raw Pixel data lives in file servers </li></ul></ul><ul><ul><li>Catalog data (derived objects) lives in Database </li></ul></ul><ul><ul><li>Online query to any and all </li></ul></ul><ul><li>Also used for education </li></ul><ul><ul><li>150 hours of online Astronomy </li></ul></ul><ul><ul><li>Implicitly teaches data analysis </li></ul></ul><ul><li>Interesting things </li></ul><ul><ul><li>Spatial data search </li></ul></ul><ul><ul><li>Client query interface via Java Applet </li></ul></ul><ul><ul><li>Query from Emacs, Python, …. </li></ul></ul><ul><ul><li>Cloned by other surveys (a template design) </li></ul></ul><ul><ul><li>Web services are core of it. </li></ul></ul>
  50. 50. SkyServer SkyServer.SDSS.org <ul><li>Like the TerraServer, but looking the other way: a picture of ¼ of the universe </li></ul><ul><li>Sloan Digital Sky Survey Data: Pixels + Data Mining </li></ul><ul><li>About 400 attributes per “object” </li></ul><ul><li>Spectrograms for 1% of objects </li></ul>
  51. 51. Demo of SkyServer <ul><li>Shows standard web server </li></ul><ul><li>Pixel/image data </li></ul><ul><li>Point and click </li></ul><ul><li>Explore one object </li></ul><ul><li>Explore sets of objects (data mining) </li></ul>
  52. 52. SkyQuery ( http://skyquery.net/ ) <ul><li>Distributed Query tool using a set of web services </li></ul><ul><li>Many astronomy archives from Pasadena, Chicago, Baltimore, Cambridge (England) </li></ul><ul><li>Has grown from 4 to 15 archives, now becoming international standard </li></ul><ul><li>WebService Poster Child </li></ul><ul><li>Allows queries like: </li></ul>SELECT o.objId, o.r, o.type, t.objId FROM SDSS:PhotoPrimary o, TWOMASS:PhotoPrimary t WHERE XMATCH(o,t)<3.5 AND AREA(181.3,-0.76,6.5) AND o.type=3 and (o.I - t.m_j)>2
  53. 53. SkyQuery Structure <ul><li>Each SkyNode publishes </li></ul><ul><ul><li>Schema Web Service </li></ul></ul><ul><ul><li>Database Web Service </li></ul></ul><ul><li>Portal is </li></ul><ul><ul><li>Plans Query (2 phase) </li></ul></ul><ul><ul><li>Integrates answers </li></ul></ul><ul><ul><li>Is itself a web service </li></ul></ul>2MASS INT SDSS FIRST SkyQuery Portal Image Cutout
  54. 54. Schema (aka metadata) <ul><li>Everyone starts with the same schema <stuff/> Then the start arguing about semantics. </li></ul><ul><li>Virtual Observatory: http:// www.ivoa.net / </li></ul><ul><li>Metadata based on Dublin Core: http:// www.ivoa.net/Documents/latest/RM.html </li></ul><ul><li>Universal Content Descriptors (UCD): http://vizier.u-strasbg.fr/doc/UCD.htx Captures quantitative concepts and their units Reduced from ~100,000 tables in literature to ~1,000 terms </li></ul><ul><li>VOtable – a schema for answers to questions http://www.us-vo.org/VOTable/ </li></ul><ul><li>Common Queries: Cone Search and Simple Image Access Protocol, SQL </li></ul><ul><li>Registry: http://www.ivoa.net/Documents/latest/RMExp.html still a work in progress. </li></ul>
  55. 55. SkyServer/SkyQuery Evolution MyDB and Batch Jobs <ul><li>Problem: need multi-step data analysis (not just single query). </li></ul><ul><li>Solution: Allow personal databases on portal </li></ul><ul><li>Problem: some queries are monsters </li></ul><ul><li>Solution: “Batch schedule” on portal. Deposits answer in personal database. </li></ul>
  56. 56. Ecosystem Sensor Net LifeUnderYourFeet.Org <ul><li>Small sensor net monitoring soil </li></ul><ul><li>Sensors feed to a database </li></ul><ul><li>Helping build system to collect & organize data. </li></ul><ul><li>Working on data analysis tools </li></ul><ul><li>Prototype for other LIMS Laboratory Information Management Systems </li></ul>
  57. 57. RNA Structural Genomics <ul><li>Goal: Predict secondary and tertiary structure from sequence. Deduce tree of life. </li></ul><ul><li>Technique: Analyze sequence variations sharing a common structure across tree of life </li></ul><ul><li>Representing structurally aligned sequences is a key challenge </li></ul><ul><li>Creating a database-driven alignment workbench accessing public and private sequence data </li></ul>
  58. 58. VHA Health Informatics <ul><li>VHA: largest standardized electronic medical records system in US. </li></ul><ul><li>Design, populate and tune a ~20 TB Data Warehouse and Analytics environment </li></ul><ul><li>Evaluate population health and treatment outcomes, </li></ul><ul><li>Support epidemiological studies </li></ul><ul><ul><li>7 million enrollees </li></ul></ul><ul><ul><li>5 million patients </li></ul></ul><ul><ul><li>Example Milestones: </li></ul></ul><ul><ul><ul><li>1 Billionth Vital Sign loaded in April ‘06 </li></ul></ul></ul><ul><ul><ul><li>30-minutes to population-wide obesity analysis (next slide) </li></ul></ul></ul><ul><ul><ul><li>Discovered seasonality in blood pressure -- NEJM fall ‘06 </li></ul></ul></ul>
  59. 59. HDR Vitals Based Body Mass Index Calculation on VHA FY04 Population Source: VHA Corporate Data Warehouse Total Patients 23,876 (0.7%) 701,089 (21.6%) 1,177,093 (36.2%) 1,347,098 (41.5%) 3,249,156 (100%)

×