Your SlideShare is downloading. ×
  • Like
November 18
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Now you can save presentations on your phone or tablet

Available for both IPhone and Android

Text the download link to your phone

Standard text messaging rates apply
Published

 

Published in Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
    Be the first to like this
No Downloads

Views

Total Views
343
On SlideShare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
0
Comments
0
Likes
0

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide
  • - SEC program needs to grow both discipline-specific capabilities and cross-discipline services - Vision is CDAWeb (or successor over time) that is a usual point of entry for multi-disciplinary data o With pointers to discipline or dataset specific (substantially distributed) support - CDAWLib is not the same as SolarSoft (PAPCO is a closer cousin in approach) but start in direction of an IDL library
  • f

Transcript

  • 1. Scientific Databases Lecture: Virtual Observatories for Space Science Dr. Kirk Borne, GMU SCS November 18, 2003 GMU CSI 710
  • 2. Outline
    • Quick Review of Astronomy Data
    • The National Virtual Obseratory (NVO)
    • Other Virtual Observatories for Space Science
    • Why Virtual Observatories?
    • NVO – It’s all about the Science:
      • IT-enabled, Science-enabling
    • The Enabling Computational Science Technologies for the NVO – where you can help!
    • Distributed Data Mining in the NVO
  • 3. The Nature of Astronomical Data
    • Imaging
      • 2D map of the sky at multiple wavelengths
    • Derived catalogs
      • subsequent processing of images
      • extracting object parameters (400+ per object)
    • Spectroscopic follow-up
      • spectra: more detailed object properties
      • clues to physical state and formation history
      • lead to distances: 3D maps
    • Numerical simulations
    • All inter-related!
  • 4. NOAO Deep Wide-Field Survey: http:// www.noao.edu/noao/noaodeep /
  • 5. NOAO Deep Wide-Field Survey: http:// www.noao.edu/noao/noaodeep /
  • 6. NOAO Deep Wide-Field Survey: http:// www.noao.edu/noao/noaodeep /
  • 7. NASA Astronomy Mission Data: the tip of the data mountain http://nssdc.gsfc.nasa.gov/astro/astrolist.html NSSDC’s astrophysics data holdings: One of many science data collections for astronomy across the US and the world! NSSDC = National Space Science Data Center @ NASA/GSFC
  • 8. “ Quote of the day”
    • “ It's just as unpleasant to get more than you expected as it is to get less.”
          • George Bernard Shaw
  • 9. Why so many Telescopes? …
    • Many great astronomical
    • discoveries have come
    • from inter-comparisons
    • of various wavelengths:
    • Quasars
    • Gamma-ray bursts
    • Ultraluminous IR galaxies
    • X-ray black-hole binaries
    • Radio galaxies
    • . . .
    Overlay Because …
  • 10. Therefore, our science data archive systems should enable multi-wavelength interdisciplinary distributed database access, discovery, mining, and analysis.
  • 11. How does one integrate and use these distributed data archives? …
  • 12. Emerging Computational Environment
    • Standardizing distributed data
      • Web Services, supported on all platforms
      • Custom configure remote data dynamically
      • XML: Extensible Markup Language
      • SOAP: Simple Object Access Protocol
      • WSDL: Web Services Description Language
      • UDDI: Universal Description, Discovery and Integration
    • Standardizing distributed computing
      • Grid Services
      • Custom configure remote computing dynamically
      • Build your own remote computer, use it, then discard it
      • Virtual Data: new data sets on demand
  • 13. … The National Virtual Observatory (NVO)
    • National Academy of Sciences “Decadal Survey” recommended NVO as highest priority small (<$100M) project :
      • “ Several small initiatives recommended by the committee span both ground and space. The first among them—the National Virtual Observatory (NVO)—is the committee’s top priority among the small initiatives. The NVO will provide a “virtual sky” based on the enormous data sets being created now and the even larger ones proposed for the future. It will enable a new mode of research for professional astronomers and will provide to the public an unparalleled opportunity for education and discovery.” (p.14)
  • 14. Why is it Virtual ?
    • A Virtual Data System :
      • has multiple components
      • is (geographically) distributed
      • is interoperable
      • provides seamless user access to distributed data system components
      • provides “one-stop shopping” for data end-user
  • 15. Why is it Necessary?
    • To maximize cross-enterprise multi-institutional resources
    • To minimize duplication of effort
    • To streamline operations through shared development
    • To serve multiple user communities
    • To facilitate simultaneous data mining, knowledge discovery, and information retrieval from multiple distributed data collections
    • Because data volumes are huge & growing rapidly ...
      • For example, in Astronomy :
      • a few terabytes &quot;yesterday” (10,000 CDROMs)
      • tens of terabytes &quot;today” (100,000 CDROMs)
      • petabytes &quot;tomorrow&quot; (within 10 years) (100,000,000 CDROMs)
  • 16. National Virtual Observatory http://www.us-vo.org /
    • NVO is a concept. It was recommended by the Astronomy Decadal Survey Committee to the National Academy of Sciences. Currently funded by NSF ($10M Information Technology Research grant); and NASA next year(?).
    • NVO is not just “National”. It is actually “ Global ”: http:// www.ivoa.net /
    • Will link geographically distributed astronomical data archives and information resources = provides “one-stop shopping” for data end-user
    • Will be heterogeneous, interoperable , and federated (autonomy maintained at local sites) … therefore, we are using XML and Web Services.
    • Requires middleware standards for : metadata , resource descriptions (including the Dublin Core), queries , query results , the data (including the Data Model – see next slide), and semantics (… we are using Unified Content Descriptors = UCDs).
    • Requires innovative computational science technologies for :
      • data discovery, data mining, data fusion, distributed querying, and code-shipping (“Ship the code, not the data”)
  • 17. Virtual Observatory Data Model A data model is the structure in which a computer program stores persistent information.
  • 18.  
  • 19. VxO: becoming an operational system (high TRL)
    • What is a VxO ?
      • V irtual “ anything ” O bservatory – where “anything” currently includes Astrophysical , Solar , Magnetospheric , Heliospheric , Ionospheric , …
    • Summary statement for any VxO …
      • Researchers should be able to find and access seamlessly all existing data relevant to the research they are considering, that data should be independently and correctly useable, and that data should be available in useful ways and in useful contexts.
    • Without exception, full VxO efforts aim in this direction by providing multi-mission data access and easy browse functionality.
  • 20. Tools & Services Science Data Facility Acquisition & Ingest Science User Support http://spdf.gsfc.nasa.gov/ ModelsWeb CDAWLib HelioWeb Capabilities of Space Physics Science Databases. The VxO Challenge: to Integrate Data, Tools, Services (Trajectories)
  • 21. How do Space Science Databases Change in a Future that has an Increasingly Rich/Robust VxO Framework ?
    • One definition for this VxO framework could be …
    • &quot;The distributed implementation of an integrated space sciences data environment&quot;
    • The broad goals of the data centers don't fundamentally change with this definition.
      • They still must enable new science by adding unique value to the Space Science research community through strong multi-discipline and cross-discipline data resources, with unique services tied to unique databases.
    • These services (data, functions, software) should (and will) be increasingly supplied as a key element of that new, broader VxO environment.
      • Logically, the data center’s services eventually become consumers as well as providers.
      • Visible early user impact of VxO is critical.
    • VxOs should develop a good long-term hybrid solution = PIs + missions/projects + Science Data Centers + (other) specialized services
  • 22. Science Data Formats – part of the glue
    • Several key data formats are standard in space science: FITS (Astrophysics & Solar Physics), CDF and netCDF (Space Physics & Earth Science), HDF (Space Physics, Earth Science, & Computational Science).
    • Why?
      • These provide a baseline data format for all data sets in that discipline and in joint international projects.
      • They provide the base for many data center services, data analysis tools, data integration tools, visualization packages.
      • They are a key enabling technology for many different space missions and space science projects.
    • Plans:
      • Translation tools: from FITS < – > CDF < – > HDF < – > netCDF
      • Substantial work on format translators via XML and XSLT.
  • 23. Interfaces to a VxO Environment
    • &quot;Web Services&quot; interface to existing data services
      • Web Services interfaces and software libraries complement existing FTP and interactive user web interfaces.
      • Web Services provides application-to-application interface, without human intervention.
      • Web Services provides distributed data registry (WSDL), data/resource discovery (UDDI), and data services (SOAP).
      • Scientific database services have unique scope and functionality that must be accessible by the VxO environment for it to gain user acceptance.
        • e.g., SOAP/XML interface for Space Physics data now enables 3-D interactive graphics of distributed multi-mission data.
      • Plans for data format translators and converters
  • 24. Why Virtual Observatories?
    • Because :
      • The data are highly distributed.
      • Multi-mission data lead to new discoveries.
      • The data volumes are HUGE and growing.
      • And maybe because of Augustine’s Law …
    “ Software is like entropy; it always increases.” - Norman Augustine
  • 25. Szalay’s Law: The utility of N comparable datasets increases as N 2
    • Metcalf’s Law: The value of a network scales as n 2 , where n is the number of nodes connected.
    • Hagel & Armstrong’s Axiom: The aggregation of resources is more important than the amount of resources owned.
    • Metcalf’s law applies to telephones, the Internet …
    • Szalay argues as follows:
      • Each new dataset gives new information.
      • 2-way combinations give even more new information.
  • 26. Size of a Typical Archived Astronomical Data Repository
    • Size of the archived data for an all-sky survey -- 40,000 square degrees is two Trillion pixels --
      • One band 4 Terabytes
      • Multi-wavelength 10-100 Terabytes
      • Time dimension 10 Petabytes
      • LSST project (10 yrs) ~100 Petabytes @ http://www.lsst.org/
    All-sky distribution of 526,280,881 stars from the MACHO survey.
  • 27. Ongoing Surveys of the Sky
    • Large number of new surveys
      • multi-TB in size, 100 million objects or more
      • individual archives planned, or under way
    • Multi-wavelength view of the sky
      • more than 13 wavelength coverage in 5 years
    • Impressive early discoveries
      • finding exotic objects by unusual colors
        • L,T dwarfs, high-z quasars
      • finding objects by time variability
        • gravitational microlensing
    MACHO 2MASS DENIS SDSS GALEX FIRST DPOSS GSC-II COBE MAP NVSS FIRST ROSAT OGLE ...
  • 28. Sloan Digital Sky Survey Data Products http://www.sdss.org/
    • Full Data Collection ~20 TB
    • Object catalog 400 GB parameters of >10 8 objects
    • Redshift Catalog 1 GB parameters of 10 6 objects
    • Atlas Images 1.5 TB 5 color cutouts of >10 8 objects
    • Spectra 60 GB in a one-dimensional form
    • Derived Catalogs 20 GB - clusters - QSO absorption lines
    • 4x4 Pixel All-Sky Map 60 GB heavily compressed
  • 29. Large Synoptic Survey Telescope
    • Highly ranked in Decadal Review
    • Optimized for surveys
    • scan mode
    • deep mode
    • 7 square degree field
    • 6.9m effective aperture
    • 24 th mag in 20 sec
    • > 20 Tbytes/night
    • Real-time analysis
    • “ Celestial Cinematography”
    • Simultaneous multiple science goals
  • 30. Large Mirror Fabrication (for large telescopes, such as LSST) (Univ. of Arizona Mirror Laboratory) That’s big!
  • 31. NVO – It’s all about the Science
  • 32. Science Discovery - the Old Way
  • 33. Science Discovery - The New Way -Different!
    • Systematic data exploration
      • will have a central role
      • statistical analysis of the “typical” objects
      • automated search for the “rare” events
    The discovery process will rely heavily on distributed data access and multi-archive data mining.
  • 34. Conceptual Architecture for a Distributed Data Mining System Data Archives Discovery tools Analysis tools User Gateway
  • 35. The Discovery Process
    • discover significant patterns from the analysis of statistically rich and unbiased image/catalog databases
    • understand complex astrophysical systems via confrontation between data and large numerical simulations
    Past: observations of small, carefully selected samples of objects in a narrow wavelength band Future: high quality, homogeneous multi-wavelength data on millions of objects, allowing us to The discovery process will rely heavily on advanced visualization, data mining, and statistical analysis tools.
  • 36. The NVO in 5 words or less: “ The archive is the sky!”
  • 37. NVO: It is all about the Science
    • There is a huge scientific interest in the new data collections --large sky surveys, multiple telescopes, multiple-wavelength coverage of the sky, time domain coverage ... And it is all available on-line from your desktop …
      • “ The archive is the sky!”
    • Something is needed to help scientists access, mine, and explore these huge data collections.
      • 1 Terabyte at 10 Mbyte/s takes 1 day to transmit
      • Hundreds of intensive queries and thousands of casual queries per-day
      • Data will reside at multiple locations, in many different formats
      • Existing analysis tools do not scale to Terabyte data sets
    • Acute need in a few years; solution will not just happen.
  • 38.
    • Rare and exotic objects
      • Very high redshift quasars
      • Dark matter in the galactic halo
      • Time-variable objects, transient events: distant supernovae and microlensing
      • Brown dwarfs
      • Variable stars
      • Asteroids...
        • ...incoming!!
      • Serendipity!
    NVO Enables New Science http://www.us-vo.org/
  • 39. NVO Science Cases & Drivers (from Aspen 2001 NVO Workshop)
    • Solar System : NEOs, Long-Period Comets, TNOs, Killer Asteroids!!!
    • The Digital Galaxy : Find star streams and populations -- relics of past/present assembly phase. Identify components of disk, thick disk, bulge, halo, arms, ??
    • The Low-Surface Brightness Universe : spatial filtering, multi-wavelength searches, intersection of the image and catalog domains
    • Panchromatic Census of AGN (Active Galactic Nuclei) : Complete sample of the AGN zoo, their emission mechanisms, and their environments
    • Precision Cosmology & Large-Scale Structure : Hierarchical Assembly History of Galaxies and Structure, Cosmological Parameters, Dark Matter and Galaxy Biasing as f(z)
    • Precision science of any kind that depends on very large sample sizes
    • &quot;Survey Science Deluxe&quot;
    • Search for rare and exotic objects (e.g., high- z QSOs, high- z Sne, L/T dwarfs)
    • Serendipity : Explore new domains of parameter space (e.g., time domain, or &quot;color-color space&quot; of all kinds)
  • 40. Enabling Computational Science Technologies for the NVO
  • 41. Major Functions of the NVO and the related Enabling Computational Science Technologies
    • To facilitate data mining and knowledge discovery within the very large astronomical databases -- Requires:
      • indexing for fast queries, filtering of large queries, data subsetting, visualization, parallelization (queries, access), ...
    • To facilitate linkages and cross-archive investigations -- Requires:
      • distributed computing, scalable architectures, load balancing, thin middleware layer, interoperability, code libraries, code-shipping, data-finding services, data standards & interchange formats, query/results protocols, data fusion, quality assessment, archive/metadata profiles, user profiles, intelligent agents, ...
    • To serve a broad community of users ( professionals, amateur astronomers, schools, general public) --
      • must support thousands of queries per day
  • 42. Some General Challenges for NVO (and all Virtual Data Systems)
    • Data Discovery : Finding data within distributed data systems
    • Transparent User Access to Data : across heterogeneous environments
    • (Distributed) Data Mining and Analysis : of terabytes!
    • Interoperability : of systems, data, metadata, tools
    • New Technology Infusion : across multiple distributed systems
    • Sociology : &quot;We don't need it&quot; or &quot;We already have it”
  • 43. How do you get all of these distributed science databases working together?
    • Virtual Observatory team motto:
    • “ It’s the middleware, stupid.”
  • 44. National Virtual Observatory http://www.us-vo.org /
    • NVO is a concept. It was recommended by the Astronomy Decadal Survey Committee to the National Academy of Sciences. Currently funded by NSF ($10M Information Technology Research grant); and NASA next year(?).
    • NVO is not just “National”. It is actually “Global”: http:// www.ivoa.net /
    • Will link geographically distributed astronomical data archives and information resources = provides “one-stop shopping” for data end-user
    • Will be heterogeneous, interoperable, and federated (autonomy maintained at local sites) … therefore, we are using XML and Web Services.
    • Requires middleware standards for : metadata , resource descriptions (including the Dublin Core), queries , query results , the data (including the Data Model – see next slide), and semantics (… we are using Unified Content Descriptors = UCDs).
    • Requires innovative computational science technologies for :
      • data discovery, data mining, data fusion, distributed querying, and code-shipping (“Ship the code, not the data”)
  • 45. Tools for the NVO & other Virtual Data Systems
    • XML (eXtensible Markup Language) = &quot;the language of interoperability&quot; - ADC/XML Project was most comprehensive and advanced application of XML to NASA astrophysics data archives - including the XDF (eXtensible Data Format) and FITSML data standards [ http://xml.gsfc.nasa.gov/ ]
    • Comprehensive Data Mining Resource Guide for Large Scientific Databases - [follow the link at http://nvo.gsfc.nasa.gov/ ]
      • &quot;The trouble with facts is that there are so many of them.&quot; - Samuel McChord Crothers, in &quot;The Gentle Reader&quot;
    • ISAIA (Interoperable Systems for Archival Information Access) : resource description profiles to enable access to distributed data providers
    • MOCHA (Middleware based On a Code-sHipping Architecture) : middleware tools for search, retrieval, & data fusion from heterogeneous databases using heterogeneous interfaces - transparently federates distributed data access -
      • &quot;Ship the code, not the data“
    • The GRID ! …
  • 46. What is The Grid?
    • The GRID is “a distributed computing infrastructure that facilitates resource-sharing and coordinated problem-solving in dynamic, multi-institutional virtual organizations.”
    • http://www.globus.org/datagrid/
    • http://www.gridforum.org/
    • http://www.nas.nasa.gov/About/IPG/
    • (NASA’s Information Power Grid)
  • 47. The Grid : by Foster & Kesselman (Argonne National Laboratory) Internet computing and GRID technologies promise to change the way we tackle complex problems. They will enable large-scale aggregation and sharing of computational, data and other resources across institutional boundaries …. Transform scientific disciplines ranging from high energy physics to the life sciences
  • 48. Data Grids vs. Computational Grids
  • 49. Slide shown earlier: Conceptual Architecture for a Distributed Data Mining System Data Archives Discovery tools Analysis tools User Gateway
  • 50. A Concept for a Data Grid Node for Distributed Data Mining**
    • Hardware requirements
    • Large distributed database engines
      • with few Gbyte/s aggregate I/O speed
    • High speed (>10 Gbit/s) backbones
      • cross-connecting the major archives
    • Scalable computing environment
      • with hundreds of CPUs for analysis
    10 Gbits/s ** Slide provided by Alex Szalay (JHU) HPC comes to the rescue! Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID Objectivity RAID RAID Database layer 2 GBytes/sec Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute node Compute layer 200 CPUs Interconnect layer 1 Gbits/sec/node Other nodes
  • 51. An HPC Application: Parallel Data Mining Figure : How Parallel Processing Speeds Up Data Mining The application of parallel computing resources and parallel data access (e.g., RAID) enables concurrent drill-downs into large data collections
  • 52. Distributed Data Mining in the NVO
  • 53. Data Mining: connecting the dots? Reference: http://homepage.interaccess.com/~purcellm/lcas/Cartoons/cartoons.htm
  • 54. Scaling the VO Mountain: Role of Data Mining Discoveries Data Mining Visualization Data Services Existing Data Centers and Archives You are here
  • 55. Exploration of new domains of the observable parameter space : The Time Domain -Part 1 Moving object appears as little rainbow in multiple-color image overlays  In-coming Killer Asteroid? NVO = Data Mining in Action
  • 56. Data Mining through Data Processsing: Simple Multiple-Frame Subtraction SUPERNOVA discovered !!
  • 57. Mega-Flares on normal Sun-like stars = a star like our Sun increased in brightness 300X one night! … say what?? The Time Domain -- Part 2 NVO = Data Mining in Action
  • 58. SETI@home searches for E.T. -- An equivalent data mining tool [email_address] on anyone’s desktop can find new comets, asteroids, exploding stars, quasars -- Chunks of data are sent to user’s screensaver, which begins to mine data for special or one-of-kind astronomical events. The Time Domain -- Part 3 NVO = Data Mining in Action
  • 59. [email_address] brings science discovery to the desktop of everyone! … a great tool for space science and computational science education. Requires : access to distributed science databases and data mining & analysis tools.
  • 60. 1. Potential tool for Distributed Data Mining: http:// skyserver.pha.jhu.edu/VOconeprofile /
    • ConeSearch
    • Find all astronomical objects within a radius of a point on the sky (= cone).
    • Find cross-identifications (e.g., a radio galaxy in one catalog = an Infrared galaxy in another catalog)
    • >70 services are now queried.
    • Results are returned in XML format (VOTable).
  • 61. 2. Potential Tool for Distributed Data Mining: Data Inventory Service http://us- vo.org/news/dis.html / Response from the Data Inventory Service, showing links to relevant images and catalogs: Uses ConeSearch Profile Service.
  • 62. 3. Potential tool for Distributed Data Mining: http:// www.skyquery.net/main.htm Submits queries to large distributed databases! 2 nd place Winner in Microsoft Contest
  • 63.
    • Sample Data Mining Applications within the NVO:
      • Discover data stored in geographically distributed heterogeneous systems.
      • Search huge databases for trends and correlations in high-dimensional parameter spaces: identify new properties or new classes of objects.
      • Search for rare, one-of-a-kind, and exotic objects in huge databases.
      • Identify temporal variations in objects from millions or billions of observations.
      • Identify moving objects in huge survey catalogs and image databases.
      • Identify parameter glitches / anomalies / deviations either in static databases (e.g., archives) or in dynamic data (e.g., science / telemetry / engineering data streams from remote satellites).
      • Find clusters, nearest neighbors, outliers, and/or zones of avoidance in the distribution of astrophysical objects or other observables in arbitrary parameter spaces.
      • Serendipitously explore the huge databases that will be part of the NVO, through access to distributed, autonomous, federated, heterogeneous, multi-wavelength, multi-mission astrophysics data archives.
    Summary - Applications of Data Mining to the NVO
    • Data Mining Resource Guide for Space Science:
    • http://nvo.gsfc.nasa.gov/nvo_datamining.html
    • Purpose and Content -- to assist NASA scientists in data mining activities by providing comprehensive summaries of: NASA-funded data mining projects, data mining tutorials, algorithms, techniques, software, organizations, conference links, conference summaries, publications, lessons learned, related I.T. technologies, science applications, expert interviews, and applications of data mining to the new National Virtual Observatory (NVO).
    http://www.us-vo.org/
  • 64. Web References
    • General:
      • http://xml.gsfc.nasa.gov/
      • http://nvo.gsfc.nasa.gov/
      • http://www.us-vo.org/
    • Specific:
      • VOTable - XML language for queries and tabular query results:
          • http://www.us-vo.org/VOTable/
      • Data Mining Resource Guide:
          • http://nvo.gsfc.nasa.gov/nvo_datamining.html
      • Scientific Data Mining Workshop and Reports:
          • http://www.anc.ed.ac.uk/sdmiv/
  • 65. VO: Creating the Future of Astrophysics Data Analysis
  • 66. Summary
    • Quick Review of Astronomy Data
    • The National Virtual Obseratory (NVO)
    • Other Virtual Observatories for Space Science
    • Why Virtual Observatories?
    • NVO – It’s all about the Science:
      • IT-enabled, Science-enabling
    • The Enabling Computational Science Technologies for the NVO – where you can help!
    • Distributed Data Mining in the NVO
  • 67. Next Lecture
    • November 25 – Intelligent Archives of the Future