Summary of 3DPAS


Published on

presented at D3Science workshop at e-Science 2011 conference

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Summary of 3DPAS

  1. 1. Review of 3DPAS ThemeDaniel S. Katz, University of Chicago & Argonne National LaboratoryShantenuJha, Rutgers UniversityNeil Chue Hong, University of EdinburghSimon Dobson, University of St. AndrewsAndre Luckow, Louisiana State UniversityOmer Rana, University of CardiffYogeshSimmhan, University of Southern California
  2. 2. Outline• e-SI• DPA theme• 3DPAS theme• Report in-progress – Application Scenarios – Understanding Distributed Dynamic Data – Vectors – Infrastructure – Programming Systems and Abstractions• Future Steps 3DPAS review for D3Science –
  3. 3. e-Science Institute (e-SI)• A 10-year project (Aug 2001 – July 2011), located in Edinburgh• Aimed at, but not limited to, UK•• Tagline – time & space to think• Mission: to stimulate the creation of new insights in e-Science and computing science by bringing together international experts and enabling them to successfully address significant and diverse challenges• Research themes formed the core of eSI’s activity – Theme: connected programme of visitors, workshops and events – Conceived and driven by Theme Leader – Focusing on a specific issue in e-Science that crosses boundaries and raises new research questions – Goals: o Identify research issues o Rally a community of researchers o Map a path of future research that will make best progress towards new e-Science methods and capabilities. 3DPAS review for D3Science –
  4. 4. Context – Data and Science• Data has always been important to science• Some use the concept of paradigms – First (thousand years ago) – empirical – describe natural phenomena – Second (few hundred years ago) – theoretical – use models and generalizations – Third (few decades ago) – computational – solve complex problem – Fourth (few years ago) – data exploration – gain knowledge directly from data from experiment, theory, simulation• Problem – we cannot keep declaring new paradigms at an exponentially increasing rate• But it’s true that there is an emerging science of “listening to data”, as defined by Jim Gray, Google, etc. 3DPAS review for D3Science –
  5. 5. Distributed Programming Abstractions• DPA theme at eSI –• Series of workshops• Led to book in progress: ShantenuJha, Daniel S. Katz, Manish Parashar, Omer Rana, and Jon Weissman, “Abstractions for Distributed Applications and Systems,” to be published by Wiley in 2012• And multiple papers, including: S. Jha, D. S. Katz, M. Parashar, O. Rana, and J. Weissman, "Critical Perspectives on Large-Scale Distributed Applications and Production Grids," (Best Paper Award Winner), Proceedings of the 10th IEEE/ACM International Conference on Grid Computing (Grid 2009), 2009.• Idea – start with distributed science and engineering applications – analyze them (determine `vectors’); examine interaction with infrastructures and tools; find abstractions – Tech report on infrastructures (much of Chapter 3) available now: – Vectors: Execution Unit, Coordination, Communication, Execution Environment• In the process, we realized that data intensive applications had some unique challenges and issues 3DPAS review for D3Science –
  6. 6. Dynamic Distributed Data-intensive ProgrammingSystems and Applications (3DPAS)• This led to 3DPAS theme at eSI –• Similar idea to DPA – Start with science and engineering applications – See if DPA vector suffice or if new vectors are needed – Examine what is different with respect to infrastructures and programming systems• Initially done through workshops at eSI• Continuing through weekly teleconferences• Driving towards a report/paper 3DPAS review for D3Science –
  7. 7. D3 (data intensive, distributed, dynamic)• Data intensive: order of magnitude of large data and large computing – Exascale data and petascale computing – Petascale data and exascale computing – Exascale data and exascale computing.• Distributed: number, dispersion, and replication of distributed data or computation resources – Low in a cloud or cluster that resides in a single building – High in a grid that spans multiple geographically-separated administrative domains, or multiple data centers• Dynamic: perhaps both data and computation – Data may emerge at runtime – Mechanisms to handle data during application execution, e.g., data transfer, scheduling – Application components may be launched at runtime in response to data, application, or environment dynamics• All may vary in different stages of an application • Most applications have data collection, storage, analysis stages 3DPAS review for D3Science –
  8. 8. Value/Impact• All data-intensive applications do not have dynamic and distributed elements today• However, as scales increase, applications will have to be distributed and dynamic – And these issues will be increasingly correlated• Analyzing current D3 applications should impact many future applications – And lead to lessons about and requirements on future infrastructures and programming systems 3DPAS review for D3Science –
  9. 9. Applications Process• Asked questions about possible applications 1. What is the purpose of the application? 2. How is the application used to do this? 3. What infrastructure is used? (including compute, data, network, instruments, etc.) 4. What dynamic data is used in the application? a. What are the types of data, b. What is the size of the data set(s)? 5. How does the application get the data? 6. What are the time (or quality) constraints on the application? 7. How much diverse data integration is involved? 8. How diverse is the data? 9. Please feel free to also talk about the current state of the application, if it exists today, and any specific gaps that you know need to be overcome 3DPAS review for D3Science –
  10. 10. Applications Process (2)• In workshops, discussed current applications, and considered if news application “felt” the same as a previous application in terms of the answers to the questions• Came to 14 applications• Noted they fall into different categories – Traditional applications, single program that is run by a user – Archetypical applications: a group of applications, independent programs, written by different authors, may be competing, usually not intended to run together – Infrastructural applications: set of applications (or archetypical applications) that need to be run in series (perhaps in different phases), may be run by different groups that do not frequently interact 3DPAS review for D3Science –
  11. 11. Applications Application Area Type Lead Person/Site Metagenomics Biosciences Archetypical Amsterdam Medical Centre, Netherlands ATLAS experiment Particle Infrastructural CERN &Daresbury Lab + (WLCG) Physics RAL, UK Large Synoptic Sky Astrophysics Infrastructural University of Edinburgh – Survey (LSST) Institute of Astronomy, UK Virtual Astronomy Astrophysics Archetypical University of Edinburgh – Institute of Astronomy, UK Cosmic Microwave Astrophysics Traditional Lawrence Berkeley National Background Laboratory, USA Marine (Sea Biosciences Infrastructural University of St. Andrews, Mammal) Sensors UK Climate Earth Science Infrastructural National Center for Atmospheric Research, USA 3DPAS review for D3Science –
  12. 12. Applications (2)Application Area Type Lead Person/SiteInteractive Exploration of Earth Archetypical University of Reading, UKEnvironmental Data SciencePower Grids Energy Infrastructural University of Southern Informatics California, USAFusion (International Chemistry/ Traditional Oak Ridge NationalThermonuclear Physics Laboratory & RutgersExperimental Reactor) University, USAIndustrial Incident Emergency Infrastructural THALES, The NetherlandsNotification and Response ResponseMODIS Data Processing Earth Traditional Lawrence Berkeley Science National Laboratory, USAFloating Sensors Earth Infrastructural Lawrence Berkeley Science National Laboratory, USADistributed Network Security Infrastructural University of Minnesota,Intrusion Detection USA 3DPAS review for D3Science –
  13. 13. Climate (infrastructural) • CMIP/ICPP process runs and analyses climate models in 3 stages • Data are generated by distributed HPC centers • Data are stored by distributed ESGF gateways and data nodes • Data are analyzed by distributed researchers, who search for particular data, gather them to a site, process them • Resources for analysis can be dynamic, as can data stored in data nodesThanks: Don Middleton 13 3DPAS review for D3Science –
  14. 14. Fusion (traditional) • ITER needs a variety of codes • Codes run on distributed set of leadership-class facilities, using advance reservations to co-schedule the simulations • Codes reads and writes data files, using ADIOS and HDF5 • Files output by each code are transformed and transferred to be used as inputs by other codes, linking the codes into a single coupled simulation • Data generated are too large to be written to disk for post-run analysis; in-situ analysis and visualization tools are being developedThanks: Scott Klasky 14 3DPAS review for D3Science –
  15. 15. Metagenomics (archetypical)• Analysis of genome sequence data being produced by next gen devices• Sequencers are producing data at a rate increasing faster than computing capability• Sequencers are distributed; data produced cannot all be co-located• Multiple analyses (using different software) by multiple users need to make best use of available computing resources, understanding location and access issueswrt datasets 3DPAS review for D3Science –
  16. 16. CMB (traditional) • Cosmic Microwave Background (CMB) performs data simulation and analysis to understand the Universe 400,000 years after the Big Bang – Detectors take O(1012 - 1015) time-ordered sequences – Observations reduced to map of O(106 - 108) sky pixels – Pixels reduced to O(103 - 104) angular power spectrum coefficients – Coefficient reduced to O(10) cosmological parameters • Computationally most expensive step is from map to angular power spectrum – Exact solution is O(pixels3) – prohibitive – Approximate solution: sets of O(104) Monte Carlo realizations of observed sky to remove biases and quantify uncertainties, each of which involves simulating and mapping the time-ordered data – Map-making is applied to both real and simulated data, but O(104) more times to simulated data (uses on-the-fly simulation module – simulations performed when requested) • Currently uses single HPC system, but would be faster with distributed systems • Central system that builds map would launch data simulations on available remote resources; output data from the simulations would be asynchronously delivered back to that central system as files incorporated in map as they are producedThanks: Julian Borrill 16 3DPAS review for D3Science –
  17. 17. Some Additional Applications• ATLAS/WLCG (Infrastructural) – Hierarchy of systems; data centrally stored, and locally cached (and copied to where they likely will be used), perhaps at various levels of the hierarchy – Processing is done by applications that are independent of each other – Processing of one data file is independent of processing of another file, but groups of processing results are collected to obtain statistical outputs about the data• LSST (Infrastructural) – Data taken by a telescope – Quick analysis is done at the telescope site for interesting (urgent) events (which may involve comparing new data with previous data) – System can get more data from other observatories if needed; request other observatories to take more data; or call a human – Data then transferred to an archive site, may be at observatory, where data are analyzed, reduced, and classified, some of which may be farmed out to grid resources – Detailed analysis of new data vs. archived data is performed – Reanalysis of all data is done periodically – Data are stored in files and databases 3DPAS review for D3Science –
  18. 18. Some More Additional Applications• Virtual Astronomy (Archetypical) – Services are orchestrated through a pipeline, including a data retrieval service that is used to share data across VO sites – Data are moved through the pipeline, and intermediate and final products can be stored in Grid storage service• Marine (Sea Mammal) Sensors (Infrastructural) – Data are brought to a central site when sensors periodically transmit – Stored data are analyzed using statistical techniques, then visualized with tools such as Google Earth• Power Grids (Infrastructural) – Diverse streams arrive at a central utility private cloud at dynamic rates controlled by the application – Real-time event detection pipeline can trigger load curtailment operations – Data mining is performed on current and historical data for forecasting – Partial application execution on remote micro-grid sites is possible. 3DPAS review for D3Science –
  19. 19. Even More Additional Applications• Industrial Incident Notification and Response (Infrastructural) – Data are streamed from diverse sources, and sometimes manually entered into the system – Disaster detection causes additional information sources to be requested from that region and applications to be composed based on available data – Some applications run on remote sites for data privacy – Escalation can cause more humans in the loop and additional operations• MODIS Data Processing (Traditional) – Data brought into system from various FTP servers – Pipeline of initial standardized processing steps on data is done on clouds or HPC resources – Scientists can then submit executables that do further custom processing on subsets of the data, which likely include some summarization processing (building graphs) 3DPAS review for D3Science –
  20. 20. 3DPAS Vectors• DPA vectors – Execution Unit – Communication – Coordination – Execution Environment• What changes for D3 applications? – DPA already assumed distributed; data-intensive is somewhat orthogonal to vectors, last D is dynamic• So, what can be dynamic? – Data (in value or type) – Application (for archetypical and infrastructural applications) – Execution Environment• And how can the application respond? – All 3 vectors can change (under user control, or autonomically) 3DPAS review for D3Science –
  21. 21. Infrastructure• Software infrastructure to support D3 applications and users exists at three levels: – System-level software capabilities (e.g., notifications, file system consistency) – Middleware (e.g., databases, metadata servers) – Programming systems, services and tools (e.g., data-centric workflows)• Strong connection between software infrastructure and execution units – Infrastructure supports the communication between and coordination of execution units, e.g., to allow co-scheduling• What changes for D3 applications? – Boundary between infrastructure and application often blurred o e.g., a catalog may be provided by underlying infrastructure or implemented in application – Sometimes infrastructure requires knowledge of data models o e.g., to support semantic information integration, triggers, optimized data transport• General need for infrastructure components to support – Data management: sources, storage, access, movement, discovery, notification, provenance – Data analysis: conversion, enrichment, analysis, workflow, calibration, integration 3DPAS review for D3Science –
  22. 22. Programming Systems• Pipelines/workflows a key concept• Loosely, 3 stages for many applications – data collection, data storage, data analysis – But the order varies: Sometimes analysis is done during collection to reduce storage• Some stages are built from legacy (heritage) applications• Some applications don’t include all stages (some stages happen elsewhere; data is just “there”)• Stream processing also is important to some applications (or some stages) – the complete data can never be stored, and can only be accessed once in time• Issues that programming systems should address – Programming provisioning of resources – Use of existing services, or building of new services – How to adapt to changes? Autonomics? – Recording provenance 3DPAS review for D3Science –
  23. 23. Programming Systems (2)• Possible change: replace ad hoc and scripted approaches by more formal workflow tools – Potential benefits: efficiency, productivity, reproducibility, increased software reuse, ability to add provenance tracking – Potential issues: can application-specific knowledge by used by generic tools? 3DPAS review for D3Science –
  24. 24. Conclusions• D3 applications exist, the number is increasing• There are some similarities across some applications – Stages, streaming, dynamism and adaptivity – Probably means there are generic abstractions that could be used• Programming systems are somewhat ad hoc• We want generic tools that – Allow applications to adapt to dynamism in various elements o E.g., developers can find and use available systems at runtime, applications can run in the best location with respect to data sources – Provide good performance• Further research needed – How do we abstract the set of distributed systems to allow this? – What middleware and tools are needed? 3DPAS review for D3Science –