Large Scale Scientific Data<br />Notes From The Field<br />Rob Gillen<br />Computer Science Research<br />Oak Ridge Nation...
ORNL is DOE’s largest scienceand energy laboratory<br /><ul><li>World’s most powerful open scientific computing facility
Nation’s largest concentrationof open source materials research
$1.3B budget
4,350 employees
3,900 researchguests annually
$350 million investedin modernization
Nation’s most diverse energy portfolio
Operating the world’s most intense pulsed neutron source
Managing the billion-dollar U.S. ITER project</li></li></ul><li>Leading the developmentof ultrascale scientific computing<...
World’s most powerful open scientific computing facility
Jaguar XT operating at  >1.64 petaflops
Exascale system by the end of the next decade
Focus on computationally intensive projects of large scale and high scientific impact
Upcoming SlideShare
Loading in...5
×

Azure Sample for Climate Analysis

1,172

Published on

Presentation I gave at the Microsoft Public Sector/Healthcare & Life Sciences Dinner and Cloud Computing Showcase held during PDC.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,172
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • For updates to this content please download the latest Azure Services Platform Training Kit from: http://www.azure.com
  • Azure Sample for Climate Analysis

    1. 1. Large Scale Scientific Data<br />Notes From The Field<br />Rob Gillen<br />Computer Science Research<br />Oak Ridge National Laboratory<br />Planet Technologies, Inc.<br />
    2. 2. ORNL is DOE’s largest scienceand energy laboratory<br /><ul><li>World’s most powerful open scientific computing facility
    3. 3. Nation’s largest concentrationof open source materials research
    4. 4. $1.3B budget
    5. 5. 4,350 employees
    6. 6. 3,900 researchguests annually
    7. 7. $350 million investedin modernization
    8. 8. Nation’s most diverse energy portfolio
    9. 9. Operating the world’s most intense pulsed neutron source
    10. 10. Managing the billion-dollar U.S. ITER project</li></li></ul><li>Leading the developmentof ultrascale scientific computing<br /><ul><li>Leadership Computing Facility:
    11. 11. World’s most powerful open scientific computing facility
    12. 12. Jaguar XT operating at >1.64 petaflops
    13. 13. Exascale system by the end of the next decade
    14. 14. Focus on computationally intensive projects of large scale and high scientific impact
    15. 15. Just upgraded to ~225,000 cores
    16. 16. Addressing key science and technology issues
    17. 17. Climate
    18. 18. Fusion
    19. 19. Materials
    20. 20. Bioenergy</li></ul>3 Managed by UT-Battellefor the Department of Energy<br />
    21. 21. Initial Context<br />Studying the intersection of HPC/scientific computing and the cloud<br />Data locality is a key issue for us<br />Cloud computing looks to fill a niche in pre- and post-processing as well as generalized mid-range compute<br />This project is an introductory or preparatory step into the larger research project<br />
    22. 22. Sample Application Goals<br />Make CMIP3 data more accessible/consumable<br />Prototype the use of cloud computing for post-processing of scientific data<br />Answer the questions:<br />Can cloud computing be used effectively for large-scale data<br />How accessible is the programming paradigm<br />Note: focus is on the mechanics, not the science (could be using number of foobars in the world rather than temp simulations)<br />
    23. 23. Technologies Utilized<br />Windows Azure (tables, blobs, queues, web roles, worker roles)<br />OGDI (http://ogdisdk.cloudapp.net/) <br />C#, F#, PowerShell, DirectX, SilverLight, WPF, Bing Maps (Virtual Earth), GDI+, ADO.NET Data Services <br />
    24. 24. Two-Part Problem<br />Get the data into the cloud/exposed in such a way as to be consumable by generic clients in Internet-friendly formats<br />Provide some sort of visualization or sample application to provide context/meaning to the data.<br />
    25. 25. Context: 35 Terabytes of numbers - How much data is that?<br />A single latitude/longitude map at typical climate model resolution represents about ~40 KB.<br />If you wanted to look at all 35 TB in the form of these latitude/longitude plots and if..<br />Every 10 seconds you displayed another map and if<br />You worked 24 hours a day 365 days each year,<br />You could complete the task in about 200 years.<br />
    26. 26. Dataset Used<br />5 GB worth of NetCDF files<br />Contributing Sources<br />NOAA Geophysical Fluid Dynamics Laboratory, CM2.0 Model<br />NASA Goddard Institute for Space Studies, C4x3<br />NCAR Parallel Climate Model (Version 1)<br />Climate of the 20th Century Experiment, run 1, daily<br />Surface Air Temperature (tas)<br />Maximum Surface Air Temperature (tasmax)<br />Minimum Surface Air Temperature (tasmin)<br />&gt; 1.1 billion unique values (lat/lon/temp pairs)<br />0.014 % of total set<br />
    27. 27. Application Workflow<br />Source file are uploaded to blob storage<br />Each source file is split into 1000’s of CSV files stored in blob storage<br />Process generates a Load Table command for each CSV created<br />Load Table workers process jobs and load CSV data into Azure Tables.<br />Once a CSV file has been processed, a Create Image job is created<br />
    28. 28. Application Workflow<br />Create Image workers process queue, generating a heat map image for each time set<br />Once all data is loaded and images are created, a video is rendered based on the resulting images and used for inclusion in visualization applications.<br />
    29. 29. Current Data Loaded<br />&gt; 1.1 billion table entries (lat/lon/value)<br />&gt; 250,000 blobs<br />&gt; 75 GB (only blob)<br />
    30. 30. Data Load Review<br />Results for first subset<br />Averaged 2:30/time period<br />40,149 time periods<br />24 per worker hour<br />1,672.8 worker-hours<br />14 active workers<br />119.5 calendar hours<br />328,900,608 total entities<br />Near-linear scale out<br />This represents 0.003428 % of total set<br />
    31. 31. WPF Data Visualization Application<br />Demo<br />

    ×