Your SlideShare is downloading. ×
Lessons from the Field:<br />Azure for Science<br />Rob Gillen<br />gillenre@ornl.gov<br />rob.gillenfamily.net<br />@argo...
Agenda<br />Introductions<br /><ul><li>Why is ORNL looking at Cloud Computing
Azure in 5 minutes</li></ul>Post-Processing and Data Distribution in the Cloud <br /><ul><li>Using Cloud Computing for Pos...
Data hosting/distribution</li></ul>Lessons (being) Learned <br /><ul><li>General Lessons
Performance</li></li></ul><li>Oak Ridge National Laboratory is DOE’s largest science and energy lab<br /><ul><li>World’s m...
Nation’s largest concentrationof open source materials research
$1.6B budget
4,350 employees
3,900 researchguests annually
$350 million investedin modernization
Nation’s most diverse energy portfolio
Operating the world’s most intense pulsed neutron source
Managing the billion-dollar U.S. ITER project</li></li></ul><li>Delivering science and technology<br />Ultrascale computin...
UltrascaleScientific Computing<br /><ul><li>Leadership Computing Facility:
World’s most powerful open scientific computing facility
Peak speed of 2.33 petaflops (> two thousand trillion calculations/sec)
18,688 nodes, 224,526 compute cores, 299 TB RAM, 10,000 TB Disk
4,352 ft2 floor space
Exascale system by the end of the next decade
Focus on computationally intensive projects of large scale and high scientific impact
Addressing key science and technology issues
Climate
Fusion
Materials
Bioenergy
Home of the 1st and 3rd fastest super computers in the world.</li></ul>The world’s most powerful system for open science<b...
Then Why Look at Cloud Computing???<br />Science Takes Different Forms<br /><ul><li>Tight Simulations
Data-Parallelized
Embarrassingly Parallel</li></ul>Dearth of Mid-Range Assets<br /><ul><li>256-1,000 cores
1 of many possible solutions</li></ul>Scaling Issues<br /><ul><li>Power Consumption
Programming Struggles
Fault-Tolerance</li></ul>Forward-Looking<br /><ul><li>Next-Generation Problems
Next-Generation Researchers</li></li></ul><li>Private<br />(On-Premise)<br />Infrastructure<br />(as a Service)<br />Platf...
Private<br />(On-Premise)<br />Types of Clouds<br />Infrastructure<br />(as a Service)<br />Platform<br />(as a Service)<b...
Application Services<br />“Dublin”<br />“Velocity”<br />Frameworks<br />“Geneva”<br />Security<br />Access Control<br />Pr...
Windows Azure Compute<br />Development, service hosting, & management environment<br />.NET, Java PHP, Python, Ruby, nativ...
Windows Azure Diagnostics<br />Configurable trace, performance counter, Windows event log, IIS log & file buffering<br />L...
Windows Azure Storage<br />Rich data abstractions – tables, blobs, queues, drives, CDN<br />Capacity (100TB), throughput (...
Windows Azure Table Storage<br />Designed for structured data, not relational data<br />Data definition is part of the app...
Windows Azure Blob Storage<br />Storage for  large, named files plus their metadata<br />Block Blob <br />Targeted at stre...
Windows Azure Queue<br />Performance efficient, highly available and provide reliable message delivery<br />Asynchronous w...
Windows Azure Drive<br />Provides a durable NTFS volume for Windows Azure applications to use<br />Use existing NTFS APIs ...
Windows Azure Content Delivery Network<br />Provides high-bandwidth global blob content delivery<br />18 locations globall...
Tenants of Internet-Scale Application Architecture<br />Design<br /><ul><li>Horizontal scaling
Service-oriented composition
Upcoming SlideShare
Loading in...5
×

Azure: Lessons From The Field

4,918

Published on

This is a presentation I delivered at CodeMash 2.0.1.0 dealing with lessons learned while building an application for handling the post-processing of scientific data using the Windows Azure platform.

Published in: Technology
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
4,918
On Slideshare
0
From Embeds
0
Number of Embeds
4
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Transcript of "Azure: Lessons From The Field"

  1. 1. Lessons from the Field:<br />Azure for Science<br />Rob Gillen<br />gillenre@ornl.gov<br />rob.gillenfamily.net<br />@argodev<br />
  2. 2. Agenda<br />Introductions<br /><ul><li>Why is ORNL looking at Cloud Computing
  3. 3. Azure in 5 minutes</li></ul>Post-Processing and Data Distribution in the Cloud <br /><ul><li>Using Cloud Computing for Post-Processing
  4. 4. Data hosting/distribution</li></ul>Lessons (being) Learned <br /><ul><li>General Lessons
  5. 5. Performance</li></li></ul><li>Oak Ridge National Laboratory is DOE’s largest science and energy lab<br /><ul><li>World’s most powerful open scientific computing facility
  6. 6. Nation’s largest concentrationof open source materials research
  7. 7. $1.6B budget
  8. 8. 4,350 employees
  9. 9. 3,900 researchguests annually
  10. 10. $350 million investedin modernization
  11. 11. Nation’s most diverse energy portfolio
  12. 12. Operating the world’s most intense pulsed neutron source
  13. 13. Managing the billion-dollar U.S. ITER project</li></li></ul><li>Delivering science and technology<br />Ultrascale computing<br />Energy technologies<br />Bioenergy<br />ITER<br />Neutron sciences<br />Climate<br />Materials at the nanoscale<br />National security<br />Nuclear energy<br />
  14. 14. UltrascaleScientific Computing<br /><ul><li>Leadership Computing Facility:
  15. 15. World’s most powerful open scientific computing facility
  16. 16. Peak speed of 2.33 petaflops (> two thousand trillion calculations/sec)
  17. 17. 18,688 nodes, 224,526 compute cores, 299 TB RAM, 10,000 TB Disk
  18. 18. 4,352 ft2 floor space
  19. 19. Exascale system by the end of the next decade
  20. 20. Focus on computationally intensive projects of large scale and high scientific impact
  21. 21. Addressing key science and technology issues
  22. 22. Climate
  23. 23. Fusion
  24. 24. Materials
  25. 25. Bioenergy
  26. 26. Home of the 1st and 3rd fastest super computers in the world.</li></ul>The world’s most powerful system for open science<br />
  27. 27. Then Why Look at Cloud Computing???<br />Science Takes Different Forms<br /><ul><li>Tight Simulations
  28. 28. Data-Parallelized
  29. 29. Embarrassingly Parallel</li></ul>Dearth of Mid-Range Assets<br /><ul><li>256-1,000 cores
  30. 30. 1 of many possible solutions</li></ul>Scaling Issues<br /><ul><li>Power Consumption
  31. 31. Programming Struggles
  32. 32. Fault-Tolerance</li></ul>Forward-Looking<br /><ul><li>Next-Generation Problems
  33. 33. Next-Generation Researchers</li></li></ul><li>Private<br />(On-Premise)<br />Infrastructure<br />(as a Service)<br />Platform<br />(as a Service)<br />Types of Clouds<br />You manage<br />Applications<br />Applications<br />Applications<br />You manage<br />Runtimes<br />Runtimes<br />Runtimes<br />Security & Integration<br />Security & Integration<br />Security & Integration<br />Managed by vendor<br />Databases<br />Databases<br />Databases<br />You manage<br />Servers<br />Servers<br />Servers<br />Managed by vendor<br />Virtualization<br />Virtualization<br />Virtualization<br />Server HW<br />Server HW<br />Server HW<br />Storage<br />Storage<br />Storage<br />Networking<br />Networking<br />Networking<br />
  34. 34. Private<br />(On-Premise)<br />Types of Clouds<br />Infrastructure<br />(as a Service)<br />Platform<br />(as a Service)<br />
  35. 35. Application Services<br />“Dublin”<br />“Velocity”<br />Frameworks<br />“Geneva”<br />Security<br />Access Control<br />Project “Sydney”<br />Connectivity<br />Service Bus<br />SQL Azure Data Sync<br />Data<br />Compute<br />Windows Azure Platform<br />Table Storage<br />Blob Storage<br />Queue<br />Drive<br />Content Delivery Network<br />Storage<br />
  36. 36. Windows Azure Compute<br />Development, service hosting, & management environment<br />.NET, Java PHP, Python, Ruby, native code (C/C++, Win32, etc.)<br />ASP.NET providers, FastCGI, memcached, MySQL, Tomcat<br />Full-trust – supports standard languages and APIs<br />Secure certificate store<br />Management API’s, and logging and diagnostics systems<br />Multiple roles – Web, Worker, Virtual Machine (VHD)<br />Multiple VM sizes<br />1.6 GHz CPU x64, 1.75GB RAM, 100Mbps network, 250GB volatile storage<br />Small (1X), Medium (2X), Large (4X), X-Large (8X)<br />In-place rolling upgrades, organized by upgrade domains<br />Walk each upgrade domain one at a time<br />Compute<br />
  37. 37. Windows Azure Diagnostics<br />Configurable trace, performance counter, Windows event log, IIS log & file buffering<br />Local data buffering quota management<br />Query & modify from the cloud and from the desktop per role instance<br />Transfer to storage scheduled & on-demand<br />Filter by data type, verbosity & time range<br />Compute<br />
  38. 38. Windows Azure Storage<br />Rich data abstractions – tables, blobs, queues, drives, CDN<br />Capacity (100TB), throughput (100MB/sec), transactions (1K req/sec)<br />High accessibility<br />Supports geo-location<br />Language & platform agnostic REST APIs<br />URL: http://&lt;account&gt;.&lt;store&gt;.core.windows.net<br />Client libraries for .NET, Java, PHP, etc.<br />High durability – data is replicated 3 times within a cluster, and (Feb 2010) across datacenters<br />High scalability – data is automatically partitioned and load balanced across servers<br />Storage<br />Storage<br />
  39. 39. Windows Azure Table Storage<br />Designed for structured data, not relational data<br />Data definition is part of the application<br />A Table is a set of Entities (records)<br />An Entity is a set of Properties (fields)<br />No fixed schema<br />Each property is stored as a &lt;name, typed value&gt; pair<br />Two entities within the same table can have different properties<br />No schema is enforced<br />Table Storage<br />
  40. 40. Windows Azure Blob Storage<br />Storage for large, named files plus their metadata<br />Block Blob <br />Targeted at streaming workloads<br />Each blob consists of a sequence of blocks<br />Each block is identified by a Block ID<br />Size limit 200GB per blob<br />Page Blob<br />Targeted at random read/write workloads<br />Each blob consists of an array of pages<br />Each page is identified by its offset from the start of the blob<br />Size limit 1TB per blob<br />Blob Storage<br />
  41. 41. Windows Azure Queue<br />Performance efficient, highly available and provide reliable message delivery<br />Asynchronous work dispatch<br />Inter-role communication <br />Polling based model; best-effort FIFO data structure<br />Queue operations<br />Create Queue<br />Delete Queue<br />List Queues<br />Get/Set Queue Metadata<br />Message operations<br />Add Message<br />Get Message(s)<br />Peek Message(s)<br />Delete Message<br />Queue<br />
  42. 42. Windows Azure Drive<br />Provides a durable NTFS volume for Windows Azure applications to use<br />Use existing NTFS APIs to access a durable drive<br />Durability and survival of data on application failover <br />Enables migrating existing NTFS applications to the cloud<br />Drives can be up to 1TB; a VM can dynamically mount up to 8 drives<br />A Windows Azure Drive is a Page Blob<br />Example, mount Page Blob as X:<br />http://&lt;account&gt;.blob.core.windows.net/&lt;container&gt;/&lt;blob&gt;<br />All writes to drive are made durable to the Page Blob<br />Drive made durable through standard Page Blob replication<br />Drive<br />
  43. 43. Windows Azure Content Delivery Network<br />Provides high-bandwidth global blob content delivery<br />18 locations globally (US, Europe, Asia, Australia and South America), and growing<br />Blob service URL vs. CDN URL<br />Blob URL: http://&lt;account&gt;.blob.core.windows.net/<br />CDN URL: http://&lt;guid&gt;.vo.msecnd.net/ <br />Support for custom domain names<br />Access details<br />Blobs are cached in CDN until the TTL passes<br />Use per-blob HTTP Cache-Control policy for TTL (new)<br />CDN provides only anonymous HTTP access<br />Content Delivery Network<br />
  44. 44. Tenants of Internet-Scale Application Architecture<br />Design<br /><ul><li>Horizontal scaling
  45. 45. Service-oriented composition
  46. 46. Eventual consistency
  47. 47. Fault tolerant (expect failures)</li></ul>Security<br /><ul><li>Claims-based authentication & access control
  48. 48. Federated identity
  49. 49. Data encryption & key mgmt.</li></ul>Management<br /><ul><li>Policy-driven automation
  50. 50. Aware of application lifecycles
  51. 51. Handle dynamic data schema and configuration changes</li></ul>Data & Content<br /><ul><li>De-normalization
  52. 52. Logical partitioning
  53. 53. Distributed in-memory cache
  54. 54. Diverse data storage options (persistent & transient, relational & unstructured, text & binary, read & write, etc.)</li></ul>Processes<br /><ul><li>Loosely coupled components
  55. 55. Parallel & distributed processing
  56. 56. Asynchronous distributed communication
  57. 57. Idempotent (handle duplicity)
  58. 58. Isolation (separation of concerns)</li></li></ul><li>Application Goals<br />Simulate Post-Processing of Scientific Data<br /><ul><li>Generate Visualizations from “raw” data
  59. 59. Transform data to be consumable by general processes
  60. 60. Exercise various storage mechanisms</li></ul>Focus on Mechanics<br /><ul><li>The specific science problem being solved is secondary to the approach
  61. 61. Goal is to refine approach such that it can fade allowing the science to regain preeminence</li></li></ul><li>Putting Data Into the Cloud<br />Source Data<br /><ul><li>NetCDF files – subset of US contribution to CMIP3 archive</li></ul>Visualization Support<br /><ul><li>Flatten Source Files to CSV
  62. 62. Generate base “heat map”
  63. 63. Combine heat map and base map
  64. 64. Generate Video/Animation</li></ul>General Consumption/Publishing<br /><ul><li>Expose data as a “service” (REST/XML/JSON, etc.)
  65. 65. Query-able
  66. 66. Azure Tables (OGDI) / Azure Blob</li></li></ul><li>Application Patterns<br />Grid / Parallel Computing Application<br />User<br />Silverlight<br />Application<br />Web Browser<br />Mobile<br />Browser<br />WPF<br />Application<br />ASP.NET<br />(Web Role)<br />Web Svc<br />(Web Role)<br />Jobs<br />(Worker Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />Private Cloud<br />Public Services<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />ASP.NET<br />(Web Role)<br />Enterprise Application<br />Application<br />Service<br />Enterprise Web Svc<br />Data<br />Service<br />Table Storage<br />Service<br />Blob Storage<br />Service<br />Queue<br />Service<br />Enterprise Data<br />Storage<br />Service<br />Identity<br />Service<br />Enterprise Identity<br />Service Bus<br />Access Control Service<br />Workflow<br />Service<br />User<br />Data<br />Application Data<br />Reference Data<br />
  67. 67. Flatten<br />NetCDF<br />Generate Image<br />Table <br />Loader<br />Application Flow<br />Message From Q<br />Message From Q<br />Message From Q<br />Download Binary File<br />Download CSV<br />Download CSV<br />For each Time Period…<br />Generate Image<br />Read In Rows<br />Flatten to CSV (memory)<br />Size Image<br />For each Set of 100…<br />Upload to Blob Storage<br />Upload to Blob Storage<br />Submit Batch To Table<br />Queue Table Load Job<br />Combine with Overlay<br />Queue Gen Image Job<br />Upload to Blob Storage<br />Period in Lookup Table<br />
  68. 68. Context<br />35 TB of numbers – How Much Data Is That?<br /><ul><li>A single lat/lon map at typical climate model resolution represents ~40 KB
  69. 69. If you wanted to look at all 35 TB in the form of these lat/lon plots and if…
  70. 70. Every 10 seconds you displayed another map
  71. 71. You worked 24 hours/day, 365 days/year
  72. 72. You could complete the task in about 200 years.</li></ul>Dataset Used <br /><ul><li>1 NetCDF file, approximately 92 MB, located in blob storage
  73. 73. 1,825 CSV files generated.
  74. 74. 815.84 MB total
  75. 75. Average file size is around 457.76 KB
  76. 76. Each CSV represented 12,690 data points (lat/lon/temp)
  77. 77. 3,650 images generated
  78. 78. 145.03 MB total
  79. 79. Heat Maps avg. 31.25 KB
  80. 80. Combined images avg. 49 KB
  81. 81. 23,652,000 entities added to azure table</li></li></ul><li>Lessons<br />Performance Counters<br /><ul><li>Take advantage of the new logging infrastructure within Azure to understand how your application is behaving.
  82. 82. However, like food at the dinner table, only take what you can eat.</li></li></ul><li>Flatten Operation – Proc utilization ~16% during active work<br />
  83. 83. Image Generation – Proc utilization ~95% during active work<br />
  84. 84. Table Load – Proc utilization ~57% during active work<br />
  85. 85. Table Load – Proc utilization ~57% during active work<br />
  86. 86. Lessons<br />Performance Counters<br /><ul><li>Take advantage of the new logging infrastructure within Azure to understand how your application is behaving.
  87. 87. However, like food at the dinner table, only take what you can eat.</li></ul>Tracing Infrastructure<br /><ul><li>Huge improvements from CTP to v1
  88. 88. Use categories to filter / limit what you transfer out
  89. 89. My eyes were bigger than my stomach</li></ul>Table Maintenance<br /><ul><li>(nodes * counters) + (nodes * trace) == lots of data
  90. 90. Plan early for how you are going to maintain Wad* tables.
  91. 91. Remember… redundancy/availability has a cost. (Perf)</li></li></ul><li>Flatten: CSV Upload Time<br />Over 40,349 attempts, 249.99 ms (79.12ms) with a rate of 15.63 mb/s (4.74). <br />Avg File size: 457.76 KB<br />
  92. 92. Flatten: CSV Upload Rate<br />Over 40,349 attempts, 249.99 ms (79.12ms) with a rate of 15.63 mb/s (4.74). <br />Avg File size: 457.76 KB<br />
  93. 93. Flatten: Queue Insert Duration<br />Over 40,345 attempts, given a msg size of 616b, insertion time averaged <br />254.96 ms (68.86)<br />
  94. 94. Flatten: Single Table Entity Insert<br />Over 40,353 attempts, average insertion time of 248.63 ms (108.16)<br />
  95. 95. ImageGen: CSV File Download Duration<br />Over 40,349 attempts, 249.99 ms (79.12ms) with a rate of 15.63 mb/s (4.74). <br />Avg File size: 457.76 KB<br />
  96. 96. ImageGen: CSV File Download Rate<br />Over 40,349 attempts, 249.99 ms (79.12ms) with a rate of 15.63 mb/s (4.74). <br />Avg File size: 457.76 KB<br />
  97. 97. ImageGen: Image Generation and Resizing<br />Over 24,687 attempts, average generation time was 3.7s (0.283s)<br />
  98. 98. ImageGen: Image File Upload Duration<br />Over 24,688 attempts, 88.14ms (44.84ms) with a rate of 3.02 mb/s (0.614). <br />Avg File size: 32 KB<br />
  99. 99. ImageGen: Image File Upload Rate<br />Over 24,688 attempts, 88.14ms (44.84ms) with a rate of 3.02 mb/s (0.614). <br />Avg File size: 32 KB<br />
  100. 100. TableLoad: Batch Insert Rate<br />Over 89,202 batches (100 records each), average duration was 1.447s (0.316s)<br />
  101. 101. Lessons<br />Data<br /><ul><li>Generic formats tend to be large (92 MB NetCDF 816 MB CSV)
  102. 102. Data transfer within Azure datacenter is fast (from your computer is slow)
  103. 103. Think about transport overhead (ATOM/JSON/CSV/etc. – 9x larger)
  104. 104. Use Asynccalls for data uploads/downloads (use your CPU cycles wisely – you are paying for them)</li></ul>Azure Tables <br /><ul><li>Inserts/Deletes are slow but relatively linear
  105. 105. Partition keys are not queryable… store them
  106. 106. Not well suited for “changing” data
  107. 107. If you are using the client library/ADO.NET Data Services, be careful of how you handle async calls – you can lose context
  108. 108. Use batch updates wherever possible (1 in 0.24863s or 100 in 1.447s) (6 individual updates take longer than 100 in a single batch.</li></li></ul><li>Lessons<br />General<br /><ul><li>Timeouts happen – Expect/Plan for them (exponential back-off & retry policies)
  109. 109. Design for Idempotency
  110. 110. Watch your compilation model (x86 vs. x64)
  111. 111. Data transfer within Azure datacenter is fast (from your computer is slow)
  112. 112. Don’t re-invent the wheel – use the available tools when practical
  113. 113. Powershell, PowerPivot, Logparser, and the NET Charting Libraries are your friend.</li></li></ul><li>Thank you<br />gillenre@ornl.gov<br />rob.gillenfamily.net<br />@argodev<br />
  114. 114. The Microsoft Cloud<br />Data Center Infrastructure<br />
  115. 115. The Microsoft Cloud<br />Data Center Infrastructure<br />
  116. 116. The Microsoft Cloud<br />~100 Globally Distributed Data Centers<br />Quincy, WA<br />Chicago, IL<br />San Antonio, TX<br />Dublin, Ireland<br />Generation 4 DCs<br />

×