Windows Azure: Notes From the Field


Published on

Presented on September 14, 2009 to the HUNTUG group (

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • For updates to this content please download the latest Azure Services Platform Training Kit from:
  • This is the exploding cloud diagram
  • Windows Azure runs on Windows Server 2008 running .NET 3.5 SP1. At MIX09, we opened up support for Full Trust and FastCGI. Full Trust is starred here because while Full Trust gives you access to p/invoke into native code, it is code that still runs in user mode (not administrator). However, for most native code that is just fine. If you wanted to call into some Win32 APIs for instance, it might not work in all instances because we are not running your code under a system administrator account.There are 2 roles in playA web role – which is just a web site,, wcf, images, css etc.A worker role – which is similar to a windows service, it runs in the background and can be used to decouple processing. There is a diagram later that shows the architecture, so don’t worry about how it fits together just yet.Key to point out the inbound protocols are HTTP & HTTPS – outbound are any TCP Socket, (but not UDP).All servers are stateless, and all access if through load balancers.
  • This should give a short introduction to storage. Key points are its durable (meaning once you write something we write it to disk), scalable (you have multiple servers with your data), available (the same as compute, we make sure the storage service is always running – there are 3 instances of your data at all times).Quickly work through the different types of storage:Blobs – similar to the file system, use it to store content that changes, uploads, unstructured data, images, movies etc.Tables – Semi-structured, provides a partitioned entity store (more on partitions etc. in the Building Azure Services Talk) – allows you to have tables containing billions of rows, partitioned across multiple servers.Queues – Simple queue for decoupling Computer Web and Worker Roles.All access is through REST interface. You can actually access the storage from outside of the data center (you don’t need compute) and you can access storage via anything that can make a HTTP request.It also means table storage can be accesses via ADO.NET Data Services.
  • Remind them the cloud is all the hardware across the board.Point out the automated service management,
  • Developer SDK is a Cloud in a box, allowing you to develop and debug locally without requiring a connection to the cloud. You can do this without Visual Studio as there are command line tools for executing the “cloud in a box” and publishing to the cloud.There is also a separate download for the Visual Studio 2008 tools, which provide the VS debugging and templates.Requirements are any version of Visual Studio (including Web Developer Express), Vista SP1, Win7 RC or later.
  • Windows Azure: Notes From the Field

    1. 1. Windows Azure: Notes <br />From The Field<br />Rob Gillen<br />Computer Science Research<br />Oak Ridge National Laboratory<br />Planet Technologies, Inc.<br />
    2. 2. Agenda<br />Introduction to Windows Azure<br />Application Overview<br />What didn’t work<br />What is working (or, at least we think)<br />Lessons (being) Learned<br />Questions<br />
    3. 3. About Planet Technologies<br />Leader in integration and customization of Microsoft technologies, architecture, security, and management consulting<br />100% Microsoft Focused Gold Partner<br />Four-time Microsoft Federal Partner of the Year (05-08)<br />Microsoft SLG Partner of the Year (08)<br />Microsoft Public Sector Partner of the Year (06)<br />
    4. 4. Oak Ridge National Laboratory is DOE’s largest science and energy lab<br /><ul><li>World’s most powerful open scientific computing facility
    5. 5. Nation’s largest concentrationof open source materials research
    6. 6. $1.3B budget
    7. 7. 4,350 employees
    8. 8. 3,900 researchguests annually
    9. 9. $350 million investedin modernization
    10. 10. Nation’s most diverse energy portfolio
    11. 11. Operating the world’s most intense pulsed neutron source
    12. 12. Managing the billion-dollar U.S. ITER project</li></li></ul><li>Delivering science and technology<br />Ultrascale computing<br />Energy technologies<br />Bioenergy<br />ITER<br />Neutron sciences<br />Climate<br />Materials at the nanoscale<br />National security<br />Nuclear energy<br />
    13. 13. UltrascaleScientific Computing<br /><ul><li>Leadership Computing Facility:
    14. 14. World’s most powerful open scientific computing facility
    15. 15. Jaguar XT operating at 1.64 petaflops
    16. 16. Exascale system by the end of the next decade
    17. 17. Focus on computationally intensive projects of large scale and high scientific impact
    18. 18. Addressing key science and technology issues
    19. 19. Climate
    20. 20. Fusion
    21. 21. Materials
    22. 22. Bioenergy</li></ul>The world’s most powerful system for open science<br />
    23. 23. Unique Network Connectivity<br />10 GB, moving to 40 GB, and higher<br />
    24. 24. Disclaimer<br />Windows Azure is still in CTP. There are issues. They are making it better. This talk is simply about current experiences and hopefully some tips/pointers to help you reach success faster.<br />The tests performed and referenced in this talk are not deemed scientifically accurate – simply what I have seen in my testing/usage.<br />There are (many) people (much) smarter than me. <br />
    25. 25. What is Windows Azure?<br />Compute<br />Storage<br />Developer<br />SDK<br />
    26. 26. Developer<br />Tools<br />What is Windows Azure?<br />Compute<br /><ul><li>.NET 3.5 SP1
    27. 27. Server 2008 – 64bit
    28. 28. Full Trust*
    29. 29. Web Role
    30. 30. IIS7 Web Sites (ASP.NET, FastCGI)
    31. 31. Web Services (WCF)
    32. 32. Worker Role
    33. 33. Stateless Servers
    34. 34. Http(s) </li></ul>Storage<br />
    35. 35. Developer<br />Tools<br />What is Windows Azure?<br />Storage<br /><ul><li>Durable, scalable, available
    36. 36. Blobs
    37. 37. Tables
    38. 38. Queues
    39. 39. REST interfaces
    40. 40. Can be used without compute</li></ul>Compute<br />
    41. 41. What is Windows Azure?<br />Compute<br />Storage<br /><ul><li>All of the hardware
    42. 42. Hardware Load Balancers
    43. 43. Servers
    44. 44. Networks
    45. 45. DNS
    46. 46. Monitoring
    47. 47. Automated service management</li></ul>Developer<br />Tools<br />
    48. 48. What is Windows Azure?<br />Developer SDK<br /><ul><li>Windows Azure SDK
    49. 49. Local compute environment
    50. 50. Local Mock Storage
    51. 51. Command line tools
    52. 52. Small Managed API
    53. 53. Logging, working storage
    54. 54. Microsoft Visual Studio 2008 add-in</li></ul>Compute<br />Storage<br />
    55. 55. Windows Azure Datacenter<br />Your Service<br />Service Architecture<br />Worker Service<br />Worker Service<br />Internet<br />LB<br />Tables<br />Storage<br />Web Site<br />(ASPX, ASMX, WCF)<br />Web Site<br />(ASPX, ASMX, WCF)<br />Web Site<br />(ASPX, WCF)<br />Queue<br />LB<br />Blobs<br />
    56. 56. Initial Context<br />Studying the intersection of HPC/scientific computing and the cloud<br />Data locality is expected to be a key issue for us<br />Cloud Computing looks to fill a niche in pre- and post-processing as well as generalized mid-range compute<br />This project is an introductory or preparatory step into the larger research project<br />
    57. 57. Sample Application Goals<br />Make CMIP3 data more accessible/consumable<br />Prototype the use of cloud computing for post-processing of scientific data<br />Answer the questions:<br />Can cloud computing be used effectively for large-scale data<br />How accessible is the programming paradigm<br />Note: focus is on the mechanics, not the science (could be using number of foobars in the world rather than temp simulations)<br />
    58. 58. Two-Part Problem<br />Get the data into the cloud/exposed in such a way as to be consumable by generic clients in Internet-friendly formats<br />Provide some sort of visualization or sample application to provide context/meaning to the data.<br />Simply making the data available doesn’t solve much<br />Looking at TB of date/lat/lon/temp combinations doesn’t convey much<br />A visualization or sample application was required to make the data “grok-able”<br />
    59. 59. Putting Data in the Cloud<br />Source format - NetCDF is a hierarchical, n-dimensional binary format. Highly compressed and efficient. Difficult to consume in small bites over the Internet (often need to download the entire file or use OpenDAP)<br />Libraries for interacting with NetCDF are available in C, Java, Ruby, Python, etc. Rudimentary managed wrapper available on CodePlex. File format is a hurdle for the casual observer (non-domain expert).<br />
    60. 60. Putting Data in the Cloud<br />Desire to expose data as a “service” (think REST, XML, JSON, etc.)<br />Decided to store in Azure Tables as “flattened” view<br />Designed to scale to billions of records<br />Consumers can query and retrieve small slices of data<br />Supports ADO.NET Data Services with no extra effort (ATOM)<br />
    61. 61. Context: 35 Terabytes of numbers - How much data is that?<br />A single latitude/longitude map at typical climate model resolution represents about ~40 KB.<br />If you wanted to look at all 35 TB in the form of these latitude/longitude plots and if..<br />Every 10 seconds you displayed another map and if<br />You worked 24 hours a day 365 days each year,<br />You could complete the task in about 200 years.<br />
    62. 62. Dataset Used<br />1.2 GB NetCDF file – NCAR climate of the 20th century, run 1, daily data, air temperature, 1.4 degree grid.<br />40,149 days represented<br />Each day has 8,192 temperature values<br />Total of 328,900,608 unique values<br />0.003428 % of total set<br />
    63. 63. Data Load Approach #1<br />Local application flattened NetCDF in memory, load records directly into Azure Tables using Entity Framework<br />Initially 1 record at a time (prior to batch support)<br />100 record batches (max/batch once batch support enabled)<br />Worked, but took *forever* (collect this time)<br />
    64. 64. Data Load Approach #2<br />Local application flattened NetCDF into CSV files (one per time unit - ~41,000 files)<br />CSV files uploaded into Azure blob storage<br />Queue populated with individual entries for each time unit<br />Workers roles would grab a time period from the queue, pull in the CSV, upload the data to the tables in 100-unit batches, and delete from the queue.<br />
    65. 65. Data Load Approach #2<br />Results<br />Averaged 2:30/time period<br />40,149 time periods<br />24 per worker hour<br />1,672.8 worker-hours<br />14 active workers<br />119.5 calendar hours<br />~5 days total load.<br />328,900,608 total entities<br />Near-linear scale out<br />Remember, this is 0.003428 % of total set<br />
    66. 66. Data Load Approach #3<br />Similar to #2, but initial flattening to CSVs occurs in Azure rather than local machine<br />Same table load performance as #2, but doesn’t require local machine resources for flattening and uploading<br />Uploading a single 1.2GB NetCDF file is much faster than uploading ~40,100 300KB CSV files<br />
    67. 67. Sample Visualization Application<br />Goals<br />Generate heat maps for each time slice<br />Animate collection of heat maps<br />Allow user to compare similar time frames from various experiments to understand impact of changes<br />
    68. 68. Visualization Approach #1<br />Silverlight-based app, using CTP Virtual Earth control<br />Download data by time period, for each data point (lat, lon, temp), create a bounding square (polygon) and set the color on the VE control<br />Downloaded via Entity Framework (easy to write)<br />Downloaded via JSON (harder, but less verbose)<br />Store datasets in memory, allow user to select between, animate downloaded sets, batch download<br />
    69. 69. Virtualization Approach #1<br />Results<br />ATOM is *very* bloated (~9MB per time period, average of 55 seconds over 9 distinct, serial calls)<br />JSON is better (average of 18.5 seconds and 1.6MB)<br />Client image rendering is *ok*…<br />Polygons prevented normal VE interaction<br />When interaction occurred, it was jerky<br />
    70. 70. Silverlight-based Client Processing<br />Click to start<br />Demo<br />
    71. 71. Visualization Approach #1.5<br />Attempted to go the whole “GIS” route and create a WMS or use MapCruncher<br />Results<br />Process worked OK, but was heavy/manually intensive.<br />With the resolution of the data I was using, was interactivity valuable? <br />
    72. 72. Visualization Approach #2<br />Pre-generate the images for each time period<br />Used fixed-size base map<br />Pre-cache images<br />Silverlight and WPF viewer would include WPF animation to cycle through image collection<br />
    73. 73. Visualization Approach #2<br />Results<br />Image Generation worked fine (smoother than VE)<br />Both Silverlight and WPF desktop app choked on animations when the number of images got large (i.e. &gt; 100)<br />
    74. 74. Visualization Approach #3<br />Same approach as #2, but generate video (i.e. WMV)<br />Results<br />Significantly improved rendering performance<br />Supports streaming<br />
    75. 75. WPF Client Image Animation and pre-rendered video<br />Demo<br />
    76. 76. Sidebar: Generating Heatmaps<br />Create an image using GDI+ and set the appropriate pixels to a shade of gray from 0-255<br />Apply a color map that translates from a gray to a color in a reference image<br />(Yes… you have to care about pixels…)<br />
    77. 77. Sidebar: Generating Heat Maps<br />Rudimentary math, but process intensive for generating each image. (There’s likely a better way…)<br />
    78. 78. Current Application Workflow<br />NetCDFfile (source) uploaded to blob storage<br />NetCDFfile split into 1000’s of CSV files stored in blob storage<br />Process generates a LoadTable command for each CSV created<br />LoadTable workers process jobs and load CSV data into Azure Tables.<br />Once a CSV file has been processed, a CreateImage job is created<br />
    79. 79. Current Application Workflow<br />CreateImage workers process queue, generating a heat map image for each time set<br />Once all data is loaded and images are created, a video is rendered based on the resulting images and used for inclusion in visualization applications.<br />Each source image is “munged” with a base map image prior to loading into the video.<br />
    80. 80. Technologies Utilized<br />Windows Azure (tables, blobs, queues, web roles, worker roles)<br />OGDI ( <br />C#, F#, PowerShell, DirectX, SilverLight, WPF, Bing Maps (Virtual Earth), GDI+, ADO.NET Data Services <br /><br />
    81. 81. Lessons<br />Cloud-focused data formats are large.<br />Single ~1.2 GB NetCDF == ~16 GB of CSV<br />Table load time is “slow”<br />~8,200 records, over 82 batches, average 2:30<br />However, insert time remains linear<br />Partition keys are not queryable… store them.<br />Load times prevent Azure tables from being particularly well-suited for large-scale data<br />Watch your compliation model (32 vs. 64 bit)<br />
    82. 82. Lessons<br />Errors happen… plan for/expect them<br />Watch for timeouts when retrieving files, uploading data, etc. (Code Sample)<br />Design for Idempotency<br />multiple applications of the operation does not change the result<br />Assume your worker roles will get restarted.<br />Azure deployments will fail when you least want them to (remember, it’s a CTP). <br />Stay away from dev storage (local fabric)<br />
    83. 83. Lessons<br />ATOM is convenient, but bloated – use JSON where possible<br />Data transfer within Azure datacenters is fast. Use web roles to format/proxy data for transfer over the Internet<br />Azure logs are very slow – use alternate reporting methods if faster feedback loop is necessary<br />
    84. 84. Related Content<br />Net CDF:<br />Net CDF Wrapper for .NET:<br />OPeNDAP:<br />CMIP 3:<br />Open Government Data Initiative:<br />JSON.NET:<br />Map Cruncher:<br />Heat maps for VE:!42E1F70205EC8A96!7742.entry?wa=wsignin1.0&sa=406128337<br />Heat maps in C#:<br />
    85. 85. Related Content<br />Silverlight 3 and Data Paging<br />With ATOM:<br />With JSON:<br />AtomPub, JSON, Azure, and Large Datasets<br />Part 1:<br />Part 2:<br />
    86. 86. Questions<br />Rob Gillen<br />Email:<br />Blog:<br />Twitter: @argodev<br />