Your SlideShare is downloading. ×
Cloud HPC
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×

Introducing the official SlideShare app

Stunning, full-screen experience for iPhone and Android

Text the download link to your phone

Standard text messaging rates apply

Cloud HPC

2,355
views

Published on

Application of Cloud Computing for HPC problems

Application of Cloud Computing for HPC problems

Published in: Technology

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
2,355
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
149
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide
  • Why buy the cow when all you need is the milk.
  • http://www.slideshare.net/Ivan_datasynapse/datasynapse-and-amazon-ec2-technical-overview
  • Emerging Grid Frameworks are enabling Real-time use of the Grid
  • http://highscalability.com/are-cloud-based-memory-architectures-next-big-thing
  • Transcript

    • 1. MOVING HPC APPLICATIONS TO CLOUD
      The Practitioner Prospective
      © 2009 Grid Dynamics — Proprietary and Confidential
      Victoria Livschitz
      CEO, Grid Dynamics
    • 2. Agenda
      What clouds & HPC are being discussed?
      HPC & Clouds: match made in heaven?
      Concerns: dealing with performance, data and security
      Strategies of moving HPC to clouds
      Overview of HPC cloudware platforms
      Case studies: Monte Carlo, Batch Analytics, Excel @ Cloud
      Conclusions: where is cloud-HPC headed?
      10/20/2009
      © 2009 Grid Dynamics
      2
    • 3. What are we talking about?
      10/20/2009
      © 2009 Grid Dynamics
      3
    • 4. HPC + CLOUD: Match made in Heaven or Hell?
      10/20/2009
      © 2009 Grid Dynamics
      4
    • 5. The blessings
      10/20/2009
      © 2009 Grid Dynamics
      5
    • 6. THE CURSE: BARRIERS OF ADOPTION
      Performance issues
      Virtualization tolls CPU and especially I/O
      Cloud networks are not designed for low-latency communication
      Data issues
      HPC can consume or produce enormous data volumes
      Need to move them in and out of cloud, which leads to latency & cost
      Vendor-related issues
      Memory caps (currently ~16 GB) limits some shared memory jobs
      Legacy issues: cloud to support only latest and greatest kernels and libs
      Licensing and certifications of vendor software
      Security issues
      Data privacy, availability and integrity
      Private data moving over WAN
      10/20/2009
      © 2009 Grid Dynamics
      6
    • 7. RAW CLOUD PERFORMANCE: IS REALLY an issue?
      Cloud HPC is slower than bare metal cluster
      For majority of use cases: from 5% to 30% slower
      Not an issue if you can compensate it by having more VM
      Consider time sharing and queuing on static HPC cluster
      Slow but dedicated cloud can get things done faster
      10/20/2009
      © 2009 Grid Dynamics
      7
    • 8. MITIGATING DATA ISSUES
      10/20/2009
      © 2009 Grid Dynamics
      8
    • 9. Mitigating security issues
      10/20/2009
      © 2009 Grid Dynamics
      9
    • 10. HYBRID Cloud architecture
      10/20/2009
      © 2009 Grid Dynamics
      10
      Keep your private data secure on colo
      Perimeter firewall for internet-facing services
      LAN connection to elastic capacity
    • 11. IS CLOUD HPC ALREADY a reality?
      Gaia ESA mission:
      To build a catalogue of 1B stars (1% of Galaxy)
      to be launched 2011 for 5 year mission
      3-8 Mbit/s downlink, 30 Gb/day
      Data reduction cycles
      Multiple observation allow to refine star positions
      6 month observation cycle followed by 2 week catalog refinement cycle
      Reasons to go to cloud
      Bursty load profile
      EC2-based solution is cheaper: 350K EURO vs 720K EURO in-house without power and storage
      Risk mitigation: no need to purchase up-front datacenter for 5 years mission, as probe may get lost any day
      10/20/2009
      © 2009 Grid Dynamics
      11
    • 12. Moving HPC to cloud strategies
      10/20/2009
      © 2009 Grid Dynamics
      12
    • 13. Moving a Grid to a cloud
      10/20/2009
      © 2009 Grid Dynamics
      13
      WHEN?
      CPU is a bounding factor
      Legacy code or black-box tasks
      Re-architecting is just not feasible or practical
      For dev and test grids
      Grid vendor is already there
      HOW?
      Build your own
      custom worker machine image
      Keep scheduler and data sources on premises for maximum control and security
      Consider SSH tunneling or VPN for maximum security
      Or use vendor’s cloud adapters
      Data Synapse Federator
      Sun Grid Engine DRM
      UnivaUniCloud (SGE)
      Condor – CycleComputingCycleCloud
    • 14. DATA SYNAPSE FEDERATOR
      Policies for starting / stopping cloud based engines
      Secure connections to cloud based engines
      10/20/2009
      © 2009 Grid Dynamics
      14
      DataSynapseManager
      On Premise
      In the Cloud
      Federator
      Grid Client
    • 15. DATA SYNAPSE FEDERATOR
      10/20/2009
      © 2009 Grid Dynamics
      15
      Federator
      Activation Policy
      Proxy Service
      Proxy Service
      DataSynapse Engines on EC2
      SSH tunnel to communicate over WAN
      For managing engines
      For engines to access on-premise data
      Proxy is doing basic caching
      DS Engine updates grid libraries on boot
      DS Manager
      On Premise
      On AWS
      Secure SSHTunnel
      DS S3
      DS Base AMI
      Client S3
      Custom AMI
    • 16. Adopting Datagrid CLOUDWARE
      WHEN?
      Data access is a bounding factor
      White box tasks
      Luxury of re-design
      HOW?
      Plenty of powerful clustered middleware:
      Oracle Coherence
      GigaSpaces XAP
      GridGain
      Terracotta
      Design application considering
      Data partitioning
      Compute-data affinity
      In-place data processing
      10/20/2009
      © 2009 Grid Dynamics
      16
    • 17. GIGASPACES XAP
      Full app stack
      General frameworks
      In-memory data grid
      Messaging
      Web container
      Collapsed tiers
      Processing unit as logical unit of scalability
      SLA driven container as physical unit of scalability
      Cloud adapter to provision containers on demand
      10/20/2009
      © 2009 Grid Dynamics
      17
    • 18. ORACLE COHERENCE
      9/18/09
      © 2009 Grid Dynamics
      18
      Most popular data grid product
      True dynamic scalability
      Shared common virtualized app platform
      In-memory data grid
      In-place data processing
      Explicit locking
      ACID Transactions
      Application Tier
      Data Services
      Oracle Coherence
      Data Grid
      Data Sources
      Databases
      Web Services
      Mainframes
    • 19. NATIVE CLOUD HPC
      WHEN?
      Innovative path-finding solutions (speed of innovation)
      True massive scale data processing
      Naturally bursty applications
      Analysis and processing of Big Data
      HOW?
      Amazon Elastic Map/Reduce (Hadoop on the cloud)
      HDFS to store large files
      MapReduce to manage workload
      HBase to manage semi-structured data on top of HDFS
      Hive to batch-query and aggregation with QL queries
      Cloudera
      RightScaleRightGrid
      10/20/2009
      © 2009 Grid Dynamics
      19
    • 20. 10/20/2009
      © 2009 Grid Dynamics
      20
    • 21. MONTE-carlo @ cloud
      10/20/2009
      21
      © 2009 Grid Dynamics
    • 22. Analytics Applications
      Analytics applications: analyze data or perform computations based on mathematical models
      Typical usage examples
      Project sales numbers
      Estimate inventory levels
      Evaluate portfolio values
      Value at risk calculations (VAR)
      Project web site traffics
      Information helps in making better decisions
      Identify and mitigate risks
      10/20/2009
      © 2009 Grid Dynamics
      22
    • 23. Analytics Applications
      10/20/2009
      23
      © 2009 Grid Dynamics
    • 24. Cloud-Based Solution for NEAR Real-time Analytics
      Pros
      Dynamically scale it up and down based on the size of computation
      Create and dispose Infrastructure once the computation is done
      Add more machines to bring the compute time close to real-time
      Cons
      Massive data transfer in and out of cloud can be time consuming. Problems that depend on lots of dynamic data may not be suitable
      Shared processor memory is no longer available. Share-all models are poor candidates
      10/20/2009
      24
      © 2009 Grid Dynamics
    • 25. Business Drivers
      Major investment bank
      Annuity calculator application
      Monte-Carlo simulation with geometric Brownian motion (GBM)
      Fully parallelizable algorithm
      Customer talks to an agent and agent gets back to the customer next business day
      Currently nightly batch job computes the annuity amounts
      Problems with current approach
      System is constrained by time available for batch
      Customer satisfaction can be improved if this can be computed on spot, in near real time
      Adding new resources to system is hard and expensive
      10/20/2009
      25
      © 2009 Grid Dynamics
    • 26. Requirements and Solution
      Business requirements
      Ability to quickly launch and shutdown the application on demand
      Ability to scale up or down based on the size of the problem
      Complete the simulation in near real-time
      Model functionality should be reusable
      Security
      Re-use existing Monte Carlo models (written in C++)
      Solution
      Amazon Web Services
      GridGainCloudware
      10/20/2009
      26
      © 2009 Grid Dynamics
    • 27. GridGaincloudware
      10/20/2009
      © 2009 Grid Dynamics
      27
    • 28. Case Study: Solution Architecture
      10/20/2009
      © 2009 Grid Dynamics
      28
    • 29. Case Study: Highlights
      Monte Carlo simulation service that can be launched on click of a button
      Simulation cluster up and serving in less than 4 minutes
      Scale up the cluster in under 2 mins
      Simulation cluster can be dismissed on click of a button
      ~1M draws in MC simulation yields accurate results in near real time
      SOA Architecture, simulation is a web service that can be consumed by any client
      Dynamically loads the application code and reference data, configures the application on boot up from S3 (Storage cloud)
      10/20/2009
      © 2009 Grid Dynamics
      29
    • 30. BATCH analytics @ cloud
      10/20/2009
      30
      © 2009 Grid Dynamics
    • 31. WHY Batch Processing @ Cloud?
      Traditional batch processing limitations
      Limited by number of server resources
      Low utilization
      No way to process burst workload
      HW failure reduces capacity
      Cloud way
      Unlimited server resources
      100% utilization
      Opportunity to scale with load
      Opportunity to automatically restore capacity on failure
      Do it as quickly as you need
      Neutral cost equation: 1000 servers @ 1 hour = 10 servers @ 100 hours
      10/20/2009
      © 2009 Grid Dynamics
      31
    • 32. EXAMPLE: Log Processing @ Cloud
      Problem:
      Processing of traffic usage in large enterprise
      NetFlow logs gathered, stored and processed for reports to business
      Various analytics, like biggest traffic offender within enterprise
      Solution:
      Terracotta cloudware for cluster management, job distribution and results gathering
      Logs are served by scalable nginx web server
      Automated provisioning and dynamic scalability
      Deployed on top of Amazon EC2
      10/20/2009
      © 2009 Grid Dynamics
      32
    • 33. Batch Processing Architecture
      10/20/2009
      © 2009 Grid Dynamics
      33
      Scale up request
      New Server
      Frontend
      Provisioning Service
      Cloud API
      Job result
      Job request
      Batch processing cluster
      Master
      Worker Servers Array
      Data source
    • 34. Terracotta Cloudware
      10/20/2009
      © 2009 Grid Dynamics
      34
      Scale-out
      App Server
      App Server
      App Server
      Web App
      Web App
      Web App
      Business Logic
      Business Logic
      Business Logic
      Frameworks
      Frameworks
      Frameworks
      Frameworks
      Frameworks
      Frameworks
      JVM
      JVM
      JVM
      Terracotta Server
      Clustering the JVM
      Cluster JVM, not application
      Transparent clustering
      Network attached memory
      Separation of application from infrastructure
      No new API
      Java is the API
      Java memory model
      Java concurrency
    • 35. Worker JVM
      Worker JVM
      Heap
      Terracotta Master-Worker Architecture
      10/20/2009
      © 2009 Grid Dynamics
      35
      TC server
      Master JVM
      Heap
      TC driver
      TC driver
      TC driver
      TC communication layer
    • 36. Scheduler Batch Processing @ Cloud
      Sun Grid Engine + AWS
      When tasks are highly heterogeneous
      For cloud bursting
      Advanced resource management capabilities
      Self-contained AMI to boot and self-organize SGE cluster
      SDM + EC2 adapters to grow and shrink cluster depending on working queue
      Univa UD
      10/20/2009
      © 2009 Grid Dynamics
      36
    • 37. EXAMPLE: DNA Sequencer
      Problem: DNA Sequencer tool
      produces TBs of raw data in one experiment
      Processed by in-house SGE cluster
      refined to GBs after processing
      Storage is cheap, but redundant geo-distributed storage is not cheap
      Frequent need to re-run processing of old experiments, ad-hoc
      Hard to allocate resources for ad-hoc runs, raw data may become unavailable
      Solution: SGE+AWS
      Raw data from tool is FedExed to Amazon and uploaded to S3
      Run ad-hoc SGE cluster in the cloud to re-process (same codebase as in-house)
      SGE workers process data from and store results to S3
      Consume refined results: either download directly, or FedEx back to labs
      10/20/2009
      © 2009 Grid Dynamics
      37
    • 38. RightGrid: Cloud Way for batch processing
      Easy way to utilize all power of cloud computing
      Dynamic SLA-based scaling of worker machines
      True scalable storage
      TrueScalable messaging
      RightGrid offers lightweight yet powerful framework:
      EC2 as worker pool, S3 as mediated storage, SQS as messaging
      Ruby-based framework for JobProducer, JobConsumer, message codec, etc…
      Designed to wrap and run arbitrary code on worker nodes
      Transient and persistent worker execution model
      Failover, error reporting and audit
      Custom scaling policies
      10/20/2009
      © 2009 Grid Dynamics
      38
    • 39. Right Grid Architecture
      10/20/2009
      © 2009 Grid Dynamics
      39
    • 40. EXAMPLE: Document Converting
      Problem:
      Publishing house needs to convert its documents repository to standard format for later indexing
      All kinds of document formats to be rendered as pdf documents
      Once-in-a blue moon job
      Solution
      Use Amazon EC2 and RigtScale’sRightGrid framework
      Document storage FedExed to Amazon, uploaded to S3
      Documents converted by application built on top of RightGrid framework
      Converted documents stored on S3
      Resulting document pack is FedExed from Amazon to customer
      10/20/2009
      © 2009 Grid Dynamics
      40
    • 41. EXCEL ANALYTICS @ cloud
      10/20/2009
      41
      © 2009 Grid Dynamics
    • 42. WHY EXCEL @ Cloud?
      Ubiquitous
      Financial analysts think in Excel
      Excel + VBA is current financial analyst IDE
      For many financial institutions, Excel is a main data analysis tool
      Used by analysts and engineers
      Limited Programming Model
      Single threaded, memory limited, not that performing
      Need to Run Large Excel Workloads
      Parallelization of workload and data is the only way out
      On-demand infrastructure to run parallel excel
      10/20/2009
      © 2009 Grid Dynamics
      42
    • 43. MOVING EXCEL TO CLOUD
      Calculation Flow
      DAG of calculation units (Macro, UDF, Workbook recalc)
      Representable as “DAG table” or task dependency table
      Data flow
      Workbook as a system of records and data synchronization point
      Moving around workbooks is costly – moving data deltas is essential
      Template regions are used to capture input and output parameters
      10/20/2009
      © 2009 Grid Dynamics
      43
    • 44. MOVING EXCEL TO CLOUD: DEPLOYMENT
      10/20/2009
      © 2009 Grid Dynamics
      44
      Customer Premises
      Cloud (Private or Public)
      Compute Nodes
      (MS Windows & Excel)
      User PCs
      (MS Windows & Excel)
      Private Link
      Or
      Internet
      3. Submit Tasks
      1. Submit Job
      Web Server
      HTTP or FTP Server
      (Only for Public Clouds)
      Scheduler
      2. Stage Workbook In
      4. Stage Result Out
      Staging Server
    • 45. FUTURE of CLOUD HPC
      10/20/2009
      © 2009 Grid Dynamics
      45
      Specialized IaaS and PaaS offerings for HPC
      Bare metal with provisioning on demand
      Integrated HPC engines
      Math services
      Domain specific reference data services
    • 46. © 2009 Grid Dynamics
      Thank You!
      Victoria Livschitz
      CEO, Grid Dynamics