Rev Up Your HPC Engine
Fritz Ferstl, CTO Univa Corp, fferstl@univa.com
Who is Univa?
Copyright © 2014 Univa Corporation. All Rights Reserved.
2
• Profile
• Based in Chicago,
global reach
• >500...
Challenges for Workload and Resource
Management Systems
Copyright © 2014 Univa Corporation. All Rights Reserved. 3
Scalability
• Node counts stay flat or go down, sockets stay
flat, cores explode
• With the core explosion, the number of ...
Heterogeneity
5Copyright © 2014 Univa Corporation. All Rights Reserved.
• Hardware
• Multi-sockets, multi-cores
• Partial ...
Policy Variety
6Copyright © 2014 Univa Corporation. All Rights Reserved.
• Automated  Transparency?
• Manual overrides
• ...
Use Case Variety
7Copyright © 2014 Univa Corporation. All Rights Reserved.
• Classical HPC (simulation)  Large parallel /...
Geographical Distribution / Clouds
8Copyright © 2014 Univa Corporation. All Rights Reserved.
• Resource sharing: servers, ...
Solutions
Approaches
Best Practices
Copyright © 2014 Univa Corporation. All Rights Reserved. 9
Evolve
• Architecture Evolution
• more cores / nodes / jobs 
make it faster
• Integration with GPUs, Phi, etc
• New Sched...
Be Street-Smart
• Simplify where possible!
• Be-all solution can be the
most expensive
• Effort
• Poor utilization  slow ...
Think Different
• Examples:
• Less HA @ more throughput via fast SSD-Raid with
regular back-up
• Use array jobs whereever ...
Accept Difference
• Simple: temporarily designate parts of cluster
• Advanced: Cloud-share
• Share resources across separa...
Tailored Solutions
• Tailoring & add-ons can make all the
difference
• Tailoring such as
• Job Classes
• Customized report...
Conclusions
• Workload & Resource Management Systems more
required than ever
• Specifically in the “new” era of Cloud and ...
Thank You
http://www.univa.com
fferstl@univa.com
Copyright © 2014 Univa Corporation. All Rights Reserved. 16
Upcoming SlideShare
Loading in...5
×

Rev Up Your HPC Engine

1,609

Published on

In this slidecast, Fritz Ferstl from Univa presents: Rev Up Your HPC Engine. The presentation explores the challenges for Workload Management systems in today's datacenters with ever-increasing core counts.

See the presentation video and the full transcript: http://wp.me/p3RLHQ-cjs

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
1,609
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
2
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Rev Up Your HPC Engine

  1. 1. Rev Up Your HPC Engine Fritz Ferstl, CTO Univa Corp, fferstl@univa.com
  2. 2. Who is Univa? Copyright © 2014 Univa Corporation. All Rights Reserved. 2 • Profile • Based in Chicago, global reach • >500 customers in 3 yrs (mostly Fortune 500) • Products /Technologies: • Univa Grid Engine • UniSight • Univa License Orchestrator • UniCloud Data Center Automation Experts Do more with less in Big Compute and Big Data Help organizations play a better game of Tetris
  3. 3. Challenges for Workload and Resource Management Systems Copyright © 2014 Univa Corporation. All Rights Reserved. 3
  4. 4. Scalability • Node counts stay flat or go down, sockets stay flat, cores explode • With the core explosion, the number of jobs also explodes • Ever shorter run-times, more applications, more use cases • Large commercial sites approach or go beyond 100K • Throughput clusters process >150 million jobs / month 4Copyright © 2014 Univa Corporation. All Rights Reserved.
  5. 5. Heterogeneity 5Copyright © 2014 Univa Corporation. All Rights Reserved. • Hardware • Multi-sockets, multi-cores • Partial cluster upgrades • Evolving memory, network and storage architectures • Accelerators: GPUs, Phi • Job Profiles • Throughput • Array Jobs • Large Parallel • Interactive • Sessions • Reservations • Transactional • Hybrid • Dependencies, Workflows
  6. 6. Policy Variety 6Copyright © 2014 Univa Corporation. All Rights Reserved. • Automated  Transparency? • Manual overrides • Preferential access • Priorities • Reservations • Resource Urgencies • Quotas • Deadlines • Conflict Resolution • E.g. don‘t starve large parallel plus maintain high utilization
  7. 7. Use Case Variety 7Copyright © 2014 Univa Corporation. All Rights Reserved. • Classical HPC (simulation)  Large parallel / many mid-size parallel • Verification / Test  Throughput • From single simulation to parameter study  array jobs • Ultra-short jobs • Big Data / Data Mining • Exclusive usage of nodes vs shared usage
  8. 8. Geographical Distribution / Clouds 8Copyright © 2014 Univa Corporation. All Rights Reserved. • Resource sharing: servers, licenses, data, other • Data access latencies • Security • File system dependencies • Pre-/Post-Staging • Data locality: • Bring the job to the data • Or bring the data to the job
  9. 9. Solutions Approaches Best Practices Copyright © 2014 Univa Corporation. All Rights Reserved. 9
  10. 10. Evolve • Architecture Evolution • more cores / nodes / jobs  make it faster • Integration with GPUs, Phi, etc • New Scheduling Algorithms • Efficient handling of job mixes: parallel / array / sequential jobs • Scheduling of ultra-short jobs • More Monitoring, Better Error Tracking • Reporting, Accounting & Analytics 10Copyright © 2014 Univa Corporation. All Rights Reserved.
  11. 11. Be Street-Smart • Simplify where possible! • Be-all solution can be the most expensive • Effort • Poor utilization  slow ROI • Focus on most important goals 11Copyright © 2014 Univa Corporation. All Rights Reserved.
  12. 12. Think Different • Examples: • Less HA @ more throughput via fast SSD-Raid with regular back-up • Use array jobs whereever possible • More smaller jobs vs fewer bigger jobs • All considered, preemption may be a good option 12Copyright © 2014 Univa Corporation. All Rights Reserved.
  13. 13. Accept Difference • Simple: temporarily designate parts of cluster • Advanced: Cloud-share • Share resources across separate workload management system instances • Dynamically re-assign resources (servers) based on demand • Provides autonomy while maintaining high utilization • But avoid meta-scheduling where you can! 13Copyright © 2014 Univa Corporation. All Rights Reserved.
  14. 14. Tailored Solutions • Tailoring & add-ons can make all the difference • Tailoring such as • Job Classes • Customized reports • Add-ons such as • Submission portals and wrappers 14Copyright © 2014 Univa Corporation. All Rights Reserved.
  15. 15. Conclusions • Workload & Resource Management Systems more required than ever • Specifically in the “new” era of Cloud and Big Data • Allows you to benefit from 20+ years of experience in HPC workload orchestration and to move beyond • Clear-cut set of challenges  non-trivial solutions • Build on best-in-class products, architectures and development teams • Being “street-smart” about architecting and configuration of a cluster has big impact 15Copyright © 2014 Univa Corporation. All Rights Reserved.
  16. 16. Thank You http://www.univa.com fferstl@univa.com Copyright © 2014 Univa Corporation. All Rights Reserved. 16
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×