1. Office of Instructional and
Research Technology
Very large computing and the real
world
a very few thoughts
Eric Marshall
Associate Director for Research Technology
Rutgers University
4. Office of Instructional and
Research Technology
The real world
• Bugs, warts, and the eternal problem of hindsight
5. Office of Instructional and
Research Technology
The problem of architecture
• Build as you go vs. predicting the future
6. Office of Instructional and
Research Technology
Where do you put and for how long?
• The problem of 2x foot print in the land of 24x7
7. Office of Instructional and
Research Technology
Who is expert?
• Is the architect, programmer, scientist, owner, vendor
or bottle washer expert? Complex problems are hard.
8. Office of Instructional and
Research Technology
“Anyone who understands the system isn’t doing
science!”
• The problem of users
9. Office of Instructional and
Research Technology
Supercomputers are disposable
• 3 to 5 year ‘shelf life’
10. Office of Instructional and
Research Technology
“This system sucks, the last one was better!”
(no matter how many systems)
• The problem of transition: porting, change and habits
11. Office of Instructional and
Research Technology
Goldlock’s paradox
• The problem of useful use: efficient programming, useful scaling,
overhead, keeping track of results, allocation, etc.
12. Office of Instructional and
Research Technology
Goldlock’s paradox (cont’d)
• Someone will always say the solution is around around the corner!
13. Office of Instructional and
Research Technology
Scaling is deadly
• Scaling problems: OS/SAN/code/people/etc.
Large Scale Cluster (LSC)
SGI Origin 3800 + 3900, 600MHz
2 Nodes x 512 PE + 512GB + 2.9TB disk
5 Nodes x 256 PE + 256GB + .9TB disk
1 Node x 128 PE + 128GB + .9TB disk
SAN Bandwidth: 2GB/s per LSC Node
CXFS, PCP, Workshop Pro,GridEngine, S-Plus,
TotalView, Matlab, NAG SMP, Mathmatica
Analysis Cluster (ANC)
SGI Origin 3900, 600 MHz, 2 Nodes x 96 PE + 96GB + 4.2TB disk
SAN Bandwidth: 2GB/s per ANC Node
GridEngine, CXFS, PCP, Workshop Pro
Tape SAN
4 x STK 9310 Tape Libraries
24 x 9940B Drives (200GB, 30MB/s)
22 x 9840A Drives (20GB, 10MB/s)
3.5PB Tape Storage On-Line
1.5PB Off-Line
LAN
Cisco Catalyst 6509
4 x 16 GbE
2 x 48 Fast Ethernet
SAN (FC) Switch
Brocade 2800 & 3800
Redundant Access
Dual-Ported
Fiber Channel
MetaData Server (MDS)
HFS & HSMS Server
SGI Origin 3800, 600 MHz,
2 Nodes x 64 PE + 64GB
Disk SAN: 4GB/s per MDS Node
Tape SAN: 1GB/s per MDS Node
2.8TB disk, Failsafe, DMF, CXFS
Onyx 3 - Infinite Reality 3
Computational Capability & Capacity
89 Coupled Climate Model Years
Per Computational Day
1 deg. Ocean Model
2 deg. Atmospheric
Disk SAN
23.6TB SAN Disk
TP9100B
5+P+HS RAID5
w/Dual Controllers
2Gbit/s Fibre
GFDL HPCS
July 2005
CCCI Cluster (IC)
SGI Altix 3700, 1.5GHz
2 Nodes x 256 PE + 512GB + 2TB disk
1 Node x 96 PE + 192GB + 3TB disk
SAN Bandwidth:
2GigE/Node, NFS mounted
PCP, Workshop Pro,GridEngine,
TotalView,
NAG
14. Office of Instructional and
Research Technology
Questions?
Eric Marshall
Office of Instructional and Research Technology
eric.marshall@rutgers.edu
732 445-2262
Editor's Notes
Note title change
Personal intro
Human’s ability to plan and abstract is powerful and useful, however…
Side effects happen – Aswan high dam, Egypt –> fish populations and salinity of the Mediterranean Sea
First artificial heart lasted 50 minutes! (http://en.wikipedia.org/wiki/Artificial_heart)
Heidemarie Stefanyshyn-Piper’stool box (http://www.google.com/hostednews/ap/article/ALeqM5h1W8dcUP9H70AmlSfDSenPteDT9gD94HJO401) Nov. 17th 2008
Human’s ability to plan and abstract is powerful and useful, however…
Side effects happen – Aswan high dam, Egypt –> fish populations and salinity of the Mediterranean Sea
First artificial heart lasted 50 minutes! (http://en.wikipedia.org/wiki/Artificial_heart)
Solvable for repeatable tasks, not so much for the bleeding edge
Systems are complex enough that computer scientists, IT/sys admins, and domain scientists are forced into each others domain. Most have no wish to do this! The result is ugly and wasteful.
ENIAC
My pocket has more computing power than the entire Allied forces of the Second World War. Yet supercomputers are not built to be replaced.
Computers come and go – CODE is FOREVER! Also the user experience.
Engineering moves ahead not always in sync with the users needs.
Changing systems is a pain – bigger systems = bigger problems. Does not help the user experience.
Humans are compelled to try big tasks
Staff does not scale.