SlideShare a Scribd company logo
1 of 87
Scaleabilty
Jim Gray
Gray@Microsoft.com
(with help from Gordon Bell, George Spix, Catharine van Ingen
9:00
11:00
1:30
3:30
7:00
Overview
Faults
Tolerance
T Models
Party
TP mons
Lock Theory
Lock Techniq
Queues
Workflow
Log
ResMgr
CICS & Inet
Adv TM
Cyberbrick
Files &Buffers
COM+
Corba
Replication
Party
B-tree
Access Paths
Groupware
Benchmark
Mon Tue Wed Thur Fri
A peta-op business app?
• P&G and friends pay for the web (like they paid
for broadcast television) – no new money, but
given Moore, traditional advertising revenues can
pay for all of our connectivity - voice, video,
data…… (presuming we figure out how to & allow
them to brand the experience.)
• Advertisers pay for impressions and ability to
analyze same.
• A terabyte sort a minute – to one a second.
• Bisection bw of ~20gbytes/s – to ~200gbytes/s.
• Really a tera-op business app (today’s portals)
Scaleability
Scale Up and Scale Out
SMPSMP
Super ServerSuper Server
DepartmentalDepartmental
ServerServer
PersonalPersonal
SystemSystem
Grow Up with SMPGrow Up with SMP
4xP6 is now standard4xP6 is now standard
Grow Out with ClusterGrow Out with Cluster
Cluster has inexpensive partsCluster has inexpensive parts
Cluster
of PCs
There'll be Billions Trillions Of Clients
• Every device will be “intelligent”
• Doors, rooms, cars…
• Computing will be ubiquitous
Billions Of Clients
Need Millions Of Servers
MobileMobile
clientsclients
FixedFixed
clientsclients
ServerServer
SuperSuper
serverserver
ClientsClients
ServersServers
 All clients networkedAll clients networked
to serversto servers
 May be nomadicMay be nomadic
or on-demandor on-demand
 Fast clients wantFast clients want
fasterfaster serversservers
 Servers provideServers provide
 Shared DataShared Data
 ControlControl
 CoordinationCoordination
 CommunicationCommunication
Trillions
Billions
Thesis
Many little beat few big
 Smoking, hairy golf ballSmoking, hairy golf ball
 How to connect the many little parts?How to connect the many little parts?
 How to program the many little parts?How to program the many little parts?
 Fault tolerance & Management?Fault tolerance & Management?
$1$1
millionmillion $100 K$100 K $10 K$10 K
MainframeMainframe MiniMini
MicroMicro NanoNano
14"14"
9"9"
5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8"
1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP
101066
clocks to bulk ramclocks to bulk ram
Event-horizon on chipEvent-horizon on chip
VM reincarnatedVM reincarnated
Multi-program cache,Multi-program cache,
On-Chip SMPOn-Chip SMP
10 microsecond ram
10 millisecond disc
10 second tape archive
10 nano-second ram
Pico Processor
10 pico-second ram
1 MM
3
100 TB
1 TB
10 GB
1 MB
100 MB
4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G)
The Bricks of Cyberspace• Cost 1,000 $
• Come with
– NT
– DBMS
– High speed Net
– System management
– GUI / OOUI
– Tools
• Compatible with everyone else
• CyberBricks
Computers shrink to a point
• Disks 100x in 10 years
2 TB 3.5” drive
• Shrink to 1” is 200GB
• Disk is super computer!
• This is already true of
printers and “terminals”
Kilo
Mega
Giga
Tera
Peta
Exa
Zetta
Yotta
Super Server: 4T Machine
 Array of 1,000 4B machinesArray of 1,000 4B machines
1 b ips processors1 b ips processors
1 B B DRAM1 B B DRAM
10 B B disks10 B B disks
1 Bbps comm lines1 Bbps comm lines
1 TB tape robot1 TB tape robot
 A few megabucksA few megabucks
 Challenge:Challenge:
ManageabilityManageability
ProgrammabilityProgrammability
SecuritySecurity
AvailabilityAvailability
ScaleabilityScaleability
AffordabilityAffordability
 As easy as a single systemAs easy as a single system
Future servers are CLUSTERSFuture servers are CLUSTERS
of processors, discsof processors, discs
Distributed database techniquesDistributed database techniques
make clusters workmake clusters work
CPU
50 GB Disc
5 GB RAM
Cyber BrickCyber Brick
a 4B machinea 4B machine
Cluster Vision
Buying Computers by the Slice
• Rack & Stack
– Mail-order components
– Plug them into the cluster
• Modular growth without limits
– Grow by adding small modules
• Fault tolerance:
– Spare modules mask failures
• Parallel execution & data search
– Use multiple processors and disks
• Clients and servers made from the same stuff
– Inexpensive: built with
commodity CyberBricks
Systems 30 Years Ago
• MegaBuck per Mega Instruction Per Second (mips)
• MegaBuck per MegaByte
• Sys Admin & Data Admin per MegaBuck
Disks of 30 Years Ago
• 10 MB
• Failed every few weeks
1988: IBM DB2 + CICS Mainframe
65 tps
• IBM 4391
• Simulated network of 800 clients
• 2m$ computer
• Staff of 6 to do benchmark
2 x 3725
network controllers
16 GB
disk farm
4 x 8 x .5GB
Refrigerator-sized CPU
1987: Tandem Mini @ 256 tps
• 14 M$ computer (Tandem)
• A dozen people (1.8M$/y)
• False floor, 2 rooms of machines
Simulate 25,600
clients
32 node processor array
40 GB
disk array (80 drives)
OS expert
Network expert
DB expert
Performance
expert
Hardware experts
Admin expert
Auditor
Manager
1997: 9 years later
1 Person and 1 box = 1250 tps
• 1 Breadbox ~ 5x 1987 machine room
• 23 GB is hand-held
• One person does all the work
• Cost/tps is 100,000x less
5 micro dollars per transaction
4x200 Mhz cpu
1/2 GB DRAM
12 x 4GB disk
Hardware expert
OS expert
Net expert
DB expert
App expert
3 x7 x 4GB
disk arrays
mainframe
mini
micro
time
price
What Happened?
Where did the 100,000x come from?
• Moore’s law: 100X (at most)
• Software improvements: 10X (at most)
• Commodity Pricing: 100X (at least)
• Total 100,000X
• 100x from commodity
– (DBMS was 100K$ to start: now 1k$ to start
– IBM 390 MIPS is 7.5K$ today
– Intel MIPS is 10$ today
– Commodity disk is 50$/GB vs 1,500$/GB
– ...
SGI O2K UE10K DELL 6350 Cray T3E IBM SP2 PoPC
per sqft
cpus 2.1 4.7 7.0 4.7 5.0 13.3
specint 29.0 60.5 132.7 79.3 72.3 253.3
ram 4.1 4.7 7.0 0.6 5.0 6.8 gb
disks 1.3 0.5 5.2 0.0 2.5 13.3
Standard package, full height, fully populated, 3.5” disks
HP, DELL, Compaq are trading places wrt rack mount lead
PoPC – Celeron NLX shoeboxes – 1000 nodes in 48 (24x2) sq ft.
$650K from Arrow (3yr warrantee!) on chip at speed L2
Web & server farms, server consolidation / sqft
http://www.exodus.com (charges by mbps times sqft)
General purpose, non-
parallelizable codes
PCs have it!
Vectorizable
Vectorizable & //able
(Supers & small DSMs)
Hand tuned, one-of
MPP course grain
MPP embarrassingly //
(Clusters of PCs)
Database
Database/TP
Web Host
Stream Audio/Video
Technical
Commercial
Application
Taxonomy
If central control & rich
then IBM or large SMPs
else PC Clusters
Peta scale w/
traditional balance
2000 2010
1 PIPS processors
1015
ips
106
cpus @109
ips
104
cpus @1011
ips
10 PB of DRAM 108
chips @107
bytes
106
chips @109
bytes
10 PBps memory
bandwidth
1 PBps IO
bandwidth
108
disks 107
Bps
107
disks 108
Bps
100 PB of disk
storage
105
disks 1010
B 103
disks 1012
B
10 EB of tape
storage
107
tapes 1010
B 105
tapes 1012
B
10x every 5 years, 100x every 10 (1000x in 20 if SC)
Except --- memory & IO bandwidth
I think there is a worldI think there is a world
market for maybe fivemarket for maybe five
computers.computers.
““
””
Thomas Watson Senior,
Chairman of IBM, 1943
Microsoft.com: ~150x4 nodes: a crowd
(3)
Switched
Ethernet
Switched
Ethernet
www.microsoft.com
(3)
search.microsoft.com
(1)
premium.microsoft.com
(1)
European Data Center
FTP
Download Server
(1)
SQL SERVERS
(2)
Router
msid.msn.com
(1)
MOSWest
Admin LAN
SQLNet
Feeder LAN
FDDI Ring
(MIS4)
Router
www.microsoft.com
(5)
Building 11
Live SQL Server
Router
home.microsoft.com
(5)
FDDI Ring
(MIS2)
www.microsoft.com
(4)
activex.microsoft.com
(2)
search.microsoft.com
(3)
register.microsoft.com
(2)
msid.msn.com
(1)
FDDI Ring
(MIS3)
www.microsoft.com
(3)
premium.microsoft.com
(1)
msid.msn.com
(1)
FDDI Ring
(MIS1)
www.microsoft.com
(4)
premium.microsoft.com
(2)
register.microsoft.com
(2)
msid.msn.com
(1) Primary
Gigaswitch
Secondary
Gigaswitch
Staging Servers
(7)
search.microsoft.com
support.microsoft.com
(2)
register.msn.com
(2)
MOSWest
DMZ Staging Servers
premium.microsoft.com
(1)
HTTP
Download Servers
(2) Router
search.microsoft.com
(2)
SQL SERVERS
(2)
msid.msn.com
(1)
FTP
Download Server
(1)Router
Router
Router
Router
Router
Router
Router
Router
Internal WWW
SQL Reporting
home.microsoft.com
(4)
home.microsoft.com
(3)
home.microsoft.com
(2)
register.microsoft.com
(1)
support.microsoft.com
(1)
Internet
13
DS3
(45 Mb/Sec Each)
2
OC3
(100Mb/Sec Each)
2
Ethernet
(100 Mb/Sec Each)
cdm.microsoft.com
(1)
FTP Servers
Download
Replication
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:12
Ave CFG:4xP5,
256 RAM,
12 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP6
512 RAM
28 GB HD
Ave CFG:4xP6,
256 RAM,
30 GB HD
Ave Cost:$25K
FY98 Fcst:2
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave CFG:4xP6,
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave Cost:$28K
FY98 Fcst:7
Ave CFG:4xP5,
256 RAM,
20 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
50 GB HD
Ave CFG:4xP6,
512 RAM,
160 GB HD
Ave CFG:4xP6,
512 RAM,
160 GB HD
FTP.microsoft.com
(3)
Ave CFG:4xP5,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP6,
512 RAM,
30 GB HD
Ave CFG:4xP6,
1 GB RAM,
160 GB HD
Ave Cost:$83K
FY98 Fcst:2
IDC Staging Servers
Live SQL Servers
SQL Consolidators
Japan Data Center
Internet
Internet
www.microsoft.com
(3)
Ave CFG:4xP6,
512 RAM,
50 GB HD
HotMail (a year ago):
~400 Computers Crowd (now 2x bigger)
Local
Director
Front Door
(P-200, 128MB)
140 +10/mo
FreeBSD/Apache
200MBpsInternetlink
Graphics
15xP6
FreeBSD/Hotmail
Ad
10xP6
FreeBSD/Apache
Incoming Mail
25xP-200
FreeBSD/hm-SMTP
Local
Director
Local
Director
Local
Director
Security
2xP200-FreeBSD
Member Dir
U Store
E3k,xxMB, 384GB RAID5 +
DLT tape robot
Solaris/HMNNFS
50 machines, many old
13 + 1.5/mo 1 per million users
Ad Pacer
3 P6
FreeBSD
Cisco
Catalyst 5000
Enet Switch
Local10MbpsSwitchedEthernet
M Serv
(SPAC Ultra-1, ??MB)
4- replicas
Solaris
Telnet
Maintenance
Interface
DB Clusters (crowds)
• 16-node Cluster
– 64 cpus
– 2 TB of disk
– Decision support
• 45-node Cluster
– 140 cpus
– 14 GB DRAM
– 4 TB RAID disk
– OLTP (Debit Credit)
• 1 B tpd (14 k tps)
The
Microsoft TerraServer Hardware
• Compaq AlphaServer 8400Compaq AlphaServer 8400
• 8x400Mhz Alpha cpus8x400Mhz Alpha cpus
• 10 GB DRAM10 GB DRAM
• 324 9.2 GB StorageWorks Disks324 9.2 GB StorageWorks Disks
– 3 TB raw, 2.4 TB of RAID53 TB raw, 2.4 TB of RAID5
• STK 9710 tape robot (4 TB)STK 9710 tape robot (4 TB)
• WindowsNT 4 EE, SQL Server 7.0WindowsNT 4 EE, SQL Server 7.0
TerraServer: Lots of Web Hits
• A billion web hits!
• 1 TB, largest SQL DB on the Web
• 100 Qps average, 1,000 Qps peak
• 877 M SQL queries so far
Sessions 10 m 77 k 125 k
71 Total Average Peak
Hits 1,065 m 8.1 m 29 m
Queries 877 m 6.7 m 18 m
Images 742 m 5.6m 15 m
Page Views 170 m 1.3 m 6.6 m
Users 6.4 m 48 k 76 k
0
5
10
15
20
25
30
35
6/22/98
6/29/987/6/987/13/987/20/987/27/988/3/988/10/988/17/988/24/988/31/989/7/989/14/989/21/989/28/9810/5/98
10/12/98
10/19/98
10/26/98
Date
Count
Sessions
Hit
Page View
DB Query
Image
TerraServer Availability
• Operating for 13 months
• Unscheduled outage: 2.9 hrs
• Scheduled outage: 2.0 hrs
Software upgrades
• Availability:
99.93% overall up
• No NT failures (ever)
• One SQL7 Beta2 bug
• One major operator-assisted
outage
Backup / Restore
•
Configuration
StorageTek TimberWolf 9710
DEC StorageWorks UltraSCSI Raid-5 Array
Legato Networker PowerEdition 4.4a
Windows NT Server Enterprise Edition 4.0
Performance
Data Bytes Backed Up 1.2 TB
Total Time 7.25 Hours
Number of Tapes Consumed 27 tapes
Total Tape Drives 10 drives
Data ThroughPut 168 GB/Hour
Average ThroughPut Per Device 16.8 GB/Hour
Average Throughput Per Device 4.97 MB/Sec
NTFS Logical Volumes 2
tpmC vs Time
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
80,000
90,000
Jan-
95
Jan-
96
Jan-
97
Jan-
98
Jan-
99
Jan-
00
tpmC
h
Unix
NT
Windows NT Versus UNIX
Best Results on an SMP: SemiLog plot shows 3x (~2 year) lead by UNIX
Does not show Oracle/Alpha Cluster at 100,000 tpmC
All these numbers are off-scale huge (40,000 active users?)
tpmC vs Time
1,000
10,000
100,000
Jan-
95
Jan-
96
Jan-
97
Jan-
98
Jan-
99
Jan-
00
tpmC
h
Unix
NT
TPC C Improvements (MS SQL)
250%/year on Price,
100%/year performance
bottleneck is 3GB address space
1.5
2.755676
$/tpmC vs time
$10
$100
$1,000
Jan-94 Jan-95 Jan-96 Jan-97 Jan-98 Dec-98
$/tpmC
tpmC vs time
100
1,000
10,000
100,000
Jan-94 Jan-95 Jan-96 Jan-97 Jan-98 Dec-98
tpmC
40% hardware,40% hardware,
100% software,100% software,
100% PC Technology100% PC Technology
UNIX (dis) Economy Of Scale
Bang for the Buck
tpmC/K$
0
5
10
15
20
25
30
35
40
45
50
0 10,000 20,000 30,000 40,000 50,000 60,000
tpmC
tpmC/k$
Informix
MS SQL Server
Oracle
Sybase
Two different pricing regimes
This is late 1998 prices
TPC Price/tpmC
47
53
61
9
17.0
45
35
30
7
12
8
17
4 5 3
0
10
20
30
40
50
60
70
processor disk software net total/10
Sequent/Oracle 89 k tpmC @ 170$/tpmC
Sun Oracle 52 k tpmC @ 134$/tpmC
HP+NT4+MS SQL 16.2 ktpmC @ 33$/tpmC
Registers
On Chip Cache
On Board Cache
Memory
Disk
1
2
10
100
Tape /Optical
Robot
109
106
This ResortThis Resort
This Room
10 min
My Head 1 min
1.5 hrLos AngelesLos Angeles
2 YearsPluto
2,000 Years
Andromeda
Storage Latency: How far away is the data?
Thesis: Performance =Storage Accesses
not Instructions Executed
• In the “old days” we counted instructions and IO’s
• Now we count memory references
• Processors wait most of the time
SortDisc Wait
Where the time goes:
clock ticks used by AlphaSort Components
SortDisc Wait
OS
Memory Wait
D-Cache
Miss
I-Cache
MissB-Cache
Data Miss
Storage Hierarchy (10 levels)
Registers, Cache L1, L2
Main (1, 2, 3 if nUMA).
Disk (1 (cached), 2)
Tape (1 (mounted), 2)
Today’s Storage Hierarchy :
Speed & Capacity vs Cost Tradeoffs
1015
1012
109
106
103
TypicalSystem(bytes)
Size vs Speed
Access Time (seconds)
10-9 10-6 10-3 10 0 10 3
Cache
Main
Secondary
Disc
Nearline
Tape Offline
Tape
Online
Tape
104
102
100
10-2
10-4
$/MB
Price vs Speed
Access Time (seconds)
10-9 10-6 10-3 10 0 10 3
Cache
Main
Secondary
Disc
Nearline
Tape
Offline
Tape
Online
Tape
Meta-Message:
Technology Ratios Are Important
• If everything gets faster & cheaper at
the same rate
THEN nothing really changes.
• Things getting MUCH BETTER:
– communication speed & cost 1,000x
– processor speed & cost 100x
– storage size & cost 100x
• Things staying about the same
– speed of light (more or less constant)
– people (10x more expensive)
– storage speed (only 10x better)
Storage Ratios Changed
• 10x better access time
• 10x more bandwidth
• 4,000x lower media price
• DRAM/DISK 100:1 to 10:10 to 50:1
Disk Performance vs Time
1
10
100
1980 1990 2000
Year
accesstime(ms)
1
10
100
bandwidth(MB/s)
Disk Performance vs Time
(accesses/ second & Capacity)
1
10
100
1980 1990 2000
Year
Accessesper
Second
0.1
1
10
DiskCapackty
(GB)
Storage Price vs Time
0.01
0.1
1
10
100
1000
10000
1980 1990 2000
Year
$/MB
The Pico Processor
1 M SPECmarks
106
clocks/
fault to bulk ram
Event-horizon on chip.
VM reincarnated
Multi-program cache
Terror Bytes!
10 microsecond ram
10 millisecond disc
10 second tape archive 100 petabyte
100 terabyte
1 terabyte
Pico Processor
10 pico-second ram
1 MM
3
megabyte
10 nano-second ram 10 gigabyte
Bottleneck Analysis
• Drawn to linear scale
Theoretical
Bus Bandwidth
422MBps = 66 Mhz x 64 bits
Memory
Read/Write
~150 MBps
MemCopy
~50 MBps
Disk R/W
~9MBps
Bottleneck Analysis
• NTFS Read/Write
• 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5)
3 PCI 64
~ 155 MBps Unbuffered read (175 raw)
~ 95 MBps Unbuffered write
Good, but 10x down from our UNIX brethren (SGI, SUN)
Memory
Read/Write
~250 MBps
PCI
~110 MBps
Adapter
~70 MBps
PCI
Adapter
Adapter
Adapter
155MBps
PennySort
• Hardware
– 266 Mhz Intel PPro
– 64 MB SDRAM (10ns)
– Dual Fujitsu DMA 3.2GB EIDE disks
• Software
– NT workstation 4.3
– NT 5 sort
• Performance
– sort 15 M 100-byte records (~1.5 GB)
– Disk to disk
– elapsed time 820 sec
• cpu time = 404 sec
PennySort Machine (1107$ )
board
13%
Memory
8%
Cabinet +
Assembly
7%
Network,
Video, floppy
9%
Software
6%
Other
22%
cpu
32%
Disk
25%
Penny Sort Ground Rules
http://research.microsoft.com/barc/SortBenchmark
• How much can you sort for a penny.
– Hardware and Software cost
– Depreciated over 3 years
– 1M$ system gets about 1 second,
– 1K$ system gets about 1,000 seconds.
– Time (seconds) = SystemPrice ($) / 946,080
• Input and output are disk resident
• Input is
– 100-byte records (random data)
– key is first 10 bytes.
• Must create output file
and fill with sorted version of input file.
• Daytona (product) and Indy (special) categories
How Good is NT5 Sort?
• CPU and IO not overlapped.
• System should be able to sort 2x more
• RAM has spare capacity
• Disk is space saturated
(1.5GB in, 1.5GB out on 3GB drive.)
Need an extra 3GB drive or a >6GB drive
CPU DiskFixed ram
Sandia/Compaq/ServerNet/NT Sort
• Sort 1.1 Terabyte
(13 Billion records)
in 47 minutes
• 68 nodes (dual 450 Mhz processors)
543 disks,
1.5 M$
• 1.2 GBps network rap
(2.8 GBps pap)
• 5.2 GBps of disk rap
(same as pap)
• (rap=real application performance,
pap= peak advertised performance)
Bisection Line (Each switch
on this line adds 3 links to
bisection width)
Y Fabric
(14 bidirectional
bisection links)
X Fabric
(10 bidirectional
bisection links)
To Y
fabric
To X
Fabric 512 MB
SDRAM
2 400 MHz
CPUs
6-port ServerNet I
crossbar switch
6-port ServerNet I
crossbar switch
Compaq Proliant 1850R
Server
4 SCSI busses,
each with 2 data
disks
The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National Labs
PCI
Bus
ServerNet I
dual-ported
PCI NIC
SP sort
• 2 – 4 GBps!
432nodes
37racks
compute
488 nodes 55 racks
1952 processors, 732 GB RAM, 2168 disks
56nodes
18racks
Storage
Compute rack:
16 nodes, each has
4x332Mhz PowerPC604e
1.5 GB RAM
1 32x33 PCI bus
9 GB scsi disk
150MBps full duplex SP switch
Storage rack:
8 nodes, each has
4x332Mhz PowerPC604e
1.5 GB RAM
3 32x33 PCI bus
30x4 GB scsi disk (4+1 RAID5)
150MBps full duplex SP switch
56 storage nodes manage 1680 4GB disks
336 4+P twin tail RAID5 arrays (30/node)
432nodes
37racks
compute
488 nodes 55 racks
1952 processors, 732 GB RAM, 2168 disks
56nodes
18racks
Storage
Compute rack:
16 nodes, each has
4x332Mhz PowerPC604e
1.5 GB RAM
1 32x33 PCI bus
9 GB scsi disk
150MBps full duplex SP switch
Storage rack:
8 nodes, each has
4x332Mhz PowerPC604e
1.5 GB RAM
3 32x33 PCI bus
30x4 GB scsi disk (4+1 RAID5)
150MBps full duplex SP switch
56 storage nodes manage 1680 4GB disks
336 4+P twin tail RAID5 arrays (30/node)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
0 100 200 300 400 500 600 700 800 900
Elapsed time (seconds)GB/s
GPFS read
GPFS write
Local read
Local write
Progress on Sorting: NT now leads
both price and performance
• Speedup comes from Moore’s law 40%/year
• Processor/Disk/Network arrays: 60%/year
(this is a software speedup).
1.E+02
1.E+03
1.E+04
1.E+05
1.E+06
1.E+07
1.E+08
1985 1990 1995 2000
Ordinal+SGI
Sort Records/second vs Time
Bitton M68000
Cray YMP
IBM 3090
Tandem
Kitsuregawa
Hardware Sorter
Sequent Intel
HyperCube
IBM RS6000
NOW
Alpha
Penny
NTsort
Sandia/Compaq
/NT
SPsort/IB
1.E-03
1.E+00
1.E+03
1.E+06
1985 1990 1995 2000
Records Sorted per Second
Doubles Every Year
GB Sorted per Dollar
Doubles Every Year
Compaq/NTNT/PennySort
SPsort
Recent Results
• NOW Sort: 9 GB on a cluster of 100 UltraSparcs in 1 minute
• MilleniumSort: 16x Dell NT cluster: 100 MB in 1.18 Sec
(Datamation)
• Tandem/Sandia Sort: 68 CPU ServerNet
1 TB in 47 minutes
• IBM SPsort
408 nodes, 1952 cpu
2168 disks
17.6 minutes = 1057sec
(all for 1/3 of 94M$,
slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect
Data Gravity
Processing Moves to Transducers
• Move Processing to data sources
• Move to where the power (and sheet metal) is
• Processor in
– Modem
– Display
– Microphones (speech recognition)
& cameras (vision)
– Storage: Data storage and analysis
• System is “distributed” (a cluster/mob)
Gbps SAN: 110 MBps
SAN:
Standard Interconnect
PCI: 70 MBps
UW Scsi: 40 MBps
FW scsi: 20 MBps
scsi: 5 MBps
• LAN faster than
memory bus?
• 1 GBps links in lab.
• 100$ port cost soon
• Port is computer
• Winsock: 110 MBps
(10% cpu utilization at each end)
RIP
FDDI
RIP
ATM
RIP
SCI
RIP
SCSI
RIP
FC
RIP
?
Disk = Node
• has magnetic storage (100 GB?)
• has processor & DRAM
• has SAN attachment
• has execution
environment
OS Kernel
SAN driver Disk driver
File SystemRPC, ...
Services DBMS
Applications
endend
Standard Storage Metrics• Capacity:
– RAM: MB and $/MB: today at 10MB &
100$/MB
– Disk: GB and $/GB: today at 10 GB and
200$/GB
– Tape: TB and $/TB: today at .1TB and
25k$/TB (nearline)
• Access time (latency)
– RAM:100 ns
– Disk: 10 ms
– Tape: 30 second pick, 30 second position
• Transfer rate
SCAN?
• Kaps: How many KB objects served per
second
–The file server, transaction processing metr
–This is the OLD metric.
• Maps: How many MB objects served per
–The Multi-Media metric
• SCAN: How long to scan all the data
–The data mining and utility metric
(good 1998 devices packaged in
system
http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf)DRAM DISK TAPE robot
Unit capacity (GB) 1 18 35
Unit price $ 4000 500 10000
$/GB 4000 28 20
Latency (s) 1.E-7 1.E-2 3.E+1
Bandwidth (Mbps) 500 15 7
Kaps 5.E+5 1.E+2 3.E-2
Maps 5.E+2 13.04 3.E-2
Scan time (s/TB) 2 1200 70000
$/Kaps 9.E-11 5.E-8 3.E-3
$/Maps 8.E-8 4.E-7 3.E-3
$/TBscan $0.08 $0.35 $211
X 14
(good 1998 devices packaged in
system
http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf)4.E+03
500
5.E+05
500
2
9.E-11
8.E-08
0.08
28 15
99
13
1200
5.E-08
4.E-07
0.35
20 7
0.03 0.03
7.E+04
3.E-03 3.E-03
211
1.E-12
1.E-09
1.E-06
1.E-03
1.E+00
1.E+03
1.E+06
$/G
B
Bandw
idth
(M
bps)
Kaps
M
aps
Scan
tim
e
(s/TB)
$/Kaps
$/M
aps
$/TBscan
DRAM
DISK
TAPE robot X 14
Maps, SCANs
• parallelism: use many little devices in
parallel1 Terabyte
10 MB/s
At 10 MB/s:
1.2 days to scan
1 Terabyte
1,000 x parallel:
100 seconds SCAN.
Parallelism: divide a big problem into many smaller ones
to be solved in parallel.
a Card
The 1 TB disc card
An array of discs
Can be used as
100 discs
1 striped disc
10 Fault Tolerant discs
....etc
LOTS of accesses/second
bandwidth
14"
Life is cheap, its the accessories that cost ya.
Processors are cheap, it’s the peripherals that cost ya
(a 10k$ disc card).
Not Mainframe Silos
Scan in 27 hours.
many independent tape robots
(like a disc farm)
10K$ robot
14 tapes
500 GB
5 MB/s
20$/GB
30 Maps
100 robots
50TB
50$/GB
3K Maps
27 hr Scan
1M$
Myth
Optical is cheap: 200 $/platter
2 GB/platter
=> 100$/GB (2x cheaper than disc)
Tape is cheap: 30 $/tape
20 GB/tape
=> 1.5 $/GB (100x cheaper than disc).
Cost
Tape needs a robot (10 k$ ... 3 m$ )
10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB
(1x…10x cheaper than disc)
Optical needs a robot (100 k$ )
100 platters = 200GB ( TODAY ) => 400 $/GB
( more expensive than mag disc )
Robots have poor access times
Not good for Library of Congress (25TB)
Data motel: data checks in but it never checks out!
The Access Time Myth
The Myth: seek or pick time
dominates
The reality: (1) Queuing dominates
(2) Transfer dominates
BLOBs
(3) Disk seeks often
short
Implication: many cheap servers
better than one fast expensive
server
Seek
Rotate
Transfer
Seek
Rotate
Transfer
Wait
What To Do About HIGH Availability
• Need remote MIRRORED site to tolerate
environmental failures (power, net, fire, flood)
operations failures
• Replicate changes across the net
• Failover servers across the net (some distance)
• Allows: software upgrades, site moves, fires,...
• Tolerates: operations errors, hiesenbugs,
server
client
State Changes server
>100 feet or >100 miles
Mflop/s/$K vs Mflop/s
0.001
0.010
0.100
1.000
10.000
100.000
0.1 1 10 100 1000 10000 100000
Mflop/s
Mflop/s/$K
LANL Loki P6
Linux
NAS Expanded
Linux Cluster
Cray T3E
IBM SP
SGI Origin 2000-
195
Sun Ultra
Enterprise 4000
UCB NOW
Scaleup Has Limits
(chart courtesy of Catharine Van Ingen)
• Vector Supers ~ 10x supers
– ~3 Gflops/cpu
– bus/memory ~ 20 GBps
– IO ~ 1GBps
• Supers ~ 10x PCs
– 300 Mflops/cpu
– bus/memory ~ 2 GBps
– IO ~ 1 GBps
• PCs are slow
– ~ 30 Mflops/cpu
– and bus/memory ~ 200MBps
– and IO ~ 100 MBps
TOP500 Systems by Vendor
(courtesy of Larry Smarr NCSA)
TOP500 Reports: http://www.netlib.org/benchmark/top500.html
CRI
SGI
IBM
Convex
HP
SunTMC
Intel
DEC
Japanese Vector Machines
Other
0
100
200
300
400
500 Jun-93
Nov-93
Jun-94
Nov-94
Jun-95
Nov-95
Jun-96
Nov-96
Jun-97
Nov-97
Jun-98
NumberofSystems
Other
Japanese
DEC
Intel
TMC
Sun
HP
Convex
IBM
SGI
CRI
NCSA Super ClusterNCSA Super Cluster
• National Center for Supercomputing Applications
University of Illinois @ Urbana
• 512 Pentium II cpus, 2,096 disks, SAN
• Compaq + HP +Myricom + WindowsNT
• A Super Computer for 3M$
• Classic Fortran/MPI programming
• DCOM programming model
http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
Avalon: Alpha
Clusters for Science
http://cnls.lanl.gov/avalon/
140 Alpha Processors(533 Mhz)
x 256 MB + 3GB disk
Fast Ethernet switches
= 45 Gbytes RAM 550 GB disk
+ Linux…………………...
= 10 real Gflops for $313,000
=> 34 real Mflops/k$
on 150 benchmark Mflops/k$
Beowulf project is Parent
http://www.cacr.caltech.edu/beowulf/naegling.html
114 nodes, 2k$/node,
Scientists want cheap mips.
• Intel/Sandia:
9000x1 node Ppro
• LLNL/IBM:
512x8 PowerPC (SP2)
• LANL/Cray:
?
• Maui Supercomputer Center
– 512x1 SP2
Your Tax Dollars At Work
ASCI for Stockpile Stewardship
Observations
• Uniprocessor RAP << PAP
– real app performance << peak advertised performance
• Growth has slowed (Bell Prize
– 1987: 0.5 GFLOPS
– 1988 1.0 GFLOPS 1 year
– 1990: 14 GFLOPS 2 years
– 1994: 140 GFLOPS 4 years
– 1997: 604 GFLOPS
– 1998: 1600 G__OPS 4 years
Two Generic Kinds of computing
• Many little
– embarrassingly parallel
– Fit RPC model
– Fit partitioned data and computation model
– Random works OK
– OLTP, File Server, Email, Web,…..
• Few big
– sometimes not obviously parallel
– Do not fit RPC model (BIG rpcs)
– Scientific, simulation, data mining, ...
Many Little Programming Model
• many small requests
• route requests to data
• encapsulate data with procedures (objects)
• three-tier computing
• RPC is a convenient/appropriate model
• Transactions are a big help in error handling
• Auto partition (e.g. hash data and computation)
• Works fine.
• Software CyberBricks
Object Oriented Programming
Parallelism From Many Little Jobs
• Gives location transparency
• ORB/web/tpmon multiplexes clients to servers
• Enables distribution
• Exploits embarrassingly parallel apps (transactions)
• HTTP and RPC (dcom, corba, rmi, iiop, …) are basis
Tp mon / orb/ web server
Few Big Programming Model
• Finding parallelism is hard
– Pipelines are short (3x …6x speedup)
• Spreading objects/data is easy,
but getting locality is HARD
• Mapping big job onto cluster is hard
• Scheduling is hard
– coarse grained (job) and fine grain (co-schedule)
• Fault tolerance is hard
Kinds of Parallel
Execution
Pipeline
Partition
outputs split N ways
inputs merge M ways
Any
Sequential
Program
Any
Sequential
Program
Sequential
Sequential
SequentialSequential Any
Sequential
Program
Any
Sequential
Program
Why Parallel Access To
Data?
1 Terabyte
10 MB/s
At 10 MB/s
1.2 days to scan
1 Terabyte
1,000 x parallel
100 second SCAN.
Parallelism:
divide a big problem
into many smaller ones
to be solved in parallel.
BANDW
IDTH
Why are Relational
Operators
Successful for Parallelism?Relational data model uniform operators
on uniform data stream
Closed under composition
Each operator consumes 1 or 2 input streams
Each stream is a uniform collection of data
Sequential data in and out: Pure dataflow
partitioning some operators (e.g. aggregates, non-equi-join, sort,..)
requires innovation
AUTOMATIC PARALLELISM
Database Systems
“Hide” Parallelism
• Automate system management via tools
–data placement
–data organization (indexing)
–periodic tasks (dump / recover / reorganize)
• Automatic fault tolerance
–duplex & failover
–transactions
• Automatic parallelism
–among transactions (locking)
–within a transaction (parallel execution)
SQL a Non-Procedural
Programming Language
• SQL: functional programming language
describes answer set.
• Optimizer picks best execution plan
– Picks data flow web (pipeline),
– degree of parallelism (partitioning)
– other execution parameters (process placement, memory,...)
GUI
Schema
Plan
Monitor
Optimizer
ExecutionPlanning
Rivers
Executors
Partitioned
Execution
A...E F...J K...N O...S T...Z
ATable
Count Count Count Count Count
Count
Spreads computation and IO among processors
Partitioned data gives
NATURAL parallelism
N x M way
Parallelism
A...E F...J K...N O...S T...Z
Merge
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Join
Sort
Merge
Merge
N inputs, M outputs, no bottlenecks.
Partitioned Data
Partitioned and Pipelined Data Flows
Automatic Parallel Object Relational DB
Select image
from landsat
where date between 1970 and 1990
and overlaps(location, :Rockies)
and snow_cover(image) >.7;
Temporal
Spatial
Image
date loc image
Landsat
1/2/72
.
.
.
.
.
..
.
.
4/8/95
33N
120W
.
.
.
.
.
.
.
34N
120W
Assign one process per processor/disk:
find images with right data & location
analyze image, if 70% snow, return it
image
Answer
date, location,
& image tests
Data Rivers: Split + Merge Streams
Producers add records to the river,
Consumers consume records from the river
Purely sequential programming.
River does flow control and buffering
does partition and merge of data records
River = Split/Merge in Gamma =
Exchange operator in Volcano /SQL Server.
River
M Consumers
N producers
N X M Data Streams
Generalization: Object-oriented Rivers
• Rivers transport sub-class of record-set (= stream of objects)
– record type and partitioning are part of subclass
• Node transformers are data pumps
– an object with river inputs and outputs
– do late-binding to record-type
• Programming becomes data flow programming
– specify the pipelines
• Compiler/Scheduler does
data partitioning and
“transformer” placement
NT Cluster Sort as a Prototype
• Using
– data generation and
– sort
as a prototypical app
• “Hello world” of distributed processing
• goal: easy install & execute
Remote Install
RegConnectRegistry()
RegCreateKeyEx()
•Add Registry entry to each remote node.
Cluster StartupExecution
MULT_QI COSERVERINFO
•Setup :
MULTI_QI struct
COSERVERINFO struct
•CoCreateInstanceEx()
•Retrieve remote object handle
from MULTI_QI struct
•Invoke methods as usual
HANDLE
HANDLE
HANDLE
Sort()
Sort()
Sort()
Cluster Sort Conceptual Model
•Multiple Data Sources
•Multiple Data Destinations
•Multiple nodes
•Disks -> Sockets -> Disk -> Disk
B
AAA
BBB
CCC
A
AAA
BBB
CCC
C
AAA
BBB
CCC
BBB
BBB
BBB
AAA
AAA
AAA
CCC
CCC
CCC
BBB
BBB
BBB
AAA
AAA
AAA
CCC
CCC
CCC
How Do They Talk to Each Other?
• Each node has an OS
• Each node has local resources: A federation.
• Each node does not completely trust the others.
• Nodes use RPC to talk to each other
– CORBA? DCOM? IIOP? RMI?
– One or all of the above.
• Huge leverage in high-level interfaces.
• Same old distributed system story.
Wire(s)
h
streams
datagrams
RPC
?
Applications
VIAL/VIPL
streams
datagrams
RPC
?
Applications

More Related Content

What's hot

Collaborate vdb performance
Collaborate vdb performanceCollaborate vdb performance
Collaborate vdb performanceKyle Hailey
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Keisuke Takahashi
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Intel® Software
 
Recursive Grid Computing AMD on AMD
Recursive Grid Computing AMD on AMDRecursive Grid Computing AMD on AMD
Recursive Grid Computing AMD on AMDQuentin Fennessy
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013Server Density
 
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...Intel® Software
 
ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718Jinrong Ye
 
Ibm and Erb's Presentation Insider's Edition Event . September 2010
Ibm and Erb's Presentation Insider's Edition Event .  September 2010Ibm and Erb's Presentation Insider's Edition Event .  September 2010
Ibm and Erb's Presentation Insider's Edition Event . September 2010Erb's Marketing
 
Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...
Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...
Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...Nguyễn Hoàng (LightJSC)
 
Cloud Storage Introduction ( CEPH )
Cloud Storage Introduction ( CEPH )  Cloud Storage Introduction ( CEPH )
Cloud Storage Introduction ( CEPH ) Alex Lau
 

What's hot (13)

Collaborate vdb performance
Collaborate vdb performanceCollaborate vdb performance
Collaborate vdb performance
 
NFS and Oracle
NFS and OracleNFS and Oracle
NFS and Oracle
 
Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5Trying and evaluating the new features of GlusterFS 3.5
Trying and evaluating the new features of GlusterFS 3.5
 
Cobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale EnvironmentsCobbler, Func and Puppet: Tools for Large Scale Environments
Cobbler, Func and Puppet: Tools for Large Scale Environments
 
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
Accelerate Game Development and Enhance Game Experience with Intel® Optane™ T...
 
OpenDBCamp Virtualization
OpenDBCamp VirtualizationOpenDBCamp Virtualization
OpenDBCamp Virtualization
 
Recursive Grid Computing AMD on AMD
Recursive Grid Computing AMD on AMDRecursive Grid Computing AMD on AMD
Recursive Grid Computing AMD on AMD
 
High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013High performance Infrastructure Oct 2013
High performance Infrastructure Oct 2013
 
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
World of Tanks* 1.0+: Enriching Gamers Experience with Multicore Optimized Ph...
 
ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718ibbackup vs mysqldump对比测试 - 20080718
ibbackup vs mysqldump对比测试 - 20080718
 
Ibm and Erb's Presentation Insider's Edition Event . September 2010
Ibm and Erb's Presentation Insider's Edition Event .  September 2010Ibm and Erb's Presentation Insider's Edition Event .  September 2010
Ibm and Erb's Presentation Insider's Edition Event . September 2010
 
Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...
Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...
Promise - Rich media storage solution- Thunderbolt3 storage solution - Pegasu...
 
Cloud Storage Introduction ( CEPH )
Cloud Storage Introduction ( CEPH )  Cloud Storage Introduction ( CEPH )
Cloud Storage Introduction ( CEPH )
 

Viewers also liked

The buransh mahotsav_report
The buransh mahotsav_report The buransh mahotsav_report
The buransh mahotsav_report Prasanna Kapoor
 
19 structured files
19 structured files19 structured files
19 structured filesashish61_scs
 
5 application serversforproject
5 application serversforproject5 application serversforproject
5 application serversforprojectashish61_scs
 
04 transaction models
04 transaction models04 transaction models
04 transaction modelsashish61_scs
 
Digital: An Inflection Point for Mankind
Digital: An Inflection Point for MankindDigital: An Inflection Point for Mankind
Digital: An Inflection Point for MankindBernard Panes
 
10 Ways to be a great Solution Architect
10 Ways to be a great Solution Architect10 Ways to be a great Solution Architect
10 Ways to be a great Solution ArchitectBernard Panes
 
Mostafa Wael Farouk -Techno-Functional
Mostafa Wael Farouk -Techno-FunctionalMostafa Wael Farouk -Techno-Functional
Mostafa Wael Farouk -Techno-FunctionalMostafa Wael
 

Viewers also liked (15)

The buransh mahotsav_report
The buransh mahotsav_report The buransh mahotsav_report
The buransh mahotsav_report
 
01 whirlwind tour
01 whirlwind tour01 whirlwind tour
01 whirlwind tour
 
19 structured files
19 structured files19 structured files
19 structured files
 
4 db recovery
4 db recovery4 db recovery
4 db recovery
 
5 application serversforproject
5 application serversforproject5 application serversforproject
5 application serversforproject
 
Ibrahim Thesis
Ibrahim ThesisIbrahim Thesis
Ibrahim Thesis
 
Jeopardy (cecilio)
Jeopardy (cecilio)Jeopardy (cecilio)
Jeopardy (cecilio)
 
04 transaction models
04 transaction models04 transaction models
04 transaction models
 
Digital: An Inflection Point for Mankind
Digital: An Inflection Point for MankindDigital: An Inflection Point for Mankind
Digital: An Inflection Point for Mankind
 
Solution6.2012
Solution6.2012Solution6.2012
Solution6.2012
 
Assignment.1
Assignment.1Assignment.1
Assignment.1
 
10 Ways to be a great Solution Architect
10 Ways to be a great Solution Architect10 Ways to be a great Solution Architect
10 Ways to be a great Solution Architect
 
Solution8 v2
Solution8 v2Solution8 v2
Solution8 v2
 
11 bpm
11 bpm11 bpm
11 bpm
 
Mostafa Wael Farouk -Techno-Functional
Mostafa Wael Farouk -Techno-FunctionalMostafa Wael Farouk -Techno-Functional
Mostafa Wael Farouk -Techno-Functional
 

Similar to 14 scaleabilty wics

Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDaehyeok Kim
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...Chester Chen
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationBigstep
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)ITCamp
 
In-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsIn-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsDATAVERSITY
 
ClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptxClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptxBiHongPhc
 
Nimble Storage Series A presentation 2007
Nimble Storage Series A presentation 2007Nimble Storage Series A presentation 2007
Nimble Storage Series A presentation 2007Wing Venture Capital
 
Appsterdam talk - about the chips inside your phone
Appsterdam talk - about the chips inside your phoneAppsterdam talk - about the chips inside your phone
Appsterdam talk - about the chips inside your phonemarcocjacobs
 
Flash Storage Trends
Flash Storage TrendsFlash Storage Trends
Flash Storage TrendsOsys AG
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)MongoDB
 
Sun Oracle Exadata V2 For OLTP And DWH
Sun Oracle Exadata V2 For OLTP And DWHSun Oracle Exadata V2 For OLTP And DWH
Sun Oracle Exadata V2 For OLTP And DWHMark Rabne
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward
 
Basic course
Basic courseBasic course
Basic courseSirajRock
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1Hassy Veldstra
 
BGF 2012 (Browsergames Forum)
BGF 2012 (Browsergames Forum)BGF 2012 (Browsergames Forum)
BGF 2012 (Browsergames Forum)Christof Wegmann
 

Similar to 14 scaleabilty wics (20)

Designs, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed SystemsDesigns, Lessons and Advice from Building Large Distributed Systems
Designs, Lessons and Advice from Building Large Distributed Systems
 
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
SF Big Analytics & SF Machine Learning Meetup: Machine Learning at the Limit ...
 
Memory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and VirtualizationMemory, Big Data, NoSQL and Virtualization
Memory, Big Data, NoSQL and Virtualization
 
.NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov).NET Memory Primer (Martin Kulov)
.NET Memory Primer (Martin Kulov)
 
In-Memory Computing: Myths and Facts
In-Memory Computing: Myths and FactsIn-Memory Computing: Myths and Facts
In-Memory Computing: Myths and Facts
 
Palestra IBM-Mack Zvm linux
Palestra  IBM-Mack Zvm linux  Palestra  IBM-Mack Zvm linux
Palestra IBM-Mack Zvm linux
 
ClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptxClickOS_EE80777777777777777777777777777.pptx
ClickOS_EE80777777777777777777777777777.pptx
 
Nimble Storage Series A presentation 2007
Nimble Storage Series A presentation 2007Nimble Storage Series A presentation 2007
Nimble Storage Series A presentation 2007
 
Appsterdam talk - about the chips inside your phone
Appsterdam talk - about the chips inside your phoneAppsterdam talk - about the chips inside your phone
Appsterdam talk - about the chips inside your phone
 
Flash Storage Trends
Flash Storage TrendsFlash Storage Trends
Flash Storage Trends
 
Tech
TechTech
Tech
 
Basic course
Basic courseBasic course
Basic course
 
Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)Deployment Strategies (Mongo Austin)
Deployment Strategies (Mongo Austin)
 
Sun Oracle Exadata V2 For OLTP And DWH
Sun Oracle Exadata V2 For OLTP And DWHSun Oracle Exadata V2 For OLTP And DWH
Sun Oracle Exadata V2 For OLTP And DWH
 
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
Flink Forward Berlin 2017: Robert Metzger - Keep it going - How to reliably a...
 
Basic course
Basic courseBasic course
Basic course
 
Basic course
Basic courseBasic course
Basic course
 
The Smug Mug Tale
The Smug Mug TaleThe Smug Mug Tale
The Smug Mug Tale
 
The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1The Anatomy Of The Google Architecture Fina Lv1.1
The Anatomy Of The Google Architecture Fina Lv1.1
 
BGF 2012 (Browsergames Forum)
BGF 2012 (Browsergames Forum)BGF 2012 (Browsergames Forum)
BGF 2012 (Browsergames Forum)
 

More from ashish61_scs

More from ashish61_scs (20)

7 concurrency controltwo
7 concurrency controltwo7 concurrency controltwo
7 concurrency controltwo
 
Transactions
TransactionsTransactions
Transactions
 
22 levine
22 levine22 levine
22 levine
 
21 domino mohan-1
21 domino mohan-121 domino mohan-1
21 domino mohan-1
 
20 access paths
20 access paths20 access paths
20 access paths
 
18 philbe replication stanford99
18 philbe replication stanford9918 philbe replication stanford99
18 philbe replication stanford99
 
17 wics99 harkey
17 wics99 harkey17 wics99 harkey
17 wics99 harkey
 
16 greg hope_com_wics
16 greg hope_com_wics16 greg hope_com_wics
16 greg hope_com_wics
 
15 bufferand records
15 bufferand records15 bufferand records
15 bufferand records
 
14 turing wics
14 turing wics14 turing wics
14 turing wics
 
13 tm adv
13 tm adv13 tm adv
13 tm adv
 
11 tm
11 tm11 tm
11 tm
 
10b rm
10b rm10b rm
10b rm
 
10a log
10a log10a log
10a log
 
09 workflow
09 workflow09 workflow
09 workflow
 
08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick08 message and_queues_dieter_gawlick
08 message and_queues_dieter_gawlick
 
06 07 lock
06 07 lock06 07 lock
06 07 lock
 
05 tp mon_orbs
05 tp mon_orbs05 tp mon_orbs
05 tp mon_orbs
 
03 fault model
03 fault model03 fault model
03 fault model
 
02 fault tolerance
02 fault tolerance02 fault tolerance
02 fault tolerance
 

Recently uploaded

Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxJisc
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxmarlenawright1
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and ModificationsMJDuyan
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...Amil baba
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxCeline George
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17Celine George
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...Nguyen Thanh Tu Collection
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptNishitharanjan Rout
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...Nguyen Thanh Tu Collection
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxJisc
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Jisc
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsNbelano25
 
Introduction to TechSoup’s Digital Marketing Services and Use Cases
Introduction to TechSoup’s Digital Marketing  Services and Use CasesIntroduction to TechSoup’s Digital Marketing  Services and Use Cases
Introduction to TechSoup’s Digital Marketing Services and Use CasesTechSoup
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxEsquimalt MFRC
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxPooja Bhuva
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxPooja Bhuva
 

Recently uploaded (20)

VAMOS CUIDAR DO NOSSO PLANETA! .
VAMOS CUIDAR DO NOSSO PLANETA!                    .VAMOS CUIDAR DO NOSSO PLANETA!                    .
VAMOS CUIDAR DO NOSSO PLANETA! .
 
Wellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptxWellbeing inclusion and digital dystopias.pptx
Wellbeing inclusion and digital dystopias.pptx
 
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptxHMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
HMCS Vancouver Pre-Deployment Brief - May 2024 (Web Version).pptx
 
Our Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdfOur Environment Class 10 Science Notes pdf
Our Environment Class 10 Science Notes pdf
 
Understanding Accommodations and Modifications
Understanding  Accommodations and ModificationsUnderstanding  Accommodations and Modifications
Understanding Accommodations and Modifications
 
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
NO1 Top Black Magic Specialist In Lahore Black magic In Pakistan Kala Ilam Ex...
 
What is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptxWhat is 3 Way Matching Process in Odoo 17.pptx
What is 3 Way Matching Process in Odoo 17.pptx
 
How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17How to Create and Manage Wizard in Odoo 17
How to Create and Manage Wizard in Odoo 17
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
TỔNG ÔN TẬP THI VÀO LỚP 10 MÔN TIẾNG ANH NĂM HỌC 2023 - 2024 CÓ ĐÁP ÁN (NGỮ Â...
 
AIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.pptAIM of Education-Teachers Training-2024.ppt
AIM of Education-Teachers Training-2024.ppt
 
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
80 ĐỀ THI THỬ TUYỂN SINH TIẾNG ANH VÀO 10 SỞ GD – ĐT THÀNH PHỐ HỒ CHÍ MINH NĂ...
 
Towards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptxTowards a code of practice for AI in AT.pptx
Towards a code of practice for AI in AT.pptx
 
Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)Accessible Digital Futures project (20/03/2024)
Accessible Digital Futures project (20/03/2024)
 
Tatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf artsTatlong Kwento ni Lola basyang-1.pdf arts
Tatlong Kwento ni Lola basyang-1.pdf arts
 
Introduction to TechSoup’s Digital Marketing Services and Use Cases
Introduction to TechSoup’s Digital Marketing  Services and Use CasesIntroduction to TechSoup’s Digital Marketing  Services and Use Cases
Introduction to TechSoup’s Digital Marketing Services and Use Cases
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptxHMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
HMCS Max Bernays Pre-Deployment Brief (May 2024).pptx
 
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptxExploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
Exploring_the_Narrative_Style_of_Amitav_Ghoshs_Gun_Island.pptx
 
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptxOn_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
On_Translating_a_Tamil_Poem_by_A_K_Ramanujan.pptx
 

14 scaleabilty wics

  • 1. Scaleabilty Jim Gray Gray@Microsoft.com (with help from Gordon Bell, George Spix, Catharine van Ingen 9:00 11:00 1:30 3:30 7:00 Overview Faults Tolerance T Models Party TP mons Lock Theory Lock Techniq Queues Workflow Log ResMgr CICS & Inet Adv TM Cyberbrick Files &Buffers COM+ Corba Replication Party B-tree Access Paths Groupware Benchmark Mon Tue Wed Thur Fri
  • 2. A peta-op business app? • P&G and friends pay for the web (like they paid for broadcast television) – no new money, but given Moore, traditional advertising revenues can pay for all of our connectivity - voice, video, data…… (presuming we figure out how to & allow them to brand the experience.) • Advertisers pay for impressions and ability to analyze same. • A terabyte sort a minute – to one a second. • Bisection bw of ~20gbytes/s – to ~200gbytes/s. • Really a tera-op business app (today’s portals)
  • 3. Scaleability Scale Up and Scale Out SMPSMP Super ServerSuper Server DepartmentalDepartmental ServerServer PersonalPersonal SystemSystem Grow Up with SMPGrow Up with SMP 4xP6 is now standard4xP6 is now standard Grow Out with ClusterGrow Out with Cluster Cluster has inexpensive partsCluster has inexpensive parts Cluster of PCs
  • 4. There'll be Billions Trillions Of Clients • Every device will be “intelligent” • Doors, rooms, cars… • Computing will be ubiquitous
  • 5. Billions Of Clients Need Millions Of Servers MobileMobile clientsclients FixedFixed clientsclients ServerServer SuperSuper serverserver ClientsClients ServersServers  All clients networkedAll clients networked to serversto servers  May be nomadicMay be nomadic or on-demandor on-demand  Fast clients wantFast clients want fasterfaster serversservers  Servers provideServers provide  Shared DataShared Data  ControlControl  CoordinationCoordination  CommunicationCommunication Trillions Billions
  • 6. Thesis Many little beat few big  Smoking, hairy golf ballSmoking, hairy golf ball  How to connect the many little parts?How to connect the many little parts?  How to program the many little parts?How to program the many little parts?  Fault tolerance & Management?Fault tolerance & Management? $1$1 millionmillion $100 K$100 K $10 K$10 K MainframeMainframe MiniMini MicroMicro NanoNano 14"14" 9"9" 5.25"5.25" 3.5"3.5" 2.5"2.5" 1.8"1.8" 1 M SPECmarks, 1TFLOP1 M SPECmarks, 1TFLOP 101066 clocks to bulk ramclocks to bulk ram Event-horizon on chipEvent-horizon on chip VM reincarnatedVM reincarnated Multi-program cache,Multi-program cache, On-Chip SMPOn-Chip SMP 10 microsecond ram 10 millisecond disc 10 second tape archive 10 nano-second ram Pico Processor 10 pico-second ram 1 MM 3 100 TB 1 TB 10 GB 1 MB 100 MB
  • 7. 4 B PC’s (1 Bips, .1GB dram, 10 GB disk 1 Gbps Net, B=G) The Bricks of Cyberspace• Cost 1,000 $ • Come with – NT – DBMS – High speed Net – System management – GUI / OOUI – Tools • Compatible with everyone else • CyberBricks
  • 8. Computers shrink to a point • Disks 100x in 10 years 2 TB 3.5” drive • Shrink to 1” is 200GB • Disk is super computer! • This is already true of printers and “terminals” Kilo Mega Giga Tera Peta Exa Zetta Yotta
  • 9. Super Server: 4T Machine  Array of 1,000 4B machinesArray of 1,000 4B machines 1 b ips processors1 b ips processors 1 B B DRAM1 B B DRAM 10 B B disks10 B B disks 1 Bbps comm lines1 Bbps comm lines 1 TB tape robot1 TB tape robot  A few megabucksA few megabucks  Challenge:Challenge: ManageabilityManageability ProgrammabilityProgrammability SecuritySecurity AvailabilityAvailability ScaleabilityScaleability AffordabilityAffordability  As easy as a single systemAs easy as a single system Future servers are CLUSTERSFuture servers are CLUSTERS of processors, discsof processors, discs Distributed database techniquesDistributed database techniques make clusters workmake clusters work CPU 50 GB Disc 5 GB RAM Cyber BrickCyber Brick a 4B machinea 4B machine
  • 10. Cluster Vision Buying Computers by the Slice • Rack & Stack – Mail-order components – Plug them into the cluster • Modular growth without limits – Grow by adding small modules • Fault tolerance: – Spare modules mask failures • Parallel execution & data search – Use multiple processors and disks • Clients and servers made from the same stuff – Inexpensive: built with commodity CyberBricks
  • 11. Systems 30 Years Ago • MegaBuck per Mega Instruction Per Second (mips) • MegaBuck per MegaByte • Sys Admin & Data Admin per MegaBuck
  • 12. Disks of 30 Years Ago • 10 MB • Failed every few weeks
  • 13. 1988: IBM DB2 + CICS Mainframe 65 tps • IBM 4391 • Simulated network of 800 clients • 2m$ computer • Staff of 6 to do benchmark 2 x 3725 network controllers 16 GB disk farm 4 x 8 x .5GB Refrigerator-sized CPU
  • 14. 1987: Tandem Mini @ 256 tps • 14 M$ computer (Tandem) • A dozen people (1.8M$/y) • False floor, 2 rooms of machines Simulate 25,600 clients 32 node processor array 40 GB disk array (80 drives) OS expert Network expert DB expert Performance expert Hardware experts Admin expert Auditor Manager
  • 15. 1997: 9 years later 1 Person and 1 box = 1250 tps • 1 Breadbox ~ 5x 1987 machine room • 23 GB is hand-held • One person does all the work • Cost/tps is 100,000x less 5 micro dollars per transaction 4x200 Mhz cpu 1/2 GB DRAM 12 x 4GB disk Hardware expert OS expert Net expert DB expert App expert 3 x7 x 4GB disk arrays
  • 16. mainframe mini micro time price What Happened? Where did the 100,000x come from? • Moore’s law: 100X (at most) • Software improvements: 10X (at most) • Commodity Pricing: 100X (at least) • Total 100,000X • 100x from commodity – (DBMS was 100K$ to start: now 1k$ to start – IBM 390 MIPS is 7.5K$ today – Intel MIPS is 10$ today – Commodity disk is 50$/GB vs 1,500$/GB – ...
  • 17. SGI O2K UE10K DELL 6350 Cray T3E IBM SP2 PoPC per sqft cpus 2.1 4.7 7.0 4.7 5.0 13.3 specint 29.0 60.5 132.7 79.3 72.3 253.3 ram 4.1 4.7 7.0 0.6 5.0 6.8 gb disks 1.3 0.5 5.2 0.0 2.5 13.3 Standard package, full height, fully populated, 3.5” disks HP, DELL, Compaq are trading places wrt rack mount lead PoPC – Celeron NLX shoeboxes – 1000 nodes in 48 (24x2) sq ft. $650K from Arrow (3yr warrantee!) on chip at speed L2 Web & server farms, server consolidation / sqft http://www.exodus.com (charges by mbps times sqft)
  • 18. General purpose, non- parallelizable codes PCs have it! Vectorizable Vectorizable & //able (Supers & small DSMs) Hand tuned, one-of MPP course grain MPP embarrassingly // (Clusters of PCs) Database Database/TP Web Host Stream Audio/Video Technical Commercial Application Taxonomy If central control & rich then IBM or large SMPs else PC Clusters
  • 19. Peta scale w/ traditional balance 2000 2010 1 PIPS processors 1015 ips 106 cpus @109 ips 104 cpus @1011 ips 10 PB of DRAM 108 chips @107 bytes 106 chips @109 bytes 10 PBps memory bandwidth 1 PBps IO bandwidth 108 disks 107 Bps 107 disks 108 Bps 100 PB of disk storage 105 disks 1010 B 103 disks 1012 B 10 EB of tape storage 107 tapes 1010 B 105 tapes 1012 B 10x every 5 years, 100x every 10 (1000x in 20 if SC) Except --- memory & IO bandwidth
  • 20. I think there is a worldI think there is a world market for maybe fivemarket for maybe five computers.computers. ““ ”” Thomas Watson Senior, Chairman of IBM, 1943
  • 21. Microsoft.com: ~150x4 nodes: a crowd (3) Switched Ethernet Switched Ethernet www.microsoft.com (3) search.microsoft.com (1) premium.microsoft.com (1) European Data Center FTP Download Server (1) SQL SERVERS (2) Router msid.msn.com (1) MOSWest Admin LAN SQLNet Feeder LAN FDDI Ring (MIS4) Router www.microsoft.com (5) Building 11 Live SQL Server Router home.microsoft.com (5) FDDI Ring (MIS2) www.microsoft.com (4) activex.microsoft.com (2) search.microsoft.com (3) register.microsoft.com (2) msid.msn.com (1) FDDI Ring (MIS3) www.microsoft.com (3) premium.microsoft.com (1) msid.msn.com (1) FDDI Ring (MIS1) www.microsoft.com (4) premium.microsoft.com (2) register.microsoft.com (2) msid.msn.com (1) Primary Gigaswitch Secondary Gigaswitch Staging Servers (7) search.microsoft.com support.microsoft.com (2) register.msn.com (2) MOSWest DMZ Staging Servers premium.microsoft.com (1) HTTP Download Servers (2) Router search.microsoft.com (2) SQL SERVERS (2) msid.msn.com (1) FTP Download Server (1)Router Router Router Router Router Router Router Router Internal WWW SQL Reporting home.microsoft.com (4) home.microsoft.com (3) home.microsoft.com (2) register.microsoft.com (1) support.microsoft.com (1) Internet 13 DS3 (45 Mb/Sec Each) 2 OC3 (100Mb/Sec Each) 2 Ethernet (100 Mb/Sec Each) cdm.microsoft.com (1) FTP Servers Download Replication Ave CFG:4xP6, 512 RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:12 Ave CFG:4xP5, 256 RAM, 12 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 50 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP6 512 RAM 28 GB HD Ave CFG:4xP6, 256 RAM, 30 GB HD Ave Cost:$25K FY98 Fcst:2 Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 50 GB HD Ave CFG:4xP5, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 160 GB HD Ave CFG:4xP6, Ave CFG:4xP5, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD Ave Cost:$28K FY98 Fcst:7 Ave CFG:4xP5, 256 RAM, 20 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 50 GB HD Ave CFG:4xP6, 512 RAM, 160 GB HD Ave CFG:4xP6, 512 RAM, 160 GB HD FTP.microsoft.com (3) Ave CFG:4xP5, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP6, 512 RAM, 30 GB HD Ave CFG:4xP6, 1 GB RAM, 160 GB HD Ave Cost:$83K FY98 Fcst:2 IDC Staging Servers Live SQL Servers SQL Consolidators Japan Data Center Internet Internet www.microsoft.com (3) Ave CFG:4xP6, 512 RAM, 50 GB HD
  • 22. HotMail (a year ago): ~400 Computers Crowd (now 2x bigger) Local Director Front Door (P-200, 128MB) 140 +10/mo FreeBSD/Apache 200MBpsInternetlink Graphics 15xP6 FreeBSD/Hotmail Ad 10xP6 FreeBSD/Apache Incoming Mail 25xP-200 FreeBSD/hm-SMTP Local Director Local Director Local Director Security 2xP200-FreeBSD Member Dir U Store E3k,xxMB, 384GB RAID5 + DLT tape robot Solaris/HMNNFS 50 machines, many old 13 + 1.5/mo 1 per million users Ad Pacer 3 P6 FreeBSD Cisco Catalyst 5000 Enet Switch Local10MbpsSwitchedEthernet M Serv (SPAC Ultra-1, ??MB) 4- replicas Solaris Telnet Maintenance Interface
  • 23. DB Clusters (crowds) • 16-node Cluster – 64 cpus – 2 TB of disk – Decision support • 45-node Cluster – 140 cpus – 14 GB DRAM – 4 TB RAID disk – OLTP (Debit Credit) • 1 B tpd (14 k tps)
  • 24. The Microsoft TerraServer Hardware • Compaq AlphaServer 8400Compaq AlphaServer 8400 • 8x400Mhz Alpha cpus8x400Mhz Alpha cpus • 10 GB DRAM10 GB DRAM • 324 9.2 GB StorageWorks Disks324 9.2 GB StorageWorks Disks – 3 TB raw, 2.4 TB of RAID53 TB raw, 2.4 TB of RAID5 • STK 9710 tape robot (4 TB)STK 9710 tape robot (4 TB) • WindowsNT 4 EE, SQL Server 7.0WindowsNT 4 EE, SQL Server 7.0
  • 25. TerraServer: Lots of Web Hits • A billion web hits! • 1 TB, largest SQL DB on the Web • 100 Qps average, 1,000 Qps peak • 877 M SQL queries so far Sessions 10 m 77 k 125 k 71 Total Average Peak Hits 1,065 m 8.1 m 29 m Queries 877 m 6.7 m 18 m Images 742 m 5.6m 15 m Page Views 170 m 1.3 m 6.6 m Users 6.4 m 48 k 76 k 0 5 10 15 20 25 30 35 6/22/98 6/29/987/6/987/13/987/20/987/27/988/3/988/10/988/17/988/24/988/31/989/7/989/14/989/21/989/28/9810/5/98 10/12/98 10/19/98 10/26/98 Date Count Sessions Hit Page View DB Query Image
  • 26. TerraServer Availability • Operating for 13 months • Unscheduled outage: 2.9 hrs • Scheduled outage: 2.0 hrs Software upgrades • Availability: 99.93% overall up • No NT failures (ever) • One SQL7 Beta2 bug • One major operator-assisted outage
  • 27. Backup / Restore • Configuration StorageTek TimberWolf 9710 DEC StorageWorks UltraSCSI Raid-5 Array Legato Networker PowerEdition 4.4a Windows NT Server Enterprise Edition 4.0 Performance Data Bytes Backed Up 1.2 TB Total Time 7.25 Hours Number of Tapes Consumed 27 tapes Total Tape Drives 10 drives Data ThroughPut 168 GB/Hour Average ThroughPut Per Device 16.8 GB/Hour Average Throughput Per Device 4.97 MB/Sec NTFS Logical Volumes 2
  • 28. tpmC vs Time 0 10,000 20,000 30,000 40,000 50,000 60,000 70,000 80,000 90,000 Jan- 95 Jan- 96 Jan- 97 Jan- 98 Jan- 99 Jan- 00 tpmC h Unix NT Windows NT Versus UNIX Best Results on an SMP: SemiLog plot shows 3x (~2 year) lead by UNIX Does not show Oracle/Alpha Cluster at 100,000 tpmC All these numbers are off-scale huge (40,000 active users?) tpmC vs Time 1,000 10,000 100,000 Jan- 95 Jan- 96 Jan- 97 Jan- 98 Jan- 99 Jan- 00 tpmC h Unix NT
  • 29. TPC C Improvements (MS SQL) 250%/year on Price, 100%/year performance bottleneck is 3GB address space 1.5 2.755676 $/tpmC vs time $10 $100 $1,000 Jan-94 Jan-95 Jan-96 Jan-97 Jan-98 Dec-98 $/tpmC tpmC vs time 100 1,000 10,000 100,000 Jan-94 Jan-95 Jan-96 Jan-97 Jan-98 Dec-98 tpmC 40% hardware,40% hardware, 100% software,100% software, 100% PC Technology100% PC Technology
  • 30. UNIX (dis) Economy Of Scale Bang for the Buck tpmC/K$ 0 5 10 15 20 25 30 35 40 45 50 0 10,000 20,000 30,000 40,000 50,000 60,000 tpmC tpmC/k$ Informix MS SQL Server Oracle Sybase
  • 31. Two different pricing regimes This is late 1998 prices TPC Price/tpmC 47 53 61 9 17.0 45 35 30 7 12 8 17 4 5 3 0 10 20 30 40 50 60 70 processor disk software net total/10 Sequent/Oracle 89 k tpmC @ 170$/tpmC Sun Oracle 52 k tpmC @ 134$/tpmC HP+NT4+MS SQL 16.2 ktpmC @ 33$/tpmC
  • 32. Registers On Chip Cache On Board Cache Memory Disk 1 2 10 100 Tape /Optical Robot 109 106 This ResortThis Resort This Room 10 min My Head 1 min 1.5 hrLos AngelesLos Angeles 2 YearsPluto 2,000 Years Andromeda Storage Latency: How far away is the data?
  • 33. Thesis: Performance =Storage Accesses not Instructions Executed • In the “old days” we counted instructions and IO’s • Now we count memory references • Processors wait most of the time SortDisc Wait Where the time goes: clock ticks used by AlphaSort Components SortDisc Wait OS Memory Wait D-Cache Miss I-Cache MissB-Cache Data Miss
  • 34. Storage Hierarchy (10 levels) Registers, Cache L1, L2 Main (1, 2, 3 if nUMA). Disk (1 (cached), 2) Tape (1 (mounted), 2)
  • 35. Today’s Storage Hierarchy : Speed & Capacity vs Cost Tradeoffs 1015 1012 109 106 103 TypicalSystem(bytes) Size vs Speed Access Time (seconds) 10-9 10-6 10-3 10 0 10 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape 104 102 100 10-2 10-4 $/MB Price vs Speed Access Time (seconds) 10-9 10-6 10-3 10 0 10 3 Cache Main Secondary Disc Nearline Tape Offline Tape Online Tape
  • 36. Meta-Message: Technology Ratios Are Important • If everything gets faster & cheaper at the same rate THEN nothing really changes. • Things getting MUCH BETTER: – communication speed & cost 1,000x – processor speed & cost 100x – storage size & cost 100x • Things staying about the same – speed of light (more or less constant) – people (10x more expensive) – storage speed (only 10x better)
  • 37. Storage Ratios Changed • 10x better access time • 10x more bandwidth • 4,000x lower media price • DRAM/DISK 100:1 to 10:10 to 50:1 Disk Performance vs Time 1 10 100 1980 1990 2000 Year accesstime(ms) 1 10 100 bandwidth(MB/s) Disk Performance vs Time (accesses/ second & Capacity) 1 10 100 1980 1990 2000 Year Accessesper Second 0.1 1 10 DiskCapackty (GB) Storage Price vs Time 0.01 0.1 1 10 100 1000 10000 1980 1990 2000 Year $/MB
  • 38. The Pico Processor 1 M SPECmarks 106 clocks/ fault to bulk ram Event-horizon on chip. VM reincarnated Multi-program cache Terror Bytes! 10 microsecond ram 10 millisecond disc 10 second tape archive 100 petabyte 100 terabyte 1 terabyte Pico Processor 10 pico-second ram 1 MM 3 megabyte 10 nano-second ram 10 gigabyte
  • 39. Bottleneck Analysis • Drawn to linear scale Theoretical Bus Bandwidth 422MBps = 66 Mhz x 64 bits Memory Read/Write ~150 MBps MemCopy ~50 MBps Disk R/W ~9MBps
  • 40. Bottleneck Analysis • NTFS Read/Write • 18 Ultra 3 SCSI on 4 strings (2x4 and 2x5) 3 PCI 64 ~ 155 MBps Unbuffered read (175 raw) ~ 95 MBps Unbuffered write Good, but 10x down from our UNIX brethren (SGI, SUN) Memory Read/Write ~250 MBps PCI ~110 MBps Adapter ~70 MBps PCI Adapter Adapter Adapter 155MBps
  • 41. PennySort • Hardware – 266 Mhz Intel PPro – 64 MB SDRAM (10ns) – Dual Fujitsu DMA 3.2GB EIDE disks • Software – NT workstation 4.3 – NT 5 sort • Performance – sort 15 M 100-byte records (~1.5 GB) – Disk to disk – elapsed time 820 sec • cpu time = 404 sec PennySort Machine (1107$ ) board 13% Memory 8% Cabinet + Assembly 7% Network, Video, floppy 9% Software 6% Other 22% cpu 32% Disk 25%
  • 42. Penny Sort Ground Rules http://research.microsoft.com/barc/SortBenchmark • How much can you sort for a penny. – Hardware and Software cost – Depreciated over 3 years – 1M$ system gets about 1 second, – 1K$ system gets about 1,000 seconds. – Time (seconds) = SystemPrice ($) / 946,080 • Input and output are disk resident • Input is – 100-byte records (random data) – key is first 10 bytes. • Must create output file and fill with sorted version of input file. • Daytona (product) and Indy (special) categories
  • 43. How Good is NT5 Sort? • CPU and IO not overlapped. • System should be able to sort 2x more • RAM has spare capacity • Disk is space saturated (1.5GB in, 1.5GB out on 3GB drive.) Need an extra 3GB drive or a >6GB drive CPU DiskFixed ram
  • 44. Sandia/Compaq/ServerNet/NT Sort • Sort 1.1 Terabyte (13 Billion records) in 47 minutes • 68 nodes (dual 450 Mhz processors) 543 disks, 1.5 M$ • 1.2 GBps network rap (2.8 GBps pap) • 5.2 GBps of disk rap (same as pap) • (rap=real application performance, pap= peak advertised performance) Bisection Line (Each switch on this line adds 3 links to bisection width) Y Fabric (14 bidirectional bisection links) X Fabric (10 bidirectional bisection links) To Y fabric To X Fabric 512 MB SDRAM 2 400 MHz CPUs 6-port ServerNet I crossbar switch 6-port ServerNet I crossbar switch Compaq Proliant 1850R Server 4 SCSI busses, each with 2 data disks The 72-Node 48-Switch ServerNet-I Topology Deployed at Sandia National Labs PCI Bus ServerNet I dual-ported PCI NIC
  • 45. SP sort • 2 – 4 GBps! 432nodes 37racks compute 488 nodes 55 racks 1952 processors, 732 GB RAM, 2168 disks 56nodes 18racks Storage Compute rack: 16 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 1 32x33 PCI bus 9 GB scsi disk 150MBps full duplex SP switch Storage rack: 8 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 3 32x33 PCI bus 30x4 GB scsi disk (4+1 RAID5) 150MBps full duplex SP switch 56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node) 432nodes 37racks compute 488 nodes 55 racks 1952 processors, 732 GB RAM, 2168 disks 56nodes 18racks Storage Compute rack: 16 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 1 32x33 PCI bus 9 GB scsi disk 150MBps full duplex SP switch Storage rack: 8 nodes, each has 4x332Mhz PowerPC604e 1.5 GB RAM 3 32x33 PCI bus 30x4 GB scsi disk (4+1 RAID5) 150MBps full duplex SP switch 56 storage nodes manage 1680 4GB disks 336 4+P twin tail RAID5 arrays (30/node) 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 0 100 200 300 400 500 600 700 800 900 Elapsed time (seconds)GB/s GPFS read GPFS write Local read Local write
  • 46. Progress on Sorting: NT now leads both price and performance • Speedup comes from Moore’s law 40%/year • Processor/Disk/Network arrays: 60%/year (this is a software speedup). 1.E+02 1.E+03 1.E+04 1.E+05 1.E+06 1.E+07 1.E+08 1985 1990 1995 2000 Ordinal+SGI Sort Records/second vs Time Bitton M68000 Cray YMP IBM 3090 Tandem Kitsuregawa Hardware Sorter Sequent Intel HyperCube IBM RS6000 NOW Alpha Penny NTsort Sandia/Compaq /NT SPsort/IB 1.E-03 1.E+00 1.E+03 1.E+06 1985 1990 1995 2000 Records Sorted per Second Doubles Every Year GB Sorted per Dollar Doubles Every Year Compaq/NTNT/PennySort SPsort
  • 47. Recent Results • NOW Sort: 9 GB on a cluster of 100 UltraSparcs in 1 minute • MilleniumSort: 16x Dell NT cluster: 100 MB in 1.18 Sec (Datamation) • Tandem/Sandia Sort: 68 CPU ServerNet 1 TB in 47 minutes • IBM SPsort 408 nodes, 1952 cpu 2168 disks 17.6 minutes = 1057sec (all for 1/3 of 94M$, slice price is 64k$ for 4cpu, 2GB ram, 6 9GB disks + interconnect
  • 48. Data Gravity Processing Moves to Transducers • Move Processing to data sources • Move to where the power (and sheet metal) is • Processor in – Modem – Display – Microphones (speech recognition) & cameras (vision) – Storage: Data storage and analysis • System is “distributed” (a cluster/mob)
  • 49. Gbps SAN: 110 MBps SAN: Standard Interconnect PCI: 70 MBps UW Scsi: 40 MBps FW scsi: 20 MBps scsi: 5 MBps • LAN faster than memory bus? • 1 GBps links in lab. • 100$ port cost soon • Port is computer • Winsock: 110 MBps (10% cpu utilization at each end) RIP FDDI RIP ATM RIP SCI RIP SCSI RIP FC RIP ?
  • 50. Disk = Node • has magnetic storage (100 GB?) • has processor & DRAM • has SAN attachment • has execution environment OS Kernel SAN driver Disk driver File SystemRPC, ... Services DBMS Applications
  • 52. Standard Storage Metrics• Capacity: – RAM: MB and $/MB: today at 10MB & 100$/MB – Disk: GB and $/GB: today at 10 GB and 200$/GB – Tape: TB and $/TB: today at .1TB and 25k$/TB (nearline) • Access time (latency) – RAM:100 ns – Disk: 10 ms – Tape: 30 second pick, 30 second position • Transfer rate
  • 53. SCAN? • Kaps: How many KB objects served per second –The file server, transaction processing metr –This is the OLD metric. • Maps: How many MB objects served per –The Multi-Media metric • SCAN: How long to scan all the data –The data mining and utility metric
  • 54. (good 1998 devices packaged in system http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf)DRAM DISK TAPE robot Unit capacity (GB) 1 18 35 Unit price $ 4000 500 10000 $/GB 4000 28 20 Latency (s) 1.E-7 1.E-2 3.E+1 Bandwidth (Mbps) 500 15 7 Kaps 5.E+5 1.E+2 3.E-2 Maps 5.E+2 13.04 3.E-2 Scan time (s/TB) 2 1200 70000 $/Kaps 9.E-11 5.E-8 3.E-3 $/Maps 8.E-8 4.E-7 3.E-3 $/TBscan $0.08 $0.35 $211 X 14
  • 55. (good 1998 devices packaged in system http://www.tpc.org/results/individual_results/Dell/dell.6100.9801.es.pdf)4.E+03 500 5.E+05 500 2 9.E-11 8.E-08 0.08 28 15 99 13 1200 5.E-08 4.E-07 0.35 20 7 0.03 0.03 7.E+04 3.E-03 3.E-03 211 1.E-12 1.E-09 1.E-06 1.E-03 1.E+00 1.E+03 1.E+06 $/G B Bandw idth (M bps) Kaps M aps Scan tim e (s/TB) $/Kaps $/M aps $/TBscan DRAM DISK TAPE robot X 14
  • 56. Maps, SCANs • parallelism: use many little devices in parallel1 Terabyte 10 MB/s At 10 MB/s: 1.2 days to scan 1 Terabyte 1,000 x parallel: 100 seconds SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel.
  • 57. a Card The 1 TB disc card An array of discs Can be used as 100 discs 1 striped disc 10 Fault Tolerant discs ....etc LOTS of accesses/second bandwidth 14" Life is cheap, its the accessories that cost ya. Processors are cheap, it’s the peripherals that cost ya (a 10k$ disc card).
  • 58. Not Mainframe Silos Scan in 27 hours. many independent tape robots (like a disc farm) 10K$ robot 14 tapes 500 GB 5 MB/s 20$/GB 30 Maps 100 robots 50TB 50$/GB 3K Maps 27 hr Scan 1M$
  • 59. Myth Optical is cheap: 200 $/platter 2 GB/platter => 100$/GB (2x cheaper than disc) Tape is cheap: 30 $/tape 20 GB/tape => 1.5 $/GB (100x cheaper than disc).
  • 60. Cost Tape needs a robot (10 k$ ... 3 m$ ) 10 ... 1000 tapes (at 20GB each) => 20$/GB ... 200$/GB (1x…10x cheaper than disc) Optical needs a robot (100 k$ ) 100 platters = 200GB ( TODAY ) => 400 $/GB ( more expensive than mag disc ) Robots have poor access times Not good for Library of Congress (25TB) Data motel: data checks in but it never checks out!
  • 61. The Access Time Myth The Myth: seek or pick time dominates The reality: (1) Queuing dominates (2) Transfer dominates BLOBs (3) Disk seeks often short Implication: many cheap servers better than one fast expensive server Seek Rotate Transfer Seek Rotate Transfer Wait
  • 62. What To Do About HIGH Availability • Need remote MIRRORED site to tolerate environmental failures (power, net, fire, flood) operations failures • Replicate changes across the net • Failover servers across the net (some distance) • Allows: software upgrades, site moves, fires,... • Tolerates: operations errors, hiesenbugs, server client State Changes server >100 feet or >100 miles
  • 63. Mflop/s/$K vs Mflop/s 0.001 0.010 0.100 1.000 10.000 100.000 0.1 1 10 100 1000 10000 100000 Mflop/s Mflop/s/$K LANL Loki P6 Linux NAS Expanded Linux Cluster Cray T3E IBM SP SGI Origin 2000- 195 Sun Ultra Enterprise 4000 UCB NOW Scaleup Has Limits (chart courtesy of Catharine Van Ingen) • Vector Supers ~ 10x supers – ~3 Gflops/cpu – bus/memory ~ 20 GBps – IO ~ 1GBps • Supers ~ 10x PCs – 300 Mflops/cpu – bus/memory ~ 2 GBps – IO ~ 1 GBps • PCs are slow – ~ 30 Mflops/cpu – and bus/memory ~ 200MBps – and IO ~ 100 MBps
  • 64. TOP500 Systems by Vendor (courtesy of Larry Smarr NCSA) TOP500 Reports: http://www.netlib.org/benchmark/top500.html CRI SGI IBM Convex HP SunTMC Intel DEC Japanese Vector Machines Other 0 100 200 300 400 500 Jun-93 Nov-93 Jun-94 Nov-94 Jun-95 Nov-95 Jun-96 Nov-96 Jun-97 Nov-97 Jun-98 NumberofSystems Other Japanese DEC Intel TMC Sun HP Convex IBM SGI CRI
  • 65. NCSA Super ClusterNCSA Super Cluster • National Center for Supercomputing Applications University of Illinois @ Urbana • 512 Pentium II cpus, 2,096 disks, SAN • Compaq + HP +Myricom + WindowsNT • A Super Computer for 3M$ • Classic Fortran/MPI programming • DCOM programming model http://access.ncsa.uiuc.edu/CoverStories/SuperCluster/super.html
  • 66. Avalon: Alpha Clusters for Science http://cnls.lanl.gov/avalon/ 140 Alpha Processors(533 Mhz) x 256 MB + 3GB disk Fast Ethernet switches = 45 Gbytes RAM 550 GB disk + Linux…………………... = 10 real Gflops for $313,000 => 34 real Mflops/k$ on 150 benchmark Mflops/k$ Beowulf project is Parent http://www.cacr.caltech.edu/beowulf/naegling.html 114 nodes, 2k$/node, Scientists want cheap mips.
  • 67. • Intel/Sandia: 9000x1 node Ppro • LLNL/IBM: 512x8 PowerPC (SP2) • LANL/Cray: ? • Maui Supercomputer Center – 512x1 SP2 Your Tax Dollars At Work ASCI for Stockpile Stewardship
  • 68. Observations • Uniprocessor RAP << PAP – real app performance << peak advertised performance • Growth has slowed (Bell Prize – 1987: 0.5 GFLOPS – 1988 1.0 GFLOPS 1 year – 1990: 14 GFLOPS 2 years – 1994: 140 GFLOPS 4 years – 1997: 604 GFLOPS – 1998: 1600 G__OPS 4 years
  • 69. Two Generic Kinds of computing • Many little – embarrassingly parallel – Fit RPC model – Fit partitioned data and computation model – Random works OK – OLTP, File Server, Email, Web,….. • Few big – sometimes not obviously parallel – Do not fit RPC model (BIG rpcs) – Scientific, simulation, data mining, ...
  • 70. Many Little Programming Model • many small requests • route requests to data • encapsulate data with procedures (objects) • three-tier computing • RPC is a convenient/appropriate model • Transactions are a big help in error handling • Auto partition (e.g. hash data and computation) • Works fine. • Software CyberBricks
  • 71. Object Oriented Programming Parallelism From Many Little Jobs • Gives location transparency • ORB/web/tpmon multiplexes clients to servers • Enables distribution • Exploits embarrassingly parallel apps (transactions) • HTTP and RPC (dcom, corba, rmi, iiop, …) are basis Tp mon / orb/ web server
  • 72. Few Big Programming Model • Finding parallelism is hard – Pipelines are short (3x …6x speedup) • Spreading objects/data is easy, but getting locality is HARD • Mapping big job onto cluster is hard • Scheduling is hard – coarse grained (job) and fine grain (co-schedule) • Fault tolerance is hard
  • 73. Kinds of Parallel Execution Pipeline Partition outputs split N ways inputs merge M ways Any Sequential Program Any Sequential Program Sequential Sequential SequentialSequential Any Sequential Program Any Sequential Program
  • 74. Why Parallel Access To Data? 1 Terabyte 10 MB/s At 10 MB/s 1.2 days to scan 1 Terabyte 1,000 x parallel 100 second SCAN. Parallelism: divide a big problem into many smaller ones to be solved in parallel. BANDW IDTH
  • 75. Why are Relational Operators Successful for Parallelism?Relational data model uniform operators on uniform data stream Closed under composition Each operator consumes 1 or 2 input streams Each stream is a uniform collection of data Sequential data in and out: Pure dataflow partitioning some operators (e.g. aggregates, non-equi-join, sort,..) requires innovation AUTOMATIC PARALLELISM
  • 76. Database Systems “Hide” Parallelism • Automate system management via tools –data placement –data organization (indexing) –periodic tasks (dump / recover / reorganize) • Automatic fault tolerance –duplex & failover –transactions • Automatic parallelism –among transactions (locking) –within a transaction (parallel execution)
  • 77. SQL a Non-Procedural Programming Language • SQL: functional programming language describes answer set. • Optimizer picks best execution plan – Picks data flow web (pipeline), – degree of parallelism (partitioning) – other execution parameters (process placement, memory,...) GUI Schema Plan Monitor Optimizer ExecutionPlanning Rivers Executors
  • 78. Partitioned Execution A...E F...J K...N O...S T...Z ATable Count Count Count Count Count Count Spreads computation and IO among processors Partitioned data gives NATURAL parallelism
  • 79. N x M way Parallelism A...E F...J K...N O...S T...Z Merge Join Sort Join Sort Join Sort Join Sort Join Sort Merge Merge N inputs, M outputs, no bottlenecks. Partitioned Data Partitioned and Pipelined Data Flows
  • 80. Automatic Parallel Object Relational DB Select image from landsat where date between 1970 and 1990 and overlaps(location, :Rockies) and snow_cover(image) >.7; Temporal Spatial Image date loc image Landsat 1/2/72 . . . . . .. . . 4/8/95 33N 120W . . . . . . . 34N 120W Assign one process per processor/disk: find images with right data & location analyze image, if 70% snow, return it image Answer date, location, & image tests
  • 81. Data Rivers: Split + Merge Streams Producers add records to the river, Consumers consume records from the river Purely sequential programming. River does flow control and buffering does partition and merge of data records River = Split/Merge in Gamma = Exchange operator in Volcano /SQL Server. River M Consumers N producers N X M Data Streams
  • 82. Generalization: Object-oriented Rivers • Rivers transport sub-class of record-set (= stream of objects) – record type and partitioning are part of subclass • Node transformers are data pumps – an object with river inputs and outputs – do late-binding to record-type • Programming becomes data flow programming – specify the pipelines • Compiler/Scheduler does data partitioning and “transformer” placement
  • 83. NT Cluster Sort as a Prototype • Using – data generation and – sort as a prototypical app • “Hello world” of distributed processing • goal: easy install & execute
  • 85. Cluster StartupExecution MULT_QI COSERVERINFO •Setup : MULTI_QI struct COSERVERINFO struct •CoCreateInstanceEx() •Retrieve remote object handle from MULTI_QI struct •Invoke methods as usual HANDLE HANDLE HANDLE Sort() Sort() Sort()
  • 86. Cluster Sort Conceptual Model •Multiple Data Sources •Multiple Data Destinations •Multiple nodes •Disks -> Sockets -> Disk -> Disk B AAA BBB CCC A AAA BBB CCC C AAA BBB CCC BBB BBB BBB AAA AAA AAA CCC CCC CCC BBB BBB BBB AAA AAA AAA CCC CCC CCC
  • 87. How Do They Talk to Each Other? • Each node has an OS • Each node has local resources: A federation. • Each node does not completely trust the others. • Nodes use RPC to talk to each other – CORBA? DCOM? IIOP? RMI? – One or all of the above. • Huge leverage in high-level interfaces. • Same old distributed system story. Wire(s) h streams datagrams RPC ? Applications VIAL/VIPL streams datagrams RPC ? Applications

Editor's Notes

  1. This chart makes the point that processor speed is now bus and memory latency. We have an ongoing effort to measure and improve this latency.