[Harvard CS264] 01 - Introduction

Massively Parallel Computing
CS 264 / CSCI E-292
Lecture #1: Introduction | January 25th, 2011

Nicolas Pinto (MIT, Harvard)
pinto@mit.edu

Massively Parallel Computing

pu ting
om
Supe r eC
rcom
putin any -co
g M
MPC H
igh-T
uting hrou
ghpu
p t Co
om Hu mput
dC
Clou ma ing
n?
“C
om
pu
tin
g”

http://www.youtube.com/watch?v=jj0WsQYtT7M

Modeling & Simulation

• Physics, astronomy, molecular dynamics, ﬁnance, etc.

• Data and processing intensive

• Requires high-performance computing (HPC)

• Driving HPC architecture development

(20 09)
CS 264
Top Dog (2008)

• Roadrunner, LANL
• #1 on top500.org in 2008 (now #7)

• 1.105 petaﬂop/s

• 3000 nodes with dual-core AMD Opteron processors

• Each node connected via PCIe to two IBM Cell processors

• Nodes are connected via Inﬁniband 4x DDR

http://www.top500.org/lists/2010/11

Tianhe-1A
at NSC Tianjin

2.507 Petaflop
7168 Tesla M2050 GPUs

1 Petaﬂop/s = ~1M high-end laptops = ~world population
with hand calculators 24/7/365 for ~16 years
Slide courtesy of Bill Dally (NVIDIA)

http://news.cnet.com/8301-13924_3-20021122-64.html

What $100+ million
can buy you...

Roadrunner (#7) Jaguar (#2)

Road
runn
e r (#7
)

http://www.lanl.gov/roadrunner/

Response from the legend:

...

http://techcrunch.com/2010/12/14/stallman-cloud-computing-careless-computing/

Cloud Utility Computing?
for CS264

http://code.google.com/appengine/

http://www.nilkanth.com/my-uploads/2008/04/comparingpaas.png

How much Data?

• Google processes 24 PB / day, 8 EB / year (’10)

• Wayback Machine has 3 PB,100 TB/month (’09)

• Facebook user data: 2.5 PB, 15 TB/day (’09)

• Facebook photos: 15 B, 3 TB/day (’09) - 90 B (now)

• eBay user data: 6.5 PB, 50 TB/day (’09)

• “all words ever spoken by human beings”~ 42 ZB

Adapted from http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/

“640k ought to be enough for anybody.”
- Bill Gates just a rumor (1981)

Disk Throughput
• Average Google job size: 180 GB
• 1 SATA HDD = 75 MB / sec
• Time to read 180 GB off disk: 45 mins
• Solution: parallel reads
• 1000 HDDs = 75 GB / sec
• Google’s solutions: BigTable, MapReduce, etc.

Cloud Computing
• Clear trend: centralization of computing
resources in large data centers
• Q: What do Oregon, Iceland, and
abandoned mines have in common?
• A: Fiber, juice, and space
• Utility computing!

Instrument Data
Explosion
Sloan Digital Sky Survey

ATLUM / Connectome Project

Another example?
hint: Switzerland

ool 2005
me r Sch
ERN Sum
C

ool 2005
me r Sch
ERN Sum
C

bad taste party...

ool 2005
me r Sch
ERN Sum
C

pitchers...

LHC

Maximilien Brice, © CERN

N’s Cl uster
CER

~5000 nodes (‘05)

ool 2005
me r Sch
ERN Sum
C

presentations...

Slide courtesy of Hanspeter Pﬁster

Diesel Powered HPC

Life Support…

Murchison Wideﬁeld Array

How much Data?

• NOAA has ~1 PB climate data (‘07)

• MWA radio telescope: 8 GB/sec of data

• Connectome: 1 PB / mm3 of brain tissue
(1 EB for 1 cm3)

• CERN’s LHC will generate 15 PB a year (‘08)

Computer Games
• PC gaming business:
• $15B / year market (2010)

• $22B / year in 2015 ?

• WOW: $1B / year

• NVIDIA Shipped 1B GPUs since 1993:
• 10 years to ship 200M GPUs (1993-2003)

• 1/3 of all PCs have more than one GPU

• High-end GPUs sell for around $300

• Now used for science application

Many-Core Processors

Intel Core i7-980X Extreme NVIDIA GTX 580 SC
6 cores 512 cores
1.17B transistors 3B transistors
http://en.wikipedia.org/wiki/Transistor_count

Data Throughput

Massive
Data GPU
Parallelism

Instruction
Level CPU
Parallelism

Data Fits in Cache Huge Data
David Kirk, NVIDIA

3 of Top5 Supercomputers
$"!!

$!!!

#"!!
!"#$%&'()

#!!!

"!!

!
%&'()*+#, -'./'0 1*2/3'* %4/2'5* 6788*09::

Bill Dally, NVIDIA

Personal Supercomputers

~4 Teraﬂops
@ 1500 Watts

Disruptive Technologies
• Utility computing
• Commodity off-the-shelf (COTS) hardware

• Compute servers with 100s-1000s of processors

• High-throughput computing
• Mass-market hardware

• Many-core processors with 100s-1000s of cores

• High compute density / high ﬂops/W

Green HPC

NVIDIA/NCSA
Green 500 Entry

Green HPC
NVIDIA/NCSA Green 500 Entry

128 nodes, each with:
1x Core i3 530 (2 cores, 2.93 GHz => 23.4 GFLOP peak)
1x Tesla C2050 (14 cores, 1.15 GHz => 515.2 GFLOP peak)
4x QDR Infiniband
4 GB DRAM
Theoretical Peak Perf: 68.95 TF
Footprint: ~20 ft^2 => 3.45 TF/ft^2
Cost: $500K (street price) => 137.9 MF/$
Linpack: 33.62 TF, 36.0 kW => 934 MF/W

Massively Parallel Human
Computing ???
• “Crowdsourcing”

• Amazon Mechanical Turk
(artificial artificial intelligence)

• Wikipedia

• Stackoverflow

• etc.

What is this course about?
Massively parallel processors
• GPU computing with CUDA

Cloud computing
• Amazon’s EC2 as an example of utility
computing
• MapReduce, the “back-end” of cloud
computing

Power Cost
• Power ∝ Voltage2 x Frequency

• Frequency ∝ Voltage
• Power ∝ Frequency3

Jack Dongarra

Power Cost

Cores Freq Perf Power P/W
CPU 1 1 1 1 1
“New” CPU 1 1.5 1.5 3.3 0.45x
Multicore 2 0.75 1.5 0.8 1.88x

Jack Dongarra

Problem with Buses

Anant Agarwal, MIT

Problem with Memory

http://www.OpenSparc.net/

Problem with Disks

64 MB / sec

Tom’s Hardware

Good News
• Moore’s Law marches on
• Chip real-estate is essentially free
• Many-core architectures are commodities
• Space for new innovations

Bad News
• Power limits improvements in clock speed
• Parallelism is the only route to improve
performance
• Computation / communication ratio will get
worse
• More frequent hardware failures?

A “Simple” Matter of
Software
• We have to use all the cores efﬁciently
• Careful data and memory management
• Must rethink software design
• Must rethink algorithms
• Must learn new skills!
• Must learn new strategies!
• Must learn new tools...

Our mantra: always use the right tool !

Instructor: Nicolas Pinto

The Rowland Institute at Harvard
HARVARD UNIVERSITY

The Approach
Reverse and Forward Engineering the Brain

The Approach
Reverse and Forward Engineering the Brain

REVERSE FORWARD
Study Build
Natural System Artiﬁcial System

t aﬂo ps !
in =2 0 pe
bra

“ If you want to have good ideas
you must have many ideas. ”
“ Most of them will be wrong,
and what you have to learn is
which ones to throw away. ”
Linus Pauling
(double Nobel Prize Winner)

High-throughput
Screening

The curse of speed
...and the blessing of massively parallel computing

thousands of big models

large amounts of unsupervised
learning experience

The curse of speed
...and the blessing of massively parallel computing

No off-the-shelf solution? DIY!
Engineering (Hardware/SysAdmin/Software) Science

Leverage non-scientiﬁc high-tech
markets and their $billions of R&D...
Gaming: Graphics Cards (GPUs), PlayStation 3
Web 2.0: Cloud Computing (Amazon, Google)

The blessing of GPUs
DIY GPU pr0n (since 2006) Sony Playstation 3s (since 2007)

speed
(in billion ﬂoating point operations per second)

Q9450 (Matlab/C) [2008] 0.3

Q9450 (C/SSE) [2008] 9.0

7900GTX (OpenGL/Cg) [2006] 68.2

PS3/Cell (C/ASM) [2007] 111.4

8800GTX (CUDA1.x) [2007] 192.7

GTX280 (CUDA2.x) [2008] 339.3

cha n ging...
e
GTX480 (CUDA3.x) [2010]
pe edu p is g a m 974.3
(Fermi)
>1 000X s
Pinto, Doukhan, DiCarlo, Cox PLoS 2009
Pinto, Cox GPU Comp. Gems 2011

Tired Of Waiting For Your
Computations? n your deskto
p:
go
Supercomputin n of c h e a p a n
d
eneratio
Prog ramm ing the next g sing CUDA
all el hardware u
massively par

extensive
g ive students
designed to disruptive
This IA P has been ne w potentially
e in using a ses having
ha nds- on experienc ab les the mas
echnology en
techno logy. This t apabilities.
rcomputing c
access to supe
rog ramming
the CUDA p
e students to orp. which, has been an
We will introduc NVIDIA C
developed by u n if y in g t h e
lan guage p li fy in g a n d
t o w a rd s s im
s e n t ia l s t e p el chips.
es
of m assively parall
prog ramming
ions from
nero us contribut
orted by ge te at Harvard
, and MIT
This IAP is supp stitu
e Rowland In s given by
NVID IA Corp. , Th e featuring talk
) and will b
(OEIT , BCS, EECS .
various ﬁelds
experts from

(IAP 09)
6. 963

Co-Instructor:Hanspeter Pﬁster

Visual Computing
• Large image & video collections

• Physically-based modeling

• Face modeling and recognition

• Visualization

VolumePro 500
Released
1999

NVIDIA CUDA Center
of Excellence

TFs
• Claudio Andreoni (MIT Course 18)
• Dwight Bell (Harvard DCE)
• Krunal Patel (Accelereyes)
• Jud Porter (Harvard SEAS)
• Justin Riley (MIT OEIT)
• Mike Roberts (Harvard SEAS)

Claudio Andreoni
(MIT Course 18)

About you...

• Undergraduate ? Graduate ?

• Programming ? >5 years ? <2 years ?

• CUDA ? MPI ? MapReduce ?

• CS ? Life Sc ? Applied Sc ? Engineering ? Math ? Physics ?

• Humanities ? Social Sc ? Economy ?

CS 264 Goals
• Have fun!
• Learn basic principles of parallel computing
• Learn programming with CUDA
• Learn to program a cluster of GPUs (e.g. MPI)
• Learn basics of EC2 and MapReduce
• Learn new learning strategies, tools, etc.
• Implement a ﬁnal project

Experimental Learning t, re pe
epea
Strategy peat,r
Re
Memory “recall”

Lectures

•Theory, Architecture, Patterns ?
• Act 1: GPU Computing
• Act II: Cloud Computing
• Act III: Guest Lectures

Lectures “Format”
• 2x ~ 45min regular “lectures”
• ~ 15min “Clinic”
• we’ll be here to ﬁx your problems

• ~ 5 min: Life and Code “Hacking”:
• GTD Zen
• Presentation Zen
• Ninja Programming Tricks & Tools, etc.
• Interested? email staff+spotlight@cs264.org

Act I: GPU Computing

• Introduction to GPU Computing
• CUDA Basics
• CUDA Advanced
• CUDA Ninja Tricks !

l u t i on
n k Convo
Fi lterba
Performance / Effort 3D

Performance (g ops) Development Time (hours)

0.3
Matlab
0.5

9.0
C/SSE
10.0

111.4
PS3
30.0

339.3
GT200
10.0

Empirical results...

Performance (g ops)

Q9450 (Matlab/C) [2008] 0.3

Q9450 (C/SSE) [2008] 9.0

7900GTX (Cg) [2006] 68.2

PS3/Cell (C/ASM) [2007] 111.4

8800GTX (CUDA1.x) [2007] 192.7

GTX280 (CUDA2.x) [2008] 339.3

.
GTX480 (CUDA3.x) [2010] e cha nging.. 974.3
g am
e edup is
>1 0 00X sp

Act II: Cloud Computing

• Introduction to utility computing
• EC2 & starcluster (Justin Riley, MIT OEIT)
• Hadoop (Zak Stone, SEAS)
• MapReduce with GPU Jobs on EC2

Amazon’s Web Services
• Elastic Compute Cloud (EC2)
• Rent computing resources by the hour

• Basic unit of accounting = instance-hour

• Additional costs for bandwidth

• You’ll be getting free AWS credits for course
assignments

MapReduce
• Functional programming meets distributed
processing
• Processing of lists with <key, value> pairs
• Batch data processing infrastructure
• Move the computation where the data is

Act III: Guest Lectures
• Andreas Knockler (NYU): OpenCL & PyOpenCL
• John Owens (UC Davis): fundamental algorithms/
data structures and irregular parallelism
• Nathan Bell (NVIDIA): Thrust
• Duane Merrill* (Virginia Tech): Ninja Tricks
• Mike Bauer* (Stanford): Sequoia
• Greg Diamos (Georgia Tech): Ocelot
• Other lecturers* from Google,Yahoo, Sun, Intel,
NCSA, AMD, Cloudera, etc.

Labs
• Lead by TF(s)
• Work on an interesting small problem
• From skeleton code to solution
• Hands-on

53 Church St., Room 104
53 Church St., Rm 104
Thu, Fri 7.35-9.35 pm

53 Church St., Room 105
53 Church St., Rm 105

NVIDIA Fx4800 Quadro
• MacPro

• NVIDIA Fx4800
Quadro, 1.5 GB

Resonance @ SEAS
• Quad-core Intel Xeon
host, 3 GHz, 8 GB

• 8 Tesla S1070s (32
GPUs, 4 GB each)

• 16 quad-core Intel
Xeons, 2 GHz, 16 GB

• http://
community.crimsongri
d.harvard.edu/getting-
started/resources/
resonance-cuda-host

What do you
need to know?
• Programming (ideally in C / C++)
• See HW 0

• Basics of computer systems
• CS 61 or similar

Homeworks
• Programming assignments
• “Issue Spotter” (code debug & review, Q&A)
• Contribution to the community
(OSS, Wikipedia, Stackoverﬂow, etc.)
• Due: Fridays at 11 pm EST
• Hard deadline - 2 “bonus” days

Ofﬁce Hours
• Lead by a TF
• 104 @ 53 Church St
(check website and news feed)

Participation
• HW0 (this week)
• Mandatory attendance for guest lectures
• forum.cs264.org
• Answer questions, help others

• Post relevant links and discussions (!)

Final Project
• Implement a substantial project
• Pick from a list of suggested projects or design
your own
• Milestones along the way (idea, proposal, etc.)
• In-class ﬁnal presentations
• $500+ price for the best project

Grading

• On a 0-100 scale
• Participation: 10%

• Homework: 50%

• Final project: 40%

www.cs264.org
• Detailed schedule (soon)
• News blog w/ RSS feed
• Video feeds
• Forum (forum.cs264.org)
• Academic honesty policy
• HW0 (due Fri 2/4)

one more thing
from WikiLeaks?

This course is not for you...
• If you’re not genuinely interested in the topic
• If you can’t cope with uncertainly,
unpredictability, poor documentation, and
immature software
• If you’re not ready to do a lot of programming
• If you’re not open to thinking about computing in
new ways
• If you can’t put in the time

Slide after Jimmy Lin, iSchool, Maryland

Otherwise...
It will be a richly rewarding experience!

Be Patient

Be Flexible

Be Constructive

http://davidzinger.wordpress.com/2007/05/page/2/

It would be a
win-win-win situation!

(The Ofﬁce Season 2, Episode 27: Conﬂict Resolution)

Acknowledgements
• Hanspeter Pﬁster & Henry Leitner, DCE
• TFs
• Rob Parrott & IT Team, SEAS
• Gabe Russell & Video Team, DCE
• NVIDIA, esp. David Luebke
• Amazon

Next?

• Fill out the survey: http://bit.ly/enrb1r
• Get ready for HW0 (Lab 1 & 2)
• Subscribe to http://forum.cs264.org
• Subscribe to RSS feed: http://bit.ly/eFIsqR

[Harvard CS264] 01 - Introduction

Recommended

Recommended

More Related Content

More from npinto

More from npinto (20)

Recently uploaded

Recently uploaded (20)

[Harvard CS264] 01 - Introduction