Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Deep Learning : Convergence of HPC and
Hyperscale
Sumit Sanyal
CEO
Technology Revolution :
Deep Neural Nets
Ø  A new paradigm in programming
Ø  DNNs remarkably effective at tackling many pr...
Make a
guess at NN
architecture
Run on
training
data, does
it perform
well?
No
Run on
test data,
does it
perform
well?
No
...
Ø  Deep Neural Nets require a lot of compute cycles!
Ø  An example – image classification:
Ø  Ratio of (Compute cycles : I...
Neural Net Computations
Ø  All Deep Neural Net implementations have the following properties
Ø  Small set of non-linear tr...
Why minds.ai?
minds.ai was founded on the basis that Deep Learning is the next
frontier for HPC – it’s what we do
“…around...
Ø  Instruction level – SIMD, VLIW etc.
Ø  Thread level – warps
Ø  Processor level – many cores
Ø  Server level – many GPUs...
Ø Exponential growth in Data Centers
Ø  Commoditization of Enterprise silicon
Ø  Traditional mobile players announcing ASI...
Age of dark silicon
Transistor property Dennardian Post Dennardian
# of transistors
( Q )
S2 S2
Peak clock frequency
( F )...
Ø  Dark silicon will drive heterogeneity
Ø  Multi core architectures with different Instruction Sets
Ø  Power aware schedu...
Ø  Instruction level – SIMD, VLIW etc.
Ø  Thread level – warps
Ø  Processor level – many cores
Ø  Server level – many GPUs...
Ø  Deep Learning is driving the convergence of High Performance
Compute (HPC) and Hyperscale (Data Centers)
Ø  Traditional...
Ø  Intra server vs. Inter server bandwidths
Ø  Inter server bandwidths will grow faster than intra
server
Ø  Lead to large...
Training Server Reference Design
CPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
CPU
IB NIC
IB NIC
Ethernet
PCIe Root Hub
SSD
SSD
GPU Training Cluster Architecture
Server Node
IB Switch
Server Node
Server Node
Server Node
Server Node
Server Node
Server...
Ø  Instruction level – SIMD, VLIW etc.
Ø  Thread level – warps
Ø  Processor level – many cores
Ø  Server level – many GPUs...
Ø  Computing super clusters
Ø  105 variability in size of compute ‘jobs’
Ø  Large number of collocated servers running the...
Thank you.
•  Sumit Sanyal, Founder and CEO
sumit@minds.ai
•  Steve Kuo, Co-Founder
steve@minds.ai
Contacts
19
Accelerated Deep Learning
Upcoming SlideShare
Loading in …5
×

Deep Learning: Convergence of HPC and Hyperscale

1,157 views

Published on

In this deck from the 2016 Stanford HPC Conference, Sumit Sanyal from Minds.ai presents: Deep Learning: Convergence of HPC & Hyperscale.

"This talk includes a case study of convergence of HPC, Hyperscale and high-speed computational neuroscience."

Learn more: http://minds.ai

Sign up for our insideHPC Newsletter: http://insideHPC.com/newsletter

Published in: Technology
  • Be the first to comment

Deep Learning: Convergence of HPC and Hyperscale

  1. 1. Deep Learning : Convergence of HPC and Hyperscale Sumit Sanyal CEO
  2. 2. Technology Revolution : Deep Neural Nets Ø  A new paradigm in programming Ø  DNNs remarkably effective at tackling many problems Ø  Designing new NN architectures as opposed to “programming” Ø  Training as opposed to “compiling” Ø  Trained weights as the “new” binaries Ø  Big Neural Nets required to process Big Data Ø  Videos, images, speech and text Ø  DNNs are significantly increasing recognition accuracies which have stagnated for decades Ø  Used to structure Big Data
  3. 3. Make a guess at NN architecture Run on training data, does it perform well? No Run on test data, does it perform well? No Yes DONE! Design Bigger Network Need more Training Data Yes Neural network development Iterate Iterate Train on labeled dataset Wait 1-7+ days 3
  4. 4. Ø  Deep Neural Nets require a lot of compute cycles! Ø  An example – image classification: Ø  Ratio of (Compute cycles : IO bandwidth) significantly higher than non AI algorithms Ø  Training AlexNet (for image classification) requires ~27,000 flops/input data byte Ø  Training VGG ~150,000 flops/ data byte Ø  R3 / R2 à Volume (compute) / Surface(IO BW) Ø  Significantly higher for Deep Nets Ø  Power dissipation challenges Ø  Compute density limited by DC cooling capacity Ø  At 1 ~uW / MHz (current state-of-art in 28 nm) requires 300 Watts! Compute vs. IO AI is no longer bored J
  5. 5. Neural Net Computations Ø  All Deep Neural Net implementations have the following properties Ø  Small set of non-linear transforms Ø  Small set of linear algebra primitives Ø  Relatively modest dynamic range of weight/data values Ø  Very regular/repetitive data flows Ø  Only persistent memory requirement is for weights Ø  Updated while learning, fixed for recognition Ø Variance in the size of the net across applications is >105 Compute cycles will be commoditized; not computers!
  6. 6. Why minds.ai? minds.ai was founded on the basis that Deep Learning is the next frontier for HPC – it’s what we do “…around 2008 my group at Stanford started advocating shifting deep learning to GPUs (this was really controversial at that time; but now everyone does it); and I'm now advocating shifting to HPC (High Performance Computing/Supercomputing) tactics for scaling up deep learning. Machine learning should embrace HPC. These methods will make researchers more efficient and help accelerate the progress of our whole field.” – Andrew Ng, Feb 2016 HPC Design Expertise Neural Network Design Expertise Image and Video Recognition Expertise GPU Programming Expertise + + +
  7. 7. Ø  Instruction level – SIMD, VLIW etc. Ø  Thread level – warps Ø  Processor level – many cores Ø  Server level – many GPUs in a server Ø  Cluster level – many servers with high BW interconnect Ø  Data Center level Ø  Planet level J 7 levels of parallelism
  8. 8. Ø Exponential growth in Data Centers Ø  Commoditization of Enterprise silicon Ø  Traditional mobile players announcing ASICs for enterprise compute Ø Higher demands for compute density Ø  GPGPUs have won the first round Ø  Dennardian scaling is breaking down Ø Power dissipation will emerge as major challenge Ø  Chip level, server level and DC level ASICs for DNNs
  9. 9. Age of dark silicon Transistor property Dennardian Post Dennardian # of transistors ( Q ) S2 S2 Peak clock frequency ( F ) S S Capacitance ( C ) 1/S 1/S Supply Voltage ( Vdd ) 1 / S2 1 Dynamic Power (QFCVdd 2) 1 S2 Active Silicon 1 1/S2 * S is the ratio of feature size between next generation processes
  10. 10. Ø  Dark silicon will drive heterogeneity Ø  Multi core architectures with different Instruction Sets Ø  Power aware scheduling across cores Ø  Decreasing parts of the chip can run at full clock frequency Ø  Specialized silicon for server blades Ø  Bridges to intra server and inter server communications Ø  Last level caching support, caching across MPI Ø  Distributed compute in network interfaces Heterogeneity in Enterprise silicon
  11. 11. Ø  Instruction level – SIMD, VLIW etc. Ø  Thread level – warps Ø  Processor level – many cores Ø  Server level – many GPUs in a server Ø  Cluster level – many servers with high BW interconnect Ø  Data Center level Ø  Planet level J 7 levels of parallelism
  12. 12. Ø  Deep Learning is driving the convergence of High Performance Compute (HPC) and Hyperscale (Data Centers) Ø  Traditional HPC ecosystems : expensive and bleeding edge Ø  DC infrastructure : commodity and homogeneous Ø  Single or dual CPU servers common Ø  All of this is changing Ø  GPGPUs now common in DCs, initial resistance Ø  InfiniBand penetration has reached a tipping point Ø  Dense compute clusters require high bandwidth interconnects Heterogeneity in Hyperscale
  13. 13. Ø  Intra server vs. Inter server bandwidths Ø  Inter server bandwidths will grow faster than intra server Ø  Lead to larger, denser servers Ø  8 or more GPGPUs per server for DNN training jobs Ø  Will co-exist with CPU based servers for search and database operations Ø  Many kinds of servers, one size fits all does not work Server Architectures
  14. 14. Training Server Reference Design CPU GPU GPU GPU GPU GPU GPU GPU GPU CPU IB NIC IB NIC Ethernet PCIe Root Hub SSD SSD
  15. 15. GPU Training Cluster Architecture Server Node IB Switch Server Node Server Node Server Node Server Node Server Node Server Node Ethernet Switch RAID & Access •  Scalable up to 7 Server Nodes •  8 GPUs per Node •  ~2.5kw per Server Node •  100Gbps IB Interconnect
  16. 16. Ø  Instruction level – SIMD, VLIW etc. Ø  Thread level – warps Ø  Processor level – many cores Ø  Server level – many GPUs in a server Ø  Cluster level – many servers with high BW interconnect Ø  Data Center level Ø  Planet level J 7 levels of parallelism
  17. 17. Ø  Computing super clusters Ø  105 variability in size of compute ‘jobs’ Ø  Large number of collocated servers running the same job Ø  High BW, low latency interconnect Ø  Clusters could grow to significant fraction of a Data Center Ø  Architecture of clusters will be hierarchical and heterogeneous Ø  Edge servers for security and Data management Ø  Dedicated RAID servers Ø  Dedicated compute servers Ø  Control and management nodes Ø  Multi DC training clusters for big models are technically feasible Ø  Scientific community has performed trans continental simulations Data Center Architectures
  18. 18. Thank you.
  19. 19. •  Sumit Sanyal, Founder and CEO sumit@minds.ai •  Steve Kuo, Co-Founder steve@minds.ai Contacts 19
  20. 20. Accelerated Deep Learning

×