Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Nervana and the Future of Computing

Arjun Bansal speaks at Paris ML Meetup

  • Login to see the comments

Nervana and the Future of Computing

  1. 1. Proprietary and confidential. Do not distribute. Nervana and the Future of Computing 26 April 2016 Arjun Bansal Co-founder & VP Algorithms, Nervana MAKING MACHINES SMARTER.™
  2. 2. Proprietary and confidential. Do not distribute. AI on demand using Deep Learning 2 DL Image Classification Object Localization Video Indexing Text Analysis Nervana Platform Machine Translation
  3. 3. Proprietary and confidential. Do not distribute. Image classification and video activity detection 3 Deep learning model Potential applications • Trained on a public dataset1 of 13K videos in 100 categories • Training was approximately 3 times faster than competitive framework • Can be extended to perform scene and object detection, action similarity labeling, video retrieval, anomaly detection 1: UCF101 dataset: http://crcv.ucf.edu/data/UCF101.php • Activity detection and monitoring for security • Automatic editing of captured moments from video camera • Facial recognition and image based retrieval • Sense and avoid systems for autonomous driving • Baggage screening at airports and other public venueshttps://www.youtube.com/watch?v=ydnpgUOpdBw
  4. 4. Proprietary and confidential. Do not distribute.ner va na Object localization and recognition 4
  5. 5. Proprietary and confidential. Do not distribute.ner va na Speech to text 5 https://youtu.be/NaqZkV_fBIM
  6. 6. Proprietary and confidential. Do not distribute.ner va na Question answering 6 Stories Mary journeyed to Texas. John went to Maryland. Mary went to Iowa. John travelled to Florida. Questions Answers Where is John located? Florida
  7. 7. Proprietary and confidential. Do not distribute.ner va na Reinforcement learning 7 Pong Breakout https://youtu.be/KkIf0Ok5GCEhttps://youtu.be/0ZlgrQS3krg
  8. 8. Proprietary and confidential. Do not distribute.ner va na Application areas 8 Healthcare Agriculture Finance Online Services Automotive Energy
  9. 9. Proprietary and confidential. Do not distribute. Nervana is building the future of computing 9 The Economist, March 12, 2016 Cloud Computing Custom ASIC Deep Learning / AI
  10. 10. Proprietary and confidential. Do not distribute.ner va na nervana cloud 10 Images Text Tabular Speech Time series Video Data import trainbuild deploy Cloud
  11. 11. Proprietary and confidential. Do not distribute.ner va na nervana neon 11
  12. 12. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library
  13. 13. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library
  14. 14. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library • Model support Models • Convnet • RNN, LSTM • MLP • DQN • NTM Domains • Images • Video • Speech • Text • Time series
  15. 15. Proprietary and confidential. Do not distribute.ner va na Running locally: % python rnn.py # or neon rnn.yaml Running in nervana cloud: % ncloud submit —py rnn.py # or —yaml rnn.yaml % ncloud show <model_id> % ncloud list % ncloud deploy <model_id> % ncloud predict <model_id> <data> # or use REST api nervana neon 11 • Fastest library • Model support • Cloud integration
  16. 16. Proprietary and confidential. Do not distribute.ner va na Backends • CPU • GPU • Multiple GPUs • Parameter server • (Xeon Phi) • nervana TPU nervana neon 11 • Fastest library • Model support • Cloud integration • Multiple backends
  17. 17. Proprietary and confidential. Do not distribute.ner va na nervana neon 11 • Fastest library • Model support • Cloud integration • Multiple backends • Optimized at assembler level
  18. 18. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12
  19. 19. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density =1 nervana engine 10 GPUs 200 CPUs
  20. 20. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture
  21. 21. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation Instruction and data memory Ctrl ALU CPU Data Memory Ctrl Nervana
  22. 22. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference
  23. 23. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision
  24. 24. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision • Power efficiency
  25. 25. Proprietary and confidential. Do not distribute.ner va na nervana tensor processing unit (TPU) 12 • 10-100x gain • Architecture optimized for • Unprecedented compute density • Scalable distributed architecture • Memory near computation • Learning and inference • Exploit limited precision • Power efficiency
  26. 26. Proprietary and confidential. Do not distribute.ner va na Special purpose computation 13 1940s: Turing Bombe Motivation: Automating calculations, code breaking
  27. 27. Proprietary and confidential. Do not distribute.ner va na General purpose computation 14 2000s: SoC Motivation: reduce power and cost, fungible computing. Enabled inexpensive mobile devices.
  28. 28. Proprietary and confidential. Do not distribute.ner va na Dennard scaling has ended 15 What business and technology constraints do we have now?
  29. 29. Proprietary and confidential. Do not distribute.ner va na Many-core tiled architectures 16 Tile Processor Architecture Overview for the TILEPro Series 5 and provides high bandwidth and extremely low latency communication among tiles. The Tile Processor™ integrates external memory and I/O interfaces on chip and is a complete programma- ble multicore processor. External memory and I/O interfaces are connected to the tiles via the iMesh interconnect. Figure 2-1 shows the 64-core TILEPro64™ Tile processor with details of an individual tile’s structure. Figure 2-1. Tile Processor Hardware Architecture Each tile is a powerful, full-featured computing system that can independently run an entire oper- ating system, such as Linux. Each tile implements a 32-bit integer processor engine utilizing a three-way Very Long Instruction Word (VLIW) architecture with its own program counter (PC), cache, and DMA subsystem. An individual tile is capable of executing up to three operations per cycle. CDN TDN IDN MDN STN UDN 1,1 6,1 3,2 4,2 5,2 6,2 7,2 XAUI (10GbE) TDN IDN MDN STN UDN LEGEND: Tile Detail port2 msh0 port0 port2 port1 port0 DDR2 DDR2 port0 msh1 port2 port0 port1 port2 DDR2 DDR2 RGMII (GbE) XAUI (10GbE) FlexI/O PCIe (x4 lane) I2C, JTAG, HPI, UART, SPI ROM FlexI/O PCIe (x4 lane) port1 port1 msh3 msh2 port2 msh0 port0 port2 port1 port0 port0 msh1 port2 port0 port1 port2 port1 port1 msh3 msh2 gpio1 port0 port1 port1 port0 port1 xgbe0 gbe0 xgbe1 port0 gpio1 port1 port0 port1 gbe1 port0 port1 xgbe0 xgbe1 port0 0,3 1,3 2,3 3,3 4,3 5,3 6,3 7,3 0,5 1,5 2,5 3,5 4,5 5,5 6,5 7,5 0,6 1,6 2,6 3,6 4,6 5,6 6,6 7,6 0,7 1,7 2,7 3,7 4,7 5,7 6,7 7,7 7,00,0 1,0 2,0 3,0 4,0 5,0 6,0 0,1 1,1 6,12,1 3,1 4,1 5,1 7,1 3,2 4,2 5,2 6,2 7,20,2 1,2 2,2 0,4 1,4 2,4 3,4 4,4 5,4 6,4 7,4 port0 7,0 port0 pcie0 port0 port1 rshim0 gpio0 pcie1 port0 port1 pcie0 port0 port1 rshim0 gpio0 pcie1 port0 port1 Switch Engine Cache Engine Processor Engine U D N S T N M D N I D N T D N C D N U D N S T N M D N I D N T D N C D N STNSTN TDNTDN IDNIDN MDNMDN UDNUDN CDNCDN 2010s: multi-core, GPGPU Motivation: increased performance without clock rate increase or smaller devices. Requires changes in programming paradigm. NVIDIA GM204Tilera Intel Xeon Phi Knight’s landing
  30. 30. Proprietary and confidential. Do not distribute.ner va na FPGA architectures 17 Altera Arria 10 Motivation: fine grained parallelism, reconfigurable, lots of IO, scalable. Slow clock speed, lacks compute density for machine learning.
  31. 31. Proprietary and confidential. Do not distribute.ner va na Neuromorphic architectures 18 IBM TrueNorth dress for the target axon and addresses representing core ension to the target core). This coded into a packet that is in- entering spikes (Fig. 2I). Spikes leaving the mesh are tagged with their row (for spikes traveling east-west) or column (for spikes traveling north- south) before being merged onto a shared link ters (31,232 bits), destination addresses (6656 bits), and axonal delays (1024 bits). In terms of efficiency, TrueNorth’s power density is 20 mW per cm2 , whereas that of a typical central processing
  32. 32. Proprietary and confidential. Do not distribute.ner va na Neural network parallelism 20 Data chunk 1 Data chunk n … Processor 1 Processor n … parameter server Full deep network on each processor Parameter coordination Data parallelism Model parallelism
  33. 33. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G
  34. 34. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G
  35. 35. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW
  36. 36. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW
  37. 37. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G
  38. 38. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G G P U G P U G P U G P U PCIE SW
  39. 39. Proprietary and confidential. Do not distribute.ner va na Existing computing topologies are lacking 21 G P U CPU S S D CPU G P U G P U G P U IB 10 G G P U CPU S S D CPU G P U G P U G P U IB 10 G PCIE SW PCIE SW G P U G P U G P U G P U PCIE SW CPU S S D CPU IB 10 G G P U G P U G P U G P U PCIE SW
  40. 40. Proprietary and confidential. Do not distribute.ner va na nervana compute topology 22 CPU CPU S S D IB 10 G S S D IB 10 G nn n n nn nn PCIE SW PCIE SW
  41. 41. Proprietary and confidential. Do not distribute.ner va na Distributed linear algebra and convolution 23 02/27/2014! CS267 Lecture 12! 50! 52! SUMMA – n x n matmul on P1/2 x P1/2 grid •  C[i, j] is n/P1/2 x n/P1/2 submatrix of C on processor Pij! •  A[i,k] is n/P1/2 x b submatrix of A! •  B[k,j] is b x n/P1/2 submatrix of B ! •  C[i,j] = C[i,j] + Σk A[i,k]*B[k,j] ! •  summation over submatrices! •  Need not be square processor grid ! * = i" j" A[i,k]" k" k" B[k,j]" C[i,j] 02/27/2014! CS267 Lecture 12! SUMMA distributed matrix multiply C=A*B (Jim Demmel, CS267 lecture notes) Matrix multiplication on multidimensional torus networks Edgar Solomonik and James Demmel Division of Computer Science University of California at Berkeley, CA, USA solomon@cs.berkeley.edu, demmel@cs.berkeley.edu Abstract. Blocked matrix multiplication algorithms such as Cannon’s algorithm and SUMMA have a 2-dimensional communication structure. We introduce a generalized ’Split-Dimensional’ version of Cannon’s algorithm (SD-Cannon) with higher-dimensional and bidirectional communication structure. This algorithm is useful for higher-dimensional torus interconnects that can achieve more injection bandwidth than single-link bandwidth. On a bidirectional torus network of dimension d, SD-Cannon
  42. 42. Proprietary and confidential. Do not distribute.ner va na Summary 24 • Computers are tools for solving problems of their time • Was: Coding, calculation, graphics, web • Today: Learning and Inference on data • Deep learning as a computational paradigm • Custom architecture can do vastly better

×