제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI

Kwanghoon Jhin
kwanghoon.jhin@kr.ibm.com
March, 2017
PowerAI
World Fastest Machine Learning Platform
IBM Power Systems: S822LC for HPC
IBM Storage: Elastic Storage Server
Revision E1

1950 — Alan Turing creates the “Turing Test”
1952 — Arthur Samuel wrote the first computer learning program. The program was the game of checkers,
and the IBM computer improved at the game the more it played, studying which moves made up winning
strategies and incorporating those moves into its program.
1959 – “Field of study that gives computers the ability to learn without being explicitly programmed” Arthur
Samuel
A (very short) History of AI

6
IBM 701 vacuum tube plug-in unit
700-series Electronic Data Processing Machine.

1957 — Frank Rosenblatt designed the first neural network for computers (the perceptron), which simulate
the thought processes of the human brain.

“… the navy revealed the embryo of an electronic computer today that it
expects will be able to walk, talk, see, write, reproduce
itself and be conscious of its existence ... Dr. Frank Rosenblatt, a
research psychologist at the Cornell Aeronautical Laboratory, Buffalo, said
Perceptrons might be fired to the planets as mechanical space explorers”
08-JUL-1958, the New York Times.
과도한 자신감이 거짓 믿음을 만들다.

Perceptrons (1969) by Marvin Minsky, Founder of the MIT AI Lab
– We need to use Multi-Layer Perceptrons
– No one on earth had found a viable way to train MLPs good enough to learn such simple functions.
11

12
AI winter:
– 1974 to 1980
– 1987 to 1993

Backpropagation
– 1974 by Paul Werbos
– 1982 by Paul Werbos
– 1986 by Geoffrey Hinton
13

Paul Werbos (1974). Beyond regression: New tools for prediction and analysis in the behavioral sciences.
PhD thesis, Harvard University.
Paul Werbos (1982). Applications of advances in nonlinear sensitivity analysis. In System modeling and
optimization (pp. 762-770). Springer Berlin Heidelberg.
Rumelhart, David E.; Hinton, Geoffrey E.; Williams, Ronald J. (8 October 1986). "Learning representations
by back-propagating errors". Nature. 323 (6088): 533–536.
14
Backpropagation / 역전파(逆傳播)

1997 — IBM’s Deep Blue beats the world champion at chess.
2006 — Geoffrey Hinton coins the term “deep learning” to explain new algorithms that let computers “see”
and distinguish objects and text in images and videos.
2010 — The Microsoft Kinect can track 20 human features at a rate of 30 times per second, allowing people
to interact with the computer via movements and gestures.
2011 — IBM’s Watson beats its human competitors at Jeopardy.
2011 — Google Brain is developed, and its deep neural network can learn to discover and categorize objects
much the way a cat does.
2012 — In 2012, Hinton and two of his other U of T students, Alex Krizhevsky and Ilya Sutskever, entered an
image recognition contest. Competing teams build computer vision algorithms that learn to identify objects in
millions of pictures; the most accurate wins. The U of T model took the error rate of the best-performing
algorithms to date, and the error rate of the average human, and snapped the difference in half like a dry
twig. The trio created a company, and Google acquired it.
(https://www.thestar.com/news/world/2015/04/17/how-a-toronto-professors-research-revolutionized-artificial-
intelligence.html)
15

2012 — Google’s X Lab develops a machine learning algorithm that is able to autonomously browse
YouTube videos to identify the videos that contain cats.
2014 — Facebook develops DeepFace, a software algorithm that is able to recognize or verify individuals on
photos to the same level as humans can.
2015 — Amazon launches its own machine learning platform.
2015 — Microsoft creates the Distributed Machine Learning Toolkit, which enables the efficient distribution of
machine learning problems across multiple computers.
2015 — Over 3,000 AI and Robotics researchers, endorsed by Stephen Hawking, Elon Musk and Steve
Wozniak (among many others), sign an open letter warning of the danger of autonomous weapons which
select and engage targets without human intervention.
2016 — Google’s artificial intelligence algorithm beats a professional player at the Chinese board game Go,
which is considered the world’s most complex board game and is many times harder than chess. The
AlphaGo algorithm developed by Google DeepMind managed to win five games out of five in the Go
competition.
16
http://www.forbes.com/sites/bernardmarr/2016/02/19/a-short-history-of-machine-learning-every-manager-should-read/

17
PowerAI > 왜 지금인가?
긴 ‘인공지능의 겨울(AI Winter)’를 끝내고
‘인공지능의 봄(AI Spring)’을 맞이하다.
Big DataTechnologyAlgorithm
비정형 데이터의 범람하드웨어 능력의 비약적인 발전관습과 한계에 도전 그리고 성취
학계에서는 지금의 시대를 ‘인공지능의 봄’이라고 부릅니다.
그 이유는 이전까지 없었던 아래의 세 가지가 갖추어졌기 때문입니다.
인공지능 설계의 오래된
난제를 해결
수퍼 컴퓨터에서만 가능하던
연산을 책상 위에서 처리
기계학습에 새로운 가능성과
의미를 부여

18
PowerAI > 기계학습에 요구되는 컴퓨팅 구성 요소들
지금 이 시대에 필요한 컴퓨팅 플랫폼
GPU computingFrameworks
한계없는 데이터의 저장과 사용을
위한 소프트웨어 정의 스토리지
가장 혁신적인 I/O 기술을 탑재한
GPU 컴퓨팅 시스템
IBM이 최적화환 프래임워크 및
라이브러리
IBM은 PowerAI라는 이름으로 ‘최적화된 프레임워크’, ‘GPU 컴퓨팅 시스템’
그리고 ‘소프트웨어 정의 스토리지’를 하나의 컴퓨팅 플랫폼으로 제공하고 있습니다.
IBM Optimized Frameworks IBM S822LC for HPC IBM Elastic Storage Server
Software Defined Storage

19
IBM이 최적화 한 ML/DL 프레임워크
IBM은 ibm.biz/powerai 를 통하여 S822LC for HPC에
최적화된 ML/DL 프레임워크를 제공합니다.
OpenBLAS BazelDIGITS NCCLDistributed
Frameworks
Machine Learning / Deep Learning Frameworks
Supporting Libraries
기계학습 플랫폼을 계획하고 구축하는데에 많은 시간을
투자해야 하는 부분이 바로, 프래임워크 최적화입니다.
IBM은 이와 같은 현장의 불편과 불필요한 시간투자를
최소화할 수 있는 ‘최적화’된 프래임워크를 제공합니다.

GPU의 기원
– 1999 – Nvidia, GeForce 256를 출시하면서 “the
world’s first GPU”라고 지칭. 그리고 다음과 같이
GeForce 256 GPU를 설명
• “Single-chip processor with integrated
transform, lighting, triangle setup/clipping, and
rendering engines”
– 2002 – Nvidia의 경쟁사인, ATI Technologies
(AMD)에서 Radeon 9700을 발표하면서 VPU,
‘Visual Processing Unit’이라고 지칭
– 2007 – Nvidia, GPGPU인 Tesla 제품군 발표
Co-Processor로 GPGPU 활용
– HPC
– Machine Learning
20
S822LC for HPC > Graphic Processing Unit

21
GPU 컴퓨팅은 CPU 컴퓨팅과 무엇이 다른가?
Mythbusters at NVISION 08

22
1 2 5 12
9 6 0 3
5 72 9 -15
5 16 -32 -20
18 20
98 -58
78
1 + 2 + 5 + 12 + 9 + 6 + 0 + 3 + 5 + 72 + 9 + (-15) + 5 + 16 + (-32) + (-20) = 78
16개의 수를 합하는 연산을 해야 합니다.
주어진 16개의 수를 차례대로 합하여 총합산을 끝내는 방식(직렬연산)과,
4개씩 짝을 지어 동시에 연산하는 방식(병렬연산)이 있을 수 있겠습니다.
전자의 방식이 일반적인 CPU 연산의 방식이라고 할 수 있고,
후자가 GPU가 취하는 방식입니다.
GPU는 병렬 처리를 효율적으로 처리하기 위한 수천 개의 코어를 가지고 있습니다.
S822LC for HPC > GPU는 연산에 어떤 도움을 주는가?

Although NVLink primarily focuses on connecting multi
also connect Tesla P100 GPUs with IBM Power CPUs
shows how the CPUs are connected with NVLink in th
configuration, each GPU has 180 GBps bidirectional b
and 80 GBps bidirectional bandwidth to the connected
Figure 1-12 CPU to GPU and GPU to GPU interconnect us
All the initialization of the GPU is through the PCIe inte
the side band communication for status, power manag
and running, all data communication is using the NVLi
23
S822LC for HPC > NVLink
NVLink로 PCIe의 한계를 뛰어 넘다.
GPU 제작사인 Nvidia는 GPU 컴퓨팅의 큰 문제 중 하나인
CPU(host)와 GPU(device) 사이 그리고 GPU(device)와
GPU(device) 사이의 병목현상에 주목해 왔습니다. 그리고 그
해결책으로 NVLink라는 bus를 선보였습니다.
Nvidia는 IBM와 손잡고, 세계 최초 NVLink 1.0을 완벽
구현하기에 이르렀으며, S822LC for HPC에 적용되었습니다.
IBM은 NVLink 기술을 높게 평가하여 2.0 버전을 차세대 CPU
플랫폼인 Power9에도 적용하기로 결정했습니다.
PCIe 3.0 x16
32 GB/s
bidirectional
NVLink
160 GB/s
bidirectional
X5
bandwidth
4 GPUs
with NVLink

24
S822LC for HPC > Unified Memory
지금까지의 GPU 메모리 접근의 한계를 없애다.
Unified Memory
• 전세대 GPU, Kepler와 Maxwell의 한계로 멈추어 있던 GPU
사용 메모리 크기 제한(GPU 내 설계에 따른 전용 메모리,
GDDR을 벗어날 수 없음)이 Pascal에서는 제거되어,
시스템에 탑재된 모든 메모리를 사용 가능
• 이제 개발자들이 GPU 메모리 내의 데이터 이동 관리보다
연산 자체에 집중하는 것이 가능
‘두껍고도 수평적으로’ (both Fat and Flat) 설계
• 어느 연결에서도 데이터 전송 병목이 생기지 않도록 설계
• GPU에서도 CPU처럼 시스템 메모리를 직접 접근하고 사용
• 같은 소켓의 GPU간 ‘Fat Pipe’ 구현
보편적 업무와 연산 알고리즘에 최적화
• Startup/Teardown시 폭발적인 성능
• GPU간의 안정적 데이터 이송
• 기존의 부족한 대역폭으로 인한 host-device간의 bus
transfer 문제를 해소
Pascal
GPU
Power8
CPU
Unified Memory
CUDA 8
Allocated beyond GPU memory
Up to 1TB on S822LC for HPC

25
S822LC for HPC > 성능
유사 조건에서 GPU 세대간 성능비교
0
20
40
60
80
100
120
140
x86 with 4x M40 / PCIe Power8 with 4x P100 / NVLink
Training time (minutes): AlexNet and
Caffe to top-1, 50% Accuracy
(Lower is better)
0:00
1:12
2:24
3:36
4:48
6:00
7:12
8:24
x86 with 8x M40 / PCIe Power8 with 4x P100 / NVLink
BVLC Caffe vs IBM Caffe / VGGNet
Time to Top-1 50% accuracy:
(Lower is better)
• IBM S822LC 20-cores 2.86GHz 512GB memory
/ 4 NVIDIA Tesla P100 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / IBM Caffe 1.0.0-rc3 / Imagenet Data
• Intel Broadwell E5-2640v4 20-core 2.6 GHz 512GB memory
/ 8 NVIDIA Tesla M40 GPUs / Ubuntu 16.04 / CUDA 8.0.44 / cuDNN 5.1 / BVLC Caffe 1.0.0-rc3 / Imagenet Data
S822LC/HPC with 4 Tesla
P100 Tesla GPUs is 24%
Faster than 8x Tesla M40
GPUs
S822LC/HPC with 4 Tesla
P100 GPUs is 2.2x Faster
than 4x Tesla M40 GPUs
x2.2
24%

26
S822LC for HPC > 성능
유사 조건에서 NVLink와 PCIe와의 비교
X86 + GeForce
Pascal Architecture
on PCIe 3.0 x16
POWER8 + Tesla
Pascal Architecture
on NVLink
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication
and GPU performance
54%
Pascal Architecture로 제작된 Tesla P100과 GeForce GTX Titan X를
직접 비교한 결과, NVLink의 유무가 전체 처리 속도를 크게
좌우한다는 것을 알 수 있었습니다.
• 데이터 이동시간의 획기적인 단축
• 연산 대기시간 단축으로 전체적인 부하 감소
• 총 연산시간의 실질적인 단축

27
S822LC for HPC > 내부구조 및 구성
2U 19inch Rack Mount, 2-Socket POWER8 CPU,
4 Tesla P100 GPU, 32 DDR4 DIMM slots, 3 PCIe slots.
POWER8 with NVLink (2x)
• 190W
• Integrated NVLink 1.0
Memory DIMM’s Riser (8x)
• 4 IS DDR4 DIMMs per riser
• Single Centaur per riser
• 32 IS DIMM’s total
• 32-1024 GB memory capacity
PCIe slot (3x)
• Gen3 PCIeNvidia GPU
• SXM2 form factor
• NVLink 1.0
• 300 W
• Max of 2 per socket
Power Supplies (2x)
• 1300W
• Common Form Factor
SupplyCooling Fans (4x)
• 80mm Counter- Rotating Fans
• Hot swap
Storage Option (2x)
• 0-2, SATA HDD.SSD
• Tray design for install/removal
• Hot Swap
Service Controller Card
• BMC Content

28
S822LC for HPC > 내부구조 및 구성 > 수냉식 냉각장치
수냉식 냉각장치 옵션 적용
기본 공냉식 냉각장치
수냉식 냉각장치 제공 WATER COOLING OPTION
S822LC for HPC는 대규모 HPC 컴퓨팅 시스템 제작에
필수적인 수냉식 냉각방식을 옵션으로 제공하고
있습니다.
GPU 컴퓨팅에서 필연적으로 따라오는 발열문제를
근본적으로 해결하고 다수의 S822LC for HPC을
구성한 다중 랙을 운영할 때에도 보다 안정적인
컴퓨팅 파워를 확보함과 동시에 전산실 내 공조
시스템의 부하를 줄일 수 있습니다.
수냉식 냉각옵션은 4GPU 구성에만 제공되며, 아래
그림과 같이 방렬판에 직결되어 냉각제가 공급되는
호스가 설치됩니다.

29
S822LC for HPC > Nvidia Tesla 제품군 비교
S822LC for HPC는 ‘P100 (NVLink)’를 사용합니다.
P100 (NVLink) P100 (PCIe) M40 K40 K80
Architecture PASCAL PASCAL Maxwell Kepler Kepler
CUDA cores 3840 3840 3072 2880
4992
(dual GPU)
Half-precision
performance (FP16)
21.2 TeraFLOPS with
NVIDIA GPU Boost
18.68 TeraFLOPS with
NVIDIA GPU Boost
N/A N/A N/A
Single-Precision
Performance (FP32)
10.6 TeraFLOPS with
NVIDIA GPU Boost
9.34 TeraFLOPS with
NVIDIA GPU Boost
7 Teraflops with
NVIDIA GPU Boost
4.29 Teraflops with
NVIDIA GPU Boost
8.73 Teraflops with
NVIDIA GPU Boost
Double-Precision
Performance (FP64)
5.3 TeraFLOPS with
NVIDIA GPU Boost
4.67 TeraFLOPS with
NVIDIA GPU Boost
0.2 Teraflops with
NVIDIA GPU Boost
1.43 Teraflops with
NVIDIA GPU Boost
2.91 Teraflops with
NVIDIA GPU Boost
GPU Memory 16 GB HBM2 12GB or 16GB HBM2 24 GB GDDR5 12 GB GDDR5
24 GB GDDR5
(12 GB per GPU)
Memory Bandwidth 720 GB/s HBM2
3072-bit HBM2 (12GB)
4096-bit HBM2 (16GB)
288 GB/s 288 GB/s
480 GB/s
(240 GB/s per GPU)
System Interface
160 GB/s bidirectional
interconnect
bandwidth with
NVIDIA NVLink
32 GB/s
PCI Express 3.0 x16
32 GB/s
PCI Express 3.0 x16
32 GB/s
PCI Express 3.0 x16
32 GB/s
PCI Express 3.0 x16
Max Power
Consumption
300 W 250 W 250 W 235 W
300 W
(150W per GPU)

30
IBM Elastic Storage Server
Power Systems
S822L (Server) &
S812L (Mgmt)
EXP24 (GS) or
DCS3700 (GL) or
DeepFlash150 (GF)
Spectrum Scale
Software
Defined Storage
Elastic
Storage
Server
Compute Disks Software
IBM Elastic Storage Server는 IBM Power Systems를
기반으로 Linux 운영체제 그리고 IBM Spectrum Scale이라는
파일 시스템을 중심으로 구성되는 ‘소프트웨어 정의
스토리지’이자 스토리지 어플라이언스입니다.

31
IBM Elastic Storage Server > 개요
IBM Elastic Storage Server는 Spectrum Scale의 모든
장점을 그대로 발휘합니다.
§ Spectrum Scale:
– 다양한 플랫폼상에서 고성능, 고가용 파일시스
템 서비스 및 데이터 관리 서비스를 제공할 수
있도록 설계된 시스템 소프트웨어.
– 1993년 발표된 이후 오랜 기간 안정화 및 버전
업을 통해 성숙화된 파일시스템 솔루션
클러스터 내
최대 노드 수
Architectural Limit 16,384
Maximum Tested 9,300
파일시스템 크기
Architectural Limit
2^99bytes
(524,288 YiB)
Maximum Tested 18PiB
마운드할 수 있는 최대 파일시스템 수 256
파일시스템당
최대 파일 수
Architectural Limit 2^64
Maximum Tested 9,000,000,000
단일 파일시스템에서 테스트된 최대 I/O
대역폭
400GB/s
최대 Dependent fileset 수 10,000
최대 Independent fileset 수 1,000
최대 Global Snapshot 수 256
각각의 Independent fileset에 대한 최대
Snapshot 수
256
병렬/분산 처리 파일시스템
Global Single Name Space
소프트웨어 또는 어플라이언스로 제공
Ethernet/Infiniband를 통한 고속 데이터 처리

32
IBM Elastic Storage Server > 제품군
다양한 구성을 지원하며, 속도중심 또는 용량중심의
제품을 선택할 수 있습니다. All Flash도 지원합니다.
Model GL4
4 Enclosures, 20U
232 NL-SAS, 2 SSD
10+ GB/Sec
Model GL6
6 Enclosures, 28U
348 NL-SAS, 2 SSD
15+ GB/sec
Model GL2
2 Enclosures, 12U
116 NL-SAS, 2 SSD
5+ GB/Sec
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
FC5887
Model GS1
24 SSD
6 GB/Sec
Model GS2
46 SAS + 2 SSD or
48 SSD Drives
2 GB/Sec SAS
12 GB/Sec SSD
Model GS4
94 SAS + 2 SSD or
96 SSD Drives
5 GB/Sec SAS
16 GB/Sec SSD
Model GS6
142 SAS + 2 SSD
7 GB/Sec
Model GF2
2 Enclosures, 10u
360TB, usable Flash
Max Read 26 GB/s
Max Write 16 GB/s
Model GF1
1 Enclosure, 7U
180TB, usable Flash
Max Read 13 GB/s
Max Write 9 GB/s
• GS models use 2U 24x2.5” JBODs or SSDs > 최대 120TB
• GL models use 4U 60x3.5” JBODs > 최대 2PB
• GF models use 3U 32xFlash JBOFs enclosures > 최대 360TB
• Support drives: 1.8TB SAS, 400GB, 800GB, 1.6TB SSD 2.5”;
2TB,4TB,6TB and 8TB NL-SAS 3.5” HDDs
• Supported NICs: 10GbE, 40GbE Ethernet and EDR Infiniband

33
IBM Elastic Storage Server > HPC 고객 도입사례
“기존의 일반적인 RAID
스토리지 어레이에서는 1TB
디스크를 리빌드하는데
12시간 전후가 소요 되었는데,
GNR은 2TB 디스크를 1시간
이내에 리빌드가 가능하도록
했으며, 연구 시뮬레이션을
위한 고속 데이터 엑세스를
가능하도록 해 주었다.”
Lothar
Wollschläger
Storage Specialist @ Jülich Supercomputing Centre
IBM Elastic Storage Server
• 총 20대로 구성
• 7 PB (usable capacity)
• 200 GB/s I/O 성능
IBM Elastic Storage Server에서 제공되는 ‘GPFS Native RAID’를 통해 디스크 리빌드(disk rebuild) 시간을
획기적으로 단축시켜 리빌드 중 발생 가능한 2차 장애의 위험성이 현저하게 감소

IBM Systems HardwarePowerAI
World Fastest Machine Learning Platform
THANK YOU!

Architectural Overview
Operation
Training Layer
(Compute Cluster)
Data Layer
(Central Storage)
Compute
Compute
Compute
Job Scheduler
Storage Server
Data
Data
Data
Service Layer
(Service Gateway)
Database
Index
Search
Ranking
API
High Bandwidth Service Network
Management Network
IBM Elastic Storage
Server
IBM Spectrum Scale
IBM S822LC for HPC
IBM Spectrum LSF
IBM IB Switch EDR

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to 제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI

Similar to 제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI (20)

More from Tommy Lee

More from Tommy Lee (20)

Recently uploaded

Recently uploaded (6)

제4회 한국IBM과 함께하는 난공불락 오픈소스 인프라 세미나-AI