Cloud tpu jae_180814

어머! TPU! 이건 꼭 써야해!
구글 딥러닝 캠프 2018에서 삽질해보고 쓰는 이야기
GITHUB REPO: https://github.com/jwkanggist/tpu-tutorial-experiment
Jaewook Kang
jwkang10@gmail.com
Aug. 2018
1
© 2018
Jaewook Kang
All Rights Reserved

▪ GIST EEC Ph.D. (2015)
▪ 연구팀 리더 (~2018 5)
▪ 모두의 연구소 MoT 연구실 리더
▪ 좋아하는 것:
▪ 통계적 신호처리 이론 / 디지털 신호처리
▪ 모바일을 위한 C++ Native 신호처리/선형대수
라이브러리 구현
▪ Mobile Machine learning
▪ 수영 덕력 6년
2
Jaewook Kang (강재욱)
소 개

Presented at SI Analytics + MoTLabs
❖SI Analytics에서 발표했습니다.
3

Modulabs에서도 발표했네요
4

TPU란 무엇인가? 빠르게 알아보자!
- Revisit the “Future of computing in Google I/O by John Hennessy”
- https://www.youtube.com/watch?v=Azt8Nc-mtKM
- Why TPU?
- About TPU
5

한번 핵심만 쉽고 빠르게 살펴보죠!
6

Why TPU ?
❖Domain specific hardware의 시대
– https://www.youtube.com/watch?v=Azt8Nc-mtKM
7 Image credit: Kaz Sato

Why TPU ?
• TF 집적도 증가율( cost per tr) 이 slow down 하고 있다
– Dennard scaling
• TR크기가 70%줄면 → 면적당 TR개수 2배
• → 게이트길이 70% → 단일 TF 구동전압(vdd), 정적 용량 70% (전력 소
비 ½ 배)
• 게이트 산화막 두께도 얇아짐 → 클럭 1.4배
• 전력 증가 없이 processor클럭 1.4배 가능
8

Why TPU ?
• TF 집적도 증가율( cost per tr) 이 slow down 하고 있다
– Dennard scaling
• TR크기가 70%줄면 → 면적당 TR개수 2배
• → 게이트길이 70% → 단일 TF 구동전압(vdd), 정적 용량 70% (전력 소
비 ½ 배)
• 게이트 산화막 두께도 얇아짐 → 클럭 1.4배
• 전력 증가 없이 processor클럭 1.4배 가능
– End of Dennard scaling (2000s middle)
• Fabrication 문제로 더이상 전력 소비가 1/2이 되지 않음
• 이제 집적도가 증가하면 면적당 소비전력이 증가하기 시작
• Overheating!
• → MultiCore의 시대가 열림
9

Why TPU ?
– 멀티코어의 시대?
• Energy Efficiency문제의 많은 부분이 hardware domain에서 software
domain로 넘어오는 계기
• 멀티코어 에서는 Sequential codes 부분이 bottleneck
– Sync 문제, 스케쥴링 문제
– 코어 개수에 비례해서 processing time이 크게 감소하지 않는다!
• Sequential codes를 전체의 1%정도로 해도 40% 정도의 gain밖에 못 얻는다
10
Image credit:
https://www.youtube.com/watch?v=
Azt8Nc-mtKM

Why TPU ?
– 더이상 집적도가 processor의 지표가 될 수 없음!
• Power consumption이 지표가 되어야 한다!
• Processor 설계의 패러다임이 Energy Efficiency 바뀐다
– 멀티코어도 이제는 답이 아닌거 같음
• 멀티 코어 (HW) + thread 최적화 (SW) 해도 성능 이득 매우 작음
11

Why TPU ?
– 3% performance increase per year?
– General purpose processor 시대의 종료를 의미
12
Image credit:
Azt8Nc-mtKM
멀티코어로
속도향상
끝!Dennard
scaling 끝
!

Why TPU ?
– Not one application but a domain of application
13

Why TPU ?
• Neural network processors
• DSP processors
14

Why TPU ?
• Neural network processors
• DSP processors
– 모든 것을 한 종류 processor에서 처리 못함
• 여러 종류의 processor를 조합해서 사용
• 머신러닝 예제
– 복잡하고 작은 TASK은 CPU에게
» Non linear operations
– 단순하고 큰 TASK는 GPU TPU에게
» Matrix, vector, and sparse matrices ops
15

About TPU
18
❖ Tensor Processing Unit
– ISCA 2017
– link

About TPU
❖CPU보다 30배 빠르고 GPU보다 15배 빠르다
– Data format 최적화와 No cpu hosting 하면 더 빠름
• TPU v1
• GPU Nvidia Tesla K80
• CPU Haswell Xeon E5-2699 v3
- ref: ISCA 2017 TPU paper
19
Image credit:
Azt8Nc-mtKM

About TPU
❖ Tensor Processing Unit! (version1)
– Inference only
– Efficient large scale matrix operation capability
• 256 X 256 (64K) MAC ( 25 times larger than K80)
– Removing Memory I/O bottleneck!
• 24MiB on-chip unified Buffer
• Systolic array
• 167GiB/s Buffering
20

About TPU
❖Operation flow (TPU v1 inference only)
21
TPU instructions
+ data 은 PCIe
Gen4 x 16 인터페
이스로 전송된다.
Image credit:
https://drive.google.com/file/d/0Bx4h
afXDDq2EMzRNcy1vSUxtcEk/view

About TPU
22
- Weight Memory:
Weight는 8GiB
DRAM (read only)에
저장됐다가 Weight
FIFO로 전달
(off-chip)
Image credit:

About TPU
23
- Unified Buffer:
- 24MiB on-chip Buffer
- Input to Mtx mul unit
- Feeding data
- Buffering
intermediate results
- Very Fast Buffering
I/O (167GiB/s)
Image credit:

About TPU
24
- Matrix Multiply Unit
- 64K MAC per cycle!
- 8bit mul-and-adder
- 16 bit product
- 32bit accumulator
- → 8bit 연산에 최적화!
- Systolic array를 이용
한 병렬 연산!
Image credit:

About TPU
❖ Tensor Processing Unit! (version1)
– Inference only
– Efficient large scale matric operation capability
• 256 X 256 (64K) MAC ( 25 times larger than K80)
– Removing Memory I/O bottleneck!
• 24MiB on-chip unified Buffer
• Systolic array
• 167GiB/s Buffering
• 속도 개선의 핵심!
25

About TPU
26
❖Memory I/O bottleneck을 제거하는 것이 핵심!
1/8
1/18
1/9

About TPU
27
Image credit: Kaz Sato
❖Memory I/O bottleneck을 제거하는 것이 핵심!

❖ Systolic array! –Buffering 횟수 최소화
- Naïve한 행렬곱 방식은 input을 Buffer에서 여러 번 읽어
와야 한다.
- 아래 예제에서는 input을 2번 읽어야 한다.
About TPU
28
InputOutput Weights

❖Systolic array! –Buffering 횟수 최소화
- Unified Buffer부터 input을 한번만 읽는다!
- 값을 시프트 하면서 병렬 처리 (hard-wiring!)
- 시프트 레지스터가 엄청 많이 필요
- 유연성 희생 → 사실상 matrix multiplication만 가능
About TPU
29
Image credit :
https://cloud.google.com/blog/pro
ducts/gcp/an-in-depth-look-at-
googles-first-tensor-processing-
unit-tpu

❖Systolic array! –Buffering 횟수 최소화
- Unified Buffer부터 input을 한번만 읽는다!
- 값을 시프트 하면서 병렬 처리 (hard-wiring!)
- 시프트 레지스터가 엄청 많이 필요
- 유연성 희생 → 사실상 matrix multiplication만 가능
About TPU
30
Image credit :
https://drive.google.com/file/d/0Bx
4hafXDDq2EMzRNcy1vSUxtcEk/
view

About TPU
❖Matrix Multiplication 엄청 빨리하는 processor!
– Memory I/O bottleneck 해결하여 속도 개선
• Unified Buffer takes 29%!! (TPUv1)
• Fast Buffering! (167 GiB/sec)
• Systolic array를 통한 Buffering 횟수 최소화
– 다른 부분 최소화
– control 2%
– TPU v2부터는
training도 가능!
31
Image credit:

다른 자료
❖ Effective machine learning using cloud TPUs (Google I/O 18)
– https://www.youtube.com/watch?v=zEOtG-ChmZE
❖ The future of computing by John Hennessy (Google I/O 18)
❖ Google cloud blog (Kaz Sato)
– https://cloud.google.com/blog/products/gcp/an-in-depth-look-at-googles-first-tensor-
processing-unit-tpu
❖ 이진원님 자료
– https://drive.google.com/file/d/1yEA4l-IIdTykPy3Flp7F6ZySUEU39Amv/view
❖ 이채영님 블로그
– https://blog.goodaudience.com/how-to-use-google-cloud-tpus-177c3a025067
32

이제 용감하게 TPU한번 써볼까?
GITHUB REPO: https://github.com/jwkanggist/tpu-tutorial-experiment
- Cloud TPU in Google Cloud Platform
33

Cloud TPU in GCP
❖Everything of TPU in online
– https://cloud.google.com/tpu/docs/system-architecture?authuser=0
34
Your
local
Your dataset in
TFrecords
Your
TFcodes

Cloud TPU in GCP
35
Your
local
Your dataset in
TFrecords
Your
TFcodes
Your input pipeline in online
Your model in online
Your
computing power
in online

Cloud TPU in GCP
36
- Local 에서 TPUEstimator로 코딩
- GCP engine VM에서 computational graph 생성
→ cloud TPU로 업로드 (gPRC)
- Dataset은 tfrecord형식으로 GCP bucket에서
직접 cloud TPU server로 다운로드
- Cloud TPU server에서 graph을 CPU에서 돌리
는 부분과 TPU에서 돌리는 부분으로 분할
- 각각 XLA컴파일
→ 각 TPU 코어에 모델 부분만 replicate

Cloud TPU in GCP
❖Everything of TPU in online!
37
- TPU는 4개 chip으로 구성
- 한 chip당 dual core를 가지고 있음
- Graph는 4x2=8개로 복사(replicate)되어 각
TPU core에서 실행
→ TPU core batchsize = batchsize/8
- 훈련 step마다 8 core는 정보교환
→weight 업데이트
- Core 간에 효율적인 weight업데이트
알고리즘이 있다고 함
chip1
chip0
chip2
chip3
Duplicate
Graph
and
Send it to
TPU cores

Cloud TPU in GCP
❖TPU를 사용하기 위해서 무엇이 필요한가요?
– Google cloud platform (GCP)
• Compute engine
• GCP bucket
• Google cloud SDK CLI
– TPU estimator + knowing little bit about how TPU works
• TPU specific training config
– TFRecord + tf data
• For online input pipeline
– 삽질력
38

Setup Cloud TPU in GCP
❖Skip by referring my Git repo
– Please see: https://github.com/jwkanggist/tpu-resnet-tutorial
– also see: Chaeyoung’s blog: https://blog.goodaudience.com/how-to-
use-google-cloud-tpus-177c3a025067
39

TPUEstimator
❖ 기본적으로 일반 tf.Estimator와 코드 구조는 크게 다르지
않음…
– https://github.com/tensorflow/tensorflow/blob/master/tensorflow/c
ontrib/tpu/tpu_estimator.md
40

TPUEstimator
않지 않음...
41

TPUEstimator
않을까? 과연..
❖ Executable APIs in main functions
– TPUEstimator.train()
– TPUEstimator.evaluate()
– TPUEstimator.predict()
– TPUEstimator.export_savedmodel()
42

TPUEstimator
않을까? 과연..
❖ Executable APIs in main functions
– TPUEstimator.train()
– TPUEstimator.evaluate()
– TPUEstimator.predict()
– TPUEstimator.export_savedmodel()
43

TPUEstimator
않을까? 과연..
❖ Component APIs
– model_fn() (MUST HAVE)
– input_fn() (MUST HAVE)
– metric_fn() (MAY NEED)
– summary_fn() (MAY NEED)
44

TPUEstimator
않을까? 과연..
❖ Component APIs
– model_fn() (TPU)
– input_fn() (CPU)
– metric_fn() (CPU)
– summary_fn() (CPU)
45

Oh shit! TPUEstimator
않을까? 과연..
46
CODE CREDIT:
https://github.com/tens
orflow/tensorflow/blob/
master/tensorflow/cont
rib/tpu/README.md

TPUEstimator
❖ 기본적으로 일반 tf.Estimator와 코드 구조는 크게 다르지 않을
까? 과연..
– https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib
/tpu/tpu_estimator.md
❖ model_fn()
• 그래프 중 model파트만 replicate되어 모든 TPU core에 들어감
• In-graph replication (single session):
– single client가 전체 그래프를 빌드
– computationally intensive한 파트만 replicate되어 각 worker에게 할당
되는 방식
– step마다 모든 core gradient값을 combine하여 weight를 업데이트
• Btw-graph replication(multiple session) :
– worker마다 별도의 client가 생성
– 각각 전체 그래프를 별도 빌드하여 task를 수행하는 방식
47

TPUEstimator
까? 과연..
❖ model_fn()
되는 방식
• Btw-graph replication(multiple session) :
– worker마다 별도의 client가 생성
– 각각 전체 그래프를 별도 빌드하여 task를 수행하는 방식
48

TPUEstimator
까? 과연..
❖ model_fn()
되는 방식
✓ Mutiple GPU도 앞으로 “In-Graph replication”방식으로 가는 중
– tf.contrib.distribute.MirroredStrategy
– https://github.com/tensorflow/tensorflow/blob/master/tensorflow/contrib/distribute/README.md
49

TPUEstimator
않을까? 과연..
❖ model_fn()
– Optimizer를 다음과 같이 CrossShardOptmizer로 wrapping해야한다.
– For gradient sharing among TPU cores:
• forward pass → gradient cal on an indep mini-batch → exchange
gradient among the cores → weight update
50

TPUEstimator
않을까? 과연..
❖ model_fn()
– Optimizer를 다음과 같이 CrossShardOptmizer로 wrapping해야한다.
– TPUEstimatorSpec을 return한다
51

TPUEstimator
않을까? 과연..
❖ input_fn()
– Dataset must be remote (GCP bucket)!!
– TFRecord 사용은 data feeding speed를 위한 선택 (필수 아님)
• 속도를 위해서 tfrecord을 사용해야 하며, 파일 수를 줄이고 단일 파일은 x MB정도 크기로
함
– tf.Data 사용 필수
– input_fn()에서 미리 shape freezing되어야 함. (None 아님)
• train/eval mode에서는 model_fn()에서 input_fn()에서 return하는 shape에 맞춰서 model 빌
드
• prediction mode에서는 batch_size를 model_fn의 argument부터 받아서 설정
• https://github.com/tensorflow/tpu/blob/1fe0a9b8b8df3e2eb370b0ebb2f80eded6a9e2b6/models/official/resnet/resnet_main.py
– CPU에서 동작
52

❖Input pipeline
– All in Tensorflow!
• TFRecords + Tf.Data + Tf.TPUEstimator
• Preprocessor in tensorflow
– tf.contrib.data.map() or tf.contrib.data.map_and_batch()
• + GCP Bucket!
TPUEstimator
53
TFRecord
Converter
GCP
Bucket
TFData.
TFRecordData
set
Pre-
processor
TFData
Iterator
TF.
Estimator
Rawdata Upload
download To
Trainer

❖Pose estimation에 TF pipeline 적용이 어려운 이유 (TF1.8기
준)
• Pose image augmentation은 (input,labels) 같이 변환 필요
• Input: 사람 이미지
• Labels: keypoint 좌표 (x,y)
• Tf1.8 image augmentation api는 아직 random변환 값을 리턴 하지
않음
• Augmentation은 random이어야 함
• augmentation API가 label을 같이 변환 할 수 있도록 random 변환값을 리턴해
주어야 한다.
• 예제:
• Tf.contrib.image.rotate: 변환 각도값 리턴 값 필요
• Tf.image.random_flip_left_right : flip했는지 않했는지 리턴 값 필요
• Tf.image.crop_and_resize: crop한 bounding box 좌표 필요
• Tensorflow로 input pipeline을 구성하려면 전부 구현 해야함!
TPUEstimator
54

TPUEstimator
않을까? 과연..
❖ summary_fn()
– Tensorboard summary를 추가하는 함수
– model_fn()안에서 TPUEstimatorSpec의 “host_call” argument로 들어
간다.
• summary 의 tensor tuple을 리턴
• official code에서는 함수이름이 host_call()로 되어 있음
• https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimatorSpec
55

TPUEstimator
않을까? 과연..
❖ metric_fn()
– TPUEstimator.evaluate() 에서 호출하는 함수 평가 metric을 계산
• Ex) Top-1 top-5 accuracy
– model_fn()안에서 TPUEstimatorSpec의 “eval_metrics”의 argument로
들어간다.
• metric tensor로 구성된 dict을 리턴
• https://www.tensorflow.org/api_docs/python/tf/contrib/tpu/TPUEstimatorSpec
56

TPUEstimator
❖ Converting from TF Estimator to TPUEstimator
– https://cloud.google.com/tpu/docs/using-estimator-api?authuser=0
– 1) Change tf.estimator.RunConfig to tf.contrib.tpu.RunConfig.
– 2) Set TPUConfig to specify the iterations_per_loop.
• tf.contrib.tpu.RunConfig
• iterations_per_loop 는 cloud TPU에서 single session run에 돌아가는
step num
– 짧을 수록 빠르지만 결과를 모니터링 하기 어렵다
– 3) In model_fn, use tf.contrib.tpu.CrossShardOptimizer to wrap your
optimizer.
– 4) Change tf.estimator.Estimator to tf.contrib.tpu.TPUEstimator.
– 5) Set Effective batch size = 8 * batchsize_per_core
57

TPUEstimator!!!
❖ TPU을 써본 후 감상
– Tf.Estimator로 커스텀 코드 안짜 봤으면 고생할 각오를 해야
– 세션 안을 만질 수 없기 때문에 처음 코딩할 때 엄청난 답답함을 느
낌
• Numpy와 코드 호환성이 매우 떨어지게 됨.
• Summary + logging하는 것에 대한 유연성이 떨어짐 (shit point 1)
• 사실 잘 모르는 것이지 불가능한것은 아닐듯
– 모든 pipeline이 Tensor화
• Input data pipeline이 tensorflow ops에 제한됨 (shit point 2)
• Pose estimation을 위한 image augmentation이 어려움
– Generative model 경우 cpu에서 model_fn()이 돌아가야 함( added by 채영)
• Generative model은 non-linear ops가 많아서 TPU에서 다 처리 어려움
• 이경우 CPU hosting을 해야하고 따라서 tf.Estimator도 조합해서 사용해야함
– 모든 tensorflow ops가 TPU를 지원하지 않음 (shit point 3)
– 디버깅하기 어려움 58

그래도 TPU는 써야해!
❖30배의 속도 개선은 30배의 생산성 개선을 의미!
– Bulk한 matrix multiplications이 필요한 모델에 매우 효율
적
• ex) image classification / segmentation
– Generative model 쪽 사용은 qualification 필요
– 구글의 빠른 개발 속도로 곧 상용화 임박!
– 쓸 수 있는 기회가 생기면 당연히 써 봐야함!
59

Some TPU benchmarks
❖Mobilenet v1(executed by https://github.com/jwkanggist/tpu-tutorial-experiment/blob/master/run_mobilenet_main.sh )
– take 48 hour to take the paper suggest Top-1 accuracy
• paper suggested : 70.6%
• the trained : 71.27% @ 48 hour by cloud TPU
60
Loss

Some TPU benchmarks
❖Mobilenet v1(executed by https://github.com/jwkanggist/tpu-tutorial-experiment/blob/master/run_mobilenet_main.sh )
– take 48 hour to take the paper suggest Top-1 accuracy
61
Top1-Acc

Some TPU benchmarks
❖Resnet-50 (executed by https://github.com/jwkanggist/tpu-tutorial-experiment/blob/master/run_resnet_main.sh )
– take xx hour to take the paper suggest Top-1 accuracy
62
Loss

Some TPU benchmarks
❖Resnet-50 (executed by https://github.com/jwkanggist/tpu-tutorial-experiment/blob/master/run_resnet_main.sh )
– take xx hour to take the paper suggest Top-1 accuracy
63
Top1-Acc

Some TPU benchmarks
❖이 후 추가 예정 (Cloud TPU access가 계속 가능하면..)
– squeezenet
– shufflenet v1 v2
– mobilenet v2 ..
64

Code References
❖ MoTLabs 거북목 프로젝트 trainer_tpu.py
– https://github.com/motlabs/dont-be-turtle/blob/develop/tfmodules/trainer_tpu.py
❖ Tensorflow official codes
– https://github.com/tensorflow/tpu
– https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib/tpu
❖ Cloud TPU 공식 문서
– https://cloud.google.com/tpu/docs/custom-setup
65

Welcome your feedback!
:) Google Cloud TPU 사용에 대한 모든 의견 환
영합니다 :)
- 컨설팅
- 피드백
- 리뷰!
- 코웍!
66

Reviewers
❖전태균 (SI Analytics)
❖서정훈 (SI Analytics)
❖전승현 (SI Analytics)
❖곽도영 (PNU, MoTLabs)
❖김택민 (SNU, MoTLabs)
❖이채영 (용인외대부고, MoTLabs)
❖황동현 (KETI)
68

MoTLabs 소개
❖자신의 모델을 모바일에 올려보고 싶은 분들 같이 해
봐요!
❖후원, 참여 환영합니다!
❖ https://motlabs.github.io/
❖ https://www.facebook.com/lab4all/posts/761099760749661
❖ jwkang10@gmail.com 로 메일
❖Keywords:
– CNN / RNN module
– Model Compression
– Tensorflow Lite
– CoreML
– MLkit
– Android / iOS
69

Cloud tpu jae_180814

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Cloud tpu jae_180814

Similar to Cloud tpu jae_180814 (20)

More from Jaewook. Kang

More from Jaewook. Kang (19)

Recently uploaded

Recently uploaded (8)

Cloud tpu jae_180814