This webinar by Dov Nimratz (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Embedded Community Webinar #1 on July 7, 2020.
Webinar agenda:
- CPU / GPU / TPU architectures
- Historical context
- CPU and their variations
- GPU or gin in a bottle for artificial intelligence tasks
- TPU architecture specialized artificial intelligence accelerator
- What's next in technology
More details and presentation: https://www.globallogic.com/ua/about/events/embedded-community-webinar-1/
Technical computing (high-performance computing) used to be the domain of specialists using expensive, proprietary equipment. Today, technical computing is going mainstream, becoming the absolutely irreplaceable competitive tool for research scientists and businesses alike.
Here's a look at Dell’s pioneering role in the evolution of technical computing, with a focus on the key industry trends and technologies that will bring the next generation of tools and functionality to research and development organizations around the world.
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
Introduction to high performance computing, what is it, how to use it and when to use what. Provides a detailed checklist how to build pipelines and tips to optimize cluster usage and reduce waiting time in queue. It also provides a quick overview of resources available in Compute Canada.
This webinar by Dov Nimratz (Senior Solution Architect, Consultant, GlobalLogic) was delivered at Embedded Community Webinar #1 on July 7, 2020.
Webinar agenda:
- CPU / GPU / TPU architectures
- Historical context
- CPU and their variations
- GPU or gin in a bottle for artificial intelligence tasks
- TPU architecture specialized artificial intelligence accelerator
- What's next in technology
More details and presentation: https://www.globallogic.com/ua/about/events/embedded-community-webinar-1/
Technical computing (high-performance computing) used to be the domain of specialists using expensive, proprietary equipment. Today, technical computing is going mainstream, becoming the absolutely irreplaceable competitive tool for research scientists and businesses alike.
Here's a look at Dell’s pioneering role in the evolution of technical computing, with a focus on the key industry trends and technologies that will bring the next generation of tools and functionality to research and development organizations around the world.
High performance computing tutorial, with checklist and tips to optimize clus...Pradeep Redddy Raamana
Introduction to high performance computing, what is it, how to use it and when to use what. Provides a detailed checklist how to build pipelines and tips to optimize cluster usage and reduce waiting time in queue. It also provides a quick overview of resources available in Compute Canada.
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
This Presentation was prepared by Abdussamad Muntahi for the Seminar on High Performance Computing on 11/7/13 (Thursday) Organized by BRAC University Computer Club (BUCC) in collaboration with BRAC University Electronics and Electrical Club (BUEEC).
How Development Teams Cut Costs with ScyllaDB.pdfScyllaDB
Now that teams are increasingly being pressed to cut costs, the database can be a low-hanging fruit for sizable cost reduction – especially if you’re managing terabytes to petabytes of data with millions of read/write operations per second.
Join Tzach Livyatan, VP of Product at ScyllaDB, as he shares four ways that teams commonly cut database costs by rethinking their database strategy. We’ll cover topics including:
- Cutting admin costs by reducing node sprawl and reducing the need for tuning
- ScyllaDB as a better, compatible Amazon DynamoDB
- Options to increase price performance through new cloud instances
- Ways to safely add more workloads to your cluster without compromising the performance of your latency-sensitive workloads
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance ViewRebekah Rodriguez
With Intel’s Jan 10th launch of the Intel® Xeon® Max CPU series – the industry’s first with high bandwidth memory (HBM) enabled CPU – Supermicro is proud to discuss its complete range of first-to-market X13 servers with high bandwidth memory. This Supermicro Systems, Applications, and Performance webinar shows how Supermicro’s Green Compute approach is the best solution for customers wanting to get more performance per watt, lowering CAPEX and OPEX spending.
Join us as we highlight our server solutions optimized for customer applications and for scale-out configurations that drive higher compute density in today’s modern data centers, along with some real performance improvements.
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureRebekah Rodriguez
The Universal GPU system architecture combines the latest technologies that support multiple GPU form factors, CPU choices, storage, and networking options.Together, these components are optimized to deliver high performance in a balanced architecture in a highly scalable system. Systems can be optimized for each customer’s specific Artificial Intelligence (AI), Machine Learning (ML), or High Performance Computing (HPC) applications. Organizations worldwide are demanding new options for their future computing environments, which have the thermal headroom for the next generation of CPUs and GPUs.
Join this webinar to learn how to leverage Supermicro's Universal GPU system to simplify customer deployments, deliver ultimate modularity and customization options for AI to Omniverse environments.
High Performance Computing Presentationomar altayyan
The Presentation Delivered on 3-6-2018 in the Data Mining Course, AI Specialization, at the Faculty of Information Technology Engineering Damascus University
Paper Link:
https://shamra.sy/academia/show/5b0c790de9fc6
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
In this deck from PASC 2019, Liu Yu from Inspur presents: Large-Scale Optimization Strategies for Typical HPC Workloads.
"Ensuring performance of applications running on large-scale clusters is one of the primary focuses in HPC research. In this talk, we will show our strategies on performance analysis and optimization for applications in different fields of research using large-scale HPC clusters. Our strategies are designed to comprehensively analyze runtime features of applications, parallel mode of the physical model, algorithm implementation and other technical details. This three levels of strategy covers platform optimization, technological innovation, and model innovation, and targeted optimization based on these features. State-of-the-art CPU instructions, network communication and other modules, and innovative parallel mode of some applications have been optimized. After optimization, it is expected that these applications will outperform their non-optimized counterparts with obvious increase in performance."
Watch the video: https://wp.me/p3RLHQ-kwB
Learn more: http://en.inspur.com/en/2403285/2403287/2403295/index.html
and
https://pasc19.pasc-conference.org/program/keynote-presentations/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
Yahoo! is one of the most-visited web sites in the world. It runs one of the largest private cloud infrastructures, one that operates on petabytes of data every day. Being able to store and manage that data well is essential to the efficient functioning of Yahoo!`s Hadoop clusters. A key component that enables this efficient operation is data compression. With regard to compression algorithms, there is an underlying tension between compression ratio and compression performance. Consequently, Hadoop provides support for several compression algorithms, including gzip, bzip2, Snappy, LZ4 and others. This plethora of options can make it difficult for users to select appropriate codecs for their MapReduce jobs. This paper attempts to provide guidance in that regard. Performance results with Gridmix and with several corpuses of data are presented. The paper also describes enhancements we have made to the bzip2 codec that improve its performance. This will be of particular interest to the increasing number of users operating on “Big Data” who require the best possible ratios. The impact of using the Intel IPP libraries is also investigated; these have the potential to improve performance significantly. Finally, a few proposals for future enhancements to Hadoop in this area are outlined.
This Presentation was prepared by Abdussamad Muntahi for the Seminar on High Performance Computing on 11/7/13 (Thursday) Organized by BRAC University Computer Club (BUCC) in collaboration with BRAC University Electronics and Electrical Club (BUEEC).
How Development Teams Cut Costs with ScyllaDB.pdfScyllaDB
Now that teams are increasingly being pressed to cut costs, the database can be a low-hanging fruit for sizable cost reduction – especially if you’re managing terabytes to petabytes of data with millions of read/write operations per second.
Join Tzach Livyatan, VP of Product at ScyllaDB, as he shares four ways that teams commonly cut database costs by rethinking their database strategy. We’ll cover topics including:
- Cutting admin costs by reducing node sprawl and reducing the need for tuning
- ScyllaDB as a better, compatible Amazon DynamoDB
- Options to increase price performance through new cloud instances
- Ways to safely add more workloads to your cluster without compromising the performance of your latency-sensitive workloads
Linux Kernel vs DPDK: HTTP Performance ShowdownScyllaDB
In this session I will use a simple HTTP benchmark to compare the performance of the Linux kernel networking stack with userspace networking powered by DPDK (kernel-bypass).
It is said that kernel-bypass technologies avoid the kernel because it is "slow", but in reality, a lot of the performance advantages that they bring just come from enforcing certain constraints.
As it turns out, many of these constraints can be enforced without bypassing the kernel. If the system is tuned just right, one can achieve performance that approaches kernel-bypass speeds, while still benefiting from the kernel's battle-tested compatibility, and rich ecosystem of tools.
X13 Products + Intel® Xeon® CPU Max Series–An Applications & Performance ViewRebekah Rodriguez
With Intel’s Jan 10th launch of the Intel® Xeon® Max CPU series – the industry’s first with high bandwidth memory (HBM) enabled CPU – Supermicro is proud to discuss its complete range of first-to-market X13 servers with high bandwidth memory. This Supermicro Systems, Applications, and Performance webinar shows how Supermicro’s Green Compute approach is the best solution for customers wanting to get more performance per watt, lowering CAPEX and OPEX spending.
Join us as we highlight our server solutions optimized for customer applications and for scale-out configurations that drive higher compute density in today’s modern data centers, along with some real performance improvements.
Supermicro’s Universal GPU: Modular, Standards Based and Built for the FutureRebekah Rodriguez
The Universal GPU system architecture combines the latest technologies that support multiple GPU form factors, CPU choices, storage, and networking options.Together, these components are optimized to deliver high performance in a balanced architecture in a highly scalable system. Systems can be optimized for each customer’s specific Artificial Intelligence (AI), Machine Learning (ML), or High Performance Computing (HPC) applications. Organizations worldwide are demanding new options for their future computing environments, which have the thermal headroom for the next generation of CPUs and GPUs.
Join this webinar to learn how to leverage Supermicro's Universal GPU system to simplify customer deployments, deliver ultimate modularity and customization options for AI to Omniverse environments.
High Performance Computing Presentationomar altayyan
The Presentation Delivered on 3-6-2018 in the Data Mining Course, AI Specialization, at the Faculty of Information Technology Engineering Damascus University
Paper Link:
https://shamra.sy/academia/show/5b0c790de9fc6
Large-Scale Optimization Strategies for Typical HPC Workloadsinside-BigData.com
In this deck from PASC 2019, Liu Yu from Inspur presents: Large-Scale Optimization Strategies for Typical HPC Workloads.
"Ensuring performance of applications running on large-scale clusters is one of the primary focuses in HPC research. In this talk, we will show our strategies on performance analysis and optimization for applications in different fields of research using large-scale HPC clusters. Our strategies are designed to comprehensively analyze runtime features of applications, parallel mode of the physical model, algorithm implementation and other technical details. This three levels of strategy covers platform optimization, technological innovation, and model innovation, and targeted optimization based on these features. State-of-the-art CPU instructions, network communication and other modules, and innovative parallel mode of some applications have been optimized. After optimization, it is expected that these applications will outperform their non-optimized counterparts with obvious increase in performance."
Watch the video: https://wp.me/p3RLHQ-kwB
Learn more: http://en.inspur.com/en/2403285/2403287/2403295/index.html
and
https://pasc19.pasc-conference.org/program/keynote-presentations/
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
This talk is given at Vizianagaram where many Engineering college faculty were attended. I have introduced developments in multi-core computers along with their architectural developments. Also, I have explained about high performance computing, where these are used. I have introduced the concept of pipelining, Amdahl's law, issues related to pipelining, MIPS architecture.
Morph: Flexible Acceleration for 3D CNN-based Video Understanding.
Kartik Hegde, Rohit Agrawal, Yulun Yao, Christopher W. Fletcher
University of Illinois
2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Application Profiling at the HPCAC High Performance Centerinside-BigData.com
Pak Lui from the HPC Advisory Council presented this deck at the 2017 Stanford HPC Conference.
"To achieve good scalability performance on the HPC scientific applications typically involves good understanding of the workload though performing profile analysis, and comparing behaviors of using different hardware which pinpoint bottlenecks in different areas of the HPC cluster. In this session, a selection of HPC applications will be shown to demonstrate various methods of profiling and analysis to determine the bottleneck, and the effectiveness of the tuning to improve on the application performance from tests conducted at the HPC Advisory Council High Performance Center."
Watch the video presentation: http://wp.me/p3RLHQ-gpY
Learn more: http://hpcadvisorycouncil.com
Sign up for our insideHPC Newsletter: http://insidehpc.com/newsletter
Hadoop World 2011: Hadoop Network and Compute Architecture Considerations - J...Cloudera, Inc.
Hadoop is a popular framework for web 2.0 and enterprise businesses who are challenged to store, process and analyze large amounts of data as part of their business requirements. Hadoop’s framework brings a new set of challenges related to the compute infrastructure and underlined network architectures. This session reviews the state of Hadoop enterprise environments, discusses fundamental and advanced Hadoop concepts and reviews benchmarking analysis and projection for big data growth as related to Data Center and Cluster designs. The session also discusses network architecture tradeoffs, and the advantages of close integration between compute and networking.
Lightweight DNN Processor Design (based on NVDLA)Shien-Chun Luo
https://sites.google.com/view/itri-icl-dla/
(Public Information Share) This is our lightweight DNN inference processor presentation, including a system solution (from Caffe prototxt to HW controls files), hardware features, and an example of object detection (Tiny YOLO) RTL simulation results. We modified open-source NVDLA, small configuration, and developed a RISC-V MCU in this accelerating system.
#PR12 #PR366
안녕하세요 논문 읽기 모임 PR-12의 366번째 논문리뷰입니다.
올해가 AlexNet이 나온지 10주년이 되는 해네요.
AlexNet이 2012년에 혜성처럼 등장한 이후, Solve computer vision problem = Use CNN이 공식처럼 사용되던 2010년대가 가고
2020년대 들어서 ViT의 등장을 시작으로 Transformer 기반의 network들이 CNN의 자리를 위협하고 상당부분 이미 뺏어간 상황입니다.
2020년대에 CNN의 가야할 길은 어디일까요?
Inductive bias가 적은 Transformer가 대용량의 데이터로 학습하면 항상 CNN보다 더 낫다는 건 진실일까요?
이 논문에서는 2020년대를 위한 CNN이라는 제목으로 ConvNeXt라는 새로운(?) architecture를 제안합니다.
사실 새로운 건 없고 그동안 있었던 것들과 Transformer에서 적용한 것들을 copy해와서 CNN에 적용해보았는데요,
Transformer보다 성능도 좋고 속도도 빠른 결과가 나왔다고 합니다.
결과에 대해서 약간의 논란이 twitter 상에서 나오고 있는데 이 부분 포함해서 자세한 내용은 영상을 통해서 보실 수 있습니다.
늘 재밌게 봐주시고 좋아요 댓글 구독 해주시는 분들께 감사드립니다 :)
논문링크: https://arxiv.org/abs/2201.03545
영상링크: https://youtu.be/Mw7IhO2uBGc
PR-355: Masked Autoencoders Are Scalable Vision LearnersJinwon Lee
#PR12 #PR355
안녕하세요 논문 읽기 모임 PR-12의 355번째 논문리뷰입니다.
Computer Vision 분야에는 왜 BERT나 GPT 같은 model이 없을까요?
Self-supervised learning을 이용하여 pretraining 한 후, downstream task에서 supervised learning보다 성능이 잘나오는 model을 언제쯤 보게 될까요?
어쩌면 그 model 이 논문에 있을 수도 있습니다.
이 논문에서는 ViT 기반의 Autoencoder를 활용하여 ImageNet-1K training set을 이용하여 self-supervised pretraining으로 SOTA(ImageNet-1K only)를 달성하였습니다.
image를 patch로 만들고 75%의 patch를 masking한 후 25%의 patch만으로 masking된 75%의 pixel 값을 직접 예측하는 형태를 사용하였고,
다른 model들에 비하여 연산량과 memory 사용량이 적어서 big model로의 확장도 용이합니다.
재미있는 아이디어와 다양한 실험결과가 있으니 자세한 내용은 발표 영상을 참고해주세요!
영상링크: https://youtu.be/mtUa3AAxPNQ
논문링크: https://arxiv.org/abs/2111.06377v1
PR-344: A Battle of Network Structures: An Empirical Study of CNN, Transforme...Jinwon Lee
#PR12 #PR344
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 344번째 논문 리뷰입니다.
오늘은 중국과기대와 MSRA에서 나온 A Battle of Network Structures라는 강렬한 제목을 가진 논문입니다.
부제에서 잘 나와있듯이 이 논문은 computer vision에서 CNN, Transformer, MLP에 대해서 같은 환경에서 비교를 통해 어떤 특징들이 있는지를 알아본 논문입니다.
우선 같은 조건에서 실험하기 위하여 SPACH라는 unified framework을 만들고 그 안에 CNN, Transformer, MLP를 넣어서 실험을 합니다.
셋 모두 조건이 잘 갖춰지면 비슷한 성능을 내지만, MLP는 model size가 커지면 overfitting이 발생하고
CNN은 Transformer에 비해서 적은 data에서도 좋은 성능이 나오는 generalization capability가 좋고,
Transformer는 model capacity가 커서 data가 충분하고 연산량도 큰 환경에서 잘한다는 것이 실험의 한가지 결과입니다.
또하나는 global receptive field를 갖는 transformer나 MLP의 경우에도 local한 연산을 하는 local model을 같이 써줄때에 성능이 좋아진다는 것입니다.
이런 insight들을 통해서 이 논문에서는 CNN과 Transformer를 결합한 형태의 Hybrid model을 제안하여 SOTA 성능을 낼 수 있음을 보여줍니다.
개인적으로 놀랄만한 insight를 발견한 것은 아니었지만 세가지 network의 특징과 장단점에 대해서 정리해볼 수 있는 그런 논문이라고 평하고 싶습니다.
자세한 내용은 영상을 참고해주세요! 감사합니다
영상링크: https://youtu.be/NVLMZZglx14
논문링크: https://arxiv.org/abs/2108.13002
PR-330: How To Train Your ViT? Data, Augmentation, and Regularization in Visi...Jinwon Lee
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 330번째 논문 리뷰입니다.
오늘은 무려 5만개의 학습된 ViT model을 제공하는 구글스러운 논문을 리뷰해보았습니다. ViT가 CNN을 조금씩 대체해가고 있는데요, ViT는 CNN과 달리 inductive bias가 적은 관계로
좋은 성능을 위해서는 굉장히 많은 data가 필요하거나, augmentation과 regularization을 많이 써줘야 합니다.
그런데 이렇게 다양한 경우 즉 다양한 data, 다양한 model size, 다양한 augmentation 방법, 다양한 regularization, 다양한 data size 등등에 따른 ViT의 성능과 속도 등의 비교 분석 실험이 지금까지는 없었죠.
이 논문에서는 그 어려운 걸(?) 해냈습니다. 그리고 수많은 ViT를 이용해 실험을 하면서 몇가지 중요한 finding들을 찾았습니다.
요약하면 다음과 같습니다.
1. augmentation과 regularization을 잘 쓰면 1/10의 data로도 전체 data 다 쓴거랑 대부분 비슷한 성능을 낼 수 있다. 그런데 항상 그런건 아니다.
반대로 말하면 data가 10배 있으면 augmentation이나 regularization안 쓰고도 좋은 성능을 낼 수 있다.
2. downstream task 학습할 때 scratch부터 학습하는거랑 large dataset으로 pre-trained한 걸 이용해서 transfer learning하는 건 후자가 좋다.
3. transfer learning 할 때도 pre-trained model 중에 data 많이 써서 학습한게 더 좋다.
4. augmentation/regularization은 data가 많으면 별 도움이 안되고 둘 중에는 augmenation이 더 좋다.
5. pre-trained model이 많을 때 model을 고르는 방법은 그냥 upstream에서 제일 잘됐던 걸 고르면 얼추 잘된다.
6. 속도를 빠르게 하고 싶을 때는 model을 작은거 쓰지말고 patch size를 키워라. 그래야 성능이 별로 안떨어진다.
입니다.
흥미로운 결과들이 많으니 자세한 내용은 아래 영상을 참고해주세요!
감사합니다!
영상링크: https://youtu.be/A3RrAIx-KCc
논문링크: https://arxiv.org/abs/2106.10270
PR-317: MLP-Mixer: An all-MLP Architecture for VisionJinwon Lee
Computer Vision 분야에서 CNN은 과연 살아남을 수 있을까요?
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 317번째 논문 리뷰입니다.
이번에는 Google Research, Brain Team의 MLP-Mixer: An all-MLP Architecture for Vision을 리뷰해보았습니다.
Attention의 공격도 버거운데 이번에는 MLP(Multi-Layer Perceptron)의 공격입니다.
MLP만을 사용해서 Image Classification을 하는데 성능도 좋고 속도도 빠르고....
구조를 간단히 소개해드리면 ViT(Vision Transformer)의 self-attention 부분을 MLP로 변경하였습니다.
MLP block 2개를 사용하여 하나는 patch(token)들 간의 연산을 하는데 사용하고, 하나는 patch 내부 연산을 하는데 사용합니다.
사실 MLP를 사용하긴 했지만 논문에도 언급되어 있듯이, 이 부분을 일종의 convolution이라고 볼 수 있는데요...
그래도 transformer 기반의 network이 가질 수밖에 없는 quadratic complexity를 linear로 낮춰주고
convolution의 inductive bias 거의 없이 아주아주 simple한 구조를 활용하여 이렇게 좋은 성능을 보여준 점이 멋집니다.
반면에 역시나 data를 많이 써야 한다거나, MLP의 한계인 fixed length의 input만 받을 수 있다는 점은 단점이라고 생각하는데요,
이 연구를 시작으로 MLP도 다시한번 조명받는 계기가 되면 좋을 것 같네요
비슷한 시점에 나온 비슷한 연구들도 마지막에 간략하게 소개하였습니다.
재미있게 봐주세요. 감사합니다!
논문링크: https://arxiv.org/abs/2105.01601
영상링크: https://youtu.be/KQmZlxdnnuY
PR-297: Training data-efficient image transformers & distillation through att...Jinwon Lee
안녕하세요 TensorFlow Korea 논문 읽기 모임 PR-12의 297번째 리뷰입니다
어느덧 PR-12 시즌 3의 끝까지 논문 3편밖에 남지 않았네요.
시즌 3가 끝나면 바로 시즌 4의 새 멤버 모집이 시작될 예정입니다. 많은 관심과 지원 부탁드립니다~~
(멤버 모집 공지는 Facebook TensorFlow Korea 그룹에 올라올 예정입니다)
오늘 제가 리뷰한 논문은 Facebook의 Training data-efficient image transformers & distillation through attention 입니다.
Google에서 나왔던 ViT논문 이후에 convolution을 전혀 사용하지 않고 오직 attention만을 이용한 computer vision algorithm에 어느때보다 관심이 높아지고 있는데요
이 논문에서 제안한 DeiT 모델은 ViT와 같은 architecture를 사용하면서 ViT가 ImageNet data만으로는 성능이 잘 안나왔던 것에 비해서
Training 방법 개선과 새로운 Knowledge Distillation 방법을 사용하여 mageNet data 만으로 EfficientNet보다 뛰어난 성능을 보여주는 결과를 얻었습니다.
정말 CNN은 이제 서서히 사라지게 되는 것일까요? Attention이 computer vision도 정복하게 될 것인지....
개인적으로는 당분간은 attention 기반의 CV 논문이 쏟아질 거라고 확신하고, 또 여기에서 놀라운 일들이 일어날 수 있을 거라고 생각하고 있습니다
CNN은 10년간 많은 연구를 통해서 발전해왔지만, transformer는 이제 CV에 적용된 지 얼마 안된 시점이라서 더 기대가 크구요,
attention이 inductive bias가 가장 적은 형태의 모델이기 때문에 더 놀라운 이들을 만들 수 있을거라고 생각합니다
얼마 전에 나온 open AI의 DALL-E도 그 대표적인 예라고 할 수 있을 것 같습니다. Transformer의 또하나의 transformation이 궁금하신 분들은 아래 영상을 참고해주세요
영상링크: https://youtu.be/DjEvzeiWBTo
논문링크: https://arxiv.org/abs/2012.12877
PR-284: End-to-End Object Detection with Transformers(DETR)Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 284번째 논문 review입니다.
이번 논문은 Facebook에서 나온 DETR(DEtection with TRansformer) 입니다.
arxiv-sanity에 top recent/last year에서 가장 상위에 자리하고 있는 논문이기도 합니다(http://www.arxiv-sanity.com/top?timefilter=year&vfilter=all)
최근에 ICLR 2021에 submit된 ViT로 인해서 이제 Transformer가 CNN을 대체하는 것 아닌가 하는 얘기들이 많이 나오고 있는데요, 올 해 ECCV에 발표된 논문이고 feature extraction 부분은 CNN을 사용하긴 했지만 transformer를 활용하여 효과적으로 Object Detection을 수행하는 방법을 제안한 중요한 논문이라고 생각합니다. 이 논문에서는 detection 문제에서 anchor box나 NMS(Non Maximum Supression)와 같은 heuristic 하고 미분 불가능한 방법들이 많이 사용되고, 이로 인해서 유독 object detection 문제는 딥러닝의 철학인 end-to-end 방식으로 해결되지 못하고 있음을 지적하고 있습니다. 그 해결책으로 bounding box를 예측하는 문제를 set prediction problem(중복을 허용하지 않고, 순서에 무관함)으로 보고 transformer를 활용한 end-to-end 방식의 알고리즘을 제안하였습니다. anchor box도 필요없고 NMS도 필요없는 DETR 알고리즘의 자세한 내용이 알고싶으시면 영상을 참고해주세요!
영상링크: https://youtu.be/lXpBcW_I54U
논문링크: https://arxiv.org/abs/2005.12872
PR-270: PP-YOLO: An Effective and Efficient Implementation of Object DetectorJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 270번째 논문 review입니다.
이번 논문은 Baidu에서 나온 PP-YOLO: An Effective and Efficient Implementation of Object Detector입니다. YOLOv3에 다양한 방법을 적용하여 매우 높은 성능과 함께 매우 빠른 속도 두마리 토끼를 다 잡아버린(?) 그런 논문입니다. 논문에서 사용한 다양한 trick들에 대해서 좀 더 깊이있게 살펴보았습니다. Object detection에 사용된 기법 들 중에 Deformable convolution, Exponential Moving Average, DropBlock, IoU aware prediction, Grid sensitivity elimination, MatrixNMS, CoordConv, 등의 방법에 관심이 있으시거나 알고 싶으신 분들은 영상과 발표자료를 참고하시면 좋을 것 같습니다!
논문링크: https://arxiv.org/abs/2007.12099
영상링크: https://youtu.be/7v34cCE5H4k
PR-258: From ImageNet to Image Classification: Contextualizing Progress on Be...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 258번째 논문 review입니다.
이번 논문은 MIT에서 나온 From ImageNet to Image Classification: Contextualizing Progress on Benchmarks입니다.
Deep Learning 하시는 분들이면 ImageNet 모르시는 분들이 없을텐데요, 이 논문은 ImageNet의 labeling 방법의 한계와 문제점에 대해서 얘기하고 top-1 accuracy 기반의 평가 방법에도 문제가 있을 수 있음을 지적하고 있습니다.
ImageNet data의 20% 이상이 multi object를 포함하고 있지만 그 중에 하나만 정답으로 인정되는 문제가 있고, annotation 방법의 한계로 인하여 실제로 사람이 생각하는 것과 다른 class가 정답으로 labeling되어 있는 경우도 많았습니다. 또한 terrier만 20종이 넘는 등 전문가가 아니면 판단하기 어려운 label도 많다는 문제도 있었구요. 이 밖에도 다양한 실험을 통해서 정량적인 분석과 함께 human-in-the-loop을 이용한 평가로 현재 model들의 성능이 어디까지 와있는지, 그리고 앞으로 더 높은 성능을 내기 위해서 data labeling 측면에서 해결해야할 과제는 무엇인지에 대해서 이야기하고 있습니다. 논문이 양이 좀 많긴 하지만 기술적인 내용이 별로 없어서 쉽게 읽으실 수 있는데요, 자세한 내용이 궁금하신 분들은 영상을 참고해주세요!
논문링크: https://arxiv.org/abs/2005.11295
발표영상링크: https://youtu.be/CPMgX5ikL_8
TensorFlow Korea 논문읽기모임 PR12 243째 논문 review입니다
이번 논문은 RegNet으로 알려진 Facebook AI Research의 Designing Network Design Spaces 입니다.
CNN을 디자인할 때, bottleneck layer는 정말 좋을까요? layer 수는 많을 수록 높은 성능을 낼까요? activation map의 width, height를 절반으로 줄일 때(stride 2 혹은 pooling), channel을 2배로 늘려주는데 이게 최선일까요? 혹시 bottleneck layer가 없는 게 더 좋지는 않은지, 최고 성능을 내는 layer 수에 magic number가 있는 건 아닐지, activation이 절반으로 줄어들 때 channel을 2배가 아니라 3배로 늘리는 게 더 좋은건 아닌지?
이 논문에서는 하나의 neural network을 잘 design하는 것이 아니라 Auto ML과 같은 기술로 좋은 neural network을 찾을 수 있는 즉 좋은 neural network들이 살고 있는 좋은 design space를 design하는 방법에 대해서 얘기하고 있습니다. constraint이 거의 없는 design space에서 human-in-the-loop을 통해 좋은 design space로 그 공간을 좁혀나가는 방법을 제안하였는데요, EfficientNet보다 더 좋은 성능을 보여주는 RegNet은 어떤 design space에서 탄생하였는지 그리고 그 과정에서 우리가 당연하게 여기고 있었던 design choice들이 잘못된 부분은 없었는지 아래 동영상에서 확인하실 수 있습니다~
영상링크: https://youtu.be/bnbKQRae_u4
논문링크: https://arxiv.org/abs/2003.13678
PR-231: A Simple Framework for Contrastive Learning of Visual RepresentationsJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 231번째 논문 review 입니다
이번 논문은 Google Brain에서 나온 A Simple Framework for Contrastive Learning of Visual Representations입니다. Geoffrey Hinton님이 마지막 저자이시기도 해서 최근에 더 주목을 받고 있는 논문입니다.
이 논문은 최근에 굉장히 핫한 topic인 contrastive learning을 이용한 self-supervised learning쪽 논문으로 supervised learning으로 학습한 ResNet50와 동일한 성능을 얻을 수 있는 unsupervised pre-trainig 방법을 제안하였습니다. Data augmentation, Non-linear projection head, large batch size, longer training, NTXent loss 등을 활용하여 훌륭한 representation learning이 가능함을 보여주었고, semi-supervised learning이나 transfer learning에서도 매우 뛰어난 결과를 보여주었습니다. 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/2002.05709
영상링크: https://youtu.be/FWhM3juUM6s
PR-217: EfficientDet: Scalable and Efficient Object DetectionJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 217번째 논문 review입니다
이번 논문은 GoogleBrain에서 쓴 EfficientDet입니다. EfficientNet의 후속작으로 accuracy와 efficiency를 둘 다 잡기 위한 object detection 방법을 제안한 논문입니다. 이를 위하여 weighted bidirectional feature pyramid network(BiFPN)과 EfficientNet과 유사한 방법의 detection용 compound scaling 방법을 제안하고 있는데요, 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1911.09070
영상링크: https://youtu.be/11jDC8uZL0E
PR-207: YOLOv3: An Incremental ImprovementJinwon Lee
TensorFlow Korea 논문읽기모임 PR12 207번째 논문 review입니다
이번 논문은 YOLO v3입니다.
매우 유명한 논문이라서 크게 부연설명이 필요없을 것 같은데요, Object Detection algorithm들 중에 YOLO는 굉장히 특색있는 one-stage algorithm입니다. 이 논문에서는 YOLO v2(YOLO9000) 이후에 성능 향상을 위하여 어떤 것들을 적용하였는지 하나씩 설명해주고 있습니다. 또한 MS COCO의 metric인 average mAP에 대해서 비판하면서 mAP를 평가하는 방법에 대해서도 얘기를 하고 있는데요, 자세한 내용은 영상을 참고해주세요~
논문링크: https://arxiv.org/abs/1804.02767
영상링크: https://youtu.be/HMgcvgRrDcA
PR-197: One ticket to win them all: generalizing lottery ticket initializatio...Jinwon Lee
TensorFlow Korea 논문읽기모임 PR12 197번째 논문 review입니다
(2기 목표 200편까지 이제 3편이 남았습니다)
이번에 제가 발표한 논문은 FAIR(Facebook AI Research)에서 나온 One ticket to win them all: generalizing lottery ticket initializations across datasets and optimizers 입니다
한 장의 ticket으로 모든 복권에서 1등을 할 수 있다면 얼마나 좋을까요?
일반적인 network pruning 방법은 pruning 하기 이전에 학습된 network weight를 그대로 사용하면서 fine tuning하는 방법을 사용해왔습니다
pruning한 이후에 network에 weight를 random intialization한 후 학습하면 성능이 잘 나오지 않는 문제가 있었는데요
작년 MIT에서 나온 Lottery ticket hypothesis라는 논문에서는 이렇게 pruning된 이후의 network를 어떻게 random intialization하면 높은 성능을 낼 수 있는지
이 intialization 방법을 공개하며 lottery ticket의 winning ticket이라고 이름붙였습니다.
그런데 이 winning ticket이 혹시 다른 dataset이나 다른 optimizer를 사용하는 경우에도 잘 동작할 수 있을까요?
예를 들어 CIFAR10에서 찾은 winning ticket이 ImageNet에서도 winning ticket의 성능을 나타낼 수 있을까요?
이 논문은 이러한 질문에 대한 답을 실험을 통해서 확인하였고, initialization에 대한 여러가지 insight를 담고 있습니다.
자세한 내용은 발표 영상을 참고해주세요~!
영상링크: https://youtu.be/YmTNpF2OOjA
발표자료링크: https://www.slideshare.net/JinwonLee9/pr197-one-ticket-to-win-them-all-generalizing-lottery-ticket-initializations-across-datasets-and-optimizers
논문링크: https://arxiv.org/abs/1906.02773
PR-183: MixNet: Mixed Depthwise Convolutional KernelsJinwon Lee
TensorFlow-KR 논문읽기모임 PR12(12PR) 183번째 논문 review입니다.
이번에 살펴볼 논문은 Google Brain에서 발표한 MixNet입니다. Efficiency를 추구하는 CNN에서 depthwise convolution이 많이 사용되는데, 이 때 depthwise convolution filter의 size를 다양하게 해서 성능도 높이고 efficiency도 높이는 방법을 제안한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크 : https://arxiv.org/abs/1907.09595
발표영상 : https://youtu.be/252YxqpHzsg
PR-169: EfficientNet: Rethinking Model Scaling for Convolutional Neural NetworksJinwon Lee
TensorFlow-KR 논문읽기모임 PR12 169번째 논문 review입니다.
이번에 살펴본 논문은 Google에서 발표한 EfficientNet입니다. efficient neural network은 보통 mobile과 같은 제한된 computing power를 가진 edge device를 위한 작은 network 위주로 연구되어왔는데, 이 논문은 성능을 높이기 위해서 일반적으로 network를 점점 더 키워나가는 경우가 많은데, 이 때 어떻게 하면 더 효율적인 방법으로 network을 키울 수 있을지에 대해서 연구한 논문입니다. 자세한 내용은 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1905.11946
영상링크: https://youtu.be/Vhz0quyvR7I
PR-155: Exploring Randomly Wired Neural Networks for Image RecognitionJinwon Lee
TensorFlow-KR 논문읽기모임 PR12 155번째 논문 review 입니다.
이번에는 Facebook AI Research에서 최근에 나온(4/2) Exploring Randomly Wired Neural Networks for Image Recognition을 review해 보았습니다. random하게 generation된 network이 그동안 사람들이 온갖 노력을 들여서 만든 network 이상의 성능을 나타낸다는 결과로 많은 사람들에게 충격을 준 논문인데요, 자세한 내용은 자료와 영상을 참고해주세요
논문링크: https://arxiv.org/abs/1904.01569
영상링크: https://youtu.be/NrmLteQ5BC4
PR-144: SqueezeNext: Hardware-Aware Neural Network DesignJinwon Lee
Tensorfkow-KR 논문읽기모임 PR12 144번째 논문 review입니다.
이번에는 Efficient CNN의 대표 중 하나인 SqueezeNext를 review해보았습니다. SqueezeNext의 전신인 SqueezeNet도 같이 review하였고, CNN을 평가하는 metric에 대한 논문인 NetScore에서 SqueezeNext가 1등을 하여 NetScore도 같이 review하였습니다.
논문링크:
SqueezeNext - https://arxiv.org/abs/1803.10615
SqueezeNet - https://arxiv.org/abs/1602.07360
NetScore - https://arxiv.org/abs/1806.05512
영상링크: https://youtu.be/WReWeADJ3Pw
NO1 Uk Amil Baba In Lahore Kala Jadu In Lahore Best Amil In Lahore Amil In La...Amil baba
Contact with Dawood Bhai Just call on +92322-6382012 and we'll help you. We'll solve all your problems within 12 to 24 hours and with 101% guarantee and with astrology systematic. If you want to take any personal or professional advice then also you can call us on +92322-6382012 , ONLINE LOVE PROBLEM & Other all types of Daily Life Problem's.Then CALL or WHATSAPP us on +92322-6382012 and Get all these problems solutions here by Amil Baba DAWOOD BANGALI
#vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore#blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #blackmagicforlove #blackmagicformarriage #aamilbaba #kalajadu #kalailam #taweez #wazifaexpert #jadumantar #vashikaranspecialist #astrologer #palmistry #amliyaat #taweez #manpasandshadi #horoscope #spiritual #lovelife #lovespell #marriagespell#aamilbabainpakistan #amilbabainkarachi #powerfullblackmagicspell #kalajadumantarspecialist #realamilbaba #AmilbabainPakistan #astrologerincanada #astrologerindubai #lovespellsmaster #kalajaduspecialist #lovespellsthatwork #aamilbabainlahore #Amilbabainuk #amilbabainspain #amilbabaindubai #Amilbabainnorway #amilbabainkrachi #amilbabainlahore #amilbabaingujranwalan #amilbabainislamabad
MATHEMATICS BRIDGE COURSE (TEN DAYS PLANNER) (FOR CLASS XI STUDENTS GOING TO ...PinkySharma900491
Class khatm kaam kaam karne kk kabhi uske kk innings evening karni nnod ennu Tak add djdhejs a Nissan s isme sniff kaam GCC bagg GB g ghan HD smart karmathtaa Niven ken many bhej kaam karne Nissan kaam kaam Karo kaam lal mam cell pal xoxo
2. References
Most figures and slides are from
Norman P.Jouppi, et al., "In-Datacenter PerformanceAnalysis of aTensor
Processing Unit", 44th IEEE/ACM International Symposium on Computer
Architecture (ISCA-44),Toronto, Canada, June 2017.
https://arxiv.org/abs/1704.04760
David Patterson, "Evaluation of theTensor Processing Unit: A Deep Neural
Network Accelerator for the Datacenter", NAE Regional Meeting, April 2017.
https://sites.google.com/view/naeregionalsymposium
Kaz Sato, “An in-depth look at Google’s firstTensor Processing Unit (TPU)”,
https://cloud.google.com/blog/big-data/2017/05/an-in-depth-look-at-
googles-first-tensor-processing-unit-tpu
4. A Golden Age in Microprocessor Design
• Stunning progress in microprocessor design 40 years ≈ 106x faster!
• Three architectural innovations (~1000x)
Width: 8163264 bit (~8x)
Instruction level parallelism:
4-10 clock cycles per instruction to 4+ instructions per clock cycle (~10-20x)
Multicore: 1 processor to 16 cores (~16x)
• Clock rate: 3 to 4000 MHz (~1000x thru technology & architecture)
• Made possible by IC technology:
Moore’s Law: growth in transistor count (2X every 1.5 years)
Dennard Scaling: power/transistor shrinks at same rate as transistors are
added (constant per mm2 of silicon)
6. What’s Left?
• Since
Transistors not getting much better
Power budget not getting much higher
Already switched from 1 inefficient processor/chip to N efficient
processors/chip
• Only path left is Domain Specific Architetures
Just do a few tasks, but extremely well
7. TPU Origin
• Starting as far back as 2006, Google engineers had discussions about
deploying GPUs, FPGAs, or custom ASICs in their data centers.They
concluded that they can use the excess capacity of the large data
centers.
• The conversation changed in 2013 when it was projected that if
people used voice search for 3 minutes a day using speech
recognition DNNs, it would have required Google’s data centers to
double in order to meet computation demands.
• Google then started a high-priority project to quickly produce a
custom ASIC for inference.
• The goal was to improve cost-performance by 10x over GPUs.
• Given this mandate, theTPU was designed, verified, built, and
deployed in data centers in just 15 months
8. TPU
• Built on a 28nm process
• Runs @700MHz
• Consumes 40W when
running
• Connected to its host via a
PCIe Gen3 x16 bus
• TPU card to replace a disk
• Up to 4 cards / server
9. 3 Kinds of Popular NNs
• Multi-Layer Perceptrons(MLP)
Each new layer is a set of nonlinear functions of weighted sum of all outputs
( fully connected) from a prior one
• Convolutional Neural Networks(CNN)
Each ensuing layer is a set of nonlinear functions of weighted sums of
spatially nearby subsets of outputs from the prior layer, which also reuses the
weights.
• Recurrent Neural Networks(RNN)
Each subsequent layer is a collection of nonlinear functions of weighted sums
of outputs and the previous state.The most popular RNN is Long Short-Term
Memory (LSTM).
11. TPU Architecture and Implementation
• Add as accelerators to existing servers
So connect over I/O Bus(“PCIe”)
TPU ≈ matrix accelerator on I/O bus
• Host server sends it instructions like a Floating Point Unit
Unlike GPU that fetches and executes own instructions
• The goal was to run whole inference models in theTPU to reduce
interactions with the host CPU and to be flexible enough to match
the NN needs of 2015 and beyond
13. TPU High Level Architecture
• Matrix Multiply Unit is the heart of theTPU
65,536(256x256) 8-bit MAC units
The matrix unit holds one 64 KiB tile of weights
plus one for double-buffering
>25x as many MACs vs GPU, >100x as many MACs vs CPU
• Peak performance: 92TOPS = 65,536 x 2 x 700M
• The 16-bit products are collected in the 4 MiB of 32-bit Accumulators below
the matrix unit.
The 4MiB represents 4096, 256-element, 32-bit accumulators
operations / byte @peak performance : 1350 round up : 2048 double
buffering : 4096
14. TPU High Level Architecture
• The weights for the matrix unit are staged
through an on-chip Weight FIFO that reads
from an off-chip 8 GiB DRAM called Weight Memory
Two 2133MHz DDR3 DRAM channels
for inference, weights are read-only
8 GiB supports many simultaneously active models
• The intermediate results are held in the 24 MiB on-chip Unified Buffer,
which can serve as inputs to the Matrix Unit
The 24 MiB size was picked in part to match the pitch of the Matrix Unit on the die
and, given the short development schedule
15. Floorplan ofTPU Die
• The Unified Buffer is
almost a third of the die
• Matrix Multiply Unit is a
quarter
• Control is just 2%
16. RISC, CISC and theTPU Instruction Set
• Most modern CPUs are heavily influenced by the Reduced Instruction
Set Computer (RISC) design style
With RISC, the focus is to define simple instructions (e.g., load, store, add
and multiply) that are commonly used by the majority of applications and
then to execute those instructions as fast as possible.
• A Complex Instruction Set Computer(CISC) design focuses on
implementing high-level instructions that run more complex tasks
(such as calculating multiply-and-add many times) with each
instruction.
The average clock cycles per instruction (CPI) of these CISC instructions is
typically 10 to 20
• TPU choose the CISC style
17. TPU Instructions
• It has about a dozen instructions overall, but below five are the key ones
18. TPU Instructions
• The CISC MatrixMultiply instruction is 12 bytes
3 are Unified Buffer address; 2 are accumulator address; 4 are length
(sometimes 2 dimensions for convolutions); and the rest are opcode and
flags.
• Average clock cycles per instruction : > 10
• 4-stage overlapped execution, 1 instruction type / stage
Execute other instructions while matrix multiplier busy
• Complexity in SW
No branches, in-order issue, SW controlled buffers, SW controlled pipeline
synchronization
19. Systolic Execution in Matrix Array
• Problem : Reading a large SRAM uses much more power than
arithmetic
• Solution : Using “Systolic Execution” to save energy by reducing
reads and writes of the Unified Buffer
• A systolic array is a two dimensional collection of arithmetic units
that each independently compute a partial result as a function of
inputs from other arithmetic units that are considered upstream to
each unit
• It is similar to blood being pumped through the human circulatory
system by heart, which is the origin of the systolic name
22. TPU Systolic Array
• In theTPU, the systolic array is
rotated
• Weights are loaded from the top
and the input data flows into the
array in from the left
• Weights are preloaded and take
effect with the advancing wave
alongside the first data of a new
block
23. Software Stack
• Software stack is split into a User Space
Driver and a Kernel Driver.
• The Kernel Driver is lightweight and
handles only memory management
and interrupts.
• The User Space driver changes
frequently. It sets up and controlsTPU
execution, reformats data intoTPU
order, translates API calls intoTPU
instructions, and turns them into an
application binary.
24. Relative Performances : 3 Contemporary Chips
*TPU is less than half die size of the Intel Haswell processor
• K80 andTPU in 28nm process, Haswell fabbed in intel 22nm process
• These chips and platforms chosen for comparison because widely deployed in
Google data centers
25. Relative Performance : 3 Platforms
• These chips and platforms chosen for comparison because widely
deployed in Google data centers
26. Performance Comparison
• Roofline Performance model
This simple visual model is not perfect, yet
it offers insights on the causes of
performance bottlenecks.
TheY-axis is performance in floating-point
operations per second, thus the peak
computation rate forms the “flat” part of
the roofline.
The X-axis is operational intensity,
measured as floating-point operations per
DRAM byte accessed.
27. TPU Die Roofline
• TheTPU has a long “slanted” part of
its roofline, where operational
intensity means that performance is
limited by memory bandwidth.
• Five of the six applications are
happily bumping their heads against
the ceiling
• MLPs and LSTMs are memory bound,
and CNNs are computation bound.
31. Why So Far Below Rooflines? (MLP0)
• Response time is the reason
• Researchers have demonstrated that small increases in response
time cause customers to use a service less
• Inference prefers latency over throughput
32. TPU & GPU Relative Performance to CPU
• GM : Geometric Mean
• WM :Weighted Mean
34. ImprovingTPU : Move “Ridge Point” to the Left
• Current DRAM
2 DDR 2133MHz 34GB/s
• Replace with GDDR5 like in K80
BW : 34GB/s 180GB/s
Move to Ridge Point from 1350 to 250
This improvement would expand die size by about 10%. However, higher
memory bandwidth reduces pressure on the Unified Buffer, so reducing the
Unified Buffer to 14 MiB could gain back 10% in area.
Maximum MiB of the 24 MiB Unified Buffer used per NN app
39. Evaluation ofTPU Designs
• Below table shows the differences between the model results and
the hardware performance counters, which average below 10%.
41. Weighted MeanTPU Relative Performance
• First, increasing memory bandwidth ( memory ) has the biggest
impact: performance improves 3X on average when memory
increases 4X
• Second, clock rate has little benefit on average with or without more
accumulators.The reason is the MLPs and LSTMs are memory bound
but only the CNNs are compute bound
Increasing the clock rate by 4X has almost no impact on MLPs and LSTMs
but improves performance of CNNs by about 2X.
• Third, the average performance slightly degrades when the matrix
unit expands from 256x256 to 512x512 for all apps
The issue is analogous to internal fragmentation of large pages