Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Innovation to Accelerate DX Performance

54 views

Published on

Fujitsu Laboratories’ cutting edge high performance computing technologies are solving a wide range of DX applications using existing platform software. Mr Akaboshi will give an insight into the latest advances that are accelerating unstructured data processing, reflecting Fujitsu Laboratories’ focus on innovative computing advances for social and business transformation.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Innovation to Accelerate DX Performance

  1. 1. INTERNAL USE ONLYINTERNAL USE ONLY Copyright 2019 FUJITSU LABORATORIES LTD. Naoki Akaboshi Fujitsu Laboratories Head of ICT systems laboratory Innovation to Accelerate DX Performance
  2. 2. INTERNAL USE ONLYINTERNAL USE ONLY Copyright 2019 FUJITSU LABORATORIES LTD. History of Fujitsu’s Computing1 Chapter 2
  3. 3. 3 Copyright 2016 FUJITSU Computer performance ◼ Since ENIAC was developed 70 year ago, computer performance is increasing twice every 1.5 years. 1,E+00 1,E+03 1,E+06 1,E+09 1,E+12 1930 1950 1970 1990 2010 ENIAC Computationspersecondpercomputer ENIAC, 1946 U.S. federal government 2x / 1.5 years
  4. 4. 4 Copyright 2016 FUJITSU Fujitsu computer systems FACOM100 (1954) FACOM230-10 (1965) M-190 (1976) M-780 (1985) M-1800 (1990) VPP-500 (1992) PRIMEHPC FX10 (2011) VP-100 (1982) PRIMEQUEST (2005) GS21 (2002) DS90 (1991) SPARC M10 (2013) Supercomputer Mainframe Enterprise Servers PRIMEHPC FX100 (2014) SPARC M12 (2017)
  5. 5. 5 Copyright 2016 FUJITSU 2000 - 2003- 1999 SPARC64 V SPARC64 GP GS8900 GS21 600 GS8800B SPARC64 VII GS21SPARC64 V + SPARC64 VI GS8800 GS21 900 Mainframe HighPerformanceHighReliability Store Ahead Branch History Prefetch Single-chip CPU Non-Blocking $ O-O-O Execution Super-Scalar L2$ on Die HPC-ACE System on Chip Hardware Barrier Multi-core Multi-thread 2004 - 2007 2008 - 2011 SPARC64 GP 2012 - 2015 SPARC64 IXfx Virtual Machine Architecture Software On Chip High-speed Interconnect SPARC64 X SPARC64 X+ Supercomputer UNIX $ECC Register/ALU Parity Instruction Retry $ Dynamic degradation RC/RT/History SPARC64 VIIIfx GS21 M2600 K computer SPARC64 SPARC64 II GS8600 Fujitsu microprocessors 2016 - A64FX SPARC64 XIfx
  6. 6. 6 Copyright 2016 FUJITSU ◼ Fujitsu provides many HPC solutions to satisfy various customer demands ◼ Support for both supercomputers with original CPU and x86 cluster systems ◼ FX1000: First ARM architecture for large-scale HPC Supercomputer PRIMEHPC PRIMEHPC FX10 PRIMEHPC FX100 K computer (Co-developed with RIKEN) Large-Scale SMP System RX900 x86 Cluster CX400/CX600(KNL) BX900/BX400 PRIMEHPC FX1000 Fujitsu high performance computing
  7. 7. © 2019 FUJITSU ◼ World’s top-performing general-purpose supercomputer: ・Low power consumption ・World-leading computational performance ・User convenience ・The ability to produce ground-breaking results ◼ Up to 100 times greater application performance ◼ 30-40MW power consumption ◼ Suitable design for AI application such as Deep Learning System Characteristics © 2019 FUJITSU Supercomputer Fugaku Development 2.2m
  8. 8. INTERNAL USE ONLYINTERNAL USE ONLY Copyright 2019 FUJITSU LABORATORIES LTD. Computing Innovation 2 Chapter 8
  9. 9. 9 Copyright 2016 FUJITSU Microprocessor trend ◼ Transistor counts are growing exponentially following Moore’s law ◼ Single thread performance • Increased by 60%/year (-2005) • Slowed down to +20%/year (2005-) ◼ Power & operating frequency • Power restriction limits operating frequency (2005-) Performance growth is limited by power consumption Source: Stanford, K. Rupp Tr, counts(K) Performance Frequency(MHz) Power(W) Core counts
  10. 10. 10 Copyright 2016 FUJITSU Moore’s law ◼ Device technology scaling has brought higher performance as well as higher power efficiency for these 50 years. ◼ The trade off line is determined by device technology at each generation. As technology scales, the trade-off line moves upward. ◼ Technology node has reached 7nm. (physical limitation of current Tr. technology) s: Scaling factor Power efficiency*(Performance)2 = K∝s5 1 10 102 103 104 102 103 104 105 Performance (a.u.) Powerefficiency(a.u.) 1990 2000 2010 2020 Technology scaling will never be a driver for computing Mobile Server Moore’s limit line advancement
  11. 11. 11 Copyright 2016 FUJITSU Data explosion ◼ As amount of data is exploding, it exceeds capability of traditional ICT ◼ Need new processing to create valuable information from unstructured data 1990 2010 20202000 Year Amountofdata 1 ZB=1021 1 YB=1024 Amount of data will reach: 40 Zetta Byte by 2020 1 Yotta Byte by 2030 40 ZB1 ZB 1 YB Unstructured data IOT, sensors Structured data Business data, RDB
  12. 12. 12 Copyright 2016 FUJITSU Computing innovations beyond Moore’s law ◼ To overcome the limit of Moore’s law in terms of both performance and power efficiency, realize beyond-Moore’s law computing by two approaches 1 10 102 103 104 102 103 104 105 Performance (a.u.) PowerEfficiency(a.u.) Moore’s limit line Beyond Moore’s Law Moore’s Law Computing architecture innovation Software/Algorithm innovation
  13. 13. 13 Copyright 2016 FUJITSU New computing architecture Conventional computing Neural computing (Inference) Neural computing (Learning) Accelerators Brain inspired computing Supercomputers Quantum computers Specialization Processing Numeric Media Knowledge Intelligence ◼ Evolving from numeric computing to intelligence computing End of Moore’s Law
  14. 14. 14 Copyright 2016 FUJITSU Approach for new computing architecture Conventional computing Neural computing (Inference) Neural computing (Learning) Brain inspired computing Supercomputers Quantum computers Processing Numeric Media Knowledge Conventional Computing ◼ Evolving from numeric computing to intelligence computing Conventional Computing Specialization Intelligence
  15. 15. 15 Copyright 2016 FUJITSU Conventional computing Neural computing (Inference) Neural computing (Learning) Accelerators Brain inspired computing Supercomputers Quantum computers Processing Numeric Media Knowledge Conventional Computing ◼ Evolving from numeric computing to intelligence computing Conventional Computing Domain Specific Computing Specialization Intelligence Approach for new computing architecture Digital Annealer
  16. 16. 16 Copyright 2016 FUJITSU Conventional computing Neural computing (Inference) Neural computing (Learning) Accelerators Brain inspired computing Scientific computing Quantum computers Processing Numeric Media Knowledge Conventional Computing Domain Specific Computing ◼ Evolving from numeric computing to intelligence computing Conventional Computing Domain Specific Computing New Computing Paradigm Specialization Intelligence Approach for new computing architecture
  17. 17. 17 Copyright 2016 FUJITSU Conventional computing Neural computing (Inference) Neural computing (Learning) Accelerators Brain inspired computing Scientific computing Quantum computers Processing Numeric Media Knowledge Conventional Computing Domain Specific Computing New Computing Paradigm ◼ Evolving from numeric computing to intelligence computing Conventional Computing Domain Specific Computing New Computing Paradigm Future Computing Technologies Specialization Intelligence Approach for new computing architecture
  18. 18. INTERNAL USE ONLYINTERNAL USE ONLY Copyright 2019 FUJITSU LABORATORIES LTD. What sort of computing technologies will be required in the digital era?3 Chapter 18
  19. 19. INTERNAL USE ONLYINTERNAL USE ONLY Computing demand in the digital era Computing Demand for AI increase explosively Copyright 2019 FUJITSU LABORATORIES LTD. 2012 AlexNet 50 sec 1,000sec ≈17min 690,000sec ≈8 days 15,000,000sec ≈6 months 2016 ResNet 2017 NMT 2018 Alpha Go Zero Demand for AI computation up by x 300,000 In the whole K Computer system: 19
  20. 20. INTERNAL USE ONLYINTERNAL USE ONLY Issues in the field Copyright 2019 FUJITSU LABORATORIES LTD. Behavior Analysis Factory Operations improvement Current performance does not meet the computing demand in the field User wants a ten times # of cameras, but GPU performance is insufficient Too heavy deep learning for 4K camera makes analysis impossible 20
  21. 21. Copyright 2019 FUJITSU LABORATORIES LTD. Computing Evolution Advanced skills are required for efficient use of cutting-edge technologies Advanced software technology Difficult to use parallelization technologies and special programming language Many-core Domain specific computing Difficult to boost the performance only by hardware evolution Many-core Domain Specific Computing Higher frequency Parallelization Efficient use of domain specific computing Computing is getting more complex Cutting-edge technologies are getting more complex Advanced skills are required... 21
  22. 22. INTERNAL USE ONLYINTERNAL USE ONLY The world’s best acceleration technology Copyright 2019 FUJITSU LABORATORIES LTD. Photo source National Institute of Advanced Industrial Science and Technology: Achieved by the ABCI supercomputer at AIST Deep Learning acceleration technologies World’s fastest speed without loss of precision, knocking 30 seconds off the previous record 108 sec 75 sec Press release, 1st April 2019 Acceleration by experts 22
  23. 23. Making high-speed computing available to everyone Copyright 2019 FUJITSU LABORATORIES LTD. Boost performance not only in hardware but also in software Provide users with run at high speed yet are still easy to use General purpose CPU Domain specific computingMany-core ExistinginterfaceI/F Platform Software Software-based Acceleration Technologies OSS frameworks/libraries 23
  24. 24. INTERNAL USE ONLYINTERNAL USE ONLY Copyright 2019 FUJITSU LABORATORIES LTD. Content-Aware Computing: Not only fast, but also easy to use4 Chapter World-first 24
  25. 25. Current state-of-the-art technology and its issues Eliminating strict precision for faster computation (Approximate Computing) Copyright 2019 FUJITSU LABORATORIES LTD. Difficult to use while ensuring the accuracy of results Must be implemented through trial and error Example: Bit reduction Data distribution Scale down the data so it can be packed into longer vector computation 32bit 8bit compress compress 25
  26. 26. Novel technology:Content-Aware Computing Copyright 2019 FUJITSU LABORATORIES LTD. Extract parallelism automatically Automatic parallelization Generate optimal code Compiler optimization Analyzing the contents of computation to automate the trial-and-error Combining high performance with ease of use Accelerate based on analysis of the contents of computation Content-Aware Computing Formally static equivalent transformation Static Program analysis Dynamic program behavior analysis Unnecessary Precision elimination 26
  27. 27. Data-aware bit-width reduction Copyright 2019 FUJITSU LABORATORIES LTD. Bit-width reduction only in pre-specific layers Current state of the art Bit-width reduction based on distribution of each layer Narrow bit-width applies only to pre-specific layers Automatically select optimal bit-width & layers based on internal data distribution 1 Tech Increase speed threefold by optimizing the bit width & layers based on the data distribution conv1 relu1 fc50 pool1 softmax use only 32bitconv50 relu50 pool50 use only16bit conv1 relu1 fc50 pool1 softmax 32bit conv50 relu50 pool50 select 8/16 bit adaptively Data-aware Bit-width reduction 27
  28. 28. Data-aware bit-width reduction Epoch 8-bit 32-bit Structure of Neural Network Round-off error 8-bit 8-bit 8-bit8-bit Dynamic internal data analyzing achieve dynamic bit-width optimizing for each layer 1
  29. 29. Elapsed time of training 3-fold speed up Elapsed time of training Accuracy Conventional Proposed Accuracy Data-aware bit-width reduction1
  30. 30. INTERNAL USE ONLYINTERNAL USE ONLY Adaptive synchronous mitigation2 Tech Auto-terminate the synchronization with prediction increasing # of training by accuracy degradation Measure elapsed time reduction by force process termination1 Predict increasing # of training by accuracy degradation2 Optimize # of terminated processes to minimize the elapsed time3 reduction reduction reduction 3.7x faster Copyright 2019 FUJITSU LABORATORIES LTD.30
  31. 31. INTERNAL USE ONLYINTERNAL USE ONLY Copyright 2019 FUJITSU LABORATORIES LTD. Terminate degraded processing Conventional Adaptive synchronous mitigation Total training time is reduced Prevent accuracy degradation by taking more epochs ProcessingNode ProcessingNode Adaptive synchronous mitigation Performance degradation Elapsed time in each node Accuracy Elapsed time in each node Accuracy 2 31
  32. 32. Accuracy Elapsed time Performance slow down by temporal performance degrading Prevent performance slow down by adaptive synchronous mitigation 3.7-fold speed up Adaptive synchronous mitigation2 32
  33. 33. Effects of application to Deep learning Copyright 2019 FUJITSU LABORATORIES LTD. Perform up to 10 times faster Can easily be used by anyone Accelerate computation by bit-width reduction and synchronous mitigation Automate the optimization that previously required painstaking trial-and-error refinement 33
  34. 34. DX applications Deployment in service businesses Used in Azure, AWS, and FUJITSU Cloud Service for OSS Deployment in platform businesses “Fugaku”, FX1000 / 700 PRIMERGY series Zinrai Deep Learning System (ZDLS) Content-Aware Computing Copyright 2019 FUJITSU LABORATORIES LTD. Apply technologies to business Apply Content-Aware Computing to business 34
  35. 35. Copyright 2019 FUJITSU LABORATORIES LTD. Supporting our customers’ DX projects with the world’s best computing technologies ◼ Scientific simulations ◼ Trading systems Existing high performance computing technologies Used in specialized field Computing technologies for DX Making acceleration technology easier to use ◼ Deep learning / machine learning ◼ Digital solutions 35
  36. 36. Copyright 2019 FUJITSU LABORATORIES LTD.

×