SlideShare a Scribd company logo
1 of 45
Download to read offline
Shinya Morino, Sr. Solution Architect, NVIDIA, 2/14/2020
QGATE 0.3:
QUANTUM CIRCUIT SIMULATOR
2
NVIDIA AI TECHNOLOGY CENTER (NVAITC)
Catalyse AI transformation through Research-Centric Integrated Engagements
Singapore (AP HQ)
Taiwan
China
Australia
Hong Kong
Luxembourg
Established Aug 2015 in Singapore
Collaboration Footprint: Singapore. ASEAN. Taiwan. China. Hong Kong. Australia. Europe.
Thailand
London
Indonesia
3
QUANTUM COMPUTING
Qubit (Quantum bit):
- The basic unit of quantum computers.
- Qubits are represented as a linear superposition of
two basis states, |0> and |1>.
ۧ|𝜓 = 𝛼 ۧ|0 + 𝛽 ۧ|1
𝛼 2
+ 𝛽 2
= 1
- |0> or|1> is observed by measurement.
Observation probabilities of |0> and |1> are 𝛼 2
and
𝛽 2
respectively.
Qubit
ۧ|0 = cos
𝜃
2
, ۧ|1 = 𝑒 𝑖𝜙 sin
𝜃
2
4
QUANTUM COMPUTING
Quantum circuits consist with qubits and quantum logic gates.
- With N qubits, 2N states can be represented (if entangled).
- One quantum state corresponds to one complex number.
Ex. With 53 qubits, 253 ( 10 Peta) states can be represented.
Quantum states are controlled by using quantum logic gates.
- Applying one gate can change 2N qubit states at the same
time.
- Developing quantum circuits is the programming for quantum
computing
Quantum circuit
H
H
H
H
5
QUANTUM CIRCUIT SIMULATION
State vector
- Quantum states are expended to a vector of
complex numbers
- Vector size is 2N for N-qubit circuits.
- Each bit in index is corresponding to one qubit.
Quantum states and state vector
𝑠0
𝑠1
𝑠2
⋮
𝑠2 𝑁
−2
𝑠2 𝑁
−1
ۧ|0 … 00
ۧ|0 … 01
ۧ|0 … 10
⋮
ۧ|1 … 10
ۧ|1 … 11
index of state vector
Quantum state
(complex number)
q0q1qN-1 …
Qubits
6
Represented as a 2x2 unitary matrix
Applying quantum gate to a state vector.
QUANTUM CIRCUIT SIMULATION
Quantum Logic Gate
U 𝑈 =
𝑢00 𝑢01
𝑢10 𝑢11
𝑠𝑖+1,| ۧ…𝟎…
𝑠𝑖+1,| ۧ…𝟏…
= 𝑈
𝑠𝑖,| ۧ…𝟎…
𝑠𝑖,| ۧ…𝟏…
Gate
U =
1 0
0 1
0 0
0 0
0 0
0 0
u00 u01
u10 u11
U
Control
Target
Gate is applied when controlling gbit is |1>.
Control gates can make qubits entangled.
𝑠𝑖+1,| ۧ…𝟏…𝟎…
𝑠𝑖+1,| ۧ…𝟏…𝟏…
= 𝑈
𝑠𝑖,| ۧ…𝟏…𝟎…
𝑠𝑖,| ۧ…𝟏…𝟏…
7
It’s said …
“Number of qubits” is the limitation,
because vast amount of memory proportional to 2N, is required for simulations.
PROBLEM DEFINITION
Quantum circuit simulator is an essential tool to develop quantum circuits, but there’re
limitations:
But actual issue as of today is:
“Simulation is very slow.”
Needing long time for debugging and verifying quantum circuits
8
QUANTUM CIRCUIT EXAMPLES
Circuit # qubits # gates
Capacity of
State vector
Estimated simulation time
Python*1
(CPU 1core)
CPU*2
(multi-core)
Quantum Volume*3
(width 32, depth 32)
32 5,120 64 GB 2 days 3 hours
iQFT *4
(Ex: 32 qubits)
32 560 64 GB 3 hours 13 min
Modulo operation
( 5n mod 12 )
27 5,449 2 GB 45 min 3 min
*1: Simulation with 1 cpu core. *2: Assuming 55 GB/sec of CPU memory bandwidth with naïve simulation algorithm.
*3: https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm,
*4: iQFT, Inversed Quantum Fourier Transform,
9
QUANTUM CIRCUIT SIMULATOR
QGATE
11
QGATE DESIGN CONCEPT
1. Easy development of quantum circuits with fast simulations for experiments
Rich built-in gate set to quickly develop circuits
Utilizing modern computing devices for performance
2. Single node, Multi GPU (multi devices)
Utilizing a big server with a huge amount of memory.
Focusing on performance. No intra-node communication.
3. Works as backends of other SDKs
Simulations can be accelerated on Blueqat, various SDKs.
12
1. EASY DEVELOPMENT OF QUANTUM CIRCUITS
Rich built-in gate set
- Multi-bit-controlled gates, such as Toffoli gate is included in built-in gate set
- Adjoint for all gates
All qubits are fully connected
IBM’s OpenQASM gate set is also supported
13
BUILT-IN OPERATORS
Quantum logic gate Symbol
Identity I
Hadamard gate H
Pauli gates and their rotations X, Y, Z, Rx(theta), Ry(theta), Rz(theta)
Exponential of identity and Pauli gates Exp(I, X, Y, Z)
Global phase Exp(theta)
Phase shift gates P(theta), T, S
Measurement, Probability Measure(qubit), Prob(qubit)
Extensions
OpenQASM’s U gates U3, U2, U1
Multi qubit measurement Measure(pauli gates)
14
UTILIZING MODERN COMPUTING DEVICES
FOR PERFORMANCE
Tesla V100 (SXM2)
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
32 GB HBM2 @ 900GB/s | 300GB/s NVLink
GPU CPU
CPU runtime is also implemented.
(Utilizing multi cores in one CPU socket)
15
TARGET HARDWARE
Requirement:
- Quantum circuit simulations need a
huge amount of memory
- Performance is important as well.
DGX-2
- 512 GB of GPU memory in 16 Tesla
V100
- By using NVLink, all memories in
GPUs are in one address space.
NVIDIA DGX-2
16
DGX-2
All GPUs are sharing a single address space.
All-to-all connections by NVLink
(300 GB/sec, bidirectional)
- 512 GB of ultra-fast memory
is available
- FP32: 35 qubits
FP64: 34 qubits
16 NVIDIA High-end GPUs + NVLink2
17
At a Glance
GPUs 4x NVIDIA® Tesla® V100
TFLOPS (GPU FP16) 500
GPU Memory 32 GB per GPU
NVIDIA Tensor Cores 2,560 (total)
NVIDIA CUDA Cores 20,480 (total)
CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)
System Memory 256 GB LRDIMM DDR4
Storage
Data: 3 x 1.92 TB SSD RAID 0
OS: 1 x 1.92 TB SSD
Network Dual 10 Gb LAN
Display 3x DisplayPort, 4K Resolution
Acoustics < 35 dB
Maximum Power Requirements 1500 W
Operating Temperature Range 10 - 30 oC
Software
Ubuntu Desktop Linux OS
DGX Recommended GPU Driver
CUDA Toolkit
17
NVIDIA DGX STATION
18
DGX STATION NVLINK NETWORK TOPOLOGY
For Efficient Application Scaling
NVIDIA NVLink Bridge
- Four NVIDIA Tesla V100 accelerators
- Each Tesla V100 GPU in DGX Station has four
NVLink connection points, each providing a point-
to-point connection to another GPU at a peak
bandwidth of 25GB/s
- Optimized for:
- The bandwidth achievable for a variety of point-
to-point and collective communications primitives
- The flexibility of the topology
- Performance with a subset of the GPUs
19
GPU REQUIREMENT
Qgate runs with a single GPU, and scales to multiple GPUs in a single node.
- Works with Kepler GPU (Cc3.5) or later. Recommendation is Maxwell GPU (Cc5.0) or later.
Multi GPU requirement
- NVLink : All-to-all NVLink connections between GPUs are required.
For performance, NVLink is strongly recommended.
- PCIe: All GPUs should be connected to the same PCIe root complex.
CPU
- Running with 1 CPU socket is supported. There’s no consideration for NUMA.
20
PERFORMANCE MEASUREMENT
Quantum circuit for measurement
- 10 Hadamard gates are placed on each qubit.
- FP64 is used.
Baseline, Single GPU Performance
H
H
H
H
H
H
H
H
H
H
H
H
...
...
...
Device
CPU (1 core)*1 Single thread on CPU
CPU (multi-core)
Multi-threaded*2 on CPU
(40 threads, 20 physical cores)
GPU GPU / CUDA
10 Hadamard gates
*1: CPU(1 core) is a model of python-based simulator which is
sometimes implemented by using 1 CPU core.
*2: Implemented by using C++ STL’s thread class
21
SUMMARY
Performance Baseline (30 qubits, Single GPU)
# gates applied in
sec.
Memory bandwidth Acc.
CPU (1 core) 0.11 3.7 GB/sec 1 -
CPU (multi-core) 1.59 54.8 GB/sec 14.9x 1
GPU 23.5 806 GB/sec 220x 14.7x
22
PROCESSING PIPELINE
(0.3 RELEASE)
23
PROCESSING PIPELINE
Built with Python and Native Extensions
Gate cancellation
Runtime
Removing cancelling gates
Dynamically grouping qubits, Reducing number of variables
required to represent quantum states
Reordering operators (gates and measurements)
in order to maximize effects of dynamic qubit grouping.
Parallelization on computing devices
CPU(multi-core), and GPU(CUDA)
Python
Input (Intermediate repr.)
Native
extension
Output (state vector)
Operator reordering
Dynamic qubit grouping
Quantum
computing
specific
Device
specific
Reordering qubits to reduce data transfer between devices.Qubit reordering
24
Backend
SOFTWARE DIAGRAM
qgate.model
Quantum circuit object model
Built-in gate definitions
qgate.simulator.runtimeqgate.simulator
Simulator
qgate.script
Circuit definition on python
qgate.openqasm
Importing OpenQASM files
qgate.simulator.qubits
State vector
Complex number
probability
Other plugins …
Frontend
Plugin
Blueqat plugin
qgate
pyruntime:
Python, reference
cpuruntime:
CPU, multi-core
cudaruntime:
CUDA, GPU
OM (object model)
Analyses and optimizations for
quantum circuits
Runtime
Accelerating numerical
operations
25
Products of some gate pairs cancel out
𝐼 = 𝑋 ∙ 𝑋 = 𝑌 ∙ 𝑌 = 𝑍 ∙ 𝑍 = 𝐻 ∙ 𝐻
GATE CANCELLATION
Quantum Circuit Optimization
U
U
U
X
U: Arbitrary unitary gate
X U
X XX
Ex: Modulo arithmetic*
(5^x mod 12, 27 qubits)Cancel out
Cancel out *This circuit was developed by Kato-san in MDR.
Ref: V. Vedral, A. Barenco, A. Ekert, https://arxiv.org/abs/quant-ph/9511018v1
Item Value
Before cancellation 5449 gates
After cancellation 3885 gates
Reduction rate 71.3 %
Also works for pairs of Y, Z, H gates whose squares are Identity.
26
DYNAMIC QUBIT GROUPING
If qubits are not entangled,
- State vector can be factorized.
- Reducing number of variables.
ۧ𝑠0|000
ۧ𝑠1|001
ۧ𝑠2|010
ۧ𝑠3|011
ۧ𝑠4|100
ۧ𝑠5|101
ۧ𝑠6|110
ۧ𝑠7|111
If 1 qubit is
not entangled,
ۧ𝑠10|00
ۧ𝑠11|01
ۧ𝑠12|10
ۧ𝑠13|11
ۧ𝑠00|0
ۧ𝑠01|1
⨂
3 qubit state vector
Size: 8 Size: 6 = (2 + 4)
1 qubit 2 qubits
27
ۧ𝑠0|0 … 00
ۧ𝑠1|0 … 01
ۧ𝑠2|0 … 10
ۧ𝑠220
−2|1 … 10
ۧ𝑠220
−1|1 … 11
ۧ𝑠0|0 … 00
ۧ𝑠1|0 … 01
ۧ𝑠210
−1|1 … 11
ۧ𝑠0|0 … 00
ۧ𝑠1|0 … 01
ۧ𝑠2|0 … 10
ۧ𝑠230
−3|1 … 01
ۧ𝑠230
−2|1 … 10
ۧ𝑠230
−1|1 … 11
DYNAMIC QUBIT GROUPING
30 qubit case
If qubits are divided to
10- and 20-qubit groups.
⨂
30 qubit state vector
Size: 230 Size: 220 + 210 ( 0.1 %)
10 qubits 20 qubits
…
…
…
28
EX. INVERSED QUANTUM FOURIER TRANSFORM
R1
R1
H
R2
R3
H
R1R2 H
R3 R1R2 HR4
# Variables 10
(2x5)
10
(22 + 2x3)
12
(23 + 2x2)
18
(24 + 2)
32
(25)
H
Qubits are grouped when
a controlled gate applied.
29
EFFECTS OF DYNAMIC QUBIT GROUPING
Calculation amount reduced by applying qubit grouping.
iQFT, Numerical Estimation
1.0E+00
1.0E+01
1.0E+02
1.0E+03
1.0E+04
1.0E+05
1.0E+06
1.0E+07
1.0E+08
1.0E+09
1.0E+10
1.0E+11
1.0E+12
0.0E+00
2.0E-01
4.0E-01
6.0E-01
8.0E-01
1.0E+00
0 4 8 12 16 20 24 28 32
w/o Qubit grouping
w/ Qubit grouping
Reduction ratio
Ratioofcalculationamount
(Qubitgroupingenabled/disabled)
CalculationAmount
Log axis.
12.1 % at 30 qubits.
# qubits
30
CALCULATION AMOUNT COMPARISON
In the range where # qubits is small,
- Processing overheads are observed.
In the range where # qubits is big,
- Computation time is enough long, and
overhead is relatively small.
- Estimation and measurement matched.
Observed overhead
- Time for analyzing quantum circuit
- Managing grouped state vectors.
CUIDA/CPU/Theoretical
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 12 16 20 24 28 32
# Qubits
Reductionratio
Processing overheads
observed
Performance
improved as expected
CUDA
CPU(multi core)
Theoretical
31
OPERATOR REORDERING
Maximizing effects of dynamic qubit grouping
- Reordering operators into a smaller qubit
group
- Reducing amount of calculation.
U0 U1
U3
U4
U2
U0 U1
U3
U4
U2
32
BENCHMARK
One of the most important algorithms of quantum computing
- Shor’s algorithm
Used for order-finding problem (https://en.wikipedia.org/wiki/Shor%27s_algorithm)
- Quantum chemistry
Used for obtaining matrix eigen values
Phase Estimation
33
PHASE ESTIMATION
Without Operator Reordering
R1
R1
H
R2
R3
H
R1R2 H
R3 R1R2 HR4
H
U16 U8 U4 U2 U
34
PHASE ESTIMATION
Operators are Reordered
R1
R1
H
R2
R3
H
R1R2 H
R3 R1R2 HR4
H
U16 U8 U4 U2 U
35
PHASE ESTIMATION
30 qubit circuit, 493 gates, FP64
- Measuring global phase of one qubit.
- 29 qubits are used for measurements.
- Running on a single Tesla V100 (32 GB)
Benchmark
exp(i 2n-1q) exp(i 2n-2q) exp(i q)
…
…
29qubits
iQFT
36
AN EXAMPLE OF CALCULATION RESULTS
1024 shots of sampling.
The initial value is 0.1
The initial value is 0.1.
Raw sampling results.
(0.09999997168779373, 1)
(0.09999998286366463, 1)
(0.09999998472630978, 1)
(0.09999999031424522, 1)
(0.09999999217689037, 1)
(0.09999999403953552, 4)
(0.09999999590218067, 4)
(0.09999999776482582, 26)
(0.09999999962747097, 900)
(0.10000000149011612, 57)
(0.10000000335276127, 17)
(0.10000000521540642, 7)
(0.10000000707805157, 1)
(0.10000000894069672, 1)
(0.10000001080334187, 1)
(0.10000001639127731, 1)
37
PHASE ESTIMATION
30 qubit circuit, 493 gates, FP64
- Measuring global phase of one qubit.
- 29 qubits are used for measurements.
Operator Reordering, Single GPU
Runtime/ optimization Elapsed time [s] Acceleration
CPU / no optimization 213 1
CPU / optimized 24.7 8.6x
CUDA / no optimization 13.7 15.5x
CUDA / optimized 1.86 114x
exp(i 2n-1q) exp(i 2n-2q) exp(i q)
…
…
29 qubitsiQFT
38
MULTI GPU + NVLINK
39
IDEAL MULTI GPU PERFORMANCE
Performance Baseline (30 qubits, Single GPU)
# gates applied
in sec.
Memory
bandwidth
Acc.
CPU (1 core) 0.11 3.7 GB/sec 1 -
CPU (multi-
core)
1.59 54.8 GB/sec 14.9x 1
GPU 23.5 806 GB/sec 220x 14.7x
58.8 = 14.7 x 4 GPUs (DGX Station)
40
BOTTLENECK : DATA TRANSFER
Ex. DGX Station
NVLink is fast, but slower than GPU memory.
100 GB/s
100 GB/s
50 GB/s50 GB/s
50 GB/s 50 GB/s
900 GB/s 900 GB/s
900 GB/s900 GB/s
Bandwidth
GPU 900 GB/s
NVLink
(1 Link, bidirectional)
50 GB/s
41
QUBIT REORDERING
Applying gates to q0 ~ q3 is done in
each GPU.
When q4, q5 are included in target
qubits, data transfers between GPUs
happen.
Multi GPU, Reducing Data Transfers
Ex)
q0
q1
q2
q3
q4
q5
Gates are applied in each GPU
Data transfers between GPUs happen
for each gate application.
Ref: 0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit, Thomas Häner, Damian S.Steiger, https://arxiv.org/abs/1704.01127
42
QUBIT REORDERING
Reordering qubits
- Swapping q0 ~ q2 and q3 ~ q5.
- All required inter-device
communications are done during
reordering qubits.
- All gates are applied in each
GPU.
Multi GPU, Reducing Data Transfers
Ex)
Gates are applied
in each GPU
Data transfers
between GPUs happen only here.
Reorderingqubits
q0
q1
q2
q3
q4
q5
q3
q4
q5
q0
q1
q2
Gates are applied
in each GPU
43
BENCHMARK
https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm
32 qubit circuit, 5120 gates, FP64
Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4
Quantum Volume(n=32, d=32), FP64, DGX Station (4 GPUs)
Runtime Optimization Elapsed time Acc.
CPU No optimization 3.1 hours -
CUDA,
4 Tesla V100
No optimization 370 sec 29.7 x
+ Qubit reordering* 318 sec 56.7 x
+ Qubit grouping
+ Operator reordering
176 sec 62.5 x
*: Qubits are reordered for 10 times during execution of the whole circuit.
44
BENCHMARK
32 qubit circuit, 558 gates, FP64
Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4
Phase estimation, 32 qubit circuit
Runtime Optimization Elapsed time Acc.
CPU No optimization 774 sec -
CUDA,
4 Tesla V100
No optimization 18.4 sec 42 x
+ Qubit reordering* 15.4 sec 50 x
+ Qubit grouping
+ Operator reordering
3.2 sec 242 x
*: Qubits are reordered for 8 times during execution of the whole circuit.
45
PLANS FOR THE NEXT VERSION
• Supporting hyper-cube-mesh topology.
• Fully utilizing 8 GPUs on servers such as DGX-1 and AWS p3dn.24xlarge instance
• Enabling to run 33 qubit circuit(float64).
• Acceleration for GPU kernels.
• Qgate 0.3 implements naïve GPU kernels to apply gates, not optimized yet.
QGATE 0.3: QUANTUM CIRCUIT SIMULATOR

More Related Content

What's hot

Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)
Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)
Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)Moe Moe Myint
 
برمجة الأردوينو - اليوم الثالث.
برمجة الأردوينو - اليوم الثالث. برمجة الأردوينو - اليوم الثالث.
برمجة الأردوينو - اليوم الثالث. Ahmed Sakr
 
Arm programmer's model
Arm programmer's modelArm programmer's model
Arm programmer's modelv Kalairajan
 
Introduction to arduino ppt main
Introduction to  arduino ppt mainIntroduction to  arduino ppt main
Introduction to arduino ppt maineddy royappa
 
Introducing the Arduino
Introducing the ArduinoIntroducing the Arduino
Introducing the ArduinoCharles A B Jr
 
Building IoT with Arduino Day One
Building IoT with Arduino Day One Building IoT with Arduino Day One
Building IoT with Arduino Day One Anthony Faustine
 
Design challenges in embedded systems
Design challenges in embedded systemsDesign challenges in embedded systems
Design challenges in embedded systemsmahalakshmimalini
 
MAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERSMAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERSBhamidipati Gayatri
 
Embedded system in washing machine
Embedded system in washing machineEmbedded system in washing machine
Embedded system in washing machineVignesh Suresh
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)Ankur Pipara
 
2. block diagram and components of embedded system
2. block diagram and components of embedded system2. block diagram and components of embedded system
2. block diagram and components of embedded systemVikas Dongre
 
Smart Camera as Embedded System
Smart Camera as Embedded SystemSmart Camera as Embedded System
Smart Camera as Embedded SystemPunnam Chandar
 
What is a Microcontroller ?
What is a Microcontroller ?What is a Microcontroller ?
What is a Microcontroller ?ShrutiVij4
 

What's hot (20)

Embedded System Networking
Embedded System NetworkingEmbedded System Networking
Embedded System Networking
 
Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)
Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)
Ch 1 introduction to Embedded Systems (AY:2018-2019--> First Semester)
 
برمجة الأردوينو - اليوم الثالث.
برمجة الأردوينو - اليوم الثالث. برمجة الأردوينو - اليوم الثالث.
برمجة الأردوينو - اليوم الثالث.
 
Arm programmer's model
Arm programmer's modelArm programmer's model
Arm programmer's model
 
Introduction to arduino ppt main
Introduction to  arduino ppt mainIntroduction to  arduino ppt main
Introduction to arduino ppt main
 
Introducing the Arduino
Introducing the ArduinoIntroducing the Arduino
Introducing the Arduino
 
Building IoT with Arduino Day One
Building IoT with Arduino Day One Building IoT with Arduino Day One
Building IoT with Arduino Day One
 
Design challenges in embedded systems
Design challenges in embedded systemsDesign challenges in embedded systems
Design challenges in embedded systems
 
IoT
IoTIoT
IoT
 
MAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERSMAC UNIT USING DIFFERENT MULTIPLIERS
MAC UNIT USING DIFFERENT MULTIPLIERS
 
Embedded system in washing machine
Embedded system in washing machineEmbedded system in washing machine
Embedded system in washing machine
 
Frequency counter
Frequency counterFrequency counter
Frequency counter
 
finger print based security system
finger print based security systemfinger print based security system
finger print based security system
 
Internet of things (IoT)
Internet of things (IoT)Internet of things (IoT)
Internet of things (IoT)
 
Introduction to stm32-part1
Introduction to stm32-part1Introduction to stm32-part1
Introduction to stm32-part1
 
Embedded C - Lecture 1
Embedded C - Lecture 1Embedded C - Lecture 1
Embedded C - Lecture 1
 
2. block diagram and components of embedded system
2. block diagram and components of embedded system2. block diagram and components of embedded system
2. block diagram and components of embedded system
 
Smart Camera as Embedded System
Smart Camera as Embedded SystemSmart Camera as Embedded System
Smart Camera as Embedded System
 
Communication protocols - Embedded Systems
Communication protocols - Embedded SystemsCommunication protocols - Embedded Systems
Communication protocols - Embedded Systems
 
What is a Microcontroller ?
What is a Microcontroller ?What is a Microcontroller ?
What is a Microcontroller ?
 

Similar to QGATE 0.3: QUANTUM CIRCUIT SIMULATOR

Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010John Holden
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDKNVIDIA Japan
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
IQM slide pitch deck
IQM slide pitch deckIQM slide pitch deck
IQM slide pitch deckKan Yuenyong
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUsfcassier
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9inside-BigData.com
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCGanesan Narayanasamy
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learningAmgad Muhammad
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computationjtsagata
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_reportMichael Zhang
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Fisnik Kraja
 
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCarlos Reaño González
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelKoichi Shirahata
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeOfer Rosenberg
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stackHajime Tazaki
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rFerdinand Jamitzky
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiectureHaris456
 

Similar to QGATE 0.3: QUANTUM CIRCUIT SIMULATOR (20)

Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
 
開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK開発者が語る NVIDIA cuQuantum SDK
開発者が語る NVIDIA cuQuantum SDK
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
IQM slide pitch deck
IQM slide pitch deckIQM slide pitch deck
IQM slide pitch deck
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
 
Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9Inside the Volta GPU Architecture and CUDA 9
Inside the Volta GPU Architecture and CUDA 9
 
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPCRISC-V  and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
RISC-V and OpenPOWER open-ISA and open-HW - a swiss army knife for HPC
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
 
Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?Can FPGAs Compete with GPUs?
Can FPGAs Compete with GPUs?
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
 
Cuda 6 performance_report
Cuda 6 performance_reportCuda 6 performance_report
Cuda 6 performance_report
 
Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...Designing High Performance Computing Architectures for Reliable Space Applica...
Designing High Performance Computing Architectures for Reliable Space Applica...
 
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA CouplingCygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
Cygnus - World First Multi-Hybrid Accelerated Cluster with GPU and FPGA Coupling
 
Performance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming ModelPerformance Analysis of Lattice QCD with APGAS Programming Model
Performance Analysis of Lattice QCD with APGAS Programming Model
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
 
Playing BBR with a userspace network stack
Playing BBR with a userspace network stackPlaying BBR with a userspace network stack
Playing BBR with a userspace network stack
 
Lrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with rLrz kurs: gpu and mic programming with r
Lrz kurs: gpu and mic programming with r
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
 

More from NVIDIA Japan

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?NVIDIA Japan
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA Japan
 
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情NVIDIA Japan
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdfNVIDIA Japan
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Japan
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA Japan
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのNVIDIA Japan
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報NVIDIA Japan
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラNVIDIA Japan
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことNVIDIA Japan
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIANVIDIA Japan
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーNVIDIA Japan
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティNVIDIA Japan
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~NVIDIA Japan
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×RoboticsエンジニアへのロードマップNVIDIA Japan
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育NVIDIA Japan
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育NVIDIA Japan
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報NVIDIA Japan
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにNVIDIA Japan
 
GTC 2020 発表内容まとめ
GTC 2020 発表内容まとめGTC 2020 発表内容まとめ
GTC 2020 発表内容まとめNVIDIA Japan
 

More from NVIDIA Japan (20)

HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?HPC 的に H100 は魅力的な GPU なのか?
HPC 的に H100 は魅力的な GPU なのか?
 
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
NVIDIA cuQuantum SDK による量子回路シミュレーターの高速化
 
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
Physics-ML のためのフレームワーク NVIDIA Modulus 最新事情
 
20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf20221021_JP5.0.2-Webinar-JP_Final.pdf
20221021_JP5.0.2-Webinar-JP_Final.pdf
 
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワークNVIDIA Modulus: Physics ML 開発のためのフレームワーク
NVIDIA Modulus: Physics ML 開発のためのフレームワーク
 
NVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読みNVIDIA HPC ソフトウエア斜め読み
NVIDIA HPC ソフトウエア斜め読み
 
HPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなのHPC+AI ってよく聞くけど結局なんなの
HPC+AI ってよく聞くけど結局なんなの
 
Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報Magnum IO GPUDirect Storage 最新情報
Magnum IO GPUDirect Storage 最新情報
 
データ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラデータ爆発時代のネットワークインフラ
データ爆発時代のネットワークインフラ
 
Hopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないことHopper アーキテクチャで、変わること、変わらないこと
Hopper アーキテクチャで、変わること、変わらないこと
 
GPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIAGPU と PYTHON と、それから最近の NVIDIA
GPU と PYTHON と、それから最近の NVIDIA
 
GTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリーGTC November 2021 – テレコム関連アップデート サマリー
GTC November 2021 – テレコム関連アップデート サマリー
 
テレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティテレコムのビッグデータ解析 & AI サイバーセキュリティ
テレコムのビッグデータ解析 & AI サイバーセキュリティ
 
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
必見!絶対におすすめの通信業界セッション 5 つ ~秋の GTC 2020~
 
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
2020年10月29日 プロフェッショナルAI×Roboticsエンジニアへのロードマップ
 
2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育2020年10月29日 Jetson活用によるAI教育
2020年10月29日 Jetson活用によるAI教育
 
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
2020年10月29日 Jetson Nano 2GBで始めるAI x Robotics教育
 
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
COVID-19 研究・対策に活用可能な NVIDIA ソフトウェアと関連情報
 
Jetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジにJetson Xavier NX クラウドネイティブをエッジに
Jetson Xavier NX クラウドネイティブをエッジに
 
GTC 2020 発表内容まとめ
GTC 2020 発表内容まとめGTC 2020 発表内容まとめ
GTC 2020 発表内容まとめ
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 

QGATE 0.3: QUANTUM CIRCUIT SIMULATOR

  • 1. Shinya Morino, Sr. Solution Architect, NVIDIA, 2/14/2020 QGATE 0.3: QUANTUM CIRCUIT SIMULATOR
  • 2. 2 NVIDIA AI TECHNOLOGY CENTER (NVAITC) Catalyse AI transformation through Research-Centric Integrated Engagements Singapore (AP HQ) Taiwan China Australia Hong Kong Luxembourg Established Aug 2015 in Singapore Collaboration Footprint: Singapore. ASEAN. Taiwan. China. Hong Kong. Australia. Europe. Thailand London Indonesia
  • 3. 3 QUANTUM COMPUTING Qubit (Quantum bit): - The basic unit of quantum computers. - Qubits are represented as a linear superposition of two basis states, |0> and |1>. ۧ|𝜓 = 𝛼 ۧ|0 + 𝛽 ۧ|1 𝛼 2 + 𝛽 2 = 1 - |0> or|1> is observed by measurement. Observation probabilities of |0> and |1> are 𝛼 2 and 𝛽 2 respectively. Qubit ۧ|0 = cos 𝜃 2 , ۧ|1 = 𝑒 𝑖𝜙 sin 𝜃 2
  • 4. 4 QUANTUM COMPUTING Quantum circuits consist with qubits and quantum logic gates. - With N qubits, 2N states can be represented (if entangled). - One quantum state corresponds to one complex number. Ex. With 53 qubits, 253 ( 10 Peta) states can be represented. Quantum states are controlled by using quantum logic gates. - Applying one gate can change 2N qubit states at the same time. - Developing quantum circuits is the programming for quantum computing Quantum circuit H H H H
  • 5. 5 QUANTUM CIRCUIT SIMULATION State vector - Quantum states are expended to a vector of complex numbers - Vector size is 2N for N-qubit circuits. - Each bit in index is corresponding to one qubit. Quantum states and state vector 𝑠0 𝑠1 𝑠2 ⋮ 𝑠2 𝑁 −2 𝑠2 𝑁 −1 ۧ|0 … 00 ۧ|0 … 01 ۧ|0 … 10 ⋮ ۧ|1 … 10 ۧ|1 … 11 index of state vector Quantum state (complex number) q0q1qN-1 … Qubits
  • 6. 6 Represented as a 2x2 unitary matrix Applying quantum gate to a state vector. QUANTUM CIRCUIT SIMULATION Quantum Logic Gate U 𝑈 = 𝑢00 𝑢01 𝑢10 𝑢11 𝑠𝑖+1,| ۧ…𝟎… 𝑠𝑖+1,| ۧ…𝟏… = 𝑈 𝑠𝑖,| ۧ…𝟎… 𝑠𝑖,| ۧ…𝟏… Gate U = 1 0 0 1 0 0 0 0 0 0 0 0 u00 u01 u10 u11 U Control Target Gate is applied when controlling gbit is |1>. Control gates can make qubits entangled. 𝑠𝑖+1,| ۧ…𝟏…𝟎… 𝑠𝑖+1,| ۧ…𝟏…𝟏… = 𝑈 𝑠𝑖,| ۧ…𝟏…𝟎… 𝑠𝑖,| ۧ…𝟏…𝟏…
  • 7. 7 It’s said … “Number of qubits” is the limitation, because vast amount of memory proportional to 2N, is required for simulations. PROBLEM DEFINITION Quantum circuit simulator is an essential tool to develop quantum circuits, but there’re limitations: But actual issue as of today is: “Simulation is very slow.” Needing long time for debugging and verifying quantum circuits
  • 8. 8 QUANTUM CIRCUIT EXAMPLES Circuit # qubits # gates Capacity of State vector Estimated simulation time Python*1 (CPU 1core) CPU*2 (multi-core) Quantum Volume*3 (width 32, depth 32) 32 5,120 64 GB 2 days 3 hours iQFT *4 (Ex: 32 qubits) 32 560 64 GB 3 hours 13 min Modulo operation ( 5n mod 12 ) 27 5,449 2 GB 45 min 3 min *1: Simulation with 1 cpu core. *2: Assuming 55 GB/sec of CPU memory bandwidth with naïve simulation algorithm. *3: https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm, *4: iQFT, Inversed Quantum Fourier Transform,
  • 10. 11 QGATE DESIGN CONCEPT 1. Easy development of quantum circuits with fast simulations for experiments Rich built-in gate set to quickly develop circuits Utilizing modern computing devices for performance 2. Single node, Multi GPU (multi devices) Utilizing a big server with a huge amount of memory. Focusing on performance. No intra-node communication. 3. Works as backends of other SDKs Simulations can be accelerated on Blueqat, various SDKs.
  • 11. 12 1. EASY DEVELOPMENT OF QUANTUM CIRCUITS Rich built-in gate set - Multi-bit-controlled gates, such as Toffoli gate is included in built-in gate set - Adjoint for all gates All qubits are fully connected IBM’s OpenQASM gate set is also supported
  • 12. 13 BUILT-IN OPERATORS Quantum logic gate Symbol Identity I Hadamard gate H Pauli gates and their rotations X, Y, Z, Rx(theta), Ry(theta), Rz(theta) Exponential of identity and Pauli gates Exp(I, X, Y, Z) Global phase Exp(theta) Phase shift gates P(theta), T, S Measurement, Probability Measure(qubit), Prob(qubit) Extensions OpenQASM’s U gates U3, U2, U1 Multi qubit measurement Measure(pauli gates)
  • 13. 14 UTILIZING MODERN COMPUTING DEVICES FOR PERFORMANCE Tesla V100 (SXM2) 7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS 32 GB HBM2 @ 900GB/s | 300GB/s NVLink GPU CPU CPU runtime is also implemented. (Utilizing multi cores in one CPU socket)
  • 14. 15 TARGET HARDWARE Requirement: - Quantum circuit simulations need a huge amount of memory - Performance is important as well. DGX-2 - 512 GB of GPU memory in 16 Tesla V100 - By using NVLink, all memories in GPUs are in one address space. NVIDIA DGX-2
  • 15. 16 DGX-2 All GPUs are sharing a single address space. All-to-all connections by NVLink (300 GB/sec, bidirectional) - 512 GB of ultra-fast memory is available - FP32: 35 qubits FP64: 34 qubits 16 NVIDIA High-end GPUs + NVLink2
  • 16. 17 At a Glance GPUs 4x NVIDIA® Tesla® V100 TFLOPS (GPU FP16) 500 GPU Memory 32 GB per GPU NVIDIA Tensor Cores 2,560 (total) NVIDIA CUDA Cores 20,480 (total) CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core) System Memory 256 GB LRDIMM DDR4 Storage Data: 3 x 1.92 TB SSD RAID 0 OS: 1 x 1.92 TB SSD Network Dual 10 Gb LAN Display 3x DisplayPort, 4K Resolution Acoustics < 35 dB Maximum Power Requirements 1500 W Operating Temperature Range 10 - 30 oC Software Ubuntu Desktop Linux OS DGX Recommended GPU Driver CUDA Toolkit 17 NVIDIA DGX STATION
  • 17. 18 DGX STATION NVLINK NETWORK TOPOLOGY For Efficient Application Scaling NVIDIA NVLink Bridge - Four NVIDIA Tesla V100 accelerators - Each Tesla V100 GPU in DGX Station has four NVLink connection points, each providing a point- to-point connection to another GPU at a peak bandwidth of 25GB/s - Optimized for: - The bandwidth achievable for a variety of point- to-point and collective communications primitives - The flexibility of the topology - Performance with a subset of the GPUs
  • 18. 19 GPU REQUIREMENT Qgate runs with a single GPU, and scales to multiple GPUs in a single node. - Works with Kepler GPU (Cc3.5) or later. Recommendation is Maxwell GPU (Cc5.0) or later. Multi GPU requirement - NVLink : All-to-all NVLink connections between GPUs are required. For performance, NVLink is strongly recommended. - PCIe: All GPUs should be connected to the same PCIe root complex. CPU - Running with 1 CPU socket is supported. There’s no consideration for NUMA.
  • 19. 20 PERFORMANCE MEASUREMENT Quantum circuit for measurement - 10 Hadamard gates are placed on each qubit. - FP64 is used. Baseline, Single GPU Performance H H H H H H H H H H H H ... ... ... Device CPU (1 core)*1 Single thread on CPU CPU (multi-core) Multi-threaded*2 on CPU (40 threads, 20 physical cores) GPU GPU / CUDA 10 Hadamard gates *1: CPU(1 core) is a model of python-based simulator which is sometimes implemented by using 1 CPU core. *2: Implemented by using C++ STL’s thread class
  • 20. 21 SUMMARY Performance Baseline (30 qubits, Single GPU) # gates applied in sec. Memory bandwidth Acc. CPU (1 core) 0.11 3.7 GB/sec 1 - CPU (multi-core) 1.59 54.8 GB/sec 14.9x 1 GPU 23.5 806 GB/sec 220x 14.7x
  • 22. 23 PROCESSING PIPELINE Built with Python and Native Extensions Gate cancellation Runtime Removing cancelling gates Dynamically grouping qubits, Reducing number of variables required to represent quantum states Reordering operators (gates and measurements) in order to maximize effects of dynamic qubit grouping. Parallelization on computing devices CPU(multi-core), and GPU(CUDA) Python Input (Intermediate repr.) Native extension Output (state vector) Operator reordering Dynamic qubit grouping Quantum computing specific Device specific Reordering qubits to reduce data transfer between devices.Qubit reordering
  • 23. 24 Backend SOFTWARE DIAGRAM qgate.model Quantum circuit object model Built-in gate definitions qgate.simulator.runtimeqgate.simulator Simulator qgate.script Circuit definition on python qgate.openqasm Importing OpenQASM files qgate.simulator.qubits State vector Complex number probability Other plugins … Frontend Plugin Blueqat plugin qgate pyruntime: Python, reference cpuruntime: CPU, multi-core cudaruntime: CUDA, GPU OM (object model) Analyses and optimizations for quantum circuits Runtime Accelerating numerical operations
  • 24. 25 Products of some gate pairs cancel out 𝐼 = 𝑋 ∙ 𝑋 = 𝑌 ∙ 𝑌 = 𝑍 ∙ 𝑍 = 𝐻 ∙ 𝐻 GATE CANCELLATION Quantum Circuit Optimization U U U X U: Arbitrary unitary gate X U X XX Ex: Modulo arithmetic* (5^x mod 12, 27 qubits)Cancel out Cancel out *This circuit was developed by Kato-san in MDR. Ref: V. Vedral, A. Barenco, A. Ekert, https://arxiv.org/abs/quant-ph/9511018v1 Item Value Before cancellation 5449 gates After cancellation 3885 gates Reduction rate 71.3 % Also works for pairs of Y, Z, H gates whose squares are Identity.
  • 25. 26 DYNAMIC QUBIT GROUPING If qubits are not entangled, - State vector can be factorized. - Reducing number of variables. ۧ𝑠0|000 ۧ𝑠1|001 ۧ𝑠2|010 ۧ𝑠3|011 ۧ𝑠4|100 ۧ𝑠5|101 ۧ𝑠6|110 ۧ𝑠7|111 If 1 qubit is not entangled, ۧ𝑠10|00 ۧ𝑠11|01 ۧ𝑠12|10 ۧ𝑠13|11 ۧ𝑠00|0 ۧ𝑠01|1 ⨂ 3 qubit state vector Size: 8 Size: 6 = (2 + 4) 1 qubit 2 qubits
  • 26. 27 ۧ𝑠0|0 … 00 ۧ𝑠1|0 … 01 ۧ𝑠2|0 … 10 ۧ𝑠220 −2|1 … 10 ۧ𝑠220 −1|1 … 11 ۧ𝑠0|0 … 00 ۧ𝑠1|0 … 01 ۧ𝑠210 −1|1 … 11 ۧ𝑠0|0 … 00 ۧ𝑠1|0 … 01 ۧ𝑠2|0 … 10 ۧ𝑠230 −3|1 … 01 ۧ𝑠230 −2|1 … 10 ۧ𝑠230 −1|1 … 11 DYNAMIC QUBIT GROUPING 30 qubit case If qubits are divided to 10- and 20-qubit groups. ⨂ 30 qubit state vector Size: 230 Size: 220 + 210 ( 0.1 %) 10 qubits 20 qubits … … …
  • 27. 28 EX. INVERSED QUANTUM FOURIER TRANSFORM R1 R1 H R2 R3 H R1R2 H R3 R1R2 HR4 # Variables 10 (2x5) 10 (22 + 2x3) 12 (23 + 2x2) 18 (24 + 2) 32 (25) H Qubits are grouped when a controlled gate applied.
  • 28. 29 EFFECTS OF DYNAMIC QUBIT GROUPING Calculation amount reduced by applying qubit grouping. iQFT, Numerical Estimation 1.0E+00 1.0E+01 1.0E+02 1.0E+03 1.0E+04 1.0E+05 1.0E+06 1.0E+07 1.0E+08 1.0E+09 1.0E+10 1.0E+11 1.0E+12 0.0E+00 2.0E-01 4.0E-01 6.0E-01 8.0E-01 1.0E+00 0 4 8 12 16 20 24 28 32 w/o Qubit grouping w/ Qubit grouping Reduction ratio Ratioofcalculationamount (Qubitgroupingenabled/disabled) CalculationAmount Log axis. 12.1 % at 30 qubits. # qubits
  • 29. 30 CALCULATION AMOUNT COMPARISON In the range where # qubits is small, - Processing overheads are observed. In the range where # qubits is big, - Computation time is enough long, and overhead is relatively small. - Estimation and measurement matched. Observed overhead - Time for analyzing quantum circuit - Managing grouped state vectors. CUIDA/CPU/Theoretical 0 0.2 0.4 0.6 0.8 1 1.2 1.4 8 12 16 20 24 28 32 # Qubits Reductionratio Processing overheads observed Performance improved as expected CUDA CPU(multi core) Theoretical
  • 30. 31 OPERATOR REORDERING Maximizing effects of dynamic qubit grouping - Reordering operators into a smaller qubit group - Reducing amount of calculation. U0 U1 U3 U4 U2 U0 U1 U3 U4 U2
  • 31. 32 BENCHMARK One of the most important algorithms of quantum computing - Shor’s algorithm Used for order-finding problem (https://en.wikipedia.org/wiki/Shor%27s_algorithm) - Quantum chemistry Used for obtaining matrix eigen values Phase Estimation
  • 32. 33 PHASE ESTIMATION Without Operator Reordering R1 R1 H R2 R3 H R1R2 H R3 R1R2 HR4 H U16 U8 U4 U2 U
  • 33. 34 PHASE ESTIMATION Operators are Reordered R1 R1 H R2 R3 H R1R2 H R3 R1R2 HR4 H U16 U8 U4 U2 U
  • 34. 35 PHASE ESTIMATION 30 qubit circuit, 493 gates, FP64 - Measuring global phase of one qubit. - 29 qubits are used for measurements. - Running on a single Tesla V100 (32 GB) Benchmark exp(i 2n-1q) exp(i 2n-2q) exp(i q) … … 29qubits iQFT
  • 35. 36 AN EXAMPLE OF CALCULATION RESULTS 1024 shots of sampling. The initial value is 0.1 The initial value is 0.1. Raw sampling results. (0.09999997168779373, 1) (0.09999998286366463, 1) (0.09999998472630978, 1) (0.09999999031424522, 1) (0.09999999217689037, 1) (0.09999999403953552, 4) (0.09999999590218067, 4) (0.09999999776482582, 26) (0.09999999962747097, 900) (0.10000000149011612, 57) (0.10000000335276127, 17) (0.10000000521540642, 7) (0.10000000707805157, 1) (0.10000000894069672, 1) (0.10000001080334187, 1) (0.10000001639127731, 1)
  • 36. 37 PHASE ESTIMATION 30 qubit circuit, 493 gates, FP64 - Measuring global phase of one qubit. - 29 qubits are used for measurements. Operator Reordering, Single GPU Runtime/ optimization Elapsed time [s] Acceleration CPU / no optimization 213 1 CPU / optimized 24.7 8.6x CUDA / no optimization 13.7 15.5x CUDA / optimized 1.86 114x exp(i 2n-1q) exp(i 2n-2q) exp(i q) … … 29 qubitsiQFT
  • 37. 38 MULTI GPU + NVLINK
  • 38. 39 IDEAL MULTI GPU PERFORMANCE Performance Baseline (30 qubits, Single GPU) # gates applied in sec. Memory bandwidth Acc. CPU (1 core) 0.11 3.7 GB/sec 1 - CPU (multi- core) 1.59 54.8 GB/sec 14.9x 1 GPU 23.5 806 GB/sec 220x 14.7x 58.8 = 14.7 x 4 GPUs (DGX Station)
  • 39. 40 BOTTLENECK : DATA TRANSFER Ex. DGX Station NVLink is fast, but slower than GPU memory. 100 GB/s 100 GB/s 50 GB/s50 GB/s 50 GB/s 50 GB/s 900 GB/s 900 GB/s 900 GB/s900 GB/s Bandwidth GPU 900 GB/s NVLink (1 Link, bidirectional) 50 GB/s
  • 40. 41 QUBIT REORDERING Applying gates to q0 ~ q3 is done in each GPU. When q4, q5 are included in target qubits, data transfers between GPUs happen. Multi GPU, Reducing Data Transfers Ex) q0 q1 q2 q3 q4 q5 Gates are applied in each GPU Data transfers between GPUs happen for each gate application. Ref: 0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit, Thomas Häner, Damian S.Steiger, https://arxiv.org/abs/1704.01127
  • 41. 42 QUBIT REORDERING Reordering qubits - Swapping q0 ~ q2 and q3 ~ q5. - All required inter-device communications are done during reordering qubits. - All gates are applied in each GPU. Multi GPU, Reducing Data Transfers Ex) Gates are applied in each GPU Data transfers between GPUs happen only here. Reorderingqubits q0 q1 q2 q3 q4 q5 q3 q4 q5 q0 q1 q2 Gates are applied in each GPU
  • 42. 43 BENCHMARK https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm 32 qubit circuit, 5120 gates, FP64 Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4 Quantum Volume(n=32, d=32), FP64, DGX Station (4 GPUs) Runtime Optimization Elapsed time Acc. CPU No optimization 3.1 hours - CUDA, 4 Tesla V100 No optimization 370 sec 29.7 x + Qubit reordering* 318 sec 56.7 x + Qubit grouping + Operator reordering 176 sec 62.5 x *: Qubits are reordered for 10 times during execution of the whole circuit.
  • 43. 44 BENCHMARK 32 qubit circuit, 558 gates, FP64 Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4 Phase estimation, 32 qubit circuit Runtime Optimization Elapsed time Acc. CPU No optimization 774 sec - CUDA, 4 Tesla V100 No optimization 18.4 sec 42 x + Qubit reordering* 15.4 sec 50 x + Qubit grouping + Operator reordering 3.2 sec 242 x *: Qubits are reordered for 8 times during execution of the whole circuit.
  • 44. 45 PLANS FOR THE NEXT VERSION • Supporting hyper-cube-mesh topology. • Fully utilizing 8 GPUs on servers such as DGX-1 and AWS p3dn.24xlarge instance • Enabling to run 33 qubit circuit(float64). • Acceleration for GPU kernels. • Qgate 0.3 implements naïve GPU kernels to apply gates, not optimized yet.