2. 2
NVIDIA AI TECHNOLOGY CENTER (NVAITC)
Catalyse AI transformation through Research-Centric Integrated Engagements
Singapore (AP HQ)
Taiwan
China
Australia
Hong Kong
Luxembourg
Established Aug 2015 in Singapore
Collaboration Footprint: Singapore. ASEAN. Taiwan. China. Hong Kong. Australia. Europe.
Thailand
London
Indonesia
3. 3
QUANTUM COMPUTING
Qubit (Quantum bit):
- The basic unit of quantum computers.
- Qubits are represented as a linear superposition of
two basis states, |0> and |1>.
ۧ|𝜓 = 𝛼 ۧ|0 + 𝛽 ۧ|1
𝛼 2
+ 𝛽 2
= 1
- |0> or|1> is observed by measurement.
Observation probabilities of |0> and |1> are 𝛼 2
and
𝛽 2
respectively.
Qubit
ۧ|0 = cos
𝜃
2
, ۧ|1 = 𝑒 𝑖𝜙 sin
𝜃
2
4. 4
QUANTUM COMPUTING
Quantum circuits consist with qubits and quantum logic gates.
- With N qubits, 2N states can be represented (if entangled).
- One quantum state corresponds to one complex number.
Ex. With 53 qubits, 253 ( 10 Peta) states can be represented.
Quantum states are controlled by using quantum logic gates.
- Applying one gate can change 2N qubit states at the same
time.
- Developing quantum circuits is the programming for quantum
computing
Quantum circuit
H
H
H
H
5. 5
QUANTUM CIRCUIT SIMULATION
State vector
- Quantum states are expended to a vector of
complex numbers
- Vector size is 2N for N-qubit circuits.
- Each bit in index is corresponding to one qubit.
Quantum states and state vector
𝑠0
𝑠1
𝑠2
⋮
𝑠2 𝑁
−2
𝑠2 𝑁
−1
ۧ|0 … 00
ۧ|0 … 01
ۧ|0 … 10
⋮
ۧ|1 … 10
ۧ|1 … 11
index of state vector
Quantum state
(complex number)
q0q1qN-1 …
Qubits
6. 6
Represented as a 2x2 unitary matrix
Applying quantum gate to a state vector.
QUANTUM CIRCUIT SIMULATION
Quantum Logic Gate
U 𝑈 =
𝑢00 𝑢01
𝑢10 𝑢11
𝑠𝑖+1,| ۧ…𝟎…
𝑠𝑖+1,| ۧ…𝟏…
= 𝑈
𝑠𝑖,| ۧ…𝟎…
𝑠𝑖,| ۧ…𝟏…
Gate
U =
1 0
0 1
0 0
0 0
0 0
0 0
u00 u01
u10 u11
U
Control
Target
Gate is applied when controlling gbit is |1>.
Control gates can make qubits entangled.
𝑠𝑖+1,| ۧ…𝟏…𝟎…
𝑠𝑖+1,| ۧ…𝟏…𝟏…
= 𝑈
𝑠𝑖,| ۧ…𝟏…𝟎…
𝑠𝑖,| ۧ…𝟏…𝟏…
7. 7
It’s said …
“Number of qubits” is the limitation,
because vast amount of memory proportional to 2N, is required for simulations.
PROBLEM DEFINITION
Quantum circuit simulator is an essential tool to develop quantum circuits, but there’re
limitations:
But actual issue as of today is:
“Simulation is very slow.”
Needing long time for debugging and verifying quantum circuits
8. 8
QUANTUM CIRCUIT EXAMPLES
Circuit # qubits # gates
Capacity of
State vector
Estimated simulation time
Python*1
(CPU 1core)
CPU*2
(multi-core)
Quantum Volume*3
(width 32, depth 32)
32 5,120 64 GB 2 days 3 hours
iQFT *4
(Ex: 32 qubits)
32 560 64 GB 3 hours 13 min
Modulo operation
( 5n mod 12 )
27 5,449 2 GB 45 min 3 min
*1: Simulation with 1 cpu core. *2: Assuming 55 GB/sec of CPU memory bandwidth with naïve simulation algorithm.
*3: https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm,
*4: iQFT, Inversed Quantum Fourier Transform,
10. 11
QGATE DESIGN CONCEPT
1. Easy development of quantum circuits with fast simulations for experiments
Rich built-in gate set to quickly develop circuits
Utilizing modern computing devices for performance
2. Single node, Multi GPU (multi devices)
Utilizing a big server with a huge amount of memory.
Focusing on performance. No intra-node communication.
3. Works as backends of other SDKs
Simulations can be accelerated on Blueqat, various SDKs.
11. 12
1. EASY DEVELOPMENT OF QUANTUM CIRCUITS
Rich built-in gate set
- Multi-bit-controlled gates, such as Toffoli gate is included in built-in gate set
- Adjoint for all gates
All qubits are fully connected
IBM’s OpenQASM gate set is also supported
12. 13
BUILT-IN OPERATORS
Quantum logic gate Symbol
Identity I
Hadamard gate H
Pauli gates and their rotations X, Y, Z, Rx(theta), Ry(theta), Rz(theta)
Exponential of identity and Pauli gates Exp(I, X, Y, Z)
Global phase Exp(theta)
Phase shift gates P(theta), T, S
Measurement, Probability Measure(qubit), Prob(qubit)
Extensions
OpenQASM’s U gates U3, U2, U1
Multi qubit measurement Measure(pauli gates)
13. 14
UTILIZING MODERN COMPUTING DEVICES
FOR PERFORMANCE
Tesla V100 (SXM2)
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
32 GB HBM2 @ 900GB/s | 300GB/s NVLink
GPU CPU
CPU runtime is also implemented.
(Utilizing multi cores in one CPU socket)
14. 15
TARGET HARDWARE
Requirement:
- Quantum circuit simulations need a
huge amount of memory
- Performance is important as well.
DGX-2
- 512 GB of GPU memory in 16 Tesla
V100
- By using NVLink, all memories in
GPUs are in one address space.
NVIDIA DGX-2
15. 16
DGX-2
All GPUs are sharing a single address space.
All-to-all connections by NVLink
(300 GB/sec, bidirectional)
- 512 GB of ultra-fast memory
is available
- FP32: 35 qubits
FP64: 34 qubits
16 NVIDIA High-end GPUs + NVLink2
16. 17
At a Glance
GPUs 4x NVIDIA® Tesla® V100
TFLOPS (GPU FP16) 500
GPU Memory 32 GB per GPU
NVIDIA Tensor Cores 2,560 (total)
NVIDIA CUDA Cores 20,480 (total)
CPU Intel Xeon E5-2698 v4 2.2 GHz (20-core)
System Memory 256 GB LRDIMM DDR4
Storage
Data: 3 x 1.92 TB SSD RAID 0
OS: 1 x 1.92 TB SSD
Network Dual 10 Gb LAN
Display 3x DisplayPort, 4K Resolution
Acoustics < 35 dB
Maximum Power Requirements 1500 W
Operating Temperature Range 10 - 30 oC
Software
Ubuntu Desktop Linux OS
DGX Recommended GPU Driver
CUDA Toolkit
17
NVIDIA DGX STATION
17. 18
DGX STATION NVLINK NETWORK TOPOLOGY
For Efficient Application Scaling
NVIDIA NVLink Bridge
- Four NVIDIA Tesla V100 accelerators
- Each Tesla V100 GPU in DGX Station has four
NVLink connection points, each providing a point-
to-point connection to another GPU at a peak
bandwidth of 25GB/s
- Optimized for:
- The bandwidth achievable for a variety of point-
to-point and collective communications primitives
- The flexibility of the topology
- Performance with a subset of the GPUs
18. 19
GPU REQUIREMENT
Qgate runs with a single GPU, and scales to multiple GPUs in a single node.
- Works with Kepler GPU (Cc3.5) or later. Recommendation is Maxwell GPU (Cc5.0) or later.
Multi GPU requirement
- NVLink : All-to-all NVLink connections between GPUs are required.
For performance, NVLink is strongly recommended.
- PCIe: All GPUs should be connected to the same PCIe root complex.
CPU
- Running with 1 CPU socket is supported. There’s no consideration for NUMA.
19. 20
PERFORMANCE MEASUREMENT
Quantum circuit for measurement
- 10 Hadamard gates are placed on each qubit.
- FP64 is used.
Baseline, Single GPU Performance
H
H
H
H
H
H
H
H
H
H
H
H
...
...
...
Device
CPU (1 core)*1 Single thread on CPU
CPU (multi-core)
Multi-threaded*2 on CPU
(40 threads, 20 physical cores)
GPU GPU / CUDA
10 Hadamard gates
*1: CPU(1 core) is a model of python-based simulator which is
sometimes implemented by using 1 CPU core.
*2: Implemented by using C++ STL’s thread class
20. 21
SUMMARY
Performance Baseline (30 qubits, Single GPU)
# gates applied in
sec.
Memory bandwidth Acc.
CPU (1 core) 0.11 3.7 GB/sec 1 -
CPU (multi-core) 1.59 54.8 GB/sec 14.9x 1
GPU 23.5 806 GB/sec 220x 14.7x
22. 23
PROCESSING PIPELINE
Built with Python and Native Extensions
Gate cancellation
Runtime
Removing cancelling gates
Dynamically grouping qubits, Reducing number of variables
required to represent quantum states
Reordering operators (gates and measurements)
in order to maximize effects of dynamic qubit grouping.
Parallelization on computing devices
CPU(multi-core), and GPU(CUDA)
Python
Input (Intermediate repr.)
Native
extension
Output (state vector)
Operator reordering
Dynamic qubit grouping
Quantum
computing
specific
Device
specific
Reordering qubits to reduce data transfer between devices.Qubit reordering
23. 24
Backend
SOFTWARE DIAGRAM
qgate.model
Quantum circuit object model
Built-in gate definitions
qgate.simulator.runtimeqgate.simulator
Simulator
qgate.script
Circuit definition on python
qgate.openqasm
Importing OpenQASM files
qgate.simulator.qubits
State vector
Complex number
probability
Other plugins …
Frontend
Plugin
Blueqat plugin
qgate
pyruntime:
Python, reference
cpuruntime:
CPU, multi-core
cudaruntime:
CUDA, GPU
OM (object model)
Analyses and optimizations for
quantum circuits
Runtime
Accelerating numerical
operations
24. 25
Products of some gate pairs cancel out
𝐼 = 𝑋 ∙ 𝑋 = 𝑌 ∙ 𝑌 = 𝑍 ∙ 𝑍 = 𝐻 ∙ 𝐻
GATE CANCELLATION
Quantum Circuit Optimization
U
U
U
X
U: Arbitrary unitary gate
X U
X XX
Ex: Modulo arithmetic*
(5^x mod 12, 27 qubits)Cancel out
Cancel out *This circuit was developed by Kato-san in MDR.
Ref: V. Vedral, A. Barenco, A. Ekert, https://arxiv.org/abs/quant-ph/9511018v1
Item Value
Before cancellation 5449 gates
After cancellation 3885 gates
Reduction rate 71.3 %
Also works for pairs of Y, Z, H gates whose squares are Identity.
25. 26
DYNAMIC QUBIT GROUPING
If qubits are not entangled,
- State vector can be factorized.
- Reducing number of variables.
ۧ𝑠0|000
ۧ𝑠1|001
ۧ𝑠2|010
ۧ𝑠3|011
ۧ𝑠4|100
ۧ𝑠5|101
ۧ𝑠6|110
ۧ𝑠7|111
If 1 qubit is
not entangled,
ۧ𝑠10|00
ۧ𝑠11|01
ۧ𝑠12|10
ۧ𝑠13|11
ۧ𝑠00|0
ۧ𝑠01|1
⨂
3 qubit state vector
Size: 8 Size: 6 = (2 + 4)
1 qubit 2 qubits
29. 30
CALCULATION AMOUNT COMPARISON
In the range where # qubits is small,
- Processing overheads are observed.
In the range where # qubits is big,
- Computation time is enough long, and
overhead is relatively small.
- Estimation and measurement matched.
Observed overhead
- Time for analyzing quantum circuit
- Managing grouped state vectors.
CUIDA/CPU/Theoretical
0
0.2
0.4
0.6
0.8
1
1.2
1.4
8 12 16 20 24 28 32
# Qubits
Reductionratio
Processing overheads
observed
Performance
improved as expected
CUDA
CPU(multi core)
Theoretical
30. 31
OPERATOR REORDERING
Maximizing effects of dynamic qubit grouping
- Reordering operators into a smaller qubit
group
- Reducing amount of calculation.
U0 U1
U3
U4
U2
U0 U1
U3
U4
U2
31. 32
BENCHMARK
One of the most important algorithms of quantum computing
- Shor’s algorithm
Used for order-finding problem (https://en.wikipedia.org/wiki/Shor%27s_algorithm)
- Quantum chemistry
Used for obtaining matrix eigen values
Phase Estimation
34. 35
PHASE ESTIMATION
30 qubit circuit, 493 gates, FP64
- Measuring global phase of one qubit.
- 29 qubits are used for measurements.
- Running on a single Tesla V100 (32 GB)
Benchmark
exp(i 2n-1q) exp(i 2n-2q) exp(i q)
…
…
29qubits
iQFT
35. 36
AN EXAMPLE OF CALCULATION RESULTS
1024 shots of sampling.
The initial value is 0.1
The initial value is 0.1.
Raw sampling results.
(0.09999997168779373, 1)
(0.09999998286366463, 1)
(0.09999998472630978, 1)
(0.09999999031424522, 1)
(0.09999999217689037, 1)
(0.09999999403953552, 4)
(0.09999999590218067, 4)
(0.09999999776482582, 26)
(0.09999999962747097, 900)
(0.10000000149011612, 57)
(0.10000000335276127, 17)
(0.10000000521540642, 7)
(0.10000000707805157, 1)
(0.10000000894069672, 1)
(0.10000001080334187, 1)
(0.10000001639127731, 1)
36. 37
PHASE ESTIMATION
30 qubit circuit, 493 gates, FP64
- Measuring global phase of one qubit.
- 29 qubits are used for measurements.
Operator Reordering, Single GPU
Runtime/ optimization Elapsed time [s] Acceleration
CPU / no optimization 213 1
CPU / optimized 24.7 8.6x
CUDA / no optimization 13.7 15.5x
CUDA / optimized 1.86 114x
exp(i 2n-1q) exp(i 2n-2q) exp(i q)
…
…
29 qubitsiQFT
38. 39
IDEAL MULTI GPU PERFORMANCE
Performance Baseline (30 qubits, Single GPU)
# gates applied
in sec.
Memory
bandwidth
Acc.
CPU (1 core) 0.11 3.7 GB/sec 1 -
CPU (multi-
core)
1.59 54.8 GB/sec 14.9x 1
GPU 23.5 806 GB/sec 220x 14.7x
58.8 = 14.7 x 4 GPUs (DGX Station)
39. 40
BOTTLENECK : DATA TRANSFER
Ex. DGX Station
NVLink is fast, but slower than GPU memory.
100 GB/s
100 GB/s
50 GB/s50 GB/s
50 GB/s 50 GB/s
900 GB/s 900 GB/s
900 GB/s900 GB/s
Bandwidth
GPU 900 GB/s
NVLink
(1 Link, bidirectional)
50 GB/s
40. 41
QUBIT REORDERING
Applying gates to q0 ~ q3 is done in
each GPU.
When q4, q5 are included in target
qubits, data transfers between GPUs
happen.
Multi GPU, Reducing Data Transfers
Ex)
q0
q1
q2
q3
q4
q5
Gates are applied in each GPU
Data transfers between GPUs happen
for each gate application.
Ref: 0.5 Petabyte Simulation of a 45-Qubit Quantum Circuit, Thomas Häner, Damian S.Steiger, https://arxiv.org/abs/1704.01127
41. 42
QUBIT REORDERING
Reordering qubits
- Swapping q0 ~ q2 and q3 ~ q5.
- All required inter-device
communications are done during
reordering qubits.
- All gates are applied in each
GPU.
Multi GPU, Reducing Data Transfers
Ex)
Gates are applied
in each GPU
Data transfers
between GPUs happen only here.
Reorderingqubits
q0
q1
q2
q3
q4
q5
q3
q4
q5
q0
q1
q2
Gates are applied
in each GPU
42. 43
BENCHMARK
https://github.com/Qiskit/openqasm/blob/master/benchmarks/quantum_volume/quantum_volume_n32_d32.qasm
32 qubit circuit, 5120 gates, FP64
Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4
Quantum Volume(n=32, d=32), FP64, DGX Station (4 GPUs)
Runtime Optimization Elapsed time Acc.
CPU No optimization 3.1 hours -
CUDA,
4 Tesla V100
No optimization 370 sec 29.7 x
+ Qubit reordering* 318 sec 56.7 x
+ Qubit grouping
+ Operator reordering
176 sec 62.5 x
*: Qubits are reordered for 10 times during execution of the whole circuit.
43. 44
BENCHMARK
32 qubit circuit, 558 gates, FP64
Hardware: NVIDIA DGX Station. CPU: Xeon E5-2698 v4 2.2 GHz, GPU Tesla V100 x 4
Phase estimation, 32 qubit circuit
Runtime Optimization Elapsed time Acc.
CPU No optimization 774 sec -
CUDA,
4 Tesla V100
No optimization 18.4 sec 42 x
+ Qubit reordering* 15.4 sec 50 x
+ Qubit grouping
+ Operator reordering
3.2 sec 242 x
*: Qubits are reordered for 8 times during execution of the whole circuit.
44. 45
PLANS FOR THE NEXT VERSION
• Supporting hyper-cube-mesh topology.
• Fully utilizing 8 GPUs on servers such as DGX-1 and AWS p3dn.24xlarge instance
• Enabling to run 33 qubit circuit(float64).
• Acceleration for GPU kernels.
• Qgate 0.3 implements naïve GPU kernels to apply gates, not optimized yet.