Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits of Archimedes’ constant on Google Cloud"

Reading:
“Pi in the sky: Calculating a
record-breaking 31.4 trillion
digits of Archimedes’ constant
on Google Cloud”
Journal Club at AIS Lab. on April 22, 2019
Kento Aoyama, Ph.D Student
Akiyama Laboratory, Dept. of Computer Science, Tokyo Institute of Technology

Outline
1. Abstract
2. Introduction
3. About “y-cruncher” and Pi computation
4. Computational Details
a. System Overview
b. Major Difficulties
c. Minor Difficulties
5. Summary
6. Supplementals
2

TL;DR Abstract
3
● Authors successfully computed the Pi to 31.4 trillion decimal digits using "y-cruncher"
which implementing the Chudnovsky formula
● Compute instances provided by Google Cloud were used during 121 days calculation
● Storage bandwidth is the most important
● Error detection and checkpoint/restart functions are crucial for Pi computation

The persons who related to the record
● Emma Haruka Iwao (@Yuryu)
○ Pi record holder (31.4 trillion)
○ Developer Advocate for Google Cloud Platform (2015~)
○ M.Sc in Computer Science at University of Tsukuba (Prof. Tatebe Lab.)
● Alexander J. Yee (@Mysticial)
○ Author of “y-cruncher” (the program used for this computation)
○ Software Developer at Citadel LLC (2016~)
○ M.Sc in Computer Science at University of Illinois
○ More details in (numberworld.org)
4

Sources of This Presentation
● Google Cloud Blog:
○ ”Pi in the sky: Calculating a record-breaking 31.4 trillion digits of Archimedes’ constant on
Google Cloud” (accessd: April 18, 2019)
● Private Tech Blog (numberworld.org):
○ ”Google Cloud Topples the Pi Record” (accessd: April 18, 2019)
○ “y-cruncher - A Multi-Threaded Pi-Program” (accessd: April 18, 2019)
● Developer Keynote (Google Cloud Next’19):
○ Video: https://www.youtube.com/watch?time_continue=2971&v=W16iHlo2TuE (accessd: April
18, 2019)
● F. Bellard, “Computation of 2700 billion decimal digits of Pi using a Desktop Computer Evaluation of
the Chudnovsky series,” Computer (Long. Beach. Calif)., vol. 2010, pp. 1–11, 2010.
5

Rough Introduction
● Most scientific applications don’t need Pi beyond a few hundred digits, but that isn’t
stopping anyone.
● The complexity of Chudnovky's formula - a common algorithm for computing pi - is
O(n (log n)^3): the time and resources necessary to calculate digits increase more
rapidly than the digits themselves
● Using Compute Engine, Google Cloud’s high-performance infrastructure as a service
offering, has a number of benefits over using dedicated physical machines
○ live migration feature lets your application continue running while Google takes care of the
heavy lifting needed to keep our infrastructure up to date
7
From Google Cloud Blog

System Overview
8

Miscellaneous Facts and Statistics
● First Pi record using Cloud Service
● First Pi record using SSD
● First Pi record using AVX512 instruction set
● First Pi record using network attached storage (NAS)
● Second Pi record done with y-cruncher that has encountered/recovered from a silent
hardware error
● The computation racked up a total of 10 PB of file reads, 9 PB of file writes
● The speed of this computation was 1/8 bottlenecked by the storage bandwidth
9
From numberworld.org

About “y-cruncher” and
Pi computation
10

About y-cruncher
“The first scalable multi-threaded Pi-benchmark for multi-core systems”
● It is developed by Alexander J. Yee (@Mysticial)
● It has been used for 6 world Pi records (April 2019)
● It can be downloaded from webpage ( http://www.numberworld.org/y-cruncher/ )
● It is closed source (few parts of code is available on GitHub with licenses)
● It supports both Windows and Linux systems
11

Software Features
● Able to compute Pi and other constants to trillions of digits
● Two algorithms are available for most constants: computation and verification
● Multi-Threaded - Multi-threading can be used to fully utilize modern multi-core
processors without significantly increasing memory usage
● Vectorized - Able to fully utilize the SIMD capabilities for most processors (SSE, AVX,
AVX512, etc...)
● Swap Space - management for large computations that require more memory than
there is available
● Multi-Hard Drive - Multiple hard drives can be used for faster disk swapping
● Semi-Fault Tolerant - Able to detect and correct for minor errors that may be caused
by hardware instability or software bugs
12

Implementation (as of v0.7.7)
General Information:
● y-cruncher started off as a C99 program.
Now it is mostly C++11 with a tiny bit of
C++14
● Intel SSE and AVX compiler intrinsics are
heavily used
● Some inline assembly is used
● C++ template metaprogramming is used
extensively to reduce code duplication
13
Libraries and Dependencies:
● WinAPI (Windows Only)
● POSIX (Linux Only)
● Cilk Plus
● Thread Building Blocks (TBB)
y-cruncher has no other non-system
dependencies. No Boost. No GMP.

Formulas and Algorithms
y-cruncher provides two algorithms for each major constant: computation and verification
List of available constants (see more detail in numberworld.org/Formulas and Algorithms)
○ Square Root of n and Golden Ratio
○ e - the Napier's constant
○ Pi - the Archimedes’ constant
○ ArcCoth(n) - Inverse Hyperbolic Cotangent
○ Log(n)
○ Zeta(3) - Apery's Constant
○ Catalan's Constant
○ Lemniscate
○ Euler-Mascheroni Constant
14

Pi Computation - Chudnovsky formula[a]
with A = 13591409, B = 545140134, C = 640320
Every iteration in the n loop, the generated Pi digits is increased by 14 digits.
“It was evaluated with the binary splitting algorithm. The asymptotic running time is O( M(n)
log(n)^2 ) for a n limb result. It is worst than the asymptotic running time of the
Arithmetic-Geometric Mean algorithms of O(M(n) log(n)) , but it has better locality and many
improvements can reduce its constant factor.” [b]
15
[b] F. Bellard, “Computation of 2700 billion decimal digits of Pi using a Desktop Computer Evaluation of
the Chudnovsky series,” Computer (Long. Beach. Calif)., vol. 2010, pp. 1–11, 2010.
[a] D. V. Chudnovsky and G. V. Chudnovsky, “Approximations and complex multiplication according to
Ramanujan, in Ramanujan Revisited,” Academic Press Inc., Boston, p. 375-396 & p. 468-472, 1988.

Binary Splitting Algorithm (1/N)
Let S be defined as
We define the auxiliary integers
16
F. Bellard, “Computation of 2700 billion decimal digits of Pi using a Desktop Computer Evaluation of the
Chudnovsky series,” Computer (Long. Beach. Calif)., vol. 2010, pp. 1–11, 2010.

P, Q and T can be evaluated recursively with the following relations defined with
m such as n1
< m < n2
:
17

Algorithm 1 is deduced from these relations.
18
For the Chudnovsky series we can take:
We get then

Points of the Bellard’s paper
● Precomputed powers
Several constant factors can be precomputed separately, it can reduce the calculations
by using reasonable additional memory
● Multi-threading
Different parts of the binary splitting recursion can be executed in different threads
● Restartability
If operands are stored on disk, each step of computation is implemented so that it is
restartable.
● Fast multiplication algorithm using DFT
19
Because the "y-cruncher" is closed-source and not published ...
Let's refer to the Bellard’s paper which implements the same formula.

Computational Details
of the record
20

System Overview (Table)
21

Instance
We selected an n1-megamem-96 instance for the main computing node.
● It was the biggest virtual machine type available on Compute Engine that provided
Intel Skylake processors at the beginning of the project
● The Skylake generation of Intel processors supports AVX-512, which are 512-bit SIMD
extensions - that can perform floating point operations on 512-bit data or eight
double-precision floating-point numbers at once
22

Storage
We selected n1-standard-16 for the iSCSI target machines to ensure sufficient bandwidth
between the computing node and the storage:
● the network egress bandwidth and Persistent Disk throughput are determined by the
number of vCPU cores
● We used the iSCSI protocol to remotely attach Persistent Disks to add additional
capacity
● The number of nodes were decided based on y-cruncher's disk benchmark
performance
Currently, each Compute Engine virtual machine can mount up to 64 TB of Persistent Disks.
23

Disk I/O Bottleneck
25
Date Digits Who CPU Utilization
January 2019 31.4 trillion Emma Haruka Iwao 12%
November 2016 22.4 trillion Peter Trueb 22%
October 2014 13.3 trillion Sandon Van Ness
"houkouonchi"
36%
December 2013 12.1 trillion Shigeru Kondo 37%
October 2011 10 trillion Shigeru Kondo ~77%
August 2010 5 trillion Shigeru Kondo 35.89%
CPU Utilizations of Pi records (latest 6)

Disk I/O Bottleneck
● The “memory wall” (bandwidth wall)
○ CPU < RAM/memory < DISK/storage
● Memory speeds are 1.5 - 3x slower
than is ideal for y-cruncher
● Storage speeds are 3 - 20x slower
than is ideal for y-cruncher
26From Wikipedia.org, numberworld.org

Disk I/O Bottleneck
● In this latest Pi computation, the disk/storage bandwidth was about 2-3 GB/s, which
led to an average CPU utilization of 12.2%. ( = about an 1/8 bottleneck)
● If we have infinite storage bandwidth:
○ The computation would have taken 2 ~ 3 weeks (122 * 1/8)
● If we have infinite computational power:
○ The computation would still have taken around 4 month
27

Network bandwidth (I/O) Bottleneck
● The storages are attached as NAS (Network Attached Storage)
○ “network storage bandwidth” is the limitation factor
● For more details:
○ “Write bandwidth” was artificially capped to about 1.8 GB/s by the platform
○ “Read bandwidth”, while not artificially capped, was still limited to about 3.1 GB/s by the
network hardware
● But, put it simply, 2-3 GB/s is not enough
○ Computation is effectively free: computational improvements by both of software and hardware
■ AVX512. Skylake architecture, etc.
○ GPUs aren't going to help with this kind of storage bottleneck (so there is no GPU versions)
28

Network bandwidth (I/O) Bottleneck
● We need upward of 20 GB/s of storage bandwidth for the case in high-end server
○ 20 GB/s is less than 2 x PCIe 3.0 x 16 slots, it’s technically possible
○ but it requires a level of hardware customization that we have yet to see
● if we had 20 GB/s of storage bandwidth, the computation would likely have taken
less than 1 month
● Thus in the current era, whoever has the biggest and fastest storage (without
sacrificing reliability) will win the race for the most digits of Pi
29

Machine Errors on Pi computation
“This computation is the 6th time that y-cruncher has been used to set the Pi record.
It is the 4th time that featured at least one hardware error, and the 2nd that had a
suspected silent hardware error.
Hardware errors are a thing - even on server grade hardware.”
30

Normal (non-silent) Hardware Errors
Normal (non-silent) hardware errors: not a problem
● The machine crashes, reboot it and resume the computation.
● Circuit breaker trips, turn it back on and resume the computation.
● Hard drive fails, restore from backup and resume the computation…
This is (mostly) a solved problem thanks to checkpoint-restart.
31

Silent Hardware Errors
Silent hardware errors: a fearful problem
● they are silent and do not cause a visible error
● they lead to data corruption which can propagate to the end of a long computation
resulting in the wrong digits
○ This is the worst scenario because you end up wasting a many-months long computation and
have no idea whether the error was a hardware fault or a software bug ...
32

Error Detection
● y-cruncher has many forms of built-in error-detection that catch errors as soon as
possible to minimize the amount of wasted resources as well as minimizing the
probability that a computation finishes with the wrong results
● Error-detection saved the 2nd and 4th of hardware error from the bad ending. (in
previous records)
33

Limitation of Error Detection
Y-cruncher’s error detection only has about 90 % coverage
● Empirical evidence from: actual (unintended) hardware errors, artificially induced errors
by means of overclocking
● Meaning that 1 in 10 silent hardware errors will go undetected and lead the
computation finishing with the wrong digits
● The two errors that have happened so far were both lucky to land in that 90%.
○ The current 10% without coverage is the long tail of code that is either very difficult to do
error-detection, or would incur an unacceptably large performance overhead
34

Silent Hardware Errors is most fearful
For example:
1. Someone invests a large amount of time and money into a large computation. The digits don't
pass verification.
2. The person contacts me asking for help. But I can't do anything. All that investment is lost.
3. Lot of distress on both sides. Maybe lots of finger-pointing.
For this reason, I typically discourage people from running computations that may take longer
than 6 months.
y-cruncher is currently 6/6 in world record Pi attempts that have run to completion.
But there is some amount of luck to this.
35

Load Imbalance with Thread Building
Blocks (TBB)
y-cruncher lets the user choose a parallel computing framework (None, C++11 std::async(),
Thread Spawn, Windows Thread Pool, Push Pool, Cilk Plus, Thread Building Block)
For this computation, we decided to use Intel's Thread Building Blocks (TBB).
But it turned out that TBB suffers severe load-balancing issues under y-cruncher's workload.
By comparison both Intel's own Cilk Plus and y-cruncher's Push Pool had no such problems.
The result was a loss of computational performance.
In the end, this didn't matter since the disk bottleneck easily absorbed any amount of
computational inefficiency.
37

Deployment Issues
There were numerous issues with deployment. Examples include:
● The were performance issues with live migration due to the memory-intensiveness of
the computation (The 1.4 TB of memory would have been completely overwritten
roughly once every ~10 min. for much of the entire computation)
● There were timeout issues with accessing the external storage nodes
38

Summary
40
● Authors successfully computed the Pi to 31.4 trillion decimal digits using "y-cruncher"
which implementing the Chudnovsky formula
● Compute instances (1 fat-compute node with 24 storage-node) provided by Google
Cloud were used during 121 days calculation
● Storage bandwidth is the most important: as the limitation factor of computation
performance is the bandwidth of network attached storage so that the average CPU
utilization was 12%
● Error detection and checkpoint/restart functions are crucial for the current
long-time-running Pi computation but it is still not perfect (coverage as about 90%)
○ Sometimes silent hardware errors may be not undetected, or uncorrected

My impression after reading
41
● This work just uses “y-cruncher” as a conventional software with conventional
methods, so that there is less novelty in term of the HPC field
● On the other hands, in term of the reliability of the long-time-running mathematical
calculation (or SRE challenge; Site Reliability Engineering), it’s a good tech report
● As a ads of Google Cloud and Pi day celebration: a great contribution
● It is a good thing to extend the Pi digits (“Pi digit is a measurement of civilization”)

Let’s calc the Pi digits!
43
Docker image for testing “y-cruncher”
https://github.com/metaVariable/docker-y-cruncher
# clone
git clone https://github.com/metaVariable/docker-y-cruncher.git
# build
docker build . -t y-cruncher:v0.7.7.9500
# run
docker run -it y-cruncher:v0.7.7.9500 ./y-cruncher custom pi -dec:10000

Scalability
“As of v0.7.1, y-cruncher is coarse-grained paralleled. On shared and uniform memory, the
isoefficiency function is estimated to be Θ(p2). This means that every time you double the #
of processors, the computation size would need to be 4x larger to achieve the same
parallel efficiency.”
“The Θ(p2) heuristically comes from a non-recursive Bailey's 4-step FFT algorithm using a
sqrt(N) reduction factor. In both of the FFT stages, there are only sqrt(N) independent FFTs.
Therefore, the parallelism cannot exceed sqrt(N) for a computation of size N.”
44

Non-Uniform Memory
“As of 2018, y-cruncher is still a shared memory program and is not optimized for
non-uniform memory (NUMA) systems. So historically, y-cruncher's performance and
scalability has always been very poor on NUMA systems. While the scaling is still ok on
dual-socket systems, it all goes downhill once you put y-cruncher on anything that is
extremely heavily NUMA. (such as quad-socket Opteron systems)”
“While y-cruncher is not "NUMA optimized", it has been "NUMA aware" since v0.7.3 with
the addition of node-interleaving memory allocators.”
45

Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits of Archimedes’ constant on Google Cloud"

More Related Content

What's hot

Similar to Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits of Archimedes’ constant on Google Cloud"

More from Kento Aoyama

Recently uploaded

Reading: "Pi in the sky: Calculating a record-breaking 31.4 trillion digits of Archimedes’ constant on Google Cloud"