Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Rainforest Algorithm


Published on

This describes the rainforest crypto currency algorithm as a CPU-friendly and people-friendly alternative to existing algorithms that currently favor large corporations and mining facilities. Presented on 2018-04-11 in Linz, Austria.

Code uploaded on this site :

Published in: Technology
  • Thanks for the previous comments. helped me too
    Are you sure you want to  Yes  No
    Your message goes here
  • Did you try ⇒ ⇐?. They know how to do an amazing essay, research papers or dissertations.
    Are you sure you want to  Yes  No
    Your message goes here
  • Hey Gov-Auctions - great service. My wife is stoked with her new wheels and it was fun! ★★★
    Are you sure you want to  Yes  No
    Your message goes here
  • Get the best essay, research papers or dissertations. from ⇒ ⇐ A team of professional authors with huge experience will give u a result that will overcome your expectations.
    Are you sure you want to  Yes  No
    Your message goes here
  • The future!
    Are you sure you want to  Yes  No
    Your message goes here

The Rainforest Algorithm

  1. 1. the rainforest algorithmthe rainforest algorithm Bill Schneider BC 2018 Linz
  2. 2. Who Am I ● Bill Schneider, 52, father of two ● previously systems architect for a large ASIC manufacturer ● specialized in hardware crypto ● now self-employed researcher in parallel computing
  3. 3. Why This Talk ● horrifed by each visit at our customers' mining factories :
  4. 4. Why This Talk ● 90 % of some villages' electricity redirected to factories during daytime ● workers having to stand 45°C ambient temperature inside, 11 hours a day ● some died from electric shocks while trying to steal power directly from the dam ● outdated hardware thrown directly into nature ● I can't accept to leave such a planet to my kids
  5. 5. How Did We End There ● Bitcoin created in 2009 to give back to power to the people ● slow adoption at frst, started to take of and to generate large revenues for miners ● everyone wants to mine to get a chance to make money ● note: nobody "makes" money, everyone gets a smaller or larger share of the others’ money ● ASIC vendors entered the race and wiped the market with increasingly powerful systems ● investors quickly realized they could make $10k/day for $1M investment ⇒ very quick ROI ● individuals eventually stopped mining Bitcoin and started to look for currencies using other algorithms (not sha256)
  6. 6. Current State Of Art ● thousands of new cryptocurrencies created, not that many still alive ● ~50 diferent crypto algorithms, some used a lot (sha256, scrypt, x11, blake, cryptonight) ● mining still dominated by factories (ASIC) and large, power-hungry, GPUs, or sometimes FPGAs ● mining difculty adapts to the fastest miners, reducing revenues for the small ones ➢ currencies created for the people but favoring large corporations instead!
  7. 7. Crypto Currencies Now Play Against Individuals ● mining costs a lot for people : cf. 2017 rise of graphics cards prices ● mining becomes much less afordable for the poor, if at all ● mining becomes a competition of power between big players employing tens of thousands of huge machines ➢
  8. 8. And Now We’re In An Absurd Situation! ● mining almost not afordable anymore due to power consumption, favors those who don't pay bills or steal power ● in order not to pay the power bill, miners now use your energy thanks to malwares installed on your devices (browsers, servers) ● ● power draw grows : ~59 TWh/yr circa Q1 2018, or ~5x US household energy consumption ● ● ● huge impact on earth : fossil energy, electronic waste
  9. 9. What Made This Possible At All ● some algos were designed to shine on ASICs (e.g. SHA256, back to this later) ● FPGAs are faster to adapt to more complex algorithms ● GPUs are massively more powerful than CPUs. Best explained here : ➢ ➢ 3200 inst/clock for a GPU vs 4 for a CPU, for only $350 ● Even though GPUs are much more parallel, the frequency is lower, but a performance gain of 5-10 for the same price as a CPU is common ● it's a competition. If your CPU is too small to bring you the money to pay the power bill, you double the price to get 10 times the performance.
  10. 10. Your Coins Are Cheaper Than Mine ● the lack of balance risks to discourage all individual miners and keep only biggest players who will control the market ● how to rebalance the value and the mining efort ? ● some algorithms claim to be ASIC-proof (too hard to implement on ASIC), such as scrypt, x11, ethash ➢ ● shouldn't give GPUs too much an advantage ⇒ the performance reward should be relative to the efort and investment, not to wasted energy!
  11. 11. Could We Rebalance The Power ● making sure ASIC/FPGA implementation is not affordable (which is diferent from not possible) ● making GPU implementations less an advantage (e.g. 2x, not 10x) ● this will lower the entry barrier for CPUs ● more low-power CPUs will be usable ● more people will be able to make money from mining, thus diluting the revenue and making it even less afordable to throw in a lot of hardware ● less total energy will be spent for the same results
  12. 12. Important Benefits Of CPU Mining ● CPUs are everywhere (smartphones, set-top-boxes, TV, PCs, NAS, servers) ● CPU consumption in small devices not much diferent from idle to full power ⇒ CPU power comes for free on all such devices ● if 10 million miners would only mine on their phones, total mining power usage would drop to 1%! ● so many miners would make it much more difcult for large companies to take all the revenue
  13. 13. What Makes An Algorithm Afordable on an ASIC ● many (most) of the algorithms currently in the wild were SHA3 candidates whose purpose is to be extremely fast on ASICs and very power efcient (think NFC credit cards, IoT, wearable devices) : ● ● look at the studied algorithms names there : SHA256, Fugue, Lufa, Blake, Grostl, Shabal, BMW, Hamsi, SHAvite, CubeHash, JH, SIMD, ECHO, Keccak, Skein ● sound familiar ? Among 17 algorithms x17 employs exactly 15 of these ones! ● The other two being Haval, which itself was also a SHA3 candidate not tested in this paper, and whirlpool which has been implemented in ASICs 10 years ago already : ●
  14. 14. What Makes An Algorithm Afordable on an ASIC ➔ the widely praised x17 (Verge,...) is easily implementable on ASICs and FPGA as all functions have been done already they just have to be connected together (already done in Oct 2017 BTW). ➔ cryptonight (Monero,...) adds 4 of them at the end of its own hash (blake, goestl, jh, skein), it's only protected by the initial work ➔ SHA256 (Bitcoin), SHA3 (Ethereum) already exist on ASICs and FPGAs
  15. 15. What Is Common To All These Algorithms ● one critical aspect of a SHA candidate is that it never reveals anything from the input text when looking at the output (think about passwords). ● it must always execute in constant time to prevent side-channel attacks ● it must not use memory nor cache to avoid leaving observable traces of intermediary states leading to guessing the input text ● these rules make them much faster on ASICs than on software ⇒ so, they are excellent, aren't they ? NO!
  16. 16. What Is Wrong With These Algorithms ● for crypto currencies, the hashed block is public, there's nothing to hide ● we want the output hash not to be easy to guess from an input ● we want them not to be that fast on hardware ● we want them to beneft the most possible from software ● the trivial bit operations they involve are expensive in software and cost zero in hardware (fxed rotations, fxed bit permutations, XOR, …) ● they all ofer easy restart points by using a very small state
  17. 17. Where Do The Costs Come From CPU: ● Vendor, model/design complexity (performance per cycle) ● Number of cores, cache size ● Frequency, memory bandwidth GPU: ● Number of cores ● Max frequency FPGA: ● Number of cells (limits the complexity one can implement) ● Lithography (max frequency, power draw) ASIC: ● Development time (limited reuse of full custom blocks, testing) ● Licensed blocks ● Lithography (max frequency, power draw)
  18. 18. Let’s Reverse The Engine Parallelism is : ● hard for CPUs (limited exec units & ports), often serialized ● expensive for FPGAs (requires large number of cells) ● trivial in ASICs Long chains are : ● trivial in CPUs (pipelined, very high frequencies) ● expensive for FPGAs (makes inefcient use of cells) ● a frequency blocker for ASICs (fmax=1/LongestChain)
  19. 19. Let’s Reverse The Engine Fixed bit operations are : ● 0.5/1 cycle for a CPU ● cheap for FPGAs (1 LUT can fuse several of them) ● trivial for ASICs (shift is a wire, XOR is 4/6 transistors) Variable bit operations are : ● 0.5/1 cycle for a CPU ● slow for FPGAs (have to loop over a layer of several LUT stages) ● very expensive for ASICs (64-bit L/R rotate is 512 mux / 2048 NAND gates / 8192 transistors)
  20. 20. Let’s Reverse The Engine Small storage is : ● trivial and very fast for CPUs (L1/L2 caches) ● expensive for FPGAs (very limited memory) ● complex and slow for ASICs (same as creating a cache) Variable-time operations are : ● natural for CPUs (pipeline + superscalar + out-of-order handles gracefully) ● difcult for FPGAs (dedicate cells to create a control unit) ● a bottleneck for ASICs (whole chain stalls)
  21. 21. Let’s Reverse The Engine Complex operations such as add64/div64 are : ● fast on CPUs (been optimized for decades, high frequency) ● very complex on FPGAs (a large FPGA may be fully dedicated to a single such operation) ● error-prone and very hard to get right on ASICs ⇒ current hash algos employ highly parallel fxed bit operations, let's implement a highly serialized, iterative algorithm involving complex operations, long chains, variable bit-ops and memory accesses
  22. 22. ASIC Strengths And Weaknesses ● ASICs cost a lot to develop. ● made by assembling blocks made of gates for a single very specifc purpose ● being specifc to a task, they almost never beneft from the most advanced and very expensive lithography (32nm SOI at best, 65nm more afordable) ⇒ Often high power draw and low frequency
  23. 23. ASIC Strengths And Weaknesses ● but they excel at simple tasks. E.g. constants are wires. Constant bit shifting is called "wire routing". Constant XOR only requires two transistors per bit. ● it's trivial to parallelize such tasks by increasing the silicon area ● complex algorithms are very hard to implement ; no bug allowed there or all products have to be replaced in feld (remember Pentium's FDIV bug?) ⇒ best use cases are for generic symmetric crypto algorithms (AES, SHA, ...)
  24. 24. FPGA Strengths And Weaknesses ● FPGAs often abusively called ASICs in that they are generally deployed for a specifc purpose ● FPGAs are feld programmable : gate matrices where connections are enabled or not by a bit in programmable memory. ● more suitable at complex algorithms than ASICs since bugs can be fxed ● FPGAs do not scale well with frequency ; each connection adds propagation delays; limited fan-out sometimes requires cascades; ● FPGAs however implement basic blocks (e.g. adder, multiplier, PCIe controller) ● since more generic, they can often use fner lithography than ASICs (typically 28nm) ● number of gates is limited, typically 100k for Spartan7, 300k for CycloneV. ● modern FPGAs often implement 1 or 2 low-power CPU cores
  25. 25. GPU Strengths And Weaknesses ● GPUs were initially designed for highly parallel graphics processing ● highly capable general purpose processors, but not "repurposable", for example, bad at bit-level operations if not implemented ● 32-bit only (large enough for anything) ● number of GPU cores typically varies between 32 and 5120 ● since general purpose, etched with best lithography available (12nm for Tesla V100) ● moderate frequencies (500-1500 MHz) ● much higher memory bandwidth than anything else (~500-900 GB/s), but high latency
  26. 26. GPU Strengths And Weaknesses ● mostly designed for image processing ; they excel at 32-bit integer and single/double precision foating point processing ● like DSPs, almost always implement MAC operations with hard-wired single-cycle multipliers ● divide not needed and not implemented, or via foats ● small L1 cache (16-32kB) shared between 8-128 cores supposed to work on the same data set (e.g. a portion of an image) ● look-up tables can be extremely fast if they ft in this cache ● extreme power draw (100-300W)
  27. 27. CPU Strengths And Weaknesses ● CPUs were initially designed for the highest single-threaded performance on any application ● IPC matters (instruction per clock) ● deep pipelines ensure that even long operations are delivered fast (every cycle for most), even under extreme dependencies ● multiple execution queues and ports to process several independent or lightly dependent instructions in parallel (typ. 2-4 depending on operations) ● 40-year old RISC vs CISC war ended in the middle with the best of both worlds ● developed strong knowledge of highly complex operations, which can be faster than even on an ASIC (eg: 200ps 64-bit adder)
  28. 28. CPU Strengths And Weaknesses ● can further optimize processing on the fy (e.g. register renaming, instruction fusing) ● devote a huge part of the silicon to process very complex operations very quickly ● all 64 bits now ● high frequencies are common (2-4.5 GHz) ● limited number of CPU cores ● moderate to high power consumption ● very slow at processing bit-level operations ⇒ only hope in regular crypto is to use dedicated instructions
  29. 29. Feature Comparison : ASIC/FPGA/GPU/CPU Next slides compare high-end ASICs, FPGAs, GPUs and CPUs. The ASIC is assumed to be etched at 32nm. The FPGA is assumed to be one of the modern models above such as Virtex-6 and above. The GPU is assumed to be clocked at 1 GHz. Two CPUs are considered here: CPU1 is a typical high-end PC x86 CPU, out-of-order, multi-issue, clocked at 4 GHz. CPU2 is a typical mid-end smartphone CPU, in-order, dual- issue, running at 2 GHz. Comparisons are made per code execution unit (i.e. core).
  30. 30. Feature Comparison : ASIC/FPGA/GPU/CPU ● 32-bit constant bit shift : ● 4 parallel 32-bit constant bit shifts : device ASIC FPGA GPU CPU1 CPU2 complexity trivial simple trivial trivial trivial cost free 128 cells 1 cycle 2 cycles 2 cycles device ASIC FPGA GPU CPU1 CPU2 complexity trivial simple trivial trivial trivial cost free 32 cells 1 cycle 1 cycle 1 cycle
  31. 31. Feature Comparison : ASIC/FPGA/GPU/CPU ● 4 parallel 32-bit constant XOR : ● 32-bit 32-constant lookup : device ASIC FPGA GPU CPU1 CPU2 complexity simple simple trivial trivial trivial cost 4k gates 256 cells 1 cycle 1 cycle 1 cycle device ASIC FPGA GPU CPU1 CPU2 complexity trivial trivial trivial trivial trivial cost free free 1 cycle 2 cycles 2 cycles
  32. 32. Feature Comparison : ASIC/FPGA/GPU/CPU ⇒ This explains why most crypto algorithms favor ASICs and FPGAs frst, then GPUs second. All of them are made of these cheap bit operations exclusively. ⇒ ASIC-proof algorithms simply use lots of RAM to cancel the ASIC advantage
  33. 33. Feature Comparison : ASIC/FPGA/GPU/CPU ● 64-bit constant bit shift : ● 4 parallel 64-bit constant bit shifts : device ASIC FPGA GPU CPU1 CPU2 complexity trivial simple trivial trivial trivial cost free 256 cells 12 cycles 2 cycles 2 cycles device ASIC FPGA GPU CPU1 CPU2 complexity trivial simple simple trivial trivial cost free 64 cells 3 cycles 1 cycle 1 cycle
  34. 34. Feature Comparison : ASIC/FPGA/GPU/CPU ● 4 parallel 64-bit constant XOR : ● 64-bit 64-constant lookup : device ASIC FPGA GPU CPU1 CPU2 complexity simple simple trivial trivial trivial cost 16k gates 1024 cells 1 cycle 1 cycle 1 cycle device ASIC FPGA GPU CPU1 CPU2 complexity trivial trivial trivial trivial trivial cost free free 2 cycle 2 cycles 2 cycles
  35. 35. Feature Comparison : ASIC/FPGA/GPU/CPU ⇒ for 64-bit, CPUs start to recover their advantage over GPUs ⇒ Note: GPUs will not scale if all constants don’t ft in the cache ⇒ ASIC/FPGA not afected except by lookup tables cost
  36. 36. Feature Comparison : ASIC/FPGA/GPU/CPU ● 64-bit barrel shifter : ● 4 parallel 64-bit barrel shifters : ● 64-bit variable shift involves 8 layers of 64 MUX or 2048 gates ● 64-bit variable shifts on 32 bit is easy but takes a few operations device ASIC FPGA GPU CPU1 CPU2 complexity simple simple trivial trivial trivial cost 16k gates 8 delays 800 cells 8 ns 16 cycles 16 ns 2 cycles 500 ps 4 cycles 2 ns device ASIC FPGA GPU CPU1 CPU2 complexity medium simple trivial trivial trivial cost 2k gates 8 delays 200 cells 8 ns 2 cycles 4 ns 2 cycles 250 ps 2 cycles 500 ps
  37. 37. Feature Comparison : ASIC/FPGA/GPU/CPU ● 64-bit addition : ● ASIC: using ● FPGA: using 6-LUT cells device ASIC FPGA GPU CPU1 CPU2 complexity complex simple simple trivial trivial cost 2850 gates 5 delays 64 cells 2.5 ns 2 cycles 2 ns 1 cycle 250 ps 1 cycle 500 ps
  38. 38. Feature Comparison : ASIC/FPGA/GPU/CPU ● 2 parallel 64-bit additions : ⇒ The ASIC will take more than 1 ns to perform such an addition, and will draw lots of power. The CPU is the clear winner here. Even the low-end smartphone CPU can be 8 times faster than the GPU and 5 times faster than an FPGA at this. device ASIC FPGA GPU CPU1 CPU2 complexity complex simple simple trivial trivial cost 5700 gates 5 delays 128 cells 2.5 ns 4 cycles 4 ns 1 cycle 250 ps 1 cycle 500 ps
  39. 39. Feature Comparison : ASIC/FPGA/GPU/CPU ● 64-bit divide + modulus : ⇒ Divide is one of the most complex operations. It can be done using successive compare+subtract+shift (one per bit), or using dichotomic multiplies (one per bit). It absolutely requires native word size or it is severely impacted. Implementing it on an ASIC is hopeless without using a good library, and it will be iterative and slow. GPUs may exploit foats to approach the result and save cycles. device ASIC FPGA GPU CPU1 CPU2 complexity very high high high trivial trivial cost 10k+ gates 320 delays 128 cells 160 ns 370 cycles 370 ns 6-8 cycles 1.5-2 ns 10-12 cycles 5-6 ns
  40. 40. Feature Comparison : ASIC/FPGA/GPU/CPU ● CRC32 on 32-bit input : ⇒ CRC32 is an interesting operation : it's designed to operate on bits in network protocols at wire speed using very simple hardware. Bit operations are not suitable to parallel processing but it can be implemented using a 32kb lookup table. It is used in the Zip compression algorithm, so modern smartphone CPUs implement it. For GPUs, it costs another 4kB in L1 cache. ● ⇒ ASIC: 16577 µm² (128x128µm) in 65nm, 974 MHz max ⇒ FPGA: 12 LUT, 96kb memory, 8 bits at a time device ASIC FPGA GPU CPU1 CPU2 complexity very high high low low trivial cost 100k+ gates 10 delays 16 cells 8 ns 4 cycles 4 ns 4 cycles 2 ns 1 cycles 500 ps
  41. 41. Feature Comparison : ASIC/FPGA/GPU/CPU ● AES128 on 16 bytes input : ⇒ AES provides strong cryptography. In counter mode it can beneft from parallelism, which is not the case for small messages (16 bytes). It only involves bit rotations, substitutions and XOR. It performs moderately well in software but extremely well on hardware. It is natively implemented on all modern processors. An ASIC based implementation is a matter of trade- of between speed and area. An implementer would have to pick a good library or read various good papers on the subject. FPGA implementations are small but moderately fast and can leave enough room for higher parallelism. device ASIC FPGA GPU CPU1 CPU2 complexity high high medium low trivial cost varies varies 3k cells 480 ns 1000 cycles 1 µs 360 cycles 90 ns 700 cycles 350 ns
  42. 42. Feature Comparison : ASIC/FPGA/GPU/CPU ⇒ Conclusion ● strong cryptography is mandatory for security, AES is everywhere, use it! ● ofset the hardware advantage of AES using complex operations ● no need to use full AES : SHA-Vite and Echo use only two rounds and are secure ● maximize the efciency of used silicon times frequency of operation ● make use of most of the available silicon in CPUs/GPUs. Operations like DIV, ROT, ADD are too painful to implement using ASICs, will often surpass FPGAs capacities or make them operate at low frequencies. ● use 64 bit computing exclusively to favor modern CPUs and force GPUs to take a reasonable hit but not too much so as to protect small miners' investment ● maximize the use of operations which perform better on mobile CPUs to favor entry-level devices a bit more ● ensure that mining performance scales better with hardware and energy costs
  43. 43. Design Considerations Specific To Mining Activities ● Miners hash small blocks (80 bytes), each changing only by a 32-bit nonce which is at the end of the block. Thus what they typically do is take a snapshot of the intermediary state, copy it, and only hash the last 4 bytes. ● First idea : add lots of rounds after the last byte ⇒ very bad idea : many rounds == easily pipelinable in hardware using many ASICs ● Second idea : ⇒ make the intermediary state huge ⇒ ensure it's not faster to copy it than to initialize it ⇒ favors high bandwidth L1 caches
  44. 44. Proposal ● The Rainforest algorithm makes rounds involving for each round : ● Reading no more than 32-bit of input text at once ● unaligned 64-bit table look-ups ● 64-bit divides ● 64-bit additions ● 64-bit variable rotations ● Fast cache substitutions (16 kB) ● 2 rounds of AES ● CRC32 mangling applied at multiple layers (fast on small CPUs) ● 17 kB internal RAM state, 256-bit hash state and output
  45. 45. Proposal Result: (4N+3)/4+4 rounds like this one :
  46. 46. Benchmarks For Full Hashes And Nonce Scan (mining) Device Core i7 6700K Radeon RX560 Geekbox RK3368 Xiaomi Redmi Node 4 (SD625) Cores (threads) 4(8) 1024 8*Cortex A53 8*Cortex A53 Freq MHz 4000 1300 1416 2000 Power 91W+PC 150W+PC 6W 6W Price $300+PC $180+PC $58 $150 kH/s (full) 390 1100 534 ~754 (estimated) kH/s (scan) 1642 1650 1582 ~2234 (estimated) Relative Perf to PC 1.0 1.0 0.96 1.36
  47. 47. Benchmarks For Full Hashes And Nonce Scan (mining) ● Tests on the Core i7 were run on a desktop PC using cpuminer under Ubuntu 16.04 ● Tests on the Cortex A53 were run on a Geekbox device under Ubuntu 16.04 with cpuminer-multi ● Tests on the Radeon were run on the Core i7 PC using sgminer and OpenCL. The OpenCL version was slightly optimized where relevant for this GPU (unaligned accesses, builtin rotates, memcpy) ● Performance on the smartphone were extrapolated from the Geekbox performance based on the CPU frequency increase ● Code successfully passes all SMHasher tests (takes one week!)
  48. 48. Performance Reports ● A cheap device like the Geekbox delivers almost the mining performance of a 4 GHz quad-core PC, costs around 7 times less and consumes around 30 times less ● Many inexpensive set-top-boxes are built around comparable processors and are turned on all day long ● Some entry-level to mid-range smartphones like the Xiaomi Note4 provide 8 cores etched at a much thinner technology, delivering 40% more performance for the same power usage at a very afordable price, making mining on phone very attractive ● The GPU still performs very well, basically as fast as the high-end CPU or the set-top-box CPU, which is much less than the ~28 times faster on algorithms like x11 or x15.
  49. 49. Mid-Term Hopes ● Encourage people to mine on their smartphones when plugged for charging. It costs zero and will bring them some money without having to buy an expensive PC for this (typically ~$500), nor the associated energy (~$3-$4/day) ● encourage people to mine on their set-top-boxes all the day ● see the mining farms progressively switch to lower power devices ● see the global power consumption dedicated to mining go down quite a bit ● see more people entering this activity to make a bit of money with no investment, participating to the new global economy, thus ensuring the long term reliability of the chain and that no one owns it
  50. 50. Next Steps ● Publish the code and patches for cpuminer, sgminer, yiimp ● Contribute the algorithm to existing crypto-currencies when they’re fexible enough to adopt new algorithms ● Port the code to other languages for pools ? ● Port the code to Android wallets for smartphones and STBs ● Possibly create a new currency using it (rfcoin? why not ?)
  51. 51. Questions ?Questions ? Thank You!Thank You!