Extending Flash Memory
Through Adaptive Parameter
Tuning
Conor Ryan
CTO – Software
Conor.Ryan@NVMdurance.com
2
Take Home Messages
 25x increase in endurance on 1X nm devices
 Machine learning software fully characterizes NAND chips before
they go into a product
 Lightweight software running on the product autonomically
manages its health, using machine learning knowledge
 40-bit ECC gives 10x increase; 120-bit ECC gives 25x increase
 Uses the least stress possible
 Raw flash runs faster
3
At a glance
 NVMdurance Pathfinder
 Delivers most of the endurance gain
 Suite of Machine Learning Algorithms
 Determines sets of optimal register values for NAND chips before they go
into a product
 “Heavy lifting” done before shipping
 NVMdurance Navigator
 Exploits endurance gains enabled by Pathfinder
 Autonomic system, runs on the controller
 Chooses register values at run-time from those discovered by Pathfinder
4
The Problem
Stress
Pass after 3K cycles
Pass after 1K cycles
Stress applied at
Time X
Number of P/E Cycles
It’s easier to pass retention tests
with less cycling done…
… but flash chips use the same
write register values throughout
their lifetime!
5
Pass after 5K cycles
Pass after 20K cycles
Solution
Stress
Pass after 30K cycles
Pass after 10K cycles
Vary stress over time
Number of P/E Cycles
6
Pass after 5K cycles
Pass after 20K cycles
Solution
Force
Pass after 30K cycles
Pass after 10K cycles
Vary force over time
Which registers to change?
What values to change to?
When to make the changes?
Each register adds two dimensions of search
Number of P/E Cycles
7
Pass after 5K cycles
Pass after 20K cycles
Solution
Stress
Pass after 30K cycles
Pass after 10K cycles
Vary stress over time
Number of P/E Cycles
8
Pass after 10K cycles
Pass after 30K cycles
Solution with variable ECC
Stress
Pass after 60K cycles
Pass after 20K cycles
Vary stress over time, spend more time at each level
Number of P/E Cycles
9
Pass after 10K cycles
Pass after 30K cycles
Solution with variable ECC
Stress
Pass after 60K cycles
Pass after 20K cycles
Vary stress over time
Which registers to change?
What values to change to?
When to make the changes?
How much ECC to use?
Number of P/E Cycles
10
The Not-So-Secret Sauce
 Reduce dimensionality of the problem
 Reduce the number of independent variables
 Understand the silicon; vary as few registers as possible
 Guide the search
 Be sensible, if not insightful
 Test only what has to be tested
 Or at least know what NOT to test
 …this is still an astronomically difficult problem!
11
The Secret Sauce
 Plot “safe” paths through massively highly dimensional space
 Each register adds two dimensions (more than 50 write registers!)
 Tune paths on the fly
 Not all blocks degrade at the same rate
 Retention may impact various blocks differently
 Anticipate health issues before they happen
 Treat the flash as though it is a dynamic, living thing
12
NVMdurance Pathfinder
Start Destination
Imagine the lifetime of a device to be a
journey through space…
… touch an “asteroid” and we have
unrecoverable data
13
NVMdurance Pathfinder
Start Destination
14
NVMdurance Pathfinder
Start Destination
15
NVMdurance Pathfinder
Start Destination
16
NVMdurance Pathfinder
Start Destination
… and
beyond!
17
NVMdurance Pathfinder
Start Destination
18
NVMdurance Pathfinder
Start Destination
19
NVMdurance Pathfinder
Start Destination
20
NVMdurance Pathfinder
Start Destination
21
Start Destination
Asteroid Belt
NVMdurance Pathfinder determines
sets of optimal register values
(analogous to safe paths through the
asteroid belt) for NAND chips before
they go into a product
NVMdurance Pathfinder
22
Start Destination
Asteroid Belt
Live on device, NVMdurance
Navigator chooses which path
through the space to use, based on
the “health” of the device.
NVMdurance Navigator
23
Start Destination
NVMdurance Navigator
Asteroid Belt
Live on device, NVMdurance
Navigator chooses which path
through the space to use, based on
the “health” of the device.
24
Start Destination
Asteroid Belt
Hit the hyperspace button!
Change the register values
NVMdurance Navigator
25
Start Destination
Asteroid Belt
NVMdurance Navigator constantly
monitors the health, and jumps
and jumps from path to path as
the use-case requires.
NVMdurance Navigator
26
Start Destination
NVMdurance Navigator
Asteroid Belt
NVMdurance Navigator constantly
monitors the health, and jumps
and jumps from path to path as
the use-case requires.
27
Start Destination
NVMdurance Navigator
Asteroid Belt
NVMdurance Navigator constantly
monitors the health, and jumps
and jumps from path to path as
the use-case requires.
28
Start Destination
NVMdurance Navigator
Asteroid Belt
NVMdurance Navigator constantly
monitors the health, and jumps
and jumps from path to path as
the use-case requires.
29
NVMdurance system
Machine Learning Parameter Discovery
NVMdurance Pathfinder: Discover routes
through multidimensional space such that
every parameter set passes retention for that
point of life (by fully characterizing NAND
chips before they go into a product)
Autonomic (runs live on the SSD controller)
NVMdurance Navigator: Observes
deterioration of the SSD; chooses when to
change parameters (using the knowledge
delivered by Pathfinder)
30
Stage
Early Life Mid Life Late Life
 Sets of Write register values for that time in life
 30-60 write registers, e.g. start voltage, step size, MSB, LSB, odd, even, etc.
 More registers means more control, but larger search space
31
Stage
Mid Life Late Life
 Sets of Write register values for that time in life
 30-60 write registers, e.g. start voltage, step size, MSB, LSB, odd, even, etc.
Early Life
32
Stage
 Each stage has multiple waypoints (each a set of read register values)
to guide it to the next stage
 Often only one set in early life – no read retry!
 First stage often lasts longer than default!
 Never more than eight waypoints
Write Register Values
Read Register Values
33
Waypoints
Mid Life Late Life
 More waypoints required later in life
 Higher wear uncovers higher variability
Early Life
34
Lifetime
 NVMdurance Pathfinder (machine learning offline)
 Automatically discovers and proves viability of stages
 Each stage passes retention test
0 5 10
P/E Cycle Count as Multiple of Default Rating
35
Lifetime
 NVMdurance Navigator (run time)
 Runs on controller (autonomic)
 Monitors “Health” of devices
 Determines when to progress to next stage
 Chooses which waypoint to use during each stage
0 5 10
P/E Cycle Count as Multiple of Default Rating
36
Increasing ECC
 System can tolerate higher BER
 Stages last longer
 Weaker stages possible earlier in life
 NVMdurance enables a truly variable/adaptive ECC
0 15 25
P/E Cycle Count as Multiple of Default Rating
37
One approach – many use cases
P/E Cycle Count as Multiple of Default Rating (120-bit ECC)
0 5 10
P/E Cycle Count as Multiple of Default Rating (40-bit ECC)
38
Experiments
Device 1X nm
ECC Up to 40 bits per sector
Retention 12 months
Baseline Endurance 1
Intrinsic Endurance (3 months retention) 1.8x longer than rated endurance
Target Device
39
Results
Number of stages 8
Stage length 1x – 2x longer than rated endurance
Retention 3 months
ECC Up to 40 bits per sector
Minimum Window Stress (stage 1) 45%
Approximately equal stress Stage 5
Maximum Window Stress (stage 8) 120%
Maximum P/E cycles 10x longer than rated endurance
Summary
40
Results
41
Results
Default
lifetime,
normalized
to one
42
Results
Initial write
stress starts
at 45% of
default
43
Results
Write stress
slowly
increases
44
Results
Write stress
exceeds
default level
8x
45
Results
10x increase
in endurance
46
Results
P/E cycle time
as % of default;
never exceeds
default.
47
Results – Increasing ECC
Stages get
longer as ECC
increases
48
Results – Increasing ECC
Stages get
longer as ECC
increases
49
Results – Increasing ECC
Stages get
longer as ECC
increases
50
Results – Increasing ECC
Stages get
longer as ECC
increases
51
Results – Increasing ECC
Stages get
longer as ECC
increases
52
Results – Increasing ECC
25x
improvement
in endurance
with the same
retention!
53
Results
Initial write
stress starts
at 25% of
default
54
Results
Write stress
increases
more slowly
55
Results
Write stress
exceeds
default level
21x
56
Results
P/E cycle time
as % of default;
never exceeds
default.
57
Results
Number of stages 8
Stage length 1.2x – 7x longer than rated endurance
Retention 3 months
ECC Up to 120 bits per sector
Minimum Window Stress (stage 1) 25%
Approximately equal stress Stage 5
Maximum Window Stress (stage 8) 120%
Maximum P/E Cycles 25x longer than rated endurance
Final Results
58
Results Summary
 25-fold increase in endurance on 1X nm
 10-fold increase on 1X with 40-bit ECC
 We avoid the problem of live optimization of parameters
 Most work done before flash is put in product
 NVMdurance Navigator can predict imminent failure
 Our “health” measure is very precise
 P/E operations run faster than defaults
 Results proven with current generation devices from multiple
manufacturers
59
Stop by and see us at booth #921
Conor.Ryan@NVMdurance.com
Conclusion
 Industry leading endurance gains
 NVMdurance's technology is synergistic to existing flash controller
technology
 Machine-learning software fully characterizes NAND chips before they go
into a product
 Lightweight software running on the product autonomically manages its
health, using that knowledge

FMS presentation 2014

  • 1.
    Extending Flash Memory ThroughAdaptive Parameter Tuning Conor Ryan CTO – Software Conor.Ryan@NVMdurance.com
  • 2.
    2 Take Home Messages 25x increase in endurance on 1X nm devices  Machine learning software fully characterizes NAND chips before they go into a product  Lightweight software running on the product autonomically manages its health, using machine learning knowledge  40-bit ECC gives 10x increase; 120-bit ECC gives 25x increase  Uses the least stress possible  Raw flash runs faster
  • 3.
    3 At a glance NVMdurance Pathfinder  Delivers most of the endurance gain  Suite of Machine Learning Algorithms  Determines sets of optimal register values for NAND chips before they go into a product  “Heavy lifting” done before shipping  NVMdurance Navigator  Exploits endurance gains enabled by Pathfinder  Autonomic system, runs on the controller  Chooses register values at run-time from those discovered by Pathfinder
  • 4.
    4 The Problem Stress Pass after3K cycles Pass after 1K cycles Stress applied at Time X Number of P/E Cycles It’s easier to pass retention tests with less cycling done… … but flash chips use the same write register values throughout their lifetime!
  • 5.
    5 Pass after 5Kcycles Pass after 20K cycles Solution Stress Pass after 30K cycles Pass after 10K cycles Vary stress over time Number of P/E Cycles
  • 6.
    6 Pass after 5Kcycles Pass after 20K cycles Solution Force Pass after 30K cycles Pass after 10K cycles Vary force over time Which registers to change? What values to change to? When to make the changes? Each register adds two dimensions of search Number of P/E Cycles
  • 7.
    7 Pass after 5Kcycles Pass after 20K cycles Solution Stress Pass after 30K cycles Pass after 10K cycles Vary stress over time Number of P/E Cycles
  • 8.
    8 Pass after 10Kcycles Pass after 30K cycles Solution with variable ECC Stress Pass after 60K cycles Pass after 20K cycles Vary stress over time, spend more time at each level Number of P/E Cycles
  • 9.
    9 Pass after 10Kcycles Pass after 30K cycles Solution with variable ECC Stress Pass after 60K cycles Pass after 20K cycles Vary stress over time Which registers to change? What values to change to? When to make the changes? How much ECC to use? Number of P/E Cycles
  • 10.
    10 The Not-So-Secret Sauce Reduce dimensionality of the problem  Reduce the number of independent variables  Understand the silicon; vary as few registers as possible  Guide the search  Be sensible, if not insightful  Test only what has to be tested  Or at least know what NOT to test  …this is still an astronomically difficult problem!
  • 11.
    11 The Secret Sauce Plot “safe” paths through massively highly dimensional space  Each register adds two dimensions (more than 50 write registers!)  Tune paths on the fly  Not all blocks degrade at the same rate  Retention may impact various blocks differently  Anticipate health issues before they happen  Treat the flash as though it is a dynamic, living thing
  • 12.
    12 NVMdurance Pathfinder Start Destination Imaginethe lifetime of a device to be a journey through space… … touch an “asteroid” and we have unrecoverable data
  • 13.
  • 14.
  • 15.
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
    21 Start Destination Asteroid Belt NVMdurancePathfinder determines sets of optimal register values (analogous to safe paths through the asteroid belt) for NAND chips before they go into a product NVMdurance Pathfinder
  • 22.
    22 Start Destination Asteroid Belt Liveon device, NVMdurance Navigator chooses which path through the space to use, based on the “health” of the device. NVMdurance Navigator
  • 23.
    23 Start Destination NVMdurance Navigator AsteroidBelt Live on device, NVMdurance Navigator chooses which path through the space to use, based on the “health” of the device.
  • 24.
    24 Start Destination Asteroid Belt Hitthe hyperspace button! Change the register values NVMdurance Navigator
  • 25.
    25 Start Destination Asteroid Belt NVMduranceNavigator constantly monitors the health, and jumps and jumps from path to path as the use-case requires. NVMdurance Navigator
  • 26.
    26 Start Destination NVMdurance Navigator AsteroidBelt NVMdurance Navigator constantly monitors the health, and jumps and jumps from path to path as the use-case requires.
  • 27.
    27 Start Destination NVMdurance Navigator AsteroidBelt NVMdurance Navigator constantly monitors the health, and jumps and jumps from path to path as the use-case requires.
  • 28.
    28 Start Destination NVMdurance Navigator AsteroidBelt NVMdurance Navigator constantly monitors the health, and jumps and jumps from path to path as the use-case requires.
  • 29.
    29 NVMdurance system Machine LearningParameter Discovery NVMdurance Pathfinder: Discover routes through multidimensional space such that every parameter set passes retention for that point of life (by fully characterizing NAND chips before they go into a product) Autonomic (runs live on the SSD controller) NVMdurance Navigator: Observes deterioration of the SSD; chooses when to change parameters (using the knowledge delivered by Pathfinder)
  • 30.
    30 Stage Early Life MidLife Late Life  Sets of Write register values for that time in life  30-60 write registers, e.g. start voltage, step size, MSB, LSB, odd, even, etc.  More registers means more control, but larger search space
  • 31.
    31 Stage Mid Life LateLife  Sets of Write register values for that time in life  30-60 write registers, e.g. start voltage, step size, MSB, LSB, odd, even, etc. Early Life
  • 32.
    32 Stage  Each stagehas multiple waypoints (each a set of read register values) to guide it to the next stage  Often only one set in early life – no read retry!  First stage often lasts longer than default!  Never more than eight waypoints Write Register Values Read Register Values
  • 33.
    33 Waypoints Mid Life LateLife  More waypoints required later in life  Higher wear uncovers higher variability Early Life
  • 34.
    34 Lifetime  NVMdurance Pathfinder(machine learning offline)  Automatically discovers and proves viability of stages  Each stage passes retention test 0 5 10 P/E Cycle Count as Multiple of Default Rating
  • 35.
    35 Lifetime  NVMdurance Navigator(run time)  Runs on controller (autonomic)  Monitors “Health” of devices  Determines when to progress to next stage  Chooses which waypoint to use during each stage 0 5 10 P/E Cycle Count as Multiple of Default Rating
  • 36.
    36 Increasing ECC  Systemcan tolerate higher BER  Stages last longer  Weaker stages possible earlier in life  NVMdurance enables a truly variable/adaptive ECC 0 15 25 P/E Cycle Count as Multiple of Default Rating
  • 37.
    37 One approach –many use cases P/E Cycle Count as Multiple of Default Rating (120-bit ECC) 0 5 10 P/E Cycle Count as Multiple of Default Rating (40-bit ECC)
  • 38.
    38 Experiments Device 1X nm ECCUp to 40 bits per sector Retention 12 months Baseline Endurance 1 Intrinsic Endurance (3 months retention) 1.8x longer than rated endurance Target Device
  • 39.
    39 Results Number of stages8 Stage length 1x – 2x longer than rated endurance Retention 3 months ECC Up to 40 bits per sector Minimum Window Stress (stage 1) 45% Approximately equal stress Stage 5 Maximum Window Stress (stage 8) 120% Maximum P/E cycles 10x longer than rated endurance Summary
  • 40.
  • 41.
  • 42.
  • 43.
  • 44.
  • 45.
  • 46.
    46 Results P/E cycle time as% of default; never exceeds default.
  • 47.
    47 Results – IncreasingECC Stages get longer as ECC increases
  • 48.
    48 Results – IncreasingECC Stages get longer as ECC increases
  • 49.
    49 Results – IncreasingECC Stages get longer as ECC increases
  • 50.
    50 Results – IncreasingECC Stages get longer as ECC increases
  • 51.
    51 Results – IncreasingECC Stages get longer as ECC increases
  • 52.
    52 Results – IncreasingECC 25x improvement in endurance with the same retention!
  • 53.
  • 54.
  • 55.
  • 56.
    56 Results P/E cycle time as% of default; never exceeds default.
  • 57.
    57 Results Number of stages8 Stage length 1.2x – 7x longer than rated endurance Retention 3 months ECC Up to 120 bits per sector Minimum Window Stress (stage 1) 25% Approximately equal stress Stage 5 Maximum Window Stress (stage 8) 120% Maximum P/E Cycles 25x longer than rated endurance Final Results
  • 58.
    58 Results Summary  25-foldincrease in endurance on 1X nm  10-fold increase on 1X with 40-bit ECC  We avoid the problem of live optimization of parameters  Most work done before flash is put in product  NVMdurance Navigator can predict imminent failure  Our “health” measure is very precise  P/E operations run faster than defaults  Results proven with current generation devices from multiple manufacturers
  • 59.
    59 Stop by andsee us at booth #921 Conor.Ryan@NVMdurance.com Conclusion  Industry leading endurance gains  NVMdurance's technology is synergistic to existing flash controller technology  Machine-learning software fully characterizes NAND chips before they go into a product  Lightweight software running on the product autonomically manages its health, using that knowledge