Resilient Algorithms and
Data Structures	

(Work by Ferraro-Petrillo, Finocchi, I. & Grandoni)
Outline of the Talk	

1.  Motivation and Model	

2.  Resilient Algorithms:	

•  Sorting and Searching	

3.  Resilient Data...
Memory Errors	

Memory error: one or multiple bits read differently from
how they were last written.	

	

Many possible ca...
Memory Errors	

Soft Errors: 	

Randomly corrupt bits, but do not leave any physical
damage --- cosmic rays	

	

Hard Erro...
Error Correcting Codes (ECC)	

Error correcting codes (ECC) allow detection and
correction of one or multiple bit errors 	...
Impact of Memory Errors	

Consequence of a memory error is system dependent	

	

1. Correctable errors : fixed by ECC	

	

...
How Common are Memory Errors?	

7
How Common are Memory Errors?	

[Schroeder et al 2009] experiments 2.5 years (Jan 06 – Jun 08) 	

on Google fleet (104 mach...
How Common are Memory Errors?	

[Hwang et al 2012]	

9	

Only minority
(2-20%) of nodes
experiences	

1 single error.
Majo...
Error Distribution	

[Hwang et al 2012]	

10	

Very skewed
distribution of errors
across nodes: the top	

5% of error node...
Error Correlation	

[Hwang et al 2012]	

11	

Errors happen in a
correlated fashion:
even a single error
on a node raises ...
Memory Errors	

Recent studies point to main memory as one of the
leading hardware causes for machine crashes and
componen...
Memory Errors	

Not all machines (clients) have ECC memory chips. 	

	

Increased demand for larger capacities at low cost...
Memory Errors	

•  Memory errors can cause security vulnerabilities: 	

Fault-based cryptanalysis [Boneh et al 97, Xu et a...
Memory Errors in Space	

15
Memory Errors in Space	

16
Memory Errors in Space	

17
Memory Errors in Space	

18
Recap on Memory Errors	

1. Memory errors can be harmful: uncorrectable
memory errors cause some catastrophic event (reboo...
A small example	

Classical algorithms may not be correct in the
presence of (even very few) memory errors	

1	

 2	

 3	
...
Recap on Memory Errors	

2. Memory errors are NOT rare: even a small cluster
of computers with few GB per node can experie...
Recap on Memory Errors	

3. ECC may not be
available (or may not be
enough): No ECC in
inexpensive memories.
ECC does not ...
Impact of Memory Errors	

23
Resilient Algorithms and Data Structures	

Resilient Algorithms and Data Structures:	

Capable of tolerating memory errors...
Faulty- Memory Model [Finocchi, I. 04]	

•  Memory fault = the correct data stored in a memory
location gets altered (dest...
Terminology	

δ = 	

upper bound known on the number of memory
	

errors (may be function of n)	

α = actual number of mem...
Other Faulty Models	

Design of fault-tolerant alg’s received attention for 50+ years	

Liar Model [Ulam 77, Renyi 76,…]	
...
Other Faulty Models	

  Robustness in Computational Geometry [Schirra 00, …] 	

  Faults from unreliable computation (geom...
Outline of the Talk	

1.  Motivation and Model	

2.  Resilient Algorithms:	

•  Sorting and Searching	

3.  Resilient Data...
Resilient Sorting	

We are given a set of n keys that need to be sorted	

Q1. Can sort efficiently correct values in presen...
Terminology	

•  Faithfully ordered sequence = ordered except for
corrupted keys	

•  Resilient sorting algorithm = produc...
Trivially Resilient	

Resilient variable: consists of (2δ+1) copies
x1, x2, …, x2δ+1 of a standard variable x	

Value of r...
Trivially Resilient Sorting	

Can trivially sort in O(δ n log n) time during δ
memory errors	

	

	

Trivially Resilient S...
Resilient Sorting	

Comparison-based sorting algorithm that takes
O(n log n + δ2) time to run during δ memory errors	

	

...
Resilient Sorting	

35	

[Babenko and Pouzyrevsky, ’12] 	

randomized algorithm (based on quicksort) which
runs in O(n log...
Resilient Sorting (cont.)	

Randomized integer sorting algorithm that takes
O(n + δ2) time to run during δ memory errors	
...
search(5) = false	

Resilient Binary Search	

2	

 3	

 4	

 5	

 8	

 9	

 13	

 20	

 26	

1	

 7	

10	

 80	

Wish to g...
Trivially Resilient Binary Search	

Can search in O(δ log n) time during δ memory errors	

	

	

Trivially Resilient Binar...
Resilient Searching	

	

	

Randomized algorithm with O(log n + δ) expected time	

[Finocchi, Grandoni, I. 05] 	

	

	

	
...
Resilient Dynamic Programming	

	

	

Running time O(nd + δd+1) and space usage O(nd + nδ)
Can tolerate up to δ = O(nd/(d+...
Outline of the Talk	

1.  Motivation and Model	

2.  Resilient Algorithms:	

•  Sorting and Searching	

3.  Resilient Data...
Resilient Data Structures	

Algorithms affected by errors during execution	

Data structures affected by errors in lifetim...
Resilient Priority Queues	

Maintain a set of elements under insert and deletemin	

insert adds an element	

deletemin del...
Resilient Priority Queues	

Upper Bound : 	

	

Both insert and deletemin can be implemented in
O(log n + δ) time	

[Jorge...
Resilient Dictionaries	

Maintain a set of elements under insert, delete
and search	

insert and delete as usual, search a...
Resilient Dictionaries	

Randomized resilient dictionary implements each
operation in O(log n + δ) time	

[Brodal et al. 0...
Resilient Dictionaries	

	

Pointer-based data structures	

	

Faults on pointers likely to be more problematic
than fault...
Outline of the Talk	

1.  Motivation and Model	

2.  Resilient Algorithms:	

•  Sorting and Searching	

3.  Resilient Data...
Experimental Framework	

Algorithm/Data
Structure	

Non-Resilient	

Trivially Resilient	

Resilient	

O(f(n))	

O(δ · f(n)...
Experimental Platform	

•  2 CPUs Intel Quad-Core Xeon E5520 @ 2.26Ghz 	

•  L1 cache 256Kb, L2 cache 1 Mb, L3 cache 8 Mb ...
Fault Injection	

This talk: Only random faults	

Algorithm / data structure and fault injection
implemented as separate t...
Resiliency: Why should we care?	

What’s the impact of memory errors?	

	

	

Try to analyze impact of errors on mergesort...
Error Propagation	

•  k-unordered sequence = faithfully ordered except for k
(correct) keys	

•  k-unordered sorting algo...
The Importance of Being Resilient	

n = 5,000,000; 	

0.01% (random) errors in input è 0.13% errors in output	

0.02% (ra...
The Importance of Being Resilient	

n = 5,000,000; 	

0.01% (random) errors in input è 0.40% errors in output	

0.02% (ra...
The Importance of Being Resilient	

n = 5,000,000; 	

0.01% (random) errors in input è 68.20% errors in output	

0.02% (r...
The Importance of Being Resilient	

57	

α
Error Amplification	

	

Mergesort	

	

0.002-0.02% (random) errors in input è 24.50-79.51% errors in output	

	

AVLsort	...
The Importance of Being Resilient	

	

AVL with n = 5,000,000; α errors on memory used
(keys, parent pointers, pointers, e...
Isn’t Trivial Resiliency Enough?	

Memory errors are a problem	

	

	

Do we need to tackle it with new algorithms / data
...
Isn’t Trivial Resiliency Enough?	

δ = 1024 	

61
Isn’t Trivial Resiliency Enough?	

  δ = 1024	

  100.000 random search 	

62
Isn’t Trivial Resiliency Enough?	

  δ = 512	

  100.000 random ops 	

63
Isn’t Trivial Resiliency Enough?	

  δ = 1024	

  100.000 random ops	

  no errors on pointers	

64
Isn’t Trivial Resiliency Enough?	

	

All experiments for 105 ≤ n ≤ 5 105, δ=1024, unless specified otherwise	

	

	

Merge...
Performance of Resilient Algorithms	

Memory errors are a problem	

	

	

Trivial approaches produce slow algorithms /
dat...
Performance of Resilient Algorithms	

α = δ = 1024 	

67
Performance of Resilient Algorithms	

α = δ = 1024 	

68
Performance of Resilient Algorithms	

  α = δ = 1024	

  100,000 random search 	

69
Performance of Resilient Algorithms	

  α = δ = 1024	

  100,000 random search	

70
Performance of Resilient Algorithms	

  α = δ = 512	

  100,000 random ops 	

71
Performance of Resilient Algorithms	

  α = δ = 512	

  100,000 random ops 	

72
Performance of Resilient Algorithms	

  α = δ = 1024	

  100,000 random ops 	

73
Performance of Resilient Algorithms	

  α = δ = 1024	

  100,000 random ops 	

74
Performance of Resiliency	

	

All experiments for 105 ≤ n ≤ 5 105, α=δ=1024, unless specified otherwise 	

	

Mergesort	

...
Larger Data Sets	

76	

How well does the performance of resilient
algorithms / data structures scale to larger
data sets?...
Larger Data Sets	

77	

α	

n = 5,000,000
Larger Data Sets	

	

n = 5,000,000	

α	

78
Larger Data Sets	

α	

	

100,000 random search on n =
5,000,000 elements	

79	

log2 n ≈ 22
Larger Data Sets	

α	

80	

	

100,000 random search on n
= 5,000,000 elements
Larger Data Sets	

	

100,000 random ops on a
heap with n = 5,000,000	

α	

81	

log2 n ≈ 22
Larger Data Sets	

	

100,000 random ops on a
heap with n = 5,000,000	

α	

82
Larger Data Sets	

	

100,000 random ops on
AVL with n = 5,000,000	

α	

83	

log2 n ≈ 22
Larger Data Sets	

	

100,000 random ops on
AVL with n = 5,000,000	

α	

84
Larger Data Sets	

	

All experiments for n = 5 106	

	

	

	

Mergesort [was 1.5-2X for 105 ≤ n ≤ 5 105]	

	

Resilient m...
Sensitivity to δ	

86	

How critical is the choice of δ ?	

Underestimating δ (α > δ) compromises
resiliency	

Overestimat...
Performance Degradation	

	

Mergesort	

	

Resilient mergesort improves by 9.7% in time and degrades by
0.04% in space	

...
Robustness	

88	

Resilient mergesort and dictionaries appear
more robust than resilient search and heaps 	

I.e., resilie...
Outline of the Talk	

1.  Motivation and Model	

2.  Resilient Algorithms:	

•  Sorting and Searching	

3.  Resilient Data...
Concluding Remarks	

•  Need of reliable computation in the presence of
memory errors	

•  Investigated basic algorithms a...
Future Work and Open Problems	

•  More (faster) implementations, engineering and
experimental analysis?	

•  Resilient gr...
Thank You!	

92	

My memory s terrible these days…
Upcoming SlideShare
Loading in …5
×

Algorithms for Big Data: Graphs and Memory Errors 5 (Lecture by Giuseppe Italiano)

1,224 views
1,127 views

Published on

The first part of my lectures will be devoted to the design of practical algorithms for very large graphs. The second part will be devoted to algorithms resilient to memory errors. Modern memory devices may suffer from faults, where some bits may arbitrarily flip and corrupt the values of the affected memory cells. The appearance of such faults may seriously compromise the correctness and performance of computations, and the larger is the memory usage the higher is the probability to incur into memory errors. In recent years, many algorithms for computing in the presence of memory faults have been introduced in the literature: in particular, an algorithm or a data structure is called resilient if it is able to work correctly on the set of uncorrupted values. This part will cover recent work on resilient algorithms and data structures.

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
1,224
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
6
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide

Algorithms for Big Data: Graphs and Memory Errors 5 (Lecture by Giuseppe Italiano)

  1. 1. Resilient Algorithms and Data Structures (Work by Ferraro-Petrillo, Finocchi, I. & Grandoni)
  2. 2. Outline of the Talk 1.  Motivation and Model 2.  Resilient Algorithms: •  Sorting and Searching 3.  Resilient Data Structures •  Priority Queues •  Dictionaries 4.  Experimental Results 5.  Conclusions and Open Problems 2
  3. 3. Memory Errors Memory error: one or multiple bits read differently from how they were last written. Many possible causes: •  electrical or magnetic interference (cosmic rays) •  hardware problems (bit permanently damaged) •  corruption in data path between memories and processing units Errors in DRAM devices concern for a long time [May & Woods 79, Ziegler et al 79, Chen & Hsiao 84, Normand 96, O’Gorman et al 96, Mukherjee et al 05, … ] 3
  4. 4. Memory Errors Soft Errors: Randomly corrupt bits, but do not leave any physical damage --- cosmic rays Hard Errors: Corrupt bits in a repeatable manner because of a physical defect (e.g., stuck bits) --- hardware problems 4
  5. 5. Error Correcting Codes (ECC) Error correcting codes (ECC) allow detection and correction of one or multiple bit errors Typical ECC is SECDED (i.e., single error correct, double error detect) Chip-Kill can correct up to 4 adjacent bits at once ECC has several overheads in terms of performance (33%), size (20%) and money (10%). ECC memory chips are mostly used in memory systems for server machines rather than for client computers 5
  6. 6. Impact of Memory Errors Consequence of a memory error is system dependent 1. Correctable errors : fixed by ECC 2. Uncorrectable errors : 2.1. Detected : Explicit failure (e.g., a machine reboot) 2.2. Undetected : 2.2.1. Induced failure (e.g., a kernel panic) 2.2.2. Unnoticed (but application corrupted, e.g., segmentation fault, file not found, file not readable, … ) 6
  7. 7. How Common are Memory Errors? 7
  8. 8. How Common are Memory Errors? [Schroeder et al 2009] experiments 2.5 years (Jan 06 – Jun 08) on Google fleet (104 machines, ECC memory) Memory errors are NOT rare events! 8
  9. 9. How Common are Memory Errors? [Hwang et al 2012] 9 Only minority (2-20%) of nodes experiences 1 single error. Majority experiences larger number of errors (half of nodes sees > 100 errors and top 5% of nodes sees > million errors)
  10. 10. Error Distribution [Hwang et al 2012] 10 Very skewed distribution of errors across nodes: the top 5% of error nodes account for more than 95 % of all errors
  11. 11. Error Correlation [Hwang et al 2012] 11 Errors happen in a correlated fashion: even a single error on a node raises the probability of future errors to more than 80%, and after seeing just a handful of errors this probability increases to more than 95%.
  12. 12. Memory Errors Recent studies point to main memory as one of the leading hardware causes for machine crashes and component replacements in today’s data centers. As the amount of DRAM in servers keeps growing and chip densities increase, DRAM errors might pose an even larger threat to the reliability of future generations of systems. 12
  13. 13. Memory Errors Not all machines (clients) have ECC memory chips. Increased demand for larger capacities at low cost just makes the problem more serious – large clusters of inexpensive memories Need of reliable computation in the presence of memory faults 13
  14. 14. Memory Errors •  Memory errors can cause security vulnerabilities: Fault-based cryptanalysis [Boneh et al 97, Xu et al 01, Bloemer & Seifert 03] Attacking Java Virtual Machines [Govindavajhala & Appel 03] Breaking smart cards [Skorobogatov & Anderson 02, Bar-El et al 06] • Avionics and space electronic systems: Amount of cosmic rays increase with altitude (soft errors) Other scenarios in which memory errors have impact (and seem to be modeled in an adversarial setting): 14
  15. 15. Memory Errors in Space 15
  16. 16. Memory Errors in Space 16
  17. 17. Memory Errors in Space 17
  18. 18. Memory Errors in Space 18
  19. 19. Recap on Memory Errors 1. Memory errors can be harmful: uncorrectable memory errors cause some catastrophic event (reboot, kernel panic, data corruption, …) 19 I m thinking of getting back into crime, Luigi. Legitimate business is too corrupt…
  20. 20. A small example Classical algorithms may not be correct in the presence of (even very few) memory errors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 A B Out An example: merging two ordered lists Θ(n) Θ(n) Θ(n2) inversions ... 11 12 20 13 80 ... 2 3 4 9 10 80 20
  21. 21. Recap on Memory Errors 2. Memory errors are NOT rare: even a small cluster of computers with few GB per node can experience one bit error every few minutes. 21 I know my PIN number: it s my name I can t remember…
  22. 22. Recap on Memory Errors 3. ECC may not be available (or may not be enough): No ECC in inexpensive memories. ECC does not guarantee complete fault coverage; expensive; system halt upon detection of uncorrectable errors; service disruption; etc… etc… 22
  23. 23. Impact of Memory Errors 23
  24. 24. Resilient Algorithms and Data Structures Resilient Algorithms and Data Structures: Capable of tolerating memory errors on data (even throughout their execution) without sacrificing correctness, performance and storage space Make sure that the algorithms and data structures we design are capable of dealing with memory errors 24
  25. 25. Faulty- Memory Model [Finocchi, I. 04] •  Memory fault = the correct data stored in a memory location gets altered (destructive faults) •  Faults can appear at any time in any memory location simultaneously •  Assumptions: –  Only O(1) words of reliable memory (safe memory) –  Corrupted values indistinguishable from correct ones Wish to produce correct output on uncorrupted data (in an adversarial model) •  Even recursion may be problematic in this model. 25
  26. 26. Terminology δ = upper bound known on the number of memory errors (may be function of n) α = actual number of memory errors (happen during specific execution) Note: typically α ≤ δ All the algorithms / data structure described here need to know δ in advance 26
  27. 27. Other Faulty Models Design of fault-tolerant alg’s received attention for 50+ years Liar Model [Ulam 77, Renyi 76,…] Comparison questions answered by a possibly lying adversary. Can exploit query replication strategies. Fault-tolerant sorting networks [Assaf Upfal 91, Yao Yao 85,…] Comparators can be faulty. Exploit substantial data replication using fault-free data replicators. Parallel Computations [Huang et al 84, Chlebus et al 94, …] Faults on parallel/distributed architectures: PRAM or DMM simulations (rely on fault-detection mechanisms) 27
  28. 28. Other Faulty Models   Robustness in Computational Geometry [Schirra 00, …]   Faults from unreliable computation (geometric precision) rather than from memory errors   Noisy / Unreliable Computation [Bravermann Mossel 08]   Faults (with given probability) from unreliable primitives (e.g., comparisons) rather than from memory errors   Memory Checkers [Blum et al 93, Blum et al 95, …]   Programs not reliable objects: self-testing and self-correction. Essential error detection and error correction mechanisms.   ……………………………………… 28
  29. 29. Outline of the Talk 1.  Motivation and Model 2.  Resilient Algorithms: •  Sorting and Searching 3.  Resilient Data Structures •  Priority Queues •  Dictionaries 4.  Experimental Results 5.  Conclusions and Open Problems 29
  30. 30. Resilient Sorting We are given a set of n keys that need to be sorted Q1. Can sort efficiently correct values in presence of memory errors? Q2. How many memory errors can tolerate in the worst case if we wish to maintain optimal time and space? Value of some keys may get arbitrarily corrupted We cannot tell which is faithful and which is corrupted 30
  31. 31. Terminology •  Faithfully ordered sequence = ordered except for corrupted keys •  Resilient sorting algorithm = produces a faithfully ordered sequence (i.e., wish to sort correctly all the uncorrupted keys) •  Faithful key = never corrupted 1 2 3 4 5 6 7 8 9 10 ordered Faithfully 80 •  Faulty key = corrupted 31
  32. 32. Trivially Resilient Resilient variable: consists of (2δ+1) copies x1, x2, …, x2δ+1 of a standard variable x Value of resilient variable given by majority of its copies: •  cannot be corrupted by faults •  can be computed in linear time and constant space [Boyer Moore 91] Trivially-resilient algorithms and data structures have Θ(δ) multiplicative overheads in terms of time and space Note: Trivially-resilient does more than ECC (SECDED, Chip-Kill, ….) 32
  33. 33. Trivially Resilient Sorting Can trivially sort in O(δ n log n) time during δ memory errors Trivially Resilient Sorting O(n log n) sorting algorithm able to tolerate only O (1) memory errors 33
  34. 34. Resilient Sorting Comparison-based sorting algorithm that takes O(n log n + δ2) time to run during δ memory errors O(n log n) sorting algorithm able to tolerate up to O ((n log n)1/2) memory errors Any comparison-based resilient O(n log n) sorting algorithm can tolerate the corruption of at most O ((n log n)1/2) keys Upper Bound [Finocchi, Grandoni, I. 05]: Lower Bound [Finocchi, I. 04]: 34
  35. 35. Resilient Sorting 35 [Babenko and Pouzyrevsky, ’12] randomized algorithm (based on quicksort) which runs in O(n log n+δ (n log n)1/2) expected time (or deterministic, in O(n log n+δ (n)1/2 log n) worst-case time) during δ memory errors Lower bound assumes that algorithms not allowed to introduce replicas of existing elements.
  36. 36. Resilient Sorting (cont.) Randomized integer sorting algorithm that takes O(n + δ2) time to run during δ memory errors O(n) randomized integer sorting algorithm able to tolerate up to O(n1/2) memory errors Integer Sorting [Finocchi, Grandoni, I. 05]: 36
  37. 37. search(5) = false Resilient Binary Search 2 3 4 5 8 9 13 20 26 1 7 10 80 Wish to get correct answers at least on correct keys: search(s) either finds a key equal to s, or determines that no correct key is equal to s If only faulty keys are equal to s, answer uninteresting (cannot hope to get trustworthy answer) 37
  38. 38. Trivially Resilient Binary Search Can search in O(δ log n) time during δ memory errors Trivially Resilient Binary Search 38
  39. 39. Resilient Searching Randomized algorithm with O(log n + δ) expected time [Finocchi, Grandoni, I. 05] Deterministic algorithm with O(log n + δ) time [Brodal et al. 07] Upper Bounds : Lower Bounds : Ω(log n + δ) lower bound (deterministic) [Finocchi, I. 04] Ω(log n + δ) lower bound on expected time [Finocchi, Grandoni, I. 05] 39
  40. 40. Resilient Dynamic Programming Running time O(nd + δd+1) and space usage O(nd + nδ) Can tolerate up to δ = O(nd/(d+1)) memory errors [Caminiti et al. 11] d-dim. Dynamic Programming 40
  41. 41. Outline of the Talk 1.  Motivation and Model 2.  Resilient Algorithms: •  Sorting and Searching 3.  Resilient Data Structures •  Priority Queues •  Dictionaries 4.  Experimental Results 5.  Conclusions and Open Problems 41
  42. 42. Resilient Data Structures Algorithms affected by errors during execution Data structures affected by errors in lifetime Data structures more vulnerable to memory errors than algorithms: 42
  43. 43. Resilient Priority Queues Maintain a set of elements under insert and deletemin insert adds an element deletemin deletes and returns either the minimum uncorrupted value or a corrupted value Consistent with resilient sorting 43
  44. 44. Resilient Priority Queues Upper Bound : Both insert and deletemin can be implemented in O(log n + δ) time [Jorgensen et al. 07] (based on cache-oblivious priority queues) Lower Bound : A resilient priority queue with n > δ elements must use Ω(log n + δ) comparisons to answer an insert followed by a deletemin [Jorgensen et al. 07] 44
  45. 45. Resilient Dictionaries Maintain a set of elements under insert, delete and search insert and delete as usual, search as in resilient searching: Again, consistent with resilient sorting search(s) either finds a key equal to s, or determines that no correct key is equal to s 45
  46. 46. Resilient Dictionaries Randomized resilient dictionary implements each operation in O(log n + δ) time [Brodal et al. 07] More complicated deterministic resilient dictionary implements each operation in O(log n + δ) time [Brodal et al. 07] 46
  47. 47. Resilient Dictionaries Pointer-based data structures Faults on pointers likely to be more problematic than faults on keys Randomized resilient dictionaries of Brodal et al. built on top of traditional (non-resilient) dictionaries Our implementation built on top of AVL trees 47
  48. 48. Outline of the Talk 1.  Motivation and Model 2.  Resilient Algorithms: •  Sorting and Searching 3.  Resilient Data Structures •  Priority Queues •  Dictionaries 4.  Experimental Results 5.  Conclusions and Open Problems 48
  49. 49. Experimental Framework Algorithm/Data Structure Non-Resilient Trivially Resilient Resilient O(f(n)) O(δ · f(n)) O(f(n) + g(δ )) 49 Resilient sorting from [Ferraro-Petrillo et al. 09] Resilient dictionaries from [Ferraro-Petrillo et al. 10] Implemented resilient binary search and heaps Implementations of resilient sorting and dictionaries more engineered than resilient binary search and heaps
  50. 50. Experimental Platform •  2 CPUs Intel Quad-Core Xeon E5520 @ 2.26Ghz •  L1 cache 256Kb, L2 cache 1 Mb, L3 cache 8 Mb •  48 GB RAM •  Scientific Linux release with Linux kernel 2.6.18-164 •  gcc 4.1.2, optimization flag –O3 50
  51. 51. Fault Injection This talk: Only random faults Algorithm / data structure and fault injection implemented as separate threads (Run on different CPUs) Preliminary experiments (not here): error rates depend on memory usage and time. 51
  52. 52. Resiliency: Why should we care? What’s the impact of memory errors? Try to analyze impact of errors on mergesort, priority queues and dictionaries using a common framework (sorting) Attempt to measure error propagation: try to estimate how much output sequence is far from being sorted (because of memory errors) Heapsort implemented on array. For coherence, in AVLSort we do not induce faults on pointers Will measure faults on AVL pointers in separate experiment 52
  53. 53. Error Propagation •  k-unordered sequence = faithfully ordered except for k (correct) keys •  k-unordered sorting algorithm = produces a k- unordered sequence, i.e., it faithfully sorts all but k correct keys 2-unordered 1 2 3 4 9 5 7 8 6 10 80 •  Resilient is 0-unordered = i.e., it faithfully sorts all correct keys 53
  54. 54. The Importance of Being Resilient n = 5,000,000; 0.01% (random) errors in input è 0.13% errors in output 0.02% (random) errors in input è 0.22% errors in output 54 α
  55. 55. The Importance of Being Resilient n = 5,000,000; 0.01% (random) errors in input è 0.40% errors in output 0.02% (random) errors in input è 0.47% errors in output 55 α
  56. 56. The Importance of Being Resilient n = 5,000,000; 0.01% (random) errors in input è 68.20% errors in output 0.02% (random) errors in input è 79.62% errors in output 56 α
  57. 57. The Importance of Being Resilient 57 α
  58. 58. Error Amplification Mergesort 0.002-0.02% (random) errors in input è 24.50-79.51% errors in output AVLsort 0.002-0.02% (random) errors in input è 0.39-0.47% errors in output Heapsort 0.002-0.02% (random) errors in input è 0.01-0.22% errors in output They all show some error amplification. Large variations likely to depend on data organization Note: Those are errors on keys. Errors on pointers are more dramatic for pointer-based data structures 58
  59. 59. The Importance of Being Resilient AVL with n = 5,000,000; α errors on memory used (keys, parent pointers, pointers, etc…) 100,000 searches; around α searches fail: on the avg, able to complete only about (100,000/α) searches before crashing 59 α
  60. 60. Isn’t Trivial Resiliency Enough? Memory errors are a problem Do we need to tackle it with new algorithms / data structures? Aren’t simple-minded approaches enough? 60
  61. 61. Isn’t Trivial Resiliency Enough? δ = 1024 61
  62. 62. Isn’t Trivial Resiliency Enough?   δ = 1024   100.000 random search 62
  63. 63. Isn’t Trivial Resiliency Enough?   δ = 512   100.000 random ops 63
  64. 64. Isn’t Trivial Resiliency Enough?   δ = 1024   100.000 random ops   no errors on pointers 64
  65. 65. Isn’t Trivial Resiliency Enough? All experiments for 105 ≤ n ≤ 5 105, δ=1024, unless specified otherwise Mergesort Trivially resilient about 100-200X slower than non-resilient Binary Search Trivially resilient about 200-300X slower than non-resilient Dictionaries Trivially resilient AVL about 300X slower than non-resilient Heaps Trivially resilient about 1000X slower than non-resilient (δ = 512) [deletemin are not random and slow] 65
  66. 66. Performance of Resilient Algorithms Memory errors are a problem Trivial approaches produce slow algorithms / data structures Need non-trivial (hopefully fast) approaches How fast can be resilient algorithms / data structures? 66
  67. 67. Performance of Resilient Algorithms α = δ = 1024 67
  68. 68. Performance of Resilient Algorithms α = δ = 1024 68
  69. 69. Performance of Resilient Algorithms   α = δ = 1024   100,000 random search 69
  70. 70. Performance of Resilient Algorithms   α = δ = 1024   100,000 random search 70
  71. 71. Performance of Resilient Algorithms   α = δ = 512   100,000 random ops 71
  72. 72. Performance of Resilient Algorithms   α = δ = 512   100,000 random ops 72
  73. 73. Performance of Resilient Algorithms   α = δ = 1024   100,000 random ops 73
  74. 74. Performance of Resilient Algorithms   α = δ = 1024   100,000 random ops 74
  75. 75. Performance of Resiliency All experiments for 105 ≤ n ≤ 5 105, α=δ=1024, unless specified otherwise Mergesort Resilient mergesort about 1.5-2X slower than non-resilient mergesort [Trivially resilient mergesort about 100-200X slower] Binary Search Resilient binary search about 60-80X slower than non-resilient binary search [Trivially resilient binary search about 200-300X slower] Heaps Resilient heaps about 20X slower than non-resilient heaps (α = δ = 512) [Trivially resilient heaps about 1000X slower] Dictionaries Resilient AVL about 10-20X slower than non-resilient AVL [Trivially resilient AVL about 300X slower] 75
  76. 76. Larger Data Sets 76 How well does the performance of resilient algorithms / data structures scale to larger data sets? Previous experiments: 105 ≤ n ≤ 5 105 New experiment with n = 5 106 (no trivially resilient)
  77. 77. Larger Data Sets 77 α n = 5,000,000
  78. 78. Larger Data Sets n = 5,000,000 α 78
  79. 79. Larger Data Sets α 100,000 random search on n = 5,000,000 elements 79 log2 n ≈ 22
  80. 80. Larger Data Sets α 80 100,000 random search on n = 5,000,000 elements
  81. 81. Larger Data Sets 100,000 random ops on a heap with n = 5,000,000 α 81 log2 n ≈ 22
  82. 82. Larger Data Sets 100,000 random ops on a heap with n = 5,000,000 α 82
  83. 83. Larger Data Sets 100,000 random ops on AVL with n = 5,000,000 α 83 log2 n ≈ 22
  84. 84. Larger Data Sets 100,000 random ops on AVL with n = 5,000,000 α 84
  85. 85. Larger Data Sets All experiments for n = 5 106 Mergesort [was 1.5-2X for 105 ≤ n ≤ 5 105] Resilient mergesort is 1.6-2.3X slower (requires ≤ 0.04% more space) Binary Search [was 60-80X for 105 ≤ n ≤ 5 105] Resilient search is 100-1000X slower (requires ≤ 0.08% more space) Heaps [was 20X for 105 ≤ n ≤ 5 105] Resilient heap is 100-1000X slower (requires 100X more space) Dictionaries [was 10-20X for 105 ≤ n ≤ 5 105] Resilient AVL is 6.9-14.6X slower (requires about 1/3 space) 85
  86. 86. Sensitivity to δ 86 How critical is the choice of δ ? Underestimating δ (α > δ) compromises resiliency Overestimating δ (α << δ) gives some performance degradation
  87. 87. Performance Degradation Mergesort Resilient mergesort improves by 9.7% in time and degrades by 0.04% in space Binary Search Resilient search degrades to 9.8X in time and by 0.08% in space Heaps Resilient heap degrades to 13.1X in time and by 59.28% in space Dictionaries Resilient AVL degrades by 49.71% in time 87 α = 32, but algorithm overestimates δ = 1024:
  88. 88. Robustness 88 Resilient mergesort and dictionaries appear more robust than resilient search and heaps I.e., resilient mergesort and dictionaries scale better with n, less sensitive to δ (so less vulnerable to bad estimates of δ), … How much of this is due to the fact that their implementations are more engineered?
  89. 89. Outline of the Talk 1.  Motivation and Model 2.  Resilient Algorithms: •  Sorting and Searching 3.  Resilient Data Structures •  Priority Queues •  Dictionaries 4.  Experimental Results 5.  Conclusions and Open Problems 89
  90. 90. Concluding Remarks •  Need of reliable computation in the presence of memory errors •  Investigated basic algorithms and data structures in the faulty memory model: do not wish to detect / correct errors, only produce correct output on correct data •  Tight upper and lower bounds in this model •  After first tests, resilient implementations of algorithms and data structures look promising 90
  91. 91. Future Work and Open Problems •  More (faster) implementations, engineering and experimental analysis? •  Resilient graph algorithms? •  Lower bounds for resilient integer sorting? •  Better faulty memory model? •  Resilient algorithms oblivious to δ? •  Full repertoire for resilient priority queues (delete, decreasekey, increasekey)? 91
  92. 92. Thank You! 92 My memory s terrible these days…

×