DDN: Protecting Your Data, Protecting Your Hardware

DDN Confidential.
Do NOT reproduce or distribute
Protecting Your Data, Protecting
Your Hardware
HPC Advisory Council, LuganoMarch, 2016
Jean-Thomas Acquaviva, DDN

2
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Corporate Status: DDN Expands its Global Network
Advanced Technical Center established, Paris, France
●
25+ R&D engineers
Technology Development Center, Pune, India
●
10+ R&D engineers

3
I/O Acceleration Layer
Distributed Virtually Shared Coherent Array of SSDs
SSD reshuffles the parameters
Latency / 40 : 4ms → 0,1 ms
Bandwidth x 3: 150 → 450 MB/s
Capacity / 8 : 8 → 1TB.
Cost x 10 $ 0,05/Gbit → $0.04
What can we do with a costly high
bandwidth low latency technology?

4
Write Amplification
1) Write is done per page (4KB)
2) Page can not be written if not previously erased
3) Erase is done on a block basis (128KB)
Overwriting a page → moving valid pages of the block to new locations
4 KB

5
RAID: Read Modify Write → Write Amplification
+
Update: hidden cost
Performance cost (can be tolerated)
Increase rate of burning cells
Paradox: Data protection accelerates SSD deprecation rate!

6
Protecting Your Data, Protecting Your Hardware
●
Storing data on persistent storage lead to deprecation of the storage medium
●
Error recovery schemes increases further the medium deprecation
Paradox: Data protection degrades hardware lifetime
→ Software has to embrace the whole complexity of the new technology!

7
Write and Overwrite Size Should Be Consistent
• Use as elementary unit of storage matching the SSD block size
Large enough to match erase block size
Write and erase sizes are identical
→ no write amplification

8
Copy-On-Write for Data Protection
Data protection is built on the top of Log basic block
●
No update → no write amplification
●
No block fragmentation
Unmap() sent when a whole parity group is obsolete (when IOps pressure is low)
Parity group
data
+
recovery data
Server A
Server B (A != B)
Server C (C != B && C != A)
Server D (...)
Server E (...)

9
Data Protection: Sharing the Burden
●
Unbalance between compute nodes and I/O nodes
●
Data protection has always to be made
→ Offloaded on compute nodes
→ Cheap if compute nodes have good vector capabilities
●
Error recovery will occur sporadically
→ I/O nodes handle data recovery
●
Quality of Data protection / recovery
→ CRC & Hamming distance
→ Data scrubbing on server side
→ Declustered RAID for graceful degradation of service

10
Log Structured Storage and Distributed Hash Table
+

11
How to Build Large Enough Packets?

12
Aggregation and Network Bandwidth
Theoretical limit: 100
Gb/s
50%
90%
0
2000
4000
6000
8000
10000
12000
Measured IB Bandwith
EDR
FDR
Message size
MB/s
90%
50%

13
Client Side: Write Aggregation of Small I/O
• 64 KB Buffering: 50% of efficiency
• 1MB buffering needed for 90% efficiency
• Pack multiple IO frags (data and metadata) in a single network message
●
Client caching is affordable:
●
KNL: up to 384 GB of memory 1% → 4GB
●
1% of KNL memory to get 90% of network bandwidth
Large enough to optimize bandwidth
Ability to aggregate fragments
From different files / IO requests
to ease coalescing

14
Storage Units Led to a Log Structured Storage
‘Data unit’ are generated on client with
content aligned or not
Despite the qualitative difference of their
contents they are processed identically on
the storage medium
Sequential Non-sequential
Data unit
Data unit

15
4 8 16 32 64 128 256 512
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
PFS
IME
I/O Size (KB)
Performance (GB/s)
Log Structure Removes Artificial Lock Requirements
John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, and Meghan Wingate. 2009. PLFS: a checkpoint filesystem for parallel
applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09).
IOR interleaved access on a single shared file: false sharing impact

16
An Example of Small Interleaved I/O on Shared File
Source: Storage Models: Past, Present, and Future. Dres Kimpe et Robert Ross, Argonne National Laboratory

17
Client Side: Pattern expression for Read Ahead
Whole purpose of I/O proxy is to linearize the resource pressure
• Write: Coalescing and buffering
• Feeding a client: prefetching
●
Stride detection mandatory to build efficient read network buffer
• Feeding the application: data pre-staging the mesh file
●
Pre-staging is not transparent and implies better tools articulation
●
I/O proxy should be able to mutualize read requests to perform pre-staging
Express Read-ahead as a pattern to
build network friendly payload

18
From Fault Tolerance to Tolerance of Load Unbalance

19
Dealing with System Complexity
Sporadic IO traffic
leads to difficult routing
?
Courtesy Philip Brighten, U. Illinois

20
NSCC / A*STAR : Storage Architecture
Courtesy Philip Brighten, U. Illinois
1PF Compute Cluster
I/O Acceleration with DDN Infinite Memory
Engine (IME) at 500 GB/s Performance
DDN EXAScaler (Lustre) For Scratch
4PB at 200 GB/s performance
EDR Infiniband N/w
GS-WOS
Bridge
PFS Stats Collection
& Monitoring
DDN GRIDScaler For Home & Nearline
3.5PB at 100 GB/s performance
WOS
over
10GbE
NAS Gateways &
Data Transfer Nodes
MetroX
5PB DDN WOS
Object Storage
Archive
Remote Login
Nodes at NUS
MetroX
Remote Login
Nodes at NTU

21
I/O proxys act as traffic aggregators: routing is easier
I/O proxy Storage Multicore Manycore GPU
Exascale as a System of Systems

22
Network is Non-Deterministic: QoS
Adaptive
Heuristic learns
“quickly”
2x Performance
Lost with
Non-Adaptive
Leverage fault tolerance mechanism to ensure QoS

23
Self Monitoring System: from Fault Tolerance to
Tolerance of Load Unbalance
●
Numerical simulation
→ Code architecture based on time-step
→ Cyclic I/O
→ Multi-dimension to 1D address space
●
Fault tolerance
→ Checkpoint restart
→ Serialization of important data structures

24
None-Uniform Bandwidth Pressure
1/ Checkpointing less than 6 minutes per hour
2/ Checkpointing means draining half of system memory
Pre-Exascale system:
4 Petabyte → bandwidth requirement 5.6 TB/s
Oakridge lab.,
Teng Wang, Weikuan Yu et al. “ An Efficient Distributed Burst Buffer for Linux”, LUG 2014

25
2-D Patterns: temporal and spatial locality
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM
Systems Journal, 10 (3): 168-192 (1971)
Time
MemoryAddress(onedotperaccess)

26
From Monitoring to Orchestration
DDN performance tool 2016

27
Conclusion: Storage Evolution
Storage getting closer to the CPU
●
Mechanically same needs will arise
●
Tools convergence
●
Access latency put pressure on the software design

28
System of Systems: Think Out of the Box!
●
Configurable
●
Inter-Operable
●
(Self) Observable
e.g built-in profiling, orchestration by job scheduler
●
Sustainable performance vs Peak: QoS

29
Wopsss.org
●
Jointly with ISC, Frankfurt, Germany June 23, 2016
●
All accepted papers will be published in the Proceedings
by Springer
●
Extended versions of the best papers will be published in
the ACM SIGOPS journal
Jalil Boukhobza, Univ. Bretagne Occidentale, France
Phlippe Deniel, CEA/DIF, France
Massimo Lamanna, CERN, Switzerland
Pedro Javier García, University of Castilla-La Mancha, Spain
Allen D. Malony, University of Oregon, USA

30
Merci !
Thank you !
Grazie
Gracias
спасибо
ありがとう
谢谢

31
Why Burst Buffers?
99% of the time the IO sub-system is stressed bellow 30% of its bandwidth
70% of the time the system is stress under 5% of its peak bandwidth
Argone lab.
P. Carns, K. Harms et al., Understanding and Improving Computational Science Storage Access through Continuous Characterization, 2011

32
What is IME?
Distributed Virtually Shared Coherent Array of SSDs
System architectureHardware tiering
IME

33
Rebuilding data
x =
Distribution matrix'
data
D0
D1
D2
D3
Parity
Group
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 g1
g1
2
g1
3
1 g2
g2
2
g2
3
1 g3
g3
2
g3
3
D0
D1
D2
D3
P1
P2
P3
By construction the distribution matrix has
a crucial property

34
Data Protection: Declustered RAID
+
Server 1
Server 2
Server N
Rebuilding bandwidth > nominal individual disk bandwidth
Larger is the system, lower is the recovery cost

DDN: Protecting Your Data, Protecting Your Hardware

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to DDN: Protecting Your Data, Protecting Your Hardware

Similar to DDN: Protecting Your Data, Protecting Your Hardware (20)

More from inside-BigData.com

More from inside-BigData.com (20)

Recently uploaded

Recently uploaded (20)

DDN: Protecting Your Data, Protecting Your Hardware