More Related Content Similar to DDN: Protecting Your Data, Protecting Your Hardware (20) More from inside-BigData.com (20) DDN: Protecting Your Data, Protecting Your Hardware1. DDN Confidential.
Do NOT reproduce or distribute
Protecting Your Data, Protecting
Your Hardware
HPC Advisory Council, LuganoMarch, 2016
Jean-Thomas Acquaviva, DDN
2. 2
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Corporate Status: DDN Expands its Global Network
Advanced Technical Center established, Paris, France
●
25+ R&D engineers
Technology Development Center, Pune, India
●
10+ R&D engineers
3. 3
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
I/O Acceleration Layer
Distributed Virtually Shared Coherent Array of SSDs
SSD reshuffles the parameters
Latency / 40 : 4ms → 0,1 ms
Bandwidth x 3: 150 → 450 MB/s
Capacity / 8 : 8 → 1TB.
Cost x 10 $ 0,05/Gbit → $0.04
What can we do with a costly high
bandwidth low latency technology?
4. 4
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Write Amplification
1) Write is done per page (4KB)
2) Page can not be written if not previously erased
3) Erase is done on a block basis (128KB)
Overwriting a page → moving valid pages of the block to new locations
4 KB
5. 5
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
RAID: Read Modify Write → Write Amplification
+
Update: hidden cost
Performance cost (can be tolerated)
Increase rate of burning cells
Paradox: Data protection accelerates SSD deprecation rate!
6. 6
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Protecting Your Data, Protecting Your Hardware
●
Storing data on persistent storage lead to deprecation of the storage medium
●
Error recovery schemes increases further the medium deprecation
Paradox: Data protection degrades hardware lifetime
→ Software has to embrace the whole complexity of the new technology!
7. 7
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Write and Overwrite Size Should Be Consistent
• Use as elementary unit of storage matching the SSD block size
Large enough to match erase block size
Write and erase sizes are identical
→ no write amplification
8. 8
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Copy-On-Write for Data Protection
Data protection is built on the top of Log basic block
●
No update → no write amplification
●
No block fragmentation
Unmap() sent when a whole parity group is obsolete (when IOps pressure is low)
Parity group
data
+
recovery data
Server A
Server B (A != B)
Server C (C != B && C != A)
Server D (...)
Server E (...)
9. 9
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Data Protection: Sharing the Burden
●
Unbalance between compute nodes and I/O nodes
●
Data protection has always to be made
→ Offloaded on compute nodes
→ Cheap if compute nodes have good vector capabilities
●
Error recovery will occur sporadically
→ I/O nodes handle data recovery
●
Quality of Data protection / recovery
→ CRC & Hamming distance
→ Data scrubbing on server side
→ Declustered RAID for graceful degradation of service
10. 10
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Log Structured Storage and Distributed Hash Table
+
11. 11
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
How to Build Large Enough Packets?
12. 12
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Aggregation and Network Bandwidth
Theoretical limit: 100
Gb/s
50%
90%
0
2000
4000
6000
8000
10000
12000
Measured IB Bandwith
EDR
FDR
Message size
MB/s
90%
50%
13. 13
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Client Side: Write Aggregation of Small I/O
• 64 KB Buffering: 50% of efficiency
• 1MB buffering needed for 90% efficiency
• Pack multiple IO frags (data and metadata) in a single network message
●
Client caching is affordable:
●
KNL: up to 384 GB of memory 1% → 4GB
●
1% of KNL memory to get 90% of network bandwidth
Large enough to optimize bandwidth
Ability to aggregate fragments
From different files / IO requests
to ease coalescing
14. 14
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Storage Units Led to a Log Structured Storage
‘Data unit’ are generated on client with
content aligned or not
Despite the qualitative difference of their
contents they are processed identically on
the storage medium
Sequential Non-sequential
Data unit
Data unit
15. 15
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
4 8 16 32 64 128 256 512
0
5000
10000
15000
20000
25000
30000
35000
40000
45000
50000
PFS
IME
I/O Size (KB)
Performance (GB/s)
Log Structure Removes Artificial Lock Requirements
John Bent, Garth Gibson, Gary Grider, Ben McClelland, Paul Nowoczynski, James Nunez, Milo Polte, and Meghan Wingate. 2009. PLFS: a checkpoint filesystem for parallel
applications. In Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC '09).
IOR interleaved access on a single shared file: false sharing impact
16. 16
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
An Example of Small Interleaved I/O on Shared File
Source: Storage Models: Past, Present, and Future. Dres Kimpe et Robert Ross, Argonne National Laboratory
17. 17
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Client Side: Pattern expression for Read Ahead
Whole purpose of I/O proxy is to linearize the resource pressure
• Write: Coalescing and buffering
• Feeding a client: prefetching
●
Stride detection mandatory to build efficient read network buffer
• Feeding the application: data pre-staging the mesh file
●
Pre-staging is not transparent and implies better tools articulation
●
I/O proxy should be able to mutualize read requests to perform pre-staging
Express Read-ahead as a pattern to
build network friendly payload
18. 18
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
From Fault Tolerance to Tolerance of Load Unbalance
19. 19
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Dealing with System Complexity
Sporadic IO traffic
leads to difficult routing
?
Courtesy Philip Brighten, U. Illinois
20. 20
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
NSCC / A*STAR : Storage Architecture
Courtesy Philip Brighten, U. Illinois
1PF Compute Cluster
I/O Acceleration with DDN Infinite Memory
Engine (IME) at 500 GB/s Performance
DDN EXAScaler (Lustre) For Scratch
4PB at 200 GB/s performance
EDR Infiniband N/w
GS-WOS
Bridge
PFS Stats Collection
& Monitoring
DDN GRIDScaler For Home & Nearline
3.5PB at 100 GB/s performance
WOS
over
10GbE
NAS Gateways &
Data Transfer Nodes
MetroX
5PB DDN WOS
Object Storage
Archive
Remote Login
Nodes at NUS
MetroX
Remote Login
Nodes at NTU
21. 21
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
I/O proxys act as traffic aggregators: routing is easier
I/O proxy Storage Multicore Manycore GPU
Exascale as a System of Systems
22. 22
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Network is Non-Deterministic: QoS
Adaptive
Heuristic learns
“quickly”
2x Performance
Lost with
Non-Adaptive
Leverage fault tolerance mechanism to ensure QoS
23. 23
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Self Monitoring System: from Fault Tolerance to
Tolerance of Load Unbalance
●
Numerical simulation
→ Code architecture based on time-step
→ Cyclic I/O
→ Multi-dimension to 1D address space
●
Fault tolerance
→ Checkpoint restart
→ Serialization of important data structures
24. 24
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
None-Uniform Bandwidth Pressure
1/ Checkpointing less than 6 minutes per hour
2/ Checkpointing means draining half of system memory
Pre-Exascale system:
4 Petabyte → bandwidth requirement 5.6 TB/s
Oakridge lab.,
Teng Wang, Weikuan Yu et al. “ An Efficient Distributed Burst Buffer for Linux”, LUG 2014
25. 25
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
2-D Patterns: temporal and spatial locality
Donald J. Hatfield, Jeanette Gerald: Program Restructuring for Virtual Memory. IBM
Systems Journal, 10 (3): 168-192 (1971)
Time
MemoryAddress(onedotperaccess)
26. 26
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
From Monitoring to Orchestration
DDN performance tool 2016
27. 27
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Conclusion: Storage Evolution
Storage getting closer to the CPU
●
Mechanically same needs will arise
●
Tools convergence
●
Access latency put pressure on the software design
28. 28
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
System of Systems: Think Out of the Box!
●
Configurable
●
Inter-Operable
●
(Self) Observable
e.g built-in profiling, orchestration by job scheduler
●
Sustainable performance vs Peak: QoS
29. 29
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Wopsss.org
●
Jointly with ISC, Frankfurt, Germany June 23, 2016
●
All accepted papers will be published in the Proceedings
by Springer
●
Extended versions of the best papers will be published in
the ACM SIGOPS journal
Jalil Boukhobza, Univ. Bretagne Occidentale, France
Phlippe Deniel, CEA/DIF, France
Massimo Lamanna, CERN, Switzerland
Pedro Javier García, University of Castilla-La Mancha, Spain
Allen D. Malony, University of Oregon, USA
30. 30
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Merci !
Thank you !
Grazie
Gracias
спасибо
ありがとう
谢谢
31. 31
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Why Burst Buffers?
99% of the time the IO sub-system is stressed bellow 30% of its bandwidth
70% of the time the system is stress under 5% of its peak bandwidth
Argone lab.
P. Carns, K. Harms et al., Understanding and Improving Computational Science Storage Access through Continuous Characterization, 2011
32. 32
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
What is IME?
Distributed Virtually Shared Coherent Array of SSDs
System architectureHardware tiering
IME
33. 33
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Rebuilding data
x =
Distribution matrix'
data
D0
D1
D2
D3
Parity
Group
1 0 0 0
0 1 0 0
0 0 1 0
0 0 0 1
1 g1
g1
2
g1
3
1 g2
g2
2
g2
3
1 g3
g3
2
g3
3
D0
D1
D2
D3
P1
P2
P3
By construction the distribution matrix has
a crucial property
34. 34
ddn.com© 2016 DataDirect Networks, Inc. * Other names and brands may be claimed as the property of others.
Any statements or representations around future events are subject to change.
Data Protection: Declustered RAID
+
Server 1
Server 2
Server N
Rebuilding bandwidth > nominal individual disk bandwidth
Larger is the system, lower is the recovery cost