SlideShare a Scribd company logo
1 of 35
Scalable Multi-Access Flash
Store for Big Data Analytics
Sang-Woo Jun, Ming Liu
Kermin Fleming*, Arvind
Massachusetts Institute of Technology
*Intel, work while at MIT
February 27, 2014 1
This work is funded by Quanta and Samsung.
We also thank Xilinx for their generous hardware donations
Big Data Analytics
Analysis of previously unimaginable amount of
data provides deep insight
Examples:
 Analyzing twitter information to predict flu outbreaks
weeks before the CDC
 Analyzing personal genome to determine
predisposition to disease
 Data mining scientific datasets to extract accurate
models
February 27, 2014 2
Likely to be the biggest economic driver for the IT
industry for the next 5 or more years
Big Data Analytics
Data so large and complex, that traditional
processing methods are no longer effective
 Too large to fit reasonably in fast RAM
 Random access intensive, making prefetching and
caching ineffective
Data is often stored in the secondary storage
of multiple machines on a cluster
 Storage system and network performance become
first-order concerns
February 27, 2014 3
Designing Systems for Big Data
Due to terrible random read latency of disks,
analytic systems were optimized for reducing
disk access, or making them sequential
Widespread adoption of flash storage is
changing this picture
February 27, 2014 4
0
500
1000
1500
2000
2500
Disk 1Gb Ethernet DRAM
Latency of system components (μs)
0
500
1000
1500
2000
2500
Flash 1Gb Ethernet DRAM
Latency of system components (μs)
0
20
40
60
80
100
120
140
Flash 1Gb Ethernet DRAM
Latency of system components (μs)
Modern systems need to balance all aspects of the
system to gain optimal performance
rescaled
Designing Systems for Big Data
Network Optimizations
 High-performance network links (e.g. Infiniband)
Storage Optimizations
 High-performance storage devices (e.g. FusionIO)
 In-Store computing (e.g. Smart SSD)
 Network-attached storage device (e.g. QuickSAN)
Software Optimizations
 Hardware implementations of network protocols
 Flash-aware distributed file systems (e.g. CORFU)
 FPGA or GPU based accelerators (e.g. Netezza)
February 27, 2014 5
Cross-layer optimization is required for
optimal performance
Current Architectural Solution
Previously insignificant inter-component
communications become significant
When accessing remote storage:
 Read from flash to RAM
 Transfer from local RAM to remote RAM
 Transfer to accelerator and back
February 27, 2014 6
CPU
RAM
FusionIO
Infiniband
FPGA
CPU
RAM
FusionIO
Infiniband
FPGA
Motherboard Motherboard
Hardware and software latencies are additive
Potential bottleneck at multiple transport locations
BlueDBM Architecture Goals
Low-latency high-bandwidth access to scalable
distributed storage
Low-latency access to FPGA acceleration
Many interesting problems do not fit in DRAM
but can fit in flash
February 27, 2014 7
CPU
RAM
Flash
Network
FPGA
Motherboard
CPU
RAM
Flash
Network
FPGA
Motherboard
BlueDBM System Architecture
FPGA manages flash, network and host link
February 27, 2014 8
PCIe
…
Node 20
1TB Flash
Programmable
Controller
(Xilinx VC707)
Host PC
Ethernet
Expected to be constructed by summer, 2014
Node 19
1TB Flash
Programmable
Controller
(Xilinx VC707)
Host PC
Node 0
1TB Flash
Programmable
Controller
(Xilinx VC707)
Host PC
A proof-of-concept prototype is functional!
Controller Network
~8 High-speed (12Gb/s)
serial links (GTX)
Inter-Controller Network
BlueDBM provides a dedicated storage
network between controllers
 Low-latency high-bandwidth serial links (i.e. GTX)
 Hardware implementation of a simple packet-
switched protocol allows flexible topology
 Extremely low protocol overhead (~0.5μs/hop)
February 27, 2014 9
Controller Network
~8 High-speed (12Gb/s)
serial links (GTX)
…
1TB Flash
Programmable
Controller
(Xilinx VC707)
1TB Flash
Programmable
Controller
(Xilinx VC707)
1TB Flash
Programmable
Controller
(Xilinx VC707)
Inter-Controller Network
BlueDBM provides a dedicated storage
network between controllers
 Low-latency high-bandwidth serial links (i.e. GTX)
 Hardware implementation of a simple packet-
switched protocol allows flexible topology
 Extremely low protocol overhead (~0.5μs/hop)
February 27, 2014 10
…
1TB Flash
Programmable
Controller
(Xilinx VC707)
1TB Flash
Programmable
Controller
(Xilinx VC707)
1TB Flash
Programmable
Controller
(Xilinx VC707)
Protocol supports virtual channels to ease development
High-speed (12Gb/s)
serial link (GTX)
Controller as a
Programmable Accelerator
BlueDBM provides a platform for implementing
accelerators on the datapath of storage device
 Adds almost no latency by removing the need for
additional data transport
 Compression or filtering accelerators can allow more
throughput than hostside link (i.e. PCIe) allows
Example: Word counting in a document
 Counting is done inside accelerators, and only the
result need to be transported to host
February 27, 2014 11
Two-Part Implementation of
Accelerators
Accelerators are implemented both before and
after network transport
 Local accelerators parallelize across nodes
 Global accelerators have global view of data
Virtualized interface to storage and network
provided in these locations
February 27, 2014 12
Host
Global Accelerator
Local Accelerator Local Accelerator
Flash Flash
e.g., count words
e.g., aggregate count
Network
e.g., count words
Prototype System
(Proof-of-Concept)
Prototype system built using Xilinx ML605 and
a custom-built flash daughterboard
 16GB capacity, 80MB/s bandwidth per node
 Network over GTX high-speed transceivers
February 27, 2014 13
Flash board is 5 years old, from an earlier project
Prototype System
(Proof-of-Concept)
February 27, 2014 14
Hub nodes
Prototype Network Performance
0
1
2
3
4
5
6
0
50
100
150
200
250
300
350
400
450
500
0 2 4 6 8 10
Latency
(us)
Throughput
(MB/s)
Network Hops
Throughput (MB/s)
Latency (us)
February 27, 2014 15
Prototype Bandwidth Scaling
0
50
100
150
200
250
300
350
0 2 4
Bandwidth
(MB/s)
Node Count
Prototype performance
February 27, 2014 16
0
1
2
3
4
5
6
0 2 4
Bandwidth
(GB/s)
Node Count
Prototype performance
rescaled
0
1
2
3
4
5
6
0 2 4
Bandwidth
(GB/s)
Node Count
Prototype
performance
Projected System
performance
Maximum PCIe bandwidth
Only reachable
by accelerator
Prototype Latency Scaling
0
20
40
60
80
100
120
0 2 3
Latency
(us)
Network Hops
Software
Controller+Network
Flash Chip
February 27, 2014 17
Benefits of in-path acceleration
February 27, 2014 18
Less software overhead
Not bottlenecked by flash-host link
Experimented using hash counting
 In-path: Words are counted on the way out
 Off-path: Data copied back to FPGA for counting
 Software: Counting done in software
We’re actually building both configurations
0
20
40
60
80
100
120
140
In-datapath Off-datapath Software
Bandwidth
(MB/s)
Work In Progress
Improved flash controller and File System
 Wear leveling, garbage collection
 Distributed File System as accelerator
Application-specific optimizations
 Optimized data routing/network protocol
 Storage/Link Compression
Database acceleration
 Offload database operations to FPGA
 Accelerated filtering, aggregation and compression
could improve performance
Application exploration
 Including medical applications and Hadoop
February 27, 2014 19
Thank you
Takeaway: Analytics systems can be made
more efficient by using reconfigurable fabric to
manage data routing between system
components
February 27, 2014 20
FPGA
CPU
+
RAM
Network
Flash
FPGA
CPU
+
RAM
Network
Flash
FPGA Resource Usage
February 27, 2014 21
New System Improvements
A new, 20-node cluster will exist by summer
 New flash board with 1TB of storage at 1.6GB/s
 Total 20TBs of flash storage
 8 serial links will be pinned out as SATA ports,
allowing flexible topology
 12Gb/s per link GTX
 Rack-mounted for ease of handling
 Xilinx VC707 FPGA boards
February 27, 2014 22
Rackmount
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
…
Example: Data driven Science
Massive amounts of scientific data is being
collected
 Instead of building an hypothesis and collecting
relevant data, extract models from massive data
University of Washington N-body workshop
 Multiple snapshots of particles in simulated universe
 Interactive analysis provides insights to scientists
 ~100GBs, not very large by today’s standards
February 27, 2014 23
“Visualize clusters of particles moving faster than X”
“Visualize particles that were destroyed,
After being hotter than Y”
Workload Characteristics
Large volume of data to examine
Access pattern is unpredictable
 Often only a random subset of snapshots are
accessed
February 27, 2014 24
Row
Row
Row
Row
Row
Row
Row
Rows of interest
“Hotter than Y”
Row
Row
Row
Row
Row
Row
Row
Snapshot 1 Snapshot 2
Access at
random intervals
“Destroyed?”
Sequential
scan
Constraints in System Design
Latency is important in a random-access
intensive scenario
Components of a computer system has widely
different latency characteristics
February 27, 2014 25
0
500
1000
1500
2000
2500
Disk 1Gb
Ethernet
Flash 10Gb
Ethernet
Infiniband DRAM
Latency of system components
0
20
40
60
80
100
120
140
1Gb
Ethernet
Flash 10Gb
Ethernet
Infiniband DRAM
Latency of system components
“even when DRAM in the cluster is ~40% the size of the
entire database, [... we could only process] 900 reads per
second, which is much worse than the 30,000”
“Using HBase with ioMemory” -FusionIO
Constraints in System Design
Latency is important in a random-access
intensive scenario
Components of a computer system has widely
different latency characteristics
February 27, 2014 26
0
500
1000
1500
2000
2500
Disk 1Gb
Ethernet
Flash 10Gb
Ethernet
Infiniband DRAM
Latency of system components
“even when DRAM in the cluster is ~40% the size of the
entire database, [... we could only process] 900 reads per
second, which is much worse than the 30,000”
“Using HBase with ioMemory” -FusionIO
Existing Solutions - 1
Baseline: Single PC with DRAM and Disk
Size of dataset exceeds normal DRAM capacity
 Data on disk is accessed randomly
 Disk seek time bottlenecks performance
Solution 1: More DRAM
 Expensive, Limited capacity
Solution 2: Faster storage
 Flash storage provides fast random access
 PCIe flash cards like FusionIO
Solution 3: More machines
February 27, 2014 27
Existing Solutions - 2
Cluster of machines over a network
Network becomes a bottleneck with enough
machines
 Network fabric is slow
 Software stack adds overhead
Fitting everything on DRAM is expensive
 Trade-off between network and storage bottleneck
Solution 1: High-performance network such as
Infiniband
February 27, 2014 28
Existing Solutions - 3
Cluster of expensive machines
 Lots of DRAM
 FusionIO storage
 Infiniband network
Now computation might become the
bottleneck
Solution 1: FPGA-based Application-specific
Accelerators
February 27, 2014 29
Previous Work
Software Optimizations
 Hardware implementations of network protocols
 Flash-aware distributed file systems (e.g. CORFU)
 Accelerators on FPGAs and GPUs
Storage Optimizations
 High-performance storage devices (e.g. FusionIO)
Network Optimizations
 High-performance network links (e.g. Infiniband)
February 27, 2014 30
0
50
100
150
Flash FusionIO 1Gb
Ethernet
10Gb
Ethernet
Infiniband
Latency of system components (us)
BlueDBM Goals
BlueDBM is a distributed storage architecture
with the following goals:
 Low-latency high-bandwidth storage access
 Low-latency hardware acceleration
 Scalability via efficient network protocol & topology
 Multi-accessibility via efficient routing and controller
 Application compatibility via normal storage interface
February 27, 2014 31
CPU
RAM
Flash
Network
FPGA
Motherboard
BlueDBM Node Architecture
Address mapper determines the location of the
requested page
 Nodes must agree on mapping
 Currently programmatically defined, to maximize
parallelism by striping across nodes
February 27, 2014 32
Prototype System
Topology of prototype is restricted because
ML605 has only one SMA port
 Network requires hub nodes
February 27, 2014 33
Controller as a
Programmable Accelerator
BlueDBM provides a platform for implementing
accelerators on the datapath of storage device
 Adds almost no latency by removing the need for
additional data transport
 Compression or filtering accelerators can allow more
throughput than hostside link (i.e. PCIe) allows
Example: Word counting in a document
 Controller can be programmed to count words and
only return the result
February 27, 2014 34
Accelerator
Flash
Host Server
Accelerator
Flash
Host Server
We’re actually building both systems
Accelerator Configuration
Comparisons
February 27, 2014 35
0
20
40
60
80
100
120
140
In-datapath Off-datapath Software
Bandwidth
(MB/s)

More Related Content

Similar to fpga2014-wjun.pptx

seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
Dipak Badhe
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
anil0878
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
solarisyourep
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
Perforce
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
Vlad Ponomarev
 

Similar to fpga2014-wjun.pptx (20)

seed block algorithm
seed block algorithmseed block algorithm
seed block algorithm
 
I understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdfI understand that physics and hardware emmaded on the use of finete .pdf
I understand that physics and hardware emmaded on the use of finete .pdf
 
Network Engineering for High Speed Data Sharing
Network Engineering for High Speed Data SharingNetwork Engineering for High Speed Data Sharing
Network Engineering for High Speed Data Sharing
 
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMFGestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
Gestione gerarchica dei dati con SUSE Enterprise Storage e HPE DMF
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
Presentation architecting virtualized infrastructure for big data
Presentation   architecting virtualized infrastructure for big dataPresentation   architecting virtualized infrastructure for big data
Presentation architecting virtualized infrastructure for big data
 
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
(Speaker Notes Version) Architecting An Enterprise Storage Platform Using Obj...
 
optimizing_ceph_flash
optimizing_ceph_flashoptimizing_ceph_flash
optimizing_ceph_flash
 
Storage Area Networks Unit 2 Notes
Storage Area Networks Unit 2 NotesStorage Area Networks Unit 2 Notes
Storage Area Networks Unit 2 Notes
 
Setting up repositories
Setting up repositoriesSetting up repositories
Setting up repositories
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction[IC Manage] Workspace Acceleration & Network Storage Reduction
[IC Manage] Workspace Acceleration & Network Storage Reduction
 
Walk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoCWalk Through a Software Defined Everything PoC
Walk Through a Software Defined Everything PoC
 
Storage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 NotesStorage Area Networks Unit 1 Notes
Storage Area Networks Unit 1 Notes
 
Architecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentationArchitecting virtualized infrastructure for big data presentation
Architecting virtualized infrastructure for big data presentation
 
Nas Ashok1
Nas Ashok1Nas Ashok1
Nas Ashok1
 
Fortissimo converged super_converged_hyper
Fortissimo converged super_converged_hyperFortissimo converged super_converged_hyper
Fortissimo converged super_converged_hyper
 
Actian Vector Whitepaper
 Actian Vector Whitepaper Actian Vector Whitepaper
Actian Vector Whitepaper
 
HPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journeyHPC and cloud distributed computing, as a journey
HPC and cloud distributed computing, as a journey
 
Challenges in Managing IT Infrastructure
Challenges in Managing IT InfrastructureChallenges in Managing IT Infrastructure
Challenges in Managing IT Infrastructure
 

Recently uploaded

[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...
[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...
[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...
daisycvs
 
Buy best abortion pills Doha [+966572737505 | Planned cytotec Qatar
Buy best abortion pills Doha [+966572737505 | Planned cytotec QatarBuy best abortion pills Doha [+966572737505 | Planned cytotec Qatar
Buy best abortion pills Doha [+966572737505 | Planned cytotec Qatar
samsungultra782445
 
如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证
mestb
 
如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证
mestb
 
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
mestb
 
Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...
Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...
Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...
drjose256
 
Balancing of rotating bodies questions.pptx
Balancing of rotating bodies questions.pptxBalancing of rotating bodies questions.pptx
Balancing of rotating bodies questions.pptx
joshuaclack73
 
Vibration of Continuous Systems.pjjjjjjjjptx
Vibration of Continuous Systems.pjjjjjjjjptxVibration of Continuous Systems.pjjjjjjjjptx
Vibration of Continuous Systems.pjjjjjjjjptx
joshuaclack73
 
如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证
如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证
如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证
mestb
 
如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证
mestb
 

Recently uploaded (15)

Matrix Methods.pptxhhhhhhhhhhhhhhhhhhhhh
Matrix Methods.pptxhhhhhhhhhhhhhhhhhhhhhMatrix Methods.pptxhhhhhhhhhhhhhhhhhhhhh
Matrix Methods.pptxhhhhhhhhhhhhhhhhhhhhh
 
NO1 Best Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addre...
NO1 Best Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addre...NO1 Best Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addre...
NO1 Best Amil Baba In Karachi Kala Jadu In Karachi Amil baba In Karachi Addre...
 
[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...
[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...
[[Jeddah]] IN RIYADH +2773-7758557]] Abortion pills in Jeddah Cytotec in Riya...
 
Buy best abortion pills Doha [+966572737505 | Planned cytotec Qatar
Buy best abortion pills Doha [+966572737505 | Planned cytotec QatarBuy best abortion pills Doha [+966572737505 | Planned cytotec Qatar
Buy best abortion pills Doha [+966572737505 | Planned cytotec Qatar
 
NO1 Qari Rohani Amil In Islamabad Amil Baba in Rawalpindi Kala Jadu Amil In R...
NO1 Qari Rohani Amil In Islamabad Amil Baba in Rawalpindi Kala Jadu Amil In R...NO1 Qari Rohani Amil In Islamabad Amil Baba in Rawalpindi Kala Jadu Amil In R...
NO1 Qari Rohani Amil In Islamabad Amil Baba in Rawalpindi Kala Jadu Amil In R...
 
如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(AUT毕业证书)奥克兰理工大学毕业证成绩单本科硕士学位证留信学历认证
 
NO1 Pakistan Best vashikaran specialist in UK USA UAE London Dubai Canada Ame...
NO1 Pakistan Best vashikaran specialist in UK USA UAE London Dubai Canada Ame...NO1 Pakistan Best vashikaran specialist in UK USA UAE London Dubai Canada Ame...
NO1 Pakistan Best vashikaran specialist in UK USA UAE London Dubai Canada Ame...
 
如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(USYD毕业证书)悉尼大学毕业证成绩单本科硕士学位证留信学历认证
 
Cyber-Security-power point presentation.
Cyber-Security-power point presentation.Cyber-Security-power point presentation.
Cyber-Security-power point presentation.
 
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(SUT毕业证书)斯威本科技大学毕业证成绩单本科硕士学位证留信学历认证
 
Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...
Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...
Top^Clinic Soweto ^%[+27838792658_termination in florida_Safe*Abortion Pills ...
 
Balancing of rotating bodies questions.pptx
Balancing of rotating bodies questions.pptxBalancing of rotating bodies questions.pptx
Balancing of rotating bodies questions.pptx
 
Vibration of Continuous Systems.pjjjjjjjjptx
Vibration of Continuous Systems.pjjjjjjjjptxVibration of Continuous Systems.pjjjjjjjjptx
Vibration of Continuous Systems.pjjjjjjjjptx
 
如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证
如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证
如何办理(OP毕业证书)奥塔哥理工学院毕业证成绩单本科硕士学位证留信学历认证
 
如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UVic毕业证书)维多利亚大学毕业证成绩单本科硕士学位证留信学历认证
 

fpga2014-wjun.pptx

  • 1. Scalable Multi-Access Flash Store for Big Data Analytics Sang-Woo Jun, Ming Liu Kermin Fleming*, Arvind Massachusetts Institute of Technology *Intel, work while at MIT February 27, 2014 1 This work is funded by Quanta and Samsung. We also thank Xilinx for their generous hardware donations
  • 2. Big Data Analytics Analysis of previously unimaginable amount of data provides deep insight Examples:  Analyzing twitter information to predict flu outbreaks weeks before the CDC  Analyzing personal genome to determine predisposition to disease  Data mining scientific datasets to extract accurate models February 27, 2014 2 Likely to be the biggest economic driver for the IT industry for the next 5 or more years
  • 3. Big Data Analytics Data so large and complex, that traditional processing methods are no longer effective  Too large to fit reasonably in fast RAM  Random access intensive, making prefetching and caching ineffective Data is often stored in the secondary storage of multiple machines on a cluster  Storage system and network performance become first-order concerns February 27, 2014 3
  • 4. Designing Systems for Big Data Due to terrible random read latency of disks, analytic systems were optimized for reducing disk access, or making them sequential Widespread adoption of flash storage is changing this picture February 27, 2014 4 0 500 1000 1500 2000 2500 Disk 1Gb Ethernet DRAM Latency of system components (μs) 0 500 1000 1500 2000 2500 Flash 1Gb Ethernet DRAM Latency of system components (μs) 0 20 40 60 80 100 120 140 Flash 1Gb Ethernet DRAM Latency of system components (μs) Modern systems need to balance all aspects of the system to gain optimal performance rescaled
  • 5. Designing Systems for Big Data Network Optimizations  High-performance network links (e.g. Infiniband) Storage Optimizations  High-performance storage devices (e.g. FusionIO)  In-Store computing (e.g. Smart SSD)  Network-attached storage device (e.g. QuickSAN) Software Optimizations  Hardware implementations of network protocols  Flash-aware distributed file systems (e.g. CORFU)  FPGA or GPU based accelerators (e.g. Netezza) February 27, 2014 5 Cross-layer optimization is required for optimal performance
  • 6. Current Architectural Solution Previously insignificant inter-component communications become significant When accessing remote storage:  Read from flash to RAM  Transfer from local RAM to remote RAM  Transfer to accelerator and back February 27, 2014 6 CPU RAM FusionIO Infiniband FPGA CPU RAM FusionIO Infiniband FPGA Motherboard Motherboard Hardware and software latencies are additive Potential bottleneck at multiple transport locations
  • 7. BlueDBM Architecture Goals Low-latency high-bandwidth access to scalable distributed storage Low-latency access to FPGA acceleration Many interesting problems do not fit in DRAM but can fit in flash February 27, 2014 7 CPU RAM Flash Network FPGA Motherboard CPU RAM Flash Network FPGA Motherboard
  • 8. BlueDBM System Architecture FPGA manages flash, network and host link February 27, 2014 8 PCIe … Node 20 1TB Flash Programmable Controller (Xilinx VC707) Host PC Ethernet Expected to be constructed by summer, 2014 Node 19 1TB Flash Programmable Controller (Xilinx VC707) Host PC Node 0 1TB Flash Programmable Controller (Xilinx VC707) Host PC A proof-of-concept prototype is functional! Controller Network ~8 High-speed (12Gb/s) serial links (GTX)
  • 9. Inter-Controller Network BlueDBM provides a dedicated storage network between controllers  Low-latency high-bandwidth serial links (i.e. GTX)  Hardware implementation of a simple packet- switched protocol allows flexible topology  Extremely low protocol overhead (~0.5μs/hop) February 27, 2014 9 Controller Network ~8 High-speed (12Gb/s) serial links (GTX) … 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707)
  • 10. Inter-Controller Network BlueDBM provides a dedicated storage network between controllers  Low-latency high-bandwidth serial links (i.e. GTX)  Hardware implementation of a simple packet- switched protocol allows flexible topology  Extremely low protocol overhead (~0.5μs/hop) February 27, 2014 10 … 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) 1TB Flash Programmable Controller (Xilinx VC707) Protocol supports virtual channels to ease development High-speed (12Gb/s) serial link (GTX)
  • 11. Controller as a Programmable Accelerator BlueDBM provides a platform for implementing accelerators on the datapath of storage device  Adds almost no latency by removing the need for additional data transport  Compression or filtering accelerators can allow more throughput than hostside link (i.e. PCIe) allows Example: Word counting in a document  Counting is done inside accelerators, and only the result need to be transported to host February 27, 2014 11
  • 12. Two-Part Implementation of Accelerators Accelerators are implemented both before and after network transport  Local accelerators parallelize across nodes  Global accelerators have global view of data Virtualized interface to storage and network provided in these locations February 27, 2014 12 Host Global Accelerator Local Accelerator Local Accelerator Flash Flash e.g., count words e.g., aggregate count Network e.g., count words
  • 13. Prototype System (Proof-of-Concept) Prototype system built using Xilinx ML605 and a custom-built flash daughterboard  16GB capacity, 80MB/s bandwidth per node  Network over GTX high-speed transceivers February 27, 2014 13 Flash board is 5 years old, from an earlier project
  • 15. Prototype Network Performance 0 1 2 3 4 5 6 0 50 100 150 200 250 300 350 400 450 500 0 2 4 6 8 10 Latency (us) Throughput (MB/s) Network Hops Throughput (MB/s) Latency (us) February 27, 2014 15
  • 16. Prototype Bandwidth Scaling 0 50 100 150 200 250 300 350 0 2 4 Bandwidth (MB/s) Node Count Prototype performance February 27, 2014 16 0 1 2 3 4 5 6 0 2 4 Bandwidth (GB/s) Node Count Prototype performance rescaled 0 1 2 3 4 5 6 0 2 4 Bandwidth (GB/s) Node Count Prototype performance Projected System performance Maximum PCIe bandwidth Only reachable by accelerator
  • 17. Prototype Latency Scaling 0 20 40 60 80 100 120 0 2 3 Latency (us) Network Hops Software Controller+Network Flash Chip February 27, 2014 17
  • 18. Benefits of in-path acceleration February 27, 2014 18 Less software overhead Not bottlenecked by flash-host link Experimented using hash counting  In-path: Words are counted on the way out  Off-path: Data copied back to FPGA for counting  Software: Counting done in software We’re actually building both configurations 0 20 40 60 80 100 120 140 In-datapath Off-datapath Software Bandwidth (MB/s)
  • 19. Work In Progress Improved flash controller and File System  Wear leveling, garbage collection  Distributed File System as accelerator Application-specific optimizations  Optimized data routing/network protocol  Storage/Link Compression Database acceleration  Offload database operations to FPGA  Accelerated filtering, aggregation and compression could improve performance Application exploration  Including medical applications and Hadoop February 27, 2014 19
  • 20. Thank you Takeaway: Analytics systems can be made more efficient by using reconfigurable fabric to manage data routing between system components February 27, 2014 20 FPGA CPU + RAM Network Flash FPGA CPU + RAM Network Flash
  • 22. New System Improvements A new, 20-node cluster will exist by summer  New flash board with 1TB of storage at 1.6GB/s  Total 20TBs of flash storage  8 serial links will be pinned out as SATA ports, allowing flexible topology  12Gb/s per link GTX  Rack-mounted for ease of handling  Xilinx VC707 FPGA boards February 27, 2014 22 Rackmount Node 0 Node 1 Node 2 Node 3 Node 4 Node 5 …
  • 23. Example: Data driven Science Massive amounts of scientific data is being collected  Instead of building an hypothesis and collecting relevant data, extract models from massive data University of Washington N-body workshop  Multiple snapshots of particles in simulated universe  Interactive analysis provides insights to scientists  ~100GBs, not very large by today’s standards February 27, 2014 23 “Visualize clusters of particles moving faster than X” “Visualize particles that were destroyed, After being hotter than Y”
  • 24. Workload Characteristics Large volume of data to examine Access pattern is unpredictable  Often only a random subset of snapshots are accessed February 27, 2014 24 Row Row Row Row Row Row Row Rows of interest “Hotter than Y” Row Row Row Row Row Row Row Snapshot 1 Snapshot 2 Access at random intervals “Destroyed?” Sequential scan
  • 25. Constraints in System Design Latency is important in a random-access intensive scenario Components of a computer system has widely different latency characteristics February 27, 2014 25 0 500 1000 1500 2000 2500 Disk 1Gb Ethernet Flash 10Gb Ethernet Infiniband DRAM Latency of system components 0 20 40 60 80 100 120 140 1Gb Ethernet Flash 10Gb Ethernet Infiniband DRAM Latency of system components “even when DRAM in the cluster is ~40% the size of the entire database, [... we could only process] 900 reads per second, which is much worse than the 30,000” “Using HBase with ioMemory” -FusionIO
  • 26. Constraints in System Design Latency is important in a random-access intensive scenario Components of a computer system has widely different latency characteristics February 27, 2014 26 0 500 1000 1500 2000 2500 Disk 1Gb Ethernet Flash 10Gb Ethernet Infiniband DRAM Latency of system components “even when DRAM in the cluster is ~40% the size of the entire database, [... we could only process] 900 reads per second, which is much worse than the 30,000” “Using HBase with ioMemory” -FusionIO
  • 27. Existing Solutions - 1 Baseline: Single PC with DRAM and Disk Size of dataset exceeds normal DRAM capacity  Data on disk is accessed randomly  Disk seek time bottlenecks performance Solution 1: More DRAM  Expensive, Limited capacity Solution 2: Faster storage  Flash storage provides fast random access  PCIe flash cards like FusionIO Solution 3: More machines February 27, 2014 27
  • 28. Existing Solutions - 2 Cluster of machines over a network Network becomes a bottleneck with enough machines  Network fabric is slow  Software stack adds overhead Fitting everything on DRAM is expensive  Trade-off between network and storage bottleneck Solution 1: High-performance network such as Infiniband February 27, 2014 28
  • 29. Existing Solutions - 3 Cluster of expensive machines  Lots of DRAM  FusionIO storage  Infiniband network Now computation might become the bottleneck Solution 1: FPGA-based Application-specific Accelerators February 27, 2014 29
  • 30. Previous Work Software Optimizations  Hardware implementations of network protocols  Flash-aware distributed file systems (e.g. CORFU)  Accelerators on FPGAs and GPUs Storage Optimizations  High-performance storage devices (e.g. FusionIO) Network Optimizations  High-performance network links (e.g. Infiniband) February 27, 2014 30 0 50 100 150 Flash FusionIO 1Gb Ethernet 10Gb Ethernet Infiniband Latency of system components (us)
  • 31. BlueDBM Goals BlueDBM is a distributed storage architecture with the following goals:  Low-latency high-bandwidth storage access  Low-latency hardware acceleration  Scalability via efficient network protocol & topology  Multi-accessibility via efficient routing and controller  Application compatibility via normal storage interface February 27, 2014 31 CPU RAM Flash Network FPGA Motherboard
  • 32. BlueDBM Node Architecture Address mapper determines the location of the requested page  Nodes must agree on mapping  Currently programmatically defined, to maximize parallelism by striping across nodes February 27, 2014 32
  • 33. Prototype System Topology of prototype is restricted because ML605 has only one SMA port  Network requires hub nodes February 27, 2014 33
  • 34. Controller as a Programmable Accelerator BlueDBM provides a platform for implementing accelerators on the datapath of storage device  Adds almost no latency by removing the need for additional data transport  Compression or filtering accelerators can allow more throughput than hostside link (i.e. PCIe) allows Example: Word counting in a document  Controller can be programmed to count words and only return the result February 27, 2014 34 Accelerator Flash Host Server Accelerator Flash Host Server We’re actually building both systems
  • 35. Accelerator Configuration Comparisons February 27, 2014 35 0 20 40 60 80 100 120 140 In-datapath Off-datapath Software Bandwidth (MB/s)

Editor's Notes

  1. Affiliation, elliott at intel , work while at mit sponsers
  2. Start with examples, description in next slide (size, randomness < db sequential)
  3. Start with examples, description in next slide (size, randomness < db sequential)
  4. Add “re-scaling” to new chart
  5. Add request packet from cpu to cpu Loop animations?
  6. Add request packet that goes from cpu to remote fpga Loop animations!
  7. Bandwidth numbers? Topology?
  8. Virtual channel? Virtual network?
  9. Virtual channel? Virtual network?
  10. We needed hub nodes because the ML605 only has one SMA port pinned out Forced to take a switched fabric topology because ML605 has only one SMA port.
  11. new graph with expected performance Animation? Show orig, scale, show compare
  12. Describe in-path vs out-path Describe the experiment (route data back to FPGA, pipeline bubbles, pcie bottlenecks)
  13. Separate in-path to out-path into separate slide ( in prototype results ) Differentiate word counting example from prev slides