SlideShare a Scribd company logo
1 of 26
FlashEmbedding:
Storing Embedding Tables in SSD for
Large-Scale Recommender Systems
Speaker: Po-Chuan, Chen
Table of contents
• Abstract
• Introduction
• Background and motivation
• Embedding
• SSD-based Recommendation Interence
• End-to-End Inference Latency
• Performance Overhead Analysis
• Flash Embedding Design
• System overview
• Embedding Vector SSD
• MMIO Manager
• EV Translator
• EV-FC
• EV-Merger
• Embedding Vector Cache
• Putting it all together
• Implementation
• Evaluation
• Experimental setup
• Effectiveness of
Proposed Techniques
• End-to-End Performance
• Sensitivity Study
• Related Study
• Conclusion
Abstract
• Flash Embedding, a hardware/software co-design solution for
storing embedding tables on SSDs for largescale recommendation inference
under memory capacity limited systems.
• Lower latency (in lookups) 17.44 x
• Lower end-to-end latency 2.89 x
Introduction
• Recommender systems
exploit the categorical information while maintaining the natural
diversity within the features by using embedding tables to map each
category to a unique embedding vector within a lower-dimensional
embedded space.
Typically, they store all embedding tables in memory
to reduce latency for serving user requests.
But it costs high latency.
Solution
• One promising solution to this problem is storing embedding tables on
much larger solid-state drives (SSDs).
• The performance degradation of SSD-based recommendation inference is
mainly attributed to two factors:
 The long I/O stack
 The tremendous read amplifications of the block-interface SSD
Embedding vector SSD (EV-SSD)
it support embedding vector lookup requests by taking advantage of two-stage fine-
grained reading and MMIO interface.
Embedding vector cache (EV-Cache)
it reduce the access to EV-SSD by applying a bitmap filter and robin hood hashing–
based LRU caches.
Using the pipeline
it hide the I/O latency further by overlapping CPU computation with EV-SSD handling.
Overview : Flash Embedding Design
• Abstract
• Introduction
• Background and motivation
• Embedding
• SSD-based Recommendation Interence
• End-to-End Inference Latency
• Performance Overhead Analysis
• Flash Embedding Design
• System overview
• Embedding Vector SSD
• MMIO Manager
• EV Translator
• EV-FC
• EV-Merger
• Embedding Vector Cache
• Putting it all together
• Implementation
• Evaluation
• Experimental setup
• Effectiveness of
Proposed Techniques
• End-to-End Performance
• Sensitivity Study
• Related Study
• Conclusion
Embedding
A typical architecture of deep learning-based personalized
recommendation model comprises bottom MLP, top MLP, embedding
and feature interaction layers
Ref : https://arxiv.org/pdf/1707.07435.pdf
Day 29 Embeddings (嵌入架構) - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天
SSD-based Recommendation Inference
Utilizing the deep learning recommendation model (DLRM) framework
and Criteo Ad dataset to evaluate how well the recommender system
performs.
Using the NVMe SSD attached to the AWS F1 instance to store
embedding tables.
Because page cache will cache frequently used embeddings and
serve most embedding lookup requests if it has the request embedding.
Notice that device layer latency increases rapidly as the size of available memory decreases.
I/O stack costs are the primary limiter on SSD-based
recommender system performance
Thrashing / page swapping
The I/O granularity of recommender systems
(flash page size = 4KB)
and underlying SSDs (128B ~ 2KB) mismatch,
leading to significant read amplification.
• Abstract
• Introduction
• Background and motivation
• Embedding
• SSD-based Recommendation Interence
• End-to-End Inference Latency
• Performance Overhead Analysis
• Flash Embedding Design
• System overview
• Embedding Vector SSD
• MMIO Manager
• EV Translator
• EV-FC
• EV-Merger
• Embedding Vector Cache
• Putting it all together
• Implementation
• Evaluation
• Experimental setup
• Effectiveness of
Proposed Techniques
• End-to-End Performance
• Sensitivity Study
• Related Study
• Conclusion
System Overview
• Software
EV-Cache, dedicated to efficiently cache the embedding vector using a small
amount of memory to exploit the modest level of temporal locality within
embedding tables and to reduce long SSD read latency.
• Hardware
EV-SSD, to fast access the embedding vectors in SSD by reducing read amplification
and I/O stacks.
EV-SSD
EV-FC
EV Translator
EV-Cache
• EV-Cache workload
• Bitmap filter
• Cache Design
EV-Cache employs dedicated bitmap as filters
to prune paths early.
It takes hints of the access count from the
bitmap to determine
whether an embedding vector should be
cached or bypassed.
Cache Design
LRU (least recently used) based vector cache as its building block
 Using a HashMap to store key-value pairs
 A circular double-linked list to record the access order of each element
Robin Hood hashing reduces the maximum
probe sequence length and the long tail in
probe sequence length distribution,
achieving high performance.
EV-Cache internally uses a set of LRU caches
to serve embedding lookup requests to
improve performance in a multi-threaded
environment.
Putting it all together
 Prepare indices (EV-Cache)
 Send indices (EV-Cache)
 Lookup embedding vectors in SSD (EV-SSD)
The pipeline takes advantage of
the parallelism between CPU and SSD
and mitigates the idle time on the host side
• Abstract
• Introduction
• Background and motivation
• Embedding
• SSD-based Recommendation Interence
• End-to-End Inference Latency
• Performance Overhead Analysis
• Flash Embedding Design
• System overview
• Embedding Vector SSD
• MMIO Manager
• EV Translator
• EV-FC
• EV-Merger
• Embedding Vector Cache
• Putting it all together
• Implementation
• Evaluation
• Experimental setup
• Effectiveness of
Proposed Techniques
• End-to-End Performance
• Sensitivity Study
• Related Study
• Conclusion
Implementation
• EV-SSD
based on FPGA by enriching the FPGA chip to realize the functionality
of EV-SSD controllers and employing four DDR4 banks to emulate four
flash channels.
• EV-Cache
implement in C++ and integrate it into the PyTorch-based DLRM
framework.
Set up
• AWS EC2 F1 instance
• Xilinx Virtex 57 Plus UltraScale XCVU9P card
• Batch size 128
• Evaluate three typical read latency of low-to-high-end SSDs
• This paper compare Flash Embedding to a Baseline SSD implementation using
SSD-S setting and an Ideal DRAM implementation using DRAM-only setting.
We also evaluate the EV-Cache and EV-SSD techniques in detail.
Why ?
EV-SSD
1. bypasses most parts of the I/O stack,
allowing direct access to SSD.
2. supports fine-grained access,
and read amplification is reduced remarkably.
Embedding cache
can reduce the quantity of SSD requests by
exploiting the temporal reuse opportunity of embeddings in
the dataset
EV-SSD’s
lighter I/O stack and finer access granularity efficiently reduce the read amplification.
EV-Cache
efficiently caches the embeddings and serves embedding lookup requests directly
whenever possible, reducing traffic to SSD.
As the batch size increases, the latency of EV-SSD decreases
EV-Cache reduces the number of SSD accesses by exploiting the data locality
• Abstract
• Introduction
• Background and motivation
• Embedding
• SSD-based Recommendation Interence
• End-to-End Inference Latency
• Performance Overhead Analysis
• Flash Embedding Design
• System overview
• Embedding Vector SSD
• MMIO Manager
• EV Translator
• EV-FC
• EV-Merger
• Embedding Vector Cache
• Putting it all together
• Implementation
• Evaluation
• Experimental setup
• Effectiveness of
Proposed Techniques
• End-to-End Performance
• Sensitivity Study
• Related Study
• Conclusion
Related work
• Utilizing near-memory processing to accelerate recommendation models.
• Reducing the number of memory accesses and compute cycles at the cost
of additional memory space.
Conclusion
FlashEmbedding
 Performance problems :
 Long I/O stack
 Tremendous read amplifications of the block-interface SSD.
 Three key optimizations
 Embedding vector SSD
 Embedding vector cache
 Pipeline

More Related Content

Similar to FlashEmbedding: Storing Embedding Tables in SSD for Large-Scale Recommender Systems.pptx

Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
solarisyourep
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
Simon Huang
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
Dhritiman Halder
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red_Hat_Storage
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Vigyan Jain
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
Tom Laszewski
 

Similar to FlashEmbedding: Storing Embedding Tables in SSD for Large-Scale Recommender Systems.pptx (20)

Percona Live 4/14/15: Leveraging open stack cinder for peak application perfo...
Percona Live 4/14/15: Leveraging open stack cinder for peak application perfo...Percona Live 4/14/15: Leveraging open stack cinder for peak application perfo...
Percona Live 4/14/15: Leveraging open stack cinder for peak application perfo...
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
Leveraging OpenStack Cinder for Peak Application Performance
Leveraging OpenStack Cinder for Peak Application PerformanceLeveraging OpenStack Cinder for Peak Application Performance
Leveraging OpenStack Cinder for Peak Application Performance
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
Presentation architecting a cloud infrastructure
Presentation   architecting a cloud infrastructurePresentation   architecting a cloud infrastructure
Presentation architecting a cloud infrastructure
 
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best PracticesVMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
VMworld 2014: Advanced SQL Server on vSphere Techniques and Best Practices
 
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebulaTechDay - Toronto 2016 - Hyperconvergence and OpenNebula
TechDay - Toronto 2016 - Hyperconvergence and OpenNebula
 
Introduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3RIntroduction to NVMe Over Fabrics-V3R
Introduction to NVMe Over Fabrics-V3R
 
AWS Webcast - Achieving consistent high performance with Postgres on Amazon W...
AWS Webcast - Achieving consistent high performance with Postgres on Amazon W...AWS Webcast - Achieving consistent high performance with Postgres on Amazon W...
AWS Webcast - Achieving consistent high performance with Postgres on Amazon W...
 
Project Presentation Final
Project Presentation FinalProject Presentation Final
Project Presentation Final
 
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
Modeling, estimating, and predicting Ceph (Linux Foundation - Vault 2015)
 
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
Red Hat Storage Day Dallas - Red Hat Ceph Storage Acceleration Utilizing Flas...
 
Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4Inter connect2016 yss1841-cloud-storage-options-v4
Inter connect2016 yss1841-cloud-storage-options-v4
 
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-FinalSizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
Sizing MongoDB on AWS with Wired Tiger-Patrick and Vigyan-Final
 
New Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference ArchitecturesNew Ceph capabilities and Reference Architectures
New Ceph capabilities and Reference Architectures
 
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
Software Defined Storage, Big Data and Ceph - What Is all the Fuss About?
 
2015 open storage workshop ceph software defined storage
2015 open storage workshop   ceph software defined storage2015 open storage workshop   ceph software defined storage
2015 open storage workshop ceph software defined storage
 
Migrating enterprise workloads to AWS
Migrating enterprise workloads to AWSMigrating enterprise workloads to AWS
Migrating enterprise workloads to AWS
 
From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.From cache to in-memory data grid. Introduction to Hazelcast.
From cache to in-memory data grid. Introduction to Hazelcast.
 
VMUGIT UC 2013 - 04 Duncan Epping
VMUGIT UC 2013 - 04 Duncan EppingVMUGIT UC 2013 - 04 Duncan Epping
VMUGIT UC 2013 - 04 Duncan Epping
 

More from Po-Chuan Chen

Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Po-Chuan Chen
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Po-Chuan Chen
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Po-Chuan Chen
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
Po-Chuan Chen
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
Po-Chuan Chen
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
Po-Chuan Chen
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
Po-Chuan Chen
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
Po-Chuan Chen
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Po-Chuan Chen
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Po-Chuan Chen
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Po-Chuan Chen
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
Po-Chuan Chen
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Po-Chuan Chen
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
Po-Chuan Chen
 

More from Po-Chuan Chen (20)

E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdfE-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
E-CORE: Emotion Correlation Enhanced Empathetic Dialogue Generation.pdf
 
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
Effective Structured Prompting by Meta-Learning and Representative Verbalizer...
 
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdfQuark: Controllable Text Generation with Reinforced [Un]learning.pdf
Quark: Controllable Text Generation with Reinforced [Un]learning.pdf
 
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
Empathetic Dialogue Generation via Sensitive Emotion Recognition and Sensible...
 
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdfOn the Effectiveness of Offline RL for Dialogue Response Generation.pdf
On the Effectiveness of Offline RL for Dialogue Response Generation.pdf
 
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transfor...
 
A Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdfA Statistical Perspective on Retrieval-Based Models.pdf
A Statistical Perspective on Retrieval-Based Models.pdf
 
A Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdfA Neural Corpus Indexer for Document Retrieval.pdf
A Neural Corpus Indexer for Document Retrieval.pdf
 
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdfAdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
AdaMix: Mixture-of-Adaptations for Parameter-efficient Model Tuning.pdf
 
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
LLaMA-Adapter: Efficient Fine-tuning of Language Models with Zero-init Attent...
 
Active Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdfActive Retrieval Augmented Generation.pdf
Active Retrieval Augmented Generation.pdf
 
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdfOffline Reinforcement Learning for Informal Summarization in Online Domains.pdf
Offline Reinforcement Learning for Informal Summarization in Online Domains.pdf
 
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdfCold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
Cold_Start_Reinforcement_Learning_with_Softmax_Policy_Gradient.pdf
 
Image_to_Prompts.pdf
Image_to_Prompts.pdfImage_to_Prompts.pdf
Image_to_Prompts.pdf
 
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdfRetrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.pdf
 
Evaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdfEvaluating Parameter Efficient Learning for Generation.pdf
Evaluating Parameter Efficient Learning for Generation.pdf
 
Off-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdfOff-Policy Deep Reinforcement Learning without Exploration.pdf
Off-Policy Deep Reinforcement Learning without Exploration.pdf
 
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdfA Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
A Mixture-of-Expert Approach to RL-based Dialogue Management.pdf
 
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
Is Reinforcement Learning (Not) for Natural
Language Processing.pdfIs Reinforcement Learning (Not) for Natural
Language Processing.pdf
Is Reinforcement Learning (Not) for Natural Language Processing.pdf
 
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of TransformerspdfHyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
HyperPrompt:Prompt-based Task-Conditioning of Transformerspdf
 

Recently uploaded

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
ssuser89054b
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
chumtiyababu
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
mphochane1998
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
Neometrix_Engineering_Pvt_Ltd
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
Kamal Acharya
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
AldoGarca30
 

Recently uploaded (20)

XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX
 
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKARHAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
HAND TOOLS USED AT ELECTRONICS WORK PRESENTED BY KOUSTAV SARKAR
 
School management system project Report.pdf
School management system project Report.pdfSchool management system project Report.pdf
School management system project Report.pdf
 
PE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and propertiesPE 459 LECTURE 2- natural gas basic concepts and properties
PE 459 LECTURE 2- natural gas basic concepts and properties
 
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
NO1 Top No1 Amil Baba In Azad Kashmir, Kashmir Black Magic Specialist Expert ...
 
Thermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - VThermal Engineering-R & A / C - unit - V
Thermal Engineering-R & A / C - unit - V
 
Wadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptxWadi Rum luxhotel lodge Analysis case study.pptx
Wadi Rum luxhotel lodge Analysis case study.pptx
 
Thermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.pptThermal Engineering -unit - III & IV.ppt
Thermal Engineering -unit - III & IV.ppt
 
Verification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptxVerification of thevenin's theorem for BEEE Lab (1).pptx
Verification of thevenin's theorem for BEEE Lab (1).pptx
 
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
Unit 4_Part 1 CSE2001 Exception Handling and Function Template and Class Temp...
 
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments""Lesotho Leaps Forward: A Chronicle of Transformative Developments"
"Lesotho Leaps Forward: A Chronicle of Transformative Developments"
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Hospital management system project report.pdf
Hospital management system project report.pdfHospital management system project report.pdf
Hospital management system project report.pdf
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
1_Introduction + EAM Vocabulary + how to navigate in EAM.pdf
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
Employee leave management system project.
Employee leave management system project.Employee leave management system project.
Employee leave management system project.
 
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
Navigating Complexity: The Role of Trusted Partners and VIAS3D in Dassault Sy...
 

FlashEmbedding: Storing Embedding Tables in SSD for Large-Scale Recommender Systems.pptx

  • 1. FlashEmbedding: Storing Embedding Tables in SSD for Large-Scale Recommender Systems Speaker: Po-Chuan, Chen
  • 2. Table of contents • Abstract • Introduction • Background and motivation • Embedding • SSD-based Recommendation Interence • End-to-End Inference Latency • Performance Overhead Analysis • Flash Embedding Design • System overview • Embedding Vector SSD • MMIO Manager • EV Translator • EV-FC • EV-Merger • Embedding Vector Cache • Putting it all together • Implementation • Evaluation • Experimental setup • Effectiveness of Proposed Techniques • End-to-End Performance • Sensitivity Study • Related Study • Conclusion
  • 3. Abstract • Flash Embedding, a hardware/software co-design solution for storing embedding tables on SSDs for largescale recommendation inference under memory capacity limited systems. • Lower latency (in lookups) 17.44 x • Lower end-to-end latency 2.89 x
  • 4. Introduction • Recommender systems exploit the categorical information while maintaining the natural diversity within the features by using embedding tables to map each category to a unique embedding vector within a lower-dimensional embedded space. Typically, they store all embedding tables in memory to reduce latency for serving user requests. But it costs high latency.
  • 5. Solution • One promising solution to this problem is storing embedding tables on much larger solid-state drives (SSDs). • The performance degradation of SSD-based recommendation inference is mainly attributed to two factors:  The long I/O stack  The tremendous read amplifications of the block-interface SSD
  • 6. Embedding vector SSD (EV-SSD) it support embedding vector lookup requests by taking advantage of two-stage fine- grained reading and MMIO interface. Embedding vector cache (EV-Cache) it reduce the access to EV-SSD by applying a bitmap filter and robin hood hashing– based LRU caches. Using the pipeline it hide the I/O latency further by overlapping CPU computation with EV-SSD handling. Overview : Flash Embedding Design
  • 7. • Abstract • Introduction • Background and motivation • Embedding • SSD-based Recommendation Interence • End-to-End Inference Latency • Performance Overhead Analysis • Flash Embedding Design • System overview • Embedding Vector SSD • MMIO Manager • EV Translator • EV-FC • EV-Merger • Embedding Vector Cache • Putting it all together • Implementation • Evaluation • Experimental setup • Effectiveness of Proposed Techniques • End-to-End Performance • Sensitivity Study • Related Study • Conclusion
  • 8. Embedding A typical architecture of deep learning-based personalized recommendation model comprises bottom MLP, top MLP, embedding and feature interaction layers Ref : https://arxiv.org/pdf/1707.07435.pdf Day 29 Embeddings (嵌入架構) - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天
  • 9. SSD-based Recommendation Inference Utilizing the deep learning recommendation model (DLRM) framework and Criteo Ad dataset to evaluate how well the recommender system performs. Using the NVMe SSD attached to the AWS F1 instance to store embedding tables.
  • 10.
  • 11. Because page cache will cache frequently used embeddings and serve most embedding lookup requests if it has the request embedding. Notice that device layer latency increases rapidly as the size of available memory decreases. I/O stack costs are the primary limiter on SSD-based recommender system performance Thrashing / page swapping The I/O granularity of recommender systems (flash page size = 4KB) and underlying SSDs (128B ~ 2KB) mismatch, leading to significant read amplification.
  • 12. • Abstract • Introduction • Background and motivation • Embedding • SSD-based Recommendation Interence • End-to-End Inference Latency • Performance Overhead Analysis • Flash Embedding Design • System overview • Embedding Vector SSD • MMIO Manager • EV Translator • EV-FC • EV-Merger • Embedding Vector Cache • Putting it all together • Implementation • Evaluation • Experimental setup • Effectiveness of Proposed Techniques • End-to-End Performance • Sensitivity Study • Related Study • Conclusion
  • 13. System Overview • Software EV-Cache, dedicated to efficiently cache the embedding vector using a small amount of memory to exploit the modest level of temporal locality within embedding tables and to reduce long SSD read latency. • Hardware EV-SSD, to fast access the embedding vectors in SSD by reducing read amplification and I/O stacks.
  • 16. EV-Cache • EV-Cache workload • Bitmap filter • Cache Design EV-Cache employs dedicated bitmap as filters to prune paths early. It takes hints of the access count from the bitmap to determine whether an embedding vector should be cached or bypassed.
  • 17. Cache Design LRU (least recently used) based vector cache as its building block  Using a HashMap to store key-value pairs  A circular double-linked list to record the access order of each element Robin Hood hashing reduces the maximum probe sequence length and the long tail in probe sequence length distribution, achieving high performance. EV-Cache internally uses a set of LRU caches to serve embedding lookup requests to improve performance in a multi-threaded environment.
  • 18. Putting it all together  Prepare indices (EV-Cache)  Send indices (EV-Cache)  Lookup embedding vectors in SSD (EV-SSD) The pipeline takes advantage of the parallelism between CPU and SSD and mitigates the idle time on the host side
  • 19. • Abstract • Introduction • Background and motivation • Embedding • SSD-based Recommendation Interence • End-to-End Inference Latency • Performance Overhead Analysis • Flash Embedding Design • System overview • Embedding Vector SSD • MMIO Manager • EV Translator • EV-FC • EV-Merger • Embedding Vector Cache • Putting it all together • Implementation • Evaluation • Experimental setup • Effectiveness of Proposed Techniques • End-to-End Performance • Sensitivity Study • Related Study • Conclusion
  • 20. Implementation • EV-SSD based on FPGA by enriching the FPGA chip to realize the functionality of EV-SSD controllers and employing four DDR4 banks to emulate four flash channels. • EV-Cache implement in C++ and integrate it into the PyTorch-based DLRM framework.
  • 21. Set up • AWS EC2 F1 instance • Xilinx Virtex 57 Plus UltraScale XCVU9P card • Batch size 128 • Evaluate three typical read latency of low-to-high-end SSDs • This paper compare Flash Embedding to a Baseline SSD implementation using SSD-S setting and an Ideal DRAM implementation using DRAM-only setting. We also evaluate the EV-Cache and EV-SSD techniques in detail.
  • 22. Why ? EV-SSD 1. bypasses most parts of the I/O stack, allowing direct access to SSD. 2. supports fine-grained access, and read amplification is reduced remarkably. Embedding cache can reduce the quantity of SSD requests by exploiting the temporal reuse opportunity of embeddings in the dataset EV-SSD’s lighter I/O stack and finer access granularity efficiently reduce the read amplification. EV-Cache efficiently caches the embeddings and serves embedding lookup requests directly whenever possible, reducing traffic to SSD.
  • 23. As the batch size increases, the latency of EV-SSD decreases EV-Cache reduces the number of SSD accesses by exploiting the data locality
  • 24. • Abstract • Introduction • Background and motivation • Embedding • SSD-based Recommendation Interence • End-to-End Inference Latency • Performance Overhead Analysis • Flash Embedding Design • System overview • Embedding Vector SSD • MMIO Manager • EV Translator • EV-FC • EV-Merger • Embedding Vector Cache • Putting it all together • Implementation • Evaluation • Experimental setup • Effectiveness of Proposed Techniques • End-to-End Performance • Sensitivity Study • Related Study • Conclusion
  • 25. Related work • Utilizing near-memory processing to accelerate recommendation models. • Reducing the number of memory accesses and compute cycles at the cost of additional memory space.
  • 26. Conclusion FlashEmbedding  Performance problems :  Long I/O stack  Tremendous read amplifications of the block-interface SSD.  Three key optimizations  Embedding vector SSD  Embedding vector cache  Pipeline