Copyright © Myrtle.ai 2020
Solving Core Recommendation
Model Challenges in Data Centers
Giles Peckham, Myrtle.ai
Myrtle.aiacceleratesMachineLearninginference
• Accelerates Recommendation Models, RNNs and other DNNs with sparse structures
• Achieves maximum throughputin applicationswith strictlatency constraints
• Addresses hyper-scale inference
• Data Centers (Cloud & On-Premise) and Embedded applications
Myrtle.ai
Founding Member:
MLCommons
Alliance Member
Gold Partner
AI Keynote 2019
Joint White Paper
Copyright© Myrtle.ai 2020
Recommendation
Systems Speech Synthesis
Speech
Transcription
Machine
Translation
Copyright© Myrtle.ai 2020
MAU Accelerator
Low latency inference accelerator for data center ML workloads
Optimized for highest latency-bounded throughput
DNN Model
Cloud or enterprise
data center serverFPGA accelerator
card
Copyright© Myrtle.ai 2020
MAU Accelerator Benefits
Optimized for highest
latency-bounded
throughput
Reduceddata centerinfrastructure required • Lower CapEx
• Mitigates against rack space limitations
Reducedenergy consumption • Lower OpEx
• Smaller carbonfootprint
• Mitigates against power constraints
Deterministic low tail-latency enables the use of
higher quality models
• Improvedcustomer experience
• Better services
Uses readily-available data center accelerator cards
compatible withtypical server installations
• Rapiddeployment at scale
Development flow basedonindustry standards • Easy to compile frompopularopen-source
frameworks
Flexible & reprogrammable solution • Future proof
Copyright© Myrtle.ai 2020
Applications
Target Applications
• Speech transcription
• Natural language processing
• Speech synthesis
• Time series cleansing & analysis
• Payment & trading fraud detection
• Anomaly detection
• Network security
Target Model Architectures
• Fully connected linear layers
• RNN, including LSTM and GRU
• Time delay neural network (TDNN)
Target Sectors
• Finance (trading,compliance, service)
• Search, Social Media & other Ad Servers
• HPC (very large ML)
• Life science (genomics, dataanalytics)
• Defense, Aerospace, Security
• Telcos & Conferencing Providers
Copyright © Myrtle.ai 2020
An Accelerator for Recommendation Systems
Copyright© Myrtle.ai 2020
Recommendation Models
• One of the most common data center workloads
• Used for search, adverts,feeds and personalization
Demands
• Throughput / Capacity
• Need to ramp up capacityquicklyto meet demand
• Months/years to commission new data center floor space
• Cost
• Data center rack server investment >$50B /yr1
• Latency / Model Accuracy / Revenue
• 5ms latencyis challengingfor typical server systems
• 100ms delayin load time can cost e-commerce companies many $B /yr2
• Energy Consumption/ Carbon Footprint
• Global data center energy costs >$10B /yr3
• Global data center emissions ~100M tonnes CO2 /yr4
1. https://www.marketsandmarkets.com/Market-Reports/data-center-rack-server-market-53332315.html
2. https://www.akamai.com/uk/en/about/news/press/2017-press/akamai-releases-spring-2017-state-of-online-retail-performance-report.jsp
3. https://www.sciencedaily.com/releases/2020/02/200227144313.htm
4. https://www.comsoc.org/publications/tcn/2019-nov/energy-efficiency-data-centers
Copyright© Myrtle.ai 2020
Design Challenges
• A typicalRecommendationModel:
• Traditionalapproach:
• Put the wholemodel on one chip
• Myrtle.aiapproach:
• Offloaddifferent features of the modelto different hardware accelerators
• Make it equallypracticalto adopt
Dense Features
Compute-Bound
Sparse Features
Memory-Bound
Dense Features
Compute-Bound
OutputInput
• Up to 80% of time can be spent here
• Memory architecture in typical data center infrastructure is inefficient here
• Existingaccelerators give a poor return here
Copyright © Myrtle.ai 2020
• Accelerates the memory-bound sparse operations in all recommendation models
• Delivers large gains in latency bounded throughput
• Fully preserves existing model accuracy
• Is complementary to existing compute accelerators
• Is integrated into the PyTorch Deep Learning Framework
SEAL: An Accelerator for
Recommendation Systems
Copyright © Myrtle.ai 2020
Add SEAL
modules
Offload sparse
operations to
SEAL
CPU freed up;
latency reduced
Increase
CPU batch size
Throughput
increased
The “Virtuous Circle”
Copyright© Myrtle.ai 2020
Performance
• Vector Processing Bandwidth is the bandwidth achievable when
transforming random multi-hot vectors into real-valued dense vectors
• Carrier is Glacier Point v2
Vector Processing Bandwidth
16 GB version 18 GB/s (219 GB/s per carrier)
32 GB version 16 GB/s (195 GB/s per carrier)
Copyright© Myrtle.ai 2020
Key Benefits
Based on benchmarking using a weighted average of the mlperf.org benchmark
recommendation models (Dec. 2019):
• Rapid 8x increase in latency-boundedthroughputusing existing infrastructure1
• Enables more recommendationsto be made
• Enables better quality recommendationsto be made
• Higher CTRs
• Increased revenue
• Greater consumer satisfaction
• Up to 50%CapEx savings on further capacity expansion1,2
• Up to 80%reduction in energy consumption1,2
• OpEx savings
• Smaller carbonfootprint
1 Comparisonsarebetween a Xeon D-2100 performinginferenceon itown and thesameCPUleveragingSEALacceleration.
Performanceandbenefitswill vary,dependingon individualsystemconfiguration and model usage.
2 Based on servers+SEALonly. Excludesbuildings,HVAC etc.
8xmore
throughput
50%
less CapEx
80%
less energy
Copyright© Myrtle.ai 2020
Highly Complementary
to Existing Infrastructure
• Acceleratesexisting servers; easy to install
• Complementary to other accelerators
• Scalable
• Does not requireany changeto the
recommendation model.No model
retraining.No degradation in accuracy
• Supportsco-location of modelswith no
performancepenalty
• Supportsconcurrentdeploymentof different
versionsof a model, and loading/unloading
models on the fly to facilitate A/B testing
Copyright © Myrtle.ai 2020
Contact seal@myrtle.ai to evaluate what SEAL can do for your business
For more information visit myrtle.ai/seal
SEAL is the
• lowest power
• smallest form factor
• easiest-to-deploy
method of optimizing memory bound
recommendation models in existing infrastructure.
Thank You
w w w . m y r t l e . a i
Copyright © Myrtle.ai 2020
Giles Peckham
07785 278478
giles@myrtle.ai

Implementing AI: High Performance Architectures: Solving Core Recommendation Model Challenges in Data Centers

  • 1.
    Copyright © Myrtle.ai2020 Solving Core Recommendation Model Challenges in Data Centers Giles Peckham, Myrtle.ai
  • 2.
    Myrtle.aiacceleratesMachineLearninginference • Accelerates RecommendationModels, RNNs and other DNNs with sparse structures • Achieves maximum throughputin applicationswith strictlatency constraints • Addresses hyper-scale inference • Data Centers (Cloud & On-Premise) and Embedded applications Myrtle.ai Founding Member: MLCommons Alliance Member Gold Partner AI Keynote 2019 Joint White Paper Copyright© Myrtle.ai 2020 Recommendation Systems Speech Synthesis Speech Transcription Machine Translation
  • 3.
    Copyright© Myrtle.ai 2020 MAUAccelerator Low latency inference accelerator for data center ML workloads Optimized for highest latency-bounded throughput DNN Model Cloud or enterprise data center serverFPGA accelerator card
  • 4.
    Copyright© Myrtle.ai 2020 MAUAccelerator Benefits Optimized for highest latency-bounded throughput Reduceddata centerinfrastructure required • Lower CapEx • Mitigates against rack space limitations Reducedenergy consumption • Lower OpEx • Smaller carbonfootprint • Mitigates against power constraints Deterministic low tail-latency enables the use of higher quality models • Improvedcustomer experience • Better services Uses readily-available data center accelerator cards compatible withtypical server installations • Rapiddeployment at scale Development flow basedonindustry standards • Easy to compile frompopularopen-source frameworks Flexible & reprogrammable solution • Future proof
  • 5.
    Copyright© Myrtle.ai 2020 Applications TargetApplications • Speech transcription • Natural language processing • Speech synthesis • Time series cleansing & analysis • Payment & trading fraud detection • Anomaly detection • Network security Target Model Architectures • Fully connected linear layers • RNN, including LSTM and GRU • Time delay neural network (TDNN) Target Sectors • Finance (trading,compliance, service) • Search, Social Media & other Ad Servers • HPC (very large ML) • Life science (genomics, dataanalytics) • Defense, Aerospace, Security • Telcos & Conferencing Providers
  • 6.
    Copyright © Myrtle.ai2020 An Accelerator for Recommendation Systems
  • 7.
    Copyright© Myrtle.ai 2020 RecommendationModels • One of the most common data center workloads • Used for search, adverts,feeds and personalization Demands • Throughput / Capacity • Need to ramp up capacityquicklyto meet demand • Months/years to commission new data center floor space • Cost • Data center rack server investment >$50B /yr1 • Latency / Model Accuracy / Revenue • 5ms latencyis challengingfor typical server systems • 100ms delayin load time can cost e-commerce companies many $B /yr2 • Energy Consumption/ Carbon Footprint • Global data center energy costs >$10B /yr3 • Global data center emissions ~100M tonnes CO2 /yr4 1. https://www.marketsandmarkets.com/Market-Reports/data-center-rack-server-market-53332315.html 2. https://www.akamai.com/uk/en/about/news/press/2017-press/akamai-releases-spring-2017-state-of-online-retail-performance-report.jsp 3. https://www.sciencedaily.com/releases/2020/02/200227144313.htm 4. https://www.comsoc.org/publications/tcn/2019-nov/energy-efficiency-data-centers
  • 8.
    Copyright© Myrtle.ai 2020 DesignChallenges • A typicalRecommendationModel: • Traditionalapproach: • Put the wholemodel on one chip • Myrtle.aiapproach: • Offloaddifferent features of the modelto different hardware accelerators • Make it equallypracticalto adopt Dense Features Compute-Bound Sparse Features Memory-Bound Dense Features Compute-Bound OutputInput • Up to 80% of time can be spent here • Memory architecture in typical data center infrastructure is inefficient here • Existingaccelerators give a poor return here
  • 9.
    Copyright © Myrtle.ai2020 • Accelerates the memory-bound sparse operations in all recommendation models • Delivers large gains in latency bounded throughput • Fully preserves existing model accuracy • Is complementary to existing compute accelerators • Is integrated into the PyTorch Deep Learning Framework SEAL: An Accelerator for Recommendation Systems
  • 10.
    Copyright © Myrtle.ai2020 Add SEAL modules Offload sparse operations to SEAL CPU freed up; latency reduced Increase CPU batch size Throughput increased The “Virtuous Circle”
  • 11.
    Copyright© Myrtle.ai 2020 Performance •Vector Processing Bandwidth is the bandwidth achievable when transforming random multi-hot vectors into real-valued dense vectors • Carrier is Glacier Point v2 Vector Processing Bandwidth 16 GB version 18 GB/s (219 GB/s per carrier) 32 GB version 16 GB/s (195 GB/s per carrier)
  • 12.
    Copyright© Myrtle.ai 2020 KeyBenefits Based on benchmarking using a weighted average of the mlperf.org benchmark recommendation models (Dec. 2019): • Rapid 8x increase in latency-boundedthroughputusing existing infrastructure1 • Enables more recommendationsto be made • Enables better quality recommendationsto be made • Higher CTRs • Increased revenue • Greater consumer satisfaction • Up to 50%CapEx savings on further capacity expansion1,2 • Up to 80%reduction in energy consumption1,2 • OpEx savings • Smaller carbonfootprint 1 Comparisonsarebetween a Xeon D-2100 performinginferenceon itown and thesameCPUleveragingSEALacceleration. Performanceandbenefitswill vary,dependingon individualsystemconfiguration and model usage. 2 Based on servers+SEALonly. Excludesbuildings,HVAC etc. 8xmore throughput 50% less CapEx 80% less energy
  • 13.
    Copyright© Myrtle.ai 2020 HighlyComplementary to Existing Infrastructure • Acceleratesexisting servers; easy to install • Complementary to other accelerators • Scalable • Does not requireany changeto the recommendation model.No model retraining.No degradation in accuracy • Supportsco-location of modelswith no performancepenalty • Supportsconcurrentdeploymentof different versionsof a model, and loading/unloading models on the fly to facilitate A/B testing
  • 14.
    Copyright © Myrtle.ai2020 Contact seal@myrtle.ai to evaluate what SEAL can do for your business For more information visit myrtle.ai/seal SEAL is the • lowest power • smallest form factor • easiest-to-deploy method of optimizing memory bound recommendation models in existing infrastructure.
  • 15.
    Thank You w ww . m y r t l e . a i Copyright © Myrtle.ai 2020 Giles Peckham 07785 278478 giles@myrtle.ai