SlideShare a Scribd company logo
1 of 7
Download to read offline
Accelerate natural language processing with AWS
EC2 M7i instances featuring 4th Gen Intel Xeon
Scalable processors
Compared to AWS M7g instances with AWS Graviton3 processors,
AWS M7i instances with Intel Xeon Scalable processors analyzed up
to 10.65 times the sentences per second and delivered up to 8.62
times the performance per dollar in RoBERTa model testing
Natural language processing (NLP) supports many everyday applications, from search result
suggestions to text autocorrections to text generation chatbots. Deep learning frameworks for NLP,
such as Bidirectional Encoder Representations from Transformers (BERT) and its variations, demand
robust compute power to analyze text and make predictions. For businesses running these workloads
in the cloud, selecting the right instance can enable faster text analysis.
Using Intel®
Extension for PyTorch, we tested the deep learning performance of two types of Amazon
Web Services (AWS) Elastic Cloud Compute (EC2) instances: M7i instances enabled by 4th Gen Intel
Xeon®
Scalable processors and M7g instances enabled by AWS Graviton3 processors. To measure
how the instances performed at different sizes, we tested both types with 4, 16, and 64 vCPUs. We
ran tests with the RoBERTa model, a variant of BERT, using a benchmark from the Intel Model Zoo. We
optimized the model for each processor type.
In our tests, the M7i instances analyzed up to 10.65 times the sentences per second compared to
the M7g instances. These results indicate that businesses running similar RoBERTa workloads could
analyze textual data faster and complete more work per instance with M7i instances featuring 4th Gen
Intel Xeon Scalable processors.
the throughput on
16vCPU instances
the throughput on
64vCPU instances
Up to
10.65x
the performance
per dollar on
64vCPU instances
Up to
4.66x
Up to
8.62x
This project was commissioned by Intel.
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors October 2023
A Principled Technologies report: Hands-on testing. Real-world results.
How we tested
We ran two types of AWS EC2 instances in the US-east-1f region:
• M7i instances featuring 4th Gen Intel Xeon Scalable processors
• M7g instances featuring AWS Graviton3 processors
We optimized the RoBERTa model for each instance type. The 4th Gen Intel Xeon Scalable processors featured
Intel Advanced Matrix Extensions (Intel AMX) with 8-bit integer, bfloat16, and float16 support. To determine
the performance businesses of various sizes might expect, we ran tests on smaller, medium-size, and larger
instances, as Figure 1 shows. To account for different types of datasets businesses may analyze, we ran tests at
different batch sizes, which indicate the number of samples that go through the neural network at a time. We
chose a small batch size of 1 and a large batch size of 32. Finally, to show performance at different precisions
an organization may use, we tested with both BF16 and FP32 precision. While FP32 was “the standard type for
neural network computations for a long time,” BF16 is a newer precision type that some organizations have used
to replace FP16 and FP32 precision.1
In addition to measuring the number of sentences per second each instance could analyze at two batch sizes
and with two precisions, we also calculated the performance per price of each instance. We based this data on
the results of our testing and instance pricing we researched on August 10, 2023. To read more about our tests,
configurations, and calculations, see the science behind the report.
What better RoBERTa performance might mean for you
BERT, pre-trained on the entirety of the English language Wikipedia, currently helps Google interpret
user searches.2
Developed by Facebook AI, the RoBERTa model we used in our testing trained on a
larger dataset, including the English language Wikipedia, 63 million news articles, and 38 GB of Reddit
content.3
According to one article, “RoBERTa has been shown to outperform BERT and other state-of-
the-art models on a variety of natural language processing tasks, including language translation, text
classification, and question answering. It has also been used as a base model for many other successful
NLP models and has become a popular choice for research and industry applications.”4
At all three
instance sizes we tested, the M7i instances enabled by 4th Gen Intel Xeon Scalable processors handled
greater throughput than the M7g instances with Graviton3 processors. And they did so using different
batch sizes and precision types. This means that selecting M7i instances could help businesses run
RoBERTa workloads faster, which could lead to a more seamless experience for users on RoBERTa-
powered apps or the ability to support more users. Additionally, with the better value that the M7i
instances delivered, organizations might accomplish more RoBERTa work per instance, improving their
bottom lines as they save on instance uptime.
Figure 1: Key specifications for each instance size we tested. Source: Principled Technologies.
Small instances
4 vCPUs
16GB RAM
Medium instances
16 vCPUs
64GB RAM
Large instances
64 vCPUs
256GB RAM
October 2023 | 2
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
Analyze text faster at BF16 and FP32 precision
We first compared RoBERTa performance at BF16 precision, measuring the sentences per second
each instance analyzed. At both batch sizes and across vCPU counts, M7i instances featuring 4th
Gen Intel Xeon Scalable processors handled a higher throughput rate than M7g instances with
Graviton3 processors. As Figures 2 and 3 show, M7i instances analyzed up to 10.65 times the
sentences per second.
Normalized RoBERTa throughput
with BF16 at batch size 1 Higher is better
3.54
8.00
6.00
4.00
0.00
12.00
10.00
4 vCPUs
Normalized
sentences/second
4.66
16 vCPUs
10.65
64 vCPUs
1.00 1.00 1.00
2.00
M7i M7g
Instance size
Figure 2: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances.
Both types of instances used BF16 precision at batch size: 1. Higher is better. Source: Principled Technologies.
8.00
6.00
4.00
0.00
12.00
10.00
2.00
Normalized RoBERTa throughput
with BF16 at batch size 32 Higher is better
3.62
4 vCPUs
Normalized
sentences/second
4.58
16 vCPUs
5.17
64 vCPUs
1.00 1.00 1.00
M7i M7g
Instance size
Figure 3: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances.
Both types of instances used BF16 precision at batch size: 32. Higher is better. Source: Principled Technologies.
Up to 10.65x
the throughput on
64vCPU instances
Up to 5.17x
the throughput on
64vCPU instances
Note: The graphs in this report use two different y-axis scales to keep to a consistent size. Please be mindful of each graph’s
data range as you compare.
October 2023 | 3
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
When we compared performance using FP32 precision, we again saw performance increases from M7i
instances featuring 4th Gen Intel Xeon Scalable processors. For both batch sizes we tested, the M7i instances
outperformed the M7g instances by as much as 4.25 times (Figures 4 and 5).
Normalized RoBERTa throughput
with FP32 at batch size 1 Higher is better
1.82
3.00
2.00
1.00
0.00
5.00
4.00
4 vCPUs
1.00
Normalized
sentences/second
2.16
16 vCPUs
1.00
4.25
64 vCPUs
1.00
M7i M7g
Instance size
Figure 4: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances.
Both types of instances used FP32 precision at batch size: 1. Higher is better. Source: Principled Technologies.
Normalized RoBERTa throughput
with FP32 at batch size 32 Higher is better
1.93
3.00
2.00
1.00
0.00
5.00
4.00
4 vCPUs
Normalized
sentences/second
2.01
16 vCPUs
1.93
64 vCPUs
1.00 1.00 1.00
M7i M7g
Instance size
Figure 5: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances.
Both types of instances used FP32 precision at batch size: 32. Higher is better. Source: Principled Technologies.
Up to 4.25x
the throughput on
64vCPU instances
Up to 2.01x
the throughput on
16vCPU instances
About 4th Gen Intel Xeon Scalable processors
According to Intel, 4th Gen Intel Xeon Scalable processors feature
“the most built-in accelerators of any CPU on the market to improve
performance in AI, data analytics, networking, storage, and HPC.”5
Along with PCIe Gen5 technology, DDR5 memory, and CXL 1.1
capabilities, 4th Gen Intel Xeon Scalable processors also offer Intel
Accelerator Engines for a variety of workloads.6
For more information, visit https://www.intel.com/content/www/
us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon-
scalable-processors.html.
October 2023 | 4
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
Take advantage of higher-value performance
Not only did M7i instances analyze sentences at a faster rate, but this performance improvement
made them a better value. Dividing each instance’s RoBERTa throughput by its per-hour price,
we found that M7i instances performed more RoBERTa work for every dollar they cost. At BF16
precision, M7i instances delivered up to 8.62 times the throughput per cost of M7g instances, as
Figures 6 and 7 show.
8.00
6.00
4.00
0.00
12.00
10.00
2.00
Normalized RoBERTa performance per
dollar with BF16 at batch size 1 Higher is better
2.87
4 vCPUs
Normalized
throughput/$
per
hour
3.77
16 vCPUs
8.62
64 vCPUs
M7i M7g
1.00 1.00 1.00
Instance size
Figure 6: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used BF16 precision at
batch size 1. Higher is better. Source: Principled Technologies.
Normalized RoBERTa performance per
dollar with BF16 at batch size 32 Higher is better
2.93
3.00
2.00
1.00
0.00
5.00
4.00
4 vCPUs
1.00
Normalized
throughput/$
per
hour
3.71
16 vCPUs
1.00
4.19
64 vCPUs
1.00
M7i M7g
Instance size
Figure 7: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used BF16 precision at
batch size 32. Higher is better. Source: Principled Technologies.
Up to 8.62x
the performance
per dollar on
64vCPU instances
Up to 4.19x
the performance
per dollar on
64vCPU instances
October 2023 | 5
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
Using the same calculations as we did for BF16 precision, we saw that with FP32 precision, the M7i instances
again offered a better value than the M7g instances we tested. Figures 8 and 9 show that M7i instances handled
up to 3.44 times the throughput per cost of the M7g instances using FP32 precision.
Normalized RoBERTa performance per
dollar with FP32 at batch size 1 Higher is better
1.47
3.00
2.00
1.00
0.00
5.00
4.00
4 vCPUs
1.00
Normalized
throughput/$
per
hour
1.75
16 vCPUs
1.00
3.44
64 vCPUs
1.00
M7i M7g
Instance size
Figure 8: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used FP32 precision at
batch size 1. Higher is better. Source: Principled Technologies.
Normalized RoBERTa performance per
dollar with FP32 at batch size 32 Higher is better
1.56
3.00
2.00
1.00
0.00
5.00
4.00
4 vCPUs
1.00
Normalized
throughput/$
per
hour
1.63
16 vCPUs
1.00
1.56
64 vCPUs
1.00
M7i M7g
Instance size
Figure 9: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used FP32 precision at
batch size 32. Higher is better. Source: Principled Technologies.
Up to 3.44x
the performance
per dollar on
64vCPU instances
Up to 1.63x
the performance
per dollar on
16vCPU instances
October 2023 | 6
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
Conclusion
For natural language processing workloads, the instance type you choose can make an enormous difference
in performance—how quickly they can make sense of textual data—and value—how much work they can
accomplish at a given cost. Our RoBERTa test results indicate that AWS EC2 M7i instances enabled by 4th
Gen Intel Xeon Scalable processors outperformed M7g instances with AWS Graviton3 processors at different
vCPU counts, batch sizes, and precisions, processing up to 10.65 times as many sentences per second. These
performance gains led to M7i instances delivering a better value, achieving up to 8.62 times the throughput per
dollar. With an instance that can analyze text more quickly, your business could offer a smoother experience on
RoBERTa-supported apps or potentially condense RoBERTa workloads onto fewer instances.
1. Grigory Sapunov, “FP64, FP32, FP16, BFLOAT16, TF32, and other members of the ZOO,” accessed August 25, 2023,
https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407.
2. Ben Lutkevich, “BERT language model,” accessed August 18, 2023,
https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model.
3. GeeksforGeeks, “Overview of ROBERTa model,” accessed August 24, 2023,
https://www.geeksforgeeks.org/overview-of-roberta-model/.
4. GeeksforGeeks, “Overview of ROBERTa model.”
5. Intel, “4th Gen Xeon Scalable Processors,” accessed August 18, 2023, https://www.intel.com/content/www/us/en/prod-
ucts/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html/.
6. Intel, “4th Gen Xeon Scalable Processors.”
Principled Technologies is a registered trademark of Principled Technologies, Inc.
All other product names are the trademarks of their respective owners.
For additional information, review the science behind this report.
Principled
Technologies®
Facts matter.®
Principled
Technologies®
Facts matter.®
This project was commissioned by Intel.
Read the science behind this report at https://facts.pt/2aqfhRt
October 2023 | 7
Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors

More Related Content

Similar to Accelerate natural language processing with AWS EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors

Similar to Accelerate natural language processing with AWS EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors (20)

Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319 Foundations of Amazon EC2 - SRV319
Foundations of Amazon EC2 - SRV319
 
Age of Language Models in NLP
Age of Language Models in NLPAge of Language Models in NLP
Age of Language Models in NLP
 
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best PracticesAmazon EC2 Instances, Featuring Performance Optimisation Best Practices
Amazon EC2 Instances, Featuring Performance Optimisation Best Practices
 
Amazon EC2 Foundations
Amazon EC2 FoundationsAmazon EC2 Foundations
Amazon EC2 Foundations
 
Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...Get a clearer picture of potential cloud performance by looking beyond SPECra...
Get a clearer picture of potential cloud performance by looking beyond SPECra...
 
Podila QCon SF 2016
Podila QCon SF 2016Podila QCon SF 2016
Podila QCon SF 2016
 
Google Cloud N2 VM instances featuring 3rd Gen Intel Xeon Scalable processors...
Google Cloud N2 VM instances featuring 3rd Gen Intel Xeon Scalable processors...Google Cloud N2 VM instances featuring 3rd Gen Intel Xeon Scalable processors...
Google Cloud N2 VM instances featuring 3rd Gen Intel Xeon Scalable processors...
 
Finish Microsoft SQL Server data analysis faster with new M5n series instance...
Finish Microsoft SQL Server data analysis faster with new M5n series instance...Finish Microsoft SQL Server data analysis faster with new M5n series instance...
Finish Microsoft SQL Server data analysis faster with new M5n series instance...
 
Elastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS PresentationElastic Compute Cloud (EC2) on AWS Presentation
Elastic Compute Cloud (EC2) on AWS Presentation
 
Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017Amazon EC2 Foundations - CMP203 - re:Invent 2017
Amazon EC2 Foundations - CMP203 - re:Invent 2017
 
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
AWS EC2 M6i instances with 3rd Gen Intel Xeon Scalable processors accelerated...
 
Complete online analytics processing work faster with Google Cloud Platform N...
Complete online analytics processing work faster with Google Cloud Platform N...Complete online analytics processing work faster with Google Cloud Platform N...
Complete online analytics processing work faster with Google Cloud Platform N...
 
Process data analytics queries faster with new Microsoft Azure Lsv3-series VM...
Process data analytics queries faster with new Microsoft Azure Lsv3-series VM...Process data analytics queries faster with new Microsoft Azure Lsv3-series VM...
Process data analytics queries faster with new Microsoft Azure Lsv3-series VM...
 
Complete artificial intelligence workloads faster using Microsoft Azure virtu...
Complete artificial intelligence workloads faster using Microsoft Azure virtu...Complete artificial intelligence workloads faster using Microsoft Azure virtu...
Complete artificial intelligence workloads faster using Microsoft Azure virtu...
 
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
Improve deep learning inference  performance with Microsoft Azure Esv4 VMs wi...
 
Scale Up Performance with Intel® Development
Scale Up Performance with Intel® DevelopmentScale Up Performance with Intel® Development
Scale Up Performance with Intel® Development
 
Complete more PostgreSQL work with new Microsoft Azure Lsv3-series VMs featur...
Complete more PostgreSQL work with new Microsoft Azure Lsv3-series VMs featur...Complete more PostgreSQL work with new Microsoft Azure Lsv3-series VMs featur...
Complete more PostgreSQL work with new Microsoft Azure Lsv3-series VMs featur...
 
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech TalksDeep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
Deep Dive on Amazon EC2 Instances - January 2017 AWS Online Tech Talks
 
Improve AI inference performance with HPE ProLiant DL380 Gen11 servers, power...
Improve AI inference performance with HPE ProLiant DL380 Gen11 servers, power...Improve AI inference performance with HPE ProLiant DL380 Gen11 servers, power...
Improve AI inference performance with HPE ProLiant DL380 Gen11 servers, power...
 
Re invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampionRe invent announcements_2016_hcls_use_cases_mchampion
Re invent announcements_2016_hcls_use_cases_mchampion
 

More from Principled Technologies

Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....
Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....
Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....
Principled Technologies
 
Improve performance and gain room to grow by easily migrating to a modern Ope...
Improve performance and gain room to grow by easily migrating to a modern Ope...Improve performance and gain room to grow by easily migrating to a modern Ope...
Improve performance and gain room to grow by easily migrating to a modern Ope...
Principled Technologies
 
Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...
Principled Technologies
 
Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...
Principled Technologies
 

More from Principled Technologies (20)

Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....
Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....
Investing in GenAI: Cost‑benefit analysis of Dell on‑premises deployments vs....
 
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...
Dell APEX Cloud Platform for Red Hat OpenShift: An easily deployable and powe...
 
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
 
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
 
Realize 2.1X the performance with 20% less power with AMD EPYC processor-back...
Realize 2.1X the performance with 20% less power with AMD EPYC processor-back...Realize 2.1X the performance with 20% less power with AMD EPYC processor-back...
Realize 2.1X the performance with 20% less power with AMD EPYC processor-back...
 
Improve performance and gain room to grow by easily migrating to a modern Ope...
Improve performance and gain room to grow by easily migrating to a modern Ope...Improve performance and gain room to grow by easily migrating to a modern Ope...
Improve performance and gain room to grow by easily migrating to a modern Ope...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
Upgrade your cloud infrastructure with Dell PowerEdge R760 servers and VMware...
 
A comparison of security features in Dell, HP, and Lenovo PC systems
A comparison of security features in Dell, HP, and Lenovo PC systemsA comparison of security features in Dell, HP, and Lenovo PC systems
A comparison of security features in Dell, HP, and Lenovo PC systems
 
Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...
 
Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...Increase security, sustainability, and efficiency with robust Dell server man...
Increase security, sustainability, and efficiency with robust Dell server man...
 
Scale up your storage with higher-performing Dell APEX Block Storage for AWS ...
Scale up your storage with higher-performing Dell APEX Block Storage for AWS ...Scale up your storage with higher-performing Dell APEX Block Storage for AWS ...
Scale up your storage with higher-performing Dell APEX Block Storage for AWS ...
 
Scale up your storage with higher-performing Dell APEX Block Storage for AWS
Scale up your storage with higher-performing Dell APEX Block Storage for AWSScale up your storage with higher-performing Dell APEX Block Storage for AWS
Scale up your storage with higher-performing Dell APEX Block Storage for AWS
 
Get in and stay in the productivity zone with the HP Z2 G9 Tower Workstation
Get in and stay in the productivity zone with the HP Z2 G9 Tower WorkstationGet in and stay in the productivity zone with the HP Z2 G9 Tower Workstation
Get in and stay in the productivity zone with the HP Z2 G9 Tower Workstation
 
Improving database performance and value with an easy migration to Azure Data...
Improving database performance and value with an easy migration to Azure Data...Improving database performance and value with an easy migration to Azure Data...
Improving database performance and value with an easy migration to Azure Data...
 
Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...
 
Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...Realize better value and performance migrating from Azure Database for Postgr...
Realize better value and performance migrating from Azure Database for Postgr...
 
Set up students and teachers to excel now and in the future with Intel proces...
Set up students and teachers to excel now and in the future with Intel proces...Set up students and teachers to excel now and in the future with Intel proces...
Set up students and teachers to excel now and in the future with Intel proces...
 
Finding the path to AI success with the Dell AI portfolio - Summary
Finding the path to AI success with the Dell AI portfolio - SummaryFinding the path to AI success with the Dell AI portfolio - Summary
Finding the path to AI success with the Dell AI portfolio - Summary
 

Recently uploaded

Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Victor Rentea
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
FIDO Alliance
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
FIDO Alliance
 

Recently uploaded (20)

Introduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDMIntroduction to use of FHIR Documents in ABDM
Introduction to use of FHIR Documents in ABDM
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Overview of Hyperledger Foundation
Overview of Hyperledger FoundationOverview of Hyperledger Foundation
Overview of Hyperledger Foundation
 
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
AI mind or machine power point presentation
AI mind or machine power point presentationAI mind or machine power point presentation
AI mind or machine power point presentation
 
JohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptxJohnPollard-hybrid-app-RailsConf2024.pptx
JohnPollard-hybrid-app-RailsConf2024.pptx
 
Vector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptxVector Search @ sw2con for slideshare.pptx
Vector Search @ sw2con for slideshare.pptx
 
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
TEST BANK For Principles of Anatomy and Physiology, 16th Edition by Gerard J....
 
AI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by AnitarajAI in Action: Real World Use Cases by Anitaraj
AI in Action: Real World Use Cases by Anitaraj
 
Top 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development CompaniesTop 10 CodeIgniter Development Companies
Top 10 CodeIgniter Development Companies
 
2024 May Patch Tuesday
2024 May Patch Tuesday2024 May Patch Tuesday
2024 May Patch Tuesday
 
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptxCyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
Cyber Insurance - RalphGilot - Embry-Riddle Aeronautical University.pptx
 
Generative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdfGenerative AI Use Cases and Applications.pdf
Generative AI Use Cases and Applications.pdf
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
Introduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptxIntroduction to FIDO Authentication and Passkeys.pptx
Introduction to FIDO Authentication and Passkeys.pptx
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider  Progress from Awareness to Implementation.pptxTales from a Passkey Provider  Progress from Awareness to Implementation.pptx
Tales from a Passkey Provider Progress from Awareness to Implementation.pptx
 
Simplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptxSimplifying Mobile A11y Presentation.pptx
Simplifying Mobile A11y Presentation.pptx
 

Accelerate natural language processing with AWS EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors

  • 1. Accelerate natural language processing with AWS EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors Compared to AWS M7g instances with AWS Graviton3 processors, AWS M7i instances with Intel Xeon Scalable processors analyzed up to 10.65 times the sentences per second and delivered up to 8.62 times the performance per dollar in RoBERTa model testing Natural language processing (NLP) supports many everyday applications, from search result suggestions to text autocorrections to text generation chatbots. Deep learning frameworks for NLP, such as Bidirectional Encoder Representations from Transformers (BERT) and its variations, demand robust compute power to analyze text and make predictions. For businesses running these workloads in the cloud, selecting the right instance can enable faster text analysis. Using Intel® Extension for PyTorch, we tested the deep learning performance of two types of Amazon Web Services (AWS) Elastic Cloud Compute (EC2) instances: M7i instances enabled by 4th Gen Intel Xeon® Scalable processors and M7g instances enabled by AWS Graviton3 processors. To measure how the instances performed at different sizes, we tested both types with 4, 16, and 64 vCPUs. We ran tests with the RoBERTa model, a variant of BERT, using a benchmark from the Intel Model Zoo. We optimized the model for each processor type. In our tests, the M7i instances analyzed up to 10.65 times the sentences per second compared to the M7g instances. These results indicate that businesses running similar RoBERTa workloads could analyze textual data faster and complete more work per instance with M7i instances featuring 4th Gen Intel Xeon Scalable processors. the throughput on 16vCPU instances the throughput on 64vCPU instances Up to 10.65x the performance per dollar on 64vCPU instances Up to 4.66x Up to 8.62x This project was commissioned by Intel. Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors October 2023 A Principled Technologies report: Hands-on testing. Real-world results.
  • 2. How we tested We ran two types of AWS EC2 instances in the US-east-1f region: • M7i instances featuring 4th Gen Intel Xeon Scalable processors • M7g instances featuring AWS Graviton3 processors We optimized the RoBERTa model for each instance type. The 4th Gen Intel Xeon Scalable processors featured Intel Advanced Matrix Extensions (Intel AMX) with 8-bit integer, bfloat16, and float16 support. To determine the performance businesses of various sizes might expect, we ran tests on smaller, medium-size, and larger instances, as Figure 1 shows. To account for different types of datasets businesses may analyze, we ran tests at different batch sizes, which indicate the number of samples that go through the neural network at a time. We chose a small batch size of 1 and a large batch size of 32. Finally, to show performance at different precisions an organization may use, we tested with both BF16 and FP32 precision. While FP32 was “the standard type for neural network computations for a long time,” BF16 is a newer precision type that some organizations have used to replace FP16 and FP32 precision.1 In addition to measuring the number of sentences per second each instance could analyze at two batch sizes and with two precisions, we also calculated the performance per price of each instance. We based this data on the results of our testing and instance pricing we researched on August 10, 2023. To read more about our tests, configurations, and calculations, see the science behind the report. What better RoBERTa performance might mean for you BERT, pre-trained on the entirety of the English language Wikipedia, currently helps Google interpret user searches.2 Developed by Facebook AI, the RoBERTa model we used in our testing trained on a larger dataset, including the English language Wikipedia, 63 million news articles, and 38 GB of Reddit content.3 According to one article, “RoBERTa has been shown to outperform BERT and other state-of- the-art models on a variety of natural language processing tasks, including language translation, text classification, and question answering. It has also been used as a base model for many other successful NLP models and has become a popular choice for research and industry applications.”4 At all three instance sizes we tested, the M7i instances enabled by 4th Gen Intel Xeon Scalable processors handled greater throughput than the M7g instances with Graviton3 processors. And they did so using different batch sizes and precision types. This means that selecting M7i instances could help businesses run RoBERTa workloads faster, which could lead to a more seamless experience for users on RoBERTa- powered apps or the ability to support more users. Additionally, with the better value that the M7i instances delivered, organizations might accomplish more RoBERTa work per instance, improving their bottom lines as they save on instance uptime. Figure 1: Key specifications for each instance size we tested. Source: Principled Technologies. Small instances 4 vCPUs 16GB RAM Medium instances 16 vCPUs 64GB RAM Large instances 64 vCPUs 256GB RAM October 2023 | 2 Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
  • 3. Analyze text faster at BF16 and FP32 precision We first compared RoBERTa performance at BF16 precision, measuring the sentences per second each instance analyzed. At both batch sizes and across vCPU counts, M7i instances featuring 4th Gen Intel Xeon Scalable processors handled a higher throughput rate than M7g instances with Graviton3 processors. As Figures 2 and 3 show, M7i instances analyzed up to 10.65 times the sentences per second. Normalized RoBERTa throughput with BF16 at batch size 1 Higher is better 3.54 8.00 6.00 4.00 0.00 12.00 10.00 4 vCPUs Normalized sentences/second 4.66 16 vCPUs 10.65 64 vCPUs 1.00 1.00 1.00 2.00 M7i M7g Instance size Figure 2: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances. Both types of instances used BF16 precision at batch size: 1. Higher is better. Source: Principled Technologies. 8.00 6.00 4.00 0.00 12.00 10.00 2.00 Normalized RoBERTa throughput with BF16 at batch size 32 Higher is better 3.62 4 vCPUs Normalized sentences/second 4.58 16 vCPUs 5.17 64 vCPUs 1.00 1.00 1.00 M7i M7g Instance size Figure 3: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances. Both types of instances used BF16 precision at batch size: 32. Higher is better. Source: Principled Technologies. Up to 10.65x the throughput on 64vCPU instances Up to 5.17x the throughput on 64vCPU instances Note: The graphs in this report use two different y-axis scales to keep to a consistent size. Please be mindful of each graph’s data range as you compare. October 2023 | 3 Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
  • 4. When we compared performance using FP32 precision, we again saw performance increases from M7i instances featuring 4th Gen Intel Xeon Scalable processors. For both batch sizes we tested, the M7i instances outperformed the M7g instances by as much as 4.25 times (Figures 4 and 5). Normalized RoBERTa throughput with FP32 at batch size 1 Higher is better 1.82 3.00 2.00 1.00 0.00 5.00 4.00 4 vCPUs 1.00 Normalized sentences/second 2.16 16 vCPUs 1.00 4.25 64 vCPUs 1.00 M7i M7g Instance size Figure 4: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances. Both types of instances used FP32 precision at batch size: 1. Higher is better. Source: Principled Technologies. Normalized RoBERTa throughput with FP32 at batch size 32 Higher is better 1.93 3.00 2.00 1.00 0.00 5.00 4.00 4 vCPUs Normalized sentences/second 2.01 16 vCPUs 1.93 64 vCPUs 1.00 1.00 1.00 M7i M7g Instance size Figure 5: Relative RoBERTa performance of M7i instances, in sentences analyzed per second, compared to M7g instances. Both types of instances used FP32 precision at batch size: 32. Higher is better. Source: Principled Technologies. Up to 4.25x the throughput on 64vCPU instances Up to 2.01x the throughput on 16vCPU instances About 4th Gen Intel Xeon Scalable processors According to Intel, 4th Gen Intel Xeon Scalable processors feature “the most built-in accelerators of any CPU on the market to improve performance in AI, data analytics, networking, storage, and HPC.”5 Along with PCIe Gen5 technology, DDR5 memory, and CXL 1.1 capabilities, 4th Gen Intel Xeon Scalable processors also offer Intel Accelerator Engines for a variety of workloads.6 For more information, visit https://www.intel.com/content/www/ us/en/products/docs/processors/xeon-accelerated/4th-gen-xeon- scalable-processors.html. October 2023 | 4 Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
  • 5. Take advantage of higher-value performance Not only did M7i instances analyze sentences at a faster rate, but this performance improvement made them a better value. Dividing each instance’s RoBERTa throughput by its per-hour price, we found that M7i instances performed more RoBERTa work for every dollar they cost. At BF16 precision, M7i instances delivered up to 8.62 times the throughput per cost of M7g instances, as Figures 6 and 7 show. 8.00 6.00 4.00 0.00 12.00 10.00 2.00 Normalized RoBERTa performance per dollar with BF16 at batch size 1 Higher is better 2.87 4 vCPUs Normalized throughput/$ per hour 3.77 16 vCPUs 8.62 64 vCPUs M7i M7g 1.00 1.00 1.00 Instance size Figure 6: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used BF16 precision at batch size 1. Higher is better. Source: Principled Technologies. Normalized RoBERTa performance per dollar with BF16 at batch size 32 Higher is better 2.93 3.00 2.00 1.00 0.00 5.00 4.00 4 vCPUs 1.00 Normalized throughput/$ per hour 3.71 16 vCPUs 1.00 4.19 64 vCPUs 1.00 M7i M7g Instance size Figure 7: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used BF16 precision at batch size 32. Higher is better. Source: Principled Technologies. Up to 8.62x the performance per dollar on 64vCPU instances Up to 4.19x the performance per dollar on 64vCPU instances October 2023 | 5 Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
  • 6. Using the same calculations as we did for BF16 precision, we saw that with FP32 precision, the M7i instances again offered a better value than the M7g instances we tested. Figures 8 and 9 show that M7i instances handled up to 3.44 times the throughput per cost of the M7g instances using FP32 precision. Normalized RoBERTa performance per dollar with FP32 at batch size 1 Higher is better 1.47 3.00 2.00 1.00 0.00 5.00 4.00 4 vCPUs 1.00 Normalized throughput/$ per hour 1.75 16 vCPUs 1.00 3.44 64 vCPUs 1.00 M7i M7g Instance size Figure 8: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used FP32 precision at batch size 1. Higher is better. Source: Principled Technologies. Normalized RoBERTa performance per dollar with FP32 at batch size 32 Higher is better 1.56 3.00 2.00 1.00 0.00 5.00 4.00 4 vCPUs 1.00 Normalized throughput/$ per hour 1.63 16 vCPUs 1.00 1.56 64 vCPUs 1.00 M7i M7g Instance size Figure 9: Throughput per instance cost of M7i instances compared to M7g instances. Both instances used FP32 precision at batch size 32. Higher is better. Source: Principled Technologies. Up to 3.44x the performance per dollar on 64vCPU instances Up to 1.63x the performance per dollar on 16vCPU instances October 2023 | 6 Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors
  • 7. Conclusion For natural language processing workloads, the instance type you choose can make an enormous difference in performance—how quickly they can make sense of textual data—and value—how much work they can accomplish at a given cost. Our RoBERTa test results indicate that AWS EC2 M7i instances enabled by 4th Gen Intel Xeon Scalable processors outperformed M7g instances with AWS Graviton3 processors at different vCPU counts, batch sizes, and precisions, processing up to 10.65 times as many sentences per second. These performance gains led to M7i instances delivering a better value, achieving up to 8.62 times the throughput per dollar. With an instance that can analyze text more quickly, your business could offer a smoother experience on RoBERTa-supported apps or potentially condense RoBERTa workloads onto fewer instances. 1. Grigory Sapunov, “FP64, FP32, FP16, BFLOAT16, TF32, and other members of the ZOO,” accessed August 25, 2023, https://moocaholic.medium.com/fp64-fp32-fp16-bfloat16-tf32-and-other-members-of-the-zoo-a1ca7897d407. 2. Ben Lutkevich, “BERT language model,” accessed August 18, 2023, https://www.techtarget.com/searchenterpriseai/definition/BERT-language-model. 3. GeeksforGeeks, “Overview of ROBERTa model,” accessed August 24, 2023, https://www.geeksforgeeks.org/overview-of-roberta-model/. 4. GeeksforGeeks, “Overview of ROBERTa model.” 5. Intel, “4th Gen Xeon Scalable Processors,” accessed August 18, 2023, https://www.intel.com/content/www/us/en/prod- ucts/docs/processors/xeon-accelerated/4th-gen-xeon-scalable-processors.html/. 6. Intel, “4th Gen Xeon Scalable Processors.” Principled Technologies is a registered trademark of Principled Technologies, Inc. All other product names are the trademarks of their respective owners. For additional information, review the science behind this report. Principled Technologies® Facts matter.® Principled Technologies® Facts matter.® This project was commissioned by Intel. Read the science behind this report at https://facts.pt/2aqfhRt October 2023 | 7 Accelerate natural language processing with Amazon EC2 M7i instances featuring 4th Gen Intel Xeon Scalable processors