SlideShare a Scribd company logo
1/14
How to test LLMs in production?
leewayhertz.com/how-to-test-llms-in-production
In today’s AI-driven era, large language model-based solutions like ChatGPT have become
integral in diverse scenarios, promising enhanced human-machine interactions. As the
proliferation of these models accelerates, so does the need to gauge their quality and
performance in real-world production environments. Testing LLMs in production poses
significant challenges, as ensuring their reliability, accuracy, and adaptability is no
straightforward task. Approaches such as executing unit tests with an extensive test bank,
selecting appropriate evaluation metrics, and implementing regression testing when
modifications are made to prompts in a production environment are indeed beneficial.
However, scaling these operations often necessitates substantial engineering resources and
the development of dedicated internal tools. This is a complex task that requires a significant
investment of both time and manpower. The absence of a standardized testing method for
these models complicates matters further.
This article delves into the nuts and bolts of testing LLMs, primarily focusing on assessing
them in a production environment. We will explore different testing methodologies, discuss
the role of user feedback, and highlight the importance of bias and anomaly detection. This
insight aims to provide a comprehensive understanding of how we can evaluate and ensure
the reliability of these AI-powered language models in real-world settings.
What is an LLM?
2/14
Large Language Models (LLMs) represent the pinnacle of current language modeling
technology, leveraging the power of deep learning algorithms and an immense quantity of
text data. Such models have the remarkable ability to emulate human-written text and
execute a multitude of natural language processing tasks.
To comprehend language models in general, we can think of them as systems that confer
probabilities to word sequences predicated on scrutinizing text corpora. Their complexity can
vary from straightforward n-gram models to more intricate neural network models.
Nevertheless, large language models commonly denote models harnessing deep learning
techniques and boasting an extensive array of parameters, potentially amounting from
millions to billions. They are adept at recognizing intricate language patterns and crafting text
that often mimics human composition.
Building a ” large language model,” an extensive transformer model, usually requires
resources beyond a single computer’s capabilities. Consequently, they are often offered as a
service via APIs or web interfaces. Their training involves extensive text data from diverse
sources like books, articles, websites, and other written content forms. This exhaustive
training allows the models to understand statistical correlations between words, phrases, and
sentences, enabling them to generate relevant and cohesive responses to prompts or
inquiries.
An example of such a model is ChatGPT’s GPT-3 model, which underwent training on an
enormous quantity of internet text data. This process enables it to comprehend various
languages and exhibit knowledge of a wide range of subjects.
Importance of testing LLMs in production
Testing large language models in production helps ensure their robustness, reliability, and
efficiency in serving real-world use cases, contributing to trustworthy and high-quality AI
systems. To delve deeper, we can broadly categorize the importance of testing LLMs in
production, as discussed below.
To avoid the threats associated with LLMs
There are a certain potential risks associated with LLMs that significantly make production
testing important for the optimum performance of the model:
Adversarial attacks: Proactive testing of models can help identify and defend against
potential adversarial attacks. To avoid such attacks in a live environment, models can
be scrutinized with adversarial examples to enhance their resilience before
deployment.
3/14
Data authenticity and inherent bias: Typically, data sourced from various platforms
can be unstructured and may inadvertently capture human biases, which can be
reflected in the trained models. These biases may discriminate against certain groups
based on attributes such as gender, race, religion, or sexual orientation, with
repercussions varying depending on the model’s application scope. Evaluations may
overlook such biases, as they primarily focus on performance rather than the model’s
behavior driven by the data’s role.
Identification of failure points: Potential failures can occur when integrating ML
systems like LLMs into a production setting. These may be attributed to biases in
performance, lack of robustness, or input model failures. Certain evaluations might not
detect these failures, even though they indicate underlying issues. For instance, a
model with 90% accuracy indicates challenges with the remaining 10% of the data,
suggesting difficulties in generalizing this portion. This insight can trigger a closer
examination of the data for errors, leading to a deeper understanding of how to address
them. As evaluations don’t capture everything, creating structured tests for conceivable
scenarios is vital, helping identify potential failure modes.
To overcome challenges involved in moving LLMs to enterprise-scale
production
Exorbitant operational and experimental expenses: Using really large models
always means spending a lot of money. These models need big computer systems to
work properly and spread their workload over many parts. On top of that, trying things
out and making changes can get expensive quickly, and you might run out of money
before the model is even ready for use. So, it is crucial to ensure the model performs
as expected.
Language misappropriation concerns: Large language models use lots of data from
different places. One big problem is that this data can have biases based on where it
comes from – things like culture and society. Plus, checking that so much information is
accurate can take a lot of work and time. If the model learns from data that is biased or
wrong, it can make these problems worse and give results that are unfair or
misleading. It’s also really hard to make these models understand human thinking and
the different meanings of the same information. The key is to make sure that the
models reflect the wide range of human beliefs and views.
Adaptation for specific tasks: Large language models are great at handling lots of
data, but making them work for specific tasks can be tricky. This means tweaking the
big models to create smaller ones that focus on certain jobs. These smaller models
keep the good performance of the original ones, but getting them just right can take
some time. You have to think carefully about what data to use, how to set up the model,
and what base models to adjust. Getting these things right is important for making sure
we can understand how the model works.
4/14
Hardware constraints: Even if you have a lot of money to spend on using large
models, figuring out the best way to set up and share out the computer systems they
need can be tough. There’s no one-size-fits-all solution for these models, so you need
to work out the best setup for your own model. Plus, you need to have good ways of
making sure your computer resources can handle the changes in your large model’s
size.
Given the scarcity of expertise in parallel and distributed computing resources, the onus falls
on your organization to acquire specialists adept at handling LLMs.
What sets testing LLMs in production apart from testing them in
earlier stages of the development process?
End-user feedback is the ultimate validation of model quality— it’s crucial to measure
whether users deem the responses as “good” or “bad,” and this feedback should guide your
improvement efforts. High-quality input/output pairs gathered in this way can further be
employed to fine-tune the large language models.
Explicit user feedback is gleaned when users respond with a clear indicator, like a thumbs up
or thumbs down, while interacting with the LLM output in your interface. However, actively
soliciting such feedback may not yield a large enough response volume to gauge overall
quality effectively. If the rate of explicit feedback collection is low, it may be advisable to use
implicit feedback, if feasible.
Implicit feedback, on the other hand, is inferred from the user’s reaction to the LLM output.
For instance, suppose an LLM produces the initial draft of an email for a user. If the user
dispatches the email without making any modifications, it likely indicates a satisfactory
response. Conversely, if they opt to regenerate the message or rewrite it entirely, that
probably signifies dissatisfaction. Implicit feedback may not be viable for all use-cases, but it
can be a potent tool for assessing quality.
The importance of feedback, particularly in the context of testing in a production
environment, is underscored by the real-world and dynamic interactions users have with the
LLM. In comparison, testing in other stages, such as development or staging, often involves
predefined datasets and scenarios that may not capture the full range of potential user
interactions or uncover all the possible model shortcomings. This difference highlights why
testing in production, bolstered by user feedback, is a crucial step in deploying and
maintaining high-quality LLMs.
Testing LLMs in production allows you to understand your model better and helps identify
and rectify bugs early. There are different approaches and stages of production testing for
LLMs. Let’s get an overview.
5/14
Enumerate use cases
The first step in testing LLMs is to identify the possible use cases for your application.
Consider both the objectives of the users (what they aim to accomplish) and the various
types of input your system might encounter. This step helps you understand the broad range
of interactions your users might have with the model and the diversity of data it needs to
handle.
Define behaviors and properties, and develop test cases
Once you have identified the use cases, contemplate the high-level behaviors and properties
that can be tested for each use case. Use these behaviors and properties to write specific
test cases. You can even use the LLM to generate ideas for test cases, refining the best
ones and then asking the LLM to generate more ideas based on your selection. However, for
practicality, choose a few easy use cases to test the fundamental properties. While some use
cases might need more comprehensive testing, starting with basic properties can provide
initial insights.
Investigate discovered bugs
Once you identify errors in the initial tests, delve deeper into these bugs. For example,
inspect these errors closely in a use case where the LLM is tasked with making a draft more
concise, and you notice an error rate of 8.3%. Often, you can identify patterns in these
errors, which can provide insights into the underlying issues. A prompt can be developed to
facilitate this process, mimicking the AdaTest approach where the prompt/UI optimization is
prioritized.
Unit testing
Unit testing involves testing of individual components of a software system or application. In
the context of LLMs, this could include various elements of the model, such as:
Input data quality checks: Testing to ensure that the inputs are correct and in the
right format and that the parameters used are accurate. This will involve validating the
format and content of the dataset used in the model.
Algorithms: Testing the underlying algorithms in the LLMs, such as sorting and
searching algorithms, machine learning algorithms, etc. This is done to verify the
accuracy of the output, given the input.
Architecture: Testing the architecture of the LLM to validate that it is working correctly.
This could involve the layers of a deep learning model, the features in a decision tree,
the weights in a neural network, etc.
Configuration: Validating the configuration settings of the model.
Model evaluation: The output of the models should be tested against known answers
to ensure accuracy.
6/14
Performance: The performance of the LLM model in terms of speed and efficiency
needs to be tested.
Memory: Memory usage of the model should be tested and optimized.
Parameters: Testing the parameters used in the LLM, such as the learning rate,
momentum, and weight decay in a neural network.
These components might be tested individually or in combinations, depending on the
requirements of the model and the results of previous tests. Each component may have a
different effect on the model’s overall performance, so it is important to examine them
individually to identify any issues that may impact the LLM’s performance.
Integration testing
After validating individual components, test how different parts of the LLM interact.
Integration testing involves testing the various parts of a system in an integrated manner to
assess whether they function together as intended. Here is how the process works for a
language model:
Data integrity: Check the flow of data in the system. For instance, if a language model
is fed data, check whether the right kind of data is being processed correctly and the
output is as expected.
Layer interaction: In the case of a deep learning model like a neural network, it’s
important to test how information is processed and passed from one layer to the next.
This involves checking the weight and bias values and ensuring data transfer is
happening correctly. This could be as simple as checking to see if the data from one
layer is correctly passed to the next layer without any loss or distortion.
Feature testing: Test the feature extraction capability of the model. Good features are
essential for good performance in a deep learning model. You might need to test
whether the features extracted by the model are appropriate and contribute to the
overall performance of the model.
Model performance: The performance of the model is critical. Once trained, you need
to test whether the model can correctly classify, regress, or do whatever it is designed
to do correctly. This involves a lot of testing to ensure that the model, once trained,
works correctly.
Output testing: This is about testing the output of the whole system. You have an
input, and you know what the output should be. Give the system the input and compare
the output to the expected result.
7/14
Interface testing: Here, you will look at how the different components of the system
work together. For instance, how well does the user interface work with the database?
Or how well does the front-end web interface work with the back-end processing
scripts?
Remember that most of these tests are about a single function or feature of the whole
system. Once you’ve ensured that each feature works correctly, you can move on to
testing how those features work together, which is the ultimate goal of integration
testing.
Regression testing
For an LLM, regression testing involves running a suite of tests to ensure that changes such
as those added through feature engineering, hyperparameter tuning, or changes in the input
data have not adversely affected performance. These can include re-running the model and
comparing the results to the original, checking for differences in the results, or running new
tests to verify that the model’s performance metrics have not changed.
As you can see, regression testing is an essential part of the model development process,
and its primary function is to catch any problems that may arise during the upgrade process.
This involves comparing the model’s current performance with the results obtained when the
model was first developed. Regression testing ensures that new updates, patches or
improvements do not cause problems with the existing functionality, and it can help detect
any problems that may arise in the future.
It’s important to note that regression testing can also be done after the model is deployed to
production. This can be achieved by re-running the same tests on the upgraded model to
see how it performs. Regression testing can also be done by comparing the model’s
performance metrics with those obtained from a suite of tests. If the metrics are not
significantly different, then the model is considered to be in good health.
While regression testing is a very important part of the model development process, it’s
important to note that it is not the only way to test a model. Other methods can be used to
check the performance of a model, such as unit testing, functional testing, and load testing.
However, regression testing is a very important part of the model development process, and
it is a process that can be done at any time during the model’s life cycle. It’s important to
ensure that your model is performing at its best and not introducing any new bugs or
problems.
Load testing
Load testing for LLMs involves the model processing a large amount of data. This can often
happen when a system is required to process a high volume of data in a short amount of
time.
8/14
Identify the key scenarios: Load testing should begin by identifying the scenarios
where the system may face high demand. These might be common situations that the
system will face or be worst-case scenarios. The load testing should consider how the
system will behave in these situations.
Design and implement the test: Once the scenarios are identified, tests should be
designed to simulate these scenarios. The tests may need to account for various
factors, such as the volume of data, the speed of data input, and the complexity of the
data.
Execute the test: The system should be monitored closely during the test to see how it
behaves. This might involve checking the server load, the response times, and the
error rates. It may also be necessary to perform the test multiple times to ensure
reliable results.
Analyze the results: Once the test is completed, the results should be analyzed to see
how the system behaves. This can involve looking at metrics such as the number of
users, the response time, the error rate, and the server load. These results can help to
identify any issues that need to be addressed.
Repeat the process: Load testing should be repeated regularly to ensure the system
can still handle the expected load. As the system evolves and the scenarios change,
the tests may need to be updated.
Load testing is crucial to ensuring that a system can handle the load it is expected to face.
By understanding how a system behaves under load, it is possible to design and build more
resilient systems that can handle high volumes of data. This can help to ensure that a
system can continue to provide a high level of service, even under heavy load.
Feedback loop
Implement a feedback loop system where users can provide explicit or implicit feedback on
the model’s responses. This allows you to collect real-world user feedback, which is
invaluable for improving the model’s performance.
User feedback is instrumental in the iterative process of model refinement, and it plays a
crucial role in the performance of machine learning models. This kind of feedback can be
considered as a direct communication channel with the users, and it is useful for the machine
learning model in the following ways:
User needs understanding: Feedback from users can provide critical information
about what users want, what they find useful, and the areas where the machine
learning model might improve. Understanding these requirements can help tailor the
machine learning model’s functionality more closely to users’ needs.
9/14
Model refinement: User feedback can guide the model refinement process, helping
developers understand where the model falls short and what improvements can be
made. This is especially true in the case of machine learning models, where user
feedback can directly impact the model’s ability to ‘learn.’
Model validation: User feedback can also play a key role in model validation. For
instance, if a user flags a certain response as inaccurate, this can be considered when
updating and training the model.
Detection of shortcomings: User feedback can also help to detect any shortcomings
or gaps in the model. These can be areas where the model is weak or does not meet
user needs. By identifying these gaps, developers can work to improve the model and
its outputs.
Improving accuracy: By using user feedback, developers can work to improve the
accuracy of the model’s responses. For instance, if a model consistently receives
negative feedback on a particular type of response, the developers can investigate this
and make adjustments to improve the accuracy.
A/B testing
If you have multiple versions of a model or different models, use A/B testing to compare their
performance in the production environment. This involves serving different model versions to
different user groups and comparing their performance metrics. A/B testing, also known as
split testing, is a technique used to compare two versions of a system to determine which
one performs better. In the context of large language models, A/B testing can compare
different versions of the same or entirely different models.
Here is a detailed description of how A/B testing can be employed for LLMs:
Model comparison: If you have two versions of a language model (for example, two
different training runs or the same model trained with two different sets of
hyperparameters), you can use A/B testing to determine which performs better in a
production environment.
Feature testing: You can use A/B testing to evaluate the impact of new features. For
instance, if you introduce a new preprocessing step or incorporate additional training
data, you can run an A/B test to compare the model’s performance with and without the
new feature.
Error analysis: A/B testing can also be used for error analysis. If users report an issue
with the LLM’s responses, you can run an A/B test with the fix in place to verify whether
the issue has been resolved.
User preference: A/B testing can help understand user preferences. By presenting a
group of users with responses generated by two different models or model versions,
you can gather feedback on which model’s responses are preferred.
10/14
Deployment decisions: The results of A/B testing can inform decisions about which
version of a model to deploy in a production environment. If one model version
consistently outperforms another in A/B tests, it is likely a good candidate for
deployment.
During A/B testing, it’s important to ensure that the test is fair and that any differences in
performance can be attributed to the differences between the models rather than to external
factors. This typically involves randomly assigning users or requests to the different models
and controlling for variables that could influence the results.
Bias and fairness testing
Conduct tests to identify and mitigate potential biases in the model’s outputs. This involves
using fairness metrics and bias evaluation tools to measure the model’s equity across
different demographic groups.
Bias and fairness are important considerations when testing and deploying LLMs. They are
crucial because biased responses or decisions the model makes can have serious
consequences, leading to unfair treatment or discrimination.
Bias and fairness testing for LLMs typically involves the following steps:
Data audit: The data used must be audited for potential biases before training an LLM.
This includes understanding the sources of the data, its demographics, and any
potential areas of bias it might contain. The model will often learn biases in the training
data, so it’s important to identify and address these upfront.
Bias metrics: Implement metrics to quantify bias in the model’s outputs. These could
include metrics that measure disparity in error rates or the model’s performance across
different demographic groups.
Test case generation: Generate test cases that help uncover biases. This could
involve creating synthetic examples covering a range of demographics and situations,
particularly those prone to bias.
Model evaluation: The LLM should be evaluated using the test cases and bias
metrics. If bias is found, the developers need to understand why it is happening. Is it
due to the training data or due to some aspect of the model’s architecture or learning
algorithm?
Model refinement: If biases are detected, the model may need to be refined or
retrained to minimize them. This could involve changes to the model or require
collecting more balanced or representative training data.
Iterative process: Bias and fairness testing is an iterative process. As new versions of
the model are developed, or the model is exposed to new data in a production
environment, the tests should be repeated to ensure that the model continues to
behave fairly and unbiasedly.
11/14
User feedback: Allow users to provide feedback about the model’s outputs. This can
help detect biases that the testing process may have missed. User feedback is
especially valuable as it provides real-world insights into how the model is performing.
Ensuring bias and fairness in LLMs is a challenging and ongoing task. However, it’s a crucial
part of the model’s development process, as it can significantly affect its performance and
impact on users. By systematically testing for bias and fairness, developers can work
towards creating fair and unbiased models, which leads to better, more equitable outcomes.
Anomaly detection
Implement anomaly detection systems to alert you when the model’s behavior deviates from
what is expected. This can help identify issues in real time, allowing you to respond quickly.
Anomaly detection, also known as outlier detection, identifies items, events, or observations
that differ significantly from most of the data. In the context of LLMs, anomaly detection can
be essential to ensuring the model’s responses are within expected parameters and
identifying any unusual or potentially problematic output.
Here’s a detailed breakdown of how anomaly detection can be performed in LLMs:
Define normal behavior: Anomaly detection starts with defining what is “normal” for
the LLM’s output. This could be based on past responses, training data, or defined
constraints. For example, the length of the generated text, the topic, the sentiment, or
the type of language used can be factors that define normal behavior.
Set thresholds: Once the normal behavior is defined, thresholds need to be set to
determine when a response is considered an anomaly. These thresholds could be
based on statistical methods (e.g., anything beyond three standard deviations from the
mean might be considered an outlier) or domain-specific rules (e.g., a response
containing explicit language might be considered an anomaly).
Monitor model outputs: As the model generates responses, these should be
monitored and compared to the defined thresholds. Any response that falls outside
these thresholds is flagged as a potential anomaly.
Investigate anomalies: Any identified anomalies should be investigated to understand
why they occurred. This can help in identifying whether the anomaly is due to an issue
with the model (e.g., bias in the training data, a bug in the model, or an unexpected
interaction between different parts of the model) or whether it’s an acceptable response
that just happens to be unusual.
12/14
Update model or thresholds: Depending on the findings of the investigation, you may
need to update the model or the thresholds. For example, if an anomaly is due to a bug
in the model, you would need to fix the bug. If the anomaly is due to bias in the training
data, you may need to retrain the model with more balanced data. Alternatively, if the
anomaly is an acceptable but unusual response, you may need to adjust your
thresholds to accommodate these responses.
Remember that anomaly detection is an ongoing process. As the LLM continues to learn and
adapt to new data, what is considered “normal” may change, and the thresholds may need to
be adjusted accordingly. By continuously monitoring the model’s outputs and investigating
any anomalies, you can ensure that the model continues performing as expected and
delivers high-quality responses.
Key metrics for evaluating LLMs in production
There are several key metrics to assess the performance of a large language model in
production.
Interaction and user engagement
This metric quantifies the model’s proficiency in maintaining user engagement throughout a
conversation. It explores the model’s propensity to ask pertinent follow-up questions, clarify
ambiguities, and foster a fluid dialogue. Established usage metrics gathered through user
surveys or other tools can be used to gauge engagement, including average query volume,
average query size, response feedback rating, and average session duration.
Response coherence
This metric focuses on the model’s capacity to generate coherent and contextually
appropriate responses. It verifies the model’s proficiency in producing relevant and
meaningful answers. Language scoring techniques such as Bilingual Evaluation Understudy
(BLEU) and Recall Oriented Understudy for Gisting Evaluation (ROUGE) can be utilized to
measure this aspect.
Fluency
Fluency evaluates the model’s responses’ structural integrity, grammatical correctness, and
linguistic coherence. It assesses the model’s competency in producing language that sounds
natural and fluid. The perplexity metric, the normalized inverse probability of the test set
normalized by the number of words, can be used to measure fluency.
Relevance
13/14
Relevance assesses the alignment of the model’s responses with the user’s input or query. It
checks whether the model accurately grasps the user’s intention and provides suitable, on-
topic responses. Metrics such as the F1-Score and techniques like BERT can measure
relevance.
Contextual awareness
This metric gauges the model’s capacity to understand the conversation’s context. It verifies
the model’s ability to reference prior messages, track dialogue history, and deliver consistent
responses. Cross Mutual Information (XMI) can help measure context awareness.
Sensibleness and specificity
This metric evaluates the sensibility and specificity of the model’s responses. It checks
whether the model provides sensible, detailed answers rather than generic or illogical
responses. To measure sensibleness and specificity, one could compute the average scores
given by evaluators for the model’s responses across the entire dataset. These average
scores will give an overall measurement of the sensibility and specificity of the model’s
responses.
Endnote
While the process of testing may be demanding, particularly when using large language
models, the alternatives present their own sets of challenges. Benchmarking tasks that
involve generation, where there are multiple correct answers, can be inherently complex,
leading to a lack of confidence in the results. Obtaining human evaluations of a model’s
output can be even more time-consuming and may lose relevance as the model evolves,
rendering the collected labels less useful.
Choosing not to test could result in a lack of understanding of the model’s behavior, a
situation that could pave the way for potential failures. On the other hand, a well-structured
testing approach can unearth bugs, provide deeper insights into the task at hand, and reveal
serious specification issues early in the process, thereby allowing time for course correction.
In weighing the pros and cons, it becomes evident that investing time in rigorous testing is a
judicious choice. This not only ensures a deep understanding of the model’s performance
and behavior but also guarantees that any potential issues are identified and addressed
promptly, contributing to the successful deployment of the LLM in a production environment.
For your large language models to excel, ongoing testing is indispensable, with a specific
focus on production testing. Partnering with LeewayHertz means gaining access to custom
models and solutions tailored to your business needs, all fortified with rigorous testing to
ensure resilience, security, and accuracy.
14/14
Start a conversation by filling the form

More Related Content

Similar to How to test LLMs in production.pdf

AI.pdf
AI.pdfAI.pdf
AI.pdf
Tariqqandeel
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
JamieDornan2
 
What is the Role of Machine Learning in Software Development.pdf
What is the Role of Machine Learning in Software Development.pdfWhat is the Role of Machine Learning in Software Development.pdf
What is the Role of Machine Learning in Software Development.pdf
JPLoft Solutions
 
B potential pitfalls_of_process_modeling_part_b-2
B potential pitfalls_of_process_modeling_part_b-2B potential pitfalls_of_process_modeling_part_b-2
B potential pitfalls_of_process_modeling_part_b-2
Jean-François Périé
 
Model validation techniques in machine learning.pdf
Model validation techniques in machine learning.pdfModel validation techniques in machine learning.pdf
Model validation techniques in machine learning.pdf
AnastasiaSteele10
 
12 considerations for mobile testing (march 2017)
12 considerations for mobile testing (march 2017)12 considerations for mobile testing (march 2017)
12 considerations for mobile testing (march 2017)
Antoine Aymer
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
StephenAmell4
 
Technovision
TechnovisionTechnovision
Technovision
SayantanGhosh58
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
ijsc
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
StephenAmell4
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
AnastasiaSteele10
 
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docxArgument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
fredharris32
 
Course 2 Machine Learning Data LifeCycle in Production - Week 1
Course 2   Machine Learning Data LifeCycle in Production - Week 1Course 2   Machine Learning Data LifeCycle in Production - Week 1
Course 2 Machine Learning Data LifeCycle in Production - Week 1
Ajay Taneja
 
Applying user modelling to human computer interaction design
Applying user modelling to human computer interaction designApplying user modelling to human computer interaction design
Applying user modelling to human computer interaction designNika Stuard
 
4 why bad_things_happen_to_goog_projects
4 why bad_things_happen_to_goog_projects4 why bad_things_happen_to_goog_projects
4 why bad_things_happen_to_goog_projects
Robert Nuñez
 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
PhD Assistance
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
AnastasiaSteele10
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
JamieDornan2
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
StephenAmell4
 

Similar to How to test LLMs in production.pdf (20)

AI.pdf
AI.pdfAI.pdf
AI.pdf
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
 
10.1.1.9.5971 (1)
10.1.1.9.5971 (1)10.1.1.9.5971 (1)
10.1.1.9.5971 (1)
 
What is the Role of Machine Learning in Software Development.pdf
What is the Role of Machine Learning in Software Development.pdfWhat is the Role of Machine Learning in Software Development.pdf
What is the Role of Machine Learning in Software Development.pdf
 
B potential pitfalls_of_process_modeling_part_b-2
B potential pitfalls_of_process_modeling_part_b-2B potential pitfalls_of_process_modeling_part_b-2
B potential pitfalls_of_process_modeling_part_b-2
 
Model validation techniques in machine learning.pdf
Model validation techniques in machine learning.pdfModel validation techniques in machine learning.pdf
Model validation techniques in machine learning.pdf
 
12 considerations for mobile testing (march 2017)
12 considerations for mobile testing (march 2017)12 considerations for mobile testing (march 2017)
12 considerations for mobile testing (march 2017)
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
 
Technovision
TechnovisionTechnovision
Technovision
 
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
AI TESTING: ENSURING A GOOD DATA SPLIT BETWEEN DATA SETS (TRAINING AND TEST) ...
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
 
A comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdfA comprehensive guide to prompt engineering.pdf
A comprehensive guide to prompt engineering.pdf
 
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docxArgument Papers (5-7 pages in length)1. Do schools perpe.docx
Argument Papers (5-7 pages in length)1. Do schools perpe.docx
 
Course 2 Machine Learning Data LifeCycle in Production - Week 1
Course 2   Machine Learning Data LifeCycle in Production - Week 1Course 2   Machine Learning Data LifeCycle in Production - Week 1
Course 2 Machine Learning Data LifeCycle in Production - Week 1
 
Applying user modelling to human computer interaction design
Applying user modelling to human computer interaction designApplying user modelling to human computer interaction design
Applying user modelling to human computer interaction design
 
4 why bad_things_happen_to_goog_projects
4 why bad_things_happen_to_goog_projects4 why bad_things_happen_to_goog_projects
4 why bad_things_happen_to_goog_projects
 
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
Machine Learning On Big Data: Opportunities And Challenges- Future Research D...
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
 
Unlock the power of MLOps.pdf
Unlock the power of MLOps.pdfUnlock the power of MLOps.pdf
Unlock the power of MLOps.pdf
 

More from AnastasiaSteele10

How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
AnastasiaSteele10
 
Action Transformer.pdf
Action Transformer.pdfAction Transformer.pdf
Action Transformer.pdf
AnastasiaSteele10
 
How to build an AI-powered chatbot.pdf
How to build an AI-powered chatbot.pdfHow to build an AI-powered chatbot.pdf
How to build an AI-powered chatbot.pdf
AnastasiaSteele10
 
What are neural networks.pdf
What are neural networks.pdfWhat are neural networks.pdf
What are neural networks.pdf
AnastasiaSteele10
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
AnastasiaSteele10
 
How to build an AI app.pdf
How to build an AI app.pdfHow to build an AI app.pdf
How to build an AI app.pdf
AnastasiaSteele10
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
AnastasiaSteele10
 
Action Transformer - The next frontier in AI development.pdf
Action Transformer - The next frontier in AI development.pdfAction Transformer - The next frontier in AI development.pdf
Action Transformer - The next frontier in AI development.pdf
AnastasiaSteele10
 

More from AnastasiaSteele10 (8)

How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
Action Transformer.pdf
Action Transformer.pdfAction Transformer.pdf
Action Transformer.pdf
 
How to build an AI-powered chatbot.pdf
How to build an AI-powered chatbot.pdfHow to build an AI-powered chatbot.pdf
How to build an AI-powered chatbot.pdf
 
What are neural networks.pdf
What are neural networks.pdfWhat are neural networks.pdf
What are neural networks.pdf
 
Build an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdfBuild an LLM-powered application using LangChain.pdf
Build an LLM-powered application using LangChain.pdf
 
How to build an AI app.pdf
How to build an AI app.pdfHow to build an AI app.pdf
How to build an AI app.pdf
 
How to build machine learning apps.pdf
How to build machine learning apps.pdfHow to build machine learning apps.pdf
How to build machine learning apps.pdf
 
Action Transformer - The next frontier in AI development.pdf
Action Transformer - The next frontier in AI development.pdfAction Transformer - The next frontier in AI development.pdf
Action Transformer - The next frontier in AI development.pdf
 

Recently uploaded

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
Product School
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
g2nightmarescribd
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
Frank van Harmelen
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
AI for Every Business: Unlocking Your Product's Universal Potential by VP of ...
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Generating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using SmithyGenerating a custom Ruby SDK for your web service or Rails API using Smithy
Generating a custom Ruby SDK for your web service or Rails API using Smithy
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
From Siloed Products to Connected Ecosystem: Building a Sustainable and Scala...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*Neuro-symbolic is not enough, we need neuro-*semantic*
Neuro-symbolic is not enough, we need neuro-*semantic*
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

How to test LLMs in production.pdf

  • 1. 1/14 How to test LLMs in production? leewayhertz.com/how-to-test-llms-in-production In today’s AI-driven era, large language model-based solutions like ChatGPT have become integral in diverse scenarios, promising enhanced human-machine interactions. As the proliferation of these models accelerates, so does the need to gauge their quality and performance in real-world production environments. Testing LLMs in production poses significant challenges, as ensuring their reliability, accuracy, and adaptability is no straightforward task. Approaches such as executing unit tests with an extensive test bank, selecting appropriate evaluation metrics, and implementing regression testing when modifications are made to prompts in a production environment are indeed beneficial. However, scaling these operations often necessitates substantial engineering resources and the development of dedicated internal tools. This is a complex task that requires a significant investment of both time and manpower. The absence of a standardized testing method for these models complicates matters further. This article delves into the nuts and bolts of testing LLMs, primarily focusing on assessing them in a production environment. We will explore different testing methodologies, discuss the role of user feedback, and highlight the importance of bias and anomaly detection. This insight aims to provide a comprehensive understanding of how we can evaluate and ensure the reliability of these AI-powered language models in real-world settings. What is an LLM?
  • 2. 2/14 Large Language Models (LLMs) represent the pinnacle of current language modeling technology, leveraging the power of deep learning algorithms and an immense quantity of text data. Such models have the remarkable ability to emulate human-written text and execute a multitude of natural language processing tasks. To comprehend language models in general, we can think of them as systems that confer probabilities to word sequences predicated on scrutinizing text corpora. Their complexity can vary from straightforward n-gram models to more intricate neural network models. Nevertheless, large language models commonly denote models harnessing deep learning techniques and boasting an extensive array of parameters, potentially amounting from millions to billions. They are adept at recognizing intricate language patterns and crafting text that often mimics human composition. Building a ” large language model,” an extensive transformer model, usually requires resources beyond a single computer’s capabilities. Consequently, they are often offered as a service via APIs or web interfaces. Their training involves extensive text data from diverse sources like books, articles, websites, and other written content forms. This exhaustive training allows the models to understand statistical correlations between words, phrases, and sentences, enabling them to generate relevant and cohesive responses to prompts or inquiries. An example of such a model is ChatGPT’s GPT-3 model, which underwent training on an enormous quantity of internet text data. This process enables it to comprehend various languages and exhibit knowledge of a wide range of subjects. Importance of testing LLMs in production Testing large language models in production helps ensure their robustness, reliability, and efficiency in serving real-world use cases, contributing to trustworthy and high-quality AI systems. To delve deeper, we can broadly categorize the importance of testing LLMs in production, as discussed below. To avoid the threats associated with LLMs There are a certain potential risks associated with LLMs that significantly make production testing important for the optimum performance of the model: Adversarial attacks: Proactive testing of models can help identify and defend against potential adversarial attacks. To avoid such attacks in a live environment, models can be scrutinized with adversarial examples to enhance their resilience before deployment.
  • 3. 3/14 Data authenticity and inherent bias: Typically, data sourced from various platforms can be unstructured and may inadvertently capture human biases, which can be reflected in the trained models. These biases may discriminate against certain groups based on attributes such as gender, race, religion, or sexual orientation, with repercussions varying depending on the model’s application scope. Evaluations may overlook such biases, as they primarily focus on performance rather than the model’s behavior driven by the data’s role. Identification of failure points: Potential failures can occur when integrating ML systems like LLMs into a production setting. These may be attributed to biases in performance, lack of robustness, or input model failures. Certain evaluations might not detect these failures, even though they indicate underlying issues. For instance, a model with 90% accuracy indicates challenges with the remaining 10% of the data, suggesting difficulties in generalizing this portion. This insight can trigger a closer examination of the data for errors, leading to a deeper understanding of how to address them. As evaluations don’t capture everything, creating structured tests for conceivable scenarios is vital, helping identify potential failure modes. To overcome challenges involved in moving LLMs to enterprise-scale production Exorbitant operational and experimental expenses: Using really large models always means spending a lot of money. These models need big computer systems to work properly and spread their workload over many parts. On top of that, trying things out and making changes can get expensive quickly, and you might run out of money before the model is even ready for use. So, it is crucial to ensure the model performs as expected. Language misappropriation concerns: Large language models use lots of data from different places. One big problem is that this data can have biases based on where it comes from – things like culture and society. Plus, checking that so much information is accurate can take a lot of work and time. If the model learns from data that is biased or wrong, it can make these problems worse and give results that are unfair or misleading. It’s also really hard to make these models understand human thinking and the different meanings of the same information. The key is to make sure that the models reflect the wide range of human beliefs and views. Adaptation for specific tasks: Large language models are great at handling lots of data, but making them work for specific tasks can be tricky. This means tweaking the big models to create smaller ones that focus on certain jobs. These smaller models keep the good performance of the original ones, but getting them just right can take some time. You have to think carefully about what data to use, how to set up the model, and what base models to adjust. Getting these things right is important for making sure we can understand how the model works.
  • 4. 4/14 Hardware constraints: Even if you have a lot of money to spend on using large models, figuring out the best way to set up and share out the computer systems they need can be tough. There’s no one-size-fits-all solution for these models, so you need to work out the best setup for your own model. Plus, you need to have good ways of making sure your computer resources can handle the changes in your large model’s size. Given the scarcity of expertise in parallel and distributed computing resources, the onus falls on your organization to acquire specialists adept at handling LLMs. What sets testing LLMs in production apart from testing them in earlier stages of the development process? End-user feedback is the ultimate validation of model quality— it’s crucial to measure whether users deem the responses as “good” or “bad,” and this feedback should guide your improvement efforts. High-quality input/output pairs gathered in this way can further be employed to fine-tune the large language models. Explicit user feedback is gleaned when users respond with a clear indicator, like a thumbs up or thumbs down, while interacting with the LLM output in your interface. However, actively soliciting such feedback may not yield a large enough response volume to gauge overall quality effectively. If the rate of explicit feedback collection is low, it may be advisable to use implicit feedback, if feasible. Implicit feedback, on the other hand, is inferred from the user’s reaction to the LLM output. For instance, suppose an LLM produces the initial draft of an email for a user. If the user dispatches the email without making any modifications, it likely indicates a satisfactory response. Conversely, if they opt to regenerate the message or rewrite it entirely, that probably signifies dissatisfaction. Implicit feedback may not be viable for all use-cases, but it can be a potent tool for assessing quality. The importance of feedback, particularly in the context of testing in a production environment, is underscored by the real-world and dynamic interactions users have with the LLM. In comparison, testing in other stages, such as development or staging, often involves predefined datasets and scenarios that may not capture the full range of potential user interactions or uncover all the possible model shortcomings. This difference highlights why testing in production, bolstered by user feedback, is a crucial step in deploying and maintaining high-quality LLMs. Testing LLMs in production allows you to understand your model better and helps identify and rectify bugs early. There are different approaches and stages of production testing for LLMs. Let’s get an overview.
  • 5. 5/14 Enumerate use cases The first step in testing LLMs is to identify the possible use cases for your application. Consider both the objectives of the users (what they aim to accomplish) and the various types of input your system might encounter. This step helps you understand the broad range of interactions your users might have with the model and the diversity of data it needs to handle. Define behaviors and properties, and develop test cases Once you have identified the use cases, contemplate the high-level behaviors and properties that can be tested for each use case. Use these behaviors and properties to write specific test cases. You can even use the LLM to generate ideas for test cases, refining the best ones and then asking the LLM to generate more ideas based on your selection. However, for practicality, choose a few easy use cases to test the fundamental properties. While some use cases might need more comprehensive testing, starting with basic properties can provide initial insights. Investigate discovered bugs Once you identify errors in the initial tests, delve deeper into these bugs. For example, inspect these errors closely in a use case where the LLM is tasked with making a draft more concise, and you notice an error rate of 8.3%. Often, you can identify patterns in these errors, which can provide insights into the underlying issues. A prompt can be developed to facilitate this process, mimicking the AdaTest approach where the prompt/UI optimization is prioritized. Unit testing Unit testing involves testing of individual components of a software system or application. In the context of LLMs, this could include various elements of the model, such as: Input data quality checks: Testing to ensure that the inputs are correct and in the right format and that the parameters used are accurate. This will involve validating the format and content of the dataset used in the model. Algorithms: Testing the underlying algorithms in the LLMs, such as sorting and searching algorithms, machine learning algorithms, etc. This is done to verify the accuracy of the output, given the input. Architecture: Testing the architecture of the LLM to validate that it is working correctly. This could involve the layers of a deep learning model, the features in a decision tree, the weights in a neural network, etc. Configuration: Validating the configuration settings of the model. Model evaluation: The output of the models should be tested against known answers to ensure accuracy.
  • 6. 6/14 Performance: The performance of the LLM model in terms of speed and efficiency needs to be tested. Memory: Memory usage of the model should be tested and optimized. Parameters: Testing the parameters used in the LLM, such as the learning rate, momentum, and weight decay in a neural network. These components might be tested individually or in combinations, depending on the requirements of the model and the results of previous tests. Each component may have a different effect on the model’s overall performance, so it is important to examine them individually to identify any issues that may impact the LLM’s performance. Integration testing After validating individual components, test how different parts of the LLM interact. Integration testing involves testing the various parts of a system in an integrated manner to assess whether they function together as intended. Here is how the process works for a language model: Data integrity: Check the flow of data in the system. For instance, if a language model is fed data, check whether the right kind of data is being processed correctly and the output is as expected. Layer interaction: In the case of a deep learning model like a neural network, it’s important to test how information is processed and passed from one layer to the next. This involves checking the weight and bias values and ensuring data transfer is happening correctly. This could be as simple as checking to see if the data from one layer is correctly passed to the next layer without any loss or distortion. Feature testing: Test the feature extraction capability of the model. Good features are essential for good performance in a deep learning model. You might need to test whether the features extracted by the model are appropriate and contribute to the overall performance of the model. Model performance: The performance of the model is critical. Once trained, you need to test whether the model can correctly classify, regress, or do whatever it is designed to do correctly. This involves a lot of testing to ensure that the model, once trained, works correctly. Output testing: This is about testing the output of the whole system. You have an input, and you know what the output should be. Give the system the input and compare the output to the expected result.
  • 7. 7/14 Interface testing: Here, you will look at how the different components of the system work together. For instance, how well does the user interface work with the database? Or how well does the front-end web interface work with the back-end processing scripts? Remember that most of these tests are about a single function or feature of the whole system. Once you’ve ensured that each feature works correctly, you can move on to testing how those features work together, which is the ultimate goal of integration testing. Regression testing For an LLM, regression testing involves running a suite of tests to ensure that changes such as those added through feature engineering, hyperparameter tuning, or changes in the input data have not adversely affected performance. These can include re-running the model and comparing the results to the original, checking for differences in the results, or running new tests to verify that the model’s performance metrics have not changed. As you can see, regression testing is an essential part of the model development process, and its primary function is to catch any problems that may arise during the upgrade process. This involves comparing the model’s current performance with the results obtained when the model was first developed. Regression testing ensures that new updates, patches or improvements do not cause problems with the existing functionality, and it can help detect any problems that may arise in the future. It’s important to note that regression testing can also be done after the model is deployed to production. This can be achieved by re-running the same tests on the upgraded model to see how it performs. Regression testing can also be done by comparing the model’s performance metrics with those obtained from a suite of tests. If the metrics are not significantly different, then the model is considered to be in good health. While regression testing is a very important part of the model development process, it’s important to note that it is not the only way to test a model. Other methods can be used to check the performance of a model, such as unit testing, functional testing, and load testing. However, regression testing is a very important part of the model development process, and it is a process that can be done at any time during the model’s life cycle. It’s important to ensure that your model is performing at its best and not introducing any new bugs or problems. Load testing Load testing for LLMs involves the model processing a large amount of data. This can often happen when a system is required to process a high volume of data in a short amount of time.
  • 8. 8/14 Identify the key scenarios: Load testing should begin by identifying the scenarios where the system may face high demand. These might be common situations that the system will face or be worst-case scenarios. The load testing should consider how the system will behave in these situations. Design and implement the test: Once the scenarios are identified, tests should be designed to simulate these scenarios. The tests may need to account for various factors, such as the volume of data, the speed of data input, and the complexity of the data. Execute the test: The system should be monitored closely during the test to see how it behaves. This might involve checking the server load, the response times, and the error rates. It may also be necessary to perform the test multiple times to ensure reliable results. Analyze the results: Once the test is completed, the results should be analyzed to see how the system behaves. This can involve looking at metrics such as the number of users, the response time, the error rate, and the server load. These results can help to identify any issues that need to be addressed. Repeat the process: Load testing should be repeated regularly to ensure the system can still handle the expected load. As the system evolves and the scenarios change, the tests may need to be updated. Load testing is crucial to ensuring that a system can handle the load it is expected to face. By understanding how a system behaves under load, it is possible to design and build more resilient systems that can handle high volumes of data. This can help to ensure that a system can continue to provide a high level of service, even under heavy load. Feedback loop Implement a feedback loop system where users can provide explicit or implicit feedback on the model’s responses. This allows you to collect real-world user feedback, which is invaluable for improving the model’s performance. User feedback is instrumental in the iterative process of model refinement, and it plays a crucial role in the performance of machine learning models. This kind of feedback can be considered as a direct communication channel with the users, and it is useful for the machine learning model in the following ways: User needs understanding: Feedback from users can provide critical information about what users want, what they find useful, and the areas where the machine learning model might improve. Understanding these requirements can help tailor the machine learning model’s functionality more closely to users’ needs.
  • 9. 9/14 Model refinement: User feedback can guide the model refinement process, helping developers understand where the model falls short and what improvements can be made. This is especially true in the case of machine learning models, where user feedback can directly impact the model’s ability to ‘learn.’ Model validation: User feedback can also play a key role in model validation. For instance, if a user flags a certain response as inaccurate, this can be considered when updating and training the model. Detection of shortcomings: User feedback can also help to detect any shortcomings or gaps in the model. These can be areas where the model is weak or does not meet user needs. By identifying these gaps, developers can work to improve the model and its outputs. Improving accuracy: By using user feedback, developers can work to improve the accuracy of the model’s responses. For instance, if a model consistently receives negative feedback on a particular type of response, the developers can investigate this and make adjustments to improve the accuracy. A/B testing If you have multiple versions of a model or different models, use A/B testing to compare their performance in the production environment. This involves serving different model versions to different user groups and comparing their performance metrics. A/B testing, also known as split testing, is a technique used to compare two versions of a system to determine which one performs better. In the context of large language models, A/B testing can compare different versions of the same or entirely different models. Here is a detailed description of how A/B testing can be employed for LLMs: Model comparison: If you have two versions of a language model (for example, two different training runs or the same model trained with two different sets of hyperparameters), you can use A/B testing to determine which performs better in a production environment. Feature testing: You can use A/B testing to evaluate the impact of new features. For instance, if you introduce a new preprocessing step or incorporate additional training data, you can run an A/B test to compare the model’s performance with and without the new feature. Error analysis: A/B testing can also be used for error analysis. If users report an issue with the LLM’s responses, you can run an A/B test with the fix in place to verify whether the issue has been resolved. User preference: A/B testing can help understand user preferences. By presenting a group of users with responses generated by two different models or model versions, you can gather feedback on which model’s responses are preferred.
  • 10. 10/14 Deployment decisions: The results of A/B testing can inform decisions about which version of a model to deploy in a production environment. If one model version consistently outperforms another in A/B tests, it is likely a good candidate for deployment. During A/B testing, it’s important to ensure that the test is fair and that any differences in performance can be attributed to the differences between the models rather than to external factors. This typically involves randomly assigning users or requests to the different models and controlling for variables that could influence the results. Bias and fairness testing Conduct tests to identify and mitigate potential biases in the model’s outputs. This involves using fairness metrics and bias evaluation tools to measure the model’s equity across different demographic groups. Bias and fairness are important considerations when testing and deploying LLMs. They are crucial because biased responses or decisions the model makes can have serious consequences, leading to unfair treatment or discrimination. Bias and fairness testing for LLMs typically involves the following steps: Data audit: The data used must be audited for potential biases before training an LLM. This includes understanding the sources of the data, its demographics, and any potential areas of bias it might contain. The model will often learn biases in the training data, so it’s important to identify and address these upfront. Bias metrics: Implement metrics to quantify bias in the model’s outputs. These could include metrics that measure disparity in error rates or the model’s performance across different demographic groups. Test case generation: Generate test cases that help uncover biases. This could involve creating synthetic examples covering a range of demographics and situations, particularly those prone to bias. Model evaluation: The LLM should be evaluated using the test cases and bias metrics. If bias is found, the developers need to understand why it is happening. Is it due to the training data or due to some aspect of the model’s architecture or learning algorithm? Model refinement: If biases are detected, the model may need to be refined or retrained to minimize them. This could involve changes to the model or require collecting more balanced or representative training data. Iterative process: Bias and fairness testing is an iterative process. As new versions of the model are developed, or the model is exposed to new data in a production environment, the tests should be repeated to ensure that the model continues to behave fairly and unbiasedly.
  • 11. 11/14 User feedback: Allow users to provide feedback about the model’s outputs. This can help detect biases that the testing process may have missed. User feedback is especially valuable as it provides real-world insights into how the model is performing. Ensuring bias and fairness in LLMs is a challenging and ongoing task. However, it’s a crucial part of the model’s development process, as it can significantly affect its performance and impact on users. By systematically testing for bias and fairness, developers can work towards creating fair and unbiased models, which leads to better, more equitable outcomes. Anomaly detection Implement anomaly detection systems to alert you when the model’s behavior deviates from what is expected. This can help identify issues in real time, allowing you to respond quickly. Anomaly detection, also known as outlier detection, identifies items, events, or observations that differ significantly from most of the data. In the context of LLMs, anomaly detection can be essential to ensuring the model’s responses are within expected parameters and identifying any unusual or potentially problematic output. Here’s a detailed breakdown of how anomaly detection can be performed in LLMs: Define normal behavior: Anomaly detection starts with defining what is “normal” for the LLM’s output. This could be based on past responses, training data, or defined constraints. For example, the length of the generated text, the topic, the sentiment, or the type of language used can be factors that define normal behavior. Set thresholds: Once the normal behavior is defined, thresholds need to be set to determine when a response is considered an anomaly. These thresholds could be based on statistical methods (e.g., anything beyond three standard deviations from the mean might be considered an outlier) or domain-specific rules (e.g., a response containing explicit language might be considered an anomaly). Monitor model outputs: As the model generates responses, these should be monitored and compared to the defined thresholds. Any response that falls outside these thresholds is flagged as a potential anomaly. Investigate anomalies: Any identified anomalies should be investigated to understand why they occurred. This can help in identifying whether the anomaly is due to an issue with the model (e.g., bias in the training data, a bug in the model, or an unexpected interaction between different parts of the model) or whether it’s an acceptable response that just happens to be unusual.
  • 12. 12/14 Update model or thresholds: Depending on the findings of the investigation, you may need to update the model or the thresholds. For example, if an anomaly is due to a bug in the model, you would need to fix the bug. If the anomaly is due to bias in the training data, you may need to retrain the model with more balanced data. Alternatively, if the anomaly is an acceptable but unusual response, you may need to adjust your thresholds to accommodate these responses. Remember that anomaly detection is an ongoing process. As the LLM continues to learn and adapt to new data, what is considered “normal” may change, and the thresholds may need to be adjusted accordingly. By continuously monitoring the model’s outputs and investigating any anomalies, you can ensure that the model continues performing as expected and delivers high-quality responses. Key metrics for evaluating LLMs in production There are several key metrics to assess the performance of a large language model in production. Interaction and user engagement This metric quantifies the model’s proficiency in maintaining user engagement throughout a conversation. It explores the model’s propensity to ask pertinent follow-up questions, clarify ambiguities, and foster a fluid dialogue. Established usage metrics gathered through user surveys or other tools can be used to gauge engagement, including average query volume, average query size, response feedback rating, and average session duration. Response coherence This metric focuses on the model’s capacity to generate coherent and contextually appropriate responses. It verifies the model’s proficiency in producing relevant and meaningful answers. Language scoring techniques such as Bilingual Evaluation Understudy (BLEU) and Recall Oriented Understudy for Gisting Evaluation (ROUGE) can be utilized to measure this aspect. Fluency Fluency evaluates the model’s responses’ structural integrity, grammatical correctness, and linguistic coherence. It assesses the model’s competency in producing language that sounds natural and fluid. The perplexity metric, the normalized inverse probability of the test set normalized by the number of words, can be used to measure fluency. Relevance
  • 13. 13/14 Relevance assesses the alignment of the model’s responses with the user’s input or query. It checks whether the model accurately grasps the user’s intention and provides suitable, on- topic responses. Metrics such as the F1-Score and techniques like BERT can measure relevance. Contextual awareness This metric gauges the model’s capacity to understand the conversation’s context. It verifies the model’s ability to reference prior messages, track dialogue history, and deliver consistent responses. Cross Mutual Information (XMI) can help measure context awareness. Sensibleness and specificity This metric evaluates the sensibility and specificity of the model’s responses. It checks whether the model provides sensible, detailed answers rather than generic or illogical responses. To measure sensibleness and specificity, one could compute the average scores given by evaluators for the model’s responses across the entire dataset. These average scores will give an overall measurement of the sensibility and specificity of the model’s responses. Endnote While the process of testing may be demanding, particularly when using large language models, the alternatives present their own sets of challenges. Benchmarking tasks that involve generation, where there are multiple correct answers, can be inherently complex, leading to a lack of confidence in the results. Obtaining human evaluations of a model’s output can be even more time-consuming and may lose relevance as the model evolves, rendering the collected labels less useful. Choosing not to test could result in a lack of understanding of the model’s behavior, a situation that could pave the way for potential failures. On the other hand, a well-structured testing approach can unearth bugs, provide deeper insights into the task at hand, and reveal serious specification issues early in the process, thereby allowing time for course correction. In weighing the pros and cons, it becomes evident that investing time in rigorous testing is a judicious choice. This not only ensures a deep understanding of the model’s performance and behavior but also guarantees that any potential issues are identified and addressed promptly, contributing to the successful deployment of the LLM in a production environment. For your large language models to excel, ongoing testing is indispensable, with a specific focus on production testing. Partnering with LeewayHertz means gaining access to custom models and solutions tailored to your business needs, all fortified with rigorous testing to ensure resilience, security, and accuracy.
  • 14. 14/14 Start a conversation by filling the form