HOW TO USE LLMS IN
SYNTHESIZING TRAINING
DATA?
Talk to our Consultant
 
Listen to the article
There is an unending quest for rich, diverse, and bias-free data in the
dynamic realm of machine learning and arti몭cial intelligence. However, data,
as indispensable as it is, often comes with its share of pitfalls — scarcity,
privacy concerns, and biases, to name a few. Now, imagine a world where
 
data is abundant, unbiased, and unencumbered by privacy issues? Welcome
to the world of synthetic data, a game-changing innovation that is reshaping
the data science landscape.
Harnessing the power of Large Language Models (LLMs), a powerful tool
capable of understanding, generating, and even re몭ning human-like text, we
can generate high-quality synthetic training data and train our models more
e몭ciently. This article delves into how you can utilize LLMs to synthesize
training data, o몭ering a unique solution to real-world data challenges.
Through this comprehensive guide, we aim to provide you with a deep
understanding of LLMs, elucidate the bene몭ts of synthetic data, and most
importantly, guide you on how to use LLMs for synthesizing your own
training data.
What are LLMs?
What is training data in ML and its importance?
What is synthetic data?
Synthetic data use cases
Training machine learning models
Mitigating data bias
Safeguarding personal information
Bene몭ts of synthesizing training data
Step-by-step guide on using LLMs for synthesizing training data
Step 1: Choosing the right LLM for your speci몭c application
Step 2: Training the model with LLM generated synthetic data
How to evaluate the quality of synthesized training data?
Evaluating 몭delity
Evaluating utility
Evaluating privacy
What are LLMs?
Tokenization
Embedding
Transfer
Learning
Pretraining Attention
LLMs
Building
Blocks
LeewayHertz
Before we discuss what synthetic data is, it is important to understand what
LLMs are. Large Language Models (LLMs) are intricate and sophisticated
arti몭cial intelligence tools that learn and generate text-based responses that
mimic human language. These AI models are trained on extensive volumes of
text data – books, articles, web pages, and more, enabling them to decode
and grasp the structure and patterns of language. With deep learning at their
core, they perform an array of complex language tasks, generating top-tier
results.
The potential of popular LLMs like Google’s BERT, Facebook’s RoBERTa and
OpenAI’s GPT series has been leveraged for various tasks like language
translation, content creation, and more, showcasing their versatility and
e몭ectiveness.
Talking about their applications, LLMs boast an impressive range:
Language translation: LLMs are great at translating text from one language
to another, ensuring accuracy and pro몭ciency.
Chatbots and conversational AI: They form the backbone of advanced
chatbots and conversational AI systems, enabling 몭uent conversations with
users.
Content creation: Be it articles, summaries, or product descriptions, LLMs
can generate contextually relevant and grammatically precise content.
Text summarization: LLMs have the capacity to condense vast text content
into shorter, more manageable summaries.
Question answering: These models excel in identifying pertinent
information from large text bodies to answer questions.
Sentiment analysis: LLMs can decipher the underlying sentiment in a text,
helping companies comprehend customer sentiment towards their
o몭erings.
Speech recognition: LLMs enhance speech recognition systems by
understanding the context and meaning of spoken words more accurately.
LLMs also excel in natural language processing enhancing the accuracy of
search engines, improving customer service, and automated content
creation. They even facilitate personalizing user experiences and make digital
content more accessible.
The hype around LLMs is justi몭ed, given their versatile applications. They
have rede몭ned chatbot technology by simplifying its creation and
maintenance, and their ability to generate varied and unexpected texts is
noteworthy. LLMs can even be 몭ne-tuned to perform speci몭c NLP tasks,
o몭ering the possibility of building NLP models more cost-e몭ectively and
e몭ciently. Their unique capabilities have rightfully earned them the title of
‘Foundation Models.’ However, these models come with their own set of
challenges – they require extensive computational resources, custom
hardware, and a vast quantity of training data, making their development
and maintenance a costly a몭air.
What is training data in ML and its
What is training data in ML and its
importance?
Parametric
Scenarios
Partial
Models
Validation &
Adaptation
Reality
(1) Modeling
Apply
Training (5)
(2) Configuring Simulation (4)
Concrete Instances
Of Scenarios
Synthetic Datasets
Machine Learning
System
LeewayHertz
Training data is the lifeblood of Machine Learning (ML) systems. It serves as
the foundation upon which these systems learn, understand, and eventually,
make predictions. It’s a crucial piece in the complex jigsaw of ML
development, without which essential tasks would be impossible to carry
out.
At the heart of every successful AI and ML project lies quality training data. It
is the key that enables a machine to learn human-like behavior and predict
outcomes with higher accuracy. The role of training data in machine learning
cannot be understated; it dictates the performance and accuracy of the AI
model. Hence, understanding the value of a robust training dataset is pivotal
to acquiring the right quantity and quality of data for your machine learning
models. The correlation between the quality of training data and the
accuracy of the model is direct. As a practitioner, you must realize its
importance, and how it in몭uences the selection of an algorithm based on the
availability and compatibility of your training dataset.
Prioritizing training data in any AI or ML project is not just a good practice,
but a necessary one. Investing in acquiring high-quality datasets will
invariably lead to improved outcomes. To illustrate, consider a model being
trained to recognize images of cats. The model’s ability to accurately identify
a cat in a new image directly depends on the variety, quality, and quantity of
cat images it was trained on. Understanding its signi몭cance, ensuring its
quality, and choosing the right quantity forms the basis of successful AI and
ML projects. Always remember, the e몭cacy of your models is inextricably
tied to the quality of your training data.
Here are some of the areas where training data plays a vital role. The quality
and success of these applications and processes are directly related to the
quality and quantity of the training data.
Object recognition and categorization
Training data serves a critical role in supervised machine learning,
particularly in the recognition and categorization of objects. Consider a
scenario where an algorithm must distinguish between images of cats and
dogs. In this case, labeled images of both species are required. The algorithm
learns to discern the distinctive features between the two species based on
this training data, enabling it to recognize and categorize similar objects in
the future. If the training data is inaccurate or poor in quality, it could lead to
inaccurate results, potentially derailing the success of an AI project.
Crucial input for machine learning algorithms
Training data is indispensable to the operation of machine learning
algorithms. It’s the primary input that provides the algorithm with the
information necessary to make decisions akin to human intelligence. In
supervised machine learning, the algorithm requires labeled training data as
an additional input. If the training data is not appropriately labeled, it
diminishes its value for supervised learning. For instance, images must be
annotated with precise metadata to be recognizable to machines through
computer vision. Therefore, accuracy in labeling the training data is of
paramount importance.
Machine learning model validation
Developing an AI model is just part of the process; it’s equally vital to validate
the model to assess its accuracy and ensure its performance in real-life
scenarios. Validation or evaluation data is another form of training data,
often set aside to test the model’s performance under di몭erent
circumstances. This data helps verify the model’s ability to make accurate
predictions, solidifying its reliability before deployment. Hence, the
importance of training data extends beyond just learning; it also plays a key
role in ensuring the overall quality and accuracy of the AI model.
What is synthetic data?
Synthetic data refers to data that isn’t collected from the real world, and is
created arti몭cially using computer programs or simulations. Like an artist
making a replica of a real painting, these computer programs replicate
patterns found in real data, but without actually containing any real
information.
It’s used often in 몭elds like arti몭cial intelligence and machine learning
because it helps overcome certain issues associated with real data. For
example, real data can be biased, incomplete, or not diverse enough.
Synthetic data o몭ers a tailored or customized environment for the purpose
of training and improving AI and machine learning algorithms. It serves as a
simulated practice area where these algorithms can learn, adapt, and
develop their capabilities before being deployed in real-world scenarios.
Synthetic data is generated to mimic real data but can be controlled and
manipulated to provide speci몭c training scenarios and test di몭erent
scenarios, making it a valuable tool for re몭ning AI and machine learning
models. Because it is synthetic, you can create as much data as you need,
and customize it to your needs. This is especially helpful when real-world
data is hard to come by.
Additionally, synthetic data is a superior choice when it comes to privacy. It
can be created in a way that resembles real data, but without including any
personal information, such as someone’s name or address. This means AI
researchers can use it to train their models without risking anyone’s privacy.
Synthetic data use cases
Training machine learning models
Synthetic data is a valuable resource in the world of arti몭cial intelligence,
speci몭cally when we are training machine learning models. Collecting real-
world data can be tricky and time-consuming, especially when the data is
sensitive or regulated by laws like GDPR. Real-world data may also have
biases, be incomplete, or have errors. Here, synthetic data can step in as a
substitute or complement to real-world data for training machine learning
models.
Using synthetic data gives machine learning models the chance to learn from
larger and more varied data sets, which can bolster their performance and
ability to adapt.
Mitigating data bias
Another fantastic application of synthetic data is its ability to help minimize
bias and enhance fairness in data sets. It’s not unusual for real-world data to
show bias or lack balance, leading to machine learning models that re몭ect
these unfair leanings.
When a data set isn’t an accurate representation of the population it aims to
study—for instance, if it predominantly includes data from a speci몭c race or
gender—it fails to capture the true diversity of experiences and behaviors
across all groups. This can result in machine learning models that don’t
accurately serve their intended population.
By using synthetic data, we can more precisely mirror the population we are
studying, as it gives us control over how characteristics like race, gender, and
other demographics are distributed across the data set. This ensures that
our data set better re몭ects the people it aims to represent.
Safeguarding personal information
Synthetic data plays a crucial role in protecting privacy and security. Real-
world data often holds private or sensitive details that shouldn’t be made
public. Synthetic data can be used as a stand-in for real data, allowing
researchers to perform their analyses without compromising individuals’
privacy.
Personally Identi몭able Information (PII) can include anything from names,
addresses, phone numbers, and email addresses to social security numbers,
몭nancial records, medical information, and biometric data. Under GDPR, it is
mandatory for organizations to protect such information and obtain explicit
consent from individuals before gathering, utilizing, or sharing it. Synthetic
data allows us to bypass these issues while still gaining valuable insights.
Benefits of synthesizing training data
LeewayHertz
Healthcare
Agriculture
Banking &
Finance
Manufacturing
Disaster Prediction
and Risk Management
Automotive &
Robotics
eCommerce
Application of
Synthetic Data
The use of synthesized data allows for the augmentation of existing datasets
with synthetic versions, enhancing the training of various models and
algorithms. Essentially, synthesized data acts as a fabricated data pool, aiding
in the veri몭cation of mathematical models or the training of machine
learning models.
Synthesized data is employed across di몭erent sectors as a means to omit
Synthesized data is employed across di몭erent sectors as a means to omit
certain sensitive elements from the original data. In some cases, datasets
encompass con몭dential information that, due to privacy concerns, cannot be
publicly shared. Synthetic data provides a solution by producing arti몭cial data
that mirrors the original but does not retain any personally identi몭able
information. This circumvents privacy issues related to using real consumer
data without consent or remuneration. Synthetic data o몭ers a privacy-
respecting alternative, facilitating the development and testing of algorithms
and models without violating con몭dentiality.
The use of synthesized training data o몭ers numerous bene몭ts, rendering it
an invaluable resource for organizations:
Abiding by privacy laws: Besides helping companies navigate privacy laws
that restrict them from handling sensitive data, synthesized data also
reduces the risk of customer data breaches or unauthorized sharing, which
could result in costly legal battles and harm to brand reputation.
Minimizes privacy concerns: Addressing privacy concerns forms a
signi몭cant reason why organizations are increasingly employing synthetic
data generation techniques.
Enables data generation when historical data is unavailable: For completely
new products or services, historical data might not be available.
Furthermore, procuring human-annotated data can be costly and time-
consuming. By generating synthetic data swiftly, these hurdles can be
bypassed, facilitating the creation of reliable machine learning models.
Cost-e몭ective and e몭cient: Synthetic data generation emerges as a cost-
e몭ective and e몭cient solution for new product development and machine
learning model training.
Synthesized training data, thus, emerges as a pivotal tool in modern data
handling, ensuring privacy, expanding datasets, and aiding in e몭cient and
cost-e몭ective model training.
Step­by­step guide on using LLMs for
Step­by­step guide on using LLMs for
synthesizing training data?
In this example, we will create synthetic sales data for training a sales
prediction model. Imagine, we have a new sales prediction app for co몭ee
shops that we want to check using this synthetic data. We explain in the
following steps how an LLM can be used to synthesize this data for the
model’s training:
Step 1: Choosing the right LLM for your speci몭c
application
Selecting the right Language Learning Model (LLM) when synthesizing
training data requires careful consideration of a few factors:
Task requirements: The type of task you want to accomplish greatly
in몭uences your choice of LLM. For example, if your task is related to text
generation, a sequence-to-sequence model might be the best 몭t. On the
other hand, for classi몭cation tasks, a simpler model might su몭ce.
Data availability: The amount and quality of data you have at your disposal
can in몭uence the complexity of the LLM that you choose. More complex
models may require more data for training.
Computational resources: More sophisticated LLMs require more
computational power and memory for training and inference. You need to
consider your available resources when choosing a model.
Privacy concerns: If your data includes sensitive information, you may need
to consider a model that can provide better data privacy.
Accuracy vs. explainability: Some LLMs can o몭er high accuracy but have
low explainability. If your project requires understanding the reasoning
behind the model’s predictions, you might need to choose a simpler, more
interpretable model.
Model training time: Training a complex LLM can take a signi몭cant amount
of time. Depending on the time constraints of your project, you might need
to opt for a less complex model that can be trained more quickly.
Finally, it’s a good practice to experiment with a few di몭erent models and
compare their performance on your speci몭c task. This empirical evaluation
can help you 몭nd the most suitable LLM for your use case.
We can use various AI tools to create synthetic data for testing apps, building
data analysis processes, and making machine learning models. ChatGPT is
one powerful LLM among them. We could begin by asking ChatGPT to
produce some data for us using the below prompt:
Create a CSV file with 25 random sales records for a coffee shop.
Each record should include the following fields:
­ id (incrementing integer starting at 1)
­ date (random date between 1/1/2022 and 12/31/2022)
­ time (random time between 6:00am and 9:00pm in 1­minute increments
­ product_id (incrementing integer starting at 1)
­ product
­ calories
­ price in USD
­ type (drink or food)
­ quantity (random integer between 1 and 3)
­ amount (price * quantity)
­ payment type (cash, credit, debit, or gift card)
The way in which you phrase your query to ChatGPT can signi몭cantly alter
the output it provides. From the query above, we can expect correct results
but they won’t be very helpful for checking our app because they do not
come in the right format. Also, ChatGPT can’t make a real CSV 몭le. Plus, it can
only give back a certain amount of data. As a rule, ChatGPT can give back
answers that are up to 2,048 parts long – that’s the most the GPT-3 model
allows.
Instead of making ChatGPT create the synthetic data directly, we could ask it
Instead of making ChatGPT create the synthetic data directly, we could ask it
to write a Python script. This script would then generate the synthetic sales
data we need. This is a better approach because it can be used on a larger
scale. So, let’s instruct ChatGPT to write a Python program to create synthetic
sales data similar to our previous requirements.
Write a Python3 code to make a CSV file with 100 sales records for typ
id (a series of numbers starting from 1)
date (any random date between 1/1/2022 and 12/31/2022)
time (any random time from 6:00am to 9:00pm, changing every minute)
product_id (a series of numbers starting from 1)
product
calories
price (in USD)
type (either a drink or food)
quantity (a random number between 1 and 3)
amount (price times quantity)
payment type (could be cash, credit, debit, or gift card)
The generated code is as follows:
import csv
import random
from random import randrange
from datetime import timedelta, datetime
# List of possible products and their corresponding details
products = [
{"product_id": 1, "product": "Espresso", "calories": 10, "price":
{"product_id": 2, "product": "Cappuccino", "calories": 120, "price
{"product_id": 3, "product": "Croissant", "calories": 300, "price"
# add more products as needed...
]
payment_types = ["cash", "credit", "debit", "gift card"]
start_date = datetime.strptime("1/1/2022", "%m/%d/%Y")
end_date = datetime.strptime("12/31/2022", "%m/%d/%Y")
def random_date(start, end):
return start + timedelta(seconds=random.randint(0, int((end ­ star
def random_time():
return "{:02d}:{:02d}".format(randrange(6, 21), randrange(0, 60))
with open('sales_data.csv', 'w', newline='') as file:
writer = csv.writer(file)
writer.writerow(["id", "date", "time", "product_id", "product", "c
for i in range(1, 101):
product = random.choice(products)
quantity = randrange(1, 4)
writer.writerow([
i,
random_date(start_date, end_date).strftime("%m/%d/%Y"),
random_time(),
product["product_id"],
product["product"],
product["calories"],
product["price"],
product["type"],
quantity,
quantity * product["price"],
random.choice(payment_types)
])
ChatGPT was able to generate a full Python program to make synthetic sales
data with just one short prompt. However, because of ChatGPT’s response
size limit, we could only list six items on the co몭ee shop menu. If we asked
for more items, the output and the program would be cut o몭, which means
the program wouldn’t run.
You can provide additional prompts as below:
# Additional prompts:
# Create a function to return one random item from a list of dictionar
# containing 15 common drink items sold in a coffee shop including the
# Capitalize the first letter of each product name. Start product id a
# Create a function to return one random item from a list of dictionar
# containing 10 common food items sold in a coffee shop including the
# Capitalize the first letter of each product name. Start product id a
To install faker dependencies execute the below code:
pip install faker
The complete generated code will be like this:
import random
import csv
from datetime import datetime, timedelta, time
from faker import Faker
# Function to generate random time
def random_time(start, end):
return start + timedelta(
seconds=random.randint(0, int((end ­ start).total_seconds())),
)
# Generate drink items
def generate_drink_items():
drinks = [{'id': i+1, 'name': Faker().words(nb=1, unique=True)[0].
return random.choice(drinks)
# Generate food items
def generate_food_items():
foods = [{'id': i+16, 'name': Faker().words(nb=1, unique=True)[0].
return random.choice(foods)
# Generate payment types
def generate_payment_types():
return random.choice(['cash', 'credit', 'debit', 'gift card'])
# Open the CSV file
with open('coffee_shop_sales_chatgpt_data.csv', 'w', newline='') as fi
writer = csv.writer(file)
writer.writerow(["id", "date", "time", "product_id", "product", "c
# Generate sales records
for i in range(100):
item = generate_drink_items() if random.choice([True, False])
date = Faker().date_between_dates(date_start=datetime(2022, 1,
time = random_time(datetime.strptime('6:00 AM', '%I:%M %p'), d
time = random_time(datetime.strptime('6:00 AM', '%I:%M %p'), d
quantity = random.randint(1,3)
amount = round(quantity * item['price'], 2)
writer.writerow([i+1, date, time, item['id'], item['name'], it
This code selects from 15 di몭erent drink items and 10 di몭erent food items,
all with unique names, and writes 100 sales records to a CSV 몭le named
“co몭ee_shop_sales_chatgpt_data.csv”.
You can 몭nd sample reference code in this Github location.
Copy and paste the code in VS code and run it.
The generated synthetic data will be like this –
https://github.com/garysta몭ord/ten-ways-gen-ai-code-
gen/blob/main/data/output/co몭ee_shop_sales_data_chatgpt.csv
Step 2: Training the model with LLM-generated
synthetic data
We will train sales prediction model with generated synthesized data. This
involves several steps, such as-
Preprocessing the data
Splitting it into training and testing datasets
Selecting a model
Training the model
Evaluating its performance
Using it for prediction.
Below is a high-level description of these steps using Python and popular
libraries like pandas and scikit-learn:
Note: The exact code and approach would depend on the speci몭cs of your
application, data, and prediction task. The description below assumes a
regression task, where you are trying to predict a continuous value like the
amount of sales.
Load the data: Start by loading your synthetic data into a pandas DataFrame:
import pandas as pd
data = pd.read_csv('coffee_shop_sales_chatgpt_data.csv')
Preprocess the data: Before training your model, you will need to preprocess
your data. This could include:
Converting categorical variables into numeric variables using techniques
like one-hot encoding.
Normalizing numeric variables so they’re on the same scale.
Here is an example using pandas and scikit-learn:
import numpy as np
from sklearn.preprocessing import OneHotEncoder, StandardScaler
# One­hot encoding for categorical features
encoder = OneHotEncoder()
categorical_features = ['product', 'type', 'payment type']
encoded_features = encoder.fit_transform(data[categorical_features]).t
# Normalizing numeric features
scaler = StandardScaler()
numeric_features = ['calories', 'price', 'quantity']
scaled_features = scaler.fit_transform(data[numeric_features])
# Combine the processed features back into a single array
X = np.concatenate([scaled_features, encoded_features], axis=1)
y = data['amount']
Split the data: You should split your data into a training set and a testing set.
This allows you to evaluate your model’s performance on unseen data:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
Train the model: Choose a suitable machine learning model for sales
prediction. For instance, a RandomForestRegressor or
GradientBoostingRegressor can be used:
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor()
model.fit(X_train, y_train)
Evaluate the model: Use the testing set to evaluate the model’s performance.
A common metric for regression tasks is the mean absolute error (MAE):
from sklearn.metrics import mean_absolute_error
y_pred = model.predict(X_test)
print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred))
Predict sales: With the model trained, you can now use it to predict sales:
# Let's say `new_data` is your new sales data for prediction.
# Let's say `new_data` is your new sales data for prediction.
new_data = pd.read_csv('sales_data.csv')
def preprocess(new_data):
# Perform the same preprocessing steps as before
encoded_features = encoder.transform(new_data[categorical_features
scaled_features = scaler.transform(new_data[numeric_features])
# Combine the processed features back into a single array
preprocessed_data = np.concatenate([scaled_features, encoded_featu
return preprocessed_data
# Preprocess the new data
preprocessed_data = preprocess(new_data)
# Use the preprocessed data to make predictions
predictions = model.predict(preprocessed_data)
Please note: The feature names should match those that were passed during
몭t.
This example provides a basic outline of the process. Each step has many
possible variations, and the best approach will depend on your speci몭c data,
problem, and requirements. For example, you might want to try di몭erent
preprocessing techniques, machine learning models, and evaluation metrics.
How to evaluate the quality of synthesized
training data?
For the adoption of synthetic data in machine learning and analytics projects,
it’s not only essential to ensure the synthetic data serves its intended
purpose and meets application requirements, but it’s also crucial to measure
purpose and meets application requirements, but it’s also crucial to measure
and ensure the quality of the produced data.
In light of growing legal and ethical mandates for privacy protection,
synthetic data’s capability to eliminate sensitive and original information
during its generation is one of its key strengths. Therefore, alongside quality,
we require metrics to assess the risk of potential privacy breaches, if any, and
to ensure that the generation process does not merely replicate the original
data.
To address these needs, we can evaluate the quality of synthetic data across
multiple dimensions, facilitating a better understanding of the generated
data for users, stakeholders, and ourselves.
The quality of generated synthetic data is evaluated across three primary
dimensions:
1. Fidelity
2. Utility
3. Privacy
A synthetic data quality report should be able to answer the following
questions about the generated synthetic data:
How does this synthetic data compare with the original training set?
What is the usefulness of this synthetic data in our downstream
applications?
Has any information been inadvertently leaked from the original training
set into the synthetic data?
Has any sensitive information from other datasets (not used for model
training) been unintentionally synthesized by our model?
The metrics translating these dimensions for the end-users can be quite
몭exible, as the data to be generated can have varying distributions, sizes, and
behaviors. They should also be easy to comprehend and interpret.
In essence, the metrics should be completely data-driven, not requiring any
pre-existing knowledge or domain-speci몭c information. However, if users
wish to implement certain rules and constraints relevant to a particular
business domain, they should be able to specify them during the synthesis
process to ensure domain-speci몭c 몭delity is maintained.
Let’s delve deeper into each of these metrics.
Evaluating 몭delity
When we talk about the quality of synthetic data, one of the key aspects we
consider is ‘몭delity’, which basically means how closely the synthetic data
matches with the original data. We want to make sure the synthetic data is
similar enough to the original that it can serve its purpose well.
Let’s break down the ways we measure 몭delity:
Statistical comparisons: This is a way to compare the key features of the
original and synthetic data sets. We look at things like the average (mean),
middle value (median), spread of data (standard deviation), number of
distinct values, missing values, and range of values. We do this for each
category of data to see if the synthetic data is statistically similar to the
original. If it’s not, we might need to generate the synthetic data again with
di몭erent settings.
Histogram similarity score: This score helps us understand how similar the
distribution of each feature (or category of data) is in the synthetic and
original datasets. If the score is 1, it means the distributions in synthetic
data perfectly match with the original.
Mutual information score: This score tells us how dependent two features
are on each other. In other words, it shows us how much information
about one feature you can get by looking at another.
Correlation score: This score tells us how well relationships between two or
more columns of data have been preserved in the synthetic data. These
relationships are important because they can reveal connections between
di몭erent pieces of data.
For certain types of data, like time-series or sequential data, we also use
additional metrics to measure the quality. For instance, we can look at
autocorrelation and partial autocorrelation scores to see how well the
synthetic data has preserved signi몭cant correlations from the original
dataset.
Evaluating utility
In addition to 몭delity, the ‘utility’ or usefulness of synthetic data is also
important. We need to ensure that the synthetic data performs well on
common tasks in data science.
Prediction score: This is a measure of how well models trained on synthetic
data perform compared to models trained on the original data. We
compare the outcomes of these models against a testing set of data that
hasn’t been seen before. This gives us an idea of how good the synthetic
data is in terms of training e몭ective models.
Feature importance score: This score looks at the importance of di몭erent
features (or categories of data) and checks if this order of importance is
the same in the synthetic and original data. If the order is the same, it
means the synthetic data has high utility.
QScore: This score is used to check if a model trained on synthetic data will
give the same results as a model trained on original data. It does this by
running random aggregation-based queries on both datasets and
comparing the results. If the results are similar, it means the synthetic data
has good utility.
Evaluating privacy
Privacy is a signi몭cant concern when it comes to synthesizing data. It’s
important to protect sensitive information to meet ethical and legal
requirements.
Exact match score: This is a measure of how many original data records
can be found in the synthetic dataset. We want this score to be low to
ensure privacy.
Neighbors’ privacy score: This score indicates how many synthetic records
are very similar to the real ones, which could be a potential privacy
concern. A lower score means better privacy.
Membership inference score: This score tells us how likely it is that
someone could correctly guess if a speci몭c data record was part of the
original dataset. The lower this score, the better the privacy.
Holdout concept: It’s important to prevent the synthetic data from simply
copying the original data. To avoid this, a portion of the original data is set
aside and used to evaluate the synthetic data. This helps to maintain the
balance between data utility and privacy.
Endnote
LLMs have become an increasingly valuable asset for data scientists and
researchers, especially when it comes to synthesizing training data. Through
the power of advanced machine learning algorithms and generative models,
synthetic data can be created to mimic real-world data while maintaining
privacy and con몭dentiality.
While synthetic data generation presents challenges, such as modeling
complex data distributions, its potential bene몭ts are evident. Synthetic data
몭nds application in various domains, including training and testing machine
learning models, conducting simulations, and performing experiments. As
the 몭eld of synthetic data generation continues to progress, we can
anticipate the emergence of innovative techniques and tools, fueling further
advancements in this exciting area of research and development.
Looking to explore LLMs for synthesizing training data? Use our Large Language
Model (LLM) development service for easy, unbiased data creation. Contact
LeewayHertz’s experts to start your AI journey today!
Author’s Bio
Akash Takyar
CEO LeewayHertz
Akash Takyar is the founder and CEO at LeewayHertz. The experience of
building over 100+ platforms for startups and enterprises allows Akash to
rapidly architect and design solutions that are scalable and beautiful.
Akash's ability to build enterprise-grade technology solutions has attracted
over 30 Fortune 500 companies, including Siemens, 3M, P&G and Hershey’s.
Akash is an early adopter of new technology, a passionate technology
enthusiast, and an investor in AI and IoT startups.
Write to Akash
Start a conversation by filling the form
Once you let us know your requirement, our technical expert will schedule a
call and discuss your idea in detail post sign of an NDA.
All information will be kept con몭dential.
Name Phone
Company Email
Tell us about your project
Send me the signed Non-Disclosure Agreement (NDA )
Start a conversation
Insights
Question User
Knowledgebase
Answer
LLM
How to train an open­source foundation model into
a domain­specific LLM?
A domain-speci몭c language model constitutes a specialized subset of large
language models (LLMs), dedicated to producing highly accurate results
within a particular domain.
Build an LLM­powered application using
LangChain: A comprehensive step­by­step guide
LangChain is a framework that provides a set of tools, components, and
interfaces for developing LLM-powered applications.
Knowledgebase
Read More
Prompt LLM Completion
‘ Jack has a cat.
What animal is
Jack’s pet?’
‘cat’
Read More
LEEWAYHERTZPORTFOLIO
SERVICES GENERATIVE AI
About Us
Global AI Club
Careers
Case Studies
Work
Community
TraceRx
ESPN
Filecoin
Lottery of People
World Poker Tour
Chrysallis.AI
How to build a private LLM?
Language models are the backbone of natural language processing (NLP)
and have changed how we interact with language and technology.
User
Prompt
Model
Output
LLM
Training
Data
Read More
Show all Insights
Privacy & Cookies Policy
INDUSTRIES PRODUCTS
CONTACT US
Get In Touch
415-301-2880
info@leewayhertz.com
jobs@leewayhertz.com
388 Market Street
Suite 1300
San Francisco, California 94111
Sitemap
Generative AI
Arti몭cial Intelligence & ML
Web3
Blockchain
Software Development
Hire Developers
Generative AI Development
Generative AI Consulting
Generative AI Integration
LLM Development
Prompt Engineering
ChatGPT Developers
Consumer Electronics
Financial Markets
Healthcare
Logistics
Manufacturing
Startup
Whitelabel Crypto Wallet
Whitelabel Blockchain Explorer
Whitelabel Crypto Exchange
Whitelabel Enterprise Crypto Wallet
Whitelabel DAO
 
©2023 LeewayHertz. All Rights Reserved.

How to use LLMs in synthesizing training data?

  • 1.
    HOW TO USELLMS IN SYNTHESIZING TRAINING DATA? Talk to our Consultant   Listen to the article There is an unending quest for rich, diverse, and bias-free data in the dynamic realm of machine learning and arti몭cial intelligence. However, data, as indispensable as it is, often comes with its share of pitfalls — scarcity, privacy concerns, and biases, to name a few. Now, imagine a world where  
  • 2.
    data is abundant,unbiased, and unencumbered by privacy issues? Welcome to the world of synthetic data, a game-changing innovation that is reshaping the data science landscape. Harnessing the power of Large Language Models (LLMs), a powerful tool capable of understanding, generating, and even re몭ning human-like text, we can generate high-quality synthetic training data and train our models more e몭ciently. This article delves into how you can utilize LLMs to synthesize training data, o몭ering a unique solution to real-world data challenges. Through this comprehensive guide, we aim to provide you with a deep understanding of LLMs, elucidate the bene몭ts of synthetic data, and most importantly, guide you on how to use LLMs for synthesizing your own training data. What are LLMs? What is training data in ML and its importance? What is synthetic data? Synthetic data use cases Training machine learning models Mitigating data bias Safeguarding personal information Bene몭ts of synthesizing training data Step-by-step guide on using LLMs for synthesizing training data Step 1: Choosing the right LLM for your speci몭c application Step 2: Training the model with LLM generated synthetic data How to evaluate the quality of synthesized training data? Evaluating 몭delity Evaluating utility Evaluating privacy What are LLMs?
  • 3.
    Tokenization Embedding Transfer Learning Pretraining Attention LLMs Building Blocks LeewayHertz Before wediscuss what synthetic data is, it is important to understand what LLMs are. Large Language Models (LLMs) are intricate and sophisticated arti몭cial intelligence tools that learn and generate text-based responses that mimic human language. These AI models are trained on extensive volumes of text data – books, articles, web pages, and more, enabling them to decode and grasp the structure and patterns of language. With deep learning at their core, they perform an array of complex language tasks, generating top-tier results. The potential of popular LLMs like Google’s BERT, Facebook’s RoBERTa and OpenAI’s GPT series has been leveraged for various tasks like language translation, content creation, and more, showcasing their versatility and e몭ectiveness. Talking about their applications, LLMs boast an impressive range:
  • 4.
    Language translation: LLMsare great at translating text from one language to another, ensuring accuracy and pro몭ciency. Chatbots and conversational AI: They form the backbone of advanced chatbots and conversational AI systems, enabling 몭uent conversations with users. Content creation: Be it articles, summaries, or product descriptions, LLMs can generate contextually relevant and grammatically precise content. Text summarization: LLMs have the capacity to condense vast text content into shorter, more manageable summaries. Question answering: These models excel in identifying pertinent information from large text bodies to answer questions. Sentiment analysis: LLMs can decipher the underlying sentiment in a text, helping companies comprehend customer sentiment towards their o몭erings. Speech recognition: LLMs enhance speech recognition systems by understanding the context and meaning of spoken words more accurately. LLMs also excel in natural language processing enhancing the accuracy of search engines, improving customer service, and automated content creation. They even facilitate personalizing user experiences and make digital content more accessible. The hype around LLMs is justi몭ed, given their versatile applications. They have rede몭ned chatbot technology by simplifying its creation and maintenance, and their ability to generate varied and unexpected texts is noteworthy. LLMs can even be 몭ne-tuned to perform speci몭c NLP tasks, o몭ering the possibility of building NLP models more cost-e몭ectively and e몭ciently. Their unique capabilities have rightfully earned them the title of ‘Foundation Models.’ However, these models come with their own set of challenges – they require extensive computational resources, custom hardware, and a vast quantity of training data, making their development and maintenance a costly a몭air. What is training data in ML and its
  • 5.
    What is trainingdata in ML and its importance? Parametric Scenarios Partial Models Validation & Adaptation Reality (1) Modeling Apply Training (5) (2) Configuring Simulation (4) Concrete Instances Of Scenarios Synthetic Datasets Machine Learning System LeewayHertz Training data is the lifeblood of Machine Learning (ML) systems. It serves as the foundation upon which these systems learn, understand, and eventually, make predictions. It’s a crucial piece in the complex jigsaw of ML development, without which essential tasks would be impossible to carry out. At the heart of every successful AI and ML project lies quality training data. It is the key that enables a machine to learn human-like behavior and predict outcomes with higher accuracy. The role of training data in machine learning cannot be understated; it dictates the performance and accuracy of the AI model. Hence, understanding the value of a robust training dataset is pivotal to acquiring the right quantity and quality of data for your machine learning models. The correlation between the quality of training data and the accuracy of the model is direct. As a practitioner, you must realize its importance, and how it in몭uences the selection of an algorithm based on the availability and compatibility of your training dataset. Prioritizing training data in any AI or ML project is not just a good practice, but a necessary one. Investing in acquiring high-quality datasets will invariably lead to improved outcomes. To illustrate, consider a model being trained to recognize images of cats. The model’s ability to accurately identify a cat in a new image directly depends on the variety, quality, and quantity of
  • 6.
    cat images itwas trained on. Understanding its signi몭cance, ensuring its quality, and choosing the right quantity forms the basis of successful AI and ML projects. Always remember, the e몭cacy of your models is inextricably tied to the quality of your training data. Here are some of the areas where training data plays a vital role. The quality and success of these applications and processes are directly related to the quality and quantity of the training data. Object recognition and categorization Training data serves a critical role in supervised machine learning, particularly in the recognition and categorization of objects. Consider a scenario where an algorithm must distinguish between images of cats and dogs. In this case, labeled images of both species are required. The algorithm learns to discern the distinctive features between the two species based on this training data, enabling it to recognize and categorize similar objects in the future. If the training data is inaccurate or poor in quality, it could lead to inaccurate results, potentially derailing the success of an AI project. Crucial input for machine learning algorithms Training data is indispensable to the operation of machine learning algorithms. It’s the primary input that provides the algorithm with the information necessary to make decisions akin to human intelligence. In supervised machine learning, the algorithm requires labeled training data as an additional input. If the training data is not appropriately labeled, it diminishes its value for supervised learning. For instance, images must be annotated with precise metadata to be recognizable to machines through computer vision. Therefore, accuracy in labeling the training data is of paramount importance. Machine learning model validation
  • 7.
    Developing an AImodel is just part of the process; it’s equally vital to validate the model to assess its accuracy and ensure its performance in real-life scenarios. Validation or evaluation data is another form of training data, often set aside to test the model’s performance under di몭erent circumstances. This data helps verify the model’s ability to make accurate predictions, solidifying its reliability before deployment. Hence, the importance of training data extends beyond just learning; it also plays a key role in ensuring the overall quality and accuracy of the AI model. What is synthetic data? Synthetic data refers to data that isn’t collected from the real world, and is created arti몭cially using computer programs or simulations. Like an artist making a replica of a real painting, these computer programs replicate patterns found in real data, but without actually containing any real information. It’s used often in 몭elds like arti몭cial intelligence and machine learning because it helps overcome certain issues associated with real data. For example, real data can be biased, incomplete, or not diverse enough. Synthetic data o몭ers a tailored or customized environment for the purpose of training and improving AI and machine learning algorithms. It serves as a simulated practice area where these algorithms can learn, adapt, and develop their capabilities before being deployed in real-world scenarios. Synthetic data is generated to mimic real data but can be controlled and manipulated to provide speci몭c training scenarios and test di몭erent scenarios, making it a valuable tool for re몭ning AI and machine learning models. Because it is synthetic, you can create as much data as you need, and customize it to your needs. This is especially helpful when real-world data is hard to come by. Additionally, synthetic data is a superior choice when it comes to privacy. It can be created in a way that resembles real data, but without including any
  • 8.
    personal information, suchas someone’s name or address. This means AI researchers can use it to train their models without risking anyone’s privacy. Synthetic data use cases Training machine learning models Synthetic data is a valuable resource in the world of arti몭cial intelligence, speci몭cally when we are training machine learning models. Collecting real- world data can be tricky and time-consuming, especially when the data is sensitive or regulated by laws like GDPR. Real-world data may also have biases, be incomplete, or have errors. Here, synthetic data can step in as a substitute or complement to real-world data for training machine learning models. Using synthetic data gives machine learning models the chance to learn from larger and more varied data sets, which can bolster their performance and ability to adapt. Mitigating data bias Another fantastic application of synthetic data is its ability to help minimize bias and enhance fairness in data sets. It’s not unusual for real-world data to show bias or lack balance, leading to machine learning models that re몭ect these unfair leanings. When a data set isn’t an accurate representation of the population it aims to study—for instance, if it predominantly includes data from a speci몭c race or gender—it fails to capture the true diversity of experiences and behaviors across all groups. This can result in machine learning models that don’t accurately serve their intended population. By using synthetic data, we can more precisely mirror the population we are studying, as it gives us control over how characteristics like race, gender, and other demographics are distributed across the data set. This ensures that our data set better re몭ects the people it aims to represent.
  • 9.
    Safeguarding personal information Syntheticdata plays a crucial role in protecting privacy and security. Real- world data often holds private or sensitive details that shouldn’t be made public. Synthetic data can be used as a stand-in for real data, allowing researchers to perform their analyses without compromising individuals’ privacy. Personally Identi몭able Information (PII) can include anything from names, addresses, phone numbers, and email addresses to social security numbers, 몭nancial records, medical information, and biometric data. Under GDPR, it is mandatory for organizations to protect such information and obtain explicit consent from individuals before gathering, utilizing, or sharing it. Synthetic data allows us to bypass these issues while still gaining valuable insights. Benefits of synthesizing training data LeewayHertz Healthcare Agriculture Banking & Finance Manufacturing Disaster Prediction and Risk Management Automotive & Robotics eCommerce Application of Synthetic Data The use of synthesized data allows for the augmentation of existing datasets with synthetic versions, enhancing the training of various models and algorithms. Essentially, synthesized data acts as a fabricated data pool, aiding in the veri몭cation of mathematical models or the training of machine learning models. Synthesized data is employed across di몭erent sectors as a means to omit
  • 10.
    Synthesized data isemployed across di몭erent sectors as a means to omit certain sensitive elements from the original data. In some cases, datasets encompass con몭dential information that, due to privacy concerns, cannot be publicly shared. Synthetic data provides a solution by producing arti몭cial data that mirrors the original but does not retain any personally identi몭able information. This circumvents privacy issues related to using real consumer data without consent or remuneration. Synthetic data o몭ers a privacy- respecting alternative, facilitating the development and testing of algorithms and models without violating con몭dentiality. The use of synthesized training data o몭ers numerous bene몭ts, rendering it an invaluable resource for organizations: Abiding by privacy laws: Besides helping companies navigate privacy laws that restrict them from handling sensitive data, synthesized data also reduces the risk of customer data breaches or unauthorized sharing, which could result in costly legal battles and harm to brand reputation. Minimizes privacy concerns: Addressing privacy concerns forms a signi몭cant reason why organizations are increasingly employing synthetic data generation techniques. Enables data generation when historical data is unavailable: For completely new products or services, historical data might not be available. Furthermore, procuring human-annotated data can be costly and time- consuming. By generating synthetic data swiftly, these hurdles can be bypassed, facilitating the creation of reliable machine learning models. Cost-e몭ective and e몭cient: Synthetic data generation emerges as a cost- e몭ective and e몭cient solution for new product development and machine learning model training. Synthesized training data, thus, emerges as a pivotal tool in modern data handling, ensuring privacy, expanding datasets, and aiding in e몭cient and cost-e몭ective model training. Step­by­step guide on using LLMs for
  • 11.
    Step­by­step guide onusing LLMs for synthesizing training data? In this example, we will create synthetic sales data for training a sales prediction model. Imagine, we have a new sales prediction app for co몭ee shops that we want to check using this synthetic data. We explain in the following steps how an LLM can be used to synthesize this data for the model’s training: Step 1: Choosing the right LLM for your speci몭c application Selecting the right Language Learning Model (LLM) when synthesizing training data requires careful consideration of a few factors: Task requirements: The type of task you want to accomplish greatly in몭uences your choice of LLM. For example, if your task is related to text generation, a sequence-to-sequence model might be the best 몭t. On the other hand, for classi몭cation tasks, a simpler model might su몭ce. Data availability: The amount and quality of data you have at your disposal can in몭uence the complexity of the LLM that you choose. More complex models may require more data for training. Computational resources: More sophisticated LLMs require more computational power and memory for training and inference. You need to consider your available resources when choosing a model. Privacy concerns: If your data includes sensitive information, you may need to consider a model that can provide better data privacy. Accuracy vs. explainability: Some LLMs can o몭er high accuracy but have low explainability. If your project requires understanding the reasoning behind the model’s predictions, you might need to choose a simpler, more interpretable model. Model training time: Training a complex LLM can take a signi몭cant amount of time. Depending on the time constraints of your project, you might need to opt for a less complex model that can be trained more quickly.
  • 12.
    Finally, it’s agood practice to experiment with a few di몭erent models and compare their performance on your speci몭c task. This empirical evaluation can help you 몭nd the most suitable LLM for your use case. We can use various AI tools to create synthetic data for testing apps, building data analysis processes, and making machine learning models. ChatGPT is one powerful LLM among them. We could begin by asking ChatGPT to produce some data for us using the below prompt: Create a CSV file with 25 random sales records for a coffee shop. Each record should include the following fields: ­ id (incrementing integer starting at 1) ­ date (random date between 1/1/2022 and 12/31/2022) ­ time (random time between 6:00am and 9:00pm in 1­minute increments ­ product_id (incrementing integer starting at 1) ­ product ­ calories ­ price in USD ­ type (drink or food) ­ quantity (random integer between 1 and 3) ­ amount (price * quantity) ­ payment type (cash, credit, debit, or gift card) The way in which you phrase your query to ChatGPT can signi몭cantly alter the output it provides. From the query above, we can expect correct results but they won’t be very helpful for checking our app because they do not come in the right format. Also, ChatGPT can’t make a real CSV 몭le. Plus, it can only give back a certain amount of data. As a rule, ChatGPT can give back answers that are up to 2,048 parts long – that’s the most the GPT-3 model allows. Instead of making ChatGPT create the synthetic data directly, we could ask it
  • 13.
    Instead of makingChatGPT create the synthetic data directly, we could ask it to write a Python script. This script would then generate the synthetic sales data we need. This is a better approach because it can be used on a larger scale. So, let’s instruct ChatGPT to write a Python program to create synthetic sales data similar to our previous requirements. Write a Python3 code to make a CSV file with 100 sales records for typ id (a series of numbers starting from 1) date (any random date between 1/1/2022 and 12/31/2022) time (any random time from 6:00am to 9:00pm, changing every minute) product_id (a series of numbers starting from 1) product calories price (in USD) type (either a drink or food) quantity (a random number between 1 and 3) amount (price times quantity) payment type (could be cash, credit, debit, or gift card) The generated code is as follows: import csv import random from random import randrange from datetime import timedelta, datetime # List of possible products and their corresponding details products = [ {"product_id": 1, "product": "Espresso", "calories": 10, "price": {"product_id": 2, "product": "Cappuccino", "calories": 120, "price
  • 14.
    {"product_id": 3, "product":"Croissant", "calories": 300, "price" # add more products as needed... ] payment_types = ["cash", "credit", "debit", "gift card"] start_date = datetime.strptime("1/1/2022", "%m/%d/%Y") end_date = datetime.strptime("12/31/2022", "%m/%d/%Y") def random_date(start, end): return start + timedelta(seconds=random.randint(0, int((end ­ star def random_time(): return "{:02d}:{:02d}".format(randrange(6, 21), randrange(0, 60)) with open('sales_data.csv', 'w', newline='') as file: writer = csv.writer(file) writer.writerow(["id", "date", "time", "product_id", "product", "c for i in range(1, 101): product = random.choice(products) quantity = randrange(1, 4) writer.writerow([ i, random_date(start_date, end_date).strftime("%m/%d/%Y"), random_time(), product["product_id"], product["product"], product["calories"], product["price"], product["type"], quantity,
  • 15.
    quantity * product["price"], random.choice(payment_types) ]) ChatGPTwas able to generate a full Python program to make synthetic sales data with just one short prompt. However, because of ChatGPT’s response size limit, we could only list six items on the co몭ee shop menu. If we asked for more items, the output and the program would be cut o몭, which means the program wouldn’t run. You can provide additional prompts as below: # Additional prompts: # Create a function to return one random item from a list of dictionar # containing 15 common drink items sold in a coffee shop including the # Capitalize the first letter of each product name. Start product id a # Create a function to return one random item from a list of dictionar # containing 10 common food items sold in a coffee shop including the # Capitalize the first letter of each product name. Start product id a To install faker dependencies execute the below code: pip install faker The complete generated code will be like this: import random import csv from datetime import datetime, timedelta, time
  • 16.
    from faker importFaker # Function to generate random time def random_time(start, end): return start + timedelta( seconds=random.randint(0, int((end ­ start).total_seconds())), ) # Generate drink items def generate_drink_items(): drinks = [{'id': i+1, 'name': Faker().words(nb=1, unique=True)[0]. return random.choice(drinks) # Generate food items def generate_food_items(): foods = [{'id': i+16, 'name': Faker().words(nb=1, unique=True)[0]. return random.choice(foods) # Generate payment types def generate_payment_types(): return random.choice(['cash', 'credit', 'debit', 'gift card']) # Open the CSV file with open('coffee_shop_sales_chatgpt_data.csv', 'w', newline='') as fi writer = csv.writer(file) writer.writerow(["id", "date", "time", "product_id", "product", "c # Generate sales records for i in range(100): item = generate_drink_items() if random.choice([True, False]) date = Faker().date_between_dates(date_start=datetime(2022, 1, time = random_time(datetime.strptime('6:00 AM', '%I:%M %p'), d
  • 17.
    time = random_time(datetime.strptime('6:00AM', '%I:%M %p'), d quantity = random.randint(1,3) amount = round(quantity * item['price'], 2) writer.writerow([i+1, date, time, item['id'], item['name'], it This code selects from 15 di몭erent drink items and 10 di몭erent food items, all with unique names, and writes 100 sales records to a CSV 몭le named “co몭ee_shop_sales_chatgpt_data.csv”. You can 몭nd sample reference code in this Github location. Copy and paste the code in VS code and run it. The generated synthetic data will be like this – https://github.com/garysta몭ord/ten-ways-gen-ai-code- gen/blob/main/data/output/co몭ee_shop_sales_data_chatgpt.csv Step 2: Training the model with LLM-generated synthetic data We will train sales prediction model with generated synthesized data. This involves several steps, such as- Preprocessing the data Splitting it into training and testing datasets Selecting a model Training the model Evaluating its performance Using it for prediction. Below is a high-level description of these steps using Python and popular libraries like pandas and scikit-learn: Note: The exact code and approach would depend on the speci몭cs of your application, data, and prediction task. The description below assumes a
  • 18.
    regression task, whereyou are trying to predict a continuous value like the amount of sales. Load the data: Start by loading your synthetic data into a pandas DataFrame: import pandas as pd data = pd.read_csv('coffee_shop_sales_chatgpt_data.csv') Preprocess the data: Before training your model, you will need to preprocess your data. This could include: Converting categorical variables into numeric variables using techniques like one-hot encoding. Normalizing numeric variables so they’re on the same scale. Here is an example using pandas and scikit-learn: import numpy as np from sklearn.preprocessing import OneHotEncoder, StandardScaler # One­hot encoding for categorical features encoder = OneHotEncoder() categorical_features = ['product', 'type', 'payment type'] encoded_features = encoder.fit_transform(data[categorical_features]).t # Normalizing numeric features scaler = StandardScaler() numeric_features = ['calories', 'price', 'quantity'] scaled_features = scaler.fit_transform(data[numeric_features]) # Combine the processed features back into a single array X = np.concatenate([scaled_features, encoded_features], axis=1)
  • 19.
    y = data['amount'] Splitthe data: You should split your data into a training set and a testing set. This allows you to evaluate your model’s performance on unseen data: from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0. Train the model: Choose a suitable machine learning model for sales prediction. For instance, a RandomForestRegressor or GradientBoostingRegressor can be used: from sklearn.ensemble import RandomForestRegressor model = RandomForestRegressor() model.fit(X_train, y_train) Evaluate the model: Use the testing set to evaluate the model’s performance. A common metric for regression tasks is the mean absolute error (MAE): from sklearn.metrics import mean_absolute_error y_pred = model.predict(X_test) print('Mean Absolute Error:', mean_absolute_error(y_test, y_pred)) Predict sales: With the model trained, you can now use it to predict sales: # Let's say `new_data` is your new sales data for prediction.
  • 20.
    # Let's say`new_data` is your new sales data for prediction. new_data = pd.read_csv('sales_data.csv') def preprocess(new_data): # Perform the same preprocessing steps as before encoded_features = encoder.transform(new_data[categorical_features scaled_features = scaler.transform(new_data[numeric_features]) # Combine the processed features back into a single array preprocessed_data = np.concatenate([scaled_features, encoded_featu return preprocessed_data # Preprocess the new data preprocessed_data = preprocess(new_data) # Use the preprocessed data to make predictions predictions = model.predict(preprocessed_data) Please note: The feature names should match those that were passed during 몭t. This example provides a basic outline of the process. Each step has many possible variations, and the best approach will depend on your speci몭c data, problem, and requirements. For example, you might want to try di몭erent preprocessing techniques, machine learning models, and evaluation metrics. How to evaluate the quality of synthesized training data? For the adoption of synthetic data in machine learning and analytics projects, it’s not only essential to ensure the synthetic data serves its intended purpose and meets application requirements, but it’s also crucial to measure
  • 21.
    purpose and meetsapplication requirements, but it’s also crucial to measure and ensure the quality of the produced data. In light of growing legal and ethical mandates for privacy protection, synthetic data’s capability to eliminate sensitive and original information during its generation is one of its key strengths. Therefore, alongside quality, we require metrics to assess the risk of potential privacy breaches, if any, and to ensure that the generation process does not merely replicate the original data. To address these needs, we can evaluate the quality of synthetic data across multiple dimensions, facilitating a better understanding of the generated data for users, stakeholders, and ourselves. The quality of generated synthetic data is evaluated across three primary dimensions: 1. Fidelity 2. Utility 3. Privacy A synthetic data quality report should be able to answer the following questions about the generated synthetic data: How does this synthetic data compare with the original training set? What is the usefulness of this synthetic data in our downstream applications? Has any information been inadvertently leaked from the original training set into the synthetic data? Has any sensitive information from other datasets (not used for model training) been unintentionally synthesized by our model? The metrics translating these dimensions for the end-users can be quite 몭exible, as the data to be generated can have varying distributions, sizes, and behaviors. They should also be easy to comprehend and interpret.
  • 22.
    In essence, themetrics should be completely data-driven, not requiring any pre-existing knowledge or domain-speci몭c information. However, if users wish to implement certain rules and constraints relevant to a particular business domain, they should be able to specify them during the synthesis process to ensure domain-speci몭c 몭delity is maintained. Let’s delve deeper into each of these metrics. Evaluating 몭delity When we talk about the quality of synthetic data, one of the key aspects we consider is ‘몭delity’, which basically means how closely the synthetic data matches with the original data. We want to make sure the synthetic data is similar enough to the original that it can serve its purpose well. Let’s break down the ways we measure 몭delity: Statistical comparisons: This is a way to compare the key features of the original and synthetic data sets. We look at things like the average (mean), middle value (median), spread of data (standard deviation), number of distinct values, missing values, and range of values. We do this for each category of data to see if the synthetic data is statistically similar to the original. If it’s not, we might need to generate the synthetic data again with di몭erent settings. Histogram similarity score: This score helps us understand how similar the distribution of each feature (or category of data) is in the synthetic and original datasets. If the score is 1, it means the distributions in synthetic data perfectly match with the original. Mutual information score: This score tells us how dependent two features are on each other. In other words, it shows us how much information about one feature you can get by looking at another. Correlation score: This score tells us how well relationships between two or more columns of data have been preserved in the synthetic data. These relationships are important because they can reveal connections between
  • 23.
    di몭erent pieces ofdata. For certain types of data, like time-series or sequential data, we also use additional metrics to measure the quality. For instance, we can look at autocorrelation and partial autocorrelation scores to see how well the synthetic data has preserved signi몭cant correlations from the original dataset. Evaluating utility In addition to 몭delity, the ‘utility’ or usefulness of synthetic data is also important. We need to ensure that the synthetic data performs well on common tasks in data science. Prediction score: This is a measure of how well models trained on synthetic data perform compared to models trained on the original data. We compare the outcomes of these models against a testing set of data that hasn’t been seen before. This gives us an idea of how good the synthetic data is in terms of training e몭ective models. Feature importance score: This score looks at the importance of di몭erent features (or categories of data) and checks if this order of importance is the same in the synthetic and original data. If the order is the same, it means the synthetic data has high utility. QScore: This score is used to check if a model trained on synthetic data will give the same results as a model trained on original data. It does this by running random aggregation-based queries on both datasets and comparing the results. If the results are similar, it means the synthetic data has good utility. Evaluating privacy Privacy is a signi몭cant concern when it comes to synthesizing data. It’s important to protect sensitive information to meet ethical and legal requirements.
  • 24.
    Exact match score:This is a measure of how many original data records can be found in the synthetic dataset. We want this score to be low to ensure privacy. Neighbors’ privacy score: This score indicates how many synthetic records are very similar to the real ones, which could be a potential privacy concern. A lower score means better privacy. Membership inference score: This score tells us how likely it is that someone could correctly guess if a speci몭c data record was part of the original dataset. The lower this score, the better the privacy. Holdout concept: It’s important to prevent the synthetic data from simply copying the original data. To avoid this, a portion of the original data is set aside and used to evaluate the synthetic data. This helps to maintain the balance between data utility and privacy. Endnote LLMs have become an increasingly valuable asset for data scientists and researchers, especially when it comes to synthesizing training data. Through the power of advanced machine learning algorithms and generative models, synthetic data can be created to mimic real-world data while maintaining privacy and con몭dentiality. While synthetic data generation presents challenges, such as modeling complex data distributions, its potential bene몭ts are evident. Synthetic data 몭nds application in various domains, including training and testing machine learning models, conducting simulations, and performing experiments. As the 몭eld of synthetic data generation continues to progress, we can anticipate the emergence of innovative techniques and tools, fueling further advancements in this exciting area of research and development. Looking to explore LLMs for synthesizing training data? Use our Large Language Model (LLM) development service for easy, unbiased data creation. Contact LeewayHertz’s experts to start your AI journey today!
  • 25.
    Author’s Bio Akash Takyar CEOLeewayHertz Akash Takyar is the founder and CEO at LeewayHertz. The experience of building over 100+ platforms for startups and enterprises allows Akash to rapidly architect and design solutions that are scalable and beautiful. Akash's ability to build enterprise-grade technology solutions has attracted over 30 Fortune 500 companies, including Siemens, 3M, P&G and Hershey’s. Akash is an early adopter of new technology, a passionate technology enthusiast, and an investor in AI and IoT startups. Write to Akash Start a conversation by filling the form Once you let us know your requirement, our technical expert will schedule a call and discuss your idea in detail post sign of an NDA. All information will be kept con몭dential.
  • 26.
    Name Phone Company Email Tellus about your project Send me the signed Non-Disclosure Agreement (NDA ) Start a conversation Insights Question User Knowledgebase Answer LLM
  • 27.
    How to trainan open­source foundation model into a domain­specific LLM? A domain-speci몭c language model constitutes a specialized subset of large language models (LLMs), dedicated to producing highly accurate results within a particular domain. Build an LLM­powered application using LangChain: A comprehensive step­by­step guide LangChain is a framework that provides a set of tools, components, and interfaces for developing LLM-powered applications. Knowledgebase Read More Prompt LLM Completion ‘ Jack has a cat. What animal is Jack’s pet?’ ‘cat’ Read More
  • 28.
    LEEWAYHERTZPORTFOLIO SERVICES GENERATIVE AI AboutUs Global AI Club Careers Case Studies Work Community TraceRx ESPN Filecoin Lottery of People World Poker Tour Chrysallis.AI How to build a private LLM? Language models are the backbone of natural language processing (NLP) and have changed how we interact with language and technology. User Prompt Model Output LLM Training Data Read More Show all Insights
  • 29.
    Privacy & CookiesPolicy INDUSTRIES PRODUCTS CONTACT US Get In Touch 415-301-2880 info@leewayhertz.com jobs@leewayhertz.com 388 Market Street Suite 1300 San Francisco, California 94111 Sitemap Generative AI Arti몭cial Intelligence & ML Web3 Blockchain Software Development Hire Developers Generative AI Development Generative AI Consulting Generative AI Integration LLM Development Prompt Engineering ChatGPT Developers Consumer Electronics Financial Markets Healthcare Logistics Manufacturing Startup Whitelabel Crypto Wallet Whitelabel Blockchain Explorer Whitelabel Crypto Exchange Whitelabel Enterprise Crypto Wallet Whitelabel DAO   ©2023 LeewayHertz. All Rights Reserved.