Train foundation model for domain-specific language model

HOW TO TRAIN AN OPEN
SOURCE FOUNDATION MODEL
INTO A DOMAINSPECIFIC LLM?
Talk to our Consultant
 
Listen to the article
In the fast-paced world of corporate environments, technological innovations
have always been a catalyst, driving business growth with remarkable
velocity. One such technological advancement that has lately become a
talking point and brought about a paradigm shift is generative AI. Today,
 

every industry is being profoundly in몭uenced by generative AI, which has
heralded the dawn of a new era de몭ned by the widespread use of Large
Language Models (LLMs). Corporate entities across diverse sectors are
increasingly recognizing the immense potential of customized large language
models to enhance their business operations in unprecedented ways.
Businesses are progressively pivoting away from generic, universal models
due to growing needs for enhanced control, greater data privacy, and cost-
e몭cient methodologies. As organizations increasingly leverage the power of
customization, they open the doors to a world where intelligent systems
stand poised to revolutionize everything from innovation and e몭ciency to
the very core of their operational processes.
While general-purpose LLMs are trained on large and diverse datasets,
domain-speci몭c LLMs are speci몭cally designed for a particular domain and
trained on exclusive datasets that cater to an organization’s speci몭c
requirements. General-purpose LLMs can be useful for many businesses but
domain-speci몭c LLMs are more suitable for situations that require accurate
and contextually appropriate outputs within a speci몭c domain. These
models, which often outperform extensive models like GPT-3.5 that serve as
the basis for ChatGPT, are highly targeted and potentially more e몭cient.
This signi몭cant change is further propelled by the availability of numerous
open-source foundation models, making it feasible for businesses to craft
their personalized LLMs to meet their unique demands and achieve optimal
results.
Domain-speci몭c LLMs are witnessing widespread adoption as they can adapt
to speci몭c requirements, utilize available resources more e몭ciently, and
deliver powerful language processing capabilities that address speci몭c
challenges and tasks. They are poised to bring about a new era of advanced
language processing solutions with enhanced adaptability, resourcefulness,
and potency.
This article provides detailed insights into domain-speci몭c LLMs, covering

aspects like their bene몭ts, challenges involved in their development and how
to create an industry-speci몭c LLM, leveraging the potential of an open-source
foundation model. All this will help you understand the intricacies of this
fascinating progression in AI.
What are foundation models?
What are LLMs?
What are domain-speci몭c LLMs?
What are the bene몭ts of training on the domain-speci몭c LLMs?
Challenges of building domain-speci몭c custom LLMs
A detailed case study of BloombergGPT, a 몭nance-speci몭c LLM
What tasks BloombergGPT can perform?
The architecture of a domain-speci몭c LLM like BloombergGPT
How was BloombergGPT trained?
How ethical and privacy issues are handled in BloombergGPT?
How to train an open-source foundation model into a domain-speci몭c
LLM?
What are foundation models?
Foundation Models (FMs)
Large Language models (LLMs)
ex: ChatGPT, Chinchilla, GPT3
Pretrained
Generalized
Adaptable
Large
SelfSupervised
LeewayHertz
The term “Foundation Models” was put forward by Stanford researchers to
describe a new breed of machine learning models. Rather than being

designed for a speci몭c task such as image recognition, these models are
trained on extensive, diverse datasets using self-supervised learning at scale,
allowing them to be 몭ne-tuned for various downstream tasks.
Contrary to what the name might suggest, Foundation Models (FMs) are not
the bedrock of AI, nor are they suggestive of AGI (Arti몭cial General
Intelligence). These models are unique and remarkable in their own right,
characterized by 몭ve main traits:
1. Pre-training: FMs are pre-trained with vast data and substantial computing
power, making them ready for use without further training.
2. Generalization: Unlike traditional AI models, which are task-speci몭c, FMs
are versatile, and designed to tackle numerous tasks.
3. Adaptability: FMs are adjustable through prompting, meaning they
respond to user-de몭ned inputs like text.
4. Large-scale: FMs are large in both model size and data size. For instance,
GPT-3 boasts 175 billion parameters and was trained on approximately
500,000 million words.
5. Self-supervision: FMs learn from data patterns without explicit labels.
Prominent examples of FMs include GPT-3 and DALL-E-2. These models
enable everyday users and non-developers to carry out remarkable tasks by
providing simple “prompts.”
Once trained, FMs can manage a plethora of data types and tasks, making
them highly adaptable. They can automate complex work몭ows when paired
with an appropriate operational chain. Beyond generating content such as
text, images, audio, and video, FMs can also be employed for prediction and
classi몭cation tasks, a concept known as discriminative modeling.
FMs have initiated a transformative era in knowledge work, fostering
improvements in language-related tasks, reasoning, search, and computer
vision, especially when combined with other machine learning approaches
like Reinforced Learning with Human Feedback. The development of multi-

modal FMs, such as CLIP and ViLBERT, is underway, and these models could
signi몭cantly enhance robotics. By using approaches such as few-shot
learning and domain specialization, FMs can address a multitude of
challenges.
Despite their remarkable potential, FMs do have their limitations. These
include fabricating answers (known as hallucination), issues with temporal
shifting or continual adaptation, the requirement of large data sets and
considerable computing resources for training, and the need for human
evaluation for 몭ne-tuning. Other hurdles include the tension between
specialization and diversity in FM training data, uncertainty about optimal
adaptation methods, and limited understanding of the model’s inner
workings: what a model can do, why it produces certain behaviors, and how
it accomplishes this.
Leverage LeewayHertz’s expertise in domain-
speci몭c LLMs!
We train your preferred foundation model on your
proprietary data to create a domain-speci몭c LLM that
aligns perfectly with your business needs.
Learn More
What are LLMs?
Large Language Models (LLMs) represent a subclass of Foundation Models
designed to comprehend and generate text. By ingesting immense amounts
of textual data and boasting billions of parameters, these models can
perform a variety of tasks based on a user-provided prompt. Functions we
encounter daily, such as autocomplete in search engines or Smart Compose
in email platforms, are real-world applications of these language models.
They excel at predicting the most likely subsequent word or phrase based on

the provided text.
To understand LLMs, we should recognize that they are built upon advanced
neural networks known as transformers. These transformers extract
patterns from vast amounts of text data, ranging from strings to numbers to
code. It is worth noting that the creation and utilization of LLMs involve
substantial hurdles: massive datasets, considerable computational power,
and specialized skills are prerequisites for training, 몭ne-tuning, and
extracting insights from these models. This explains why only a few models,
like BERT and GPT-3, are readily accessible and why many models remain
closed-source.
LLMs should not be con몭ated with Arti몭cial General Intelligence (AGI) despite
their potential. Current LLMs don’t inherently comprehend concepts and
abstractions. They have limitations but are still the most e몭ective tools in
Natural Language Processing (NLP) for a broad range of tasks. Their potential
applications span from coding to assisting with task completion to
composing coherent and engaging narratives.
The output of an LLM typically undergoes rigorous 몭ltering before a response
is generated. This process is integral to ensuring the generated content is
factual and safe, which is crucial considering the risks associated with these
models. As businesses aim to leverage LLMs for speci몭c products or systems,
a high degree of oversight and governance becomes essential.
The historical journey of LLMs can be traced back to the early 2010s. Initially,
transfer learning with annotated or labeled data was a prevalent practice.
However, this approach was costly, challenging to implement consistently,
and posed privacy concerns. The advent of transformer-based models
around 2017 marked a signi몭cant stride, with models like GPT-3 and ChatGPT
making their debut and gaining swift popularity.
In the post-ChatGPT world, LLMs continually show promise for various
applications. They can explain complex subjects in simple terms, translate
languages, generate creative content, brainstorm, assist with online tasks,

languages, generate creative content, brainstorm, assist with online tasks,
provide customer service, and even perform technical functions. However,
they are not without 몭aws. The reliability of answers and the risk of
fabricated responses, often referred to as “model hallucination,” highlight the
need for further re몭nement of these models.
Thus, while LLMs have transformative potential and are reshaping the
complex AI landscape, they require careful management and continual
re몭nement to ensure they show their full potential without compromising on
safety and accuracy.
LLM Stack — Simple View
Layers of Large Language Model
Inference Based on Prompts
Fine Tuning
(With Labeled Data)
Filters
(Fact Checking, Safety, Policy etc.)
Raw Model Output
Training and Compute
Human
Feedback
with
Reinforcement
Learning
+
Other
Approches
LeewayHertz
What are domainspecific LLMs?
A domain-speci몭c language model constitutes a specialized subset of large
language models dedicated to producing highly accurate results within a
particular domain. Unlike a generalized LLM, which aims to process a diverse

range of topics with satisfactory pro몭ciency, a domain-speci몭c language
model focuses on optimizing its performance within a prede몭ned sector,
which could range from technical domains such as law or medicine to more
casual areas like culinary arts.
The methodology of training a domain-speci몭c language model hinges on
acquiring a substantial volume of domain-speci몭c data. This data serves as
the training set, ensuring the model is immersed in the contextual intricacies
and speci몭c knowledge pertinent to the chosen domain.
There are two primary strategies to undertake domain-speci몭c pre-training.
The 몭rst strategy involves initializing a pre-trained LLM and 몭ne-tuning it with
the domain-speci몭c data to adapt its capabilities to the new context. This
method typically allows faster convergence and often superior performance.
The alternative strategy is to construct a new LLM from scratch using the
domain-speci몭c data. The optimal strategy may vary depending on the
application and the domain in question.
However, this specialization entails a fundamental trade-o몭. A domain-
speci몭c language model, such as a model trained exclusively on culinary
recipes, can comprehend the subtleties of its specialized domain more
pro몭ciently than a generalized model. Nevertheless, its performance
deteriorates signi몭cantly when faced with tasks outside its trained domain.
On the other hand, a generic LLM possesses broader knowledge, albeit
without the 몭ne-grained expertise of a domain-speci몭c language model.
This di몭erence is similar to what we see in real-life expertise. It is exceedingly
rare to encounter a professional chef who can also provide pro몭cient stock
market analysis or a mathematician who doubles as an accomplished artist.
Correspondingly, developing an LLM that can seamlessly transition between
domains while maintaining a high degree of pro몭ciency is a challenge. The
trade-o몭 between breadth and depth of knowledge is intrinsic to the design
of language models, necessitating judicious consideration in the
development process.

Here is a table comparing general LLMs and domain-speci몭c LLMs:
Criteria
General LLMs (e.g.,
GPT-3)
Domain-speci몭c
LLMs (e.g.,
FoodUDT-1B)
Purpose
Developed to
understand and
generate text across
a broad range of
topics and contexts.
Speci몭cally designed
to understand and
generate text in a
particular niche or
domain.
Training Data
Trained on diverse
internet text,
encompassing a
wide range of topics
Trained on domain-
speci몭c data (in this
case, 500K recipes).
Knowledge
Depth
Has a general
understanding of a
wide range of topics,
including domain-
speci몭c topics at a
high level.
Has a deep
understanding of a
speci몭c domain.
Representation
of Context
Represents text and
context based on a
broad
understanding of
language and world
knowledge. For
example, “apple” is
closer to “iPhone”
than to “apple pie.”
Represents text and
context based on
deep, specialized
knowledge. For
example, “apple” is
closer to “apple pie”
than to “iPhone.”

Criteria
General LLMs (e.g.,
GPT-3)
Domain-speci몭c
LLMs (e.g.,
FoodUDT-1B)
Word
Associations
Forms associations
based on generic
context. For
example, “meat,”
“sugar,” and
“dessert” are equally
related.
Forms associations
based on domain-
speci몭c context. For
example, “dessert”
and “sugar” are
more related than
“meat” and “sugar.”
Advantages
Versatile and can
handle a wide range
of tasks and queries.
Highly accurate and
e몭cient in its
speci몭c domain due
to targeted training.
Limitations
May not possess in-
depth
understanding or
generate accurate
outputs in
specialized domains.
Limited to its
speci몭c domain and
may not generate
accurate outputs
outside that domain.
Use Cases
General
conversational AI,
summarization,
translation, and
other generic tasks.
Highly specialized
tasks within the
domain (in this case,
recipe search and
suggestions).
speci몭c LLMs!

Learn More
It is important to remember that while domain-speci몭c LLMs can provide
better performance within their domain, general LLMs are more versatile
and can handle a wider variety of tasks. The choice between the two depends
on the speci몭c needs and objectives of the application.
What are the benefits using domain
specific LLMs?
In the rapidly evolving 몭eld of arti몭cial intelligence, domain-speci몭c language
models have emerged as a critical tool for enhancing task pro몭ciency and
model e몭ciency. Tailored to excel in speci몭c 몭elds, these models leverage
the extensive knowledge base of large language models and 몭ne-tune it to
provide superior generalization capabilities and augmented interpretability
in specialized domains.
Here are some bene몭ts:
Enhanced task pro몭ciency
Large language models, ingrained with extensive textual data, demonstrate a
comprehensive understanding of language and an abundance of general
knowledge, serving as an excellent springboard for domain-speci몭c 몭ne-
tuning. While an LLM is suitable for generalized linguistic tasks, it may not
perform optimally in specialized domains such as healthcare. Leveraging an
LLM as a foundational model and 몭ne-tuning it for speci몭c domains can
considerably improve task-related performance.
Ampli몭ed e몭ciency

Domain-speci몭c models are highly tailored to execute a particular set of
tasks and concentrate solely on the information relevant to those tasks. This
speci몭city reduces computational overhead and processing requirements,
thereby enhancing model e몭ciency and expediting task completion.
Superior generalization capabilities
Training models to comprehend the particular language and domain-speci몭c
knowledge allows them to generalize more pro몭ciently to novel instances
within the domain. This becomes a crucial asset when dealing with limited
data sets or when the domain incorporates its own unique terminology and
concepts. This ability is especially advantageous in rapidly evolving 몭elds
such as precision medicine in healthcare.
Augmented interpretability
The interpretability of domain-speci몭c models is notably enhanced due to
their task-speci몭c focus and domain-speci몭c training data. These models
yield more relevant outcomes, regardless of the task, be it chat, text search,
prediction, or information extraction, thus making them more
comprehensible and meaningful.
Challenges of building domainspecific
custom LLMs
Creating custom large language models introduces various challenges for
organizations, broadly segmented into data-, technical-, ethical-, and
resource-related issues.
Data challenges
Organizations striving to establish custom LLMs grapple with data-associated
challenges, including data acquisition, quality, and privacy. The procurement
of substantial, niche-speci몭c data could be demanding, especially when
dealing with specialized or con몭dential data. Ensuring data integrity during
collection is paramount, and so is addressing data privacy and protection,
which requires deploying e몭ective measures to anonymize data and

safeguard it throughout training and deployment phases.
Technical challenges
Technical hurdles in the development of custom LLMs revolve around
aspects such as model architecture, training, assessment, and validation.
Selection of suitable architecture and parameters necessitates specialized
knowledge, while training custom LLMs requires advanced competencies in
machine learning. Evaluation becomes complex owing to the lack of
established benchmarks for niche-speci몭c tasks, and accuracy, safety, and
compliance validation of model responses present additional challenges.
Ethical challenges
It is essential to confront ethical issues related to bias, fairness, content
moderation, and safety while creating custom LLMs. LLMs may inadvertently
incorporate and perpetuate biases from training data, making meticulous
auditing and mitigation strategies a necessity. Robust content moderation
mechanisms must be in place to prevent potentially inappropriate or harmful
content generated by custom LLMs.
Resource challenges
The development of custom LLMs poses resource-related issues,
predominantly in terms of computational resources and specialized skills.
Training LLMs necessitate substantial computational resources, which could
be expensive and inaccessible for some organizations. It also requires a team
pro몭cient in machine learning, natural language processing, and software
engineering, which could be challenging to recruit and retain, thereby
increasing the complexity and cost of the project.
Despite the inherent challenges, they are not unconquerable. Organizations
can successfully develop and deploy custom LLMs with strategic planning,
apt resources, and expert guidance to meet their unique requirements. As
the market begins to witness the advent of commercially viable open-source
foundation models, the trend of building domain-speci몭c LLMs using these
models is set to intensify.

models is set to intensify.
In the following sections we will describe about the famous BloombergGPT
model. By providing an overview of the BloombergGPT model, we help you
understand the process, techniques, and strategies to be employed while
developing your own domain-speci몭c LLM using a foundation model.
speci몭c LLMs!
Learn More
A detailed case study of BloombergGPT, a
financespecific LLM
Bloomberg has unveiled BloombergGPT, a state-of-the-art large language
model, primed with extensive 몭nancial data for superior performance in
몭nancial sector-speci몭c NLP tasks. This powerful AI tool expedites the
evaluation of 몭nancial data, assists in risk assessments, measures 몭nancial
sentiment, and could potentially automate accounting and auditing
functions.
BloombergGPT is a highly specialized language model developed for the
몭nancial domain. It’s a sophisticated construct, equipped with 50 billion
parameters. The model’s training utilized a considerable volume of data –
363 billion tokens sourced from Bloomberg’s unique databases and an
additional 345 billion tokens derived from generalized datasets. This
combination of 몭nancial and general-purpose data has resulted in a model
that performs exceptionally well in its specialized domain. BloombergGPT
has demonstrated substantial improvements over existing models when

tackling 몭nancial tasks. Notably, despite its specialization, the model sustains
commendable performance levels on standard large language model
benchmarks, asserting its versatile applicability.
According to Bloomberg, the 몭nancial industry’s intricate nature necessitates
an AI speci몭cally trained on 몭nancial data. In this context, BloombergGPT has
been designed to interact with the Bloomberg Terminal, a software platform
that provides real-time market data, breaking news, in-depth 몭nancial
research, and sophisticated analytics to 몭nancial professionals.
BloombergGPT is a pioneering stride towards leveraging this novel
technology for the 몭nancial sector. It is projected to enhance Bloomberg’s
existing NLP functionalities, including sentiment analysis, named entity
recognition, news classi몭cation, and question-answering. Moreover, the
model’s capability to structure the vast data accessible on the Bloomberg
Terminal can signi몭cantly improve client services and unlock AI’s immense
potential within the 몭nancial industry.
Bloomberg’s research emphasizes that while general models eliminate the
need for specialized training due to their wide-ranging capabilities, existing
domain-speci몭c models’ results suggest that they are irreplaceable. Despite
the 몭nancial domain being Bloomberg’s primary focus, the company
supports diverse tasks that bene몭t from a general model. Hence, Bloomberg
aimed to construct a model like BloombergGPT, which performs
commendably on general-purpose LLM benchmarks, while also excelling in
몭nancial tasks.
BloombergGPT stands as a potent Large Language Model (LLM), speci몭cally
optimized for 몭nancial Natural Language Processing (NLP) tasks. This
optimization is achieved by integrating both domain-speci몭c and general-
purpose data during its training phase.
Bloomberg Query Language (BQL) functions as a sophisticated query
language, enabling access and analysis of 몭nancial data within Bloomberg’s

platform. BQL, although a powerful tool, presents complexities and can be
applied to diverse tasks including data search, data analysis, report
generation, and insight derivation. BloombergGPT streamlines this process
by converting natural language queries into valid BQL, thereby facilitating
more intuitive interactions with 몭nancial data.
Additionally, BloombergGPT holds potential applications for news settings,
serving as a tool for suggesting news headlines. This functionality proves
valuable for news applications and assists journalists in curating newsletters.
The model accepts paragraphs as inputs and proposes an appropriate title in
return. This capability enhances the ease and speed of news article creation,
contributing further to BloombergGPT’s utility in the areas of 몭nance and
news.
BloombergGPT has demonstrated a competitive performance compared to
GPT-3 and other LLMs on general tasks, outpacing them in several 몭nance-
speci몭c tasks. This introduction of BloombergGPT represents a signi몭cant
breakthrough in AI-driven 몭nancial analysis, amplifying the application scope
of NLP within the 몭nancial technology landscape.
What tasks BloombergGPT can perform?
Financial tasks in natural language processing
When BloombergGPT approaches 몭nancial tasks within the NLP discipline, it
handles them similarly to general NLP tasks. However, these 몭nancial tasks
carry unique features and challenges that di몭erentiate them from their
general counterparts. A prime example of this dynamic can be seen in
sentiment analysis.
Sentiment analysis is a common NLP task, which aims to identify the tone or
emotional context of a given text. Take, for instance, a headline such as
“COMPANY to cut 10,000 jobs”. In general, this would typically suggest a
negative sentiment as it implies job loss and potential hardship for many
individuals. However, when seen through the lens of 몭nancial context, the

sentiment can be perceived di몭erently.
In the world of 몭nance, job cuts often indicate a move towards operational
e몭ciency by a company. Such a move can increase the company’s stock price
or bolster investor con몭dence in the company. This clearly demonstrates the
multifaceted nature of sentiment analysis when applied within the 몭nancial
domain, as understood and implemented by BloombergGPT.
External 몭nancial benchmarks
When it comes to external 몭nancial benchmarks, BloombergGPT utilizes a
variety of resources. This includes four tasks drawn from the FLUE
benchmark as well as the ConvFinQA dataset. The performance of large
language models on these 몭nancial tasks is a relatively unexplored territory,
making the assessment complex.
BloombergGPT operates in a unique fashion, utilizing a few-shot learning
approach when dealing with these tasks. This technique involves training the
model on a small number of examples or ‘shots’. The purpose behind this
method is to optimize the average performance across all models.
This approach demonstrates how BloombergGPT addresses the challenges
of evaluating performance on tasks where there isn’t a broad consensus or
standard testing framework. Despite the intricate nature of these tasks,
BloombergGPT remains committed to performing e몭ciently, abiding by its
main principle of choosing a number of shots that would maximize average
performance.
Internal task: Sentiment analysis
When dealing with internal tasks, BloombergGPT focuses heavily on aspect-
speci몭c sentiment analysis, a practice common in 몭nancial literature. The
process involves a methodical approach, structured around a distinct
discovery phase.

The discovery phase is instrumental in setting up the groundwork for
sentiment analysis. It establishes the annotation and sampling procedures,
which are essential components of the process. Additionally, this phase helps
to gauge the required number of annotators for each example.
The training requirements for the annotators are also determined during this
discovery phase. BloombergGPT relies on this structured procedure to
maintain a consistent and e몭ective approach to sentiment analysis, ensuring
accurate and relevant results. This method underscores BloombergGPT’s
commitment to perform complex tasks with diligence and precision.
Exploratory task: Named Entity Recognition (NER)
Named Entity Recognition (NER) is a well-established task in the broader
scope of natural language processing. However, this area has not been
deeply investigated when it comes to generative language learning models.
Recognizing the relevance of NER in the 몭nancial sector, BloombergGPT
embarked on an exploratory journey into this task, aiming to uncover new
insights and approaches.
The team behind BloombergGPT reported preliminary results from their
exploration into NER. As part of this endeavor, they utilized seven distinct
NER datasets originating from various domains within Bloomberg’s internal
resources.
By venturing into this relatively uncharted territory within generative LLMs,
BloombergGPT is enhancing its capabilities and contributing valuable
research and 몭ndings to the broader 몭eld of natural language processing.
Evaluation on standard general-purpose NLP tasks
The performance of BloombergGPT was critically examined on a variety of
standard general-purpose NLP tasks. This evaluation process utilized BIG-
bench Hard, representing a subset of the most demanding tasks within the
broader BIG-bench benchmark.

Although the main focus of BloombergGPT is distinctly 몭nance-oriented
tasks, the inclusion of general-purpose training data in its learning process
was considered essential. This approach is based on the possibility that such
comprehensive data might augment BloombergGPT’s e몭cacy. This could not
only enhance its capabilities in dealing with 몭nancial tasks, but could
potentially improve its performance on conventional NLP tasks as well.
Therefore, even as BloombergGPT is fundamentally 몭nance-focused, it
appreciates the value of a well-rounded NLP competence, actively seeking to
excel in not just specialized, but also universal NLP tasks.
Knowledge assessments
BloombergGPT’s knowledge capacity, speci몭cally its capability to recall
information learned during its training phase, was put under the microscope
by the research team. This evaluation was conducted through various
scenarios where the model was required to provide answers to posed
questions, without the aid of additional context or resources.
These scenarios comprised multiple-choice questions, along with
benchmarks based on reading comprehension. Thus, the assessment was
designed to evaluate not just the factual knowledge retained by
BloombergGPT during its training, but also its ability to interpret, understand
and utilize that knowledge e몭ectively. It provided a valuable insight into
BloombergGPT’s understanding and recall capabilities, which are critical
aspects for any language model, especially in the 몭nancial domain where
accuracy and reliability of information is paramount.
Linguistic tasks
In the evaluation of BloombergGPT, certain tasks were not directly linked to
end-user applications. These were essentially linguistic tasks, put in place to
measure the model’s pro몭ciency in understanding language at its core.

Such tasks scrutinized BloombergGPT’s capability in tackling disambiguation,
parsing grammar, and understanding entailment. These areas are
fundamental to language understanding and were chosen to directly assess
BloombergGPT’s linguistic competence. By examining these abilities, the
team aimed to ensure that the model could adeptly navigate language
complexities, a skill that would prove invaluable even in its primary focus
area of 몭nancial language processing.
The architecture of a domain-speci몭c LLM like
BloombergGPT
The creation of BloombergGPT was a joint endeavor between the Machine
Learning (ML) Product and Research group and the AI Engineering team at
Bloomberg. Their objective was to compile one of the largest domain-speci몭c
datasets to date. This involved leveraging Bloomberg’s vast data creation,
collection, and curation resources. The team utilized Bloomberg’s expansive
archive of 몭nancial data, assembling a comprehensive dataset of 363 billion
tokens composed of English 몭nancial documents. This corpus was further
augmented with a public dataset of 345 billion tokens, resulting in a
combined training corpus surpassing 700 billion tokens.
To operationalize this vast dataset, the team elected to train a 50-billion
parameter decoder-only causal language model. This type of model is adept
at generating text in a sequential manner, making it ideal for processing
natural language data. The resulting model was then tested on several
validation sets: 몭nance-speci몭c natural language processing benchmarks,
Bloomberg’s internal benchmarks, and a variety of generic NLP tasks drawn
from widely recognized benchmarks.
BloombergGPT demonstrated exceptional performance, surpassing existing
open source models of comparable size on 몭nance-speci몭c tasks by
substantial margins. Importantly, while its specialization didn’t impede its
ability to tackle broader tasks, the model also matched or exceeded
performance on the general NLP benchmarks. This underscores the potential

of BloombergGPT to enhance 몭nancial NLP tasks without sacri몭cing its utility
for more general applications.
BloombergGPT’s architecture is based on a language model called BLOOM,
which operates as a causal or “decoder-based” model. “Causal” essentially
means that the model generates text sequentially, predicting the next word
in a sequence based on the words that came before it.
The core of the BLOOM model consists of 70 layers of transformer decoder
blocks. Here’s a simpli몭ed explanation of what this means:
Transformer decoder blocks: These are a type of model structure used in
natural language processing. They are designed to handle sequential data,
like sentences, where the order of the elements is important. The
transformer decoder block uses the following components:
SA (Self attention): This mechanism helps the model pay varying
amounts of ‘attention’ to di몭erent words in the input when generating
the output. It allows the model to focus on the important words and
understand the context better.
LN (Layer Normalization): This is a technique to stabilize and speed up
the learning process of the model. It ensures that the calculations the
model performs stay within a manageable range, preventing numerical
issues.
FFN (Feed-forward Network): This is a type of arti몭cial neural network
where information moves in one direction – from input to output. It
does not have any cycles or loops, and each layer of neurons is fully
connected to the next layer.
The model also ties the input token embeddings (the numerical
representation of words) to the linear mapping that occurs before the 몭nal
output is decided. This step helps reduce the model’s complexity and can
improve the e몭ciency of the model’s learning process.
Finally, there is an extra layer normalization step added after token

embeddings. In this case, the initial token embedding, and the additional
component of layer normalization for the embeddings. The inclusion of two
consecutive layer normalization steps ensures the model’s stability and
assists in e몭cient learning.
In summary, the architecture of the BloombergGPT model is designed to
e몭ciently learn from and generate text data, with several features intended
to enhance the model’s stability and learning speed.
speci몭c LLMs!
Learn More
How was BloombergGPT trained?
Step 1: Constructing the FinPile dataset
The research team responsible for the creation of BloombergGPT made use
of an extensive and diverse dataset titled “FinPile” for its training process.
FinPile comprises various English 몭nancial documents from multiple sources,
namely news articles, 몭lings, press releases, and social media content, all
from the extensive Bloomberg archives.
This wealth of 몭nancial data includes elements like company 몭lings, 몭nancial
news, and other information relevant to market trends. Some of this data is
publicly accessible, while other parts can only be accessed through purchase
or exclusive access via the Bloomberg Terminal. This data is meticulously
cleaned to remove any unnecessary elements such as markup, special
formatting, and templates. Moreover, each document is time-stamped,

spanning from March 1, 2007, to July 31, 2022, indicating a gradual
improvement in data quality and volume over time.
While the team can’t release the FinPile dataset, they aim to share their
insights and experiences in building this highly domain-speci몭c model. They
also intend to leverage date information in future research work.
FinPile’s 몭nancial data is paired with public data commonly used for LLM
training to create a balanced training corpus, having an equal representation
of domain-speci몭c and general-purpose text. The team undertook a de-
duplication process for each dataset in order to maintain high data quality,
which could result in di몭ering statistical reports compared to other studies.
Step 2: Tokenization
The research team behind BloombergGPT decided to employ the Unigram
tokenizer, in preference to other greedy merge-based sub-word tokenizers
like BPE or Wordpiece. This decision was based on the Unigram tokenizer’s
promising performance in prior research studies. They adopted a byte-based
approach to data handling instead of the more common practice of treating
data as a sequence of Unicode characters.
Additionally, the team employed a pre-tokenization process similar to that
used in GPT-2, which breaks the input into distinct chunks. This method
facilitates the learning of multi-word tokens, thereby increasing information
density and reducing context lengths.
The team used the Pile dataset to train their tokenizer. Given its diverse
content, The Pile was considered a suitable choice for their purposes.
Handling such a large dataset required a parallel tokenizer training method,
and the team adopted a split and merge strategy for this. They trained a
Unigram tokenizer on each chunk of data and then merged them in a
hierarchical manner to create the 몭nal tokenizer.
The resulting tokenizer consisted of 7 million tokens, which was further
condensed to 131,072 tokens. The decision of vocabulary size was in몭uenced

by several factors, including the need to 몭t more information into the context
window and considerations related to overhead. The team arrived at a
vocabulary size of 131,072 tokens following experiments with di몭erent
vocabulary sizes and the smallest encoded representation of the C4 dataset.
Consequently, the tokenizer used for BloombergGPT is relatively larger than
the standard vocabulary size of approximately 50,000 tokens.
Step 3: Building the model
The Bloomberg team builds the BloombergGPT model upon the structural
foundation provided by the BLOOM model. This involves utilizing a decoder-
only causal language model comprising 70 layers of transformer decoder
blocks. These blocks consist of multi-head self-attention, layer normalization,
and a feed-forward network with a single hidden layer. A GELU function is
employed as the non-linear element within the network.
Additionally, ALiBi positional encoding, or Attention with Linear Biases, is
implemented and the input token embeddings are connected to the linear
mapping preceding the 몭nal softmax application. An additional layer
normalization is executed following the token embeddings.
Leveraging the scaling laws from Chinchilla, the Bloomberg team determines
the size of the BloombergGPT model. They set their total compute budget at
1.3 million GPU hours on 40GB A100 GPUs. The costs associated with
activation checkpointing are accounted for and the Chinchilla equations are
used to ascertain the optimum balance between the number of parameters
and tokens.
Despite possessing a signi몭cant dataset of approximately 700 billion tokens,
the Bloomberg team found it too limited for the optimal con몭guration given
their compute budget. To circumnavigate this constraint, they opt for the
largest possible model, having 50 billion parameters, while preserving
around 30% of the total compute budget as a safeguard for potential
challenges that may emerge.

To set the structure of the BloombergGPT model, which has 50 billion
parameters, the Bloomberg team used recommendations from Levine et al.
(2020). According to these recommendations, the ideal size (D) for the hidden
layers depends on the model’s number of self-attention layers (L). The team
chose 70 self-attention layers (L = 70) and a target size of 7510 for the hidden
layers (D = 7510).
To enhance the performance of the model’s Tensor Core operations, the
team used a design with 40 heads, each having a dimension of 192. This
resulted in a total hidden size (D) of 7680, bringing the overall parameter
count to 50.6 billion.
Step 4: Training the model
The BloombergGPT model, built on PyTorch, is trained following a standard
causal language modeling objective, with a directionality from left to right.
The team has strategically set the length of the training sequences at 2,048
tokens to make optimal use of GPU resources.
The AdamW optimizer is utilized in the optimization process, and the
learning rate aligns with a cosine decay schedule that includes a linear
warmup. Additionally, a batch size warmup is applied for enhanced
performance.
The training and evaluation stages of the model are conducted on Amazon
SageMaker, leveraging 64 instances of p4d.24xlarge, each equipped with 8
NVIDIA 40GB A100 GPUs. To ensure e몭cient data access and high-speed
throughput, the team employs Amazon FSX for Lustre, which supports up to
1000 MB/s for both read and write operations per TiB of the storage unit.
Step 5: Large-scale optimization
To train the substantial BloombergGPT model within the constraints of GPU
memory, the Bloomberg team implements a series of optimization
techniques:
ZeRO optimization (stage 3): This approach involves dividing the training

state across a group of GPUs. In the case of BloombergGPT, the team
deploys 128 GPUs for the purpose of sharding and maintains four replicas
of the model throughout the training process.
MiCS: This system is designed to lower the communication overhead and
memory requirements for training clusters in the cloud. It utilizes
hierarchical communication, a 2-hop gradient update mechanism, and a
scale-aware model partitioning scheme.
Activation checkpointing: This technique discards activations and
recalculates intermediate tensors during backward passes, as and when
necessary to minimize the memory consumed during the training process.
Mixed precision training: This strategy helps save memory by executing
forward and backward passes in BFloat16 precision, while storing and
updating the parameters in full precision (FP32).
Fused kernels: This method amalgamates multiple operations into a
singular GPU operation. The result is a reduction in peak memory usage
due to the avoidance of storage of intermediate results and an
enhancement in speed.
By applying these optimizations, the Bloomberg team successfully manages
to yield an average computational performance of 102 TFLOPs, with each
training step taking around 32.5 seconds.
How ethical and privacy issues are handled in
BloombergGPT?
BloombergGPT, designed for 몭nancial applications, is developed with a
thorough focus on ethical and privacy considerations. These aspects are
meticulously addressed in the development process through a series of well-
structured measures.
Risk assessment and compliance procedures
Recognizing the sensitivity of the 몭nancial sector, Bloomberg has established
a robust risk assessment and testing process. Accuracy and factual

information are deemed crucial for the 몭rm’s reputation and the satisfaction
of its clients, who are increasingly seeking to incorporate advanced
technology into their operations.
This process involves stringent annotation guidelines, multi-level pre-launch
reviews involving central risk and compliance units, and continuous post-
launch monitoring. All research, development, and deployment procedures
for BloombergGPT align with relevant regulations, maintaining high
standards for compliance.
Toxicity and bias control
Bloomberg exerts extraordinary e몭ort to control any potential toxicity and
bias in its content, produced either by humans or AI models. The team
behind BloombergGPT maintains an ongoing interest in evaluating and
quantifying potential generation of harmful language in various application
contexts. Particular attention is given to studying the in몭uence of the FinPile
corpus, which is characterized by a lower incidence of biased or toxic
language, on the model’s tendencies. As BloombergGPT is integrated into
new products, these rigorous testing procedures and compliance controls
will be applied to ensure safe usage.
Model release and openness
The question of how to release LLMs is a subject of ongoing debate in the AI
community. Given the potential for misuse of LLMs, especially ones like
BloombergGPT that are trained on a wealth of sensitive data, Bloomberg
employs a strategic approach to this issue. Various strategies are considered,
ranging from releasing trained models under speci몭c usage licenses, to
granting API access without revealing underlying model parameters or
detailed training data.
In light of the signi몭cant risk of data leakage attacks, Bloomberg has opted
for a conservative approach, withholding the release of BloombergGPT

model parameters. This decision is made in alignment with the company’s
mission to provide access to data collected over decades, while also ensuring
its protection from potential misuse.
Contribution to the 몭eld
Despite not releasing the model, BloombergGPT’s development journey
provides valuable insights to the broader NLP community, especially those
creating domain-speci몭c models. The experience gained in training and
evaluating BloombergGPT can help better understand these models,
contributing to the shared knowledge in the 몭eld of NLP and LLMs.
Bloomberg acknowledges the in몭uence of various other models and projects
in shaping the development of BloombergGPT, and continues its e몭orts in
contributing to this collective intelligence.
How to train an opensource foundation
model into a domainspecific LLM?
The objective here is to develop a classi몭cation model leveraging the robust
capabilities of BERT, a pre-trained large language model. To customize this
model to 몭t a speci몭c domain, it will be 몭ne-tuned using domain-speci몭c
data, speci몭cally the Toxic Comment Classi몭cation Challenge data, sourced
from the realm of social media.
The dataset essential for executing this project is available on Kaggle. Only
the ‘train.csv’ 몭le among the available datasets will be utilized for this
particular case. The complete code is available on GitHub.
Let’s code!
Step 1: Loading the data set
import numpy as np
import pandas as pd
import torch
import torch.nn as nn

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import transformers
from transformers import AutoModel, BertTokenizerFast
# specify GPU if you have a GPU
device = torch.device("cuda")
#
# Lead dataset
df = pd.read_csv("train.csv")
# mydata.csv has only 2 columns "text" and "label" (binary)
df = df.sample(6400)[['comment_text', 'toxic']]
df.columns = ['text', 'label']
# print data and take a look at it
df.head()
# check for class ditribution
print(df['label'].value_counts(normalize=True))
# Split data into train and test
train_text, temp_text, train_labels, temp_labels = train_test_split(
df['text'],
df['label'],
random_state=1234,
test_size=0.3,
stratify=df['label'])
# Using temp_text and temp_labels created in previous step generate va
val_text, test_text, val_labels, test_labels = train_test_split(

val_text, test_text, val_labels, test_labels = train_test_split(
temp_text,
temp_labels,
random_state=1234,
test_size=0.5,
stratify=temp_labels)
# This will make 70% training data, 15% validation, and 15% test
Step 2: Tokenizaton
# Import BERT tokenizer from the pretained model
bert = AutoModel.from_pretrained('bertbaseuncased')
tokenizer = BertTokenizerFast.from_pretrained('bertbaseuncased')
# Get length of all the sentences in the data
seq_len = [len(i.split()) for i in train_text]
pd.Series(seq_len).hist(bins=100)
# Select the max_seq_len that we will be keeping based on the distribu
max_seq_len = 100
# Tokenize sequences trimmed at max length determined above in the tra
# tokenize and encode sequences in the
tokens_train = tokenizer.batch_encode_plus(train_text.tolist(),
max_length=max_seq_len,
pad_to_max_length=True,
truncation=True,
return_token_type_ids=False)
tokens_val = tokenizer.batch_encode_plus(val_text.tolist(),
truncation=True,

tokens_test = tokenizer.batch_encode_plus(test_text.tolist(),
truncation=True,
# Convert tokens to tensors for train, validation, and test sts
train_seq = torch.tensor(tokens_train['input_ids'])
train_mask = torch.tensor(tokens_train['attention_mask'])
train_y = torch.tensor(train_labels.tolist())
val_seq = torch.tensor(tokens_val['input_ids'])
val_mask = torch.tensor(tokens_val['attention_mask'])
val_y = torch.tensor(val_labels.tolist())
test_seq = torch.tensor(tokens_test['input_ids'])
test_mask = torch.tensor(tokens_test['attention_mask'])
test_y = torch.tensor(test_labels.tolist())
Step 3: Creating a data loader to prepare the data for training
from torch.utils.data import TensorDataset, DataLoader, RandomSampler,
#define a batch size (You may want to tweak batch size based on your s
batch_size = 32
# wrap tensors, random smapler, and prep dataloader for train data
train_data = TensorDataset(train_seq, train_mask, train_y)
train_sampler = RandomSampler(train_data)
train_dataloader = DataLoader(train_data,

sampler=train_sampler,
batch_size=batch_size)
# wrap tensors, random smapler, and prep dataloader for vaidation data
val_data = TensorDataset(val_seq, val_mask, val_y)
val_sampler = SequentialSampler(val_data)
val_dataloader = DataLoader(val_data,
sampler=val_sampler,
batch_size=batch_size)
# freeze all the parameters by setting requires_grad = False
for param in bert.parameters():
param.requires_grad = False
Step 4: De몭ning the model architecture: Adding 3 layers and 1 binary
output layer to the NN
class BERT_Arch(nn.Module):
def __init__(self, bert):
super(BERT_Arch, self).__init__()
self.bert = bert
# dropout layer
self.dropout = nn.Dropout(0.1)
# relu activation function
self.relu = nn.ReLU()
# dense layer 1 # Note 768 is immutable for BERT, you can choose a dim
self.fc1 = nn.Linear(768, 512)

# dense layer 2 (Output layer)
#softmax activation function
self.softmax = nn.LogSoftmax(dim=1)
#define the forward pass
def forward(self, sent_id, mask):
#pass the inputs to the model
_, cls_hs = self.bert(sent_id, attention_mask=mask, return_dict=False)
# Execute 1st layer
x = self.fc1(cls_hs)
x = self.relu(x)
x = self.dropout(x)
# Execute 2nd layer
x = self.fc2(x)
x = self.relu(x)
x = self.dropout(x)
# Execute 3rd layer
x = self.fc3(x)

x = self.relu(x)
x = self.dropout(x)
# output layer
x = self.fc4(x)
# apply softmax activation
x = self.softmax(x)
return x
Step 5: Passing the pre-trained BERT to prepare the loss function
model = BERT_Architecture(bert)
# push the model to GPU
model = model.to(device)
# optimizer from hugging face transformers
from transformers import AdamW
# define the optimizer
optimizer = AdamW(model.parameters(), lr=1e3)
# Find class weights and push those to GPU
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight(class_weight="balanced",
classes=np.unique(train_labels),
y=train_labels)
# class_weights = dict(zip(np.unique(train_labels), class_weights))
# convert class weights to tensor

weights = torch.tensor(class_weights, dtype=torch.float)
weights = weights.to(device)
cross_entropy = nn.NLLLoss(weight=weights)
# number of training epochs
epochs = 100
print(class_weights)
Step 6: Train the model
def train():
# Train model
model.train()
# Initiate the loss and accuracy as zero to start with (it will update
total_loss, total_accuracy = 0, 0
# empty list to save model predictions
total_preds = []
# iterate over batches
for step, batch in enumerate(train_dataloader):
# push the batch to gpu
batch = [r.to(device) for r in batch]
sent_id, mask, labels = batch
# clear previously calculated gradients
model.zero_grad()
# get model predictions for the current batch
preds = model(sent_id, mask)

# compute the loss between actual and predicted values
loss = cross_entropy(preds, labels)
# add on to the total loss
total_loss = total_loss + loss.item()
# backward pass to calculate the gradients
loss.backward()
# clip the the gradients to 1.0. It helps in preventing the exploding
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
# update parameters
optimizer.step()
# model predictions are stored on GPU. So, push it to CPU
preds = preds.detach().cpu().numpy()
# append the model predictions
total_preds.append(preds)
# compute the training loss of the epoch
avg_loss = total_loss / len(train_dataloader)
# predictions are in the form of (no. of batches, size of batch, no. o
# We will have to reshape the predictions in form of (number of sample
total_preds = np.concatenate(total_preds, axis=0)
#returns the loss and predictions
return avg_loss, total_preds

Step 7: Evaluate the model
def evaluate():
print("nEvaluating...")
# deactivate dropout layers
model.eval()
# Again starting loss and accuracy as zero
total_loss, total_accuracy = 0, 0
# empty list to save the model predictions
total_preds = []
# iterate over batches
for step, batch in enumerate(val_dataloader):
# push the batch to gpu
batch = [t.to(device) for t in batch]
sent_id, mask, labels = batch
# deactivate autograd
with torch.no_grad():
# model predictions
preds = model(sent_id, mask)
# compute the validation loss between actual and predicted values
loss = cross_entropy(preds, labels)
total_loss = total_loss + loss.item()
total_preds.append(preds)

# compute the validation loss of the epoch
avg_loss = total_loss / len(val_dataloader)
# reshape the predictions in form of (number of samples, no. of classe
total_preds = np.concatenate(total_preds, axis=0)
return avg_loss, total_preds
Step 8: Run the pipeline to iteratively train and evaluate model and
get the output for every epoch
# set initial loss to infinite
best_valid_loss = float('inf')
# empty lists to store training and validation loss of each epoch
train_losses = []
valid_losses = []
#for each epoch
for epoch in range(epochs):
#train model
train_loss, _ = train()
#evaluate model
valid_loss, _ = evaluate()
#save the best model
if valid_loss < best_valid_loss:
best_valid_loss = valid_loss
torch.save(model.state_dict(), 'saved_weights.pt')

# append training and validation loss
train_losses.append(train_loss)
valid_losses.append(valid_loss)
print(
f'Epoch {epoch + 1}/{epochs}tTraining Loss: {train_loss:.3f}tValidat
)
#load weights of best model
path = 'saved_weights.pt'
model.load_state_dict(torch.load(path))
# get predictions for test data
with torch.no_grad():
preds = model(test_seq.to(device), test_mask.to(device))
# model's performance
preds = np.argmax(preds, axis=1)
print(classification_report(test_y, preds))
pd.crosstab(test_y, preds)
Endnote
As we conclude, it’s clear that domain-speci몭c large language models are not
just bene몭cial, but indeed necessary for modern organizations. The shift
from generic models to more tailored language processing models addresses
unique industry needs, enhances performance, enables personalized
interactions, and streamlines operations. Moreover, these custom models
provide improved data privacy control and 몭nancial e몭ciency.
In the rapidly evolving world of AI and natural language processing, adopting

domain-speci몭c LLMs represents a crucial step towards staying competitive
and innovative. These models, 몭ne-tuned to understand speci몭c industry
language and context, transform how organizations interact with data and
digital interfaces. The need for domain-speci몭c LLMs is undeniable, and it is
now a question of how quickly organizations can adapt and integrate this
powerful tool into their processes to unlock its full potential.
Leverage the power of customized LLMs! Contact LeewayHertz to create your own
domain-speci몭c LLM for highly accurate results and unparalleled performance
within your domain.
Author’s Bio
Akash Takyar
CEO LeewayHertz
Akash Takyar is the founder and CEO at LeewayHertz. The experience of
building over 100+ platforms for startups and enterprises allows Akash to
rapidly architect and design solutions that are scalable and beautiful.
Akash's ability to build enterprise-grade technology solutions has attracted
over 30 Fortune 500 companies, including Siemens, 3M, P&G and Hershey’s.

Akash is an early adopter of new technology, a passionate technology
enthusiast, and an investor in AI and IoT startups.
Write to Akash
Start a conversation by filling the form
Once you let us know your requirement, our technical expert will schedule a
call and discuss your idea in detail post sign of an NDA.
All information will be kept con몭dential.
Name Phone
Company Email
Tell us about your project
Send me the signed Non-Disclosure Agreement (NDA )
Start a conversation
Insights

How to use LLMs in synthesizing training data?
Harnessing the power of large language models (LLMs), a mighty tool
capable of understanding, generating, and even re몭ning human-like text we
can generate synthesized training data that is 몭awless and train our models
more e몭ciently.
Read More
Prompt LLM Completion
‘ Jack has a cat.
What animal is
Jack’s pet?’
‘cat’

Build an LLMpowered application using
LangChain: A comprehensive stepbystep guide
LangChain is a framework that provides a set of tools, components, and
interfaces for developing LLM-powered applications.
How to build a private LLM?
Language models are the backbone of natural language processing (NLP)
and have changed how we interact with language and technology.
Read More
User
Prompt
Model
Output
LLM
Training
Data
Read More
Show all Insights

LEEWAYHERTZPORTFOLIO
SERVICES GENERATIVE AI
INDUSTRIES PRODUCTS
CONTACT US
Get In Touch
415-301-2880
info@leewayhertz.com
388 Market Street
Suite 1300
San Francisco, California 94111
About Us
Global AI Club
Careers
Case Studies
Work
Community
TraceRx
ESPN
Filecoin
Lottery of People
World Poker Tour
Chrysallis.AI
Generative AI
Arti몭cial Intelligence & ML
Web3
Blockchain
Software Development
Hire Developers
Generative AI Development
Generative AI Consulting
Generative AI Integration
LLM Development
Prompt Engineering
ChatGPT Developers
Consumer Electronics
Financial Markets
Healthcare
Logistics
Manufacturing
Startup
Whitelabel Crypto Wallet
Whitelabel Blockchain Explorer
Whitelabel Crypto Exchange
Whitelabel Enterprise Crypto Wallet
Whitelabel DAO
 

Train foundation model for domain-specific language model

Recommended

Recommended

More Related Content

Similar to Train foundation model for domain-specific language model

Similar to Train foundation model for domain-specific language model (20)

More from Benjaminlapid1

More from Benjaminlapid1 (12)

Recently uploaded

Recently uploaded (20)

Train foundation model for domain-specific language model