Training state-of-the-art general text embedding

Beyond 512 Tokens
Training the State-Of-The-Art Text Embeddings at Jina
Team lead model training
bo.wang@jina.ai

A lot of hype in the past months
1. jina-embeddings-v2 trending was number 1 for 1 week and
got over 3 million downloads on HF.
2. Integrated into a lot of frameworks and databases including
langchain, llama index etc.
3. Hackernews number 1 with a lot of debates and discussion
about long context encoders.
4. …

From fine-tuning to training at scale

Embedding model fine-tuning with Finetuner
1. One of the earliest framework on embedding model fine-tuning, same time as
sentence-transformers 1.0.
2. We believe that embeddings will be the future of search, and the quality of embeddings will
determine the future of vector search.
3. We worked on text embeddings fine-tuning, vision embedding fine-tuning and cross-modality
(CLIP) fine-tuning.
4. With FInetuner we try to solve how to leverage “small” data to achieve “big” improvement.

Embedding model fine-tuning with Finetuner
1. The industry, even the search industry is slowly moving towards vectors, just started to use
pre-trained embedding models, not ready for embedding model fine-tuning.
2. No LLM and RAG at the moment, embedding models, not widely adopted in other industries.

Why don’t we train our own embedding model?

jina-embeddings-v1
1. Researched the SOTA models, including minilm, mpnet, sentence-t5, GTR, instructor etc.
2. Collected 2 billion records of English training data.
3. Engineering heavy data cleaning, including deduplication, language detection, quality filtering, we
got 400 million records of high-quality pre-training data and 5 million lines of high-quality
human-annotated fine-tuning data.
4. Comptelty refactor Finetuner codebase to handle distributed training.

Targeting at ada-embedding-002 from OpenAI
jina-embeddings-v1 proved our capability to train general embedding models from scratch. But our
goal never stop at here. We want to train the best embeddings in the world. How to improve from
here? We mearesure two factors:
1. Our v2 model should perform well on the MTEB leaderboard.
2. Our v2 model should handle longer context, identical to ada-embedding-002 which was 8192.

All models only handle 512, why?
Almost all embedding models are fine-tuned (or continue trained) from a foundation model.
The most widely used foundation model is BERT or it’s variations.
Transformer architecture takes all token emebddings at once, rather than sequentially, So it is
important to let the model “know” the word oder information. The word ordering information is kepted
in a layer called position embeddings.

Devlin, Jacob, et al. "Bert: Pre-training of deep bidirectional transformers for language understanding." arXiv preprint
arXiv:1810.04805 (2018).

Train Short, Inference Long: Jina embeddings v2

1. Completely removed Position Embeddings, replaced with Attention with Linear Biases
(ALiBi).
2. Adopted ALiBi to bidirectional transformers architecture.
3. Compltely retrained BERT (JinaBERT) with SOTA tricks, including whole word masking,
RoBERTa recipe, better activations (GeGLUE), aggressive masking, full 512 sequence length
training and of course, ALiBI to support longer sequence.
4. Using JinaBERT as backbone, we trained JinaEmbeddings V2 on an improved dataset and
training recipe without overfitting MTEB training data.
Modifications

https://blog.llamaindex.ai/boosting-rag-picking-the-best-embedding-reranker
-models-42d079022e83

Different cases
1. If your document is always smaller than 512 tokens, jina-embeddings-v2 is yet another average
encoder, same as e5, bge or others.
2. If your document is always larger than 512 tokens, and the relevant information is at the
beginning of the document, jina-embeddings-v2 is likely to perform worse.
3. If your document is always larger than 512 tokens, and the relevant information is at the middle
or the end of the document, jina-embeddings-v2 could boost your search performance.
Keep in mind that jina-embeddings-v2 offer you the flexibility to go beyond 512 token constraint,
and adapt your personal need, at any sequence length below 8192.

Bridging Languages: Bilingual Embeddings

model size dim de-de de-en
distiluse-base-multilingual-case
d-v2
0.53GB 768 41.11 47.51
multilingual-e5-large 2.24GB 1024 52.59 77.09
cohere-embed-v3 unknown 1024 52.65
jina-embeddings-v2-base-de 1.25GB 768 54.71 77.48

What’s Next?
1. jina-embeddings-v3 is coming.
2. jina-embeddings-v3 will be extremely fast and memory efficient,
especially on longer sequences.
3. jina-embeddings-v3 will be multilingual, with a much optimized
language distribution.
4. jina-embeddings-v3 wii be solving real-world problems,
compressive failure analysis on v2 & other embedding models on
real-world data.
5. jina-embeddings-v3 will be better handling different tasks with
carefully designed task instructions and task heads and clever
routing.
6. jina-embeddings-v3 will be chunk-aware and schema-aware:
better understanding semi-structured data and different
perspective of document with hierarchical embeddings.

https://jina.ai/embeddings/
Bo Wang

Training state-of-the-art general text embedding

Recommended

Recommended

More Related Content

Similar to Training state-of-the-art general text embedding

Similar to Training state-of-the-art general text embedding (20)

More from Zilliz

More from Zilliz (14)

Recently uploaded

Recently uploaded (20)

Training state-of-the-art general text embedding