Julien Simon - Deep Dive - Model Merging

Deep Dive: Model Merging
Companion videos: https://youtu.be/cvOpX75Kz4M + https://youtu.be/qbAvOgGmFuE
Julien Simon
https://www.linkedin.com/in/juliensimon
https://www.youtube.com/juliensimonfr
The author of this material is Julien Simon https://www.linkedin.com/in/juliensimon unless explicitly mentioned.
This material is shared under the CC BY-NC 4.0 license https://creativecommons.org/licenses/by-nc/4.0/
You are free to share and adapt this material, provided that you give appropriate credit, provide a link to the license, and indicate if changes were made.
You may not use the material for commercial purposes. You may not apply any restriction on what the license permits.

What is model merging?
• Building a "great" model is challenging, time-consuming and compute-intensive
• Instead, can we build one by merging several models based on the same architecture?
• Combine multiple task-specific models into a single multitask model without any additional training
• Not an ensembling technique: there's only one model at the end
• Merging only requires lightweight compute
• Fast process, no extra cost for training and inference, no extra inference latency
• Mergekit: https://github.com/arcee-ai/mergekit

Merging techniques
• Model Soups https://arxiv.org/abs/2203.05482 (03/2022)
• Spherical Linear Interpolation (SLERP) https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
• Task Arithmetic https://arxiv.org/abs/2212.04089 (12/2022)
• Trim, Elect Sign, and Merge (TIES) https://arxiv.org/abs/2306.01708 (06/2023)
• Drop and Rescale (DARE) https://arxiv.org/abs/2311.03099 (06/2023)
• Franken-merging

Merging techniques, continued
• Model Breadcrumbs https://arxiv.org/abs/2312.06795 (12/2023)
• Model Stock https://arxiv.org/abs/2403.19522 (03/2024)
• DELLA https://arxiv.org/abs/2406.11617 (06/2024)

Model soups
https://arxiv.org/abs/2203.05482 (03/2022) + https://github.com/mlfoundations/model-soups
• Average many variants of the same model, trained on the same
dataset with different hyper-parameters, aka linear interpolation
• Optionally: weighted average, normalization
• Uniform soup: average all models
• Greedy soup: average models one by one, keeping only the ones
that gradually improve test accuracy
• Generally, model soups perform a little worse than ensembles, but
are more resilient to out of distribution data Fine-tuning a CLIP ViT-B/32 model on ImageNet.
BERT and T5 on four text classi
fi
cation datasets from the GLUE
res = (weights * tensors).sum(dim=0)
if self.normalize:
res /= weights.sum(dim=0)

Spherical Linear IntERPolation (SLERP)
https://dl.acm.org/doi/10.1145/325334.325242 (07/1985)
• Originally designed for 3D CGI to find smooth transitions between two (camera) rotations
• Only works with two models (one of them can be favored)
• Helps preserve the magnitude of weights
• Helps preserve the "curvature" of the embeddings space
Linear (red) vs. spherical (blue) interpolation
Images: https://www.researchgate.net/figure/Slerp-and-linear-interpolation-comparison_fig2_318221910, https://allenchou.net/2018/05/game-math-deriving-the-slerp-formula/
# Copy the vectors to reuse them later
v0_copy = np.copy(v0)
v1_copy = np.copy(v1)
# Normalize the vectors to get the directions and angles
v0 = normalize(v0, eps)
v1 = normalize(v1, eps)
# Dot product with the normalized vectors
dot = np.sum(v0 * v1)
# Calculate initial angle between v0 and v1
theta_0 = np.arccos(dot)
sin_theta_0 = np.sin(theta_0)
# Angle at timestep t
theta_t = theta_0 * t
sin_theta_t = np.sin(theta_t)
# Finish the slerp algorithm
s0 = np.sin(theta_0 - theta_t) / sin_theta_0
s1 = sin_theta_t / sin_theta_0
res = s0 * v0_copy + s1 * v1_copy
𝛳
t
𝛳
(1-t)
𝛳
v0
v1

Task Arithmetic
https://arxiv.org/abs/2212.04089 (12/2022)
• Pre-trained models can be fine-tuned for a
particular task : text classification,
summarization, etc.
• Task vector: tensor updates applied to the pre-
trained model during fine-tuning
• A pre-trained model can be fine-tuned for
many tasks on many datasets, leading to
many task vectors
• Task vectors can be added or subtracted to
the base model to add or remove capabilities
• Many single-task models can be merged to
build a multiple-task model
• However, as the number of tasks grow, finding
the right mix becomes a computationally
expensive problem (hyper parameter tuning on
a validation set)
Adding pairs of task vectors to T5-Base
Adding task vectors to CLIP

TrIm, Elect Sign and Merge (TIES)
https://arxiv.org/abs/2306.01708 (06/2023)
• Parameter interference can degrade merged performance
• Influential vs. redundant parameter
• Sign conflicts
• Trim, Elect Sign & Merge
• Trim each task vector to retain only the influential parameter values
(top-k % largest values)
• Resolve the sign conflicts between different values
• Average parameters whose sign agrees with the direction of the largest
movement
• Add the averaged parameters to the original model (with a scale factor)

Drop And REscale (DARE)
https://arxiv.org/abs/2311.03099 (06/2023) + https://github.com/yule-BUAA/MergeLM
• During fine-tuning, many LLM parameter updates are redundant
• DARE randomly eliminates up to 99% of the updates, with
minimal impact on model performance
Drop p% of parameters and rescale the rest by
• The larger the model, the more limited the impact
• This drastically reduces the size of task vectors
• Task vectors can be applied to a pre-trained model using
previous methods like TIES
1
1 − p
WizardLM- 13B (LM), WizardMath-13B (Math), and llama-2-13b-code-alpaca (Code)

Franken-merging
• All previous methods require models that share a common architecture (T5, Llama, etc.)
• We can build a new model by stitching layers from different models, possibly with different architectures
• Weights are left untouched (aka passthrough merging)
• Very experimental, but some interesting models are popping up
• https://huggingface.co/models?sort=downloads&search=franken
- range 0, 16
Xwin
- range 8, 24
Euryale
- range 17, 32
Xwin
- range 25, 40
Euryale
- range 33, 48
Xwin
- range 41, 56
Euryale
- range 49, 64
Xwin
- range 57, 72
Euryale
- range 65, 80
Xwin
https://huggingface.co/alpindale/goliath-120b
merging two Llama-2-70b models
https://huggingface.co/Xwin-LM/Xwin-LM-70B-V0.1
https://huggingface.co/Sao10K/Euryale-1.3-L2-70B

Model breadcrumbs
https://arxiv.org/abs/2312.06795 (12/2023) + https://github.com/rezazzr/breadcrumbs
• Improvement of the Task Arithmetic technique
• Compute a task direction for each fine-tuned model
(fine-tuned weights minus base weights)
• Mask (set to zero) a percentage of tiny and large
outliers in each task direction, resp. β and ɣ
(this is equivalent to using the parameter from the base model)
• Sum the base model and the masked task
directions, with a task weight (⍺)
• This method outperforms Task Vectors. It also scales much better to a large number of tasks: « After finding
optimal hyperparameters for both Model Breadcrumbs and Task Vectors using 10 tasks, we kept these
hyperparameters and incrementally merged all 200 tasks to create a multi-task model. […] the observed trend
remains, with Model Breadcrumbs (85% sparsity) consistently outperforming Task Vectors by a significant margin
as the number of tasks increases. »
• Merging also improves the
performance of fine-tuned models
• Optionally, add sign election before
merging (TIES method).
T5-base fined tuned on 4 GLUE tasks, then merged with six other T5 variants (IMDB, etc.)

Model stock
https://arxiv.org/abs/2403.19522 + https://github.com/naver-ai/model-stock (03/2024)
• Observation 1: « Fine-tuned weights with different random
seeds reside on a very thin shell layer-wise »
3D intuition: for each layer, vectors from all fine-tuned models live on a sphere
• Observation 2: « Going closer to the center by averaging the
weights leads to improving both performances. » [ID, OOD]
• Model soups work because averaging
many single-task models gets us closer
to the center.
• However, building a large collection of
single-task models can be
computationally costly.
• By understanding the geometry of fine-
tuned weights, model stock allows us to
approximate the center coordinates from
just a few models.
• Periodic merging during fine-tuning
yields even better results.

Drop and rEscaLe via sampLing with mAgnitude (DELLA)
https://arxiv.org/abs/2406.11617 (06/2024) + https://github.com/declare-lab/della
1. Compute delta parameters for each fine-tuned model
(fine-tuned weights minus base weights)
2. Drop a fraction of deltas:
• The drop probability of each parameter is inversely proportional to its
magnitude
• For each parameter, we flip a coin with a Bernoulli distribution to
decide whether to keep it or drop it.
• If a parameter is dropped, we set it to zero.
3. Elect weights for merging, according to the dominant direction (same
as TIES — optional in mergekit)
4. Fuse (average) the elected weights
WizardLM-13B, WizardMath-13B, llama-2-13b-code-alpaca

Julien Simon - Deep Dive - Model Merging

More Related Content

What's hot

Similar to Julien Simon - Deep Dive - Model Merging

More from Julien SIMON

Recently uploaded

Julien Simon - Deep Dive - Model Merging