NISHIO Hirokazu[English][日本語]

Evolutionary Optimization of Model Merging Recipes

This paper proposes a new approach to optimize model merging recipes using evolutionary algorithms. The main points are as follows:.

Model merging is a promising approach that combines models with different expertise to create new and powerful models, but currently relies on human intuition and expertise. Evolutionary methods could automate this.
An integrated framework is proposed to optimize model merging in both parameter space and data flow space. This allows simultaneous optimization of inference paths as well as model weights.
Successfully merged models from different fields (e.g., Japanese LLMs and mathematical inference models) to automatically generate models that can solve mathematical problems in Japanese, etc. Surprisingly, SOTA was achieved even for tasks that were not explicitly optimized.
Successfully generated a Japanese VLM that can handle Japanese culture-specific content appropriately. Demonstrated performance superior to conventional Japanese VLMs. The models generated by this method are open-sourced. It also demonstrated the possibility of efficiently creating powerful models by combining existing models rather than training large models from scratch.

In short, this is a challenging study that presents a new paradigm for automating and optimizing the process of model merging using evolutionary algorithms. It is particularly impressive in that it demonstrates the possibility of merging models from different fields to bring out new capabilities.

What is "Simultaneous optimization of inference paths as well as model weights"?

The evolutionary model merging approach proposed in the paper uses two different spaces for optimization when combining models.

Parameter Space (PS: Parameter Space) Optimizes in what ratio the parameters (weights) of each layer should be combined. In other words, it finds the optimal mixing ratio when mixing the weights of the source model.
Data Flow Space (DFS) Optimizes the paths that data (tokens) take during inference of a model. Specifically, it optimizes the connections between layers that determine the flow of data during inference, such as passing through the i-th layer of one model and then to the j-th layer of another model.

By simultaneously optimizing the merge settings in these two spaces, we can create a model that not only combines weights, but also takes into account the flow of data during inference. This allows us to find a model in which both the mixing ratio of the weights and the inference path are optimal.

This exploration of both parameter space and data flow space is a key feature of the method and one of the reasons for its strong performance. Compared to the simple method of simply averaging weights, it allows for more flexibility in combining models.

How do you find the best mixing ratio when mixing weights?

The paper uses one of the evolutionary algorithms, CMA-ES (Covariance Matrix Adaptation Evolution Strategy), to find the optimal mixing ratio of the weights.

CMA-ES is a powerful method for solving continuous optimization problems. The general flow is as follows: 1.

Initialization: Sets the initial values of the mixing weights (e.g. all 0.5). Also, initialize the hyperparameters (e.g., covariance matrix) that determine the range of the search.
Evaluation: The model is combined with the current mixing weights and its performance is evaluated (e.g. accuracy). This performance measure is called fitness.
Update: Generates the next search point based on the evaluation results. Specifically, the covariance matrix is updated to focus the search on the neighborhood of the point with the highest degree of adaptation.
Repeat steps 2 and 3 until the termination condition is met.

In this way, CMA-ES searches for the optimal mixing weight through repeated evaluation and updating. During the search, the optimal solution can be found efficiently because the search is focused on the vicinity of points with high adaptability.

CMA-ES also has theoretical underpinnings and can be applied to non-convex problems, so it is often used in neural network optimization.

In the paper, this method is combined with existing merging methods such as TIES-Merge and DARE to further improve performance: CMA-ES finds the optimal mixing weights and combines models with these weights to create a powerful model suitable for the task. models that are suitable and powerful for the task at hand.

How do you optimize in the data flow space?

Optimizing model merging in the data flow space (DFS) is done by using evolutionary algorithms to explore the connections in the layers that determine the flow of data during inference. In the paper, the following approach is taken

Defining the search space: First, we define the search space that represents the layer connections. Specifically, we introduce an "indicator array", a binary vector representing the use/non-use of the layers. For example, the vector [1, 0, 1, 1, 0] represents the use of the first, third and fourth layers.
Reducing the search space: If the number of layers is M and the number of iterations is r, there are 2^(M*r) possible layer connection patterns. Since this is a huge number, we reduce the search space by eliminating patterns that do not look good empirically (e.g., using the same layer in succession).
Introducing weight matrices: When connecting the layers of different models, performance can be degraded due to changes in the distribution of inputs. To mitigate this, a weight matrix is introduced for the connections between the layers. This weight matrix is also optimized using an evolutionary algorithm.
Perform evolutionary search: the indicator array and weight matrix are considered as individuals together, and an evolutionary algorithm (CMA-ES is used in the paper) is used to search for the best value. The performance of the validation set is used as the evaluation metric.
Selection of the best individual: After the search, the individual with the best performance in the validation set is selected as the final model.

This process allows for optimal model merging in the data flow space. Importantly, layer connections are directly incorporated into the search space. This allows the layers of the source model to be flexibly combined to find the best inference path for the task.

Another unique feature is the introduction of a weight matrix for connecting layers between different models. This allows not only to simply connect the layers, but also to absorb differences in the distribution of the data, thus preventing performance degradation.

I'm not sure what you mean by "data flow during inference".

Yes, "data flow during inference" may be a bit confusing. To put it in the context of a neural network, it can be said to be "the order of layers of the neural network that the input data follows during inference.

Neural networks are usually composed of multiple layers connected in sequence. For example, input layer, intermediate layer 1, intermediate layer 2, output layer, and so on. Usually, input data passes through each layer in turn starting from the input layer, and finally the result is output from the output layer.

However, the model merging in the data flow space proposed in this paper contemplates changing the connection order of these layers. For example, after using the input layer and intermediate layer 1 of model A, connect to intermediate layer 2 of model B, and so on, combining the layers of different models to create a new network structure.

In this new network structure, the order of the layers through which data passes is different from the original model. In other words, the "flow of data during inference" changes.

The paper attempts to find a network structure suitable for the task by using evolutionary algorithms to optimize the way these layers are connected, or "the flow of data during inference.

"change the connection order of the layers" eh, is that right, is this common? Where is this written in your paper?

Thank you for pointing this out. Changing the connection order of the layers is not a common neural network training method. I guess my explanation was insufficient.

The approach proposed in this paper is different from the usual neural network training in that it combines layers of different models to create a new network structure.

The "3.2 Merging in the Data Flow Space" section of the paper states the following in this regard: "The data flow space is a space that is not only a data flow space, but a data flow space that is a data flow space.

In our initial effort in this domain, we limit ourselves to serial connections and non-adaptive configurations, deferring the investigation of more flexible model merging to future work. Concretely, with a collection of N models and a budget T, our method searches for a sequence of layer indices L(t)i,j that delineates the path all the tokens should follow for a specific task. Here Li,j denotes the j-th layer in the i-th model, with t ∈ [1, T] marking the step in the inference path.

In other words, we are considering combining the layers of the N models to create a new network structure consisting of T layers. In this new network, data passes through the layers in the order specified by L(t)i,j.

Also, at the end of "3.2 Merging in the Data Flow Space" you will find the following statement: "The data flow space is a space that is not only limited by the size of the data flow, but is also limited by the number of data flows that are available.

In this setting, a layer may face an input whose distribution is different from what it is used to (from its original model), leading to unexpected outputs. For example, our preliminary studies show that swapping a pair of neighboring layers in a language model makes its performance drop. Although more theoretical studies are needed to model the distribution shift, empirically we find that appropriately scaling an input that wishes to go from layer i to j by Wij help alleviate the problem.

Here, reference is made to the possibility of performance degradation due to changes in the distribution of inputs when connecting different layers of the model.

Thus, the paper proposes an uncommon approach to optimizing inference paths by combining different layers of models. In my initial attempt to explain this point clearly, I used inaccurate wording. I apologize for this.

(Read the paper here)

One paragraph before the one AI cited, you explain an example of "using layer i of model A, then connecting to layer j of model B".
Recent analysis and discoveries imply that knowledge is stored distributedly in language models [14, 35, 36], suggesting simple yet novel model merging possibilities in the data flow space (DFS). Unlike merging in PS, model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the i-th layer in model A, a token may be directed to the j-th layer in model B.

This page is auto-translated from [/nishio/Evolutionary Optimization of Model Merging Recipes](https://scrapbox.io/nishio/Evolutionary Optimization of Model Merging Recipes) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.

(C)NISHIO Hirokazu / Converted from Markdown (en)
Source: [GitHub] / [Scrapbox]