Evolutionary Optimization of Model Merging Recipes
This paper proposes a new approach to optimize model merging recipes using evolutionary algorithms. The main points are as follows:.
In short, this is a challenging study that presents a new paradigm for automating and optimizing the process of model merging using evolutionary algorithms. It is particularly impressive in that it demonstrates the possibility of merging models from different fields to bring out new capabilities.
What is "Simultaneous optimization of inference paths as well as model weights"?
The evolutionary model merging approach proposed in the paper uses two different spaces for optimization when combining models.
By simultaneously optimizing the merge settings in these two spaces, we can create a model that not only combines weights, but also takes into account the flow of data during inference. This allows us to find a model in which both the mixing ratio of the weights and the inference path are optimal.
This exploration of both parameter space and data flow space is a key feature of the method and one of the reasons for its strong performance. Compared to the simple method of simply averaging weights, it allows for more flexibility in combining models.
How do you find the best mixing ratio when mixing weights?
The paper uses one of the evolutionary algorithms, CMA-ES (Covariance Matrix Adaptation Evolution Strategy), to find the optimal mixing ratio of the weights.
CMA-ES is a powerful method for solving continuous optimization problems. The general flow is as follows: 1.
In this way, CMA-ES searches for the optimal mixing weight through repeated evaluation and updating. During the search, the optimal solution can be found efficiently because the search is focused on the vicinity of points with high adaptability.
CMA-ES also has theoretical underpinnings and can be applied to non-convex problems, so it is often used in neural network optimization.
In the paper, this method is combined with existing merging methods such as TIES-Merge and DARE to further improve performance: CMA-ES finds the optimal mixing weights and combines models with these weights to create a powerful model suitable for the task. models that are suitable and powerful for the task at hand.
How do you optimize in the data flow space?
Optimizing model merging in the data flow space (DFS) is done by using evolutionary algorithms to explore the connections in the layers that determine the flow of data during inference. In the paper, the following approach is taken
[1, 0, 1, 1, 0] represents the use of the first, third and fourth layers.This process allows for optimal model merging in the data flow space. Importantly, layer connections are directly incorporated into the search space. This allows the layers of the source model to be flexibly combined to find the best inference path for the task.
Another unique feature is the introduction of a weight matrix for connecting layers between different models. This allows not only to simply connect the layers, but also to absorb differences in the distribution of the data, thus preventing performance degradation.
I'm not sure what you mean by "data flow during inference".
Yes, "data flow during inference" may be a bit confusing. To put it in the context of a neural network, it can be said to be "the order of layers of the neural network that the input data follows during inference.
Neural networks are usually composed of multiple layers connected in sequence. For example, input layer, intermediate layer 1, intermediate layer 2, output layer, and so on. Usually, input data passes through each layer in turn starting from the input layer, and finally the result is output from the output layer.
However, the model merging in the data flow space proposed in this paper contemplates changing the connection order of these layers. For example, after using the input layer and intermediate layer 1 of model A, connect to intermediate layer 2 of model B, and so on, combining the layers of different models to create a new network structure.
In this new network structure, the order of the layers through which data passes is different from the original model. In other words, the "flow of data during inference" changes.
The paper attempts to find a network structure suitable for the task by using evolutionary algorithms to optimize the way these layers are connected, or "the flow of data during inference.
"change the connection order of the layers" eh, is that right, is this common? Where is this written in your paper?
Thank you for pointing this out. Changing the connection order of the layers is not a common neural network training method. I guess my explanation was insufficient.
The approach proposed in this paper is different from the usual neural network training in that it combines layers of different models to create a new network structure.
The "3.2 Merging in the Data Flow Space" section of the paper states the following in this regard: "The data flow space is a space that is not only a data flow space, but a data flow space that is a data flow space.
In our initial effort in this domain, we limit ourselves to serial connections and non-adaptive configurations, deferring the investigation of more flexible model merging to future work. Concretely, with a collection of N models and a budget T, our method searches for a sequence of layer indices L(t)i,j that delineates the path all the tokens should follow for a specific task. Here Li,j denotes the j-th layer in the i-th model, with t ∈
[1, T]marking the step in the inference path.
In other words, we are considering combining the layers of the N models to create a new network structure consisting of T layers. In this new network, data passes through the layers in the order specified by L(t)i,j.
Also, at the end of "3.2 Merging in the Data Flow Space" you will find the following statement: "The data flow space is a space that is not only limited by the size of the data flow, but is also limited by the number of data flows that are available.
In this setting, a layer may face an input whose distribution is different from what it is used to (from its original model), leading to unexpected outputs. For example, our preliminary studies show that swapping a pair of neighboring layers in a language model makes its performance drop. Although more theoretical studies are needed to model the distribution shift, empirically we find that appropriately scaling an input that wishes to go from layer i to j by Wij help alleviate the problem.
Here, reference is made to the possibility of performance degradation due to changes in the distribution of inputs when connecting different layers of the model.
Thus, the paper proposes an uncommon approach to optimizing inference paths by combining different layers of models. In my initial attempt to explain this point clearly, I used inaccurate wording. I apologize for this.
(Read the paper here)
Recent analysis and discoveries imply that knowledge is stored distributedly in language models
[14, 35, 36], suggesting simple yet novel model merging possibilities in the data flow space (DFS). Unlike merging in PS, model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the i-th layer in model A, a token may be directed to the j-th layer in model B.
This page is auto-translated from [/nishio/Evolutionary Optimization of Model Merging Recipes](https://scrapbox.io/nishio/Evolutionary Optimization of Model Merging Recipes) using DeepL. If you looks something interesting but the auto-translated English is not good enough to understand it, feel free to let me know at @nishio_en. I'm very happy to spread my thought to non-Japanese readers.