論文の "3.2 Merging in the Data Flow Space" の部分に、この点について以下のように書かれています:
>In our initial effort in this domain, we limit ourselves to serial connections and non-adaptive configurations, deferring the investigation of more flexible model merging to future work. Concretely, with a collection of N models and a budget T, our method searches for a sequence of layer indices L(t)i,j that delineates the path all the tokens should follow for a specific task. Here Li,j denotes the j-th layer in the i-th model, with t ∈ [1, T] marking the step in the inference path.
また、"3.2 Merging in the Data Flow Space" の最後の方には、以下のような記述もあります:
>In this setting, a layer may face an input whose distribution is different from what it is used to (from its original model), leading to unexpected outputs. For example, our preliminary studies show that swapping a pair of neighboring layers in a language model makes its performance drop. Although more theoretical studies are needed to model the distribution shift, empirically we find that appropriately scaling an input that wishes to go from layer i to j by Wij help alleviate the problem.
>Recent analysis and discoveries imply that knowledge is stored distributedly in language models [14, 35, 36], suggesting simple yet novel model merging possibilities in the data flow space (DFS). Unlike merging in PS, model merging in DFS preserves the original weights of each layer intact. Instead, it optimizes the inference path that tokens follow as they traverse through the neural network. For example, after the i-th layer in model A, a token may be directed to the j-th layer in model B.