Soup-of-Experts: Pretraining Specialist Models via Parameters Averaging

Large-scale models are routinely trained on a mixture of different data sources.
Different data mixtures yield very different downstream performances.
We propose a novel architecture that can instantiate one model for each data mixture without having to re-train the model.
Our architecture consists of a bank of expert weights, which are linearly combined to instantiate one model.
We learn the linear combination coefficients as a function of the input histogram.
To train this architecture, we sample random histograms, instantiate the corresponding model, and backprop through one batch of data…Apple Machine Learning Research