Microsoft Translator enhanced with Z-code Mixture of Experts models

Z-code multilingual model representation diagram

Translator, a Microsoft Azure Cognitive Service, is adopting Z-code Mixture of Experts models, a breakthrough AI technology that significantly improves the quality of production translation models. As a component of Microsoft’s larger XYZ-code initiative to combine AI models for text, vision, audio, and language, Z-code supports the creation of AI systems that can speak, see, hear, and understand. This effort is a part of Azure AI and Project Turing, focusing on building multilingual, large-scale language models that support various production teams. Translator is using NVIDIA GPUs and Triton Inference Server to deploy and scale these models efficiently for high-performance inference. Translator is the first machine translation provider to introduce this technology live for customers.

Z-code MoE boosts efficiency and quality

Z-code models utilize a new architecture called Mixture of Experts (MoE), where different parts of the models can learn different tasks. The models learn to translate between multiple languages at the same time. The Z-code MoE model utilizes more parameters while dynamically selecting which parameters to use for a given input. This enables the model to specialize a subset of the parameters (experts) during training. At runtime, the model uses the relevant experts for the task, which is more computationally efficient than utilizing all model’s parameters.

animated graphic showing Z-code MoE model translating from English to French
Figure 1: Z-code MoE model translating from English to French. The model dynamically selects subsets of its parameters to be utilized for each input.

Newly introduced Z-code MoE models leverage transfer learning, which enables efficient knowledge sharing across similar languages. Moreover, the models utilize both parallel and monolingual data during the training process. This opens the way to high quality machine translation beyond the high-resource languages and improves the quality of low-resource languages that lack significant training data. This approach can provide a positive impact on AI fairness, since both high-resource and low-resource languages see improvements.

We have trained translation systems for research purposes with 200 billion parameters supporting 100 language pairs. Though such large systems significantly improved the translation quality, this also introduced challenges to deploy them in a production environment cost effectively. For our production model deployment, we opted for training a set of 5 billion parameter models, which are 80 times larger than our currently deployed models. We trained a multilingual model per set of languages, where each model can serve up to 20 language pairs and therefore replace up to 20 of the current systems. This enabled our model to maximize the transfer learning among languages while being deployable with effective runtime cost. We compared the quality improvements of the new MoE to the current production system using human evaluation. The figure below shows the results of the models on various language pairs. The Z-code-MoE systems outperformed individual bilingual systems, with average improvements of 4%. For instance, the models improved English to French translations by 3.2 percent, English to Turkish by 5.8 percent, Japanese to English by 7.6 percent, English to Arabic by 9.3 percent, and English to Slovenian by 15 percent.

graphic showing quality gains of Z-code MoE models over existing models. Languages are ordered by training data sizes.
Figure 2: Quality gains of Z-code MoE models over existing models. Languages are ordered by training data sizes.

Training large models with billions of parameters is challenging. The Translator team collaborated with Microsoft DeepSpeed to develop a high-performance system that helped train massive scale Z-code MoE models, enabling us to efficiently scale and deploy Z-code models for translation.

We partnered with NVIDIA to optimize faster engines that can be used at runtime to deploy the new Z-code/MoE models on GPUs. NVIDIA developed custom CUDA kernels and leveraged the CUTLASS and FasterTransformer libraries to efficiently implement MoE layers on a single V100 GPU. This implementation achieved up to 27x throughput improvements over standard GPU (PyTorch) runtimes. We used NVIDIA’s open source Triton Inference Server to serve Z-code MoE models. We used Triton’s dynamic batching feature to pool several requests into a big batch for higher throughput that enabled us to ship large models with relatively low runtime costs.

How can you use the new Z-code models?

Z-code models are available now by invitation to customers using Document Translation, a feature that translates entire documents, or volumes of documents, in a variety of different file formats preserving their original formatting. Z-code models will be made available to all customers and to other Translator products in phases. Please fill out this form to request access to Document Translation using Z-code models.

Learn more

Acknowledgements

The following people contributed to this work: Abdelrahman Abouelenin, Ahmed Salah, Akiko Eriguchi, Alex Cheng, Alex Muzio, Amr Hendy, Arul Menezes, Brad Ballinger, Christophe Poulain, Evram Narouz, Fai Sigalov, Hany Hassan Awadalla, Hitokazu Matsushita, Mohamed Afify, Raffy Bekhit, Rohit Jain, Steven Nguyen, Vikas Raunak, Vishal Chowdhary, and Young Jin Kim.

The post Microsoft Translator enhanced with Z-code Mixture of Experts models appeared first on Microsoft Research.

Read More