Contents

Mixture of Experts - Has open-source caught up with OpenAI?

This is a short post aiming to provide some accessible context into the concept of MoEs, not a fully comprehensive or technical look into the technology. See the last section for further reading. That being said, please do reach out if you see any issues with this or any other posts.

Prefer to listen?


We’re less than two weeks into 2024 and there’s already tons of AI news and published research. One of those has been the resurfacing idea of Mixture of Experts (MoE) models. On December 8th 2023, Paris-based Open-Source AI company MistralAI tweeted a magnet link to a torrent for their new model.

/images/posts/moe/tweet.png

The Mixtral of Experts Model

This model, Mixtral-8x7B is a pretrained generative ‘Sparse Mixture of Experts’ large language model, trained with a context window of 32k tokens, and totalling 46.7B training parameters (using just 12.9B during inference). MixtralAI’s “Mixtral of Experts” paper published 8th Jan 2024, reports that the model outperforms or matched Llama 2 70B and GPT3.5 across all of their evaluated benchmarks. Although we don’t know the architecture of GPT3.5, outperforming the 70 billion parameter Llama 2 model, using 12.9B parameters (per token, during inference) is an incredible feat.

Alongside the released foundational model MistralAI also released an intruct-tuned version.

How does a basic Mixture of Experts model work?

A Mixture of Experts (MoE) model has a number of ‘Experts’, think of these as distinct, trained neural networks, where each expert can be trained or specialise in different tasks. During inference, each token first goes through a routing/gating layer that decides which of the experts is most suited for the current input.

/images/posts/moe/moe-rlm.png Figure from ‘Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer’

During training, we are optimising such that each individual expert is better at it’s individual task, AND that the routing layer is getting better at allocating the task to the correct expert. The intuition behind this is, whilst using a dense LLM (e.g. Llama 2 70B) we are computing all parameters including ones unrelated to the pending task - wasting compute. In contrast, a MoE model will invoke the required parameters for the task at hand, increasing resource efficiency and with the additional benefit of allowing separate, distributed training or finetuning.

Note: Although compute is more efficient, memory (VRAM) usage remains high. This is because each of the experts have to be loaded and ready - but during inference only 2 * x billion parameter matrix multiplication computations are carried out. E.g. two 7b parameter matmuls.

Modality Experts

MoE models have the benefit where experts are trained on different modalities such as code, images, videos, etc. One example being ‘LIMoE’ a Language-Image MoE model. Another example is the MoE-Fusion, this model is used for infrared and visible image fusion, integrating comprehensive information from multiple sources to achieve superior performances on various practical tasks.

What was MixtralAI’s approach?

MixtralAI’s Mixtral-8x7B model is a decoder-only LLM that uses the same structure as a normal MoE model, however they are using eight experts, each of which are distinct fine-tunes of the Mistral-7B model. Mixtral-8x7B also uses two experts during inference, let’s step through the process.

/images/posts/moe/switchtransformer.png Figure from ‘Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity’

First the embedded input tokens go through a standard ‘Self-Attention’ mechanism which are then added and normalised (this is the same as the standard transformer), then our embedding goes into two parallel routing layers - which each select one expert (different finetune of Mistral-7B), next the output of these are once again added and normalised to then finally produce our output.

Confusingly, the new model has access to 46.7B parameters broken up by the eight experts. It could seem logical that eight lots of 7B parameters would allow access to 56B parameters, but the experts use only the feed forward layers and the rest of the parameters are shared.

Is GPT4 a MoE model?

OpenAI has been very secretive about the architecture of their models, especially GPT3.5 onwards. The Open-Source community has been working hard to replicate the performance of even GPT3.5 with minimal success. However, a number of rumours have been circulating on the internet suggesting that GPT4 is a MoE model made up of eight 220B parameter FFN experts (totalling around 1.7T accessible parameters!!). The likely rumour was started by George Hotz, founder of Comma.ai, and were later reaffirmed by Soumith Chintala, co-founder of PyTorch at Meta, and other experts.

Whilst we likely will never hear from OpenAI directly, a MoE structure appears to be a component that has considerably supported OpenAI models’ performance and compute costs.

Fin

Thanks for reading my post! I hope it’s helped introduce you to mixture of expert models and the intuition behind them. I’m working on a long beginner/layman friendly intuition of LLMs and AI post, I will hopefully post soon. I’ll mostly be focusing on my PhD research but I hope to post more consistently this year. If you’re looking for something else to read in the meantime, check out my Understanding Neural Networks with Iris Dataset post or check out the further reading section below for more about MoEs :).

Questions or want to chat about this post? Shoot me an email or message/tweet over on X: https://img.shields.io/twitter/follow/LewisNWatson?style=flat

Further Reading:

Please get in touch if you have any recommendations to add here!