Introduction to Mixture of Experts

Experts

Routing

Multi device parallelism at inference. (individual expert in different )

Reduce the active weights during inference.

MoEs don’t reduce the total number of weights loaded into the devices. It reduces the number of active parameters required to process the single token by load balancing across multiple experts/devices.

Feedforward layers do not combine information across tokens. They transform each token in a “point-wise” manner.

The attention layer does combine information across all the tokens in the sequence.

Dense MoEs

Token-to-Expert Probability Score: tells us how much influence each expert neural network should have on each token.

Router:The router network will input the token embeddings and output the probabilities of the n experts for each token

Router Collapse is a side-effect of MoE models where the router ends up picking the same experts for all type of tokens. It is bad because it doesn’t explore or utilize other experts available to the network. (Exploration & Expolitation).

Outrageously Large Neural Networks: The Sparsely-Gated MoE Layers. (at least two network to be routed.)

Sparse MoE

Shared MoE

Switch transformer

Expert Capacity.