Introduction

In 2025, Mixture of Experts (MoE) has become the dominant architecture for scaling large language models efficiently. Models like Mixtral 8x7B, DeepSeek-V2, and rumored GPT-4 variants leverage MoE to achieve massive parameter counts while keeping inference costs manageable. Meanwhile, Product of Experts (PoE) continues to play a crucial role in multi-modal learning and ensemble methods.

Mixture of Experts (MoE)

Why MoE Dominates in 2025

Modern LLMs face a fundamental challenge: compute scales quadratically with model size, but we want trillion-parameter models. MoE solves this by making models conditionally compute - only activating the parameters you need.

The Economics:

  • Dense 70B model: All 70B parameters active per token
  • MoE 8x70B model: Only 2x70B = 140B parameters active per token, but 560B total capacity

This 4x efficiency gain is why MoE won.

Basic Architecture

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class ModernMoE(nn.Module):
"""
Standard MoE layer used in production models (2025)
"""
def __init__(self, d_model, num_experts=8, top_k=2, expert_capacity=None):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k
self.expert_capacity = expert_capacity # For capacity factor routing

# The router (gating network)
self.router = nn.Linear(d_model, num_experts)

# The expert networks (typically FFN layers)
self.experts = nn.ModuleList([
FeedForwardExpert(d_model, ffn_dim=d_model * 4)
for _ in range(num_experts)
])

def forward(self, x):
# x shape: [batch, seq_len, d_model]
batch_size, seq_len, d_model = x.shape

# Step 1: Routing decision
router_logits = self.router(x) # [batch, seq_len, num_experts]

# Step 2: Select top-k experts per token
routing_weights, selected_experts = self.route_tokens(router_logits)

# Step 3: Execute experts (the core innovation is HERE)
output = self.execute_experts(x, routing_weights, selected_experts)

return output

Routing Mechanisms: The Critical Innovation

The routing strategy is what differentiates modern MoE implementations. Let’s cover the main approaches used in 2025.

1. Top-K Token Choice Routing (Classic)

Used by: Switch Transformer, GLaM, early MoE models

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def token_choice_routing(self, router_logits, top_k=2):
"""
Each token chooses its top-k experts
Problem: Load imbalance - popular experts get overloaded
"""
# Get top-k experts per token
routing_weights, selected_experts = torch.topk(
F.softmax(router_logits, dim=-1),
k=top_k,
dim=-1
)

# Normalize weights of selected experts
routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)

return routing_weights, selected_experts

# Example output per token:
# Token 1: Expert 3 (weight=0.7), Expert 7 (weight=0.3)
# Token 2: Expert 3 (weight=0.9), Expert 1 (weight=0.1) # Expert 3 again!
# Token 3: Expert 3 (weight=0.8), Expert 2 (weight=0.2) # Overload!

Problem: Load imbalance - some experts become popular and get overloaded.

2. Expert Choice Routing (Modern Standard)

Used by: DeepSeek-V2, Mixtral 8x22B, cutting-edge models

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
def expert_choice_routing(self, router_logits, capacity_factor=1.25):
"""
Experts choose their top tokens (2025 state-of-the-art)
Advantage: Perfect load balancing, no dropped tokens
"""
batch_size, seq_len, num_experts = router_logits.shape
total_tokens = batch_size * seq_len

# Each expert selects top tokens up to its capacity
expert_capacity = int(total_tokens / num_experts * capacity_factor)

# Transpose: now shape is [num_experts, total_tokens]
router_logits_per_expert = router_logits.view(-1, num_experts).T

assignments = []
for expert_idx in range(num_experts):
# Each expert picks its top-k tokens
expert_scores = router_logits_per_expert[expert_idx]
top_tokens = torch.topk(expert_scores, k=expert_capacity).indices
assignments.append((expert_idx, top_tokens))

return assignments

# Example: 8 experts, 1000 tokens, capacity_factor=1.25
# Each expert processes: 1000/8 * 1.25 = 156 tokens
# Expert 0 picks its top 156 tokens
# Expert 1 picks its top 156 tokens
# ...
# Total: 8 * 156 = 1248 tokens processed (some tokens go to multiple experts)

Advantage: Guaranteed load balance, no dropped tokens, better hardware utilization.

3. Soft Routing (Emerging)

Used by: Research models, Mixture-of-Depths

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def soft_routing(self, router_logits):
"""
All experts process all tokens, but with soft weights
No discrete selection - fully differentiable
"""
# Compute soft weights for all experts
routing_weights = F.softmax(router_logits, dim=-1)
# Shape: [batch, seq_len, num_experts]

# Apply all experts with weights
outputs = []
for expert_idx in range(self.num_experts):
expert_output = self.experts[expert_idx](x) # [batch, seq_len, d_model]
weighted_output = expert_output * routing_weights[:, :, expert_idx:expert_idx+1]
outputs.append(weighted_output)

final_output = sum(outputs)
return final_output

# Note: Not truly sparse, but allows smooth gradients
# Used in combination with distillation to train sparse student models

4. Auxiliary Loss for Load Balancing

Critical for training stable MoE models:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
def compute_load_balancing_loss(self, router_logits, selected_experts):
"""
Encourages uniform expert usage
Without this, one expert takes all tokens (collapse)
"""
num_tokens = router_logits.shape[0] * router_logits.shape[1]

# Compute how many tokens were routed to each expert
expert_usage = torch.zeros(self.num_experts, device=router_logits.device)
for expert_idx in range(self.num_experts):
expert_usage[expert_idx] = (selected_experts == expert_idx).float().sum()

# Ideal usage: uniform distribution
target_usage = num_tokens / self.num_experts

# Loss: penalize deviation from uniform
# Modern approach: use coefficient of variation
mean_usage = expert_usage.mean()
std_usage = expert_usage.std()
load_balance_loss = std_usage / (mean_usage + 1e-10)

return load_balance_loss

# Add to total loss
total_loss = task_loss + α * load_balance_loss # α typically 0.01

5. Hierarchical Routing (2024-2025)

Used by: Very large MoE models (1000+ experts)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def hierarchical_routing(self, router_logits, num_groups=8):
"""
Two-level routing for massive expert counts
First: choose expert group
Then: choose expert within group
"""
# Level 1: Route to expert group
group_logits = router_logits.view(*router_logits.shape[:-1], num_groups, -1)
group_logits = group_logits.max(dim=-1).values # [batch, seq, num_groups]

top_group = torch.argmax(group_logits, dim=-1) # [batch, seq]

# Level 2: Route within selected group
experts_per_group = self.num_experts // num_groups
group_start = top_group * experts_per_group

# Get expert scores within group
local_logits = router_logits[..., group_start:group_start + experts_per_group]
top_k_weights, top_k_indices = torch.topk(local_logits, k=self.top_k, dim=-1)

# Convert local indices to global
global_indices = top_k_indices + group_start.unsqueeze(-1)

return top_k_weights, global_indices

# Example: 128 experts, 8 groups
# Step 1: Choose from 8 groups -> Group 3
# Step 2: Choose from 16 experts in Group 3 -> Experts 48, 52

Modern MoE Configurations (2025)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
# Mixtral 8x7B style (2024-2025 standard)
class MixtralMoELayer(nn.Module):
"""
Configuration:
- 8 experts per layer
- Top-2 routing (each token to 2 experts)
- Token choice routing with load balancing
- Expert: 4-layer FFN (up -> gate -> down)
"""
config = {
'num_experts': 8,
'top_k': 2,
'routing': 'token_choice',
'expert_type': 'gated_ffn',
'load_balance_weight': 0.01
}

# DeepSeek-V2 style (2025 cutting-edge)
class DeepSeekV2MoELayer(nn.Module):
"""
Configuration:
- 160 experts per layer
- Top-6 routing
- Expert choice routing (2025 innovation)
- Separate MoE for different heads
- Shared experts + routed experts hybrid
"""
config = {
'num_experts': 160,
'top_k': 6,
'routing': 'expert_choice',
'capacity_factor': 1.25,
'num_shared_experts': 2, # Always active
'expert_type': 'low_rank_ffn' # Efficiency
}

Product of Experts (PoE)

Hinton (1999) PoE:

  • Goal: Combine simple, tractable “expert” models to form a complex, sharp probability distribution over data (like images).
  • How: Multiplies the probabilities from all experts and normalizes, so only data that passes all experts’ “rules” gets high probability.
  • Sampling/Training: Primarily focused on probability models like Boltzmann machines, trained using things like Gibbs sampling and contrastive divergence, with both positive and negative phases.
  • Scope: Original “experts” are usually simple neural nets or statistical models, trained together on the same data and modality.

Product of Experts (PoE): Multi-Modal Fusion

Why PoE Still Matters in 2025

While MoE dominates LLMs, PoE remains the architecture of choice for:

  • Multi-modal models (vision + language)
  • Ensemble learning
  • Uncertainty estimation
  • Any scenario where all viewpoints must contribute

Recent applications: CLIP-like models, medical AI (multiple diagnostic tests), anomaly detection systems.

Core Mechanism

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
class ModernPoE(nn.Module):
"""
Product of Experts for multi-modal fusion
Used in vision-language models, ensemble systems
"""
def __init__(self, modality_encoders):
super().__init__()
self.modality_encoders = nn.ModuleDict(modality_encoders)
# e.g., {'vision': VisionEncoder(), 'text': TextEncoder()}

def forward(self, inputs):
"""
inputs: dict with keys matching modality_encoders
e.g., {'vision': image_tensor, 'text': text_tensor}
"""
# Step 1: Each expert produces log probabilities
log_probs = []
for modality, encoder in self.modality_encoders.items():
if modality in inputs and inputs[modality] is not None:
# Get log probability distribution
logits = encoder(inputs[modality])
log_prob = F.log_softmax(logits, dim=-1)
log_probs.append(log_prob)

# Step 2: Product of experts = sum in log space
combined_log_prob = torch.stack(log_probs).sum(dim=0)

# Step 3: Normalize (becomes a valid distribution)
# Use log-sum-exp for numerical stability
log_Z = torch.logsumexp(combined_log_prob, dim=-1, keepdim=True)
normalized_log_prob = combined_log_prob - log_Z

return normalized_log_prob

# Example: Multi-modal classification
vision_encoder = VisionTransformer(...)
text_encoder = TextTransformer(...)
audio_encoder = AudioTransformer(...)

poe_model = ModernPoE({
'vision': vision_encoder,
'text': text_encoder,
'audio': audio_encoder
})

# During inference - all modalities contribute
output = poe_model({
'vision': image,
'text': caption,
'audio': sound
})

Product of Experts is sum in log space because log(a) + log(b) = log(ab).

PoE for Missing Modalities

A key advantage: gracefully handles missing inputs

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
class RobustPoE(nn.Module):
"""
Handles missing modalities at inference time
Critical for real-world deployment
"""
def forward(self, inputs, available_modalities):
log_probs = []

for modality in available_modalities:
if inputs.get(modality) is not None:
encoder = self.modality_encoders[modality]
logits = encoder(inputs[modality])
log_prob = F.log_softmax(logits, dim=-1)
log_probs.append(log_prob)

if len(log_probs) == 0:
raise ValueError("At least one modality required")

# Product over available modalities only
combined = torch.stack(log_probs).sum(dim=0)
return F.log_softmax(combined, dim=-1)

# Example: Image classification with optional text
# Training: both modalities
output = model({'vision': img, 'text': caption}, ['vision', 'text'])

# Inference: image only (text missing)
output = model({'vision': img, 'text': None}, ['vision'])
# Still works! Uses only vision expert

Modern PoE Applications (2025)

1. Ensemble Distillation

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class EnsemblePoE(nn.Module):
"""
Combine multiple teacher models via PoE
Then distill to single student model
"""
def __init__(self, teacher_models):
super().__init__()
self.teachers = nn.ModuleList(teacher_models)

def get_ensemble_targets(self, x):
# Get predictions from all teachers
teacher_logits = [teacher(x) for teacher in self.teachers]

# PoE combination (product of distributions)
log_probs = [F.log_softmax(logits, dim=-1) for logits in teacher_logits]
combined = torch.stack(log_probs).sum(dim=0)

# These become soft targets for student
return F.softmax(combined, dim=-1)

# Used in: Model compression, knowledge distillation

2. Multi-View Learning

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
class MultiViewPoE(nn.Module):
"""
Different views of same data (e.g., multiple camera angles)
All views must contribute to decision
"""
def __init__(self, view_encoders):
self.encoders = nn.ModuleList(view_encoders)

def forward(self, views):
# views: list of tensors [view1, view2, view3, ...]
log_probs = []

for view, encoder in zip(views, self.encoders):
logits = encoder(view)
log_probs.append(F.log_softmax(logits, dim=-1))

# Agreement across views increases confidence
combined = torch.stack(log_probs).sum(dim=0)
return combined

# Application: Multi-camera surveillance, 3D reconstruction

MoE vs PoE: Architecture Comparison (2025)

Aspect Mixture of Experts (MoE) Product of Experts (PoE)
Primary Use Case Scaling LLMs efficiently Multi-modal fusion, ensembles
Activation Sparse (top-k experts) Dense (all experts)
Routing Learned gating network No routing - all participate
Combination Weighted sum Product (multiplication)
Training Challenge Load balancing Gradient collapse
Inference Cost O(k) where k << N experts O(N) all experts compute
Specialization Strong - experts diverge Moderate - experts must agree
2025 Models Mixtral, DeepSeek-V2, GPT-4 Multi-modal VAE, ensembles
Hardware Friendly Very (sparse) Less (dense)
Missing Inputs Can’t handle well Gracefully degrades

When to Use Each (2025 Decision Guide)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
def choose_architecture(use_case):
"""
Decision tree for 2025
"""
if use_case == "scaling_llm":
return "MoE with expert-choice routing"
# Examples: Next-gen GPT, Claude, Gemini

elif use_case == "multi_modal_fusion":
return "PoE"
# Examples: Vision+Text, Audio+Video+Text

elif use_case == "efficient_inference":
return "MoE with top-k routing"
# Only activate what you need

elif use_case == "ensemble_models":
return "PoE"
# All models must contribute

elif use_case == "handling_missing_inputs":
return "PoE"
# Missing modality? Use remaining ones

elif use_case == "trillion_parameter_model":
return "MoE with hierarchical routing"
# Scale beyond what's possible with dense models

else:
return "Hybrid MoE-PoE"
# Best of both worlds

Training Considerations (2025 Best Practices)

MoE Training Challenges

1. Load Balancing (Critical!)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class MoEWithLoadBalancing(nn.Module):
def forward(self, x):
router_logits = self.router(x)

# Main routing
top_k_weights, top_k_experts = self.route_topk(router_logits)
output = self.execute_experts(x, top_k_weights, top_k_experts)

# Compute load balancing loss
# Modern approach: use auxiliary loss + router z-loss
lb_loss = self.load_balance_loss(router_logits, top_k_experts)
router_z_loss = self.router_z_loss(router_logits)

# Store for backward pass
self.aux_loss = lb_loss + 0.001 * router_z_loss

return output

def load_balance_loss(self, router_logits, selected_experts):
"""
Encourages uniform expert usage
Based on: "Towards MoE Deployment: Mitigating Inefficiencies" (2024)
"""
# Count tokens per expert
num_tokens = router_logits.shape[0] * router_logits.shape[1]
expert_counts = torch.bincount(
selected_experts.view(-1),
minlength=self.num_experts
).float()

# Probability of routing to each expert
routing_probs = F.softmax(router_logits, dim=-1).mean(dim=[0, 1])

# Loss: deviation from uniform
# We want: expert_counts[i] / num_tokens ≈ 1 / num_experts
balance_loss = (expert_counts * routing_probs).sum() * self.num_experts

return balance_loss

def router_z_loss(self, router_logits):
"""
Prevents router logits from growing too large
Improves training stability
"""
return torch.logsumexp(router_logits, dim=-1).pow(2).mean()

# Training loop
for batch in dataloader:
output = moe_model(batch)
task_loss = criterion(output, labels)

# Add auxiliary losses
total_loss = task_loss + 0.01 * moe_model.aux_loss
total_loss.backward()

2. Expert Capacity and Dropped Tokens

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
class CapacityFactorRouting(nn.Module):
"""
Handles token overflow when too many route to one expert
"""
def forward(self, x, router_weights, selected_experts):
batch_size, seq_len, d_model = x.shape

# Calculate expert capacity
tokens_per_expert = (batch_size * seq_len * self.top_k) // self.num_experts
expert_capacity = int(tokens_per_expert * self.capacity_factor) # 1.25 typical

# Track how many tokens each expert has received
expert_counts = torch.zeros(self.num_experts, device=x.device)

# Process tokens, drop if expert is at capacity
outputs = torch.zeros_like(x)
dropped_tokens = 0

for token_idx in range(batch_size * seq_len):
for k in range(self.top_k):
expert_idx = selected_experts[token_idx, k]

if expert_counts[expert_idx] < expert_capacity:
# Expert has capacity - process token
weight = router_weights[token_idx, k]
expert_out = self.experts[expert_idx](x[token_idx])
outputs[token_idx] += weight * expert_out
expert_counts[expert_idx] += 1
else:
# Expert at capacity - drop token
dropped_tokens += 1

# Log dropped tokens (important metric!)
self.dropped_tokens_ratio = dropped_tokens / (batch_size * seq_len * self.top_k)

return outputs

# Monitor during training
if model.dropped_tokens_ratio > 0.01: # More than 1% dropped
print(f"Warning: {model.dropped_tokens_ratio:.1%} tokens dropped")
# Consider: increase capacity_factor or improve load balancing

3. Expert Initialization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def initialize_moe_experts(moe_layer, pretrained_ffn=None):
"""
2025 best practice: initialize experts from pretrained FFN
Then add noise for diversity
"""
if pretrained_ffn is not None:
# Start from dense model
base_params = pretrained_ffn.state_dict()

for expert_idx in range(moe_layer.num_experts):
# Copy base parameters
moe_layer.experts[expert_idx].load_state_dict(base_params)

# Add small random noise for diversity
with torch.no_grad():
for param in moe_layer.experts[expert_idx].parameters():
param.add_(torch.randn_like(param) * 0.01)
else:
# Standard initialization
for expert in moe_layer.experts:
expert.apply(lambda m: nn.init.xavier_uniform_(m.weight)
if isinstance(m, nn.Linear) else None)

PoE Training Challenges

1. Gradient Imbalance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
class PoEWithGradientBalancing(nn.Module):
"""
Ensure all experts receive useful gradients
"""
def forward(self, inputs):
log_probs = []
expert_confidences = []

for modality, encoder in self.encoders.items():
if inputs.get(modality) is not None:
logits = encoder(inputs[modality])
log_prob = F.log_softmax(logits, dim=-1)
log_probs.append(log_prob)

# Track expert confidence (for weighting gradients)
confidence = F.softmax(logits, dim=-1).max(dim=-1).values
expert_confidences.append(confidence.mean())

# Combine with learned modality weights
modality_weights = F.softmax(self.modality_importance, dim=0)

weighted_log_probs = []
for i, log_prob in enumerate(log_probs):
# Weight by modality importance (learnable)
weighted = log_prob + torch.log(modality_weights[i] + 1e-10)
weighted_log_probs.append(weighted)

combined = torch.stack(weighted_log_probs).sum(dim=0)

# Store for analysis
self.expert_confidences = expert_confidences

return combined

2. Preventing Expert Collapse

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
def train_poe_with_diversity_loss(model, batch):
"""
Add diversity loss to prevent experts from becoming identical
"""
# Forward pass
outputs = model(batch)
task_loss = F.cross_entropy(outputs, labels)

# Diversity loss: encourage different experts to specialize
expert_outputs = [encoder(batch[mod]) for mod, encoder in model.encoders.items()]

# Compute pairwise differences
diversity_loss = 0
num_pairs = 0
for i in range(len(expert_outputs)):
for j in range(i + 1, len(expert_outputs)):
# We want experts to be different (high diversity)
similarity = F.cosine_similarity(
expert_outputs[i].view(batch_size, -1),
expert_outputs[j].view(batch_size, -1),
dim=1
).mean()
# Penalize high similarity
diversity_loss += similarity
num_pairs += 1

diversity_loss = diversity_loss / num_pairs

# Total loss
total_loss = task_loss - 0.1 * diversity_loss # Negative: encourage diversity

return total_loss

PoE for Visual Generation (2025):

  • Goal: Compose the outputs or “opinions” of modern pretrained models—like diffusion models, video generators, Vision-Language Models (VLMs), and even physics engines or simulators—for controllable visual (image/video) synthesis.
  • How:
    • Each “expert” can be a large pretrained model (not just a small probabilistic model).
    • Some experts are generative models (e.g., a diffusion model for images, or a video generator), others are discriminators like VLMs or physics simulators that function as reward/constraint functions.
    • They sample from the product of these experts to generate a final output that satisfies all constraints—e.g., matching realism, language description, physics validity, and user control.
  • Efficient Sampling: Introduces a new, practical inference-time framework combining Annealed Importance Sampling (AIS) and Sequential Monte Carlo (SMC) to efficiently generate samples from these product distributions—important since simple approaches (like Gibbs sampling) are too slow or impractical with high-dimensional data and large models.
  • No Retraining Needed: Assembles these powerful experts at inference time—allowing flexible composition without re-training a huge new model.
  • Applications: Enables user-controllable, physically-correct, prompt-following image/video generation—e.g., “put this object here per the graphics engine, make it look realistic per the Diffusion Model, obey physics per the simulator, and follow this text instruction per the VLM.”

Hybrid Architectures (2025 Frontier)

Shared + Routed Experts (DeepSeek-V2 Style)

The 2025 innovation: combine dense (shared) and sparse (routed) experts.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
class SharedRoutedMoE(nn.Module):
"""
Architecture used in DeepSeek-V2, Snowflake Arctic
- Always-on shared experts (dense)
- Conditionally-activated routed experts (sparse)
Best of both worlds!
"""
def __init__(self, d_model, num_shared=2, num_routed=64, top_k=6):
super().__init__()

# Always-active shared experts
self.shared_experts = nn.ModuleList([
FeedForwardExpert(d_model) for _ in range(num_shared)
])

# Conditionally-active routed experts
self.routed_experts = nn.ModuleList([
FeedForwardExpert(d_model) for _ in range(num_routed)
])

self.router = nn.Linear(d_model, num_routed)
self.top_k = top_k

def forward(self, x):
# Step 1: Shared experts (always computed)
shared_output = sum(expert(x) for expert in self.shared_experts)
shared_output = shared_output / len(self.shared_experts)

# Step 2: Route to top-k experts
router_logits = self.router(x)
top_k_weights, top_k_indices = torch.topk(
F.softmax(router_logits, dim=-1),
k=self.top_k,
dim=-1
)

# Step 3: Compute selected routed experts
routed_output = self.execute_routed_experts(
x, top_k_weights, top_k_indices
)

# Step 4: Combine shared and routed
output = shared_output + routed_output

return output

# Advantage: Shared experts handle common patterns
# Routed experts handle specialized cases

MoE with PoE Groups

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class GroupedMoEPoE(nn.Module):
"""
Multi-modal MoE where each modality has PoE experts
Used in: Advanced vision-language models
"""
def __init__(self, modalities, experts_per_modality=4, top_k_modalities=2):
super().__init__()

# For each modality, create a group of experts (use PoE within)
self.modality_groups = nn.ModuleDict({
modality: nn.ModuleList([
ModalityExpert(modality) for _ in range(experts_per_modality)
]) for modality in modalities
})

# Router selects which modality groups to use
self.modality_router = nn.Linear(hidden_dim, len(modalities))
self.top_k = top_k_modalities

def forward(self, multi_modal_input):
# Step 1: Route to top-k modality groups (MoE)
routing_scores = self.modality_router(multi_modal_input['context'])
top_k_weights, top_k_modalities = torch.topk(
F.softmax(routing_scores, dim=-1),
k=self.top_k,
dim=-1
)

# Step 2: Within each selected modality, use PoE
group_outputs = []
for modality_idx in top_k_modalities:
modality_name = self.modalities[modality_idx]
experts = self.modality_groups[modality_name]

# PoE within this modality group
log_probs = []
for expert in experts:
logits = expert(multi_modal_input[modality_name])
log_probs.append(F.log_softmax(logits, dim=-1))

# Product of experts
group_output = torch.stack(log_probs).sum(dim=0)
group_outputs.append(group_output)

# Step 3: MoE combination of groups
final_output = sum(
w * out for w, out in zip(top_k_weights, group_outputs)
)

return final_output

# Use case: Video understanding (vision + audio + text)
# - Route to relevant modalities (MoE efficiency)
# - Within each modality, all aspects must agree (PoE robustness)

Deployment Considerations (2025)

MoE Inference Optimization

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
class OptimizedMoEInference(nn.Module):
"""
Production-ready MoE with inference optimizations
"""
def __init__(self, base_moe):
super().__init__()
self.moe = base_moe

# Optimization 1: Expert caching
self.expert_cache = {}

# Optimization 2: Batch expert execution
self.use_batched_experts = True

# Optimization 3: Dynamic expert loading (for very large MoE)
self.expert_on_device = set(range(min(8, len(self.moe.experts))))

def forward(self, x):
# Get routing decisions
router_logits = self.moe.router(x)
top_k_weights, top_k_experts = torch.topk(
F.softmax(router_logits, dim=-1),
k=self.moe.top_k,
dim=-1
)

# Optimization: Group tokens by expert
# This enables better batching and memory access
expert_to_tokens = self.group_tokens_by_expert(
top_k_experts, top_k_weights
)

# Execute experts in batches
outputs = torch.zeros_like(x)
for expert_idx, (token_indices, weights) in expert_to_tokens.items():
# Optimization: Only load expert if needed
if expert_idx not in self.expert_on_device:
self.load_expert(expert_idx)

# Batch process all tokens going to this expert
expert_input = x[token_indices]
expert_output = self.moe.experts[expert_idx](expert_input)

# Apply weights and accumulate
outputs[token_indices] += weights.unsqueeze(-1) * expert_output

return outputs

def group_tokens_by_expert(self, top_k_experts, top_k_weights):
"""
Critical optimization: group by expert for batched execution
Instead of token-by-token, process all tokens for expert X together
"""
expert_to_tokens = {}
batch_size, seq_len, top_k = top_k_experts.shape

for expert_idx in range(self.moe.num_experts):
# Find all tokens routed to this expert
mask = (top_k_experts == expert_idx)
token_indices = mask.nonzero(as_tuple=False)

if len(token_indices) > 0:
weights = top_k_weights[mask]
expert_to_tokens[expert_idx] = (token_indices, weights)

return expert_to_tokens

# Typical speedup: 2-3x over naive implementation

Serving Large MoE Models

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
"""
2025 Production Setup for 1T+ parameter MoE models
"""

# Strategy 1: Expert Parallelism
# Different GPUs host different experts
# Token dispatching across GPUs

class DistributedMoE:
def __init__(self, num_experts=128, experts_per_gpu=8):
self.num_gpus = num_experts // experts_per_gpu

# Assign experts to GPUs
self.expert_placement = {
expert_idx: expert_idx // experts_per_gpu
for expert_idx in range(num_experts)
}

def forward_distributed(self, x, top_k_experts):
# Group tokens by which GPU they need
gpu_to_tokens = self.group_by_gpu(top_k_experts)

# Send tokens to appropriate GPUs
results = []
for gpu_id, token_data in gpu_to_tokens.items():
result = self.send_to_gpu(gpu_id, token_data)
results.append(result)

# Gather and combine results
return self.gather_results(results)

# Strategy 2: Shared Expert on All GPUs, Route Experts Distributed
# Mixtral-style: Shared expert (2) on every GPU
# Routed experts (64) split across GPUs

Real-World Examples (2025)

Case Study 1: Mixtral 8x7B

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
"""
Mixtral Architecture (Simplified)
- 8 experts per MoE layer
- Top-2 routing
- 32 layers with MoE
"""

class MixtralMoELayer(nn.Module):
def __init__(self, hidden_size=4096, num_experts=8, top_k=2):
super().__init__()
self.num_experts = num_experts
self.top_k = top_k

# Router
self.gate = nn.Linear(hidden_size, num_experts, bias=False)

# 8 expert FFNs
self.experts = nn.ModuleList([
nn.Sequential(
nn.Linear(hidden_size, 14336), # Expand
nn.SiLU(),
nn.Linear(14336, hidden_size), # Project back
) for _ in range(num_experts)
])

def forward(self, hidden_states):
batch_size, seq_len, hidden_size = hidden_states.shape
hidden_states_flat = hidden_states.view(-1, hidden_size)

# Routing
router_logits = self.gate(hidden_states_flat)
routing_weights = F.softmax(router_logits, dim=-1)

# Top-2 experts per token
routing_weights, selected_experts = torch.topk(
routing_weights, self.top_k, dim=-1
)
routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)

# Execute
final_hidden_states = torch.zeros_like(hidden_states_flat)

for expert_idx in range(self.num_experts):
expert_mask = (selected_experts == expert_idx)
if expert_mask.any():
# Tokens going to this expert
token_indices = expert_mask.nonzero()[:, 0]
k_indices = expert_mask.nonzero()[:, 1]

expert_input = hidden_states_flat[token_indices]
expert_output = self.experts[expert_idx](expert_input)

# Weight by routing score
weights = routing_weights[token_indices, k_indices].unsqueeze(-1)
final_hidden_states[token_indices] += weights * expert_output

return final_hidden_states.view(batch_size, seq_len, hidden_size)

# Configuration
# - 8 experts × 7B params each = 56B total params
# - But only 2 experts active per token = 14B active params
# - Effective: 4x parameter efficiency vs dense 14B model

Case Study 2: DeepSeek-V2 Innovations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
"""
DeepSeek-V2 Architecture (2025)
Key innovations:
- 160 routed experts + 2 shared experts
- Expert-choice routing (not token-choice)
- Low-rank decomposition for expert efficiency
"""

class DeepSeekV2MoELayer(nn.Module):
def __init__(self, hidden_size=4096,
num_shared_experts=2,
num_routed_experts=160,
top_k=6,
capacity_factor=1.25):
super().__init__()

# Innovation 1: Shared experts (always active)
self.shared_experts = nn.ModuleList([
LowRankFFNExpert(hidden_size, rank=1024)
for _ in range(num_shared_experts)
])

# Innovation 2: Many routed experts (sparse activation)
self.routed_experts = nn.ModuleList([
LowRankFFNExpert(hidden_size, rank=512) # Smaller rank
for _ in range(num_routed_experts)
])

self.num_routed_experts = num_routed_experts
self.top_k = top_k
self.capacity_factor = capacity_factor

# Router
self.gate = nn.Linear(hidden_size, num_routed_experts, bias=False)

def forward(self, hidden_states):
# Step 1: Shared experts (always computed)
shared_output = torch.stack([
expert(hidden_states) for expert in self.shared_experts
]).mean(dim=0)

# Step 2: Expert-choice routing
router_logits = self.gate(hidden_states)
routed_output = self.expert_choice_route(
hidden_states, router_logits
)

# Step 3: Combine
return shared_output + routed_output

def expert_choice_route(self, hidden_states, router_logits):
"""
Each expert selects top tokens (not tokens selecting experts)
Guarantees load balance!
"""
batch_size, seq_len, hidden_size = hidden_states.shape
num_tokens = batch_size * seq_len

# Capacity per expert
expert_capacity = int(
(num_tokens / self.num_routed_experts) * self.capacity_factor
)

# Flatten
hidden_flat = hidden_states.view(num_tokens, hidden_size)
router_logits_flat = router_logits.view(num_tokens, self.num_routed_experts)

# Each expert picks its top tokens
output = torch.zeros_like(hidden_flat)

for expert_idx in range(self.num_routed_experts):
# This expert's scores for all tokens
expert_scores = router_logits_flat[:, expert_idx]

# Select top-capacity tokens
top_scores, top_token_indices = torch.topk(
expert_scores, k=expert_capacity
)

# Process selected tokens
selected_tokens = hidden_flat[top_token_indices]
expert_output = self.routed_experts[expert_idx](selected_tokens)

# Add to output (normalized by score)
weights = F.softmax(top_scores, dim=-1).unsqueeze(-1)
output[top_token_indices] += weights * expert_output

return output.view(batch_size, seq_len, hidden_size)

class LowRankFFNExpert(nn.Module):
"""
Low-rank decomposition for memory efficiency
Instead of: hidden_size -> ffn_size -> hidden_size
Use: hidden_size -> rank -> ffn_size -> rank -> hidden_size
Saves parameters!
"""
def __init__(self, hidden_size, rank):
super().__init__()
self.down_proj = nn.Linear(hidden_size, rank)
self.gate_proj = nn.Linear(rank, rank * 4)
self.up_proj = nn.Linear(rank * 4, rank)
self.output_proj = nn.Linear(rank, hidden_size)

def forward(self, x):
x = self.down_proj(x) # Reduce dimension
x = F.silu(self.gate_proj(x)) # Expand and activate
x = self.up_proj(x) # Project back
x = self.output_proj(x) # To hidden size
return x

Conclusion

Mixture of Experts (MoE): “Let the specialists handle their specialties”

  • Routing mechanism selects experts
  • Weighted averaging of outputs
  • Sparse activation for efficiency
  • Great for scaling and diverse inputs

Product of Experts (PoE): “All experts must agree”

  • All experts always active
  • Multiplicative combination
  • Agreement amplifies confidence, disagreement reduces it
  • Great for consensus and multi-modal learning

Both are powerful techniques for combining multiple models, just with different philosophies. MoE is about efficient specialization, while PoE is about robust consensus. Choose based on your problem’s needs!