Introduction

In 2025, Mixture of Experts (MoE) has become the dominant architecture for scaling large language models efficiently. Models like Mixtral 8x7B, DeepSeek-V2, and rumored GPT-4 variants leverage MoE to achieve massive parameter counts while keeping inference costs manageable. Meanwhile, Product of Experts (PoE) continues to play a crucial role in multi-modal learning and ensemble methods.

Mixture of Experts (MoE)

Why MoE Dominates in 2025

Modern LLMs face a fundamental challenge: compute scales quadratically with model size, but we want trillion-parameter models. MoE solves this by making models conditionally compute - only activating the parameters you need.

The Economics:

Dense 70B model: All 70B parameters active per token
MoE 8x70B model: Only 2x70B = 140B parameters active per token, but 560B total capacity

This 4x efficiency gain is why MoE won.

Basic Architecture

class ModernMoE(nn.Module):
    """
    Standard MoE layer used in production models (2025)
    """
    def __init__(self, d_model, num_experts=8, top_k=2, expert_capacity=None):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        self.expert_capacity = expert_capacity  # For capacity factor routing
        
        # The router (gating network)
        self.router = nn.Linear(d_model, num_experts)
        
        # The expert networks (typically FFN layers)
        self.experts = nn.ModuleList([
            FeedForwardExpert(d_model, ffn_dim=d_model * 4)
            for _ in range(num_experts)
        ])
    
    def forward(self, x):
        # x shape: [batch, seq_len, d_model]
        batch_size, seq_len, d_model = x.shape
        
        # Step 1: Routing decision
        router_logits = self.router(x)  # [batch, seq_len, num_experts]
        
        # Step 2: Select top-k experts per token
        routing_weights, selected_experts = self.route_tokens(router_logits)
        
        # Step 3: Execute experts (the core innovation is HERE)
        output = self.execute_experts(x, routing_weights, selected_experts)
        
        return output

Routing Mechanisms: The Critical Innovation

The routing strategy is what differentiates modern MoE implementations. Let’s cover the main approaches used in 2025.

1. Top-K Token Choice Routing (Classic)

Used by: Switch Transformer, GLaM, early MoE models

def token_choice_routing(self, router_logits, top_k=2):
    """
    Each token chooses its top-k experts
    Problem: Load imbalance - popular experts get overloaded
    """
    # Get top-k experts per token
    routing_weights, selected_experts = torch.topk(
        F.softmax(router_logits, dim=-1),
        k=top_k,
        dim=-1
    )
    
    # Normalize weights of selected experts
    routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)
    
    return routing_weights, selected_experts

# Example output per token:
# Token 1: Expert 3 (weight=0.7), Expert 7 (weight=0.3)
# Token 2: Expert 3 (weight=0.9), Expert 1 (weight=0.1)  # Expert 3 again!
# Token 3: Expert 3 (weight=0.8), Expert 2 (weight=0.2)  # Overload!

Problem: Load imbalance - some experts become popular and get overloaded.

2. Expert Choice Routing (Modern Standard)

Used by: DeepSeek-V2, Mixtral 8x22B, cutting-edge models

def expert_choice_routing(self, router_logits, capacity_factor=1.25):
    """
    Experts choose their top tokens (2025 state-of-the-art)
    Advantage: Perfect load balancing, no dropped tokens
    """
    batch_size, seq_len, num_experts = router_logits.shape
    total_tokens = batch_size * seq_len
    
    # Each expert selects top tokens up to its capacity
    expert_capacity = int(total_tokens / num_experts * capacity_factor)
    
    # Transpose: now shape is [num_experts, total_tokens]
    router_logits_per_expert = router_logits.view(-1, num_experts).T
    
    assignments = []
    for expert_idx in range(num_experts):
        # Each expert picks its top-k tokens
        expert_scores = router_logits_per_expert[expert_idx]
        top_tokens = torch.topk(expert_scores, k=expert_capacity).indices
        assignments.append((expert_idx, top_tokens))
    
    return assignments

# Example: 8 experts, 1000 tokens, capacity_factor=1.25
# Each expert processes: 1000/8 * 1.25 = 156 tokens
# Expert 0 picks its top 156 tokens
# Expert 1 picks its top 156 tokens
# ...
# Total: 8 * 156 = 1248 tokens processed (some tokens go to multiple experts)

Advantage: Guaranteed load balance, no dropped tokens, better hardware utilization.

3. Soft Routing (Emerging)

Used by: Research models, Mixture-of-Depths

def soft_routing(self, router_logits):
    """
    All experts process all tokens, but with soft weights
    No discrete selection - fully differentiable
    """
    # Compute soft weights for all experts
    routing_weights = F.softmax(router_logits, dim=-1)
    # Shape: [batch, seq_len, num_experts]
    
    # Apply all experts with weights
    outputs = []
    for expert_idx in range(self.num_experts):
        expert_output = self.experts[expert_idx](x)  # [batch, seq_len, d_model]
        weighted_output = expert_output * routing_weights[:, :, expert_idx:expert_idx+1]
        outputs.append(weighted_output)
    
    final_output = sum(outputs)
    return final_output

# Note: Not truly sparse, but allows smooth gradients
# Used in combination with distillation to train sparse student models

4. Auxiliary Loss for Load Balancing

Critical for training stable MoE models:

def compute_load_balancing_loss(self, router_logits, selected_experts):
    """
    Encourages uniform expert usage
    Without this, one expert takes all tokens (collapse)
    """
    num_tokens = router_logits.shape[0] * router_logits.shape[1]
    
    # Compute how many tokens were routed to each expert
    expert_usage = torch.zeros(self.num_experts, device=router_logits.device)
    for expert_idx in range(self.num_experts):
        expert_usage[expert_idx] = (selected_experts == expert_idx).float().sum()
    
    # Ideal usage: uniform distribution
    target_usage = num_tokens / self.num_experts
    
    # Loss: penalize deviation from uniform
    # Modern approach: use coefficient of variation
    mean_usage = expert_usage.mean()
    std_usage = expert_usage.std()
    load_balance_loss = std_usage / (mean_usage + 1e-10)
    
    return load_balance_loss

# Add to total loss
total_loss = task_loss + α * load_balance_loss  # α typically 0.01

5. Hierarchical Routing (2024-2025)

Used by: Very large MoE models (1000+ experts)

def hierarchical_routing(self, router_logits, num_groups=8):
    """
    Two-level routing for massive expert counts
    First: choose expert group
    Then: choose expert within group
    """
    # Level 1: Route to expert group
    group_logits = router_logits.view(*router_logits.shape[:-1], num_groups, -1)
    group_logits = group_logits.max(dim=-1).values  # [batch, seq, num_groups]
    
    top_group = torch.argmax(group_logits, dim=-1)  # [batch, seq]
    
    # Level 2: Route within selected group
    experts_per_group = self.num_experts // num_groups
    group_start = top_group * experts_per_group
    
    # Get expert scores within group
    local_logits = router_logits[..., group_start:group_start + experts_per_group]
    top_k_weights, top_k_indices = torch.topk(local_logits, k=self.top_k, dim=-1)
    
    # Convert local indices to global
    global_indices = top_k_indices + group_start.unsqueeze(-1)
    
    return top_k_weights, global_indices

# Example: 128 experts, 8 groups
# Step 1: Choose from 8 groups -> Group 3
# Step 2: Choose from 16 experts in Group 3 -> Experts 48, 52

Modern MoE Configurations (2025)

# Mixtral 8x7B style (2024-2025 standard)
class MixtralMoELayer(nn.Module):
    """
    Configuration:
    - 8 experts per layer
    - Top-2 routing (each token to 2 experts)
    - Token choice routing with load balancing
    - Expert: 4-layer FFN (up -> gate -> down)
    """
    config = {
        'num_experts': 8,
        'top_k': 2,
        'routing': 'token_choice',
        'expert_type': 'gated_ffn',
        'load_balance_weight': 0.01
    }

# DeepSeek-V2 style (2025 cutting-edge)
class DeepSeekV2MoELayer(nn.Module):
    """
    Configuration:
    - 160 experts per layer
    - Top-6 routing
    - Expert choice routing (2025 innovation)
    - Separate MoE for different heads
    - Shared experts + routed experts hybrid
    """
    config = {
        'num_experts': 160,
        'top_k': 6,
        'routing': 'expert_choice',
        'capacity_factor': 1.25,
        'num_shared_experts': 2,  # Always active
        'expert_type': 'low_rank_ffn'  # Efficiency
    }

Product of Experts (PoE)

Hinton (1999) PoE:

Goal: Combine simple, tractable “expert” models to form a complex, sharp probability distribution over data (like images).
How: Multiplies the probabilities from all experts and normalizes, so only data that passes all experts’ “rules” gets high probability.
Sampling/Training: Primarily focused on probability models like Boltzmann machines, trained using things like Gibbs sampling and contrastive divergence, with both positive and negative phases.
Scope: Original “experts” are usually simple neural nets or statistical models, trained together on the same data and modality.

Why PoE Still Matters in 2025

While MoE dominates LLMs, PoE remains the architecture of choice for:

Multi-modal models (vision + language)
Ensemble learning
Uncertainty estimation
Any scenario where all viewpoints must contribute

Recent applications: CLIP-like models, medical AI (multiple diagnostic tests), anomaly detection systems.

Core Mechanism

class ModernPoE(nn.Module):
    """
    Product of Experts for multi-modal fusion
    Used in vision-language models, ensemble systems
    """
    def __init__(self, modality_encoders):
        super().__init__()
        self.modality_encoders = nn.ModuleDict(modality_encoders)
        # e.g., {'vision': VisionEncoder(), 'text': TextEncoder()}
    
    def forward(self, inputs):
        """
        inputs: dict with keys matching modality_encoders
        e.g., {'vision': image_tensor, 'text': text_tensor}
        """
        # Step 1: Each expert produces log probabilities
        log_probs = []
        for modality, encoder in self.modality_encoders.items():
            if modality in inputs and inputs[modality] is not None:
                # Get log probability distribution
                logits = encoder(inputs[modality])
                log_prob = F.log_softmax(logits, dim=-1)
                log_probs.append(log_prob)
        
        # Step 2: Product of experts = sum in log space
        combined_log_prob = torch.stack(log_probs).sum(dim=0)
        
        # Step 3: Normalize (becomes a valid distribution)
        # Use log-sum-exp for numerical stability
        log_Z = torch.logsumexp(combined_log_prob, dim=-1, keepdim=True)
        normalized_log_prob = combined_log_prob - log_Z
        
        return normalized_log_prob

# Example: Multi-modal classification
vision_encoder = VisionTransformer(...)
text_encoder = TextTransformer(...)
audio_encoder = AudioTransformer(...)

poe_model = ModernPoE({
    'vision': vision_encoder,
    'text': text_encoder,
    'audio': audio_encoder
})

# During inference - all modalities contribute
output = poe_model({
    'vision': image,
    'text': caption,
    'audio': sound
})

Product of Experts is sum in log space because log(a) + log(b) = log(ab).

PoE for Missing Modalities

A key advantage: gracefully handles missing inputs

class RobustPoE(nn.Module):
    """
    Handles missing modalities at inference time
    Critical for real-world deployment
    """
    def forward(self, inputs, available_modalities):
        log_probs = []
        
        for modality in available_modalities:
            if inputs.get(modality) is not None:
                encoder = self.modality_encoders[modality]
                logits = encoder(inputs[modality])
                log_prob = F.log_softmax(logits, dim=-1)
                log_probs.append(log_prob)
        
        if len(log_probs) == 0:
            raise ValueError("At least one modality required")
        
        # Product over available modalities only
        combined = torch.stack(log_probs).sum(dim=0)
        return F.log_softmax(combined, dim=-1)

# Example: Image classification with optional text
# Training: both modalities
output = model({'vision': img, 'text': caption}, ['vision', 'text'])

# Inference: image only (text missing)
output = model({'vision': img, 'text': None}, ['vision'])
# Still works! Uses only vision expert

Modern PoE Applications (2025)

1. Ensemble Distillation

class EnsemblePoE(nn.Module):
    """
    Combine multiple teacher models via PoE
    Then distill to single student model
    """
    def __init__(self, teacher_models):
        super().__init__()
        self.teachers = nn.ModuleList(teacher_models)
    
    def get_ensemble_targets(self, x):
        # Get predictions from all teachers
        teacher_logits = [teacher(x) for teacher in self.teachers]
        
        # PoE combination (product of distributions)
        log_probs = [F.log_softmax(logits, dim=-1) for logits in teacher_logits]
        combined = torch.stack(log_probs).sum(dim=0)
        
        # These become soft targets for student
        return F.softmax(combined, dim=-1)

# Used in: Model compression, knowledge distillation

2. Multi-View Learning

class MultiViewPoE(nn.Module):
    """
    Different views of same data (e.g., multiple camera angles)
    All views must contribute to decision
    """
    def __init__(self, view_encoders):
        self.encoders = nn.ModuleList(view_encoders)
    
    def forward(self, views):
        # views: list of tensors [view1, view2, view3, ...]
        log_probs = []
        
        for view, encoder in zip(views, self.encoders):
            logits = encoder(view)
            log_probs.append(F.log_softmax(logits, dim=-1))
        
        # Agreement across views increases confidence
        combined = torch.stack(log_probs).sum(dim=0)
        return combined

# Application: Multi-camera surveillance, 3D reconstruction

MoE vs PoE: Architecture Comparison (2025)

Aspect	Mixture of Experts (MoE)	Product of Experts (PoE)
Primary Use Case	Scaling LLMs efficiently	Multi-modal fusion, ensembles
Activation	Sparse (top-k experts)	Dense (all experts)
Routing	Learned gating network	No routing - all participate
Combination	Weighted sum	Product (multiplication)
Training Challenge	Load balancing	Gradient collapse
Inference Cost	O(k) where k << N experts	O(N) all experts compute
Specialization	Strong - experts diverge	Moderate - experts must agree
2025 Models	Mixtral, DeepSeek-V2, GPT-4	Multi-modal VAE, ensembles
Hardware Friendly	Very (sparse)	Less (dense)
Missing Inputs	Can’t handle well	Gracefully degrades

When to Use Each (2025 Decision Guide)

def choose_architecture(use_case):
    """
    Decision tree for 2025
    """
    if use_case == "scaling_llm":
        return "MoE with expert-choice routing"
        # Examples: Next-gen GPT, Claude, Gemini
        
    elif use_case == "multi_modal_fusion":
        return "PoE"
        # Examples: Vision+Text, Audio+Video+Text
        
    elif use_case == "efficient_inference":
        return "MoE with top-k routing"
        # Only activate what you need
        
    elif use_case == "ensemble_models":
        return "PoE"
        # All models must contribute
        
    elif use_case == "handling_missing_inputs":
        return "PoE"
        # Missing modality? Use remaining ones
        
    elif use_case == "trillion_parameter_model":
        return "MoE with hierarchical routing"
        # Scale beyond what's possible with dense models
        
    else:
        return "Hybrid MoE-PoE"
        # Best of both worlds

Training Considerations (2025 Best Practices)

MoE Training Challenges

1. Load Balancing (Critical!)

class MoEWithLoadBalancing(nn.Module):
    def forward(self, x):
        router_logits = self.router(x)
        
        # Main routing
        top_k_weights, top_k_experts = self.route_topk(router_logits)
        output = self.execute_experts(x, top_k_weights, top_k_experts)
        
        # Compute load balancing loss
        # Modern approach: use auxiliary loss + router z-loss
        lb_loss = self.load_balance_loss(router_logits, top_k_experts)
        router_z_loss = self.router_z_loss(router_logits)
        
        # Store for backward pass
        self.aux_loss = lb_loss + 0.001 * router_z_loss
        
        return output
    
    def load_balance_loss(self, router_logits, selected_experts):
        """
        Encourages uniform expert usage
        Based on: "Towards MoE Deployment: Mitigating Inefficiencies" (2024)
        """
        # Count tokens per expert
        num_tokens = router_logits.shape[0] * router_logits.shape[1]
        expert_counts = torch.bincount(
            selected_experts.view(-1), 
            minlength=self.num_experts
        ).float()
        
        # Probability of routing to each expert
        routing_probs = F.softmax(router_logits, dim=-1).mean(dim=[0, 1])
        
        # Loss: deviation from uniform
        # We want: expert_counts[i] / num_tokens ≈ 1 / num_experts
        balance_loss = (expert_counts * routing_probs).sum() * self.num_experts
        
        return balance_loss
    
    def router_z_loss(self, router_logits):
        """
        Prevents router logits from growing too large
        Improves training stability
        """
        return torch.logsumexp(router_logits, dim=-1).pow(2).mean()

# Training loop
for batch in dataloader:
    output = moe_model(batch)
    task_loss = criterion(output, labels)
    
    # Add auxiliary losses
    total_loss = task_loss + 0.01 * moe_model.aux_loss
    total_loss.backward()

2. Expert Capacity and Dropped Tokens

class CapacityFactorRouting(nn.Module):
    """
    Handles token overflow when too many route to one expert
    """
    def forward(self, x, router_weights, selected_experts):
        batch_size, seq_len, d_model = x.shape
        
        # Calculate expert capacity
        tokens_per_expert = (batch_size * seq_len * self.top_k) // self.num_experts
        expert_capacity = int(tokens_per_expert * self.capacity_factor)  # 1.25 typical
        
        # Track how many tokens each expert has received
        expert_counts = torch.zeros(self.num_experts, device=x.device)
        
        # Process tokens, drop if expert is at capacity
        outputs = torch.zeros_like(x)
        dropped_tokens = 0
        
        for token_idx in range(batch_size * seq_len):
            for k in range(self.top_k):
                expert_idx = selected_experts[token_idx, k]
                
                if expert_counts[expert_idx] < expert_capacity:
                    # Expert has capacity - process token
                    weight = router_weights[token_idx, k]
                    expert_out = self.experts[expert_idx](x[token_idx])
                    outputs[token_idx] += weight * expert_out
                    expert_counts[expert_idx] += 1
                else:
                    # Expert at capacity - drop token
                    dropped_tokens += 1
        
        # Log dropped tokens (important metric!)
        self.dropped_tokens_ratio = dropped_tokens / (batch_size * seq_len * self.top_k)
        
        return outputs

# Monitor during training
if model.dropped_tokens_ratio > 0.01:  # More than 1% dropped
    print(f"Warning: {model.dropped_tokens_ratio:.1%} tokens dropped")
    # Consider: increase capacity_factor or improve load balancing

3. Expert Initialization

def initialize_moe_experts(moe_layer, pretrained_ffn=None):
    """
    2025 best practice: initialize experts from pretrained FFN
    Then add noise for diversity
    """
    if pretrained_ffn is not None:
        # Start from dense model
        base_params = pretrained_ffn.state_dict()
        
        for expert_idx in range(moe_layer.num_experts):
            # Copy base parameters
            moe_layer.experts[expert_idx].load_state_dict(base_params)
            
            # Add small random noise for diversity
            with torch.no_grad():
                for param in moe_layer.experts[expert_idx].parameters():
                    param.add_(torch.randn_like(param) * 0.01)
    else:
        # Standard initialization
        for expert in moe_layer.experts:
            expert.apply(lambda m: nn.init.xavier_uniform_(m.weight) 
                        if isinstance(m, nn.Linear) else None)

PoE Training Challenges

1. Gradient Imbalance

class PoEWithGradientBalancing(nn.Module):
    """
    Ensure all experts receive useful gradients
    """
    def forward(self, inputs):
        log_probs = []
        expert_confidences = []
        
        for modality, encoder in self.encoders.items():
            if inputs.get(modality) is not None:
                logits = encoder(inputs[modality])
                log_prob = F.log_softmax(logits, dim=-1)
                log_probs.append(log_prob)
                
                # Track expert confidence (for weighting gradients)
                confidence = F.softmax(logits, dim=-1).max(dim=-1).values
                expert_confidences.append(confidence.mean())
        
        # Combine with learned modality weights
        modality_weights = F.softmax(self.modality_importance, dim=0)
        
        weighted_log_probs = []
        for i, log_prob in enumerate(log_probs):
            # Weight by modality importance (learnable)
            weighted = log_prob + torch.log(modality_weights[i] + 1e-10)
            weighted_log_probs.append(weighted)
        
        combined = torch.stack(weighted_log_probs).sum(dim=0)
        
        # Store for analysis
        self.expert_confidences = expert_confidences
        
        return combined

2. Preventing Expert Collapse

def train_poe_with_diversity_loss(model, batch):
    """
    Add diversity loss to prevent experts from becoming identical
    """
    # Forward pass
    outputs = model(batch)
    task_loss = F.cross_entropy(outputs, labels)
    
    # Diversity loss: encourage different experts to specialize
    expert_outputs = [encoder(batch[mod]) for mod, encoder in model.encoders.items()]
    
    # Compute pairwise differences
    diversity_loss = 0
    num_pairs = 0
    for i in range(len(expert_outputs)):
        for j in range(i + 1, len(expert_outputs)):
            # We want experts to be different (high diversity)
            similarity = F.cosine_similarity(
                expert_outputs[i].view(batch_size, -1),
                expert_outputs[j].view(batch_size, -1),
                dim=1
            ).mean()
            # Penalize high similarity
            diversity_loss += similarity
            num_pairs += 1
    
    diversity_loss = diversity_loss / num_pairs
    
    # Total loss
    total_loss = task_loss - 0.1 * diversity_loss  # Negative: encourage diversity
    
    return total_loss

PoE for Visual Generation (2025):

Goal: Compose the outputs or “opinions” of modern pretrained models—like diffusion models, video generators, Vision-Language Models (VLMs), and even physics engines or simulators—for controllable visual (image/video) synthesis.
How:
- Each “expert” can be a large pretrained model (not just a small probabilistic model).
- Some experts are generative models (e.g., a diffusion model for images, or a video generator), others are discriminators like VLMs or physics simulators that function as reward/constraint functions.
- They sample from the product of these experts to generate a final output that satisfies all constraints—e.g., matching realism, language description, physics validity, and user control.
Efficient Sampling: Introduces a new, practical inference-time framework combining Annealed Importance Sampling (AIS) and Sequential Monte Carlo (SMC) to efficiently generate samples from these product distributions—important since simple approaches (like Gibbs sampling) are too slow or impractical with high-dimensional data and large models.
No Retraining Needed: Assembles these powerful experts at inference time—allowing flexible composition without re-training a huge new model.
Applications: Enables user-controllable, physically-correct, prompt-following image/video generation—e.g., “put this object here per the graphics engine, make it look realistic per the Diffusion Model, obey physics per the simulator, and follow this text instruction per the VLM.”

Hybrid Architectures (2025 Frontier)

Shared + Routed Experts (DeepSeek-V2 Style)

The 2025 innovation: combine dense (shared) and sparse (routed) experts.

class SharedRoutedMoE(nn.Module):
    """
    Architecture used in DeepSeek-V2, Snowflake Arctic
    - Always-on shared experts (dense)
    - Conditionally-activated routed experts (sparse)
    Best of both worlds!
    """
    def __init__(self, d_model, num_shared=2, num_routed=64, top_k=6):
        super().__init__()
        
        # Always-active shared experts
        self.shared_experts = nn.ModuleList([
            FeedForwardExpert(d_model) for _ in range(num_shared)
        ])
        
        # Conditionally-active routed experts
        self.routed_experts = nn.ModuleList([
            FeedForwardExpert(d_model) for _ in range(num_routed)
        ])
        
        self.router = nn.Linear(d_model, num_routed)
        self.top_k = top_k
    
    def forward(self, x):
        # Step 1: Shared experts (always computed)
        shared_output = sum(expert(x) for expert in self.shared_experts)
        shared_output = shared_output / len(self.shared_experts)
        
        # Step 2: Route to top-k experts
        router_logits = self.router(x)
        top_k_weights, top_k_indices = torch.topk(
            F.softmax(router_logits, dim=-1), 
            k=self.top_k, 
            dim=-1
        )
        
        # Step 3: Compute selected routed experts
        routed_output = self.execute_routed_experts(
            x, top_k_weights, top_k_indices
        )
        
        # Step 4: Combine shared and routed
        output = shared_output + routed_output
        
        return output

# Advantage: Shared experts handle common patterns
#            Routed experts handle specialized cases

MoE with PoE Groups

class GroupedMoEPoE(nn.Module):
    """
    Multi-modal MoE where each modality has PoE experts
    Used in: Advanced vision-language models
    """
    def __init__(self, modalities, experts_per_modality=4, top_k_modalities=2):
        super().__init__()
        
        # For each modality, create a group of experts (use PoE within)
        self.modality_groups = nn.ModuleDict({
            modality: nn.ModuleList([
                ModalityExpert(modality) for _ in range(experts_per_modality)
            ]) for modality in modalities
        })
        
        # Router selects which modality groups to use
        self.modality_router = nn.Linear(hidden_dim, len(modalities))
        self.top_k = top_k_modalities
    
    def forward(self, multi_modal_input):
        # Step 1: Route to top-k modality groups (MoE)
        routing_scores = self.modality_router(multi_modal_input['context'])
        top_k_weights, top_k_modalities = torch.topk(
            F.softmax(routing_scores, dim=-1), 
            k=self.top_k, 
            dim=-1
        )
        
        # Step 2: Within each selected modality, use PoE
        group_outputs = []
        for modality_idx in top_k_modalities:
            modality_name = self.modalities[modality_idx]
            experts = self.modality_groups[modality_name]
            
            # PoE within this modality group
            log_probs = []
            for expert in experts:
                logits = expert(multi_modal_input[modality_name])
                log_probs.append(F.log_softmax(logits, dim=-1))
            
            # Product of experts
            group_output = torch.stack(log_probs).sum(dim=0)
            group_outputs.append(group_output)
        
        # Step 3: MoE combination of groups
        final_output = sum(
            w * out for w, out in zip(top_k_weights, group_outputs)
        )
        
        return final_output

# Use case: Video understanding (vision + audio + text)
# - Route to relevant modalities (MoE efficiency)
# - Within each modality, all aspects must agree (PoE robustness)

Deployment Considerations (2025)

MoE Inference Optimization

class OptimizedMoEInference(nn.Module):
    """
    Production-ready MoE with inference optimizations
    """
    def __init__(self, base_moe):
        super().__init__()
        self.moe = base_moe
        
        # Optimization 1: Expert caching
        self.expert_cache = {}
        
        # Optimization 2: Batch expert execution
        self.use_batched_experts = True
        
        # Optimization 3: Dynamic expert loading (for very large MoE)
        self.expert_on_device = set(range(min(8, len(self.moe.experts))))
    
    def forward(self, x):
        # Get routing decisions
        router_logits = self.moe.router(x)
        top_k_weights, top_k_experts = torch.topk(
            F.softmax(router_logits, dim=-1), 
            k=self.moe.top_k, 
            dim=-1
        )
        
        # Optimization: Group tokens by expert
        # This enables better batching and memory access
        expert_to_tokens = self.group_tokens_by_expert(
            top_k_experts, top_k_weights
        )
        
        # Execute experts in batches
        outputs = torch.zeros_like(x)
        for expert_idx, (token_indices, weights) in expert_to_tokens.items():
            # Optimization: Only load expert if needed
            if expert_idx not in self.expert_on_device:
                self.load_expert(expert_idx)
            
            # Batch process all tokens going to this expert
            expert_input = x[token_indices]
            expert_output = self.moe.experts[expert_idx](expert_input)
            
            # Apply weights and accumulate
            outputs[token_indices] += weights.unsqueeze(-1) * expert_output
        
        return outputs
    
    def group_tokens_by_expert(self, top_k_experts, top_k_weights):
        """
        Critical optimization: group by expert for batched execution
        Instead of token-by-token, process all tokens for expert X together
        """
        expert_to_tokens = {}
        batch_size, seq_len, top_k = top_k_experts.shape
        
        for expert_idx in range(self.moe.num_experts):
            # Find all tokens routed to this expert
            mask = (top_k_experts == expert_idx)
            token_indices = mask.nonzero(as_tuple=False)
            
            if len(token_indices) > 0:
                weights = top_k_weights[mask]
                expert_to_tokens[expert_idx] = (token_indices, weights)
        
        return expert_to_tokens

# Typical speedup: 2-3x over naive implementation

Serving Large MoE Models

"""
2025 Production Setup for 1T+ parameter MoE models
"""

# Strategy 1: Expert Parallelism
# Different GPUs host different experts
# Token dispatching across GPUs

class DistributedMoE:
    def __init__(self, num_experts=128, experts_per_gpu=8):
        self.num_gpus = num_experts // experts_per_gpu
        
        # Assign experts to GPUs
        self.expert_placement = {
            expert_idx: expert_idx // experts_per_gpu
            for expert_idx in range(num_experts)
        }
    
    def forward_distributed(self, x, top_k_experts):
        # Group tokens by which GPU they need
        gpu_to_tokens = self.group_by_gpu(top_k_experts)
        
        # Send tokens to appropriate GPUs
        results = []
        for gpu_id, token_data in gpu_to_tokens.items():
            result = self.send_to_gpu(gpu_id, token_data)
            results.append(result)
        
        # Gather and combine results
        return self.gather_results(results)

# Strategy 2: Shared Expert on All GPUs, Route Experts Distributed
# Mixtral-style: Shared expert (2) on every GPU
#                Routed experts (64) split across GPUs

Real-World Examples (2025)

Case Study 1: Mixtral 8x7B

"""
Mixtral Architecture (Simplified)
- 8 experts per MoE layer
- Top-2 routing
- 32 layers with MoE
"""

class MixtralMoELayer(nn.Module):
    def __init__(self, hidden_size=4096, num_experts=8, top_k=2):
        super().__init__()
        self.num_experts = num_experts
        self.top_k = top_k
        
        # Router
        self.gate = nn.Linear(hidden_size, num_experts, bias=False)
        
        # 8 expert FFNs
        self.experts = nn.ModuleList([
            nn.Sequential(
                nn.Linear(hidden_size, 14336),  # Expand
                nn.SiLU(),
                nn.Linear(14336, hidden_size),  # Project back
            ) for _ in range(num_experts)
        ])
    
    def forward(self, hidden_states):
        batch_size, seq_len, hidden_size = hidden_states.shape
        hidden_states_flat = hidden_states.view(-1, hidden_size)
        
        # Routing
        router_logits = self.gate(hidden_states_flat)
        routing_weights = F.softmax(router_logits, dim=-1)
        
        # Top-2 experts per token
        routing_weights, selected_experts = torch.topk(
            routing_weights, self.top_k, dim=-1
        )
        routing_weights = routing_weights / routing_weights.sum(dim=-1, keepdim=True)
        
        # Execute
        final_hidden_states = torch.zeros_like(hidden_states_flat)
        
        for expert_idx in range(self.num_experts):
            expert_mask = (selected_experts == expert_idx)
            if expert_mask.any():
                # Tokens going to this expert
                token_indices = expert_mask.nonzero()[:, 0]
                k_indices = expert_mask.nonzero()[:, 1]
                
                expert_input = hidden_states_flat[token_indices]
                expert_output = self.experts[expert_idx](expert_input)
                
                # Weight by routing score
                weights = routing_weights[token_indices, k_indices].unsqueeze(-1)
                final_hidden_states[token_indices] += weights * expert_output
        
        return final_hidden_states.view(batch_size, seq_len, hidden_size)

# Configuration
# - 8 experts × 7B params each = 56B total params
# - But only 2 experts active per token = 14B active params
# - Effective: 4x parameter efficiency vs dense 14B model

Case Study 2: DeepSeek-V2 Innovations

"""
DeepSeek-V2 Architecture (2025)
Key innovations:
- 160 routed experts + 2 shared experts
- Expert-choice routing (not token-choice)
- Low-rank decomposition for expert efficiency
"""

class DeepSeekV2MoELayer(nn.Module):
    def __init__(self, hidden_size=4096, 
                 num_shared_experts=2,
                 num_routed_experts=160,
                 top_k=6,
                 capacity_factor=1.25):
        super().__init__()
        
        # Innovation 1: Shared experts (always active)
        self.shared_experts = nn.ModuleList([
            LowRankFFNExpert(hidden_size, rank=1024)
            for _ in range(num_shared_experts)
        ])
        
        # Innovation 2: Many routed experts (sparse activation)
        self.routed_experts = nn.ModuleList([
            LowRankFFNExpert(hidden_size, rank=512)  # Smaller rank
            for _ in range(num_routed_experts)
        ])
        
        self.num_routed_experts = num_routed_experts
        self.top_k = top_k
        self.capacity_factor = capacity_factor
        
        # Router
        self.gate = nn.Linear(hidden_size, num_routed_experts, bias=False)
    
    def forward(self, hidden_states):
        # Step 1: Shared experts (always computed)
        shared_output = torch.stack([
            expert(hidden_states) for expert in self.shared_experts
        ]).mean(dim=0)
        
        # Step 2: Expert-choice routing
        router_logits = self.gate(hidden_states)
        routed_output = self.expert_choice_route(
            hidden_states, router_logits
        )
        
        # Step 3: Combine
        return shared_output + routed_output
    
    def expert_choice_route(self, hidden_states, router_logits):
        """
        Each expert selects top tokens (not tokens selecting experts)
        Guarantees load balance!
        """
        batch_size, seq_len, hidden_size = hidden_states.shape
        num_tokens = batch_size * seq_len
        
        # Capacity per expert
        expert_capacity = int(
            (num_tokens / self.num_routed_experts) * self.capacity_factor
        )
        
        # Flatten
        hidden_flat = hidden_states.view(num_tokens, hidden_size)
        router_logits_flat = router_logits.view(num_tokens, self.num_routed_experts)
        
        # Each expert picks its top tokens
        output = torch.zeros_like(hidden_flat)
        
        for expert_idx in range(self.num_routed_experts):
            # This expert's scores for all tokens
            expert_scores = router_logits_flat[:, expert_idx]
            
            # Select top-capacity tokens
            top_scores, top_token_indices = torch.topk(
                expert_scores, k=expert_capacity
            )
            
            # Process selected tokens
            selected_tokens = hidden_flat[top_token_indices]
            expert_output = self.routed_experts[expert_idx](selected_tokens)
            
            # Add to output (normalized by score)
            weights = F.softmax(top_scores, dim=-1).unsqueeze(-1)
            output[top_token_indices] += weights * expert_output
        
        return output.view(batch_size, seq_len, hidden_size)

class LowRankFFNExpert(nn.Module):
    """
    Low-rank decomposition for memory efficiency
    Instead of: hidden_size -> ffn_size -> hidden_size
    Use: hidden_size -> rank -> ffn_size -> rank -> hidden_size
    Saves parameters!
    """
    def __init__(self, hidden_size, rank):
        super().__init__()
        self.down_proj = nn.Linear(hidden_size, rank)
        self.gate_proj = nn.Linear(rank, rank * 4)
        self.up_proj = nn.Linear(rank * 4, rank)
        self.output_proj = nn.Linear(rank, hidden_size)
    
    def forward(self, x):
        x = self.down_proj(x)  # Reduce dimension
        x = F.silu(self.gate_proj(x))  # Expand and activate
        x = self.up_proj(x)  # Project back
        x = self.output_proj(x)  # To hidden size
        return x

Conclusion

Mixture of Experts (MoE): “Let the specialists handle their specialties”

Routing mechanism selects experts
Weighted averaging of outputs
Sparse activation for efficiency
Great for scaling and diverse inputs

Product of Experts (PoE): “All experts must agree”

All experts always active
Multiplicative combination
Agreement amplifies confidence, disagreement reduces it
Great for consensus and multi-modal learning

Both are powerful techniques for combining multiple models, just with different philosophies. MoE is about efficient specialization, while PoE is about robust consensus. Choose based on your problem’s needs!