Initial commit

Browse files

Files changed (14) hide show

.DS_Store +0 -0
README.md +173 -0
assets/.DS_Store +0 -0
assets/moe_architecture.png +0 -0
insights/architecture.md +173 -0
src/.DS_Store +0 -0
src/__init__.py +10 -0
src/__pycache__/__init__.cpython-311.pyc +0 -0
src/__pycache__/moe.cpython-311.pyc +0 -0
src/moe.py +142 -0
src/tests/__init__.py +0 -0
src/tests/__pycache__/__init__.cpython-311.pyc +0 -0
src/tests/__pycache__/test_moe.cpython-311.pyc +0 -0
src/tests/test_moe.py +316 -0

.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

README.md ADDED Viewed

	@@ -0,0 +1,173 @@

+---
+library_name: deepseek-moe
+tags:
+- mixture of experts-mechanism
+- transformers
+- pytorch
+- moe
+- efficient-mixture of experts
+pipeline_tag: text-generation
+language: en
+license: Apache2
+---
+# DeepSeek MoE Implementation
+[![License](https://img.shields.io/badge/License-Apache%202.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
+*Note: This repository contains a modular implementation of the DeepSeek MoE architecture, not trained model weights.*
+A clean, efficient implementation of DeepSeek's Mixture of Experts (MoE) architecture in PyTorch. This repository provides a simplified version of the architecture described in the DeepSeek paper, focusing on the core innovations that make their MoE approach unique.
+This repository is part of a series implementing the key architectural innovations from the DeepSeek paper. See the 'Related Implementations' section for the complete series.
+<p align="center">
+  <img src="./assets/moe_architecture.png" alt="DeepSeek MoE Architecture" width="600"/>
+</p>
+## Overview
+Mixture of Experts (MoE) architectures enable dramatic scaling of model parameters while maintaining computational efficiency by activating only a subset of parameters for any given input. DeepSeek's approach introduces several key innovations to the MoE architecture that improve performance and efficiency.
+Key features of this implementation:
+- **Hybrid Expert Structure**: Combines shared experts (processing all tokens) with routed experts (processing specific tokens)
+- **Efficient Top-K Routing**: Token-to-expert affinity calculation based on dot product similarity
+- **Multi-Level Load Balancing**: Cascading auxiliary losses at expert, device, and communication levels
+- **Device-Limited Routing**: Bounds communication costs in distributed training scenarios
+- **Token Dropping Strategy**: Optimize computation by dropping tokens with low affinities
+## Quick Start
+```python
+import torch
+from moe import MixtureOfExperts
+# Create input tensor
+batch_size = 8
+seq_length = 16
+d_model = 512
+inputs = torch.randn(batch_size, seq_length, d_model)
+# Create MoE layer
+moe = MixtureOfExperts(
+    d_model=512,       # Input dimension
+    d_expert=1024,     # Expert hidden dimension
+    K=2,               # Top-K experts per token
+    N_s=2,             # Number of shared experts
+    N_r=8,             # Number of routed experts
+    alpha1=0.01,       # Expert balance factor
+    alpha2=0.01,       # Device balance factor
+    alpha3=0.01,       # Communication balance factor
+    D=4,               # Number of devices
+    M=3                # Device limit for routing
+)
+# Forward pass
+outputs, expert_loss, device_loss, commu_loss = moe(inputs)
+```
+## Architecture Details
+For a detailed explanation of the architecture, see [architecture.md](insights/architecture.md).
+### DeepSeek MoE Key Innovations
+The DeepSeek MoE architecture introduces several elegant design choices:
+1. **Hybrid Expert Structure**: Using both shared experts and routed experts with residual connections maintains global information flow while allowing for specialization.
+2. **Token-Expert Affinity**: Calculating token-to-expert similarity through dot product with expert centroids, similar to attention mechanisms.
+3. **Multi-Level Balancing**: Cascading auxiliary losses that enforce balance at expert, device, and communication levels, creating a holistic approach to load distribution.
+4. **Device-Limited Routing**: Constraining each token to experts on at most M devices to bound communication costs.
+## Implementation Details
+The implementation consists of two main classes:
+### 1. Expert
+A feed-forward network with two linear transformations and a ReLU activation in between.
+```python
+Expert(x) = max(0, xW1 + b1)W2 + b2
+```
+### 2. MixtureOfExperts
+The main MoE implementation that:
+- Combines shared and routed experts
+- Calculates token-to-expert affinities
+- Applies top-K routing
+- Calculates auxiliary balance losses
+```python
+MoE(x) = x + ∑ Expert^s_i(x) + ∑ gate(x;K)*Expert^r_i(x)
+```
+## Testing
+Unit tests are provided to verify the correct functioning of:
+- Expert computations
+- MoE routing mechanisms
+- Load balancing losses
+- Residual connections
+Run the tests with:
+```bash
+python -m src.tests.test_moe
+```
+## Related Implementations
+This repository is part of a series implementing the key architectural innovations from the DeepSeek paper:
+1. **[DeepSeek MoE](https://huggingface.co/bird-of-paradise/deepseek-moe)** (This Repository): Implementation of DeepSeek's Mixture of Experts architecture that enables efficient scaling of model parameters.
+2. **[DeepSeek Multi-head Latent Attention](https://huggingface.co/bird-of-paradise/deepseek-mla)**: Implementation of DeepSeek's MLA mechanism for efficient KV cache usage during inference.
+3. **[Transformer Implementation Tutorial](https://huggingface.co/datasets/bird-of-paradise/transformer-from-scratch-tutorial)**: A detailed tutorial on implementing transformer architecture with explanations of key components.
+Together, these implementations cover the core innovations that power DeepSeek's state-of-the-art performance. By combining the MoE architecture with Multi-head Latent Attention, you can build a complete DeepSeek-style model with improved training efficiency and inference performance.
+## Contributing
+Contributions are welcome! Feel free to:
+- Report bugs and issues
+- Submit pull requests for improvements
+- Add additional test cases
+- Provide documentation clarifications
+Please ensure all tests pass before submitting pull requests.
+## Citation
+If you use this implementation in your research, please cite:
+```bibtex
+@misc{deepseek-moe-2025,
+  author = {Jen Wei},
+  title = {DeepSeek MoE Implementation},
+  year = {2025},
+  publisher = {GitHub},
+  journal = {GitHub repository},
+  howpublished = {\url{https://huggingface.co/bird-of-paradise/deepseek-moe}}
+}
+```
+## License
+This project is licensed under the Apache License 2.0.
+## Acknowledgements
+This implementation is inspired by the DeepSeek paper and other open-source MoE implementations:
+- [DeepSeek](https://github.com/deepseek-ai)
+- [Switch Transformers](https://arxiv.org/abs/2101.03961)
+- [GShard](https://arxiv.org/abs/2006.16668)

assets/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

assets/moe_architecture.png ADDED Viewed

insights/architecture.md ADDED Viewed

	@@ -0,0 +1,173 @@

+# DeepSeek MoE Architecture
+This document provides a detailed explanation of the DeepSeek Mixture of Experts (MoE) architecture and what makes it unique compared to other MoE implementations.
+## Core Concepts of MoE
+At a high level, Mixture of Experts (MoE) is a neural network architecture that divides computation across specialized "expert" networks. Rather than passing all inputs through the entire network, MoE selectively activates only a subset of experts for each input. This approach enables scaling models to have significantly more parameters while maintaining reasonable computational costs, as only a fraction of the network is active for any given input.
+## DeepSeek MoE Architecture: Key Innovations
+DeepSeek's MoE implementation has several key innovations that distinguish it from previous MoE approaches such as GShard and Switch Transformers:
+### 1. Hybrid Expert Structure: Shared + Routed Experts
+One of the most distinctive features of DeepSeek MoE is its hybrid architecture that combines:
+- **Shared Experts**: Process all tokens, providing a baseline transformation
+- **Routed Experts**: Process only specific tokens they specialize in
+The feed-forward network (FFN) output for token t is calculated as:
+$$\hat{h}_t = u_t + \sum^{N_s}_i FFN^s_t (u_t) + \sum^{N_r}_i g_i(t) FFN^r_i (u_t) $$
+Where:
+- $u_t$ is the original input token representation
+- $FFN^s_i$ is the i-th shared expert
+- $FFN^r_i$ is the i-th routed expert
+- $g_i(t)$ is the gate value determining how much each routed expert contributes
+This hybrid approach has several advantages:
+- Shared experts maintain global information flow
+- Routed experts can specialize in specific patterns
+- Residual connection (u_t term) preserves the original token information
+- Reduces knowledge redundancy among experts
+### 2. Token-Expert Affinity Calculation
+The router determines which experts should process each token using a similarity-based mechanism:
+$$ s_{i,t} = {Softmax}_i(u_t^T e_i) $$
+Where:
+- $s_{i,t}$ is the token-to-expert affinity
+- $e_i$ is the centroid of the i-th routed expert
+- $u_t$ is the token representation
+This is conceptually similar to the attention mechanism's query-key dot product (QK^T). It measures the similarity between a token vector and expert centroids:
+- Similar vectors (token and expert specialty) → large dot product
+- Different vectors → small dot product
+- Softmax converts these similarities into a probability distribution
+The router then selects the top-K experts for each token:
+```
+g_i,t = {
+  s_i,t, if s_i,t ∈ Topk({s_j,t|1 ≤ j ≤ N_r}, K_r),
+  0, otherwise
+}
+```
+This approach combines soft routing (through the affinity scores) and hard routing (through the TopK selection), allowing for more nuanced expert specialization.
+### 3. Multi-Level Load Balancing
+DeepSeek MoE implements a cascading auxiliary loss structure to ensure balance at three different levels:
+#### Expert-Level Balance Loss
+$$ \mathcal{L}_{ExpBal} = \alpha_1  \sum(f_i P_i) $$
+Where:
+- $f_i$ is the fraction of tokens routed to expert i
+- $P_i$ is the average routing probability for expert i
+This prevents "expert collapse" where only a few experts get consistently used.
+#### Device-Level Balance Loss
+$$ \mathcal{L}_{DevBal} = \alpha_2 \sum(f'_i P'_i) $$
+Where:
+- $f'_i$ is the average fraction of tokens routed to experts on device i
+- $P'_i$ is the sum of routing probabilities for experts on device i
+This ensures computation is evenly distributed across hardware devices.
+#### Communication Balance Loss
+$$ \mathcal{L}_{CommBal} = \alpha_3 \sum(f''_i P''_i) $$
+Where:
+- $f''_i$ measures the fraction of tokens sent to device i
+- $P''_i$ is the sum of routing probabilities for experts on device i
+This manages network traffic patterns between devices, which is critical for distributed training.
+The multi-level approach is particularly effective because imbalance at any level causes inefficiency:
+- Expert imbalance → wasted model capacity
+- Device imbalance → some hardware sits idle
+- Communication imbalance → network congestion
+### 4. Device-Limited Routing
+For distributed training, DeepSeek MoE implements a device-limited routing mechanism that bounds communication costs:
+1. For each token, select M devices that have experts with the highest affinity scores
+2. Perform top-K selection only among experts on these M devices
+This approach ensures that each token's computation is limited to a manageable number of devices, reducing cross-device communication overhead. Empirically, setting M ≈ 3 achieves performance comparable to unrestricted routing.
+### 5. Token-Dropping Strategy
+To further optimize computation, DeepSeek MoE implements a device-level token-dropping strategy:
+1. Compute the average computational budget for each device (capacity factor = 1.0)
+2. Drop tokens with the lowest affinity scores on each device until reaching the budget
+3. Ensure tokens from approximately 10% of training sequences are never dropped
+This approach provides flexibility to adjust computation vs. quality tradeoffs during inference while maintaining consistency between training and inference.
+## Sequence Understanding with Token-Level Routing
+Despite routing happening at the token level, DeepSeek MoE maintains sequence understanding through several mechanisms:
+1. Self-attention layers before and after the MoE layer process the whole sequence together
+2. The residual connection preserves the original token information
+3. Shared experts process all tokens, providing a base transformation
+4. Layer normalization helps integrate the different expert contributions
+This design allows each token to get specialized processing from relevant experts while the attention layers ensure these individually-processed tokens still work together to understand the sequence as a whole.
+## Comparison with Other MoE Implementations
+### vs. Switch Transformers
+- **Routing Granularity**: Switch routes each token to exactly one expert; DeepSeek routes to multiple experts (top-K)
+- **Expert Structure**: Switch uses standard FFNs; DeepSeek uses both shared and routed experts
+- **Load Balancing**: DeepSeek uses a more sophisticated multi-level balancing approach
+### vs. GShard
+- **Expert Specialization**: DeepSeek uses finer granularity for better specialization
+- **Knowledge Sharing**: DeepSeek's shared experts reduce redundancy
+- **Load Balancing**: DeepSeek's cascade of balance losses provides more robust load distribution
+- **Token Handling**: DeepSeek uses a simplified but effective token-dropping strategy
+## Integration in the Transformer Architecture
+DeepSeek MoE layers replace the standard feed-forward networks in transformer blocks, while keeping the attention mechanism intact:
+```
+Transformer Block
+  ├── RMS Norm
+  ├── Attention
+  ├── Residual Connection
+  ├── RMS Norm
+  ├── DeepSeekMoE Layer
+  │   ├─┬─ Shared Experts (process all tokens)
+  │   │ │
+  │   │ ├─ Router → Top-K Selection
+  │   │ │
+  │   │ └─ Routed Experts (process tokens via routing)
+  │   │
+  │   └── Combine outputs (residual + shared + routed)
+  └── Residual Connection
+```
+## Conclusion
+The DeepSeek MoE architecture represents a sophisticated approach to building large-scale language models that balance parameter count and computational efficiency. By using a hybrid expert structure, intelligent routing, and multi-level load balancing, DeepSeek MoE achieves better performance than previous MoE implementations with the same computational budget.
+The design reflects careful consideration of both theoretical aspects (how experts specialize and share knowledge) and practical engineering challenges (distributed training efficiency, communication patterns). This makes DeepSeek MoE not just an academic advancement but a practical approach for deploying large language models efficiently.

src/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

src/__init__.py ADDED Viewed

	@@ -0,0 +1,10 @@

+"""
+DeepSeek Mixture of Experts Implementation
+Copyright (c) 2025
+Implementation of the  Mixture of Experts mechanism from the DeepSeek-V2 paper.
+"""
+from .moe import Expert, MixtureOfExperts
+__version__ = "0.1.0"
+__all__ = ["Expert", "MixtureOfExperts"]

src/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (492 Bytes). View file

src/__pycache__/moe.cpython-311.pyc ADDED Viewed

Binary file (9.01 kB). View file

src/moe.py ADDED Viewed

	@@ -0,0 +1,142 @@

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+# Note: This is a simplified version of communication balance loss
+# For the complete implementation with proper token-device mapping
+# the device-limited routing implementation
+# and more efficient calculations, please contact the author
+class Expert(nn.Module):
+    """
+    Position-wise Feed-Forward Networks
+    This consists of two linear transformations with a ReLU activation in between.
+    FFN(x) = max(0, xW1 + b1 )W2 + b2
+    d_model: embedding dimension (e.g., 512)
+    d_expert: expert dimension (e.g., 256)
+    """
+    def __init__(self, d_model, d_expert):
+        super().__init__()
+        self.d_model=d_model
+        self.d_expert= d_expert
+        # Linear transformation y = xW+b
+        self.fc1 = nn.Linear(self.d_model, self.d_expert, bias = True)
+        self.fc2 = nn.Linear(self.d_expert, self.d_model, bias = True)
+        # for potential speed up
+        # Pre-normalize the weights (can help with training stability)
+        nn.init.xavier_uniform_(self.fc1.weight)
+        nn.init.xavier_uniform_(self.fc2.weight)
+    def forward(self, input):
+        # check input and first FF layer dimension matching
+        batch_size, seq_length, d_input = input.size()
+        assert self.d_model == d_input, "d_model must be the same dimension as the input"
+        # max(0, xW_1 + b_1)W_2 + b_2
+        return self.fc2(F.relu(self.fc1(input)))
+class MixtureOfExperts(nn.Module):
+    """
+    Mixture of Expert as in DeepSeek
+    MoE(x) = x + \sum Expert^s_i(x) + \sum gate(x;K)*Expert^r_i(x)
+    d_model: embedding dimension (e.g., 512)
+    d_expert: expert dimension (e.g., 216)
+    K : top K gate
+    N_s: number of shared experts
+    N_r: number of routed experts
+    alpha1: hyper-parameter; expert-level balance factor
+    alpha2: hyper-parameter; edevice-level balance factor
+    alpha3: hyper-parameter; communication balance factor
+    D: number of device for distributed system
+    M: number of device for Device-Limited Routing
+    """
+    def __init__(self, d_model, d_expert, K, N_s, N_r, alpha1, alpha2, alpha3, D=4, M=3):
+        super().__init__()
+        assert D < N_r, "Number of partitions needs to be less than number of routed experts"
+        assert M <= D, "Number of deviced for Device-Limited Routing needs to be less than number of total device"
+        self.d_model=d_model
+        self.d_expert= d_expert
+        self.K = K
+        self.N_s = N_s
+        self.N_r = N_r
+        self.alpha1 = alpha1
+        self.alpha2 = alpha2
+        self.alpha3 = alpha3
+        self.D = D # number of device available
+        self.M = M # for Device-Limited Routing
+        # initialize shared experts and routed experts
+        self.shared_experts = nn.ModuleList([
+            Expert(self.d_model, self.d_expert)
+            for _ in range(N_s)
+        ])
+        self.routed_experts = nn.ModuleList([
+            Expert(self.d_model, self.d_expert)
+            for _ in range(N_r)
+        ])
+        # Initiate centroids: learnable parameters, one vector per routed expert
+        self.expert_centroids = nn.Parameter(
+            torch.randn(N_r, d_model)  # [num_routed_experts, d_model]
+        )
+        nn.init.xavier_uniform_(self.expert_centroids)
+    def forward(self, input):
+        # check input and first FF layer dimension matching
+        batch_size, seq_length, d_input = input.size()
+        assert self.d_model == d_input, "d_model must be the same dimension as the input"
+        shared_output = torch.zeros_like(input)
+        for expert in self.shared_experts:
+            shared_output += expert(input) #[batch, seq, d_model]
+        # Calculate similarity between input tokens and expert centroids
+        self.similarities = torch.matmul(input, self.expert_centroids.transpose(0, 1)) #[batch, seq, N_r]
+        assert self.similarities.size(dim=-1) == self.N_r, \
+        "last dimension of similarities must be the same as the number of routed expert"
+        affinity = F.softmax(self.similarities, dim = -1)  #[batch, seq, N_r]
+        ## Apply topK to calculate the gate
+        values, indexes = torch.topk(affinity, self.K)
+        values = F.softmax(values, dim=-1) # Renormalize the top-K values
+        gate = torch.zeros_like(affinity).scatter_(2, indexes, values)  #[batch, seq, N_r]
+        """for testing"""
+        self.last_gate = gate
+        routed_output = torch.zeros_like(input)
+        for i in range(self.N_r):
+            routed_output += gate[:,:,i].unsqueeze(-1) * self.routed_experts[i](input)
+        ## Auxiliary Loss for Load Balance
+        # Expert-Level Balance Loss.
+        T = batch_size+seq_length
+        f = self.N_r/(self.K*T) * torch.count_nonzero(gate,(0,1))
+        P = 1/T * affinity.sum((0,1))
+        expert_loss = self.alpha1 * torch.matmul(f,P)
+        # Device-evel Balance Loss
+        f1= torch.tensor([partition.to(f.dtype).mean() for partition in torch.tensor_split(f, self.D)])
+        P1 = torch.tensor([partition.to(P.dtype).sum() for partition in torch.tensor_split(P, self.D)])
+        device_loss = self.alpha2 * torch.matmul(f1,P1)
+        # Communication Balance Loss
+        f2 = self.D/(self.M*T)*torch.tensor( [ torch.count_nonzero(partition,(0,1)).sum() for partition in  torch.tensor_split(gate, self.D, dim=-1)] )
+        P2 = P1
+        commu_loss = self.alpha3 * torch.matmul(f2,P2)
+        return input + shared_output + routed_output, expert_loss, device_loss, commu_loss

src/tests/__init__.py ADDED Viewed

File without changes

src/tests/__pycache__/__init__.cpython-311.pyc ADDED Viewed

Binary file (182 Bytes). View file

src/tests/__pycache__/test_moe.cpython-311.pyc ADDED Viewed

Binary file (15.3 kB). View file

src/tests/test_moe.py ADDED Viewed

	@@ -0,0 +1,316 @@

+import unittest
+import torch
+from ..moe import MixtureOfExperts,Expert  # Using relative import
+import unittest
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import sys
+import os
+# Add the parent directory to the path so we can import the module
+sys.path.insert(0, os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from moe import Expert, MixtureOfExperts
+class TestExpert(unittest.TestCase):
+    """Test the Expert module of the DeepSeek MoE implementation."""
+    def setUp(self):
+        # Set random seed for reproducibility
+        torch.manual_seed(42)
+        # Common parameters for tests
+        self.batch_size = 8
+        self.seq_len = 16
+        self.d_model = 64
+        self.d_expert = 128
+        # Create sample input tensor
+        self.inputs = torch.randn(self.batch_size, self.seq_len, self.d_model)
+        # Create expert
+        self.expert = Expert(self.d_model, self.d_expert)
+    def test_expert_init(self):
+        """Test expert initialization."""
+        # Check layer parameters
+        self.assertEqual(self.expert.fc1.in_features, self.d_model)
+        self.assertEqual(self.expert.fc1.out_features, self.d_expert)
+        self.assertEqual(self.expert.fc2.in_features, self.d_expert)
+        self.assertEqual(self.expert.fc2.out_features, self.d_model)
+        # Check if Xavier initialization was applied
+        # Just check if weights are within a reasonable range
+        self.assertTrue(torch.all(self.expert.fc1.weight < 1.0))
+        self.assertTrue(torch.all(self.expert.fc1.weight > -1.0))
+    def test_expert_forward(self):
+        """Test the forward pass of the expert module."""
+        output = self.expert(self.inputs)
+        # Check output shape
+        self.assertEqual(output.shape, self.inputs.shape)
+        # Ensure output is different from input (transformation happened)
+        self.assertFalse(torch.allclose(output, self.inputs))
+        # Test the expert with a single example (easier to verify calculations)
+        single_input = torch.randn(1, 1, self.d_model)
+        # Step-by-step execution to verify correctness
+        fc1_output = self.expert.fc1(single_input)
+        relu_output = F.relu(fc1_output)
+        expected_output = self.expert.fc2(relu_output)
+        actual_output = self.expert(single_input)
+        # Verify that the output matches our manual calculation
+        self.assertTrue(torch.allclose(actual_output, expected_output))
+class TestMixtureOfExperts(unittest.TestCase):
+    """Test the MixtureOfExperts module."""
+    def setUp(self):
+        # Set random seed for reproducibility
+        torch.manual_seed(42)
+        # Common parameters for tests
+        self.batch_size = 8
+        self.seq_len = 16
+        self.d_model = 64
+        self.d_expert = 128
+        self.K = 2  # Top-K experts per token
+        self.N_s = 2  # Number of shared experts
+        self.N_r = 8  # Number of routed experts
+        self.alpha1 = 0.01  # Expert balance factor
+        self.alpha2 = 0.01  # Device balance factor
+        self.alpha3 = 0.01  # Communication balance factor
+        self.D = 4  # Number of devices
+        self.M = 3  # Device limit for routing
+        # Create sample input tensor
+        self.inputs = torch.randn(self.batch_size, self.seq_len, self.d_model)
+        # Create MoE layer
+        self.moe = MixtureOfExperts(
+            d_model=self.d_model,
+            d_expert=self.d_expert,
+            K=self.K,
+            N_s=self.N_s,
+            N_r=self.N_r,
+            alpha1=self.alpha1,
+            alpha2=self.alpha2,
+            alpha3=self.alpha3,
+            D=self.D,
+            M=self.M
+        )
+    def test_moe_init(self):
+        """Test MoE initialization."""
+        # Check expert counts
+        self.assertEqual(len(self.moe.shared_experts), self.N_s)
+        self.assertEqual(len(self.moe.routed_experts), self.N_r)
+        # Check centroid initialization
+        self.assertEqual(self.moe.expert_centroids.shape, (self.N_r, self.d_model))
+    def test_moe_forward(self):
+        """Test the forward pass of the MoE layer."""
+        output, expert_loss, device_loss, commu_loss = self.moe(self.inputs)
+        # Check output shape
+        self.assertEqual(output.shape, self.inputs.shape)
+        # Check that losses are scalars
+        self.assertEqual(expert_loss.dim(), 0)
+        self.assertEqual(device_loss.dim(), 0)
+        self.assertEqual(commu_loss.dim(), 0)
+        # Check that losses are non-negative
+        self.assertGreaterEqual(expert_loss.item(), 0.0)
+        self.assertGreaterEqual(device_loss.item(), 0.0)
+        self.assertGreaterEqual(commu_loss.item(), 0.0)
+    def test_topk_routing(self):
+        """Test the top-K routing mechanism."""
+        # Forward pass to compute gate values
+        self.moe(self.inputs)
+        # Check gate shape
+        self.assertEqual(self.moe.last_gate.shape, (self.batch_size, self.seq_len, self.N_r))
+        # Check that exactly K experts are activated per token
+        for b in range(self.batch_size):
+            for s in range(self.seq_len):
+                # Count non-zero gate values for this token
+                active_experts = torch.count_nonzero(self.moe.last_gate[b, s])
+                self.assertEqual(active_experts, self.K)
+                # Check that gate values sum to approximately 1.0
+                gate_sum = self.moe.last_gate[b, s].sum().item()
+                self.assertAlmostEqual(gate_sum, 1.0, places=5)
+    def test_expert_contribution(self):
+        """Test that both shared and routed experts contribute to the output."""
+        # Create an input where we can track contributions
+        special_input = torch.zeros_like(self.inputs)
+        special_input[:, 0, 0] = 1.0  # Set a specific element to 1.0
+        # Process with shared experts only (zero out routed expert centroids)
+        with torch.no_grad():
+            self.moe.expert_centroids.data.fill_(0.0)
+            shared_only_output, _, _, _ = self.moe(special_input)
+        # Process with both shared and routed experts
+        with torch.no_grad():
+            # Reset centroids
+            nn.init.xavier_uniform_(self.moe.expert_centroids)
+            full_output, _, _, _ = self.moe(special_input)
+        # Check that outputs are different, indicating routed experts contributed
+        self.assertFalse(torch.allclose(shared_only_output, full_output))
+    def test_residual_connection(self):
+        """Test that the residual connection is properly implemented."""
+        # Zero out all expert weights to isolate residual behavior
+        with torch.no_grad():
+            for expert in self.moe.shared_experts:
+                expert.fc1.weight.fill_(0.0)
+                expert.fc1.bias.fill_(0.0)
+                expert.fc2.weight.fill_(0.0)
+                expert.fc2.bias.fill_(0.0)
+            for expert in self.moe.routed_experts:
+                expert.fc1.weight.fill_(0.0)
+                expert.fc1.bias.fill_(0.0)
+                expert.fc2.weight.fill_(0.0)
+                expert.fc2.bias.fill_(0.0)
+            # Reset centroids to ensure routing still happens
+            nn.init.xavier_uniform_(self.moe.expert_centroids)
+        # Process input
+        output, _, _, _ = self.moe(self.inputs)
+        # With zero weights, output should match input (residual connection)
+        self.assertTrue(torch.allclose(output, self.inputs))
+class TestLoadBalancing(unittest.TestCase):
+    """Test the load balancing mechanisms of the MixtureOfExperts."""
+    def setUp(self):
+        # Set random seed for reproducibility
+        torch.manual_seed(42)
+        # Common parameters for tests
+        self.batch_size = 16
+        self.seq_len = 32
+        self.d_model = 64
+        self.d_expert = 128
+        self.K = 2
+        self.N_s = 2
+        self.N_r = 8
+        # Create sample input tensor
+        self.inputs = torch.randn(self.batch_size, self.seq_len, self.d_model)
+    def test_expert_balance_loss(self):
+        """Test that the expert balance loss penalizes imbalanced routing."""
+        # Create two MoE layers with different alpha1 values
+        moe_balanced = MixtureOfExperts(
+            d_model=self.d_model,
+            d_expert=self.d_expert,
+            K=self.K,
+            N_s=self.N_s,
+            N_r=self.N_r,
+            alpha1=1.0,  # High expert balance factor
+            alpha2=0.0,
+            alpha3=0.0,
+            D=2,
+            M=2
+        )
+        moe_unbalanced = MixtureOfExperts(
+            d_model=self.d_model,
+            d_expert=self.d_expert,
+            K=self.K,
+            N_s=self.N_s,
+            N_r=self.N_r,
+            alpha1=0.0,  # No expert balance factor
+            alpha2=0.0,
+            alpha3=0.0,
+            D=2,
+            M=2
+        )
+        # Create highly skewed inputs to test balancing
+        skewed_inputs = torch.randn(self.batch_size, self.seq_len, self.d_model)
+        # Force skewed routing by manipulating centroids
+        with torch.no_grad():
+            # Make first expert's centroid very similar to all inputs
+            prototype = skewed_inputs.mean(dim=(0, 1))
+            moe_unbalanced.expert_centroids[0] = prototype * 10
+            # Copy the same centroids to the balanced MoE
+            moe_balanced.expert_centroids.data.copy_(moe_unbalanced.expert_centroids.data)
+        # Process with both MoEs
+        _, unbalanced_loss, _, _ = moe_unbalanced(skewed_inputs)
+        _, balanced_loss, _, _ = moe_balanced(skewed_inputs)
+        # The balanced MoE should produce a higher loss to penalize imbalance
+        self.assertGreater(balanced_loss.item(), unbalanced_loss.item())
+    def test_device_balance_loss(self):
+        """Test that the device balance loss works as expected."""
+        # Create MoE with high device balance factor
+        moe = MixtureOfExperts(
+            d_model=self.d_model,
+            d_expert=self.d_expert,
+            K=self.K,
+            N_s=self.N_s,
+            N_r=self.N_r,
+            alpha1=0.0,
+            alpha2=1.0,  # High device balance factor
+            alpha3=0.0,
+            D=2,  # Two devices
+            M=2
+        )
+        # Process input
+        _, _, device_loss, _ = moe(self.inputs)
+        # Check that device loss is calculated and non-zero
+        self.assertGreater(device_loss.item(), 0.0)
+    def test_communication_balance_loss(self):
+        """Test that the communication balance loss works as expected."""
+        # Create MoE with high communication balance factor
+        moe = MixtureOfExperts(
+            d_model=self.d_model,
+            d_expert=self.d_expert,
+            K=self.K,
+            N_s=self.N_s,
+            N_r=self.N_r,
+            alpha1=0.0,
+            alpha2=0.0,
+            alpha3=1.0,  # High communication balance factor
+            D=2,  # Two devices
+            M=1  # Limited to one device
+        )
+        # Process input
+        _, _, _, commu_loss = moe(self.inputs)
+        # Check that communication loss is calculated and non-zero
+        self.assertGreater(commu_loss.item(), 0.0)
+if __name__ == '__main__':
+    unittest.main()