+
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: nvidia_dump | deps: torch | 33.33s + | + +Raw +
+ +
+
NVIDIA GPU Information: +Mon Sep 15 16:41:01 2025 ++-----------------------------------------------------------------------------------------+ +| NVIDIA-SMI 560.35.05 Driver Version: 560.35.05 CUDA Version: 12.6 | +|-----------------------------------------+------------------------+----------------------+ +| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | +| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | +| | | MIG M. | +|=========================================+========================+======================| +| 0 NVIDIA L4 Off | 00000000:38:00.0 Off | 0 | +| N/A 46C P0 28W / 72W | 1MiB / 23034MiB | 0% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 1 NVIDIA L4 Off | 00000000:3A:00.0 Off | 0 | +| N/A 46C P0 28W / 72W | 1MiB / 23034MiB | 2% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 2 NVIDIA L4 Off | 00000000:3C:00.0 Off | 0 | +| N/A 49C P0 31W / 72W | 1MiB / 23034MiB | 2% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ +| 3 NVIDIA L4 Off | 00000000:3E:00.0 Off | 0 | +| N/A 48C P0 29W / 72W | 1MiB / 23034MiB | 2% Default | +| | | N/A | ++-----------------------------------------+------------------------+----------------------+ + ++-----------------------------------------------------------------------------------------+ +| Processes: | +| GPU GI CI PID Type Process name GPU Memory | +| ID ID Usage | +|=========================================================================================| +| No running processes found | ++-----------------------------------------------------------------------------------------+ + +
+
+
▶ UV Install Logs
+ +
+
+
+ +
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: utils | deps: torch, numpy | 31.77s + | + +Raw +
+ +
+
+
▶ UV Install Logs
+ +
+
+
+ +
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: config | deps: torch, numpy | 37.88s + | + +Raw +
+ +
+
Configuration: + Experts: 128 + Hidden size: 1152 + Top-k: 4 + Batch size: 8 + Sequence length: 512 + Device: cuda + Dtype: bfloat16 +
+
+
▶ UV Install Logs
+ +
+
+
+ +
+
+ +▼ code +▼ output + ▶ uv-logs + | +Cell: save_data | deps: torch, numpy | 44.56s + | + +Raw +
+
+
 1
+ 2
+ 3
+ 4
+ 5
+ 6
+ 7
+ 8
+ 9
+10
+11
+12
+13
+14
+15
+16
+17
+18
+19
+20
+21
+22
+23
+24
+25
+26
+27
+28
+29
+30
+31
+32
+33
+34
+35
+36
+37
+38
+39
+40
+41
+42
+43
+44
+45
+46
+47
+48
+49
+50
+51
+52
+53
+54
+55
+56
+57
+58
+59
+60
"""Generate and save shared weights for consistent comparison."""
+import torch
+import numpy as np
+from pathlib import Path
+
+# Model configuration
+NUM_EXPERTS = 128
+HIDDEN_SIZE = 1152
+INTERMEDIATE_SIZE = 3072
+TOP_K = 4
+
+# Input configuration
+BATCH_SIZE = 1
+SEQ_LEN = 100
+DTYPE = "float32"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+
+# Seeds for reproducibility
+WEIGHT_SEED = 999
+EXPERT_SEED = 777
+INPUT_SEED = 123
+GENERAL_SEED = 42
+
+def set_seed(seed: int):
+    """Set seeds for reproducibility."""
+    torch.manual_seed(seed)
+    np.random.seed(seed)
+    if torch.cuda.is_available():
+        torch.cuda.manual_seed(seed)
+        torch.cuda.manual_seed_all(seed)
+
+# Generate shared weights for all implementations
+print("Generating shared weights...")
+
+# Router weights
+set_seed(WEIGHT_SEED)
+router_weight = torch.empty(NUM_EXPERTS, HIDDEN_SIZE)
+torch.nn.init.kaiming_uniform_(router_weight)
+router_bias = torch.zeros(NUM_EXPERTS)
+
+# Expert weights - using proper dimensions for gate/up combined projection
+set_seed(EXPERT_SEED)
+gate_up_proj = torch.empty(NUM_EXPERTS, HIDDEN_SIZE, 2 * HIDDEN_SIZE).normal_(mean=0.0, std=0.02)
+gate_up_proj_bias = torch.zeros(NUM_EXPERTS, 2 * HIDDEN_SIZE)
+down_proj = torch.empty(NUM_EXPERTS, HIDDEN_SIZE, HIDDEN_SIZE).normal_(mean=0.0, std=0.02)
+down_proj_bias = torch.zeros(NUM_EXPERTS, HIDDEN_SIZE)
+
+# Save weights
+torch.save(router_weight, 'router_weight.pt')
+torch.save(router_bias, 'router_bias.pt')
+torch.save(gate_up_proj, 'gate_up_proj.pt')
+torch.save(gate_up_proj_bias, 'gate_up_proj_bias.pt')
+torch.save(down_proj, 'down_proj.pt')
+torch.save(down_proj_bias, 'down_proj_bias.pt')
+
+print(f"Saved weights:")
+print(f"  Router: {tuple(router_weight.shape)}")
+print(f"  Gate/Up proj: {tuple(gate_up_proj.shape)}")
+print(f"  Down proj: {tuple(down_proj.shape)}")
+print(f"  Hidden size: {HIDDEN_SIZE}")
+
+ +
+
+
Generating shared weights... +Saved weights: + Router: (128, 1152) + Gate/Up proj: (128, 1152, 2304) + Down proj: (128, 1152, 1152) + Hidden size: 1152 +
+
+
▶ UV Install Logs
+ +
+ +
+
+ +

GPT-OSS Implementation

+

This section benchmarks the GPT-OSS MoE implementation in non-training mode.

+
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: gptoss_run | deps: torch, numpy | 43.29s + | + +Raw +
+ +
+
Configuration: + Experts: 128 + Hidden size: 1152 + Top-k: 4 + Batch size: 8 + Sequence length: 512 + Device: cuda + Dtype: bfloat16 +Loading weights from: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e +Files in directory: [PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/down_proj.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/down_proj_bias.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/stderr.txt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/gate_up_proj.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/gate_up_proj_bias.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/result.json'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/stdout.txt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/router_weight.pt'), PosixPath('/home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e/router_bias.pt')] +Loaded shared weights from artifacts +Router weight sum: 12.588732 +Gate/up sum: 1026.601807 +Down sum: 206.729263 + +=== GPT-OSS Implementation === +Router weight sum: 12.562500 +Gate/up proj sum: 1024.000000 +Down proj sum: 207.000000 +Average time: 62.308 ms +Throughput: 65737 tokens/sec +Memory allocated: 1.330 GB +Memory increase: 0.380 GB + +Output sum: -4.968750 +
+
+
▶ UV Install Logs
+ +
+
+

Artifacts:

+gptoss_results.json +
+
+
+ +

MegaBlocks Implementation

+

This section benchmarks the MegaBlocks MoE implementation.

+
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: megablocks_run | deps: torch, numpy, kernels | 49.81s + | + +Raw +
+ +
+
Configuration: + Experts: 128 + Hidden size: 1152 + Top-k: 4 + Batch size: 8 + Sequence length: 512 + Device: cuda + Dtype: bfloat16 +Loading weights from: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/f8d80463181591394d703c9cd286c7929b4a261ab3157d791f92a5933e5a011e +Loaded shared weights from artifacts +Router weight sum: 12.588732 +Gate/up sum: 1026.601807 +Down sum: 206.729263 + +=== MegaBlocks Implementation === +[MegaBlocks] Router weight sum: 12.562500 +[MegaBlocks] Gate/up projection shape: (128, 1152, 2304), sum: 1024.000000 +[MegaBlocks] Down projection shape: (128, 1152, 1152), sum: 207.000000 +Average time: 26.933 ms +Throughput: 152084 tokens/sec +Memory allocated: 2.243 GB +Memory increase: 1.292 GB + +Output sum: -4.968750 +
+
+
▶ UV Install Logs
+ +
+
Fetching 66 files: 0%| | 0/66 [00:00<?, ?it/s] +Fetching 66 files: 2%|▏ | 1/66 [00:00<00:18, 3.49it/s] +Fetching 66 files: 26%|██▌ | 17/66 [00:01<00:03, 15.86it/s] +Fetching 66 files: 100%|██████████| 66/66 [00:01<00:00, 57.94it/s]
+
+

Artifacts:

+megablocks_results.json +
+
+
+ +

Performance Comparison

+

This section loads the benchmark results and creates visualizations comparing the two implementations.

+
+
+ +▶ code +▼ output + ▶ uv-logs + | +Cell: visualization | deps: matplotlib | 3.33s + | + +Raw +
+ +
+
Loading benchmark results from: + GPT-OSS dir: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/fc17d5998a27217e1676a638ddeceb18cab662c6e9b30c9a62218784604c9a26 + MegaBlocks dir: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/6e3545a8e3c2ca65ca800a7e1c1824fded11e28258efcd83355514bb0646e166 +Loading results from: + GPT-OSS: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/fc17d5998a27217e1676a638ddeceb18cab662c6e9b30c9a62218784604c9a26/gptoss_results.json + MegaBlocks: /home/ubuntu/Projects/uvnote-megablocks-bench/.uvnote/cache/6e3545a8e3c2ca65ca800a7e1c1824fded11e28258efcd83355514bb0646e166/megablocks_results.json +GPT-OSS results keys: ['avg_time_ms', 'throughput_tokens_per_sec', 'memory_allocated_gb', 'memory_cached_gb', 'memory_increase_gb', 'device', 'dtype', 'tokens', 'warmup_iters', 'timing_iters'] +MegaBlocks results keys: ['avg_time_ms', 'throughput_tokens_per_sec', 'memory_allocated_gb', 'memory_cached_gb', 'memory_increase_gb', 'device', 'dtype', 'tokens', 'warmup_iters', 'timing_iters'] +Extracted metrics: + Times (ms): [62.308485079556704, 26.93254135781899] + Throughputs: [65737.43519474348, 152083.67994618745] + Memory usage (GB): [1.329831600189209, 2.2425241470336914] + Memory increase (GB): [0.3795137405395508, 1.2922062873840332] + +============================================================ +PERFORMANCE COMPARISON SUMMARY +============================================================ +Metric GPT-OSS MegaBlocks Winner +------------------------------------------------------------ +Latency (ms) 62.31 26.93 MegaBlocks +Throughput (tok/s) 65737 152084 MegaBlocks +Memory Usage (GB) 1.330 2.243 GPT-OSS +Memory Increase (GB) 0.380 1.292 GPT-OSS + +MegaBlocks is 2.31x faster +MegaBlocks has 2.31x higher throughput +============================================================ +
+
+
▶ UV Install Logs
+ +
+
+

Artifacts:

+small_moe_comparison.png +
+small_moe_comparison.png +
+
+
+
+ +

Conclusion

+

This focused benchmark compares the GPT-OSS (non-training mode) and MegaBlocks MoE implementations on the same hardware with identical weights and inputs. The comparison focuses on:

+
    +
  1. Latency: Average forward pass time
  2. +
  3. Throughput: Tokens processed per second
  4. +
  5. Memory Usage: GPU memory consumption
  6. +
  7. Memory Efficiency: Memory increase during execution
  8. +
+

Both implementations use:
+- 64 experts with top-2 routing
+- 768 hidden dimensions
+- Batch size of 8, sequence length of 512
+- bfloat16 precision
+- Identical pre-generated weights for fair comparison

+

The results show the performance characteristics of each approach, helping identify the optimal implementation for different use cases.

+