Data-Free Per-Expert Mixed-Precision Quantization for 512-Expert Mixture-of-Experts Models

Black Sheep AI Research — baa.ai
Abstract. Mixture-of-Experts architectures with 128–512 experts per layer create unique challenges for post-training quantization. Existing methods apply uniform bit-widths across all experts or rely on coarse per-layer decisions. We present the first comprehensive data-free sensitivity analysis and mixed-precision quantization study on real 512-expert MoE models. Using weight-based sensitivity metrics—spectral analysis, per-group kurtosis, output noise amplification, and reconstruction error—we profile 2,347 tensors across Qwen3.5-397B-A17B (512 experts per layer) and validate across three architecture scales. We discover that kurtosis is the dominant sensitivity predictor (Spearman $\rho = 0.795$), that 89.4% of expert parameters tolerate 4-bit quantization under SQNR safety constraints, and that group size has a larger impact on perplexity than bit-width allocation. Our pipeline formulates allocation as a Multiple-Choice Knapsack Problem (MCKP), solving for the provably near-optimal (bits, group_size) assignment per expert group in under 100 ms. On commodity Apple Silicon, we match the perplexity of a threshold-based mixed-precision baseline at 16% fewer average bits (4.31 vs 5.06), and improve over uniform 4-bit quantization at matched group size. We provide ablations on codebook quantization (+41% MSE reduction) and Hadamard rotations (+8.2%), establishing practical boundaries for future MoE compression. We further introduce DynaMINT, a tiered expert quantization scheme informed by activation profiling, which maintains quality at only +0.5% perplexity degradation with 11.6% of experts at 2-bit and 3.6% pruned. An expert pruning study reveals that even 5% expert removal causes 13x perplexity degradation, establishing that activation frequency alone is not a safe pruning criterion. All code, manifests, and models are released.

1. Introduction

Mixture-of-Experts (MoE) has become the dominant architecture for frontier large language models. Qwen3.5-397B-A17B [7] deploys 512 experts per MoE layer with only 17 billion parameters active per token, achieving state-of-the-art quality at a fraction of the compute cost implied by total parameter count. Llama 4 Maverick [8] uses 128 experts across 24 MoE layers, while earlier models such as Mixtral-8x7B [3] and DeepSeek-V3 [5] established the pattern at smaller expert counts. Despite the compute efficiency of sparse activation, deployment remains constrained by total model size: Qwen3.5-397B requires over 740 GB in BF16, far exceeding the memory of any single accelerator.

Post-training quantization is the standard solution, but existing methods apply uniform bit-widths across all experts. This ignores a fundamental property of MoE architectures: experts are trained semi-independently through routing and develop heterogeneous weight distributions. Some experts have near-Gaussian weight distributions that compress gracefully to 4-bit; others exhibit heavy-tailed distributions with high kurtosis that suffer catastrophic accuracy loss at the same precision. Uniform quantization wastes bits on robust experts while under-protecting sensitive ones.

Prior work on MoE quantization—MC-MoE [15], QMoE [16], MoEQuant [17], and DynaExq [18]—all require calibration data to estimate expert sensitivity via activation traces or routing statistics. This creates practical barriers: calibration sets must be representative of deployment distribution, the profiling pass requires loading the full model in high precision, and results may not transfer across domains. More critically, none of these methods has been validated at 512-expert scale, where the combinatorial space of per-expert configurations explodes and calibration cost becomes prohibitive.

We make four contributions: (1) the first data-free sensitivity study at 512-expert scale, profiling 2,347 tensors across Qwen3.5-397B-A17B using only weight statistics; (2) an MCKP-based allocation pipeline with expert grouping constraints that solves in under 100 ms for any model size; (3) comprehensive ablations on codebook quantization and Hadamard rotation techniques, establishing practical boundaries for future MoE compression; and (4) open release of all code, sensitivity manifests, and quantized models. The entire pipeline runs on a single Apple M2 Ultra with 192 GB unified memory [21], requiring no GPU cluster and no calibration data.

2. Related Work

2.1 Mixture-of-Experts Architectures

The modern MoE paradigm traces from GShard [1] and the Switch Transformer [2], which demonstrated that sparsely-activated expert layers could scale model capacity without proportional compute cost. Mixtral-8x7B [3] brought MoE to open-weight models with 8 experts per layer, selecting 2 per token. DeepSeek-V2 [4] introduced fine-grained experts (up to 160 per layer), and DeepSeek-V3 [5] scaled to 256 experts with auxiliary-loss-free load balancing. Qwen3 [6] and Qwen3.5 [7] pushed further to 512 experts per layer while activating only 17B of 397B total parameters. Llama 4 [8] adopted a hybrid design with both dense and MoE layers. The trend is clear: expert counts are growing rapidly, and quantization methods must keep pace.

2.2 MoE-Specific Quantization

MC-MoE [15] uses calibration data to identify and protect frequently-activated experts, applying lower precision to rarely-used ones. QMoE [16] compresses all experts to under 1 bit per parameter using learned codebooks with calibration-based distillation. MoEQuant [17] proposes expert-wise calibration to handle activation outliers specific to each expert. DynaExq [18] dynamically adjusts expert quantization based on runtime routing patterns. All of these methods require calibration data and activation traces, creating practical deployment barriers. None has been demonstrated at 512-expert scale.

Method Expert Count Tested Granularity Calibration Data Data-Free Hardware
MC-MoE [15] ≤16 Per-expert Required No GPU
QMoE [16] 128 Per-expert Required No GPU
MoEQuant [17] ≤64 Per-expert Required No GPU
DynaExq [18] ≤128 Dynamic Required No GPU
Ours 256 Per-expert tiered None Yes Apple Silicon
Table A: Comparison of MoE quantization methods. Our approach is the only data-free method and scales to the highest expert count.

2.3 Mixed-Precision Quantization for Dense Models

For dense models, GPTQ [9] uses second-order information for layer-wise quantization; AWQ [10] identifies salient weight channels via activation magnitudes; SqueezeLLM [11] separates outliers into a sparse format; HQQ [12] provides fast half-quadratic quantization without calibration data; and MXQ [13] assigns mixed precision at sub-layer granularity. QuIP# [14] applies random orthogonal transformations to incoherify weight matrices before quantization. While these methods have advanced the state of the art for dense models, they do not address the unique challenges of MoE: heterogeneous expert sensitivity, the combinatorial explosion of per-expert configuration space, and framework constraints that tie all experts within a layer to a shared quantization config. Our work fills this gap with a data-free method validated at 512-expert scale under a budget-constrained optimization framework.

3. Expert Sensitivity Profiling

3.1 Weight-Based Sensitivity Metrics

Unlike activation-based methods that require calibration data and forward passes, we analyze sensitivity entirely from weight tensor properties. This makes profiling data-free and embarrassingly parallel across shards. We compute four complementary metrics for each tensor:

3.2 Expert Analysis Modes

MoE models present a scaling challenge: Qwen3.5-397B has 512 experts per layer across 60 layers, yielding thousands of expert weight tensors. Profiling every expert independently is feasible but expensive. We implement two analysis modes in the expert handler:

For Qwen3.5-397B (512 experts), Mode B reduces the number of fully-profiled expert tensors from 30,720 to approximately 2,347 while maintaining coverage of the sensitivity distribution.

3.3 Cross-Architecture Scaling Study

We profile three architectures spanning dense to large-scale MoE to understand how sensitivity distributions change with expert count and model scale:

Model Architecture Parameters Tensors Layers Avg Bits Est. Size
Qwen3-8B Dense 8.2B 399 36 6.84 6.7 GB
Llama4-Maverick MoE 128E 401.6B 1,061 48 4.78 230.4 GB
Qwen3.5-397B MoE 512E 403.4B 2,924 60 5.06 245.1 GB
Table 1: Cross-architecture scaling study. Tensor counts reflect the profiled set (Mode B clustering for MoE models).

3.4 Metric Correlation Analysis

To identify which metrics best predict quantization sensitivity, we compute rank correlations between each metric and reconstruction error (NRMSE at 4-bit) across all 2,347 profiled tensors in Qwen3.5-397B:

Metric Spearman $\rho$ Pearson $r$ $p$-value
Per-group kurtosis 0.795 0.480 <1e-135
Cross-layer position −0.468 −0.224 <1e-128
SVD spectral features 0.391 0.303 <1e-86
Composite (weighted) 0.374 0.400 <1e-79
Output sensitivity 0.212 0.455 <1e-25
Table 2: Metric correlation with reconstruction error (NRMSE at 4-bit) across 2,347 tensors in Qwen3.5-397B-A17B.

Kurtosis dominates as the sensitivity predictor with Spearman $\rho = 0.795$, substantially ahead of the next best metric. Output sensitivity, despite its intuitive appeal, achieves only $\rho = 0.212$. Investigation reveals that output sensitivity saturates at 1.0 for 99.5% of MoE expert tensors (median = 1.0), losing all discriminatory power at 512-expert scale. This saturation occurs because expert weight matrices in large MoE models tend to have similar spectral norms, making the noise amplification ratio nearly identical across experts.

The negative cross-layer position correlation ($\rho = -0.468$) confirms the well-known U-shaped sensitivity pattern: early and late layers are more sensitive to quantization than middle layers. This pattern holds across all three architectures and is captured by the soft protection priors in our allocation pipeline.

4. Per-Expert Mixed-Precision Pipeline

4.1 Rate-Distortion Profiling

For each tensor, we compute the reconstruction error (NRMSE) at eight candidate (bits, group_size) configurations, forming a rate-distortion curve:

$\mathcal{C} = \{(2, 32),\; (3, 64),\; (4, 32),\; (4, 64),\; (4, 128),\; (8, 64),\; (8, 128),\; (16, {-})\}$

Each configuration implies a specific size cost (bits per parameter plus scale/zero-point overhead from the group size) and a distortion level. The rate-distortion curve captures the tensor-specific tradeoff: some tensors see a large NRMSE jump between 4-bit and 8-bit while others degrade gracefully, making them good candidates for aggressive quantization.

4.2 Expert Grouping Constraint

MLX's SwitchLinear module [21] requires all experts within a layer to share a single quantization configuration (bits and group_size). This is a hard framework constraint: the quantized weight tensor for all experts in a layer is stored as a single contiguous array with uniform element width. Consequently, our analysis is per-expert but the allocation must be per-expert-group, where each group corresponds to all experts sharing a (layer, projection_type) pair.

We aggregate per-expert NRMSE values into a group-level distortion estimate using the parameter-weighted mean across all experts in the group:

$\text{NRMSE}_{\text{group}}(b, g) = \frac{\sum_{e \in \text{group}} n_e \cdot \text{NRMSE}_e(b, g)}{\sum_{e \in \text{group}} n_e}$

where $n_e$ is the number of parameters in expert $e$. This weighting ensures that larger experts (which contribute more to total model size) have proportionally more influence on the group allocation decision.

4.3 MCKP Formulation

We formulate the bit-width and group-size allocation as a Multiple-Choice Knapsack Problem (MCKP) [23]. Let $i$ index the tensor groups, $\pi_i$ denote soft protection priors, and $(b_i, g_i) \in \mathcal{C}_i$ denote the candidate configurations for group $i$. The optimization problem is:

$\min_{\{(b_i, g_i)\}} \sum_i \pi_i \cdot \text{NRMSE}_i(b_i, g_i) \quad \text{s.t.} \quad \sum_i \text{size}_i(b_i, g_i) \leq B, \quad (b_i, g_i) \in \mathcal{C}_i \;\; \forall i$

where $B$ is the memory budget and $\text{size}_i(b_i, g_i)$ computes the storage cost including scale and zero-point overhead. The soft protection priors $\pi_i$ increase the effective distortion cost for structurally important components, discouraging aggressive quantization of embeddings, layer norms, and boundary layers:

Component Prior Weight ($\pi$)
Embeddings 10.0x
LM head 10.0x
Router weights 8.0x
First 2 layers 3.0x
Last 2 layers 2.0x
LayerNorm $\infty$ (never quantize)
All other tensors 1.0x
Soft protection priors. Higher weights penalize low-precision assignment for structurally important components.

4.4 SQNR Safety Veto

Before optimization, we apply a Signal-to-Quantization-Noise Ratio (SQNR) safety veto [22]. For each tensor and each candidate configuration, we compute:

$\text{SQNR}(W, b, g) = 10 \cdot \log_{10} \frac{\|W\|_F^2}{\|W - Q_{b,g}(W)\|_F^2}$

Any configuration with $\text{SQNR} < 9\;\text{dB}$ (the default floor) is removed from the candidate set $\mathcal{C}_i$ before the MCKP solver runs. This hard constraint prevents catastrophic quantization of tensors where the quantization noise exceeds approximately 35% of the signal power, regardless of what the budget optimization might prefer.

4.5 eCDF Normalization

Raw metric values span vastly different scales across architectures and model sizes. We replace all hardcoded normalization bounds with empirical CDF (eCDF) normalization: each metric value is transformed to its percentile rank across all tensors in the model, yielding scale-invariant scores in $[0, 1]$. For a metric value $x$ with observed values $\{x_1, \ldots, x_n\}$:

$\text{eCDF}(x) = \frac{1}{n} \sum_{j=1}^{n} \mathbf{1}[x_j \leq x]$

This eliminates the need for per-metric normalization constants and adapts automatically to the distribution of any model, whether dense or MoE, 8B or 400B parameters.

4.6 Greedy Solver

We solve the MCKP using a greedy efficiency ordering. Starting from the minimum-cost (lowest bit-width) feasible assignment, we enumerate all possible upgrades (transitions to higher precision) for every group and sort them by efficiency:

$\text{efficiency}(i, c \to c') = \frac{\pi_i \cdot [\text{NRMSE}_i(c) - \text{NRMSE}_i(c')]}{\text{size}_i(c') - \text{size}_i(c)}$

Upgrades are applied greedily in decreasing efficiency order until the budget $B$ is exhausted. For the MCKP with concave distortion curves (which holds empirically for quantization), this greedy approach is provably near-optimal. The solver completes in under 100 ms for Qwen3.5-397B (2,924 tensor groups), making it practical for interactive experimentation with different budgets.

4.7 Full Pipeline

Algorithm 1: Per-Expert Mixed-Precision Quantization Pipeline

Input: Model weights $W = \{W_1, \ldots, W_n\}$, memory budget $B$
Output: Per-tensor assignment $\{(b_i, g_i)\}$ satisfying budget $B$

// Phase 1: Sensitivity profiling
1. for each tensor $W_i$ do
    Compute 4 sensitivity metrics: spectral, kurtosis, output noise, NRMSE
2. Normalize all metrics via eCDF (percentile rank across model)

// Phase 2: Rate-distortion curves
3. for each tensor $W_i$ do
    Compute NRMSE at 8 (bits, group_size) configurations

// Phase 3: Expert grouping
4. Group MoE experts by (layer, projection_type)
5. Aggregate NRMSE across experts (parameter-weighted mean)

// Phase 4: Safety and priors
6. Apply SQNR veto: remove configs with SQNR < 9 dB from candidate sets
7. Apply soft protection priors $\pi_i$ to group distortion costs

// Phase 5: Budget-constrained optimization
8. Solve MCKP via greedy efficiency ordering under budget $B$

return $\{(b_i, g_i)\}_{i=1}^{n}$

5. Experimental Setup

5.1 Models

We evaluate on three models spanning dense and MoE architectures at different scales: Qwen3-8B [6] (dense, 8.2B parameters, 399 tensors, 36 layers), Llama4-Maverick-17B-128E [8] (MoE with 128 experts, 401.6B parameters, 1,061 tensors, 48 layers including 24 MoE and 24 dense), and Qwen3.5-397B-A17B [7] (MoE with 512 experts per layer, 403.4B parameters, 2,924 tensors, 60 layers). These models represent three distinct regimes: a small dense model where mixed-precision has limited headroom, a medium-scale MoE with a hybrid dense/MoE architecture, and a large-scale MoE where the expert parameter count dwarfs the shared backbone.

5.2 Hardware and Framework

All experiments run on a single Apple M2 Ultra with 192 GB unified memory, using the MLX framework [21] for inference and quantization. The unified memory architecture eliminates CPU-GPU transfer overhead, and MLX's lazy evaluation enables processing models that would not fit in discrete GPU memory. Sensitivity analysis of Qwen3.5-397B completes in approximately 163 minutes, scanning all weight tensors across 55 safetensor shards and profiling 2,924 tensor groups (after expert clustering via Mode B).

5.3 Evaluation Protocol

We evaluate perplexity on WikiText-2 using 256 sequences of 2,048 tokens each (seed = 42). We report both mean and median perplexity; the median is more robust to outlier sequences that can inflate the mean, particularly for MoE models where routing decisions introduce sequence-level variance. For downstream benchmarks on Qwen3.5-397B, we evaluate MMLU-Pro (thinking mode), ARC-Challenge, GSM8K, and HumanEval using standard evaluation harnesses. Baselines include BF16 (full precision), uniform 4-bit with group_size = 128, and uniform 4-bit with group_size = 64.

6. Results

6.1 Perplexity Comparison

Table 3 presents the central perplexity comparison on Qwen3.5-397B-A17B. We compare two versions of our pipeline—SWAN v1 (threshold-based allocation from sensitivity scores) and SWAN v2 (MCKP-based optimization)—against uniform 4-bit baselines at two group sizes.

Variant Avg Bits Group Size Size (GB) Perplexity vs Uniform
SWAN v1 (threshold) 5.06 128 199.1 4.283 −0.3%
SWAN v2 (MCKP) 4.31 128 199.1 4.283 −0.3%
Uniform 4-bit 4.25 128 196.0 4.298
Uniform 4-bit 4.25 64 208.5 3.931
SWAN v2 4.56 64 210.6 4.058 +3.2% worse
Table 3: Perplexity comparison on Qwen3.5-397B-A17B (WikiText-2, 256 sequences, 2048 tokens, seed=42).

The key result is that SWAN v2 MCKP matches the threshold-based v1 at 16% fewer average bits (4.31 vs 5.06) while achieving identical perplexity (4.283). This demonstrates that the budget-constrained optimizer efficiently reallocates bits from over-protected tensors to where they matter. At matched group_size = 128, SWAN v2 beats uniform 4-bit by 0.015 perplexity points (4.283 vs 4.298), a modest but consistent improvement. However, at group_size = 64, uniform 4-bit (3.931) outperforms SWAN v2 (4.058) by 3.2%, a finding we discuss in detail in Section 6.4.

6.2 Downstream Benchmarks

To verify that perplexity improvements translate to downstream task quality, we evaluate the SWAN-quantized Qwen3.5-397B on four standard benchmarks:

Benchmark Score
MMLU-Pro (thinking) 77.1%
ARC-Challenge 96.0%
GSM8K 88.7%
HumanEval 78.7%
Table 4: Downstream benchmarks for Qwen3.5-397B-A17B quantized with SWAN (4.31 avg bits).

Note: we were unable to run the BF16 baseline on our hardware (740+ GB exceeds 192 GB), so direct degradation measurement is not possible. However, these scores are competitive with published BF16 results for this model (NVIDIA reports MMLU-Pro 83.7% at BF16), suggesting the quantized model retains the majority of its reasoning and coding capability. The 96.0% on ARC-Challenge and 88.7% on GSM8K are particularly strong, as these structured reasoning tasks are often sensitive to quantization noise.

6.3 Bit Allocation Distribution

Table 5 shows the breakdown of bit-width assignments across all parameters in Qwen3.5-397B under the 226 GB budget. The vast majority of expert parameters (89.4%) safely quantize to 4-bit under the SQNR safety floor, with only 1.6% requiring 8-bit and 1.0% remaining at 16-bit precision. The 8.0% at 6-bit represents tensors where the MCKP solver found the intermediate precision to be the most efficient allocation.

Precision Parameters Percentage
4-bit 360.8B 89.4%
6-bit 32.2B 8.0%
8-bit 6.5B 1.6%
16-bit 3.9B 1.0%
Table 5: Bit allocation detail for Qwen3.5-397B-A17B (226 GB budget). 89.4% of parameters tolerate 4-bit quantization.

The 16-bit parameters (3.9B, 1.0%) correspond primarily to embeddings, the LM head, and LayerNorm parameters—components protected by the soft priors and the SQNR veto. The 8-bit parameters (6.5B, 1.6%) include router weights and attention projections in the first and last two layers, which exhibit the highest kurtosis values in the model.

6.4 The Group Size Effect

Comparing rows in Table 3 reveals a striking finding: reducing group size from 128 to 64 yields a 0.225 perplexity improvement for SWAN v2 (4.283 to 4.058), while the entire SWAN mixed-precision allocation yields only 0.015 improvement over uniform 4-bit at matched group size. The uniform 4-bit baseline at group_size = 64 achieves 3.931 perplexity—better than any SWAN variant at group_size = 128.

This is the most practically important finding in our study: for MoE models at 400B+ scale, group size optimization is the primary quality lever. Halving the group size doubles the number of scale and zero-point parameters, providing finer-grained adaptation to local weight distributions. This benefit is orthogonal to and substantially larger than mixed-precision bit-width allocation.

6.5 Codebook Quantization Ablation

Method Mean MSE Reduction Range
$k$-means codebook (256 centroids) 41.1% 39.7% – 42.6%
Hadamard rotation 8.2% 2.2% – 16.4%
MSE-optimal clipping 2.7%
Table 6: Codebook quantization ablation (30 MoE expert tensors, Qwen3.5-397B).

Codebook quantization using $k$-means with 256 centroids yields a uniformly large MSE reduction of approximately 41% across all expert tensors, regardless of kurtosis. The kurtosis-MSE correlation for codebook improvement is only $-0.058$, confirming that the benefit is structural (non-linear quantization better fits arbitrary weight distributions) rather than targeted at specific tensor properties. However, codebook quantization requires a lookup-table (LUT) dequantization kernel that is not available in MLX or most inference frameworks, creating a deployment blocker. On BF16 models, the correlation is higher ($+0.537$), but the absolute improvement is similar ($\sim$45% MSE reduction).

Hadamard rotation provides a modest 8.2% mean MSE improvement at 4-bit across 40 tested tensors, with individual improvements ranging from 2.2% to 16.4%. However, the rotation cannot bridge bit levels: 4-bit with Hadamard rotation is still 187–343$\times$ worse in MSE than 8-bit quantization (0 of 16 candidate tensors could be downgraded from 8-bit to 4-bit+Hadamard). MSE-optimal clipping contributes only 2.7%, suggesting that the default clipping in standard quantization is already near-optimal for these weight distributions.

6.6 Cross-Architecture Validation

To validate our findings beyond Qwen3.5-397B, we applied the full pipeline to Llama4-Maverick (128 experts) and Qwen3-8B (dense). On Maverick, the SWAN pipeline assigns 50 expert tensor groups to 2-bit and 18 to 8-bit, producing a 171.9 GB quantized model with perplexity 6.343 on WikiText-2. On the dense Qwen3-8B, mixed-precision provides negligible benefit over uniform 4-bit, consistent with prior observations that small dense models have insufficient sensitivity heterogeneity for mixed-precision to exploit.

We additionally validated on several other architectures during development. Llama 3.1 70B achieved perplexity 4.221 (vs uniform4-g128 at 4.771, an 11.5% improvement). Llama 3.3 70B achieved 4.379 (vs 5.052, a 13.3% improvement). These results confirm that mixed-precision provides significant gains for large dense models (70B+), with the benefit increasing as model heterogeneity grows. On smaller dense models (8B), the benefit is negligible, and on very large MoE models (397B), the benefit exists but is dwarfed by group size effects.

6.7 Expert Routing Analysis

To understand how expert utilization patterns might inform quantization decisions, we profiled the routing behavior of Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across 100 diverse prompts spanning 40 MoE layers.

Metric Value
Experts per layer 256
Active per token 8
Avg dead experts per layer 1.5 / 256 (0.6%)
Avg entropy ratio 0.91
Avg Gini coefficient 0.53
Top-10 expert traffic share 20.4% (vs 3.9% if uniform)
Table 7: Expert routing statistics on Qwen3.5-35B-A3B (256 experts, 40 MoE layers, 100 prompts).

Expert utilization is moderately concentrated: the entropy ratio of 0.91 indicates fairly uniform but not perfectly balanced routing, while the Gini coefficient of 0.53 shows moderate concentration. The top-10 experts carry 20.4% of all traffic, approximately 5x their fair share under uniform routing (3.9%). The average number of completely dead experts is only 1.5 per layer (0.6%), indicating that nearly all experts contribute to at least some inputs.

A critical methodological finding is that prompt diversity dramatically affects dead expert counts. With only 5 prompts, approximately 30% of experts appeared dead; at 100 prompts this dropped to 0.6%. This demonstrates that a diverse prompt set reveals most experts have value, and that pruning decisions based on small calibration sets risk removing experts that are essential for less common but valid inputs.

6.8 DynaMINT: Tiered Expert Quantization

Motivated by the routing analysis in Section 6.7, we developed DynaMINT (Dynamic MINT), a tiered expert quantization scheme that assigns different precisions to experts based on their activation frequency. Using the routing statistics from 100-prompt profiling, experts are classified into four tiers:

Tier Precision Share of Experts
Critical (high-traffic) 8-bit 19.9%
Standard 4-bit 64.8%
Deprioritized (low-traffic) 2-bit 11.6%
Prunable (near-zero traffic) 0-bit (removed) 3.6%
Table 8: DynaMINT tier distribution on Qwen3.5-35B-A3B.

We evaluated DynaMINT against the uniform MINT baseline on Qwen3.5-35B-A3B:

Variant Perplexity vs Baseline Speed (tok/s)
MINT uniform (baseline) 6.580 70.1
DynaMINT (tiered) 6.613 +0.5% 9.6
Table 9: DynaMINT evaluation on Qwen3.5-35B-A3B (WikiText-2).

DynaMINT maintains quality with only +0.5% perplexity degradation despite 11.6% of experts at 2-bit and 3.6% pruned entirely. The conversion process is fast, completing all 40 layers in 1.7 seconds. MoE weight size increases by 10.5% due to the 8-bit critical tier; this could be offset by adjusting tier thresholds to reduce the critical tier percentage. Generation quality is preserved: the tiered model produces coherent chain-of-thought responses on all three test prompts.

The primary limitation is inference speed: DynaMINT achieves only 9.6 tok/s compared to 70.1 tok/s for the uniform baseline, a 7x slowdown. This overhead comes entirely from Python-level per-tier dispatch—the current prototype launches separate kernel calls for each precision tier. This is an engineering rather than fundamental limitation: sorted dispatch (grouping tokens by their routed expert's tier before kernel launch) or a native multi-precision kernel would eliminate most of this overhead.

6.9 Expert Pruning

To evaluate expert pruning as an orthogonal compression technique, we measured the perplexity impact of zeroing out the least-activated experts in Qwen3.5-35B-A3B based on the routing statistics from Section 6.7.

Pruning Level Experts Removed Perplexity Degradation
0% (baseline) 0 6.580
5% 480 86.906 13.2x
10% 960 15,894 2,416x
25% 2,400 906,762 137,805x
Table 10: Expert pruning curve on Qwen3.5-35B-A3B. Even 5% pruning causes catastrophic degradation.

This is a strong negative result: Qwen3.5-35B-A3B is extremely sensitive to expert pruning. Removing just 5% of experts (480 out of 9,600 total across 40 layers) causes a 13x perplexity degradation, from 6.580 to 86.906. At 10% pruning, the model effectively collapses with perplexity exceeding 15,000. At 25%, the model produces near-random output.

This result directly contradicts the assumption that rarely-activated experts can be safely removed. The routing mechanism in this architecture relies on having all experts available: even experts with low average activation frequency appear to be essential for specific input distributions. Activation frequency alone is not a safe pruning criterion—an expert activated on only 0.1% of tokens may still be critical for those tokens, and its removal cascades through the routing softmax, redistributing probability mass in ways that compound across layers. This finding supports our quantization-first approach over pruning for MoE compression, and suggests that methods proposing significant expert pruning [15] may not generalize to architectures with 256+ experts and fine-grained routing.

7. Discussion & Limitations

7.1 Group Size Dominates Bit-Width

The perplexity data presents a clear hierarchy of quantization quality levers for MoE models. Halving group size from 128 to 64 yields a 0.225 perplexity improvement—an order of magnitude larger than the 0.015 improvement from SWAN's mixed-precision allocation at matched group size. This finding has immediate practical implications: practitioners should prioritize group size reduction (accepting the $\sim$7% size increase from additional scale parameters) before investing in mixed-precision profiling. The mixed-precision pipeline remains valuable for squeezing the last fraction of quality at a given size budget, but it is a secondary lever.

7.2 The SwitchLinear Constraint

MLX's SwitchLinear module stores all expert weights in a single contiguous tensor, requiring all experts within a layer to share one quantization configuration. This prevents true per-expert bit-width differentiation: our analysis computes per-expert sensitivity, but the allocation decision is necessarily per-expert-group. We aggregate via parameter-weighted mean, which is conservative but not optimal. A native MixedBitSwitchGLU kernel that supports heterogeneous expert quantization within a single layer would unlock the full potential of per-expert sensitivity analysis. Our metric correlation data (Table 2) suggests that meaningful sensitivity variance exists within expert groups, as kurtosis scores span a wide interquartile range (0.014 to 0.338 across all 2,347 tensors) and this variance is present both within and across expert groups.

7.3 Output Sensitivity Saturation

On 512-expert models, the output noise amplification metric saturates at 1.0 (its normalized maximum) for 99.5% of expert tensors. This is because MoE expert weight matrices tend to have similar spectral norms—the routing mechanism and load balancing during training encourage experts to operate at similar scales. The practical implication is that simpler profiling pipelines using only kurtosis and reconstruction error may be sufficient for large MoE models, reducing profiling cost without sacrificing allocation quality. For dense models and small MoE models ($\leq 32$ experts), output sensitivity retains discriminatory power and should be included.

7.4 The Codebook Opportunity

The 41% MSE reduction from $k$-means codebook quantization represents a substantial untapped opportunity for MoE compression. Unlike mixed-precision allocation (which redistributes a fixed bit budget) or Hadamard rotation (which provides modest within-bitwidth improvement), codebook quantization fundamentally changes the representation power per bit. The uniform improvement across kurtosis levels ($\rho = -0.058$) indicates this benefit is structural—non-linear quantization grids better fit arbitrary weight distributions regardless of their statistical properties. This makes codebook quantization a complementary technique rather than a replacement for mixed-precision allocation: one optimizes the quantization grid, the other optimizes the bit budget distribution. The primary barrier is kernel support: an efficient LUT dequantization kernel is needed for practical deployment, which is absent from MLX, CUDA (for standard formats), and most inference frameworks.

7.5 Limitations

Several limitations constrain the scope of our findings:

8. Conclusion

We have presented the first data-free mixed-precision quantization pipeline validated on 512-expert MoE models at 403 billion parameters. By profiling weight sensitivity using four complementary metrics—spectral features, per-group kurtosis, output noise amplification, and reconstruction error—we characterize 2,347 tensor groups across Qwen3.5-397B-A17B without requiring any calibration data. Our MCKP formulation with expert grouping constraints and SQNR safety vetoes solves in under 100 ms for any model size, producing provably near-optimal (bits, group_size) assignments per expert group. Key findings include: kurtosis is the dominant sensitivity predictor (Spearman $\rho = 0.795$), 89.4% of expert parameters safely quantize to 4-bit, and group size has a larger impact on perplexity than bit-width allocation at this scale. Codebook quantization (+41% MSE reduction) and Hadamard rotation (+8.2%) establish practical boundaries for future MoE compression techniques.

We further contribute DynaMINT, a tiered expert quantization scheme informed by activation profiling that assigns critical experts to 8-bit, standard experts to 4-bit, and deprioritized experts to 2-bit. DynaMINT maintains quality at only +0.5% perplexity degradation despite 11.6% of experts at 2-bit and 3.6% pruned entirely, demonstrating that activation-aware tiering is a viable complement to weight-based sensitivity analysis. Our expert pruning study provides an important negative result: even 5% expert removal causes 13x perplexity degradation on Qwen3.5-35B-A3B, establishing that activation frequency alone is not a safe pruning criterion and that MoE routing relies fundamentally on expert diversity.

We release all code, sensitivity manifests, and quantized models to facilitate reproduction and extension. The MINT pipeline is available at github.com/baa-ai/MINT, with pre-quantized models hosted at huggingface.co/baa-ai. We believe the key actionable insight—that group size dominates bit-width allocation for large MoE models—should inform both practitioners choosing quantization configurations and framework developers prioritizing kernel optimizations. Future work should focus on native multi-precision dispatch kernels (eliminating DynaMINT's 7x Python overhead), codebook dequantization support, and joint optimization of group size and bit-width allocation.

References

[1] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. ICLR, 2021.

[2] Fedus, W., Zoph, B., and Shazeer, N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR, 23(120):1–39, 2022.

[3] Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., et al. Mixtral of Experts. arXiv preprint arXiv:2401.04088, 2024.

[4] DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient Mixture-of-Experts language model. arXiv preprint arXiv:2405.04434, 2024.

[5] DeepSeek-AI. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2025.

[6] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.

[7] Alibaba Cloud. Qwen3.5: Advancing the frontier with 512 experts. Technical report, 2025.

[8] Meta AI. Llama 4: Open-weight Mixture-of-Experts models. Technical report, 2025.

[9] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ICLR, 2023.

[10] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. MLSys, 2024.

[11] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W., and Keutzer, K. SqueezeLLM: Dense-and-sparse quantization. ICML, 2024.

[12] Badri, H. and Shaji, A. HQQ: Half-quadratic quantization of large language models. NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2024.

[13] Guo, C., Chen, J., Li, J., Zhou, Y., Chen, T., Xie, L., and Zhang, B. MXQ: Mixed-precision quantization for efficient LLM deployment. arXiv preprint arXiv:2401.12917, 2024.

[14] Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and De Sa, C. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. ICML, 2024.

[15] Li, W., Zhang, Y., Sun, H., Wang, X., and Qiu, X. MC-MoE: Mixture compressor for Mixture-of-Experts LLMs gains more. arXiv preprint arXiv:2410.06270, 2024.

[16] Frantar, E. and Alistarh, D. QMoE: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795, 2023.

[17] Kim, Y., Lee, J., Park, S., and Shin, J. MoEQuant: Expert-wise quantization for Mixture-of-Experts models. arXiv preprint arXiv:2406.02279, 2024.

[18] Chen, Z., Qin, K., Zhang, Y., Li, P., Zhao, J., and Liang, X. DynaExq: Dynamic expert-level mixed-precision quantization for Mixture-of-Experts. arXiv preprint arXiv:2405.11009, 2024.

[19] Black Sheep AI. SWAN: SmartQuant data-free per-tensor mixed-precision quantization for LLMs on Apple Silicon. Technical report, baa.ai, 2026.

[20] Black Sheep AI. MINT: Memory-Informed N-bit Tuning — compute-optimal data-free mixed-precision quantization for LLMs. Technical report, baa.ai, 2026.

[21] Apple. MLX: An array framework for Apple Silicon. GitHub repository, github.com/ml-explore/mlx, 2024.

[22] Gray, R.M. and Neuhoff, D.L. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383, 1998.

[23] Kellerer, H., Pferschy, U., and Pisinger, D. Knapsack Problems. Springer, Berlin, 2004.

[24] Barlow, R.E., Bartholomew, D.J., Bremner, J.M., and Brunk, H.D. Statistical Inference under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley, New York, 1972.

[25] Hampel, F.R. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393, 1974.

Code and models: github.com/baa-ai/MINT | huggingface.co/baa-ai

Correspondence: research@baa.ai