Mixture-of-Experts (MoE) has become the dominant architecture for frontier large language models. Qwen3.5-397B-A17B [7] deploys 512 experts per MoE layer with only 17 billion parameters active per token, achieving state-of-the-art quality at a fraction of the compute cost implied by total parameter count. Llama 4 Maverick [8] uses 128 experts across 24 MoE layers, while earlier models such as Mixtral-8x7B [3] and DeepSeek-V3 [5] established the pattern at smaller expert counts. Despite the compute efficiency of sparse activation, deployment remains constrained by total model size: Qwen3.5-397B requires over 740 GB in BF16, far exceeding the memory of any single accelerator.
Post-training quantization is the standard solution, but existing methods apply uniform bit-widths across all experts. This ignores a fundamental property of MoE architectures: experts are trained semi-independently through routing and develop heterogeneous weight distributions. Some experts have near-Gaussian weight distributions that compress gracefully to 4-bit; others exhibit heavy-tailed distributions with high kurtosis that suffer catastrophic accuracy loss at the same precision. Uniform quantization wastes bits on robust experts while under-protecting sensitive ones.
Prior work on MoE quantization—MC-MoE [15], QMoE [16], MoEQuant [17], and DynaExq [18]—all require calibration data to estimate expert sensitivity via activation traces or routing statistics. This creates practical barriers: calibration sets must be representative of deployment distribution, the profiling pass requires loading the full model in high precision, and results may not transfer across domains. More critically, none of these methods has been validated at 512-expert scale, where the combinatorial space of per-expert configurations explodes and calibration cost becomes prohibitive.
We make four contributions: (1) the first data-free sensitivity study at 512-expert scale, profiling 2,347 tensors across Qwen3.5-397B-A17B using only weight statistics; (2) an MCKP-based allocation pipeline with expert grouping constraints that solves in under 100 ms for any model size; (3) comprehensive ablations on codebook quantization and Hadamard rotation techniques, establishing practical boundaries for future MoE compression; and (4) open release of all code, sensitivity manifests, and quantized models. The entire pipeline runs on a single Apple M2 Ultra with 192 GB unified memory [21], requiring no GPU cluster and no calibration data.
The modern MoE paradigm traces from GShard [1] and the Switch Transformer [2], which demonstrated that sparsely-activated expert layers could scale model capacity without proportional compute cost. Mixtral-8x7B [3] brought MoE to open-weight models with 8 experts per layer, selecting 2 per token. DeepSeek-V2 [4] introduced fine-grained experts (up to 160 per layer), and DeepSeek-V3 [5] scaled to 256 experts with auxiliary-loss-free load balancing. Qwen3 [6] and Qwen3.5 [7] pushed further to 512 experts per layer while activating only 17B of 397B total parameters. Llama 4 [8] adopted a hybrid design with both dense and MoE layers. The trend is clear: expert counts are growing rapidly, and quantization methods must keep pace.
MC-MoE [15] uses calibration data to identify and protect frequently-activated experts, applying lower precision to rarely-used ones. QMoE [16] compresses all experts to under 1 bit per parameter using learned codebooks with calibration-based distillation. MoEQuant [17] proposes expert-wise calibration to handle activation outliers specific to each expert. DynaExq [18] dynamically adjusts expert quantization based on runtime routing patterns. All of these methods require calibration data and activation traces, creating practical deployment barriers. None has been demonstrated at 512-expert scale.
| Method | Expert Count Tested | Granularity | Calibration Data | Data-Free | Hardware |
|---|---|---|---|---|---|
| MC-MoE [15] | ≤16 | Per-expert | Required | No | GPU |
| QMoE [16] | 128 | Per-expert | Required | No | GPU |
| MoEQuant [17] | ≤64 | Per-expert | Required | No | GPU |
| DynaExq [18] | ≤128 | Dynamic | Required | No | GPU |
| Ours | 256 | Per-expert tiered | None | Yes | Apple Silicon |
For dense models, GPTQ [9] uses second-order information for layer-wise quantization; AWQ [10] identifies salient weight channels via activation magnitudes; SqueezeLLM [11] separates outliers into a sparse format; HQQ [12] provides fast half-quadratic quantization without calibration data; and MXQ [13] assigns mixed precision at sub-layer granularity. QuIP# [14] applies random orthogonal transformations to incoherify weight matrices before quantization. While these methods have advanced the state of the art for dense models, they do not address the unique challenges of MoE: heterogeneous expert sensitivity, the combinatorial explosion of per-expert configuration space, and framework constraints that tie all experts within a layer to a shared quantization config. Our work fills this gap with a data-free method validated at 512-expert scale under a budget-constrained optimization framework.
Unlike activation-based methods that require calibration data and forward passes, we analyze sensitivity entirely from weight tensor properties. This makes profiling data-free and embarrassingly parallel across shards. We compute four complementary metrics for each tensor:
MoE models present a scaling challenge: Qwen3.5-397B has 512 experts per layer across 60 layers, yielding thousands of expert weight tensors. Profiling every expert independently is feasible but expensive. We implement two analysis modes in the expert handler:
For Qwen3.5-397B (512 experts), Mode B reduces the number of fully-profiled expert tensors from 30,720 to approximately 2,347 while maintaining coverage of the sensitivity distribution.
We profile three architectures spanning dense to large-scale MoE to understand how sensitivity distributions change with expert count and model scale:
| Model | Architecture | Parameters | Tensors | Layers | Avg Bits | Est. Size |
|---|---|---|---|---|---|---|
| Qwen3-8B | Dense | 8.2B | 399 | 36 | 6.84 | 6.7 GB |
| Llama4-Maverick | MoE 128E | 401.6B | 1,061 | 48 | 4.78 | 230.4 GB |
| Qwen3.5-397B | MoE 512E | 403.4B | 2,924 | 60 | 5.06 | 245.1 GB |
To identify which metrics best predict quantization sensitivity, we compute rank correlations between each metric and reconstruction error (NRMSE at 4-bit) across all 2,347 profiled tensors in Qwen3.5-397B:
| Metric | Spearman $\rho$ | Pearson $r$ | $p$-value |
|---|---|---|---|
| Per-group kurtosis | 0.795 | 0.480 | <1e-135 |
| Cross-layer position | −0.468 | −0.224 | <1e-128 |
| SVD spectral features | 0.391 | 0.303 | <1e-86 |
| Composite (weighted) | 0.374 | 0.400 | <1e-79 |
| Output sensitivity | 0.212 | 0.455 | <1e-25 |
Kurtosis dominates as the sensitivity predictor with Spearman $\rho = 0.795$, substantially ahead of the next best metric. Output sensitivity, despite its intuitive appeal, achieves only $\rho = 0.212$. Investigation reveals that output sensitivity saturates at 1.0 for 99.5% of MoE expert tensors (median = 1.0), losing all discriminatory power at 512-expert scale. This saturation occurs because expert weight matrices in large MoE models tend to have similar spectral norms, making the noise amplification ratio nearly identical across experts.
The negative cross-layer position correlation ($\rho = -0.468$) confirms the well-known U-shaped sensitivity pattern: early and late layers are more sensitive to quantization than middle layers. This pattern holds across all three architectures and is captured by the soft protection priors in our allocation pipeline.
For each tensor, we compute the reconstruction error (NRMSE) at eight candidate (bits, group_size) configurations, forming a rate-distortion curve:
Each configuration implies a specific size cost (bits per parameter plus scale/zero-point overhead from the group size) and a distortion level. The rate-distortion curve captures the tensor-specific tradeoff: some tensors see a large NRMSE jump between 4-bit and 8-bit while others degrade gracefully, making them good candidates for aggressive quantization.
MLX's SwitchLinear module [21] requires all experts
within a layer to share a single quantization configuration (bits and group_size). This is a
hard framework constraint: the quantized weight tensor for all experts in a layer is stored as
a single contiguous array with uniform element width. Consequently, our analysis is per-expert
but the allocation must be per-expert-group, where each group corresponds to all experts
sharing a (layer, projection_type) pair.
We aggregate per-expert NRMSE values into a group-level distortion estimate using the parameter-weighted mean across all experts in the group:
where $n_e$ is the number of parameters in expert $e$. This weighting ensures that larger experts (which contribute more to total model size) have proportionally more influence on the group allocation decision.
We formulate the bit-width and group-size allocation as a Multiple-Choice Knapsack Problem (MCKP) [23]. Let $i$ index the tensor groups, $\pi_i$ denote soft protection priors, and $(b_i, g_i) \in \mathcal{C}_i$ denote the candidate configurations for group $i$. The optimization problem is:
where $B$ is the memory budget and $\text{size}_i(b_i, g_i)$ computes the storage cost including scale and zero-point overhead. The soft protection priors $\pi_i$ increase the effective distortion cost for structurally important components, discouraging aggressive quantization of embeddings, layer norms, and boundary layers:
| Component | Prior Weight ($\pi$) |
|---|---|
| Embeddings | 10.0x |
| LM head | 10.0x |
| Router weights | 8.0x |
| First 2 layers | 3.0x |
| Last 2 layers | 2.0x |
| LayerNorm | $\infty$ (never quantize) |
| All other tensors | 1.0x |
Before optimization, we apply a Signal-to-Quantization-Noise Ratio (SQNR) safety veto [22]. For each tensor and each candidate configuration, we compute:
Any configuration with $\text{SQNR} < 9\;\text{dB}$ (the default floor) is removed from the candidate set $\mathcal{C}_i$ before the MCKP solver runs. This hard constraint prevents catastrophic quantization of tensors where the quantization noise exceeds approximately 35% of the signal power, regardless of what the budget optimization might prefer.
Raw metric values span vastly different scales across architectures and model sizes. We replace all hardcoded normalization bounds with empirical CDF (eCDF) normalization: each metric value is transformed to its percentile rank across all tensors in the model, yielding scale-invariant scores in $[0, 1]$. For a metric value $x$ with observed values $\{x_1, \ldots, x_n\}$:
This eliminates the need for per-metric normalization constants and adapts automatically to the distribution of any model, whether dense or MoE, 8B or 400B parameters.
We solve the MCKP using a greedy efficiency ordering. Starting from the minimum-cost (lowest bit-width) feasible assignment, we enumerate all possible upgrades (transitions to higher precision) for every group and sort them by efficiency:
Upgrades are applied greedily in decreasing efficiency order until the budget $B$ is exhausted. For the MCKP with concave distortion curves (which holds empirically for quantization), this greedy approach is provably near-optimal. The solver completes in under 100 ms for Qwen3.5-397B (2,924 tensor groups), making it practical for interactive experimentation with different budgets.
We evaluate on three models spanning dense and MoE architectures at different scales: Qwen3-8B [6] (dense, 8.2B parameters, 399 tensors, 36 layers), Llama4-Maverick-17B-128E [8] (MoE with 128 experts, 401.6B parameters, 1,061 tensors, 48 layers including 24 MoE and 24 dense), and Qwen3.5-397B-A17B [7] (MoE with 512 experts per layer, 403.4B parameters, 2,924 tensors, 60 layers). These models represent three distinct regimes: a small dense model where mixed-precision has limited headroom, a medium-scale MoE with a hybrid dense/MoE architecture, and a large-scale MoE where the expert parameter count dwarfs the shared backbone.
All experiments run on a single Apple M2 Ultra with 192 GB unified memory, using the MLX framework [21] for inference and quantization. The unified memory architecture eliminates CPU-GPU transfer overhead, and MLX's lazy evaluation enables processing models that would not fit in discrete GPU memory. Sensitivity analysis of Qwen3.5-397B completes in approximately 163 minutes, scanning all weight tensors across 55 safetensor shards and profiling 2,924 tensor groups (after expert clustering via Mode B).
We evaluate perplexity on WikiText-2 using 256 sequences of 2,048 tokens each (seed = 42). We report both mean and median perplexity; the median is more robust to outlier sequences that can inflate the mean, particularly for MoE models where routing decisions introduce sequence-level variance. For downstream benchmarks on Qwen3.5-397B, we evaluate MMLU-Pro (thinking mode), ARC-Challenge, GSM8K, and HumanEval using standard evaluation harnesses. Baselines include BF16 (full precision), uniform 4-bit with group_size = 128, and uniform 4-bit with group_size = 64.
Table 3 presents the central perplexity comparison on Qwen3.5-397B-A17B. We compare two versions of our pipeline—SWAN v1 (threshold-based allocation from sensitivity scores) and SWAN v2 (MCKP-based optimization)—against uniform 4-bit baselines at two group sizes.
| Variant | Avg Bits | Group Size | Size (GB) | Perplexity | vs Uniform |
|---|---|---|---|---|---|
| SWAN v1 (threshold) | 5.06 | 128 | 199.1 | 4.283 | −0.3% |
| SWAN v2 (MCKP) | 4.31 | 128 | 199.1 | 4.283 | −0.3% |
| Uniform 4-bit | 4.25 | 128 | 196.0 | 4.298 | — |
| Uniform 4-bit | 4.25 | 64 | 208.5 | 3.931 | — |
| SWAN v2 | 4.56 | 64 | 210.6 | 4.058 | +3.2% worse |
The key result is that SWAN v2 MCKP matches the threshold-based v1 at 16% fewer average bits (4.31 vs 5.06) while achieving identical perplexity (4.283). This demonstrates that the budget-constrained optimizer efficiently reallocates bits from over-protected tensors to where they matter. At matched group_size = 128, SWAN v2 beats uniform 4-bit by 0.015 perplexity points (4.283 vs 4.298), a modest but consistent improvement. However, at group_size = 64, uniform 4-bit (3.931) outperforms SWAN v2 (4.058) by 3.2%, a finding we discuss in detail in Section 6.4.
To verify that perplexity improvements translate to downstream task quality, we evaluate the SWAN-quantized Qwen3.5-397B on four standard benchmarks:
| Benchmark | Score |
|---|---|
| MMLU-Pro (thinking) | 77.1% |
| ARC-Challenge | 96.0% |
| GSM8K | 88.7% |
| HumanEval | 78.7% |
Note: we were unable to run the BF16 baseline on our hardware (740+ GB exceeds 192 GB), so direct degradation measurement is not possible. However, these scores are competitive with published BF16 results for this model (NVIDIA reports MMLU-Pro 83.7% at BF16), suggesting the quantized model retains the majority of its reasoning and coding capability. The 96.0% on ARC-Challenge and 88.7% on GSM8K are particularly strong, as these structured reasoning tasks are often sensitive to quantization noise.
Table 5 shows the breakdown of bit-width assignments across all parameters in Qwen3.5-397B under the 226 GB budget. The vast majority of expert parameters (89.4%) safely quantize to 4-bit under the SQNR safety floor, with only 1.6% requiring 8-bit and 1.0% remaining at 16-bit precision. The 8.0% at 6-bit represents tensors where the MCKP solver found the intermediate precision to be the most efficient allocation.
| Precision | Parameters | Percentage |
|---|---|---|
| 4-bit | 360.8B | 89.4% |
| 6-bit | 32.2B | 8.0% |
| 8-bit | 6.5B | 1.6% |
| 16-bit | 3.9B | 1.0% |
The 16-bit parameters (3.9B, 1.0%) correspond primarily to embeddings, the LM head, and LayerNorm parameters—components protected by the soft priors and the SQNR veto. The 8-bit parameters (6.5B, 1.6%) include router weights and attention projections in the first and last two layers, which exhibit the highest kurtosis values in the model.
Comparing rows in Table 3 reveals a striking finding: reducing group size from 128 to 64 yields a 0.225 perplexity improvement for SWAN v2 (4.283 to 4.058), while the entire SWAN mixed-precision allocation yields only 0.015 improvement over uniform 4-bit at matched group size. The uniform 4-bit baseline at group_size = 64 achieves 3.931 perplexity—better than any SWAN variant at group_size = 128.
This is the most practically important finding in our study: for MoE models at 400B+ scale, group size optimization is the primary quality lever. Halving the group size doubles the number of scale and zero-point parameters, providing finer-grained adaptation to local weight distributions. This benefit is orthogonal to and substantially larger than mixed-precision bit-width allocation.
| Method | Mean MSE Reduction | Range |
|---|---|---|
| $k$-means codebook (256 centroids) | 41.1% | 39.7% – 42.6% |
| Hadamard rotation | 8.2% | 2.2% – 16.4% |
| MSE-optimal clipping | 2.7% | — |
Codebook quantization using $k$-means with 256 centroids yields a uniformly large MSE reduction of approximately 41% across all expert tensors, regardless of kurtosis. The kurtosis-MSE correlation for codebook improvement is only $-0.058$, confirming that the benefit is structural (non-linear quantization better fits arbitrary weight distributions) rather than targeted at specific tensor properties. However, codebook quantization requires a lookup-table (LUT) dequantization kernel that is not available in MLX or most inference frameworks, creating a deployment blocker. On BF16 models, the correlation is higher ($+0.537$), but the absolute improvement is similar ($\sim$45% MSE reduction).
Hadamard rotation provides a modest 8.2% mean MSE improvement at 4-bit across 40 tested tensors, with individual improvements ranging from 2.2% to 16.4%. However, the rotation cannot bridge bit levels: 4-bit with Hadamard rotation is still 187–343$\times$ worse in MSE than 8-bit quantization (0 of 16 candidate tensors could be downgraded from 8-bit to 4-bit+Hadamard). MSE-optimal clipping contributes only 2.7%, suggesting that the default clipping in standard quantization is already near-optimal for these weight distributions.
To validate our findings beyond Qwen3.5-397B, we applied the full pipeline to Llama4-Maverick (128 experts) and Qwen3-8B (dense). On Maverick, the SWAN pipeline assigns 50 expert tensor groups to 2-bit and 18 to 8-bit, producing a 171.9 GB quantized model with perplexity 6.343 on WikiText-2. On the dense Qwen3-8B, mixed-precision provides negligible benefit over uniform 4-bit, consistent with prior observations that small dense models have insufficient sensitivity heterogeneity for mixed-precision to exploit.
We additionally validated on several other architectures during development. Llama 3.1 70B achieved perplexity 4.221 (vs uniform4-g128 at 4.771, an 11.5% improvement). Llama 3.3 70B achieved 4.379 (vs 5.052, a 13.3% improvement). These results confirm that mixed-precision provides significant gains for large dense models (70B+), with the benefit increasing as model heterogeneity grows. On smaller dense models (8B), the benefit is negligible, and on very large MoE models (397B), the benefit exists but is dwarfed by group size effects.
To understand how expert utilization patterns might inform quantization decisions, we profiled the routing behavior of Qwen3.5-35B-A3B (256 experts per layer, 8 active per token) across 100 diverse prompts spanning 40 MoE layers.
| Metric | Value |
|---|---|
| Experts per layer | 256 |
| Active per token | 8 |
| Avg dead experts per layer | 1.5 / 256 (0.6%) |
| Avg entropy ratio | 0.91 |
| Avg Gini coefficient | 0.53 |
| Top-10 expert traffic share | 20.4% (vs 3.9% if uniform) |
Expert utilization is moderately concentrated: the entropy ratio of 0.91 indicates fairly uniform but not perfectly balanced routing, while the Gini coefficient of 0.53 shows moderate concentration. The top-10 experts carry 20.4% of all traffic, approximately 5x their fair share under uniform routing (3.9%). The average number of completely dead experts is only 1.5 per layer (0.6%), indicating that nearly all experts contribute to at least some inputs.
A critical methodological finding is that prompt diversity dramatically affects dead expert counts. With only 5 prompts, approximately 30% of experts appeared dead; at 100 prompts this dropped to 0.6%. This demonstrates that a diverse prompt set reveals most experts have value, and that pruning decisions based on small calibration sets risk removing experts that are essential for less common but valid inputs.
Motivated by the routing analysis in Section 6.7, we developed DynaMINT (Dynamic MINT), a tiered expert quantization scheme that assigns different precisions to experts based on their activation frequency. Using the routing statistics from 100-prompt profiling, experts are classified into four tiers:
| Tier | Precision | Share of Experts |
|---|---|---|
| Critical (high-traffic) | 8-bit | 19.9% |
| Standard | 4-bit | 64.8% |
| Deprioritized (low-traffic) | 2-bit | 11.6% |
| Prunable (near-zero traffic) | 0-bit (removed) | 3.6% |
We evaluated DynaMINT against the uniform MINT baseline on Qwen3.5-35B-A3B:
| Variant | Perplexity | vs Baseline | Speed (tok/s) |
|---|---|---|---|
| MINT uniform (baseline) | 6.580 | — | 70.1 |
| DynaMINT (tiered) | 6.613 | +0.5% | 9.6 |
DynaMINT maintains quality with only +0.5% perplexity degradation despite 11.6% of experts at 2-bit and 3.6% pruned entirely. The conversion process is fast, completing all 40 layers in 1.7 seconds. MoE weight size increases by 10.5% due to the 8-bit critical tier; this could be offset by adjusting tier thresholds to reduce the critical tier percentage. Generation quality is preserved: the tiered model produces coherent chain-of-thought responses on all three test prompts.
The primary limitation is inference speed: DynaMINT achieves only 9.6 tok/s compared to 70.1 tok/s for the uniform baseline, a 7x slowdown. This overhead comes entirely from Python-level per-tier dispatch—the current prototype launches separate kernel calls for each precision tier. This is an engineering rather than fundamental limitation: sorted dispatch (grouping tokens by their routed expert's tier before kernel launch) or a native multi-precision kernel would eliminate most of this overhead.
To evaluate expert pruning as an orthogonal compression technique, we measured the perplexity impact of zeroing out the least-activated experts in Qwen3.5-35B-A3B based on the routing statistics from Section 6.7.
| Pruning Level | Experts Removed | Perplexity | Degradation |
|---|---|---|---|
| 0% (baseline) | 0 | 6.580 | — |
| 5% | 480 | 86.906 | 13.2x |
| 10% | 960 | 15,894 | 2,416x |
| 25% | 2,400 | 906,762 | 137,805x |
This is a strong negative result: Qwen3.5-35B-A3B is extremely sensitive to expert pruning. Removing just 5% of experts (480 out of 9,600 total across 40 layers) causes a 13x perplexity degradation, from 6.580 to 86.906. At 10% pruning, the model effectively collapses with perplexity exceeding 15,000. At 25%, the model produces near-random output.
This result directly contradicts the assumption that rarely-activated experts can be safely removed. The routing mechanism in this architecture relies on having all experts available: even experts with low average activation frequency appear to be essential for specific input distributions. Activation frequency alone is not a safe pruning criterion—an expert activated on only 0.1% of tokens may still be critical for those tokens, and its removal cascades through the routing softmax, redistributing probability mass in ways that compound across layers. This finding supports our quantization-first approach over pruning for MoE compression, and suggests that methods proposing significant expert pruning [15] may not generalize to architectures with 256+ experts and fine-grained routing.
The perplexity data presents a clear hierarchy of quantization quality levers for MoE models. Halving group size from 128 to 64 yields a 0.225 perplexity improvement—an order of magnitude larger than the 0.015 improvement from SWAN's mixed-precision allocation at matched group size. This finding has immediate practical implications: practitioners should prioritize group size reduction (accepting the $\sim$7% size increase from additional scale parameters) before investing in mixed-precision profiling. The mixed-precision pipeline remains valuable for squeezing the last fraction of quality at a given size budget, but it is a secondary lever.
MLX's SwitchLinear module stores all expert weights in a single contiguous
tensor, requiring all experts within a layer to share one quantization configuration. This
prevents true per-expert bit-width differentiation: our analysis computes per-expert
sensitivity, but the allocation decision is necessarily per-expert-group. We aggregate via
parameter-weighted mean, which is conservative but not optimal. A native
MixedBitSwitchGLU kernel that supports heterogeneous expert quantization within
a single layer would unlock the full potential of per-expert sensitivity analysis. Our
metric correlation data (Table 2) suggests that meaningful sensitivity variance exists within
expert groups, as kurtosis scores span a wide interquartile range (0.014 to 0.338 across all
2,347 tensors) and this variance is present both within and across expert groups.
On 512-expert models, the output noise amplification metric saturates at 1.0 (its normalized maximum) for 99.5% of expert tensors. This is because MoE expert weight matrices tend to have similar spectral norms—the routing mechanism and load balancing during training encourage experts to operate at similar scales. The practical implication is that simpler profiling pipelines using only kurtosis and reconstruction error may be sufficient for large MoE models, reducing profiling cost without sacrificing allocation quality. For dense models and small MoE models ($\leq 32$ experts), output sensitivity retains discriminatory power and should be included.
The 41% MSE reduction from $k$-means codebook quantization represents a substantial untapped opportunity for MoE compression. Unlike mixed-precision allocation (which redistributes a fixed bit budget) or Hadamard rotation (which provides modest within-bitwidth improvement), codebook quantization fundamentally changes the representation power per bit. The uniform improvement across kurtosis levels ($\rho = -0.058$) indicates this benefit is structural—non-linear quantization grids better fit arbitrary weight distributions regardless of their statistical properties. This makes codebook quantization a complementary technique rather than a replacement for mixed-precision allocation: one optimizes the quantization grid, the other optimizes the bit budget distribution. The primary barrier is kernel support: an efficient LUT dequantization kernel is needed for practical deployment, which is absent from MLX, CUDA (for standard formats), and most inference frameworks.
Several limitations constrain the scope of our findings:
MixedBitSwitchGLU kernel that dispatches tokens to
experts quantized at different bit-widths. DynaMINT demonstrates a pure-Python prototype
achieving only 0.5% perplexity degradation with tiered quantization, but with 7x speed
overhead from per-tier kernel launches. Sorted dispatch or native kernel support would
eliminate this overhead.We have presented the first data-free mixed-precision quantization pipeline validated on 512-expert MoE models at 403 billion parameters. By profiling weight sensitivity using four complementary metrics—spectral features, per-group kurtosis, output noise amplification, and reconstruction error—we characterize 2,347 tensor groups across Qwen3.5-397B-A17B without requiring any calibration data. Our MCKP formulation with expert grouping constraints and SQNR safety vetoes solves in under 100 ms for any model size, producing provably near-optimal (bits, group_size) assignments per expert group. Key findings include: kurtosis is the dominant sensitivity predictor (Spearman $\rho = 0.795$), 89.4% of expert parameters safely quantize to 4-bit, and group size has a larger impact on perplexity than bit-width allocation at this scale. Codebook quantization (+41% MSE reduction) and Hadamard rotation (+8.2%) establish practical boundaries for future MoE compression techniques.
We further contribute DynaMINT, a tiered expert quantization scheme informed by activation profiling that assigns critical experts to 8-bit, standard experts to 4-bit, and deprioritized experts to 2-bit. DynaMINT maintains quality at only +0.5% perplexity degradation despite 11.6% of experts at 2-bit and 3.6% pruned entirely, demonstrating that activation-aware tiering is a viable complement to weight-based sensitivity analysis. Our expert pruning study provides an important negative result: even 5% expert removal causes 13x perplexity degradation on Qwen3.5-35B-A3B, establishing that activation frequency alone is not a safe pruning criterion and that MoE routing relies fundamentally on expert diversity.
We release all code, sensitivity manifests, and quantized models to facilitate reproduction and extension. The MINT pipeline is available at github.com/baa-ai/MINT, with pre-quantized models hosted at huggingface.co/baa-ai. We believe the key actionable insight—that group size dominates bit-width allocation for large MoE models—should inform both practitioners choosing quantization configurations and framework developers prioritizing kernel optimizations. Future work should focus on native multi-precision dispatch kernels (eliminating DynaMINT's 7x Python overhead), codebook dequantization support, and joint optimization of group size and bit-width allocation.
[1] Lepikhin, D., Lee, H., Xu, Y., Chen, D., Firat, O., Huang, Y., Krikun, M., Shazeer, N., and Chen, Z. GShard: Scaling giant models with conditional computation and automatic sharding. ICLR, 2021.
[2] Fedus, W., Zoph, B., and Shazeer, N. Switch Transformers: Scaling to trillion parameter models with simple and efficient sparsity. JMLR, 23(120):1–39, 2022.
[3] Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., de las Casas, D., Hanna, E.B., Bressand, F., et al. Mixtral of Experts. arXiv preprint arXiv:2401.04088, 2024.
[4] DeepSeek-AI. DeepSeek-V2: A strong, economical, and efficient Mixture-of-Experts language model. arXiv preprint arXiv:2405.04434, 2024.
[5] DeepSeek-AI. DeepSeek-V3 technical report. arXiv preprint arXiv:2412.19437, 2025.
[6] Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen3 technical report. arXiv preprint arXiv:2505.09388, 2025.
[7] Alibaba Cloud. Qwen3.5: Advancing the frontier with 512 experts. Technical report, 2025.
[8] Meta AI. Llama 4: Open-weight Mixture-of-Experts models. Technical report, 2025.
[9] Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. GPTQ: Accurate post-training quantization for generative pre-trained transformers. ICLR, 2023.
[10] Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. MLSys, 2024.
[11] Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M.W., and Keutzer, K. SqueezeLLM: Dense-and-sparse quantization. ICML, 2024.
[12] Badri, H. and Shaji, A. HQQ: Half-quadratic quantization of large language models. NeurIPS Workshop on Efficient Natural Language and Speech Processing, 2024.
[13] Guo, C., Chen, J., Li, J., Zhou, Y., Chen, T., Xie, L., and Zhang, B. MXQ: Mixed-precision quantization for efficient LLM deployment. arXiv preprint arXiv:2401.12917, 2024.
[14] Tseng, A., Chee, J., Sun, Q., Kuleshov, V., and De Sa, C. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. ICML, 2024.
[15] Li, W., Zhang, Y., Sun, H., Wang, X., and Qiu, X. MC-MoE: Mixture compressor for Mixture-of-Experts LLMs gains more. arXiv preprint arXiv:2410.06270, 2024.
[16] Frantar, E. and Alistarh, D. QMoE: Practical sub-1-bit compression of trillion-parameter models. arXiv preprint arXiv:2310.16795, 2023.
[17] Kim, Y., Lee, J., Park, S., and Shin, J. MoEQuant: Expert-wise quantization for Mixture-of-Experts models. arXiv preprint arXiv:2406.02279, 2024.
[18] Chen, Z., Qin, K., Zhang, Y., Li, P., Zhao, J., and Liang, X. DynaExq: Dynamic expert-level mixed-precision quantization for Mixture-of-Experts. arXiv preprint arXiv:2405.11009, 2024.
[19] Black Sheep AI. SWAN: SmartQuant data-free per-tensor mixed-precision quantization for LLMs on Apple Silicon. Technical report, baa.ai, 2026.
[20] Black Sheep AI. MINT: Memory-Informed N-bit Tuning — compute-optimal data-free mixed-precision quantization for LLMs. Technical report, baa.ai, 2026.
[21] Apple. MLX: An array framework for Apple Silicon. GitHub repository, github.com/ml-explore/mlx, 2024.
[22] Gray, R.M. and Neuhoff, D.L. Quantization. IEEE Transactions on Information Theory, 44(6):2325–2383, 1998.
[23] Kellerer, H., Pferschy, U., and Pisinger, D. Knapsack Problems. Springer, Berlin, 2004.
[24] Barlow, R.E., Bartholomew, D.J., Bremner, J.M., and Brunk, H.D. Statistical Inference under Order Restrictions: The Theory and Application of Isotonic Regression. Wiley, New York, 1972.
[25] Hampel, F.R. The influence curve and its role in robust estimation. Journal of the American Statistical Association, 69(346):383–393, 1974.
Code and models: github.com/baa-ai/MINT | huggingface.co/baa-ai
Correspondence: research@baa.ai