TensorRT-LLM PR #10672

Cut MoE Inference
Costs by 30-50%

Entropy-guided dynamic expert selection for Mixture-of-Experts models. Same accuracy, dramatically lower compute. Validated on Mixtral, Qwen-MoE, and OLMoE.

Request Consultation Read the Paper

52.5%

Mixtral savings

32.4%

Qwen-MoE savings

24.7%

OLMoE savings

adaptive_k_routing.py

def select_experts(router_logits):
    # Compute routing entropy
    probs = softmax(router_logits)
    H = -sum(p * log(p))

    # Low entropy = confident routing
    # Use fewer experts!
    if H < 0.6:
        K = 1  # 87.5% compute saved
    elif H < 1.2:
        K = 2  # 75% compute saved
    else:
        K = 4  # Full routing

    return top_k(probs, K)

Validated Results

Real compute savings on production MoE models. Accuracy measured relative to full Top-K routing baseline.

Mixtral 8x7B

MoE

52.5%

Compute Reduction

✓99.8% accuracy retained

K=1 used 78% of the time with minimal quality loss

Qwen-MoE

MoE

32.4%

Compute Reduction

✓99.9% accuracy retained

Effective across all entropy thresholds

OLMoE-1B-7B

MoE

24.7%

Compute Reduction

✓99.7% accuracy retained

Consistent savings on smaller MoE architecture

⚡Results validated via WikiText-2 perplexity benchmarks

How It Works

Adaptive-K uses information theory to make intelligent routing decisions. The key insight: routing entropy predicts when fewer experts are sufficient.

Compute Router Entropy

For each token, calculate the entropy H of the router softmax distribution. Low entropy = confident routing.

H = -sum(p * log(p))

Dynamic K Selection

Based on entropy thresholds, select fewer experts for confident tokens, more for uncertain ones.

K = 1 if H < 0.6 else (2 if H < 1.2 else 4)

Sparse Expert Execution

Only execute the selected K experts. Skip unnecessary computation entirely.

output = sum(expert[i](x) * w[i] for i in top_k)

💡

The Key Insight

When the router is confident (low entropy), it has already identified the "right" expert. Running additional experts adds compute cost but minimal value. By dynamically adjusting K based on entropy, we skip unnecessary work while maintaining output quality.

Professional Services

Bring Adaptive-K savings to your production MoE deployments. All services include documentation and knowledge transfer.

Feasibility Assessment

From €2,500

Typical duration: 1-2 weeks

Analyze your MoE deployment to estimate potential savings

✓Router entropy analysis
✓Savings projection report
✓Implementation roadmap
✓Risk assessment

Implementation Package

From €8,000

Typical duration: 4-6 weeks

Full Adaptive-K integration into your inference pipeline

✓Custom threshold calibration
✓Production-ready code
✓Performance benchmarks
✓Integration support
✓30-day warranty

Expert Consulting

€1,000/day

Typical duration: Flexible

On-demand expertise for your AI optimization needs

✓Architecture review
✓Performance tuning
✓Team training
✓Code review

Need a custom solution?

Open Resources

The research is open. The code is open. Start exploring today.

Paper

Entropy-Guided Dynamic Expert Selection in MoE Models

Full research paper with methodology, experiments, and results.

View resource→

Code

Open Source Implementation

Reference implementation with examples for Mixtral, Qwen-MoE, and OLMoE.

View resource→

TensorRT-LLM Integration

Pull request adding AdaptiveKMoeRoutingMethod to NVIDIA TensorRT-LLM.

View resource→

Citation

@article{balsamo2025adaptivek,
  title={Entropy-Guided Dynamic Expert Selection in Mixture-of-Experts Models},
  author={Balsamo, Gabriele},
  year={2025},
  url={https://github.com/Gabrobals/sbm-efficient}
}

Get In Touch

Ready to reduce your MoE inference costs? Let's discuss how Adaptive-K can help.

Or email us directly at: amministrazione@vertexdata.it

Cut MoE InferenceCosts by 30-50%

Validated Results

Mixtral 8x7B

Qwen-MoE

OLMoE-1B-7B

How It Works

Compute Router Entropy

Dynamic K Selection

Sparse Expert Execution

The Key Insight

Professional Services

Feasibility Assessment

Implementation Package

Expert Consulting

Open Resources

Entropy-Guided Dynamic Expert Selection in MoE Models

Open Source Implementation

TensorRT-LLM Integration

Citation

Get In Touch

Cut MoE Inference
Costs by 30-50%