What is Adaptive-K routing for MoE models?

Adaptive-K is an entropy-guided algorithm that dynamically selects the optimal number of experts (K) for each token in Mixture-of-Experts models. Instead of using a fixed K=2, it measures routing confidence via entropy and activates 1-8 experts based on complexity, reducing compute by 30-50%.

Which MoE models are compatible with Adaptive-K?

Adaptive-K is validated on Nemotron 3 Nano (33.3% savings), Mixtral 8x7B (31.0% savings), Qwen1.5-MoE (32.4% savings), OLMoE 1B-7B (24.7% savings), and is being tested on DeepSeek-V3, DBRX, and Grok-1. It works with any router-based MoE architecture.

How much GPU cost can Adaptive-K save?

Adaptive-K typically reduces GPU compute costs by 30-50% while maintaining model quality. For enterprise deployments processing millions of tokens daily, this translates to significant infrastructure savings.

Is the Adaptive-K SDK open source?

Yes, the adaptive-k-routing SDK is fully open source under MIT license, available on PyPI (pip install adaptive-k-routing) and GitHub. Enterprise support and TensorRT-LLM integration are available through Vertex Data.

TensorRT-LLM PR #10672

DOI: 10.5281/zenodo.18282008

Cut MoE Inference
Costs by 30-50%

Name: Adaptive-K MoE Routing
Rating: 4.8 (24 reviews)
Author: Vertex Data

Entropy-guided dynamic expert selection for Mixture-of-Experts models. Same accuracy, dramatically lower compute. Validated on Nemotron 3 Nano, Mixtral, Qwen-MoE, and OLMoE.

Request Consultation Read the Paper

31.0%

Mixtral savings

32.4%

Qwen-MoE savings

24.7%

OLMoE savings

adaptive_k_routing.py

def select_experts(router_logits):
    # Compute routing entropy
    probs = softmax(router_logits)
    H = -sum(p * log(p))

    # Low entropy = confident routing
    # Use fewer experts!
    if H < 0.6:
        K = 1  # 87.5% compute saved
    elif H < 1.2:
        K = 2  # 75% compute saved
    else:
        K = 4  # Full routing

    return top_k(probs, K)

Validated Results

Real compute savings on production MoE models. Accuracy measured relative to full Top-K routing baseline.

Nemotron 3 Nano

MoE

33.3%

Compute Reduction

✓99.9% accuracy retained

NVIDIA Nemotron 3 Nano: 128 experts, validated Jan 2026

Mixtral 8x7B

MoE

31.0%

Compute Reduction

✓99.8% accuracy retained

K=1 used 78% of the time with minimal quality loss

Qwen-MoE

MoE

32.4%

Compute Reduction

✓99.9% accuracy retained

Effective across all entropy thresholds

OLMoE-1B-7B

MoE

24.7%

Compute Reduction

✓99.7% accuracy retained

Consistent savings on smaller MoE architecture

🔬 Multiplicative Savings: Technique Combinations

Adaptive-K stacks with other optimizations. Savings multiply, not just add.

Adaptive-K + Early Exit

COMBO

68.0%

Compute Reduction

→Only 32.0% compute used

Adaptive-K + ToMe

COMBO

51.9%

Compute Reduction

→Only 48.1% compute used

🏆 BEST

Triple Combo

MAX

90.7%

Compute Reduction

→Only 9.3% compute used

💡 Key Insight: Adaptive-K reduces experts per token, Early Exit skips layers, Token Pruning (ToMe) reduces sequence length. Combined: 0.69 × 0.687 × 0.65 = 0.093 (90.7% savings). See Whitepaper Proposition 7.1.

⚡Results validated via WikiText-2 perplexity benchmarks

View Interactive Dashboard

How It Works

Adaptive-K uses information theory to make intelligent routing decisions. The key insight: routing entropy predicts when fewer experts are sufficient.

Compute Router Entropy

For each token, calculate the entropy H of the router softmax distribution. Low entropy = confident routing.

H = -sum(p * log(p))

Dynamic K Selection

Based on entropy thresholds, select fewer experts for confident tokens, more for uncertain ones.

K = 1 if H < 0.6 else (2 if H < 1.2 else 4)

Sparse Expert Execution

Only execute the selected K experts. Skip unnecessary computation entirely.

output = sum(expert[i](x) * w[i] for i in top_k)

💡

The Key Insight

When the router is confident (low entropy), it has already identified the "right" expert. Running additional experts adds compute cost but minimal value. By dynamically adjusting K based on entropy, we skip unnecessary work while maintaining output quality.

NEW IN v0.1.4

Production Observability

Monitor, debug, and optimize your Adaptive-K deployment with built-in observability tools.

📊

Prometheus Metrics

Production-ready metrics: latency, throughput, K distribution, compute savings

metrics.start_http_server(9090)
# adaptive_k_latency_seconds
# adaptive_k_avg_k
# adaptive_k_compute_saved_ratio

📝

Structured Logging

JSON-formatted logs for ELK, Datadog, or any log aggregator

logger = get_logger("inference")
logger.log_inference(trace)
# {"ts":"...", "avg_k":1.5, "latency_ms":45}

🔍

Tracing & Debug

Per-layer entropy analysis and K selection visualization

debugger.trace_k_selection(entropies)
# Layer 0 | H=0.42 ████░░░░ | K=1
# Layer 1 | H=1.23 ████████ | K=4

⚡

A/B Testing

Built-in framework to compare Adaptive-K vs Full-K in production

ab_test.assign_variant(request_id)
ab_test.compute_results()
# Latency: -32%, Quality: -0.1%

Install with observability support:

pip install adaptive-k-routing[observability]

Cut MoE Inference
Costs by 30-50%

Validated Results

Nemotron 3 Nano

Mixtral 8x7B

Qwen-MoE

OLMoE-1B-7B

🔬 Multiplicative Savings: Technique Combinations

Adaptive-K + Early Exit

Adaptive-K + ToMe

Triple Combo

How It Works

Compute Router Entropy

Dynamic K Selection

Sparse Expert Execution

The Key Insight

Production Observability

Prometheus Metrics

Structured Logging

Tracing & Debug

A/B Testing

What We Do

Feasibility Assessment

Full Implementation

Expert Consulting

Services & Pricing

Discovery Call

Integration Package

Enterprise Support

Open Resources

Entropy-Guided Dynamic Expert Selection in MoE Models

Enterprise Technical Whitepaper

Interactive Benchmark Dashboard

MoE Inference: The Missing Piece

Interactive Demo on HuggingFace

Open Source Implementation

TensorRT-LLM Integration

VS Code Extension

Citation

Get In Touch

Cut MoE InferenceCosts by 30-50%

Validated Results

Nemotron 3 Nano

Mixtral 8x7B

Qwen-MoE

OLMoE-1B-7B

🔬 Multiplicative Savings: Technique Combinations

Adaptive-K + Early Exit

Adaptive-K + ToMe

Triple Combo

How It Works

Compute Router Entropy

Dynamic K Selection

Sparse Expert Execution

The Key Insight

Production Observability

Prometheus Metrics

Structured Logging

Tracing & Debug

A/B Testing

What We Do

Feasibility Assessment

Full Implementation

Expert Consulting

Services & Pricing

Discovery Call

Integration Package

Enterprise Support

Open Resources

Entropy-Guided Dynamic Expert Selection in MoE Models

Enterprise Technical Whitepaper

Interactive Benchmark Dashboard

MoE Inference: The Missing Piece

Interactive Demo on HuggingFace

Open Source Implementation

TensorRT-LLM Integration

VS Code Extension

Citation

Get In Touch

Cut MoE Inference
Costs by 30-50%