🎉 New kimi k2.5 Multi-modal Model released! Now supports multimodal understanding and processing.
Docs
Getting Started Guide
Best Practices for Benchmarking

Best Practices for Benchmarking

Benchmarking is an engineering task that needs stability and reproducibility. You'll be calling the model thousands of times; even tiny drifts in system setup or network latency can compromise result accuracy. Here's what we've learned to keep things reproducible and trustworthy.

Quick notes

  • For any unlisted or closed-source benchmark: settemperature = 1.0, stream = true, top_p = 0.95

  • Reasoning benchmarks: max_tokens = 128k, and run at least 500–1000 samples to get low variance (e.g. AIME 2025: 32 runs -> 30 Ă— 32 = 960 questions)

  • Coding benchmarks: max_tokens = 256k

  • Agentic task benchmarks:

    • For multi-hop search: max_tokens = 256k + context management

    • Others: max_tokens ≥ 16k–64k

K2.5 Models Benchmark Recommended Settings

Benchmark CategoryBenchmarkTemperatureRecommended max tokensRecommended runsTop-pOthers (e.g. test log)
Multi-modalMMMU-Pro 1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
CharXiv (RQ)1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
MathVision1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
MathVista1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
OCRBench1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
ZeroBench1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
WorldVQA1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
InfoVQA (val)1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
SimpleVQA1.0max tokens = 64k3top_p=0.95thinking={"type": "enabled"}
ZeroBench w/ tools1.0max tokens = 64k3top_p=0.95Recommended max steps = 30
thinking={"type": "enabled"}
CodeSWE Series1.0per step tokens = 16k;
total max tokens = 256k
5top_p=0.95thinking={"type": "enabled"}
Lcb + OJBench1.0max tokens = 128k1top_p=0.95thinking={"type": "enabled"}
TerminalBench1.0max tokens = 128k3top_p=0.95thinking={"type": "enabled"}
ReasoningAIME2025 no tools1.0total max tokens = 96k32top_p=0.95thinking={"type": "enabled"}
AIME2025 w/ tools1.0per turn tokens = 96k;
total max tokens = 96k
32top_p=0.95

thinking={"type": "enabled"}


Recommended max steps = 120

HLE no tools1.0max tokens = 96k1top_p=0.95thinking={"type": "enabled"}
HLE w/ tools1.0total max tokens = 128k;
per step tokens = 48k
1top_p=0.95

thinking={"type": "enabled"}


Recommended max steps = 120

HLE heavy1.0total max tokens = 128k;
per step tokens = 48k
1top_p=0.95

thinking={"type": "enabled"}


Recommended max steps = 200


parallel n=8

HMMT2025 no tools1.0max tokens = 96k32top_p=0.95thinking={"type": "enabled"}
HMMT2025 w/tools1.0per step tokens = 96k;
total tokens = 96k
32top_p=0.95

thinking={"type": "enabled"}


Recommended max steps = 120

IMO-AnswerBench1.0max tokens = 96k3top_p=0.95thinking={"type": "enabled"}
GPQA-Diamond1.0max tokens = 96k8top_p=0.95thinking={"type": "enabled"}
Agentic Search TaskBrowseComp / BrowseComp-ZH / Seal-0 / Frames1.0per step tokens = 24k;
total max tokens = 256k
4top_p=0.95

thinking={"type": "enabled"}


Recommended max steps = 250


Recommend using a context management mechanism to prevent overly long context and ensure enough tool calls


Include today's date in the system prompt and let the model search when it is uncertain

Agentic TaskTau1.0>=16k4top_p=0.95

thinking={"type": "enabled"}


Recommended max steps = 100

For third-party providers, refer to Kimi Vendor Verifier (KVV) to choose high-accuracy services. Details: https://kimi.com/blog/kimi-vendor-verifier.html (opens in a new tab)

Tool Use Compatibility

When using tools, if the thinking parameter is set to {"type": "enabled"}, please note the following constraints to ensure model performance:

  • tool_choice can only be set to "auto" or "none" (default is "auto") to avoid conflicts between reasoning content and the specified tool_choice. Any other value will result in an error;
  • During multi-step tool calling, you must keep the reasoning_content from the assistant message in the current turn's tool call within the context, otherwise an error will be thrown;
  • The official builtin $web_search tool is temporarily incompatible with Kimi K2.5 thinking mode, you can choose to disable thinking mode first and then use the $web_search tool.

You can refer to Use Thinking Models for correct usage of tool calling.

K2-Thinking Series Models Benchmark Recommended Settings

CategoryBenchmarkTemperatureMax tokenSuggested runsNotes
CodeSWE0.7(recommended)
1.0 (ok)
per step tokens = 16k;
total max token = 256k
5
Lcb + OJBench1.0max tokens = 128k1
TerminalBench1.0max tokens = 128k3
ReasoningAIME2025 no tools1.0total max tokens = 96k32
AIME2025 w/ tools1.0per step tokens = 48k;
total max tokens = 128k
16max steps = 120
HLE no tools1.0max tokens = 96k1
HLE w/ tools1.0total max tokens = 128k;
per step tokens = 48k
1max steps = 120
HLE heavy1.0total max tokens = 128k;
per step tokens = 48k
1max steps = 200
parallel n=8
HMMT2025 no tools1.0max tokens = 96k32
HMMT2025 w/tools1.0per step tokens = 96k;
total tokens = 96k
32max steps = 120
IMO-AnswerBench1.0max tokens = 96k3
GPQA-Diamond1.0max tokens = 96k8
Agentic Search TaskBrowseComp/ BrowseComp-ZH/Seal-0/ Frames1.0per step tokens = 24k;
total max tokens = 256k
4max steps = 250
Enable context management to prevent context overflow and ensure enough tool calls.
Include today's date in the system prompt, and tell the model to search when unsure.
Agentic TaskTau0.0>=16k4max steps = 100

API Recommendations & Notes

  • Use the official API: some 3rd-party endpoints show noticeable accuracy drift.

  • Use the recommended models for testing

    • For K2 series: use kimi-k2-thinking-turbo for faster inference
    • For K2.5: use kimi-k2.5 for testing
  • Must set: stream = true

    • Non-streaming mode can lead to random mid-connection interruptions that are hard to control.
  • Current API default settings:

    • Kimi K2 Thinking:
      • default temp = 1.0
      • default max token = 64000
    • Kimi K2.5:
      • default max_tokens = 32768
      • default thinking = {"type": "enabled"}
      • default temperature = 1.0
      • default top_p = 0.95
      • default n = 1
      • default presence_penalty = 0.0
      • default frequency_penalty = 0.0
  • Timeouts:

    • With stream = false, api.moonshot.ai timeout = 2 hours, but some ISPs may terminate earlier.

    • So again we recommend you to set stream = true

  • Concurrency:

    • Keep concurrency low to avoid rate limiting
  • Retry logic is not optional:

    • handle overloaded

    • handle unexpected finish reason due to random server issues

    • handle errors due to complicated network issues

FAQ

Q1. Is the temperature setting consistent across models?

A. No. Different model families use different recommended temperatures:

  • k2.5 model: temperature = 1.0

  • k2-thinking series: temperature = 1.0

  • k2 other series: temperature = 0.6

Q2. Why use stream = true?

A. Long outputs can take minutes. Idle TCP connections may be terminated by firewalls, load balancers, or NAT gateways. Streaming keeps the connection alive and significantly improves reliability. In production, requests with stream = false fail far more often than with stream = true.

Q3. How much concurrency should I use?

A. Your API account has specific rate limits (see Recharge and Rate Limits (opens in a new tab)). Start low. If you hit HTTP 429 (rate limit), your concurrency is too high. Accuracy > speed, so tune concurrency to stay within limits.

Q5. Why should I add retry?

A. Even with streaming, requests can fail due to transient network issues. Retry on temporary faults (network jitter, server overload, rate limiting) to avoid avoidable failures.

Q6. Why should multi-turn or multi-step tasks include full context and reasoning?

A. The model needs full context to stay logically consistent. Without previous reasoning steps, later turns can go off track or produce incomplete answers.

Contact Us

Hit any issues? Drop us an email at [email protected] with your logs. We'll take a look!