Main Concepts

Text Generation Model

Moonshot's text generation model (referred to as moonshot-v1) is trained to understand both natural and written language. It can generate text output based on the input provided. The input to the model is also known as a "prompt." We generally recommend that you provide clear instructions and some examples to enable the model to complete the intended task. Designing a prompt is essentially learning how to "train" the model. The moonshot-v1 model can be used for a variety of tasks, including content or code generation, summarization, conversation, and creative writing.

Language Model Inference Service

The language model inference service is an API service based on the pre-trained models developed and trained by us (Moonshot AI). In terms of design, we primarily offer a Chat Completions interface externally, which can be used to generate text. However, it does not support access to external resources such as the internet or databases, nor does it support the execution of any code.

Token

Text generation models process text in units called Tokens. A Token represents a common sequence of characters. For example, a single English character like "antidisestablishmentarianism" might be broken down into a combination of several Tokens, while a short and common phrase like "word" might be represented by a single Token. Generally speaking, for a typical English text, 1 Token is roughly equivalent to 3-4 English characters.

It is important to note that for our text model, the total length of Input and Output cannot exceed the model's maximum context length.

Rate Limits

How do these rate limits work?

Rate limits are measured in four ways: concurrency, RPM (requests per minute), TPM (Tokens per minute), and TPD (Tokens per day). The rate limit can be reached in any of these categories, depending on which one is hit first. For example, you might send 20 requests to ChatCompletions, each with only 100 Tokens, and you would hit the limit (if your RPM limit is 20), even if you haven't reached 200k Tokens in those 20 requests (assuming your TPM limit is 200k).

For the gateway, for convenience, we calculate rate limits based on the max_tokens parameter in the request. This means that if your request includes the max_tokens parameter, we will use this parameter to calculate the rate limit. If your request does not include the max_tokens parameter, we will use the default max_tokens parameter to calculate the rate limit. After you make a request, we will determine whether you have reached the rate limit based on the number of Tokens in your request plus the number of max_tokens in your parameter, regardless of the actual number of Tokens generated.

In the billing process, we calculate the cost based on the number of Tokens in your request plus the actual number of Tokens generated.

Other Important Notes:

Rate limits are enforced at the user level, not the key level.
Currently, we share rate limits across all models.

Model List

You can use our List Models API to get a list of currently available models.

Multi-modal Model kimi-k2.5

Model Name	Description
`kimi-k2.5`	Kimi's most intelligent model to date, achieving open-source SoTA performance in Agent, code, visual understanding, and a range of general intelligent tasks. It is also Kimi's most versatile model to date, featuring a native multimodal architecture that supports both visual and text input, thinking and non-thinking modes, and dialogue and Agent tasks. Context 256k

kimi-k2 Model

Model Name	Description
`kimi-k2-0905-preview`	Context length 256k, enhanced Agentic Coding capabilities, front-end code aesthetics and practicality, and context understanding capabilities based on the 0711 version
`kimi-k2-0711-preview`	Context length 128k, MoE architecture base model with 1T total parameters, 32B activated parameters. Features powerful code and Agent capabilities. View technical blog (opens in a new tab)
`kimi-k2-turbo-preview`	High-speed version of K2, benchmarking against the latest version (0905). Output speed increased to 60-100 tokens per second, context length 256k
`kimi-k2-thinking`	K2 Long-term thinking model, supports 256k context, supports multi-step tool usage and reasoning, excels at solving more complex problems
`kimi-k2-thinking-turbo`	K2 Long-term thinking model high-speed version, supports 256k context, excels at deep reasoning, output speed increased to 60-100 tokens per second

Generation Model moonshot-v1

Model Name	Description
`moonshot-v1-8k`	Suitable for generating short texts, context length 8k
`moonshot-v1-32k`	Suitable for generating long texts, context length 32k
`moonshot-v1-128k`	Suitable for generating very long texts, context length 128k
`moonshot-v1-8k-vision-preview`	Vision model, understands image content and outputs text, context length 8k
`moonshot-v1-32k-vision-preview`	Vision model, understands image content and outputs text, context length 32k
`moonshot-v1-128k-vision-preview`	Vision model, understands image content and outputs text, context length 128k

Note: The only difference between these moonshot-v1 models is their maximum context length (including input and output), there is no difference in effect.

Deprecated Models

kimi-latest was officially discontinued on January 28, 2026 and is no longer maintained or supported. Please use the latest Kimi model kimi-k2.5 for continued support and enhanced reasoning capabilities.

kimi-thinking-preview was officially discontinued on November 11, 2025 and is no longer maintained or supported. We recommend upgrading to the latest model kimi-k2.5 for continued support and enhanced reasoning capabilities.

For further assistance, please contact sales.

Usage Guide

Getting an API Key

You need an API key to use our service. You can create an API key in our Console.

Sending Requests

You can use our Chat Completions API to send requests. You need to provide an API key and a model name. You can choose to use the default max_tokens parameter or customize the max_tokens parameter. You can refer to the API documentation for the calling method.

Handling Responses

Generally, we set a 2 hours timeout. If a single request exceeds this time, we will return a 504 error. If your request exceeds the rate limit, we will return a 429 error. If your request is successful, we will return a response in JSON format.

If you need to quickly process some tasks, you can use the non-streaming mode of our Chat Completions API. In this mode, we will return all the generated text in one request. If you need more control, you can use the streaming mode. In this mode, we will return an SSE (opens in a new tab) stream, where you can obtain the generated text. This can provide a better user experience, and you can also interrupt the request at any time without wasting resources.

Welcome Chat