🎉 New kimi k2.5 Multi-modal Model released! Now supports multimodal understanding and processing.
Docs
Getting Started Guide
Conduct Multi-turn Chat with Kimi API

Use the Kimi API for Multi-turn Chat

The Kimi API is different from the Kimi intelligent assistant. The API itself doesn't have a memory function; it's stateless. This means that when you make multiple requests to the API, the Kimi large language model doesn't remember what you asked in the previous request. For example, if you tell the Kimi large language model that you are 27 years old in one request, it won't remember that you are 27 years old in the next request.

So, we need to manually keep track of the context for each request. In other words, we have to manually add the content of the previous request to the next one so that the Kimi large language model can see what we have talked about before. We will modify the example used in the previous chapter to show how to maintain a list of messages to give the Kimi large language model a memory and enable multi-turn conversation functionality.

Note: We have added the key points for implementing multi-turn conversations as comments in the code.

from openai import OpenAI
 
client = OpenAI(
    api_key = "MOONSHOT_API_KEY", # Replace MOONSHOT_API_KEY with the API Key you obtained from the Kimi Open Platform
    base_url = "https://api.moonshot.ai/v1",
)
 
# We define a global variable messages to keep track of the historical conversation messages between us and the Kimi large language model
# The messages include both the questions we ask the Kimi large language model (role=user) and the replies it gives us (role=assistant)
# Of course, it also includes the initial System Prompt (role=system)
# The messages in the list are arranged in chronological order
messages = [
    {"role": "system", "content": "You are Kimi, an artificial intelligence assistant provided by Moonshot AI. You are better at conversing in Chinese and English. You provide users with safe, helpful, and accurate answers. At the same time, you refuse to answer any questions involving terrorism, racism, pornography, or violence. Moonshot AI is a proper noun and should not be translated into other languages."},
]
 
def chat(input: str) -> str:
    """
    The chat function supports multi-turn conversations. Each time the chat function is called to converse with the Kimi large language model, the model will 'see' the historical conversation messages that have already been generated. In other words, the Kimi large language model has a memory.
    """
 
    global messages
 
    # We construct the user's latest question as a message (role=user) and add it to the end of the messages list
    messages.append({
        "role": "user",
        "content": input,	
    })
 
    # We converse with the Kimi large language model, carrying the messages along
    completion = client.chat.completions.create(
        model="kimi-k2.5",
        messages=messages
    )
 
    # Through the API, we receive the reply message (role=assistant) from the Kimi large language model
    assistant_message = completion.choices[0].message
 
    # To give the Kimi large language model a complete memory, we must also add the message it returns to us to the messages list
    messages.append(assistant_message)
 
    return assistant_message.content
 
print(chat("Hello, I am 27 years old this year."))
print(chat("Do you know how old I am this year?")) # Here, based on the previous context, the Kimi large language model will know that you are 27 years old

Let's review the key points in the code above:

  • The Kimi API itself doesn't have a context memory function. We need to manually inform the Kimi large language model of what we have talked about before through the messages parameter in the API;
  • In the messages, we need to store both the question messages we ask the Kimi large language model (role=user) and the reply messages it gives us (role=assistant);

It's important to note that in the code above, as the number of chat calls increases, the length of the messages list also keeps growing. This means that the number of Tokens consumed by each request is also increasing. Eventually, at some point, the Tokens occupied by the messages in the messages list will exceed the context window size supported by the Kimi large language model. We recommend that you use some strategy to keep the number of messages in the messages list within a manageable range. For example, you could only keep the latest 20 messages as the context for each request.

We provide an example below to help you understand how to control the context length. Pay attention to how the make_messages function works:

from openai import OpenAI 
 
client = OpenAI(
    api_key = "MOONSHOT_API_KEY", # Replace MOONSHOT_API_KEY with the API Key you obtained from the Kimi Open Platform
    base_url = "https://api.moonshot.ai/v1",
)
 
# We place the System Messages in a separate list because every request should carry the System Messages.
system_messages = [
	{"role": "system", "content": "You are Kimi, an AI assistant provided by Moonshot AI. You are more proficient in conversing in Chinese and English. You provide users with safe, helpful, and accurate responses. You also reject any questions involving terrorism, racism, pornography, or violence. Moonshot AI is a proper noun and should not be translated into other languages."},
]
 
# We define a global variable messages to record the historical conversation messages between us and the Kimi large language model.
# The messages include both the questions we pose to the Kimi large language model (role=user) and the replies from the Kimi large language model (role=assistant).
# The messages are arranged in chronological order.
messages = []
 
 
def make_messages(input: str, n: int = 20) -> list[dict]:
	"""
	The make_messages function controls the number of messages in each request to keep it within a reasonable range, such as the default value of 20. When building the message list, we first add the System Prompt because it is essential no matter how the messages are truncated. Then, we obtain the latest n messages from the historical records as the messages for the request. In most scenarios, this ensures that the number of Tokens occupied by the request messages does not exceed the model's context window.
	"""
	global messages
	
	# First, we construct the user's latest question into a message (role=user) and add it to the end of the messages list.
	messages.append({
		"role": "user",
		"content": input,	
	})
 
	# new_messages is the list of messages we will use for the next request. Let's build it now.
	new_messages = []
 
	# Every request must carry the System Messages, so we need to add the system_messages to the message list first.
	# Note that even if the messages are truncated, the System Messages should still be in the messages list.
	new_messages.extend(system_messages)
 
	# Here, when the historical messages exceed n, we only keep the latest n messages.
	if len(messages) > n:
		messages = messages[-n:]
 
	new_messages.extend(messages)
	return new_messages
 
 
def chat(input: str) -> str:
	"""
	The chat function supports multi-turn conversations. Each time the chat function is called to converse with the Kimi large language model, the model can "see" the historical conversation messages that have already been generated. In other words, the Kimi large language model has memory.
	"""
 
	# We converse with the Kimi large language model carrying the messages.
	completion = client.chat.completions.create(
        model="kimi-k2.5",
        messages=make_messages(input)
    )
 
	# Through the API, we obtain the reply message from the Kimi large language model (role=assistant).
	assistant_message = completion.choices[0].message
 
	# To ensure the Kimi large language model has a complete memory, we must add the message returned by the model to the messages list.
	messages.append(assistant_message)
 
	return assistant_message.content
 
print(chat("Hello, I am 27 years old this year."))
print(chat("Do you know how old I am this year?")) # Here, based on the previous context, the Kimi large language model will know that you are 27 years old this year.

Please note that the above code examples only consider the simplest invocation scenarios. In actual business code logic, you may need to consider more scenarios and boundaries, such as:

  • In concurrent scenarios, additional read-write locks may be needed;
  • For multi-user scenarios, a separate messages list should be maintained for each user;
  • You may need to persist the messages list;
  • You may still need a more precise way to determine how many messages to retain in the messages list;
  • You may want to summarize the discarded messages and generate a new message to add to the messages list;
  • ……