Engine Reference

Model Name

Extra

Capabilities

Engine

All OpenAI Models

openai

🛠️ 🖼

kani.engines.openai.OpenAIEngine

All Anthropic Models

anthropic

🛠️ 🖼

kani.engines.anthropic.AnthropicEngine

All Google AI Models

google

🛠️ 🖼

kani.engines.google.GoogleAIEngine

🤗 transformers[3]

huggingface[1]

(model-specific)

kani.engines.huggingface.HuggingEngine

llama.cpp[2]

cpp

(model-specific)

kani.engines.llamacpp.LlamaCppEngine

vLLM[2]

vllm

(model-specific)

kani.ext.vllm.VLLMEngine, VLLMServerEngine, or VLLMOpenAIEngine

Additional models using the classes above are also supported - see the model zoo for a more comprehensive list of models!

Legend

  • 🛠️: Supports function calling.

  • 🖼: Supports multimodal inputs.

OpenAI

class kani.engines.openai.OpenAIEngine(
api_key: str = None,
model='gpt-4.1-nano',
max_context_size: int = None,
*,
api_type: Literal['chat_completions', 'responses'] = None,
organization: str = None,
retry: int = 5,
api_base: str = 'https://api.openai.com/v1',
headers: dict = None,
client: AsyncOpenAI = None,
tokenizer=None,
**hyperparams,
)[source]

Engine for using the OpenAI API.

This engine supports all chat-based models and fine-tunes.

Multimodal support: images, audio.

Message Extras

  • "openai_completion": The ChatCompletion (raw response) returned by the OpenAI servers, as a dictionary. Non-streaming responses only.

  • "openai_usage": The usage data (raw response) returned by the OpenAI servers, as a dictionary.

Parameters:
  • api_key – Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.

  • model – The id of the model to use (e.g. “gpt-4o-mini”, “ft:gpt-3.5-turbo:my-org:custom_suffix:id”).

  • max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.

  • api_type – Whether to use the Chat Completions API (default for most models) or Responses API (default for “deep-reasoning” style models). If unset, the best API type for the given model will be chosen.

  • organization – The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API key’s default org if not set).

  • retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 5).

  • api_base – The base URL of the OpenAI API to use.

  • headers – A dict of HTTP headers to include with each request.

  • client – An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the organization, retry, api_base, and headers params will be ignored.

  • tokenizer – The tokenizer to use for token estimation - for OpenAI models this will be loaded automatically. A class with a .encode(text: str) method that returns a list (usually of token ids).

  • hyperparams – The arguments to pass to the create_chat_completion call with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.

disable_function_calling_kwargs = {'tool_choice': 'none'}

Kwargs to set in the Kani._full_round loop when the model should disable function calling. Mostly this is useful for API models, where we want to still define functions in the prompt but disallow calling them.

message_len(message: ChatMessage) int[source]

Returns the estimated number of tokens used by a single given message.

Note

The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

function_token_reserve(functions: list[AIFunction]) int[source]

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

translate_functions(functions: list[AIFunction]) list[dict][source]

Translate a list of Kani AIFunctions to a list of OpenAI tool definitions.

translate_messages(
messages: list[ChatMessage],
) list[ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam] | List[EasyInputMessageParam | Message | ResponseOutputMessageParam | ResponseFileSearchToolCallParam | ResponseComputerToolCallParam | ComputerCallOutput | ResponseFunctionWebSearchParam | ResponseFunctionToolCallParam | FunctionCallOutput | ToolSearchCall | ResponseToolSearchOutputItemParamParam | AdditionalTools | ResponseReasoningItemParam | ResponseCompactionItemParamParam | ImageGenerationCall | ResponseCodeInterpreterToolCallParam | LocalShellCall | LocalShellCallOutput | ShellCall | ShellCallOutput | ApplyPatchCall | ApplyPatchCallOutput | McpListTools | McpApprovalRequest | McpApprovalResponse | McpCall | ResponseCustomToolCallOutputParam | ResponseCustomToolCallParam | CompactionTrigger | ItemReference][source]

Translate a list of Kani ChatMessages to a list of OpenAI messages.

static translate_kani_message_to_openai(
message: ChatMessage,
) ChatCompletionDeveloperMessageParam | ChatCompletionSystemMessageParam | ChatCompletionUserMessageParam | ChatCompletionAssistantMessageParam | ChatCompletionToolMessageParam | ChatCompletionFunctionMessageParam[source]

Translate a single Kani ChatMessage to a single OpenAI message.

static translate_kani_message_to_openai_responses(
message: ChatMessage,
) list[EasyInputMessageParam | Message | ResponseOutputMessageParam | ResponseFileSearchToolCallParam | ResponseComputerToolCallParam | ComputerCallOutput | ResponseFunctionWebSearchParam | ResponseFunctionToolCallParam | FunctionCallOutput | ToolSearchCall | ResponseToolSearchOutputItemParamParam | AdditionalTools | ResponseReasoningItemParam | ResponseCompactionItemParamParam | ImageGenerationCall | ResponseCodeInterpreterToolCallParam | LocalShellCall | LocalShellCallOutput | ShellCall | ShellCallOutput | ApplyPatchCall | ApplyPatchCallOutput | McpListTools | McpApprovalRequest | McpApprovalResponse | McpCall | ResponseCustomToolCallOutputParam | ResponseCustomToolCallParam | CompactionTrigger | ItemReference][source]

Translate a single Kani ChatMessage to its corresponding OpenAI responses input items.

async prompt_len(messages, functions=None, **kwargs) int[source]

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages – The messages in the prompt.

  • functions – The functions included in the prompt.

  • kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) ChatCompletion[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]

Optional: Stream a completion from the engine, token-by-token.

This method’s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async close()[source]

Optional: Clean up any resources the engine might need.

class kani.engines.openai.translation.ChatCompletion(openai_completion: ChatCompletion)[source]

A wrapper around the OpenAI ChatCompletion to make it compatible with the Kani interface.

Anthropic

class kani.engines.anthropic.AnthropicEngine(
api_key: str = None,
model: str = 'claude-sonnet-4-0',
max_tokens: int = 2048,
max_context_size: int = None,
*,
retry: int = 2,
api_base: str = None,
headers: dict = None,
client: AsyncAnthropic = None,
**hyperparams,
)[source]

Engine for using the Anthropic API.

This engine supports all Claude models. See https://docs.anthropic.com/claude/docs/getting-access-to-claude for information on accessing the Claude API.

See https://docs.anthropic.com/en/docs/about-claude/models/overview for a list of available models.

Multimodal support: images.

Additional capabilities: PDF document processing. Use kani.ext.multimodal_core.BinaryFilePart.

Message Extras: "anthropic_message": The Message (raw response) returned by the Anthropic servers.

Parameters:
  • api_key – Your Anthropic API key. By default, the API key will be read from the ANTHROPIC_API_KEY environment variable.

  • model – The id of the model to use (e.g. “claude-opus-4-0”). See https://docs.anthropic.com/en/docs/about-claude/models/overview for a list of models.

  • max_tokens – The maximum number of tokens to sample at each generation (defaults to 2048). Generally, you should set this to the same number as your Kani’s desired_response_tokens.

  • max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.

  • retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 2).

  • api_base – The base URL of the Anthropic API to use.

  • headers – A dict of HTTP headers to include with each request.

  • client – An instance of anthropic.AsyncAnthropic (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the retry, api_base, and headers params will be ignored.

  • hyperparams – Any additional parameters to pass to the underlying API call (see https://docs.claude.com/en/api/messages).

disable_function_calling_kwargs = {'tool_choice': {'type': 'none'}}

Kwargs to set in the Kani._full_round loop when the model should disable function calling. Mostly this is useful for API models, where we want to still define functions in the prompt but disallow calling them.

async prompt_len(messages, functions=None, **kwargs) int[source]

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages – The messages in the prompt.

  • functions – The functions included in the prompt.

  • kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) Completion[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]

Optional: Stream a completion from the engine, token-by-token.

This method’s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async close()[source]

Optional: Clean up any resources the engine might need.

token_reserve: int = 500

Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token). Default: 0

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

message_len(message: ChatMessage) int[source]

Returns the estimated number of tokens used by a single given message.

Note

The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

function_token_reserve(functions: list[AIFunction]) int[source]

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

class kani.engines.anthropic.AnthropicUnknownPart(*, extra: dict = {})[source]

A generic unknown response part from the server.

This generally corresponds to an Anthropic-specific feature. The raw response data is accessible in data, and will be sent back to the language model in future rounds correctly. Will not be sent to other engines.

data: dict

The raw content of the part returned by the Anthropic API.

Google AI

class kani.engines.google.GoogleAIEngine(
api_key: str = None,
model: str = 'gemini-2.5-flash',
max_context_size: int = None,
*,
retry: int = 2,
api_base: str = None,
headers: dict = None,
client: Client = None,
multimodal_upload_bytes_threshold: int = 512000,
**hyperparams,
)[source]

Engine for using the Google AI Studio API (aka Gemini Developer API, Google AI API) and Google Vertex AI API (aka Google Cloud API).

This engine supports all Google AI models.

See https://ai.google.dev/gemini-api/docs/models for a list of available models.

Multimodal support: images, audio, video.

Message Extras: "google_response": The raw response returned by the Google AI API.

Parameters:
  • api_key – Your Gemini Developer API key. By default, the API key will be read from the GEMINI_API_KEY environment variable.

  • model – The id of the model to use (e.g. “gemini-2.5-flash”). See https://ai.google.dev/gemini-api/docs/models for a list of models.

  • max_tokens – The maximum number of tokens to sample at each generation (defaults to 512). Generally, you should set this to the same number as your Kani’s desired_response_tokens.

  • max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.

  • retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 2).

  • api_base – The base URL of the Google AI API to use. If not specified, the default URL for the specified API (AI Studio/Vertex) will be used.

  • headers – A dict of HTTP headers to include with each request.

  • client – An instance of genai.Client (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client).

  • multimodal_upload_bytes_threshold – If a multimodal object (audio, image, video) is larger than this number of bytes, upload it as a file instead of passing it inline in a request. Default 512kB.

  • hyperparams – Any additional parameters to pass to the underlying API call (see https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateContentConfig).

async prompt_len(messages, functions=None, **kwargs) int[source]

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages – The messages in the prompt.

  • functions – The functions included in the prompt.

  • kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) Completion[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]

Optional: Stream a completion from the engine, token-by-token.

This method’s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

token_reserve: int = 500

Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token). Default: 0

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

message_len(message: ChatMessage) int[source]

Returns the estimated number of tokens used by a single given message.

Note

The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

function_token_reserve(functions: list[AIFunction]) int[source]

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

Hugging Face

class kani.engines.huggingface.HuggingEngine(
model_id: str,
max_context_size: int = None,
prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | ~torch.Tensor] = None,
*,
token=None,
device: str | None = None,
tokenizer_cls=None,
tokenizer_kwargs: dict = None,
model_cls=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,
model_load_kwargs: dict = None,
chat_template_reasoning_content_key: str = None,
chat_template_kwargs: dict = None,
mm_audio_sample_rate: int = None,
mm_video_fps: float = 1,
token_reserve: int = 0,
**hyperparams,
)[source]

Base engine for all HuggingFace text-generation models.

This class implements the main decoding logic for any HuggingFace model based on a pretrained AutoModelForCausalLM. As most models use model-specific chat templates, this base class accepts a PromptPipeline to translate kani ChatMessages into a model-specific string.

Added in version 1.2.0: By default, the HuggingEngine uses models’ bundled chat template to build the prompt for chat-based models available on Hugging Face. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.

GPU Support

By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Multimodal support: audio, images, video (depending on model).

Tip

See Quantization With BitsAndBytes for information about loading a quantized model for lower memory usage.

Parameters:
  • model_id – The ID of the model to load from HuggingFace.

  • max_context_size – The context size of the model. If not given, will be set from the model’s config.

  • prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline). If not passed, uses the Hugging Face chat template if available.

  • token – The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.

  • device – The hardware device to use. If not specified, uses CUDA or MPS if available; otherwise uses CPU.

  • tokenizer_cls – Advanced use cases: The HF tokenizer class to use. Defaults to AutoProcessor (if no processing config is available or this raises an error, this will fall back to AutoTokenizer).

  • tokenizer_kwargs – Additional arguments to pass to AutoProcessor.from_pretrained().

  • model_cls – Advanced use cases: The HF model class to use. Defaults to AutoModelForCausalLM.

  • model_load_kwargs – Additional arguments to pass to AutoModelForCausalLM.from_pretrained().

  • chat_template_reasoning_content_key – The key of each message dict that any reasoning content should be set at.

  • chat_template_kwargs – The keyword arguments to pass to tokenizer.apply_chat_template if using a chat template prompt pipeline.

  • mm_audio_sample_rate – The sample rate to remux audio inputs to. Check your model’s documentation for the expected sample rate. By default, does not change the sample rate of the input file.

  • mm_video_fps – The number of image frames to sample per second of video input.

  • hyperparams – Additional arguments to supply the model during generation.

  • token_reserve – DEPRECATED: The number of tokens to reserve for internal engine mechanisms (e.g. if there is a generation template after the last user message). If not passed, kani will attempt to infer this from a prompt pipeline.

build_prompt(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
) str | Tensor | BatchEncoding | BatchFeature[source]

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.

The default behaviour is to call the supplied pipeline.

async prompt_len(messages, functions=None, **kwargs) int[source]

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages – The messages in the prompt.

  • functions – The functions included in the prompt.

  • kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
*,
decode_kwargs: dict = None,
**hyperparams,
) Completion[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • decode_kwargs – Any arguments to pass to AutoTokenizer.decode().

  • hyperparams – Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
*,
streamer_timeout: float | None = None,
decode_kwargs: dict = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • streamer_timeout – The maximum number of seconds to wait for the next token when streaming.

  • decode_kwargs – Any arguments to pass to AutoTokenizer.decode().

  • hyperparams – Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

property token_reserve

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

message_len(message: ChatMessage) int[source]

Return the length, in tokens, of the given chat message.

The HuggingEngine’s default implementation renders the message with apply_chat_template if no prompt_pipeline is supplied.

function_token_reserve(functions: list[AIFunction]) int[source]

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

llama.cpp

class kani.engines.llamacpp.LlamaCppEngine(
repo_id: str | None = None,
filename: str | None = None,
model_path: str | None = None,
max_context_size: int = 0,
prompt_pipeline: PromptPipeline[str | list[int]] = None,
*,
model_load_kwargs: dict = None,
**hyperparams,
)[source]

This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).

GPU Support

llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.

To load some or all of the model layers on GPU, pass n_gpu_layers=... in the model_load_kwargs. Use -1 to specify all layers.

Parameters:
  • repo_id – The ID of the model repo to load from Hugging Face. If this is set, filename must be set and model_path may not be set.

  • filename – A filename or glob pattern to match the model file in the Hugging Face repo. If this is set, repo_id must be set and model_path may not be set.

  • model_path – A path to the model files on local disk. If this is set, neither repo_id nor filename may be set.

  • max_context_size – The context size of the model.

  • prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).

  • model_load_kwargs – Additional arguments to pass to Llama.from_pretrained(). See this link for more info.

  • hyperparams – Additional arguments to supply the model during generation.

build_prompt(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
) str | list[int][source]

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.

The default behaviour is to call the supplied pipeline.

async prompt_len(messages, functions=None, **kwargs) int[source]

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages – The messages in the prompt.

  • functions – The functions included in the prompt.

  • kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) Completion[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
async close()[source]

Optional: Clean up any resources the engine might need.

property token_reserve

int([x]) -> integer int(x, base=10) -> integer

Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.

If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4

message_len(message: ChatMessage) int[source]

Returns the estimated number of tokens used by a single given message.

Note

The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

function_token_reserve(functions: list[AIFunction]) int[source]

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

vLLM

See the kani-ext-vllm documentation at https://github.com/zhudotexe/kani-ext-vllm.

Base

class kani.engines.BaseEngine[source]

Base class for all LM engines.

To add support for a new LM, make a subclass of this and implement the abstract methods below.

max_context_size: int

The maximum context size supported by this engine’s LM.

abstract prompt_len(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**kwargs,
) int[source]
abstract prompt_len(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**kwargs,
) int

Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.

This method MAY be asynchronous. Use Kani.prompt_token_len() for a higher-level interface that handles asynchrony.

Parameters:
  • messages – The messages in the prompt.

  • functions – The functions included in the prompt.

  • kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).

abstract async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) BaseCompletion[source]

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]

Optional: Stream a completion from the engine, token-by-token.

This method’s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:
  • messages – The messages in the current chat context. prompt_len(messages, functions) is guaranteed to be less than max_context_size.

  • functions – The functions the LM is allowed to call.

  • hyperparams – Any additional parameters to pass to the engine.

async close()[source]

Optional: Clean up any resources the engine might need.

disable_function_calling_kwargs = {'include_functions': False}

Kwargs to set in the Kani._full_round loop when the model should disable function calling. Mostly this is useful for API models, where we want to still define functions in the prompt but disallow calling them.

message_len(message: ChatMessage) int[source]

Returns the estimated number of tokens used by a single given message.

Note

The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

token_reserve: int = 0

Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token). Default: 0

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

function_token_reserve(functions: list[AIFunction]) int[source]

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

Deprecated since version 1.7.0: Use BaseEngine.prompt_len() instead.

class kani.engines.Completion(message: ChatMessage, prompt_tokens: int | None = None, completion_tokens: int | None = None)[source]
property message

The message returned by the LM.

property prompt_tokens

How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.

property completion_tokens

How many tokens are in the completion. Can be None for kani to estimate using tokenizer.

class kani.engines.WrapperEngine(engine: BaseEngine, *args, **kwargs)[source]

A base class for engines that are meant to wrap other engines. By default, this class takes in another engine as the first parameter in its constructor and will pass through all non-overriden attributes to the wrapped engine.

Parameters:

engine – The engine to wrap.

engine

The wrapped engine.

class kani.engines.base.BaseCompletion[source]

Base class for all LM engine completions.

abstract property message: ChatMessage

The message returned by the LM.

abstract property prompt_tokens: int | None

How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.

abstract property completion_tokens: int | None

How many tokens are in the completion. Can be None for kani to estimate using tokenizer.