Engine Reference¶
Model Name |
Extra |
Capabilities |
Engine |
|---|---|---|---|
All OpenAI Models |
|
🛠️ 🖼 |
|
All Anthropic Models |
|
🛠️ 🖼 |
|
All Google AI Models |
|
🛠️ 🖼 |
|
🤗 transformers[3] |
|
(model-specific) |
|
llama.cpp[2] |
|
(model-specific) |
|
vLLM[2] |
|
(model-specific) |
|
Additional models using the classes above are also supported - see the model zoo for a more comprehensive list of models!
Legend
🛠️: Supports function calling.
🖼: Supports multimodal inputs.
OpenAI¶
- class kani.engines.openai.OpenAIEngine(
- api_key: str = None,
- model='gpt-4.1-nano',
- max_context_size: int = None,
- *,
- api_type: Literal['chat_completions', 'responses'] = None,
- organization: str = None,
- retry: int = 5,
- api_base: str = 'https://api.openai.com/v1',
- headers: dict = None,
- client: AsyncOpenAI = None,
- tokenizer=None,
- **hyperparams,
Engine for using the OpenAI API.
This engine supports all chat-based models and fine-tunes.
Multimodal support: images, audio.
Message Extras
"openai_completion": The ChatCompletion (raw response) returned by the OpenAI servers, as a dictionary. Non-streaming responses only."openai_usage": The usage data (raw response) returned by the OpenAI servers, as a dictionary.
- Parameters:
api_key – Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.
model – The id of the model to use (e.g. “gpt-4o-mini”, “ft:gpt-3.5-turbo:my-org:custom_suffix:id”).
max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.
api_type – Whether to use the Chat Completions API (default for most models) or Responses API (default for “deep-reasoning” style models). If unset, the best API type for the given model will be chosen.
organization – The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API key’s default org if not set).
retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 5).
api_base – The base URL of the OpenAI API to use.
headers – A dict of HTTP headers to include with each request.
client – An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of
(api_key, client). If this is passed theorganization,retry,api_base, andheadersparams will be ignored.tokenizer – The tokenizer to use for token estimation - for OpenAI models this will be loaded automatically. A class with a
.encode(text: str)method that returns a list (usually of token ids).hyperparams – The arguments to pass to the
create_chat_completioncall with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.
- disable_function_calling_kwargs = {'tool_choice': 'none'}¶
Kwargs to set in the Kani._full_round loop when the model should disable function calling. Mostly this is useful for API models, where we want to still define functions in the prompt but disallow calling them.
- message_len(message: ChatMessage) int[source]¶
Returns the estimated number of tokens used by a single given message.
Note
The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- function_token_reserve(functions: list[AIFunction]) int[source]¶
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- translate_functions(functions: list[AIFunction]) list[dict][source]¶
Translate a list of Kani
AIFunctions to a list of OpenAI tool definitions.
- translate_messages(
- messages: list[ChatMessage],
Translate a list of Kani
ChatMessages to a list of OpenAI messages.
- static translate_kani_message_to_openai(
- message: ChatMessage,
Translate a single Kani
ChatMessageto a single OpenAI message.
- static translate_kani_message_to_openai_responses(
- message: ChatMessage,
Translate a single Kani
ChatMessageto its corresponding OpenAI responses input items.
- async prompt_len(messages, functions=None, **kwargs) int[source]¶
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages – The messages in the prompt.
functions – The functions included in the prompt.
kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Optional: Stream a completion from the engine, token-by-token.
This method’s signature is the same as
BaseEngine.predict().This method should yield strings as an asynchronous iterable.
Optionally, this method may also yield a
BaseCompletion. If it does, it MUST be the last item yielded by this method.If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
Anthropic¶
- class kani.engines.anthropic.AnthropicEngine(
- api_key: str = None,
- model: str = 'claude-sonnet-4-0',
- max_tokens: int = 2048,
- max_context_size: int = None,
- *,
- retry: int = 2,
- api_base: str = None,
- headers: dict = None,
- client: AsyncAnthropic = None,
- **hyperparams,
Engine for using the Anthropic API.
This engine supports all Claude models. See https://docs.anthropic.com/claude/docs/getting-access-to-claude for information on accessing the Claude API.
See https://docs.anthropic.com/en/docs/about-claude/models/overview for a list of available models.
Multimodal support: images.
Additional capabilities: PDF document processing. Use
kani.ext.multimodal_core.BinaryFilePart.Message Extras:
"anthropic_message": The Message (raw response) returned by the Anthropic servers.- Parameters:
api_key – Your Anthropic API key. By default, the API key will be read from the ANTHROPIC_API_KEY environment variable.
model – The id of the model to use (e.g. “claude-opus-4-0”). See https://docs.anthropic.com/en/docs/about-claude/models/overview for a list of models.
max_tokens – The maximum number of tokens to sample at each generation (defaults to 2048). Generally, you should set this to the same number as your Kani’s
desired_response_tokens.max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.
retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 2).
api_base – The base URL of the Anthropic API to use.
headers – A dict of HTTP headers to include with each request.
client – An instance of
anthropic.AsyncAnthropic(for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed theretry,api_base, andheadersparams will be ignored.hyperparams – Any additional parameters to pass to the underlying API call (see https://docs.claude.com/en/api/messages).
- disable_function_calling_kwargs = {'tool_choice': {'type': 'none'}}¶
Kwargs to set in the Kani._full_round loop when the model should disable function calling. Mostly this is useful for API models, where we want to still define functions in the prompt but disallow calling them.
- async prompt_len(messages, functions=None, **kwargs) int[source]¶
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages – The messages in the prompt.
functions – The functions included in the prompt.
kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Optional: Stream a completion from the engine, token-by-token.
This method’s signature is the same as
BaseEngine.predict().This method should yield strings as an asynchronous iterable.
Optionally, this method may also yield a
BaseCompletion. If it does, it MUST be the last item yielded by this method.If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- token_reserve: int = 500¶
Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token). Default: 0
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- message_len(message: ChatMessage) int[source]¶
Returns the estimated number of tokens used by a single given message.
Note
The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- function_token_reserve(functions: list[AIFunction]) int[source]¶
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- class kani.engines.anthropic.AnthropicUnknownPart(*, extra: dict = {})[source]¶
A generic unknown response part from the server.
This generally corresponds to an Anthropic-specific feature. The raw response data is accessible in
data, and will be sent back to the language model in future rounds correctly. Will not be sent to other engines.
Google AI¶
- class kani.engines.google.GoogleAIEngine(
- api_key: str = None,
- model: str = 'gemini-2.5-flash',
- max_context_size: int = None,
- *,
- retry: int = 2,
- api_base: str = None,
- headers: dict = None,
- client: Client = None,
- multimodal_upload_bytes_threshold: int = 512000,
- **hyperparams,
Engine for using the Google AI Studio API (aka Gemini Developer API, Google AI API) and Google Vertex AI API (aka Google Cloud API).
This engine supports all Google AI models.
See https://ai.google.dev/gemini-api/docs/models for a list of available models.
Multimodal support: images, audio, video.
Message Extras:
"google_response": The raw response returned by the Google AI API.- Parameters:
api_key – Your Gemini Developer API key. By default, the API key will be read from the GEMINI_API_KEY environment variable.
model – The id of the model to use (e.g. “gemini-2.5-flash”). See https://ai.google.dev/gemini-api/docs/models for a list of models.
max_tokens – The maximum number of tokens to sample at each generation (defaults to 512). Generally, you should set this to the same number as your Kani’s
desired_response_tokens.max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.
retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 2).
api_base – The base URL of the Google AI API to use. If not specified, the default URL for the specified API (AI Studio/Vertex) will be used.
headers – A dict of HTTP headers to include with each request.
client – An instance of
genai.Client(for reusing the same client in multiple engines). You must specify exactly one of (api_key, client).multimodal_upload_bytes_threshold – If a multimodal object (audio, image, video) is larger than this number of bytes, upload it as a file instead of passing it inline in a request. Default 512kB.
hyperparams – Any additional parameters to pass to the underlying API call (see https://googleapis.github.io/python-genai/genai.html#genai.types.GenerateContentConfig).
- async prompt_len(messages, functions=None, **kwargs) int[source]¶
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages – The messages in the prompt.
functions – The functions included in the prompt.
kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Optional: Stream a completion from the engine, token-by-token.
This method’s signature is the same as
BaseEngine.predict().This method should yield strings as an asynchronous iterable.
Optionally, this method may also yield a
BaseCompletion. If it does, it MUST be the last item yielded by this method.If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- token_reserve: int = 500¶
Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token). Default: 0
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- message_len(message: ChatMessage) int[source]¶
Returns the estimated number of tokens used by a single given message.
Note
The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- function_token_reserve(functions: list[AIFunction]) int[source]¶
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
Hugging Face¶
- class kani.engines.huggingface.HuggingEngine(
- model_id: str,
- max_context_size: int = None,
- prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | ~torch.Tensor] = None,
- *,
- token=None,
- device: str | None = None,
- tokenizer_cls=None,
- tokenizer_kwargs: dict = None,
- model_cls=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,
- model_load_kwargs: dict = None,
- chat_template_reasoning_content_key: str = None,
- chat_template_kwargs: dict = None,
- mm_audio_sample_rate: int = None,
- mm_video_fps: float = 1,
- token_reserve: int = 0,
- **hyperparams,
Base engine for all HuggingFace text-generation models.
This class implements the main decoding logic for any HuggingFace model based on a pretrained
AutoModelForCausalLM. As most models use model-specific chat templates, this base class accepts aPromptPipelineto translate kani ChatMessages into a model-specific string.Added in version 1.2.0: By default, the
HuggingEngineuses models’ bundled chat template to build the prompt for chat-based models available on Hugging Face. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.GPU Support
By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"to the constructor.Multimodal support: audio, images, video (depending on model).
Tip
See Quantization With BitsAndBytes for information about loading a quantized model for lower memory usage.
- Parameters:
model_id – The ID of the model to load from HuggingFace.
max_context_size – The context size of the model. If not given, will be set from the model’s config.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline). If not passed, uses the Hugging Face chat template if available.token – The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
device – The hardware device to use. If not specified, uses CUDA or MPS if available; otherwise uses CPU.
tokenizer_cls – Advanced use cases: The HF tokenizer class to use. Defaults to
AutoProcessor(if no processing config is available or this raises an error, this will fall back toAutoTokenizer).tokenizer_kwargs – Additional arguments to pass to
AutoProcessor.from_pretrained().model_cls – Advanced use cases: The HF model class to use. Defaults to
AutoModelForCausalLM.model_load_kwargs – Additional arguments to pass to
AutoModelForCausalLM.from_pretrained().chat_template_reasoning_content_key – The key of each message dict that any reasoning content should be set at.
chat_template_kwargs – The keyword arguments to pass to
tokenizer.apply_chat_templateif using a chat template prompt pipeline.mm_audio_sample_rate – The sample rate to remux audio inputs to. Check your model’s documentation for the expected sample rate. By default, does not change the sample rate of the input file.
mm_video_fps – The number of image frames to sample per second of video input.
hyperparams – Additional arguments to supply the model during generation.
token_reserve – DEPRECATED: The number of tokens to reserve for internal engine mechanisms (e.g. if there is a generation template after the last user message). If not passed, kani will attempt to infer this from a prompt pipeline.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.
The default behaviour is to call the supplied pipeline.
- async prompt_len(messages, functions=None, **kwargs) int[source]¶
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages – The messages in the prompt.
functions – The functions included in the prompt.
kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- *,
- decode_kwargs: dict = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
decode_kwargs – Any arguments to pass to AutoTokenizer.decode().
hyperparams – Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- *,
- streamer_timeout: float | None = None,
- decode_kwargs: dict = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
streamer_timeout – The maximum number of seconds to wait for the next token when streaming.
decode_kwargs – Any arguments to pass to AutoTokenizer.decode().
hyperparams – Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)
- property token_reserve¶
int([x]) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4
- message_len(message: ChatMessage) int[source]¶
Return the length, in tokens, of the given chat message.
The HuggingEngine’s default implementation renders the message with
apply_chat_templateif noprompt_pipelineis supplied.
- function_token_reserve(functions: list[AIFunction]) int[source]¶
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
llama.cpp¶
- class kani.engines.llamacpp.LlamaCppEngine(
- repo_id: str | None = None,
- filename: str | None = None,
- model_path: str | None = None,
- max_context_size: int = 0,
- prompt_pipeline: PromptPipeline[str | list[int]] = None,
- *,
- model_load_kwargs: dict = None,
- **hyperparams,
This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).
GPU Support
llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.
To load some or all of the model layers on GPU, pass
n_gpu_layers=...in themodel_load_kwargs. Use-1to specify all layers.- Parameters:
repo_id – The ID of the model repo to load from Hugging Face. If this is set,
filenamemust be set andmodel_pathmay not be set.filename – A filename or glob pattern to match the model file in the Hugging Face repo. If this is set,
repo_idmust be set andmodel_pathmay not be set.model_path – A path to the model files on local disk. If this is set, neither
repo_idnorfilenamemay be set.max_context_size – The context size of the model.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline).model_load_kwargs – Additional arguments to pass to
Llama.from_pretrained(). See this link for more info.hyperparams – Additional arguments to supply the model during generation.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.
The default behaviour is to call the supplied pipeline.
- async prompt_len(messages, functions=None, **kwargs) int[source]¶
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages – The messages in the prompt.
functions – The functions included in the prompt.
kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to
Llama.create_completion(). (See https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion)
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to
Llama.create_completion(). (See https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion)
- property token_reserve¶
int([x]) -> integer int(x, base=10) -> integer
Convert a number or string to an integer, or return 0 if no arguments are given. If x is a number, return x.__int__(). For floating-point numbers, this truncates towards zero.
If x is not a number or if base is given, then x must be a string, bytes, or bytearray instance representing an integer literal in the given base. The literal can be preceded by ‘+’ or ‘-’ and be surrounded by whitespace. The base defaults to 10. Valid bases are 0 and 2-36. Base 0 means to interpret the base from the string as an integer literal. >>> int(‘0b100’, base=0) 4
- message_len(message: ChatMessage) int[source]¶
Returns the estimated number of tokens used by a single given message.
Note
The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- function_token_reserve(functions: list[AIFunction]) int[source]¶
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
vLLM¶
See the kani-ext-vllm documentation at https://github.com/zhudotexe/kani-ext-vllm.
Base¶
- class kani.engines.BaseEngine[source]¶
Base class for all LM engines.
To add support for a new LM, make a subclass of this and implement the abstract methods below.
- abstract prompt_len(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **kwargs,
- abstract prompt_len(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **kwargs,
Returns the number of tokens used by the given prompt (i.e., list of messages and functions), or a best estimate if the exact count is unavailable.
This method MAY be asynchronous. Use
Kani.prompt_token_len()for a higher-level interface that handles asynchrony.- Parameters:
messages – The messages in the prompt.
functions – The functions included in the prompt.
kwargs – Any additional parameters to pass to the underlying token counting implementation (engine-specific).
- abstract async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Optional: Stream a completion from the engine, token-by-token.
This method’s signature is the same as
BaseEngine.predict().This method should yield strings as an asynchronous iterable.
Optionally, this method may also yield a
BaseCompletion. If it does, it MUST be the last item yielded by this method.If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.
- Parameters:
messages – The messages in the current chat context.
prompt_len(messages, functions)is guaranteed to be less than max_context_size.functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.
- disable_function_calling_kwargs = {'include_functions': False}¶
Kwargs to set in the Kani._full_round loop when the model should disable function calling. Mostly this is useful for API models, where we want to still define functions in the prompt but disallow calling them.
- message_len(message: ChatMessage) int[source]¶
Returns the estimated number of tokens used by a single given message.
Note
The token count returned by this may not exactly reflect the actual token count (e.g., due to prompt formatting or not having access to the tokenizer). It should, however, be a safe overestimate to use as an upper bound.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- token_reserve: int = 0¶
Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token). Default: 0
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- function_token_reserve(functions: list[AIFunction]) int[source]¶
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
Deprecated since version 1.7.0: Use
BaseEngine.prompt_len()instead.
- class kani.engines.Completion(message: ChatMessage, prompt_tokens: int | None = None, completion_tokens: int | None = None)[source]¶
- property message¶
The message returned by the LM.
- property prompt_tokens¶
How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.
- property completion_tokens¶
How many tokens are in the completion. Can be None for kani to estimate using tokenizer.
- class kani.engines.WrapperEngine(engine: BaseEngine, *args, **kwargs)[source]¶
A base class for engines that are meant to wrap other engines. By default, this class takes in another engine as the first parameter in its constructor and will pass through all non-overriden attributes to the wrapped engine.
- Parameters:
engine – The engine to wrap.
- engine¶
The wrapped engine.