Engine Reference#

Model Name	Extra	Capabilities	Engine
GPT-3.5-turbo, GPT-4	`openai`	🛠️ 📡	`kani.engines.openai.OpenAIEngine`
Claude, Claude Instant	`anthropic`	🛠️ 📡	`kani.engines.anthropic.AnthropicEngine`
🤗 transformers[4]	`huggingface`[2]	(runtime)	`kani.engines.huggingface.HuggingEngine`
🤗 🦙 LLaMA 3	`huggingface, llama`[2]	🔓 🖥 🚀	`kani.engines.huggingface.HuggingEngine`[1]
🤗 Mistral, Mixtral	`huggingface`[2]	🛠️ 🔓 🖥 🚀	`kani.engines.huggingface.HuggingEngine`[1]
🤗 Command R, Command R+	`huggingface`[2]	🛠️ 🔓 🖥 🚀	`kani.engines.huggingface.cohere.CommandREngine`
🤗 🦙 LLaMA v2	`huggingface, llama`[2]	🔓 🖥 🚀	`kani.engines.huggingface.llama2.LlamaEngine`
🤗 🦙 Vicuna v1.3	`huggingface, llama`[2]	🔓 🖥 🚀	`kani.engines.huggingface.vicuna.VicunaEngine`
llama.cpp[4]	`cpp`	(runtime)	`kani.engines.llamacpp.LlamaCppEngine`
🦙 LLaMA v2 (GGUF)	`cpp`	🔓 🖥 🚀	`kani.engines.llamacpp.LlamaCppEngine`

Additional models using the classes above are also supported - see the model zoo for a more comprehensive list of models!

Legend

🛠️: Supports function calling.
🔓: Open source model.
🖥: Runs locally on CPU.
🚀: Runs locally on GPU.
📡: Hosted API.

Base#

class kani.engines.BaseEngine[source]#

Base class for all LM engines.

To add support for a new LM, make a subclass of this and implement the abstract methods below.

max_context_size: int#: The maximum context size supported by this engine’s LM.

abstract message_len(message: ChatMessage) → int[source]#: Return the length, in tokens, of the given chat message.

abstract async predict(

messages: list[ChatMessage],

functions: list[AIFunction] | None = None,

**hyperparams,

) → BaseCompletion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:

messages – The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.
functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.

token_reserve: int = 0#

Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the model’s reply with a delimiting token).

Default: 0

function_token_reserve(functions: list[AIFunction]) → int[source]#

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

async stream(

messages: list[ChatMessage],

functions: list[AIFunction] | None = None,

**hyperparams,

) → AsyncIterable[str | BaseCompletion][source]#

Optional: Stream a completion from the engine, token-by-token.

This method’s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:

messages – The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.
functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to the engine.

async close()[source]#: Optional: Clean up any resources the engine might need.

class kani.engines.Completion(message: ChatMessage, prompt_tokens: int | None = None, completion_tokens: int | None = None)[source]#

property message#: The message returned by the LM.

property prompt_tokens#: How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.

property completion_tokens#: How many tokens are in the completion. Can be None for kani to estimate using tokenizer.

class kani.engines.WrapperEngine(engine: BaseEngine, *args, **kwargs)[source]#

A base class for engines that are meant to wrap other engines. By default, this class takes in another engine as the first parameter in its constructor and will pass through all non-overriden attributes to the wrapped engine.

Parameters:: engine – The engine to wrap.

engine#: The wrapped engine.

class kani.engines.base.BaseCompletion[source]#

Base class for all LM engine completions.

abstract property message: ChatMessage#: The message returned by the LM.

abstract property prompt_tokens: int | None#: How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.

abstract property completion_tokens: int | None#: How many tokens are in the completion. Can be None for kani to estimate using tokenizer.

class kani.engines.httpclient.BaseClient(http: ClientSession | None = None)[source]#

aiohttp-based HTTP client to help implement HTTP-based engines.

Deprecated since version 1.0.0: We recommend using httpx.AsyncClient instead. This aiohttp-based client will be removed in a future version.

Parameters:: http – The aiohttp.ClientSession to use; if not provided, creates a new session.

SERVICE_BASE: str#: The base route of the HTTP API.

async request(method: str, route: str, **kwargs) → ClientResponse[source]#

Makes an HTTP request to the given route (relative to the base route).

Parameters:

method – The HTTP method to use (e.g. ‘GET’, ‘POST’).
route – The route to make the request to (relative to the SERVICE_BASE).

Raises:

HTTPStatusException – The request returned a non-2xx response.
HTTPTimeout – The request timed out.
HTTPException – The response could not be deserialized.

async get(route: str, **kwargs)[source]#: Convenience method; equivalent to self.request("GET", route, **kwargs).json().

async post(route: str, **kwargs)[source]#: Convenience method; equivalent to self.request("POST", route, **kwargs).json().

async close()[source]#: Close the underlying aiohttp session.

OpenAI#

class kani.engines.openai.OpenAIEngine(

api_key: str | None = None,

model='gpt-3.5-turbo',

max_context_size: int | None = None,

*,

organization: str | None = None,

retry: int = 5,

api_base: str = 'https://api.openai.com/v1',

headers: dict | None = None,

client: AsyncOpenAI | None = None,

**hyperparams,

)[source]#

Engine for using the OpenAI API.

This engine supports all chat-based models and fine-tunes.

Parameters:

api_key – Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.
model – The id of the model to use (e.g. “gpt-3.5-turbo”, “ft:gpt-3.5-turbo:my-org:custom_suffix:id”).
max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.
organization – The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API key’s default org if not set).
retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 5).
api_base – The base URL of the OpenAI API to use.
headers – A dict of HTTP headers to include with each request.
client – An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the organization, retry, api_base, and headers params will be ignored.
hyperparams – The arguments to pass to the create_chat_completion call with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.

Anthropic#

class kani.engines.anthropic.AnthropicEngine(

api_key: str | None = None,

model: str = 'claude-3-haiku-20240307',

max_tokens: int = 512,

max_context_size: int | None = None,

*,

retry: int = 2,

api_base: str | None = None,

headers: dict | None = None,

client: AsyncAnthropic | None = None,

**hyperparams,

)[source]#

Engine for using the Anthropic API.

This engine supports all Claude models. See https://docs.anthropic.com/claude/docs/getting-access-to-claude for information on accessing the Claude API.

See https://docs.anthropic.com/claude/docs/models-overview for a list of available models.

Parameters:

api_key – Your Anthropic API key. By default, the API key will be read from the ANTHROPIC_API_KEY environment variable.
model – The id of the model to use (e.g. “claude-2.1”, “claude-instant-1.2”).
max_tokens – The maximum number of tokens to sample at each generation (defaults to 512). Generally, you should set this to the same number as your Kani’s desired_response_tokens.
max_context_size – The maximum amount of tokens allowed in the chat prompt. If None, uses the given model’s full context size.
retry – How many times the engine should retry failed HTTP calls with exponential backoff (default 2).
api_base – The base URL of the Anthropic API to use.
headers – A dict of HTTP headers to include with each request.
client – An instance of anthropic.AsyncAnthropic (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the retry, api_base, and headers params will be ignored.
hyperparams – Any additional parameters to pass to the underlying API call (see https://docs.anthropic.com/claude/reference/complete_post).

Hugging Face#

class kani.engines.huggingface.HuggingEngine(

model_id: str,

max_context_size: int | None = None,

prompt_pipeline: PromptPipeline[str | Tensor] | None = None,

*,

token=None,

device: str | None = None,

tokenizer_kwargs: dict | None = None,

model_load_kwargs: dict | None = None,

**hyperparams,

)[source]#

Base engine for all HuggingFace text-generation models.

This class implements the main decoding logic for any HuggingFace model based on a pretrained AutoModelForCausalLM. As most models use model-specific chat templates, this base class accepts a PromptPipeline to translate kani ChatMessages into a model-specific string.

GPU Support

By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Tip

See 4-bit Quantization (🤗) for information about loading a quantized model for lower memory usage.

Parameters:

model_id – The ID of the model to load from HuggingFace.
max_context_size – The context size of the model. If not given, will be set from the model’s config.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).
token – The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
device – The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.
tokenizer_kwargs – Additional arguments to pass to AutoTokenizer.from_pretrained().
model_load_kwargs – Additional arguments to pass to AutoModelForCausalLM.from_pretrained().
hyperparams – Additional arguments to supply the model during generation.

message_len(message: ChatMessage) → int[source]#: Return the length, in tokens, of the given chat message.

function_token_reserve(functions: list[AIFunction]) → int[source]#

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

build_prompt( messages: list[ChatMessage], functions: list[AIFunction] | None = None, ) → str | Tensor[source]#

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.

The default behaviour is to call the supplied pipeline.

async predict(

messages: list[ChatMessage],

functions: list[AIFunction] | None = None,

**hyperparams,

) → Completion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:

messages – The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.
functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

async stream(

messages: list[ChatMessage],

functions: list[AIFunction] | None = None,

*,

streamer_timeout=None,

**hyperparams,

) → AsyncIterable[str | BaseCompletion][source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:

messages – The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.
functions – The functions the LM is allowed to call.
streamer_timeout – The maximum number of seconds to wait for the next token when streaming.
hyperparams – Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

class kani.engines.huggingface.llama2.LlamaEngine(model_id: str = 'meta-llama/Llama-2-7b-chat-hf', *args, **kwargs)[source]#

Implementation of LLaMA v2 using huggingface transformers.

You may also use the 13b, 70b, or other LLaMA models that use the LLaMA prompt by passing the HuggingFace model ID to the initializer.

Model IDs:

meta-llama/Llama-2-7b-chat-hf
meta-llama/Llama-2-13b-chat-hf
meta-llama/Llama-2-70b-chat-hf

In theory you could also use the non-chat-tuned variants as well.

GPU Support

By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Usage

engine = LlamaEngine("meta-llama/Llama-2-7b-chat-hf", use_auth_token=True)
ai = Kani(engine)

Attention

You will need to accept Meta’s license in order to download the LLaMA v2 weights. Visit https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to request access.

Then, run huggingface-cli login to authenticate with Hugging Face.

llama.cpp#

class kani.engines.llamacpp.LlamaCppEngine(

repo_id: str,

filename: str | None = None,

max_context_size: int = 0,

prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | list[int]] = PromptPipeline([Wrap(role=<ChatRole.SYSTEM: 'system'>,

predicate=None,

prefix='<<SYS>>\n',

suffix='\n<</SYS>>\n'),

TranslateRole(role=<ChatRole.SYSTEM: 'system'>,

predicate=None,

to=<ChatRole.USER: 'user'>,

warn=None),

MergeConsecutive(role=<ChatRole.USER: 'user'>,

predicate=None,

sep='\n',

joiner=None,

out_role=<ChatRole.USER: 'user'>),

MergeConsecutive(role=<ChatRole.ASSISTANT: 'assistant'>,

predicate=None,

sep=' ',

joiner=None,

out_role=<ChatRole.ASSISTANT: 'assistant'>),

ConversationFmt(prefix='',

sep='',

suffix='',

generation_suffix='',

user_prefix='<s>[INST] ',

user_suffix=' [/INST]',

assistant_prefix=' ',

assistant_suffix=' </s>',

assistant_suffix_if_last='',

system_prefix='',

system_suffix='',

function_prefix='<s>[INST] ',

function_suffix=' [/INST]')]),

*,

model_load_kwargs: dict | None = None,

**hyperparams,

)[source]#

This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).

This engine defaults to LLaMA 2 Chat 7B with 4-bit quantization.

GPU Support

llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.

To load some or all of the model layers on GPU, pass n_gpu_layers=... in the model_load_kwargs. Use -1 to specify all layers.

Parameters:

repo_id – The ID of the model repo to load from Hugging Face.
filename – A filename or glob pattern to match the model file in the repo.
max_context_size – The context size of the model.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).
model_load_kwargs – Additional arguments to pass to Llama.from_pretrained(). See this link for more info.
hyperparams – Additional arguments to supply the model during generation.

message_len(message: ChatMessage) → int[source]#: Return the length, in tokens, of the given chat message.

build_prompt( messages: list[ChatMessage], functions: list[AIFunction] | None = None, ) → str | list[int][source]#

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.

The default behaviour is to call the supplied pipeline.

async predict(

messages: list[ChatMessage],

functions: list[AIFunction] | None = None,

**hyperparams,

) → Completion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:

messages – The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.
functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to Llama.create_completion(). (See https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion)

async stream(

messages: list[ChatMessage],

functions: list[AIFunction] | None = None,

**hyperparams,

) → AsyncIterable[str | BaseCompletion][source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:

messages – The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.
functions – The functions the LM is allowed to call.
hyperparams – Any additional parameters to pass to Llama.create_completion(). (See https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion)