Engine Reference#

Model Name

Extra

Capabilities

Engine

GPT-3.5-turbo, GPT-4

openai

๐Ÿ› ๏ธ ๐Ÿ“ก

kani.engines.openai.OpenAIEngine

Claude, Claude Instant

anthropic

๐Ÿ› ๏ธ ๐Ÿ“ก

kani.engines.anthropic.AnthropicEngine

๐Ÿค— transformers[4]

huggingface[2]

(runtime)

kani.engines.huggingface.HuggingEngine

๐Ÿค— ๐Ÿฆ™ LLaMA 3

huggingface, llama[2]

๐Ÿ”“ ๐Ÿ–ฅ ๐Ÿš€

kani.engines.huggingface.HuggingEngine[1]

๐Ÿค— Mistral, Mixtral

huggingface[2]

๐Ÿ› ๏ธ ๐Ÿ”“ ๐Ÿ–ฅ ๐Ÿš€

kani.engines.huggingface.HuggingEngine[1]

๐Ÿค— Command R, Command R+

huggingface[2]

๐Ÿ› ๏ธ ๐Ÿ”“ ๐Ÿ–ฅ ๐Ÿš€

kani.engines.huggingface.cohere.CommandREngine

๐Ÿค— ๐Ÿฆ™ LLaMA v2

huggingface, llama[2]

๐Ÿ”“ ๐Ÿ–ฅ ๐Ÿš€

kani.engines.huggingface.llama2.LlamaEngine

๐Ÿค— ๐Ÿฆ™ Vicuna v1.3

huggingface, llama[2]

๐Ÿ”“ ๐Ÿ–ฅ ๐Ÿš€

kani.engines.huggingface.vicuna.VicunaEngine

llama.cpp[4]

cpp

(runtime)

kani.engines.llamacpp.LlamaCppEngine

๐Ÿฆ™ LLaMA v2 (GGUF)

cpp

๐Ÿ”“ ๐Ÿ–ฅ ๐Ÿš€

kani.engines.llamacpp.LlamaCppEngine

Additional models using the classes above are also supported - see the model zoo for a more comprehensive list of models!

Legend

  • ๐Ÿ› ๏ธ: Supports function calling.

  • ๐Ÿ”“: Open source model.

  • ๐Ÿ–ฅ: Runs locally on CPU.

  • ๐Ÿš€: Runs locally on GPU.

  • ๐Ÿ“ก: Hosted API.

Base#

class kani.engines.BaseEngine[source]#

Base class for all LM engines.

To add support for a new LM, make a subclass of this and implement the abstract methods below.

max_context_size: int#

The maximum context size supported by this engineโ€™s LM.

abstract message_len(message: ChatMessage) int[source]#

Return the length, in tokens, of the given chat message.

abstract async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) BaseCompletion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages โ€“ The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.

  • functions โ€“ The functions the LM is allowed to call.

  • hyperparams โ€“ Any additional parameters to pass to the engine.

token_reserve: int = 0#

Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the modelโ€™s reply with a delimiting token).

Default: 0

function_token_reserve(functions: list[AIFunction]) int[source]#

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]#

Optional: Stream a completion from the engine, token-by-token.

This methodโ€™s signature is the same as BaseEngine.predict().

This method should yield strings as an asynchronous iterable.

Optionally, this method may also yield a BaseCompletion. If it does, it MUST be the last item yielded by this method.

If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.

Parameters:
  • messages โ€“ The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.

  • functions โ€“ The functions the LM is allowed to call.

  • hyperparams โ€“ Any additional parameters to pass to the engine.

async close()[source]#

Optional: Clean up any resources the engine might need.

class kani.engines.Completion(message: ChatMessage, prompt_tokens: int | None = None, completion_tokens: int | None = None)[source]#
property message#

The message returned by the LM.

property prompt_tokens#

How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.

property completion_tokens#

How many tokens are in the completion. Can be None for kani to estimate using tokenizer.

class kani.engines.WrapperEngine(engine: BaseEngine, *args, **kwargs)[source]#

A base class for engines that are meant to wrap other engines. By default, this class takes in another engine as the first parameter in its constructor and will pass through all non-overriden attributes to the wrapped engine.

Parameters:

engine โ€“ The engine to wrap.

engine#

The wrapped engine.

class kani.engines.base.BaseCompletion[source]#

Base class for all LM engine completions.

abstract property message: ChatMessage#

The message returned by the LM.

abstract property prompt_tokens: int | None#

How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.

abstract property completion_tokens: int | None#

How many tokens are in the completion. Can be None for kani to estimate using tokenizer.

class kani.engines.httpclient.BaseClient(http: ClientSession | None = None)[source]#

aiohttp-based HTTP client to help implement HTTP-based engines.

Deprecated since version 1.0.0: We recommend using httpx.AsyncClient instead. This aiohttp-based client will be removed in a future version.

Parameters:

http โ€“ The aiohttp.ClientSession to use; if not provided, creates a new session.

SERVICE_BASE: str#

The base route of the HTTP API.

async request(method: str, route: str, **kwargs) ClientResponse[source]#

Makes an HTTP request to the given route (relative to the base route).

Parameters:
  • method โ€“ The HTTP method to use (e.g. โ€˜GETโ€™, โ€˜POSTโ€™).

  • route โ€“ The route to make the request to (relative to the SERVICE_BASE).

Raises:
async get(route: str, **kwargs)[source]#

Convenience method; equivalent to self.request("GET", route, **kwargs).json().

async post(route: str, **kwargs)[source]#

Convenience method; equivalent to self.request("POST", route, **kwargs).json().

async close()[source]#

Close the underlying aiohttp session.

OpenAI#

class kani.engines.openai.OpenAIEngine(
api_key: str | None = None,
model='gpt-3.5-turbo',
max_context_size: int | None = None,
*,
organization: str | None = None,
retry: int = 5,
api_base: str = 'https://api.openai.com/v1',
headers: dict | None = None,
client: AsyncOpenAI | None = None,
**hyperparams,
)[source]#

Engine for using the OpenAI API.

This engine supports all chat-based models and fine-tunes.

Parameters:
  • api_key โ€“ Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.

  • model โ€“ The id of the model to use (e.g. โ€œgpt-3.5-turboโ€, โ€œft:gpt-3.5-turbo:my-org:custom_suffix:idโ€).

  • max_context_size โ€“ The maximum amount of tokens allowed in the chat prompt. If None, uses the given modelโ€™s full context size.

  • organization โ€“ The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API keyโ€™s default org if not set).

  • retry โ€“ How many times the engine should retry failed HTTP calls with exponential backoff (default 5).

  • api_base โ€“ The base URL of the OpenAI API to use.

  • headers โ€“ A dict of HTTP headers to include with each request.

  • client โ€“ An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the organization, retry, api_base, and headers params will be ignored.

  • hyperparams โ€“ The arguments to pass to the create_chat_completion call with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.

Anthropic#

class kani.engines.anthropic.AnthropicEngine(
api_key: str | None = None,
model: str = 'claude-3-haiku-20240307',
max_tokens: int = 512,
max_context_size: int | None = None,
*,
retry: int = 2,
api_base: str | None = None,
headers: dict | None = None,
client: AsyncAnthropic | None = None,
**hyperparams,
)[source]#

Engine for using the Anthropic API.

This engine supports all Claude models. See https://docs.anthropic.com/claude/docs/getting-access-to-claude for information on accessing the Claude API.

See https://docs.anthropic.com/claude/docs/models-overview for a list of available models.

Parameters:
  • api_key โ€“ Your Anthropic API key. By default, the API key will be read from the ANTHROPIC_API_KEY environment variable.

  • model โ€“ The id of the model to use (e.g. โ€œclaude-2.1โ€, โ€œclaude-instant-1.2โ€).

  • max_tokens โ€“ The maximum number of tokens to sample at each generation (defaults to 512). Generally, you should set this to the same number as your Kaniโ€™s desired_response_tokens.

  • max_context_size โ€“ The maximum amount of tokens allowed in the chat prompt. If None, uses the given modelโ€™s full context size.

  • retry โ€“ How many times the engine should retry failed HTTP calls with exponential backoff (default 2).

  • api_base โ€“ The base URL of the Anthropic API to use.

  • headers โ€“ A dict of HTTP headers to include with each request.

  • client โ€“ An instance of anthropic.AsyncAnthropic (for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed the retry, api_base, and headers params will be ignored.

  • hyperparams โ€“ Any additional parameters to pass to the underlying API call (see https://docs.anthropic.com/claude/reference/complete_post).

Hugging Face#

class kani.engines.huggingface.HuggingEngine(
model_id: str,
max_context_size: int | None = None,
prompt_pipeline: PromptPipeline[str | Tensor] | None = None,
*,
token=None,
device: str | None = None,
tokenizer_kwargs: dict | None = None,
model_load_kwargs: dict | None = None,
**hyperparams,
)[source]#

Base engine for all HuggingFace text-generation models.

This class implements the main decoding logic for any HuggingFace model based on a pretrained AutoModelForCausalLM. As most models use model-specific chat templates, this base class accepts a PromptPipeline to translate kani ChatMessages into a model-specific string.

GPU Support

By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Tip

See 4-bit Quantization (๐Ÿค—) for information about loading a quantized model for lower memory usage.

Parameters:
  • model_id โ€“ The ID of the model to load from HuggingFace.

  • max_context_size โ€“ The context size of the model. If not given, will be set from the modelโ€™s config.

  • prompt_pipeline โ€“ The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).

  • token โ€“ The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.

  • device โ€“ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.

  • tokenizer_kwargs โ€“ Additional arguments to pass to AutoTokenizer.from_pretrained().

  • model_load_kwargs โ€“ Additional arguments to pass to AutoModelForCausalLM.from_pretrained().

  • hyperparams โ€“ Additional arguments to supply the model during generation.

message_len(message: ChatMessage) int[source]#

Return the length, in tokens, of the given chat message.

function_token_reserve(functions: list[AIFunction]) int[source]#

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

build_prompt(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
) str | Tensor[source]#

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.

The default behaviour is to call the supplied pipeline.

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) Completion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages โ€“ The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.

  • functions โ€“ The functions the LM is allowed to call.

  • hyperparams โ€“ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
*,
streamer_timeout=None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages โ€“ The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.

  • functions โ€“ The functions the LM is allowed to call.

  • streamer_timeout โ€“ The maximum number of seconds to wait for the next token when streaming.

  • hyperparams โ€“ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

class kani.engines.huggingface.llama2.LlamaEngine(model_id: str = 'meta-llama/Llama-2-7b-chat-hf', *args, **kwargs)[source]#

Implementation of LLaMA v2 using huggingface transformers.

You may also use the 13b, 70b, or other LLaMA models that use the LLaMA prompt by passing the HuggingFace model ID to the initializer.

Model IDs:

  • meta-llama/Llama-2-7b-chat-hf

  • meta-llama/Llama-2-13b-chat-hf

  • meta-llama/Llama-2-70b-chat-hf

In theory you could also use the non-chat-tuned variants as well.

GPU Support

By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Usage

engine = LlamaEngine("meta-llama/Llama-2-7b-chat-hf", use_auth_token=True)
ai = Kani(engine)

Attention

You will need to accept Metaโ€™s license in order to download the LLaMA v2 weights. Visit https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to request access.

Then, run huggingface-cli login to authenticate with Hugging Face.

Tip

See 4-bit Quantization (๐Ÿค—) for information about loading a quantized model for lower memory usage.

Tip

This engine is equivalent to the following usage of the base HuggingEngine.

LLAMA2_PIPELINE = (
    PromptPipeline()
    .wrap(role=ChatRole.SYSTEM, prefix="<<SYS>>\n", suffix="\n<</SYS>>\n")
    .translate_role(role=ChatRole.SYSTEM, to=ChatRole.USER)
    .merge_consecutive(role=ChatRole.USER, sep="\n")
    .merge_consecutive(role=ChatRole.ASSISTANT, sep=" ")
    .conversation_fmt(
        user_prefix="<s>[INST] ",
        user_suffix=" [/INST]",
        assistant_prefix=" ",
        assistant_suffix=" </s>",
        assistant_suffix_if_last="",
    )
)

engine = HuggingEngine(
    "meta-llama/Llama-2-7b-chat-hf",
    prompt_pipeline=LLAMA2_PIPELINE
)

See PromptPipeline for more information on reusable prompt pipelines.

Parameters:
  • model_id โ€“ The ID of the model to load from HuggingFace.

  • token โ€“ The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.

  • max_context_size โ€“ The context size of the model.

  • device โ€“ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.

  • tokenizer_kwargs โ€“ Additional arguments to pass to AutoTokenizer.from_pretrained().

  • model_load_kwargs โ€“ Additional arguments to pass to AutoModelForCausalLM.from_pretrained().

  • hyperparams โ€“ Additional arguments to supply the model during generation.

class kani.engines.huggingface.cohere.CommandREngine(model_id: str = 'CohereForAI/c4ai-command-r-v01', *args, **kwargs)[source]#

Implementation of Command R (35B) and Command R+ (104B) using huggingface transformers.

Model IDs:

  • CohereForAI/c4ai-command-r-v01

  • CohereForAI/c4ai-command-r-plus

GPU Support

By default, the CommandREngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Usage

engine = CommandREngine("CohereForAI/c4ai-command-r-v01")
ai = KaniWithFunctions(engine)

Configuration

Command R has many configurations that enable function calling and/or RAG, and it is poorly documented exactly how certain prompts affect the model. In this implementation, we default to the Cohere-supplied โ€œpreambleโ€ if function definitions are supplied, and assume that we pass every generated function call and results each turn.

When generating the result of a tool call turn, this implementation does NOT request the model to generate citations by default (unlike the Cohere API). You can enable citations by setting the rag_prompt_instructions parameter to DEFAULT_RAG_INSTRUCTIONS_ACC or DEFAULT_RAG_INSTRUCTIONS_FAST (imported from kani.prompts.impl.cohere).

See the constructorโ€™s available parameters for more information.

Caution

Command R requires transformers>=4.39.1 as a dependency. If you see warnings about a missing CohereTokenizerFast, please update your version with pip install transformers>=4.39.1.

Tip

See 4-bit Quantization (๐Ÿค—) for information about loading a quantized model for lower memory usage.

Parameters:
  • model_id โ€“ The ID of the model to load from HuggingFace.

  • max_context_size โ€“ The context size of the model (defaults to Command Rโ€™s size of 128k).

  • device โ€“ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.

  • tokenizer_kwargs โ€“ Additional arguments to pass to AutoTokenizer.from_pretrained().

  • model_load_kwargs โ€“ Additional arguments to pass to AutoModelForCausalLM.from_pretrained().

  • tool_prompt_include_function_calls โ€“ Whether to include previous turnsโ€™ function calls or just the modelโ€™s answers when it is the modelโ€™s generation turn and the last message is not FUNCTION.

  • tool_prompt_include_function_results โ€“ Whether to include the results of previous turnsโ€™ function calls in the context when it is the modelโ€™s generation turn and the last message is not FUNCTION.

  • tool_prompt_instructions โ€“ The system prompt to send just before the modelโ€™s generation turn that includes instructions on the format to generate tool calls in. Generally you shouldnโ€™t change this.

  • rag_prompt_include_function_calls โ€“ Whether to include previous turnsโ€™ function calls or just the modelโ€™s answers when it is the modelโ€™s generation turn and the last message is FUNCTION.

  • rag_prompt_include_function_results โ€“ Whether to include the results of previous turnsโ€™ function calls in the context when it is hte modelโ€™s generation turn and the last message is FUNCTION.

  • rag_prompt_instructions โ€“

    The system prompt to send just before the modelโ€™s generation turn that includes instructions on the format to generate the result in. Can be None to only generate a model turn. Defaults to None to for maximum interoperability between models. Options:

    • from kani.prompts.impl.cohere import DEFAULT_RAG_INSTRUCTIONS_ACC

    • from kani.prompts.impl.cohere import DEFAULT_RAG_INSTRUCTIONS_FAST

    • None (default)

    • another user-supplied string

  • hyperparams โ€“ Additional arguments to supply the model during generation.

token_reserve: int = 200#

Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the modelโ€™s reply with a delimiting token).

Default: 0

message_len(message: ChatMessage) int[source]#

Return the length, in tokens, of the given chat message.

function_token_reserve(functions: list[AIFunction]) int[source]#

Optional: How many tokens are required to build a prompt to expose the given functions to the model.

Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) Completion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages โ€“ The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.

  • functions โ€“ The functions the LM is allowed to call.

  • hyperparams โ€“ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
*,
streamer_timeout=None,
**hyperparams,
) AsyncIterable[str | Completion][source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
  • messages โ€“ The messages in the current chat context. sum(message_len(m) for m in messages) is guaranteed to be less than max_context_size.

  • functions โ€“ The functions the LM is allowed to call.

  • streamer_timeout โ€“ The maximum number of seconds to wait for the next token when streaming.

  • hyperparams โ€“ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)

class kani.engines.huggingface.vicuna.VicunaEngine(model_id: str = 'lmsys/vicuna-7b-v1.3', *args, **kwargs)[source]#

Implementation of Vicuna (a LLaMA v1 fine-tune) using huggingface transformers.

You may also use the 13b, 33b, or other LLaMA models that use the Vicuna prompt by passing the HuggingFace model ID to the initializer.

GPU Support

By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass device="cpu|cuda" to the constructor.

Tip

See 4-bit Quantization (๐Ÿค—) for information about loading a quantized model for lower memory usage.

engine = VicunaEngine("lmsys/vicuna-7b-v1.3")
ai = Kani(engine)
Parameters:
  • model_id โ€“ The ID of the model to load from HuggingFace.

  • max_context_size โ€“ The context size of the model. If not given, will be set from the modelโ€™s config.

  • prompt_pipeline โ€“ The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).

  • token โ€“ The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.

  • device โ€“ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.

  • tokenizer_kwargs โ€“ Additional arguments to pass to AutoTokenizer.from_pretrained().

  • model_load_kwargs โ€“ Additional arguments to pass to AutoModelForCausalLM.from_pretrained().

  • hyperparams โ€“ Additional arguments to supply the model during generation.

llama.cpp#

class kani.engines.llamacpp.LlamaCppEngine(
repo_id: str,
filename: str | None = None,
max_context_size: int = 0,
prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | list[int]] = PromptPipeline([Wrap(role=<ChatRole.SYSTEM: 'system'>,
predicate=None,
prefix='<<SYS>>\n',
suffix='\n<</SYS>>\n'),
TranslateRole(role=<ChatRole.SYSTEM: 'system'>,
predicate=None,
to=<ChatRole.USER: 'user'>,
warn=None),
MergeConsecutive(role=<ChatRole.USER: 'user'>,
predicate=None,
sep='\n',
joiner=None,
out_role=<ChatRole.USER: 'user'>),
MergeConsecutive(role=<ChatRole.ASSISTANT: 'assistant'>,
predicate=None,
sep=' ',
joiner=None,
out_role=<ChatRole.ASSISTANT: 'assistant'>),
ConversationFmt(prefix='',
sep='',
suffix='',
generation_suffix='',
user_prefix='<s>[INST] ',
user_suffix=' [/INST]',
assistant_prefix=' ',
assistant_suffix=' </s>',
assistant_suffix_if_last='',
system_prefix='',
system_suffix='',
function_prefix='<s>[INST] ',
function_suffix=' [/INST]')]),
*,
model_load_kwargs: dict | None = None,
**hyperparams,
)[source]#

This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).

This engine defaults to LLaMA 2 Chat 7B with 4-bit quantization.

GPU Support

llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.

To load some or all of the model layers on GPU, pass n_gpu_layers=... in the model_load_kwargs. Use -1 to specify all layers.

Parameters:
  • repo_id โ€“ The ID of the model repo to load from Hugging Face.

  • filename โ€“ A filename or glob pattern to match the model file in the repo.

  • max_context_size โ€“ The context size of the model.

  • prompt_pipeline โ€“ The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).

  • model_load_kwargs โ€“ Additional arguments to pass to Llama.from_pretrained(). See this link for more info.

  • hyperparams โ€“ Additional arguments to supply the model during generation.

message_len(message: ChatMessage) int[source]#

Return the length, in tokens, of the given chat message.

build_prompt(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
) str | list[int][source]#

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.

The default behaviour is to call the supplied pipeline.

async predict(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) Completion[source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters:
async stream(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
**hyperparams,
) AsyncIterable[str | BaseCompletion][source]#

Given the current context of messages and available functions, get the next predicted chat message from the LM.

Parameters: