Engine Reference#
Model Name |
Extra |
Capabilities |
Engine |
---|---|---|---|
GPT-3.5-turbo, GPT-4 |
|
๐ ๏ธ ๐ก |
|
Claude, Claude Instant |
|
๐ ๏ธ ๐ก |
|
๐ค transformers[4] |
|
(runtime) |
|
๐ค ๐ฆ LLaMA 3 |
|
๐ ๐ฅ ๐ |
|
๐ค Mistral, Mixtral |
|
๐ ๏ธ ๐ ๐ฅ ๐ |
|
๐ค Command R, Command R+ |
|
๐ ๏ธ ๐ ๐ฅ ๐ |
|
๐ค ๐ฆ LLaMA v2 |
|
๐ ๐ฅ ๐ |
|
๐ค ๐ฆ Vicuna v1.3 |
|
๐ ๐ฅ ๐ |
|
llama.cpp[4] |
|
(runtime) |
|
๐ฆ LLaMA v2 (GGUF) |
|
๐ ๐ฅ ๐ |
Additional models using the classes above are also supported - see the model zoo for a more comprehensive list of models!
Legend
๐ ๏ธ: Supports function calling.
๐: Open source model.
๐ฅ: Runs locally on CPU.
๐: Runs locally on GPU.
๐ก: Hosted API.
Base#
- class kani.engines.BaseEngine[source]#
Base class for all LM engines.
To add support for a new LM, make a subclass of this and implement the abstract methods below.
- abstract message_len(message: ChatMessage) int [source]#
Return the length, in tokens, of the given chat message.
- abstract async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
hyperparams โ Any additional parameters to pass to the engine.
- token_reserve: int = 0#
Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the modelโs reply with a delimiting token).
Default: 0
- function_token_reserve(functions: list[AIFunction]) int [source]#
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Optional: Stream a completion from the engine, token-by-token.
This methodโs signature is the same as
BaseEngine.predict()
.This method should yield strings as an asynchronous iterable.
Optionally, this method may also yield a
BaseCompletion
. If it does, it MUST be the last item yielded by this method.If an engine does not implement streaming, this method will yield the entire text of the completion in a single chunk by default.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
hyperparams โ Any additional parameters to pass to the engine.
- class kani.engines.Completion(message: ChatMessage, prompt_tokens: int | None = None, completion_tokens: int | None = None)[source]#
- property message#
The message returned by the LM.
- property prompt_tokens#
How many tokens are in the prompt. Can be None for kani to estimate using tokenizer.
- property completion_tokens#
How many tokens are in the completion. Can be None for kani to estimate using tokenizer.
- class kani.engines.WrapperEngine(engine: BaseEngine, *args, **kwargs)[source]#
A base class for engines that are meant to wrap other engines. By default, this class takes in another engine as the first parameter in its constructor and will pass through all non-overriden attributes to the wrapped engine.
- Parameters:
engine โ The engine to wrap.
- engine#
The wrapped engine.
- class kani.engines.base.BaseCompletion[source]#
Base class for all LM engine completions.
- abstract property message: ChatMessage#
The message returned by the LM.
- class kani.engines.httpclient.BaseClient(http: ClientSession | None = None)[source]#
aiohttp-based HTTP client to help implement HTTP-based engines.
Deprecated since version 1.0.0: We recommend using httpx.AsyncClient instead. This aiohttp-based client will be removed in a future version.
- Parameters:
http โ The
aiohttp.ClientSession
to use; if not provided, creates a new session.
- async request(method: str, route: str, **kwargs) ClientResponse [source]#
Makes an HTTP request to the given route (relative to the base route).
- Parameters:
method โ The HTTP method to use (e.g. โGETโ, โPOSTโ).
route โ The route to make the request to (relative to the
SERVICE_BASE
).
- Raises:
HTTPStatusException โ The request returned a non-2xx response.
HTTPTimeout โ The request timed out.
HTTPException โ The response could not be deserialized.
- async get(route: str, **kwargs)[source]#
Convenience method; equivalent to
self.request("GET", route, **kwargs).json()
.
OpenAI#
- class kani.engines.openai.OpenAIEngine(
- api_key: str | None = None,
- model='gpt-3.5-turbo',
- max_context_size: int | None = None,
- *,
- organization: str | None = None,
- retry: int = 5,
- api_base: str = 'https://api.openai.com/v1',
- headers: dict | None = None,
- client: AsyncOpenAI | None = None,
- **hyperparams,
Engine for using the OpenAI API.
This engine supports all chat-based models and fine-tunes.
- Parameters:
api_key โ Your OpenAI API key. By default, the API key will be read from the OPENAI_API_KEY environment variable.
model โ The id of the model to use (e.g. โgpt-3.5-turboโ, โft:gpt-3.5-turbo:my-org:custom_suffix:idโ).
max_context_size โ The maximum amount of tokens allowed in the chat prompt. If None, uses the given modelโs full context size.
organization โ The OpenAI organization to use in requests. By default, the org ID would be read from the OPENAI_ORG_ID environment variable (defaults to the API keyโs default org if not set).
retry โ How many times the engine should retry failed HTTP calls with exponential backoff (default 5).
api_base โ The base URL of the OpenAI API to use.
headers โ A dict of HTTP headers to include with each request.
client โ An instance of openai.AsyncOpenAI (for reusing the same client in multiple engines). You must specify exactly one of
(api_key, client)
. If this is passed theorganization
,retry
,api_base
, andheaders
params will be ignored.hyperparams โ The arguments to pass to the
create_chat_completion
call with each request. See https://platform.openai.com/docs/api-reference/chat/create for a full list of params.
Anthropic#
- class kani.engines.anthropic.AnthropicEngine(
- api_key: str | None = None,
- model: str = 'claude-3-haiku-20240307',
- max_tokens: int = 512,
- max_context_size: int | None = None,
- *,
- retry: int = 2,
- api_base: str | None = None,
- headers: dict | None = None,
- client: AsyncAnthropic | None = None,
- **hyperparams,
Engine for using the Anthropic API.
This engine supports all Claude models. See https://docs.anthropic.com/claude/docs/getting-access-to-claude for information on accessing the Claude API.
See https://docs.anthropic.com/claude/docs/models-overview for a list of available models.
- Parameters:
api_key โ Your Anthropic API key. By default, the API key will be read from the ANTHROPIC_API_KEY environment variable.
model โ The id of the model to use (e.g. โclaude-2.1โ, โclaude-instant-1.2โ).
max_tokens โ The maximum number of tokens to sample at each generation (defaults to 512). Generally, you should set this to the same number as your Kaniโs
desired_response_tokens
.max_context_size โ The maximum amount of tokens allowed in the chat prompt. If None, uses the given modelโs full context size.
retry โ How many times the engine should retry failed HTTP calls with exponential backoff (default 2).
api_base โ The base URL of the Anthropic API to use.
headers โ A dict of HTTP headers to include with each request.
client โ An instance of
anthropic.AsyncAnthropic
(for reusing the same client in multiple engines). You must specify exactly one of (api_key, client). If this is passed theretry
,api_base
, andheaders
params will be ignored.hyperparams โ Any additional parameters to pass to the underlying API call (see https://docs.anthropic.com/claude/reference/complete_post).
Hugging Face#
- class kani.engines.huggingface.HuggingEngine(
- model_id: str,
- max_context_size: int | None = None,
- prompt_pipeline: PromptPipeline[str | Tensor] | None = None,
- *,
- token=None,
- device: str | None = None,
- tokenizer_kwargs: dict | None = None,
- model_load_kwargs: dict | None = None,
- **hyperparams,
Base engine for all HuggingFace text-generation models.
This class implements the main decoding logic for any HuggingFace model based on a pretrained
AutoModelForCausalLM
. As most models use model-specific chat templates, this base class accepts aPromptPipeline
to translate kani ChatMessages into a model-specific string.GPU Support
By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"
to the constructor.Tip
See 4-bit Quantization (๐ค) for information about loading a quantized model for lower memory usage.
- Parameters:
model_id โ The ID of the model to load from HuggingFace.
max_context_size โ The context size of the model. If not given, will be set from the modelโs config.
prompt_pipeline โ The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline
).token โ The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
device โ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.
tokenizer_kwargs โ Additional arguments to pass to
AutoTokenizer.from_pretrained()
.model_load_kwargs โ Additional arguments to pass to
AutoModelForCausalLM.from_pretrained()
.hyperparams โ Additional arguments to supply the model during generation.
- message_len(message: ChatMessage) int [source]#
Return the length, in tokens, of the given chat message.
- function_token_reserve(functions: list[AIFunction]) int [source]#
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.
The default behaviour is to call the supplied pipeline.
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
hyperparams โ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- *,
- streamer_timeout=None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
streamer_timeout โ The maximum number of seconds to wait for the next token when streaming.
hyperparams โ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)
- class kani.engines.huggingface.llama2.LlamaEngine(model_id: str = 'meta-llama/Llama-2-7b-chat-hf', *args, **kwargs)[source]#
Implementation of LLaMA v2 using huggingface transformers.
You may also use the 13b, 70b, or other LLaMA models that use the LLaMA prompt by passing the HuggingFace model ID to the initializer.
Model IDs:
meta-llama/Llama-2-7b-chat-hf
meta-llama/Llama-2-13b-chat-hf
meta-llama/Llama-2-70b-chat-hf
In theory you could also use the non-chat-tuned variants as well.
GPU Support
By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"
to the constructor.Usage
engine = LlamaEngine("meta-llama/Llama-2-7b-chat-hf", use_auth_token=True) ai = Kani(engine)
Attention
You will need to accept Metaโs license in order to download the LLaMA v2 weights. Visit https://ai.meta.com/resources/models-and-libraries/llama-downloads/ and https://huggingface.co/meta-llama/Llama-2-7b-chat-hf to request access.
Then, run
huggingface-cli login
to authenticate with Hugging Face.Tip
See 4-bit Quantization (๐ค) for information about loading a quantized model for lower memory usage.
Tip
This engine is equivalent to the following usage of the base
HuggingEngine
.LLAMA2_PIPELINE = ( PromptPipeline() .wrap(role=ChatRole.SYSTEM, prefix="<<SYS>>\n", suffix="\n<</SYS>>\n") .translate_role(role=ChatRole.SYSTEM, to=ChatRole.USER) .merge_consecutive(role=ChatRole.USER, sep="\n") .merge_consecutive(role=ChatRole.ASSISTANT, sep=" ") .conversation_fmt( user_prefix="<s>[INST] ", user_suffix=" [/INST]", assistant_prefix=" ", assistant_suffix=" </s>", assistant_suffix_if_last="", ) ) engine = HuggingEngine( "meta-llama/Llama-2-7b-chat-hf", prompt_pipeline=LLAMA2_PIPELINE )
See
PromptPipeline
for more information on reusable prompt pipelines.- Parameters:
model_id โ The ID of the model to load from HuggingFace.
token โ The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
max_context_size โ The context size of the model.
device โ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.
tokenizer_kwargs โ Additional arguments to pass to
AutoTokenizer.from_pretrained()
.model_load_kwargs โ Additional arguments to pass to
AutoModelForCausalLM.from_pretrained()
.hyperparams โ Additional arguments to supply the model during generation.
- class kani.engines.huggingface.cohere.CommandREngine(model_id: str = 'CohereForAI/c4ai-command-r-v01', *args, **kwargs)[source]#
Implementation of Command R (35B) and Command R+ (104B) using huggingface transformers.
Model IDs:
CohereForAI/c4ai-command-r-v01
CohereForAI/c4ai-command-r-plus
GPU Support
By default, the CommandREngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"
to the constructor.Usage
engine = CommandREngine("CohereForAI/c4ai-command-r-v01") ai = KaniWithFunctions(engine)
Configuration
Command R has many configurations that enable function calling and/or RAG, and it is poorly documented exactly how certain prompts affect the model. In this implementation, we default to the Cohere-supplied โpreambleโ if function definitions are supplied, and assume that we pass every generated function call and results each turn.
When generating the result of a tool call turn, this implementation does NOT request the model to generate citations by default (unlike the Cohere API). You can enable citations by setting the
rag_prompt_instructions
parameter toDEFAULT_RAG_INSTRUCTIONS_ACC
orDEFAULT_RAG_INSTRUCTIONS_FAST
(imported fromkani.prompts.impl.cohere
).See the constructorโs available parameters for more information.
Caution
Command R requires
transformers>=4.39.1
as a dependency. If you see warnings about a missingCohereTokenizerFast
, please update your version withpip install transformers>=4.39.1
.Tip
See 4-bit Quantization (๐ค) for information about loading a quantized model for lower memory usage.
- Parameters:
model_id โ The ID of the model to load from HuggingFace.
max_context_size โ The context size of the model (defaults to Command Rโs size of 128k).
device โ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.
tokenizer_kwargs โ Additional arguments to pass to
AutoTokenizer.from_pretrained()
.model_load_kwargs โ Additional arguments to pass to
AutoModelForCausalLM.from_pretrained()
.tool_prompt_include_function_calls โ Whether to include previous turnsโ function calls or just the modelโs answers when it is the modelโs generation turn and the last message is not FUNCTION.
tool_prompt_include_function_results โ Whether to include the results of previous turnsโ function calls in the context when it is the modelโs generation turn and the last message is not FUNCTION.
tool_prompt_instructions โ The system prompt to send just before the modelโs generation turn that includes instructions on the format to generate tool calls in. Generally you shouldnโt change this.
rag_prompt_include_function_calls โ Whether to include previous turnsโ function calls or just the modelโs answers when it is the modelโs generation turn and the last message is FUNCTION.
rag_prompt_include_function_results โ Whether to include the results of previous turnsโ function calls in the context when it is hte modelโs generation turn and the last message is FUNCTION.
rag_prompt_instructions โ
The system prompt to send just before the modelโs generation turn that includes instructions on the format to generate the result in. Can be None to only generate a model turn. Defaults to
None
to for maximum interoperability between models. Options:from kani.prompts.impl.cohere import DEFAULT_RAG_INSTRUCTIONS_ACC
from kani.prompts.impl.cohere import DEFAULT_RAG_INSTRUCTIONS_FAST
None
(default)another user-supplied string
hyperparams โ Additional arguments to supply the model during generation.
- token_reserve: int = 200#
Optional: The number of tokens to reserve for internal engine mechanisms (e.g. if an engine has to set up the modelโs reply with a delimiting token).
Default: 0
- message_len(message: ChatMessage) int [source]#
Return the length, in tokens, of the given chat message.
- function_token_reserve(functions: list[AIFunction]) int [source]#
Optional: How many tokens are required to build a prompt to expose the given functions to the model.
Default: If this is not implemented and the user passes in functions, log a warning that the engine does not support function calling.
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
hyperparams โ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- *,
- streamer_timeout=None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
streamer_timeout โ The maximum number of seconds to wait for the next token when streaming.
hyperparams โ Any additional parameters to pass to GenerationMixin.generate(). (See https://huggingface.co/docs/transformers/main_classes/text_generation)
- class kani.engines.huggingface.vicuna.VicunaEngine(model_id: str = 'lmsys/vicuna-7b-v1.3', *args, **kwargs)[source]#
Implementation of Vicuna (a LLaMA v1 fine-tune) using huggingface transformers.
You may also use the 13b, 33b, or other LLaMA models that use the Vicuna prompt by passing the HuggingFace model ID to the initializer.
GPU Support
By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"
to the constructor.Tip
See 4-bit Quantization (๐ค) for information about loading a quantized model for lower memory usage.
engine = VicunaEngine("lmsys/vicuna-7b-v1.3") ai = Kani(engine)
- Parameters:
model_id โ The ID of the model to load from HuggingFace.
max_context_size โ The context size of the model. If not given, will be set from the modelโs config.
prompt_pipeline โ The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline
).token โ The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
device โ The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.
tokenizer_kwargs โ Additional arguments to pass to
AutoTokenizer.from_pretrained()
.model_load_kwargs โ Additional arguments to pass to
AutoModelForCausalLM.from_pretrained()
.hyperparams โ Additional arguments to supply the model during generation.
llama.cpp#
- class kani.engines.llamacpp.LlamaCppEngine(
- repo_id: str,
- filename: str | None = None,
- max_context_size: int = 0,
- prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | list[int]] = PromptPipeline([Wrap(role=<ChatRole.SYSTEM: 'system'>,
- predicate=None,
- prefix='<<SYS>>\n',
- suffix='\n<</SYS>>\n'),
- TranslateRole(role=<ChatRole.SYSTEM: 'system'>,
- predicate=None,
- to=<ChatRole.USER: 'user'>,
- warn=None),
- MergeConsecutive(role=<ChatRole.USER: 'user'>,
- predicate=None,
- sep='\n',
- joiner=None,
- out_role=<ChatRole.USER: 'user'>),
- MergeConsecutive(role=<ChatRole.ASSISTANT: 'assistant'>,
- predicate=None,
- sep=' ',
- joiner=None,
- out_role=<ChatRole.ASSISTANT: 'assistant'>),
- ConversationFmt(prefix='',
- sep='',
- suffix='',
- generation_suffix='',
- user_prefix='<s>[INST] ',
- user_suffix=' [/INST]',
- assistant_prefix=' ',
- assistant_suffix=' </s>',
- assistant_suffix_if_last='',
- system_prefix='',
- system_suffix='',
- function_prefix='<s>[INST] ',
- function_suffix=' [/INST]')]),
- *,
- model_load_kwargs: dict | None = None,
- **hyperparams,
This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).
This engine defaults to LLaMA 2 Chat 7B with 4-bit quantization.
GPU Support
llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.
To load some or all of the model layers on GPU, pass
n_gpu_layers=...
in themodel_load_kwargs
. Use-1
to specify all layers.- Parameters:
repo_id โ The ID of the model repo to load from Hugging Face.
filename โ A filename or glob pattern to match the model file in the repo.
max_context_size โ The context size of the model.
prompt_pipeline โ The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline
).model_load_kwargs โ Additional arguments to pass to
Llama.from_pretrained()
. See this link for more info.hyperparams โ Additional arguments to supply the model during generation.
- message_len(message: ChatMessage) int [source]#
Return the length, in tokens, of the given chat message.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.
The default behaviour is to call the supplied pipeline.
- async predict(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
hyperparams โ Any additional parameters to pass to
Llama.create_completion()
. (See https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion)
- async stream(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
- **hyperparams,
Given the current context of messages and available functions, get the next predicted chat message from the LM.
- Parameters:
messages โ The messages in the current chat context.
sum(message_len(m) for m in messages)
is guaranteed to be less than max_context_size.functions โ The functions the LM is allowed to call.
hyperparams โ Any additional parameters to pass to
Llama.create_completion()
. (See https://llama-cpp-python.readthedocs.io/en/latest/api-reference/#llama_cpp.Llama.create_completion)