LlamaCppEngine

If your language model backend is available with GGUF, kani includes a base engine that implements a prediction pipeline.

TL;DR

from kani.engines.huggingface import ChatTemplatePromptPipeline
from kani.engines.llamacpp import LlamaCppEngine
pipeline = ChatTemplatePromptPipeline.from_pretrained("org-id/base-model-id")
engine = LlamaCppEngine(repo_id="org-id/quant-model-id", filename="*.your-quant-type.gguf", prompt_pipeline=pipeline)

Important

Added in version 1.4.0: For most models that use a chat template, you won’t need to create a new engine class - kani will automatically use a Chat Template if a model has one included.

This means you can safely ignore this section of the documentation for most use cases! Just use:

from kani.engines.llamacpp import LlamaCppEngine
engine = LlamaCppEngine(repo_id="your-org/your-model-id", filename="*Q4_K_M.gguf")

kani uses llama-cpp-python for binding to the llama.cpp runtime.

Added in version 1.0.0: For most models that use a chat format, you won’t even need to create a new engine class - instead, you can pass a PromptPipeline to the LlamaCppEngine.

If you do create a new engine, instead of having to implement the prediction logic, all you have to do is subclass LlamaCppEngine and implement build_prompt().

class kani.engines.llamacpp.LlamaCppEngine(
repo_id: str | None = None,
filename: str | None = None,
model_path: str | None = None,
max_context_size: int = 0,
prompt_pipeline: PromptPipeline[str | list[int]] = None,
*,
model_load_kwargs: dict = None,
**hyperparams,
)[source]

This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).

GPU Support

llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.

To load some or all of the model layers on GPU, pass n_gpu_layers=... in the model_load_kwargs. Use -1 to specify all layers.

Parameters:
  • repo_id – The ID of the model repo to load from Hugging Face. If this is set, filename must be set and model_path may not be set.

  • filename – A filename or glob pattern to match the model file in the Hugging Face repo. If this is set, repo_id must be set and model_path may not be set.

  • model_path – A path to the model files on local disk. If this is set, neither repo_id nor filename may be set.

  • max_context_size – The context size of the model.

  • prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).

  • model_load_kwargs – Additional arguments to pass to Llama.from_pretrained(). See this link for more info.

  • hyperparams – Additional arguments to supply the model during generation.

build_prompt(
messages: list[ChatMessage],
functions: list[AIFunction] | None = None,
) str | list[int][source]

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.

The default behaviour is to call the supplied pipeline.