LlamaCppEngine¶
If your language model backend is available with GGUF, kani includes a base engine that implements a prediction pipeline.
TL;DR
from kani.engines.huggingface import ChatTemplatePromptPipeline
from kani.engines.llamacpp import LlamaCppEngine
pipeline = ChatTemplatePromptPipeline.from_pretrained("org-id/base-model-id")
engine = LlamaCppEngine(repo_id="org-id/quant-model-id", filename="*.your-quant-type.gguf", prompt_pipeline=pipeline)
Important
Added in version 1.4.0: For most models that use a chat template, you won’t need to create a new engine class - kani will automatically use a Chat Template if a model has one included.
This means you can safely ignore this section of the documentation for most use cases! Just use:
from kani.engines.llamacpp import LlamaCppEngine
engine = LlamaCppEngine(repo_id="your-org/your-model-id", filename="*Q4_K_M.gguf")
kani uses llama-cpp-python for binding to the llama.cpp runtime.
Added in version 1.0.0: For most models that use a chat format, you won’t even need to create a new engine class - instead, you can pass
a PromptPipeline to the LlamaCppEngine.
If you do create a new engine, instead of having to implement the prediction logic, all you have to do is subclass
LlamaCppEngine and implement build_prompt().
- class kani.engines.llamacpp.LlamaCppEngine(
- repo_id: str | None = None,
- filename: str | None = None,
- model_path: str | None = None,
- max_context_size: int = 0,
- prompt_pipeline: PromptPipeline[str | list[int]] = None,
- *,
- model_load_kwargs: dict = None,
- **hyperparams,
This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).
GPU Support
llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.
To load some or all of the model layers on GPU, pass
n_gpu_layers=...in themodel_load_kwargs. Use-1to specify all layers.- Parameters:
repo_id – The ID of the model repo to load from Hugging Face. If this is set,
filenamemust be set andmodel_pathmay not be set.filename – A filename or glob pattern to match the model file in the Hugging Face repo. If this is set,
repo_idmust be set andmodel_pathmay not be set.model_path – A path to the model files on local disk. If this is set, neither
repo_idnorfilenamemay be set.max_context_size – The context size of the model.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline).model_load_kwargs – Additional arguments to pass to
Llama.from_pretrained(). See this link for more info.hyperparams – Additional arguments to supply the model during generation.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.
The default behaviour is to call the supplied pipeline.