llama.cpp#

If your language model backend is available with GGUF, kani includes a base engine that implements a prediction pipeline.

kani uses llama-cpp-python for binding to the llama.cpp runtime.

New in version 1.0.0: For most models that use a chat format, you won’t even need to create a new engine class - instead, you can pass a PromptPipeline to the LlamaCppEngine.

If you do create a new engine, instead of having to implement the prediction logic, all you have to do is subclass LlamaCppEngine and implement build_prompt() and message_len().

class kani.engines.llamacpp.LlamaCppEngine(

repo_id: str,

filename: str | None = None,

max_context_size: int = 0,

prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | list[int]] = PromptPipeline([Wrap(role=<ChatRole.SYSTEM: 'system'>,

predicate=None,

prefix='<<SYS>>\n',

suffix='\n<</SYS>>\n'),

TranslateRole(role=<ChatRole.SYSTEM: 'system'>,

predicate=None,

to=<ChatRole.USER: 'user'>,

warn=None),

MergeConsecutive(role=<ChatRole.USER: 'user'>,

predicate=None,

sep='\n',

joiner=None,

out_role=<ChatRole.USER: 'user'>),

MergeConsecutive(role=<ChatRole.ASSISTANT: 'assistant'>,

predicate=None,

sep=' ',

joiner=None,

out_role=<ChatRole.ASSISTANT: 'assistant'>),

ConversationFmt(prefix='',

sep='',

suffix='',

generation_suffix='',

user_prefix='<s>[INST] ',

user_suffix=' [/INST]',

assistant_prefix=' ',

assistant_suffix=' </s>',

assistant_suffix_if_last='',

system_prefix='',

system_suffix='',

function_prefix='<s>[INST] ',

function_suffix=' [/INST]')]),

*,

model_load_kwargs: dict | None = None,

**hyperparams,

)[source]

This class implements the main decoding logic for any GGUF model (not just LLaMA as the name might suggest).

This engine defaults to LLaMA 2 Chat 7B with 4-bit quantization.

GPU Support

llama.cpp supports multiple acceleration backends, which may require different flags to be set during installation. To see the full list of backends, see their README at https://github.com/abetlen/llama-cpp-python.

To load some or all of the model layers on GPU, pass n_gpu_layers=... in the model_load_kwargs. Use -1 to specify all layers.

Parameters:

repo_id – The ID of the model repo to load from Hugging Face.
filename – A filename or glob pattern to match the model file in the repo.
max_context_size – The context size of the model.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see PromptPipeline).
model_load_kwargs – Additional arguments to pass to Llama.from_pretrained(). See this link for more info.
hyperparams – Additional arguments to supply the model during generation.

build_prompt( messages: list[ChatMessage], functions: list[AIFunction] | None = None, ) → str | list[int][source]

Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token list.

The default behaviour is to call the supplied pipeline.

message_len(message: ChatMessage) → int[source]: Return the length, in tokens, of the given chat message.