HuggingEngine¶
If your language model backend is available on HuggingFace or is compatible with transformers’
AutoModelForCausalLM interface, kani includes a base engine that implements a prediction pipeline.
TL;DR
from kani.engines.huggingface import HuggingEngine
engine = HuggingEngine(model_id="org-id/model-id")
Important
Added in version 1.2.0: For most models that use a chat template, you won’t need to create a new engine class - kani will automatically use a Chat Template if a model has one included.
This means you can safely ignore this section of the documentation for most use cases! Just use:
from kani.engines.huggingface import HuggingEngine
engine = HuggingEngine(model_id="your-org/your-model-id")
Added in version 1.0.0: For more control over the prompting of a chat model, you can pass a PromptPipeline to
the HuggingEngine.
If you do create a new engine, instead of having to implement the prediction logic, all you have to do is subclass
HuggingEngine and implement build_prompt().
Multimodal Support¶
The HuggingEngine will attempt to load a multimodal model’s AutoProcessor if available, and format
any multimodal parts found in the input correctly for the multimodal model.
For audio/video models, you should specify the audio_sample_rate based on the sampling rate expected by the model.
For certain models, you may need to override tokenizer_cls or model_cls. For example, to load the
Qwen/Qwen3-Omni-30B-A3B-Instruct model:
from kani.engines.huggingface import HuggingEngine
from transformers import Qwen3OmniMoeProcessor, Qwen3OmniMoeThinkerForConditionalGeneration
engine = HuggingEngine(
"Qwen/Qwen3-Omni-30B-A3B-Instruct",
max_context_size=32000,
audio_sr=16000,
model_cls=Qwen3OmniMoeThinkerForConditionalGeneration,
tokenizer_cls=Qwen3OmniMoeProcessor,
eos_token_id=[151645], # <|im_end|>
)
Quantization With BitsAndBytes¶
If you’re running your model locally, you might run into issues because large language models are, well, large! Unless you pay for a massive compute cluster (💸) or have access to one at your institution, you might not be able to fit models with billions of params on your GPU. That’s where model quantization comes into play.
Tip
Thanks to the hard work of the LLM community, many models on Hugging Face also have quantized versions available
in the GGUF format. GGUF is the format for llama.cpp, a low-level optimized LLM runtime. Unlike the name
suggests, it supports many more models than LLaMA. If your model has a GGUF version available, consider using the
LlamaCppEngine instead of the HuggingEngine to load a pre-quantized version.
In this section, we’ll show how to load HuggingFace models in FP4.
See also
We’re mostly going to follow the HuggingFace documentation found here: https://huggingface.co/docs/transformers/perf_infer_gpu_one
Install Dependencies
First, you’ll need to install kani with the huggingface extra (and any other extras necessary for your engine;
we’ll use LLaMA v2 in this example, so you’ll want pip install 'kani[huggingface,llama]'.)
After that, you’ll need to install bitsandbytes and accelerate:
$ pip install bitsandbytes>=0.39.0 accelerate
Caution
The bitsandbytes library is currently only UNIX-compatible.
Set Load Arguments
Then, you’ll need to set the model_load_kwargs when initializing your model, and use the engine as normal!
from transformers import BitsAndBytesConfig
from kani.engines.huggingface import HuggingEngine
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
engine = HuggingEngine(
model_id="meta-llama/Llama-2-7b-chat-hf",
model_load_kwargs={
"quantization_config": quantization_config,
},
)
Reference¶
- class kani.engines.huggingface.HuggingEngine(
- model_id: str,
- max_context_size: int = None,
- prompt_pipeline: ~kani.prompts.pipeline.PromptPipeline[str | ~torch.Tensor] = None,
- *,
- token=None,
- device: str | None = None,
- tokenizer_cls=None,
- tokenizer_kwargs: dict = None,
- model_cls=<class 'transformers.models.auto.modeling_auto.AutoModelForCausalLM'>,
- model_load_kwargs: dict = None,
- chat_template_reasoning_content_key: str = None,
- chat_template_kwargs: dict = None,
- mm_audio_sample_rate: int = None,
- mm_video_fps: float = 1,
- token_reserve: int = 0,
- **hyperparams,
Base engine for all HuggingFace text-generation models.
This class implements the main decoding logic for any HuggingFace model based on a pretrained
AutoModelForCausalLM. As most models use model-specific chat templates, this base class accepts aPromptPipelineto translate kani ChatMessages into a model-specific string.Added in version 1.2.0: By default, the
HuggingEngineuses models’ bundled chat template to build the prompt for chat-based models available on Hugging Face. See https://huggingface.co/docs/transformers/main/en/chat_templating for more information.GPU Support
By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"to the constructor.Multimodal support: audio, images, video (depending on model).
Tip
See Quantization With BitsAndBytes for information about loading a quantized model for lower memory usage.
- Parameters:
model_id – The ID of the model to load from HuggingFace.
max_context_size – The context size of the model. If not given, will be set from the model’s config.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline). If not passed, uses the Hugging Face chat template if available.token – The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
device – The hardware device to use. If not specified, uses CUDA or MPS if available; otherwise uses CPU.
tokenizer_cls – Advanced use cases: The HF tokenizer class to use. Defaults to
AutoProcessor(if no processing config is available or this raises an error, this will fall back toAutoTokenizer).tokenizer_kwargs – Additional arguments to pass to
AutoProcessor.from_pretrained().model_cls – Advanced use cases: The HF model class to use. Defaults to
AutoModelForCausalLM.model_load_kwargs – Additional arguments to pass to
AutoModelForCausalLM.from_pretrained().chat_template_reasoning_content_key – The key of each message dict that any reasoning content should be set at.
chat_template_kwargs – The keyword arguments to pass to
tokenizer.apply_chat_templateif using a chat template prompt pipeline.mm_audio_sample_rate – The sample rate to remux audio inputs to. Check your model’s documentation for the expected sample rate. By default, does not change the sample rate of the input file.
mm_video_fps – The number of image frames to sample per second of video input.
hyperparams – Additional arguments to supply the model during generation.
token_reserve – DEPRECATED: The number of tokens to reserve for internal engine mechanisms (e.g. if there is a generation template after the last user message). If not passed, kani will attempt to infer this from a prompt pipeline.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.
The default behaviour is to call the supplied pipeline.