HuggingFace#
If your language model backend is available on HuggingFace or is compatible with transformers
’
AutoModelForCausalLM
interface, kani includes a base engine that implements a prediction pipeline.
New in version 1.0.0: For most models that use a chat format, you won’t even need to create a new engine class - instead, you can pass
a PromptPipeline
to the HuggingEngine
.
If you do create a new engine, instead of having to implement the prediction logic, all you have to do is subclass
HuggingEngine
and implement build_prompt()
and message_len()
.
See also
The source code of the LlamaEngine
, which uses the HuggingEngine.
- class kani.engines.huggingface.base.HuggingEngine(
- model_id: str,
- max_context_size: int | None = None,
- prompt_pipeline: PromptPipeline[str | Tensor] | None = None,
- *,
- token=None,
- device: str | None = None,
- tokenizer_kwargs: dict | None = None,
- model_load_kwargs: dict | None = None,
- **hyperparams,
Base engine for all HuggingFace text-generation models.
This class implements the main decoding logic for any HuggingFace model based on a pretrained
AutoModelForCausalLM
. As most models use model-specific chat templates, this base class accepts aPromptPipeline
to translate kani ChatMessages into a model-specific string.GPU Support
By default, the HuggingEngine loads the model on GPU if CUDA is detected on your system. To override the device the model is loaded on, pass
device="cpu|cuda"
to the constructor.Tip
See 4-bit Quantization (🤗) for information about loading a quantized model for lower memory usage.
- Parameters:
model_id – The ID of the model to load from HuggingFace.
max_context_size – The context size of the model. If not given, will be set from the model’s config.
prompt_pipeline – The pipeline to translate a list of kani ChatMessages into the model-specific chat format (see
PromptPipeline
).token – The Hugging Face access token (for gated models). Pass True to load from huggingface-cli.
device – The hardware device to use. If not specified, uses CUDA if available; otherwise uses CPU.
tokenizer_kwargs – Additional arguments to pass to
AutoTokenizer.from_pretrained()
.model_load_kwargs – Additional arguments to pass to
AutoModelForCausalLM.from_pretrained()
.hyperparams – Additional arguments to supply the model during generation.
- build_prompt(
- messages: list[ChatMessage],
- functions: list[AIFunction] | None = None,
Given the list of messages from kani, build either a single string representing the prompt for the model, or build the token tensor.
The default behaviour is to call the supplied pipeline.
- message_len(message: ChatMessage) int [source]
Return the length, in tokens, of the given chat message.
4-bit Quantization (🤗)#
If you’re running your model locally, you might run into issues because large language models are, well, large! Unless you pay for a massive compute cluster (💸) or have access to one at your institution, you might not be able to fit models with billions of params on your GPU. That’s where model quantization comes into play.
Tip
Thanks to the hard work of the LLM community, many models on Hugging Face also have quantized versions available
in the GGUF format. GGUF is the format for llama.cpp
, a low-level optimized LLM runtime. Unlike the name
suggests, it supports many more models than LLaMA. If your model has a GGUF version available, consider using the
LlamaCppEngine
instead of the HuggingEngine
to load a pre-quantized version.
In this section, we’ll show how to load HuggingFace models in FP4.
See also
We’re mostly going to follow the HuggingFace documentation found here: https://huggingface.co/docs/transformers/perf_infer_gpu_one
Install Dependencies
First, you’ll need to install kani with the huggingface
extra (and any other extras necessary for your engine;
we’ll use LLaMA v2 in this example, so you’ll want pip install 'kani[huggingface,llama]'
.)
After that, you’ll need to install bitsandbytes
and accelerate
:
$ pip install bitsandbytes>=0.39.0 accelerate
Caution
The bitsandbytes
library is currently only UNIX-compatible.
Set Load Arguments
Then, you’ll need to set the model_load_kwargs
when initializing your model, and use the engine as normal! This
example shows the LlamaEngine
, but the same arguments should apply to any subclass of the
HuggingEngine
.
from transformers import BitsAndBytesConfig
from kani.engines.huggingface.llama2 import LlamaEngine
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
engine = LlamaEngine(
use_auth_token=True,
model_load_kwargs={
"device_map": "auto",
"quantization_config": quantization_config,
},
)