HuggingFace#

If your language model backend is available on HuggingFace or is compatible with transformers’ AutoModelForCausalLM interface, kani includes a base engine that implements a prediction pipeline.

New in version 1.0.0: For most models that use a chat format, you won’t even need to create a new engine class - instead, you can pass a PromptPipeline to the HuggingEngine.

If you do create a new engine, instead of having to implement the prediction logic, all you have to do is subclass HuggingEngine and implement build_prompt() and message_len().

4-bit Quantization (🤗)#

If you’re running your model locally, you might run into issues because large language models are, well, large! Unless you pay for a massive compute cluster (💸) or have access to one at your institution, you might not be able to fit models with billions of params on your GPU. That’s where model quantization comes into play.

Tip

Thanks to the hard work of the LLM community, many models on Hugging Face also have quantized versions available in the GGUF format. GGUF is the format for llama.cpp, a low-level optimized LLM runtime. Unlike the name suggests, it supports many more models than LLaMA. If your model has a GGUF version available, consider using the LlamaCppEngine instead of the HuggingEngine to load a pre-quantized version.

In this section, we’ll show how to load HuggingFace models in FP4.

See also

We’re mostly going to follow the HuggingFace documentation found here: https://huggingface.co/docs/transformers/perf_infer_gpu_one

Install Dependencies

First, you’ll need to install kani with the huggingface extra (and any other extras necessary for your engine; we’ll use LLaMA v2 in this example, so you’ll want pip install 'kani[huggingface,llama]'.)

After that, you’ll need to install bitsandbytes and accelerate:

$ pip install bitsandbytes>=0.39.0 accelerate

Caution

The bitsandbytes library is currently only UNIX-compatible.

Set Load Arguments

Then, you’ll need to set the model_load_kwargs when initializing your model, and use the engine as normal! This example shows the LlamaEngine, but the same arguments should apply to any subclass of the HuggingEngine.

from transformers import BitsAndBytesConfig
from kani.engines.huggingface.llama2 import LlamaEngine

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

engine = LlamaEngine(
    use_auth_token=True,
    model_load_kwargs={
        "device_map": "auto",
        "quantization_config": quantization_config,
    },
)