bitsandbytes

　✒️ @halomaster　📅 28 May 2023, 06:47 GMT⋮　【AI】　

`bitsandbytes` is a lightweight wrapper around CUDA custom functions, focusing on 8-bit optimizers, matrix multiplication (LLM.int8()), and quantization functions. ## Resources - 8-bit Optimizer: [Paper](#) | [Video](#) | [Docs](#) - LLM.int8(): [Paper](#) | [Software Blog Post](#) | [Emergent Features Blog Post](#) ## TL;DR **Requirements**: Python >=3.8, Linux distribution (Ubuntu, MacOS, etc.) + CUDA > 10.0. (Deprecated: CUDA 10.0 is deprecated and only CUDA >= 11.0 will be supported with release 0.39.0) **Installation**: ```bash pip install bitsandbytes ``` ## Compilation Quickstart ```bash git clone https://github.com/timdettmers/bitsandbytes.git cd bitsandbytes # CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120} # make argument in {cuda110, cuda11x, cuda12x} # if you do not know what CUDA you have, try looking at the output of: python -m bitsandbytes CUDA_VERSION=117 make cuda11x python setup.py install ``` ## Using Int8 Inference with HuggingFace Transformers ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'decapoda-research/llama-7b-hf', device_map='auto', load_in_8bit=True, max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB') ``` A more detailed example can be found in `examples/int8_inference_huggingface.py`. ## Using 8-bit Optimizer 1. Comment out optimizer: `#torch.optim.Adam(....)` 2. Add 8-bit optimizer of your choice `bnb.optim.Adam8bit(....)` (arguments stay the same) 3. Replace embedding layer if necessary: `torch.nn.Embedding(..) -> bnb.nn.Embedding(..)` ## Using 8-bit Inference 1. Comment out `torch.nn.Linear`: `#linear = torch.nn.Linear(...)` 2. Add `bnb` 8-bit linear light module: `linear = bnb.nn.Linear8bitLt(...)` (base arguments stay the same) There are two modes: 1. Mixed 8-bit training with 16-bit main weights. Pass the argument `has_fp16_weights=True` (default) 2. Int8 inference. Pass the argument `has_fp16_weights=False` To use the full LLM.int8() method, use the `threshold=k` argument. We recommend `k=6.0`. ```python # LLM.int8() linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0) # inputs need to be fp16 out = linear(x.to(torch.float16)) ``` ## Features - 8-bit Matrix multiplication with mixed precision decomposition - LLM.int8() inference - 8-bit Optimizers: Adam, AdamW, RMSProp, LARS, LAMB, Lion (saves 75% memory) - Stable Embedding Layer: Improved stability through better initialization and normalization - 8-bit quantization: Quantile, Linear, and Dynamic quantization - Fast quantile estimation: Up to 100x faster than other algorithms ## Requirements & Installation Requirements: anaconda, cudatoolkit, pytorch **Hardware requirements**: - LLM.int8(): NVIDIA Turing (RTX 20xx; T4) or Ampere GPU (RTX 30xx; A4-A100); (a GPU from 2018 or older). - 8-bit optimizers and quantization: NVIDIA Kepler GPU or newer (>=GTX 78X). - Supported CUDA versions: 10.2 - 12.0 `bitsandbytes` is currently only supported on Linux distributions. Windows is not supported at the moment. The requirements can best be fulfilled by installing PyTorch via anaconda. You can install PyTorch by following the "Get Started" instructions on the official website. To install, run: ```bash pip install bitsandbytes ``` ## Using bitsandbytes ### Using Int8 Matrix Multiplication For straight Int8 matrix multiplication with mixed precision decomposition, you can use `bnb.matmul(...)`. To enable mixed precision decomposition, use the `threshold` parameter: ```python bnb.matmul(..., threshold=6.0) ``` For instructions on how to use LLM.int8() inference layers inyour PyTorch models, please refer to the [Using 8-bit Inference](#using-8-bit-inference) section above. ### Using 8-bit Optimizers To use an 8-bit optimizer, simply replace your current PyTorch optimizer with the corresponding 8-bit optimizer from `bitsandbytes`. For example, if you're currently using the Adam optimizer: ```python optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) ``` Replace it with the 8-bit version from `bitsandbytes`: ```python import bitsandbytes as bnb optimizer = bnb.optim.Adam8bit(model.parameters(), lr=1e-3) ``` ### Using Quantization Functions `bitsandbytes` offers several quantization functions. The most commonly used ones are: 1. Linear Quantization 2. Quantile Quantization 3. Dynamic Quantization Here's an example of how to use the linear quantization function: ```python import torch import bitsandbytes as bnb x = torch.tensor([0.0, 0.5, 1.0, 1.5, 2.0], dtype=torch.float32) q_x, scale, zero_point = bnb.linear_quantize(x, num_bits=8) ``` And here's an example of how to use the quantile quantization function: ```python import torch import bitsandbytes as bnb x = torch.tensor([0.0, 0.5, 1.0, 1.5, 2.0], dtype=torch.float32) q_x, scale, zero_point = bnb.quantile_quantize(x, num_bits=8) ``` You can also use dynamic quantization, which adjusts the quantization parameters during runtime based on the input tensor's statistics: ```python import torch import bitsandbytes as bnb x = torch.tensor([0.0, 0.5, 1.0, 1.5, 2.0], dtype=torch.float32) q_x, scale, zero_point = bnb.dynamic_quantize(x, num_bits=8) ``` To dequantize the tensor and recover the original values (with some loss of precision), you can use the `dequantize` function: ```python x_dequantized = bnb.dequantize(q_x, scale, zero_point) ``` ## Notes and Limitations 1. `bitsandbytes` is currently only supported on Linux distributions. Windows is not supported. 2. The library is designed to work with NVIDIA GPUs, specifically those with the Turing (RTX 20xx; T4) or Ampere (RTX 30xx; A4-A100) architectures for the LLM.int8() inference feature. For 8-bit optimizers and quantization, a NVIDIA Kepler GPU or newer (>=GTX 78X) is required. 3. The library is compatible with CUDA versions 10.2 - 12.0. Note that CUDA 10.0 is deprecated and support will be dropped with the release 0.39.0. 4. The tensor inputs to the 8-bit inference layers should be in FP16 format.

[1] @halomaster　•　28 May 2023, 07:11 GMT　

# bitsandbytes `bitsandbytes`是一个围绕CUDA自定义函数的轻量级包装器，专注于8位优化器、矩阵乘法（LLM.int8()）和量化函数。 ## 简要概述 **要求**：Python >= 3.8， Linux发行版（Ubuntu，MacOS等）+ CUDA > 10.0。（已弃用：CUDA 10.0已弃用，发行版0.39.0之后仅支持CUDA >= 11.0） **安装**： ```bash pip install bitsandbytes ``` ## 快速编译指南 ```bash git clone https://github.com/timdettmers/bitsandbytes.git cd bitsandbytes # CUDA_VERSIONS in {110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 120} # make argument in {cuda110, cuda11x, cuda12x} # 如果你不知道你的CUDA版本，请尝试查看：python -m bitsandbytes的输出 CUDA_VERSION=117 make cuda11x python setup.py install ``` ## 使用 HuggingFace Transformers 进行 Int8 推理 ```python from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained( 'decapoda-research/llama-7b-hf', device_map='auto', load_in_8bit=True, max_memory=f'{int(torch.cuda.mem_get_info()[0]/1024**3)-2}GB') ``` 更详细的示例可以在`examples/int8_inference_huggingface.py`中找到。 ## 使用8位优化器 1. 注释掉优化器：`#torch.optim.Adam(....)` 2. 添加您选择的8位优化器：`bnb.optim.Adam8bit(....)`（参数保持不变） 3. 如果需要，替换嵌入层：`torch.nn.Embedding(..) -> bnb.nn.Embedding(..)` ## 使用8位推理 1. 注释掉`torch.nn.Linear`：`#linear = torch.nn.Linear(...)` 2. 添加`bnb`8位线性轻模块：`linear = bnb.nn.Linear8bitLt(...)`（基本参数保持不变）有两种模式： 1. 混合8位训练，16位主权重。传递参数`has_fp16_weights=True`（默认值） 2. Int8推理。传递参数`has_fp16_weights=False` 要使用完整的LLM.int8()方法，请使用`threshold=k`参数。我们建议`k=6.0`。 ```python # LLM.int8() linear = bnb.nn.Linear8bitLt(dim1, dim2, bias=True, has_fp16_weights=False, threshold=6.0) # 输入需要为fp16 out = linear(x.to(torch.float16)) ``` ## 特性 - 8位矩阵乘法与混合精度分解 - LLM.int8()推理 - 8位优化器：Adam，AdamW，RMSProp，LARS，LAMB，Lion（节省75%内存） - 稳定嵌入层：通过更好的初始化和归一化提高稳定性 - 8位量化：分位数，线性和动态量化 - 快速分位数估计：比其他算法快100倍 ## 要求和安装要求：anaconda，cudatoolkit，pytorch **硬件要求**： - LLM.int8()：NVIDIA图灵（RTX 20xx; T4）或安培GPU（RTX 30xx; A4-A100）；（2018年或更早的GPU）。 - 8位优化器和量化：NVIDIA凯普勒GPU或更新版本（>=GTX 78X）。 -如果你有其他问题，请随时提问。

1 of 1 pages 1 replies