Skip to content

Add option to keep output and embed tensors at f16 #8715

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 10 additions & 3 deletions convert_hf_to_gguf.py
Original file line number Diff line number Diff line change
Expand Up @@ -63,14 +63,15 @@ class Model:
model_name: str | None
metadata_override: Path | None
dir_model_card: Path
z: bool

# subclasses should define this!
model_arch: gguf.MODEL_ARCH

def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path, is_big_endian: bool = False,
use_temp_file: bool = False, eager: bool = False,
metadata_override: Path | None = None, model_name: str | None = None,
split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False, small_first_shard: bool = False):
split_max_tensors: int = 0, split_max_size: int = 0, dry_run: bool = False, small_first_shard: bool = False, z: bool = False):
if type(self) is Model:
raise TypeError(f"{type(self).__name__!r} should not be directly instantiated")

Expand All @@ -92,6 +93,7 @@ def __init__(self, dir_model: Path, ftype: gguf.LlamaFileType, fname_out: Path,
self.metadata_override = metadata_override
self.model_name = model_name
self.dir_model_card = dir_model # overridden in convert_lora_to_gguf.py
self.z = z

# Apply heuristics to figure out typical tensor encoding based on first layer tensor encoding type
if self.ftype == gguf.LlamaFileType.GUESSED:
Expand Down Expand Up @@ -319,7 +321,7 @@ def prepare_tensors(self):
assert data.dtype == np.int16
data_qtype = gguf.GGMLQuantizationType.BF16

elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data):
elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will prevent conversion to Q8_0 altogether for BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet, because they to not have MODEL_TENSOR.OUTPUT in their tensor type list.

Use self.match_model_tensor_name instead.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to modify my idea and code for your needs. As it is I can use it. But sure it would be better if done more properly. I just wish not to use quantize if it's not strictly needed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As it is I can use it.

Not on the models I mentioned.

BERT, Gemma, Gemma2, Command-R, OpenELM, and BitNet

Try it, and you should see an error with --outtype q8_0, both with and without --z.

$ python3 convert_hf_to_gguf.py models/OpenELM-270M-Instruct/ --dry-run --outtype q8_0 --z
INFO:hf-to-gguf:Loading model: OpenELM-270M-Instruct
INFO:gguf.gguf_writer:gguf: This GGUF file is for Little Endian only
INFO:hf-to-gguf:Exporting model...
INFO:hf-to-gguf:gguf: loading model part 'model.safetensors'
INFO:hf-to-gguf:blk.0.attn_k_norm.weight,  torch.bfloat16 --> F32, shape = {64}
Traceback (most recent call last):
...
  File "convert_hf_to_gguf.py", line 324, in prepare_tensors
    elif self.ftype == gguf.LlamaFileType.MOSTLY_Q8_0 and gguf.can_quantize_to_q8_0(data) and (self.z and new_name not in (self.format_tensor_name(gguf.MODEL_TENSOR.OUTPUT), self.format_tensor_name(gguf.MODEL_TENSOR.TOKEN_EMBD))):
                                                                                                                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

  File "convert_hf_to_gguf.py", line 179, in format_tensor_name
    raise ValueError(f"Missing {key!r} for MODEL_TENSORS of {self.model_arch!r}")
ValueError: Missing <MODEL_TENSOR.OUTPUT: 5> for MODEL_TENSORS of <MODEL_ARCH.OPENELM: 34>

Using self.match_model_tensor_name instead of self.format_tensor_name would avoid this problem (self.format_tensor_name is only intended to be used for tensors which will actually be in the model).

I just wish not to use quantize if it's not strictly needed.

I understand; reducing the steps needed to get a desired quant is partly why q8_0 conversion was added in #7234.

I think that a more general type override for token_embd.weight and output.weight might be useful for your purpose, but I'm not yet sure how to implement that in a simple enough way. I think it might only be possible after a refactor of the type selection logic, which should be done relatively soon anyway to better support types for ternary models.

The currently proposed flag only has an effect along with --outtype q8_0, which can be confusing UX-wise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well.. I could add an option like in quantize program... --emdt --outt or something like that? don't know..

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's also true that if you quantize to q8 without --z you get a q8 as before.. if you quantize at f16 then you don't need --z and if for some reason you quantize at f32 --z has no reason to be... that's why I used it only for q8_0

data = gguf.quantize_q8_0(data)
assert data.dtype == np.uint8
data_qtype = gguf.GGMLQuantizationType.Q8_0
Expand Down Expand Up @@ -3598,6 +3600,10 @@ def parse_args() -> argparse.Namespace:
"--metadata", type=Path,
help="Specify the path for an authorship metadata override file"
)
parser.add_argument(
"--z", action="store_true",
help="Keep output and embed tensors at F16"
)

return parser.parse_args()

Expand Down Expand Up @@ -3672,7 +3678,8 @@ def main() -> None:
metadata_override=args.metadata, model_name=args.model_name,
split_max_tensors=args.split_max_tensors,
split_max_size=split_str_to_n_bytes(args.split_max_size), dry_run=args.dry_run,
small_first_shard=args.no_tensor_first_split)
small_first_shard=args.no_tensor_first_split,
z=args.z)

if args.vocab_only:
logger.info("Exporting model vocab...")
Expand Down