-
Notifications
You must be signed in to change notification settings - Fork 222
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[gguf] Add descriptions to quantization types #615
Conversation
[GGMLQuantizationType.Q5_K]: `"type-1" 5-bit quantization. Same super-block structure as Q4_K resulting in 5.5 bpw. In "type-1", weights are given by w = d * q + m, where m is the block minimum.`, // src: https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305 | ||
[GGMLQuantizationType.Q6_K]: `"type-0" 6-bit quantization. Super-blocks with 16 blocks, each block having 16 weights. Scales are quantized with 8 bits. This ends up using 6.5625 bpw. In "type-0", weights w are obtained from quants q using w = d * q, where d is the block scale.`, // src: https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305 | ||
[GGMLQuantizationType.Q8_K]: `"type-0" 8-bit quantization. Only used for quantizing intermediate results. The difference to the existing Q8_0 is that the block size is 256. All 2-6 bit dot products are implemented for this quantization type. In "type-0", weights w are obtained from quants q using w = d * q, where d is the block scale.`, // src: https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305 | ||
[GGMLQuantizationType.IQ2_XXS]: "", // todo: add description |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ikawrakow @ggerganov @younesbelkada @FL33TW00D or anyone, I'd greatly appreciate if you can supply any of the the missing descriptions.
You can just post as a comment and I can add/commit it to the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to ggerganov/llama.cpp#5063 + offline discussion with @FL33TW00D I would say:
Q4_0: Round-to-Nearest group-wise quantization with a blocksize of 32 and 4-bit quantized weights. Block weights are simply given by w = q * s. Legacy quantization method, and not really used by the community as of today.
I would say Q5_0
/ Q8_0
is also RTN but for 5 / 8-bit, not sure yet what _1
stands for Q4_1
- I will let others comment on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i might got it right for QK_1:
Q4_1: Round-to-Nearest group-wise quantization with a blocksize of 32 and 4-bit quantized weights with an additional term that is added after the de-quantization step. Block weights are simply given by w = q * s + m with m being the minimum of the block. Legacy quantization method, and not really used by the community as of today.
Same comment applies for Q5_1 and Q8_1 I think
[GGMLQuantizationType.Q5_1]: "", // todo: add description | ||
[GGMLQuantizationType.Q8_0]: "", // todo: add description | ||
[GGMLQuantizationType.Q8_1]: "", // todo: add description | ||
[GGMLQuantizationType.Q2_K]: `"type-1" 2-bit quantization in super-blocks containing 16 blocks, each block having 16 weight. Block scales and mins are quantized with 4 bits. This ends up effectively using 2.5625 bits per weight (bpw). In "type-1", weights are given by w = d * q + m, where m is the block minimum.`, // src: https://github.com/ggerganov/llama.cpp/pull/1684#issue-1739619305 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should you encode the src
link in the code itself (so a Record<GGMLQuantizationType, { txt: string; url: string }>
to be able to link to a reference from the UI?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another potential idea: indicate a few of them with a featured
or popular
flag so we can showcase in a UI or something
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
handled in 240f0df
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: FL33TW00D <chris@fleetwood.dev>
[gguf] rename QUANT_DESCRIPTIONS -> GGUF_QUANT_DESCRIPTIONS follow up to #615
I have not found a single place where all different data/quant types of gguf is documented. Therefore, creating this description object that would be useful to the community for understanding different data/quant types.
Afterwards, I plan to make the description available at: