@nroggendorff on Hugging Face: "Updated https://huggingface.co/blog/nroggendorff/train-with-llama-architecture…"

@osanseviero\n\n\t

\n","updatedAt":"2024-07-14T18:09:16.540Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/659f000b83abded48e190901/cUwwELi9LbFqkI_11Cjk6.png","fullname":"Noa Roggendorff","name":"nroggendorff","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":["nroggendorff"],"reactions":[],"identifiedLanguage":{"language":"es","probability":0.5235151052474976},"isReport":false}},{"id":"6695427f5130ff34b9c07877","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg","fullname":"leroy Samuel Dyer","name":"LeroyDyer","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-07-15T15:38:39.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"very good ! \nmaybe a colab ! \ncould this be used to extend a tokenizer model with training ? \nas i would like to update my mistral tokenizer to include forign chars, such as hebrew and amaric, and hindi","html":"

very good !
maybe a colab !
could this be used to extend a tokenizer model with training ?
as i would like to update my mistral tokenizer to include forign chars, such as hebrew and amaric, and hindi

\n","updatedAt":"2024-07-15T15:38:39.129Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg","fullname":"leroy Samuel Dyer","name":"LeroyDyer","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":["LeroyDyer"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9849752187728882},"isReport":false},"replies":[{"id":"6695655a490f463859b781ab","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/659f000b83abded48e190901/cUwwELi9LbFqkI_11Cjk6.png","fullname":"Noa Roggendorff","name":"nroggendorff","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-07-15T18:07:22.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Im pretty sure you can add additional tokens and special tokens, so I suppose","html":"

Im pretty sure you can add additional tokens and special tokens, so I suppose

\n","updatedAt":"2024-07-15T18:07:22.925Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/659f000b83abded48e190901/cUwwELi9LbFqkI_11Cjk6.png","fullname":"Noa Roggendorff","name":"nroggendorff","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":["nroggendorff"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.8821778297424316},"isReport":false,"parentCommentId":"6695427f5130ff34b9c07877"}},{"id":"6695657cd4ca2767b9b94577","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/659f000b83abded48e190901/cUwwELi9LbFqkI_11Cjk6.png","fullname":"Noa Roggendorff","name":"nroggendorff","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-07-15T18:07:56.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"> Im pretty sure you can add additional tokens and special tokens, so I suppose\n\nbut that doesnt really have anything to do with this article, thats just a transformers feature","html":"

\n
Im pretty sure you can add additional tokens and special tokens, so I suppose
\n

but that doesnt really have anything to do with this article, thats just a transformers feature

\n","updatedAt":"2024-07-15T18:07:56.240Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/659f000b83abded48e190901/cUwwELi9LbFqkI_11Cjk6.png","fullname":"Noa Roggendorff","name":"nroggendorff","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":["nroggendorff"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9666891694068909},"isReport":false,"parentCommentId":"6695427f5130ff34b9c07877"}},{"id":"6695e768496ec7b16032fa62","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg","fullname":"leroy Samuel Dyer","name":"LeroyDyer","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-07-16T03:22:16.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Yes when you add tokens they just slot into the next numbers on the list . I suppose these can be considered special tokens . But to improve the model , it is suggested to train new tokens into the tokenizer instead , IE reoptimizing the potential merges , for the bPE encoding .. \n\nIE : by training the tokenizer using pure code files , the tokenizer would be able to optimise the tokenizer for code segmentation . \n\nWhen the models select the next token , the next token could in fact be a code segment . So these segments (chunks) have value in the series of tokens being predicted . \nThese segments such as code segments or even DNA segments may not be optimiWr in training by a standard Corpus of literature . Hence the requirement to optimize and existing tokenizer updating the model such the same way we train embeddings models to improve . These are all part of the predictive input and output of the network . \nI'm sure you understand . ","html":"

Yes when you add tokens they just slot into the next numbers on the list . I suppose these can be considered special tokens . But to improve the model , it is suggested to train new tokens into the tokenizer instead , IE reoptimizing the potential merges , for the bPE encoding ..

IE : by training the tokenizer using pure code files , the tokenizer would be able to optimise the tokenizer for code segmentation .

When the models select the next token , the next token could in fact be a code segment . So these segments (chunks) have value in the series of tokens being predicted .
These segments such as code segments or even DNA segments may not be optimiWr in training by a standard Corpus of literature . Hence the requirement to optimize and existing tokenizer updating the model such the same way we train embeddings models to improve . These are all part of the predictive input and output of the network .
I'm sure you understand .

\n","updatedAt":"2024-07-16T03:22:16.722Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg","fullname":"leroy Samuel Dyer","name":"LeroyDyer","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":["LeroyDyer"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.8797832131385803},"isReport":false,"parentCommentId":"6695427f5130ff34b9c07877"}},{"id":"6695e8fdca566116a652c20f","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg","fullname":"leroy Samuel Dyer","name":"LeroyDyer","type":"user","isPro":false,"isHf":false,"isMod":false},"createdAt":"2024-07-16T03:29:01.000Z","type":"comment","data":{"edited":false,"hidden":false,"latest":{"raw":"Recently I swapped the hugging face trainer for the unsloth trainer .. it also trained the embeddings making the model much more predictive . As well as retaining more documents and accepting of new documents , such as training ancient documents and books which often have Thier own vocabulary a or grammar .. so the predictive sequence are different .. hence its hard to train for English of the model has been trained on tavlong etc as the inner segments or chunks or tokens are not optimal it's like making a puzzle with the wrong shapes at some point it's jagged and disjointed ... Or garbelled output .. bad phrasing or grammar or just incorrect . Lol . \nHence great stuff as there are some gaps In the documentation on hugging face .. \nI think mainly to do with instantiating networks and configuration files etc . As obviously we don't need to download a model of we can generate a new one locally and pretrain it ourselves ... But the docs often leave this section out !! \nAs I think all the multimodals are actually there and a good model can be created mixing etc .. but the docs are not right ...\n\nSo all full examples are always a great blessing ! ","html":"

Recently I swapped the hugging face trainer for the unsloth trainer .. it also trained the embeddings making the model much more predictive . As well as retaining more documents and accepting of new documents , such as training ancient documents and books which often have Thier own vocabulary a or grammar .. so the predictive sequence are different .. hence its hard to train for English of the model has been trained on tavlong etc as the inner segments or chunks or tokens are not optimal it's like making a puzzle with the wrong shapes at some point it's jagged and disjointed ... Or garbelled output .. bad phrasing or grammar or just incorrect . Lol .
Hence great stuff as there are some gaps In the documentation on hugging face ..
I think mainly to do with instantiating networks and configuration files etc . As obviously we don't need to download a model of we can generate a new one locally and pretrain it ourselves ... But the docs often leave this section out !!
As I think all the multimodals are actually there and a good model can be created mixing etc .. but the docs are not right ...

So all full examples are always a great blessing !

\n","updatedAt":"2024-07-16T03:29:01.569Z","author":{"avatarUrl":"https://cdn-avatars.huggingface.co/v1/production/uploads/65d883893a52cd9bcd8ab7cf/tRsCJlHNZo1D02kBTmfy9.jpeg","fullname":"leroy Samuel Dyer","name":"LeroyDyer","type":"user","isPro":false,"isHf":false,"isMod":false}},"numEdits":0,"editors":["LeroyDyer"],"reactions":[],"identifiedLanguage":{"language":"en","probability":0.9557238221168518},"isReport":false,"parentCommentId":"6695427f5130ff34b9c07877"}}]}],"numComments":6},"theme":"light","acceptLanguages":["*"],"primaryEmailConfirmed":false}">

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

nroggendorff

posted an update Jul 14

Post

2041

Updated https://huggingface.co/blog/nroggendorff/train-with-llama-architecture so you can "train" your own tokenizer from your dataset.

nroggendorff

Jul 14

cc @osanseviero

LeroyDyer

Jul 15

very good !
maybe a colab !
could this be used to extend a tokenizer model with training ?
as i would like to update my mistral tokenizer to include forign chars, such as hebrew and amaric, and hindi

nroggendorff

Jul 15

Im pretty sure you can add additional tokens and special tokens, so I suppose

In this post