You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)
Reproduction
fromtransformersimportAutoTokenizertok=AutoTokenizer.from_pretrained("gpt2", use_fast=False)
print(tok.convert_tokens_to_string(["café"])) # -> error 1: prints 'caf�' instead of expected `café`print(tok.encode(tok.decode([113]))[0]) # error 2: prints 4210 (the id of '�') instead of expected 113 (related to error1)
Expected behavior
The expected behavior is mentioned above inside Reproduction. The same behavior appears for a few other models, which use similar code to gpt2.
uses errors="replace" which will replace é with � since é is not inside the dictionary self.byte_encoder.
Possible solutions
If others also feel that this is a problem, then there are a few ways to improve this behavior. I can create a PR if you like:
Improve the docstring of tok.decode(), similarly to how the library tiktoken includes a WARNING that this operation is lossy
Improve the docstring of tok.convert_tokens_to_string() to have a warning that you should ONLY input tokens (!) (i.e. strings which are represented by a specific token_id)
Finally, I can refactor tok.convert_tokens_to_string() to work for all strings, but I am not sure if that's worth it, because this would need to be done at a lot of models. And it might break production code which may rely on the above (wrong?) behavior.
The text was updated successfully, but these errors were encountered:
trianxy
changed the title
Unexpected result tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2`
Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2
Sep 18, 2024
System Info
transformers
version: 4.44.2Who can help?
@ArthurZucker @itazap
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Expected behavior
The expected behavior is mentioned above inside
Reproduction
. The same behavior appears for a few other models, which use similar code togpt2
.Cause
The line
uses
errors="replace"
which will replaceé
with�
sinceé
is not inside the dictionaryself.byte_encoder
.Possible solutions
If others also feel that this is a problem, then there are a few ways to improve this behavior. I can create a PR if you like:
tok.decode()
, similarly to how the librarytiktoken
includes a WARNING that this operation is lossytok.convert_tokens_to_string()
to have a warning that you should ONLY input tokens (!) (i.e. strings which are represented by a specific token_id)tok.convert_tokens_to_string()
to work for all strings, but I am not sure if that's worth it, because this would need to be done at a lot of models. And it might break production code which may rely on the above (wrong?) behavior.The text was updated successfully, but these errors were encountered: