Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 #33577

trianxy · 2024-09-18T18:25:18Z

System Info

transformers version: 4.44.2
Platform: macOS-14.7-arm64-arm-64bit
Python version: 3.11.5
Huggingface_hub version: 0.25.0
Safetensors version: 0.4.5
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.4.1 (False)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using distributed or parallel set-up in script?: NO

Who can help?

@ArthurZucker @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

from transformers import AutoTokenizer
tok = AutoTokenizer.from_pretrained("gpt2", use_fast=False)

print(tok.convert_tokens_to_string(["café"]))  # -> error 1: prints 'caf�' instead of expected `café`

print(tok.encode(tok.decode([113]))[0])  # error 2: prints 4210 (the id of '�') instead of expected 113 (related to error1)

Expected behavior

The expected behavior is mentioned above inside Reproduction. The same behavior appears for a few other models, which use similar code to gpt2.

Cause

The line

        text = bytearray([self.byte_decoder[c] for c in text]).decode("utf-8", errors=self.errors)

uses errors="replace" which will replace é with � since é is not inside the dictionary self.byte_encoder.

Possible solutions

If others also feel that this is a problem, then there are a few ways to improve this behavior. I can create a PR if you like:

Improve the docstring of tok.decode(), similarly to how the library tiktoken includes a WARNING that this operation is lossy
Improve the docstring of tok.convert_tokens_to_string() to have a warning that you should ONLY input tokens (!) (i.e. strings which are represented by a specific token_id)
Finally, I can refactor tok.convert_tokens_to_string() to work for all strings, but I am not sure if that's worth it, because this would need to be done at a lot of models. And it might break production code which may rely on the above (wrong?) behavior.

The text was updated successfully, but these errors were encountered:

trianxy added the bug label Sep 18, 2024

trianxy changed the title ~~Unexpected result tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2`~~ Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 #33577

Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 #33577

trianxy commented Sep 18, 2024

Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 #33577

Unexpected result `tok.convert_tokens_to_string(["café"]) == 'mycaf�' for some tokenizers like gpt2 #33577

Comments

trianxy commented Sep 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Cause

Possible solutions