Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add tests for chunking module #4

Merged
merged 9 commits into from
Sep 20, 2024
Merged

Add tests for chunking module #4

merged 9 commits into from
Sep 20, 2024

Conversation

violenil
Copy link
Member

@violenil violenil commented Sep 16, 2024

This PR adds some more tests to the chunking methods, and adds back the semantic chunking logic.

Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tests look already good, but it would be nice to have something more explicitly. For example some text snippets and the expected boundaries.

For the modifications to the chunking. Probably we need to clean this up a bit further, I want to get rid of all the default values

chunked_pooling/chunking.py Outdated Show resolved Hide resolved
@violenil violenil marked this pull request as ready for review September 20, 2024 08:44
Copy link
Member

@guenthermi guenthermi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two small things other than that it looks good

text,
return_offsets_mapping=True,
add_special_tokens=False,
max_length=512,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this correct to set max-length=512?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably remove this.

@@ -104,7 +99,6 @@ def main(model_name, strategy, task_name, eval_split):
output_folder='results-normal-pooling',
eval_splits=[eval_split],
overwrite_results=True,
batch_size=BATCH_SIZE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should not be remove as here is another batch size argument:


I know it is a bit messy ...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's annoying, but OK.

@violenil violenil merged commit 08bd4d5 into main Sep 20, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants