Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post: Moving workflows from single files to collections - a case study #2050

Open
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

paulzierep
Copy link
Contributor

Collections in workflows pitfalls and strategies from foodborne pathogen detection

were based on processing single files to use collections as well.
However, when applying this idea on our latest metagenomics workflow [Foodborn Pathogen detection](https://training.galaxyproject.org/training-material/topics/metagenomics/tutorials/pathogen-detection-from-nanopore-foodborne-data/tutorial.html) we encountered some problems
that arise when switching from single files to collection.
In the following we would like to present some of those issues and how we solved them, in the hopes that these strategies can help
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you also want to write them as FAQs in the GTN it could be useful, just sort of "how to do X" type FAQs, then we can easily link users to them when they encounter those issues later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the post is finalized, we will condense into a FAQ, thanks for the Idea !

@paulzierep paulzierep changed the title init Collections in workflows pitfalls and strategies from foodborne pathogen detection Jun 21, 2023
@paulzierep paulzierep changed the title Collections in workflows pitfalls and strategies from foodborne pathogen detection Plog post: Moving workflows from single files to collections - a case study Jun 21, 2023
</figcaption>
</div>

However, if this logic needs to be applied to multiple collection or again for multiple steps, the workflow becomes even more unreasonably large and complex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also run the workflow again and enable the job cache, this might be the easiest and most clean solution.

Copy link
Contributor Author

@paulzierep paulzierep Jun 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvdbeek do you mean to run the workflow again without the failed elements ? Then, if another collection fails, run the complete WF again and so on ? How do you enable the job cache and what does it do ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just run the whole workflow again. Everything that had already been run successfully will be skipped. If enabled by the admin go to User > Prefernces -> Manage Information. There were some additional fixes in 23.1 and usegalaxy.eu isn't running this yet. If you want to test this you can try on usegalaxy.org.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this info, cached jobs are indeed really cool, but I don't really see how it can solve the current issue.
The same elements will still fail again. Only in the case of randomly failing, this could help ... which should not be the case in any way.

Therefore, an additional step would still be required, by:

  • Filtering the elements in the collection (-> reindexing problem)
  • Run WF again without the Datasets that fail (-> laborious)
  • Work on the tool to overcome the issue

Or am I missing something ?

Copy link
Member

@mvdbeek mvdbeek Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your example said

Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server.

so it should help with that ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, sorry I thought you referred to the complete paragraph, for that particular example your solution can help, although since it's random, one could get some failed ones for the new run as well and ultimately would need to repeat the workflow multiple times, which is also not ideal

Copy link
Member

@mvdbeek mvdbeek Jun 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, those will be picked up by the cache. And it's generally a good thing if you have a single complete invocation without partially failed items.


# Case 3 - Collection workflow logic does not fully comply with single file logic

`TODO explain unziop problem, how it was solved`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds like a bug, would really welcome a reproduction here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we think so too, and also double-checked, we will write an Issue, but I think its still a good finding for the blog post, and we can link the issue here. Idea is to motivate users to write issues for such cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we will write a PR for that but for the blog post I think its enough to state that it is being investigated

@hexylena hexylena changed the title Plog post: Moving workflows from single files to collections - a case study Blog post: Moving workflows from single files to collections - a case study Jun 21, 2023
…ion workflows and explaining the problem of the decompressing
@EngyNasr
Copy link
Contributor

I have added some parts in this PR

@paulzierep paulzierep marked this pull request as ready for review June 30, 2023 08:40
Even if a workflow is well designed, in some cases in can happen that only few elements of a collection fail. This happened to us rather randomly in case of Kraken2, since
it requires large amounts of memory (>70 GB), which were not assigned to every run of the tool by the server. That issue was solved by increasing the minimum memory required by the tool on the EU server. But there are various other scenarios where the failure of the tool can be attributed for example to specific input data. In other cases only a few elements of a collection are empty (e.g. if an assembly can not be made due to not overlapping reads).

If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct, generally speaking. As long as you're not reducing the elements the workflow will proceed until the end.

### Rerun the workflow with rerun activated

In some cases a rerun of the workflow can also solve the issue, for example in our use case where the failure was rather randomly.
For this strategy it is beneficial to activate the rerun option of galaxy (User/Preferences/Manage Information) on instance with version > 23.0.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this meant to refer to the job cache ? re-running single elements has been possible for many years.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this not the cache option: "Do you want to be able to re-use equivalent jobs ?" Thats what I found on .org ... or did you mean another option ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to be able to re-use equivalent jobs -> this is the job cache.
I don't see how that is related to Rerun the workflow with rerun activated ?


Some of the tools, used in the *Foodborne pathogen detection workflow using collections*, failed when we run the workflow but succeed when the tool did run alone without a workflow, for example `Krakentools: Extract Kraken Reads By ID` and `Filter Sequence by ID`.

After a bit of investigation we noticed that when these tools run alone (without being in a workflow) or in a workflow with single files galaxy performs an implicit decompressing of the zipped input files, which then channels the correct file to the underlying wrapper tool. However when the same tools run in a collection in a workflow this implicit decompressing does not take place, which cause the output to fail.
Copy link
Member

@mvdbeek mvdbeek Jun 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear what you're doing there. Implicit converters sure run as part of workflows, using collections or not. Overall I don't think this is a statement that should be published.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, but still an issue, will raise next week.

However, this initial solution is not optimal, since the data size will increase in the user's history (due to file decompression) by running the workflow. Therefore, we have proposed another solution by updating the tools wrappers themselves to perform the decompression internally without the need to use the `Convert compressed file to uncompressed` tool (https://github.com/galaxyproject/tools-iuc/pull/5360). In fact the tool did not require a internal decompression step, since it can indeed work with zipped files, it only needed
to know that the input is zipped.

Although the problem could be solved again on the tool wrapper level, the general problematic of inconsistent conversion logic between single files and collections is most likely a bug which is currently investigated for following galaxy releases.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, please open issues for this when you encounter them, and before writing a blog post. We do have test cases that exercise this, so this is rather surprising.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove that part and we will do the PR


If an element of a collection is failed or empty the entire downstream processing is stopped, which can be rather annoying if one want to process a large amount of data and got stuck due to a few elements. Two solutions are proposed to handle such cases.

## Intermediate workflow specific solution
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do want to point out though that you could have taken your single file workflow and mapped it over a collection, and I think that would have solved all of the issues mentioned in this section. The workarounds here are nice, but they make the problem appear much more complicated than it should be. I would also appreciate a qualifier saying that you can ask these sort of questions in the IWC channel.

Furthermore it needs to be considered, that filtering empty or failed elements from a collection
can hide the root cause of the problem. The question why some tool runs fail or produce empty output should always be further investigated.

### Rerun the workflow with rerun activated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
### Rerun the workflow with rerun activated
### Rerun the workflow with the re-use equivalent jobs option

is that what you meant ? I read this as the normal replace element in collection option


In some cases a rerun of the workflow can also solve the issue, for example in our use case where the failure was rather randomly.
For this strategy it is beneficial to activate the rerun option of galaxy (User/Preferences/Manage Information) on instance with version > 23.0.
This allows to only rerun failed elements, which saves computing resources and time.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's also true if you click on rerun and replace collection element, I think I would say

This creates a new workflow run where successful outputs are copied from the previous execution instead of being run again, and only failed jobs and jobs depending on the failed outputs will be submitted again.

@EngyNasr
Copy link
Contributor

EngyNasr commented Jul 2, 2023

I have created an issue on tools-iuc for the implicit dataset conversion that doesnot occur while using the tool in a workflow with an input collection of zipped files.

We can now link it to the blog post

@EngyNasr
Copy link
Contributor

EngyNasr commented Aug 7, 2023

@paulzierep is it ready to be merged?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants