Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

import existing CDX files #410

Open
anarcat opened this issue Nov 13, 2018 · 0 comments
Open

import existing CDX files #410

anarcat opened this issue Nov 13, 2018 · 0 comments

Comments

@anarcat
Copy link
Contributor

anarcat commented Nov 13, 2018

Describe the solution you'd like

It should be possible to pass CDX files along with WARC files on wb-manager add. As things stand now, that command can take needlessly long as it creates its own CDX file - I already have that file, as it was generated during the crawl!

Describe alternatives you've considered

I know about the migrate-index command introduced in #80 (now cdx-convert?) but that's a separate command: after the wb-manager add command is ran, it's too late, and it's unclear if it can work before the add command is issued. In my tests, it only yields this message, regardless of what I feed it:

Index files up-to-date, nothing to convert

I also suspect custom user-defined collections might fit the bill, but I haven't figured out how to use those just yet.

Additional context

This, like #408, is a problem when processing large WARC files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant