Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Scan time increases quadratically with page count #1378

Open
aliemjay opened this issue Aug 15, 2024 · 8 comments
Open

[Bug]: Scan time increases quadratically with page count #1378

aliemjay opened this issue Aug 15, 2024 · 8 comments

Comments

@aliemjay
Copy link

When using OCRMyPDF with the --redo-ocr option, I noticed that the "Scanning contents" phase can become excessively long for large input files.

Surprisingly, the total processing time seems to scale non-linearly with the page count making the estimated time-to-complete feature useless.

Here are the progress bar timings when scanning a 1000 pages file on Core i5 12400 under v16.4.2:

$ ocrmypdf --redo-ocr in.pdf out.pdf

progress -> time -> diff
 20% -> 20.8s -> 20.8s
 40% -> 60.8s -> 40.0s
 60% -> 121s  -> 60.5s
 80% -> 200s  -> 78.7s
100% -> 300s  -> 100s

Upon investigation, I traced the issue back to

page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)

The inefficient implementation of PDFPage.get_pages seems to iterate over all previous pages before getting to pageno. This theoretically means a quadratic time complexity.

The following quick and dirty patch reduced the scan time from 300s to 10s for the same file

# TODO addthe patch
@jbarlow83
Copy link
Collaborator

Note that get_page_analysis runs in a worker thread/process context so a fix that passes all test may not be super straightforward.

@aliemjay
Copy link
Author

Here is my patch, it passed all tests locally.

diff --git a/src/ocrmypdf/pdfinfo/layout.py b/src/ocrmypdf/pdfinfo/layout.py
index 83d5d32b..fbef6a0d 100644
--- a/src/ocrmypdf/pdfinfo/layout.py
+++ b/src/ocrmypdf/pdfinfo/layout.py
@@ -288,6 +288,23 @@ def patch_pdfminer(pscript5_mode: bool):
     else:
         yield
 
+class PageCache:
+    cache = {}
+
+    @classmethod
+    def get_cached(cls, infile: PathLike, pageno: int):
+        if infile in cls.cache:
+            for cache_pageno, cache_page in cls.cache[infile]:
+                if cache_pageno == pageno:
+                    return cache_page
+                elif cache_pageno > pageno:
+                    break
+            else:
+                return None
+
+        f = Path(infile).open('rb')
+        cls.cache[infile] = enumerate(PDFPage.get_pages(f))
+        return cls.get_cached(infile, pageno)
 
 def get_page_analysis(
     infile: PathLike, pageno: int, pscript5_mode: bool
@@ -306,8 +323,7 @@ def get_page_analysis(
     with patch_pdfminer(pscript5_mode):
         try:
             with Path(infile).open('rb') as f:
-                page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)
-                page = next(page_iter, None)
+                page = PageCache.get_cached(infile, pageno)
                 if page is None:
                     raise InputFileError(
                         f"pdfminer could not process page {pageno} (counting from 0)."

@aliemjay
Copy link
Author

aliemjay commented Aug 16, 2024

For testing, this is a file with 10k blank pages:
blank-10k.pdf

Real-world files require much less pages for this bug to be noticeable.

The test was done under v16.4.2, because v16.4.3 introduced an unrelated regression to scan times. I'll file a separate issue soon.

$ ocrmypdf blank-10k.pdf out.pdf --redo-ocr

@jbarlow83
Copy link
Collaborator

I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.

@aliemjay
Copy link
Author

I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.

Can we add a new method, PageCache.clear, which closes all open files and is called somewhere in the pipeline upon completion of the scan? I think this is the only way to fix it without disturbing the public api.

this could be problematic for PyPy.

I think you mean PyPi the repository, as the issue is cross-platform compatibility with Windows?

@jbarlow83
Copy link
Collaborator

I think that PageCache needs to be attached to PdfInfo and created as a distinct object rather than using global state. PdfInfo should create a PageCache and pass it on to each PageInfo; after page information is gathered, PageCache can be cleared.

PyPi?

No, PyPy, the alternative Python implementation, which ocrmypdf supports.

@dhdaines
Copy link

Upon investigation, I traced the issue back to

page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)

The inefficient implementation of PDFPage.get_pages seems to iterate over all previous pages before getting to pageno. This theoretically means a quadratic time complexity.

Hmm, this is a bug in pdfminer.six. As far as I know, one does have to traverse the page tree in order to get to a page with a particular index (will have to go check the spec), but the problem is that the implementation of get_pages is quite lazy and actually parses all of the pages even when a subset of pages is specified for extraction.

@dhdaines
Copy link

Oh, well, it turns out that parsing the pages after traversing the page tree isn't any more expensive than just traversing the tree itself, as initializing a PDFPage object just reads the attributes from the page tree and constructs an object. So, your fix is the right one and this doesn't actually need a fix to pdfminer.six.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants