[Bug]: Scan time increases quadratically with page count #1378

aliemjay · 2024-08-15T21:51:57Z

When using OCRMyPDF with the --redo-ocr option, I noticed that the "Scanning contents" phase can become excessively long for large input files.

Surprisingly, the total processing time seems to scale non-linearly with the page count making the estimated time-to-complete feature useless.

Here are the progress bar timings when scanning a 1000 pages file on Core i5 12400 under v16.4.2:

$ ocrmypdf --redo-ocr in.pdf out.pdf

progress -> time -> diff
 20% -> 20.8s -> 20.8s
 40% -> 60.8s -> 40.0s
 60% -> 121s  -> 60.5s
 80% -> 200s  -> 78.7s
100% -> 300s  -> 100s

Upon investigation, I traced the issue back to

OCRmyPDF/src/ocrmypdf/pdfinfo/layout.py

Line 309 in 5e1e249

page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)

The inefficient implementation of PDFPage.get_pages seems to iterate over all previous pages before getting to pageno. This theoretically means a quadratic time complexity.

The following quick and dirty patch reduced the scan time from 300s to 10s for the same file

# TODO addthe patch

The text was updated successfully, but these errors were encountered:

jbarlow83 · 2024-08-15T23:50:19Z

Note that get_page_analysis runs in a worker thread/process context so a fix that passes all test may not be super straightforward.

aliemjay · 2024-08-16T13:06:59Z

Here is my patch, it passed all tests locally.

diff --git a/src/ocrmypdf/pdfinfo/layout.py b/src/ocrmypdf/pdfinfo/layout.py
index 83d5d32b..fbef6a0d 100644
--- a/src/ocrmypdf/pdfinfo/layout.py
+++ b/src/ocrmypdf/pdfinfo/layout.py
@@ -288,6 +288,23 @@ def patch_pdfminer(pscript5_mode: bool):
     else:
         yield
 
+class PageCache:
+    cache = {}
+
+    @classmethod
+    def get_cached(cls, infile: PathLike, pageno: int):
+        if infile in cls.cache:
+            for cache_pageno, cache_page in cls.cache[infile]:
+                if cache_pageno == pageno:
+                    return cache_page
+                elif cache_pageno > pageno:
+                    break
+            else:
+                return None
+
+        f = Path(infile).open('rb')
+        cls.cache[infile] = enumerate(PDFPage.get_pages(f))
+        return cls.get_cached(infile, pageno)
 
 def get_page_analysis(
     infile: PathLike, pageno: int, pscript5_mode: bool
@@ -306,8 +323,7 @@ def get_page_analysis(
     with patch_pdfminer(pscript5_mode):
         try:
             with Path(infile).open('rb') as f:
-                page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)
-                page = next(page_iter, None)
+                page = PageCache.get_cached(infile, pageno)
                 if page is None:
                     raise InputFileError(
                         f"pdfminer could not process page {pageno} (counting from 0)."

aliemjay · 2024-08-16T13:14:50Z

For testing, this is a file with 10k blank pages:
blank-10k.pdf

Real-world files require much less pages for this bug to be noticeable.

The test was done under v16.4.2, because v16.4.3 introduced an unrelated regression to scan times. I'll file a separate issue soon.

$ ocrmypdf blank-10k.pdf out.pdf --redo-ocr

jbarlow83 · 2024-08-21T09:00:12Z

I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.

aliemjay · 2024-08-25T08:50:00Z

I see a few issues with the approach taken in the patch. In particular the opened file handle is not explicit closed appropriately - this could be problematic for PyPy.

Can we add a new method, PageCache.clear, which closes all open files and is called somewhere in the pipeline upon completion of the scan? I think this is the only way to fix it without disturbing the public api.

this could be problematic for PyPy.

I think you mean PyPi the repository, as the issue is cross-platform compatibility with Windows?

jbarlow83 · 2024-08-28T08:24:55Z

I think that PageCache needs to be attached to PdfInfo and created as a distinct object rather than using global state. PdfInfo should create a PageCache and pass it on to each PageInfo; after page information is gathered, PageCache can be cleared.

PyPi?

No, PyPy, the alternative Python implementation, which ocrmypdf supports.

dhdaines · 2024-09-11T22:26:48Z

Upon investigation, I traced the issue back to

OCRmyPDF/src/ocrmypdf/pdfinfo/layout.py

Line 309 in 5e1e249

page_iter = PDFPage.get_pages(f, pagenos=[pageno], maxpages=0)

The inefficient implementation of PDFPage.get_pages seems to iterate over all previous pages before getting to pageno. This theoretically means a quadratic time complexity.

Hmm, this is a bug in pdfminer.six. As far as I know, one does have to traverse the page tree in order to get to a page with a particular index (will have to go check the spec), but the problem is that the implementation of get_pages is quite lazy and actually parses all of the pages even when a subset of pages is specified for extraction.

dhdaines · 2024-09-12T17:37:56Z

Oh, well, it turns out that parsing the pages after traversing the page tree isn't any more expensive than just traversing the tree itself, as initializing a PDFPage object just reads the attributes from the page tree and constructs an object. So, your fix is the right one and this doesn't actually need a fix to pdfminer.six.

aliemjay mentioned this issue Aug 16, 2024

[Bug]: Scan time regression in 16.4.3 with --redo-ocr #1380

Open

dhdaines mentioned this issue Sep 11, 2024

PDFPage.get_pages is inefficient for extracting a subset of pages pdfminer/pdfminer.six#1040

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Scan time increases quadratically with page count #1378

[Bug]: Scan time increases quadratically with page count #1378

aliemjay commented Aug 15, 2024

jbarlow83 commented Aug 15, 2024

aliemjay commented Aug 16, 2024

aliemjay commented Aug 16, 2024 •

edited

Loading

jbarlow83 commented Aug 21, 2024

aliemjay commented Aug 25, 2024

jbarlow83 commented Aug 28, 2024

dhdaines commented Sep 11, 2024

dhdaines commented Sep 12, 2024

[Bug]: Scan time increases quadratically with page count #1378

[Bug]: Scan time increases quadratically with page count #1378

Comments

aliemjay commented Aug 15, 2024

jbarlow83 commented Aug 15, 2024

aliemjay commented Aug 16, 2024

aliemjay commented Aug 16, 2024 • edited Loading

jbarlow83 commented Aug 21, 2024

aliemjay commented Aug 25, 2024

jbarlow83 commented Aug 28, 2024

dhdaines commented Sep 11, 2024

dhdaines commented Sep 12, 2024

aliemjay commented Aug 16, 2024 •

edited

Loading