Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyMuPDF inserting newline characters mid-word #3725

Open
brandenkmurray opened this issue Jul 25, 2024 · 1 comment
Open

PyMuPDF inserting newline characters mid-word #3725

brandenkmurray opened this issue Jul 25, 2024 · 1 comment

Comments

@brandenkmurray
Copy link

Description of the bug

In some cases PyMuPDF is adding newline characters in the middle of words which do no exist if you simply copy/paste the text from the PDF or extract the text using other libraries.

How to reproduce the bug

wellsfargo-2022-annual-report.pdf

import fitz
import pdftotext
import pdfplumber

def print_comparison(fn, page):
    #pymupdf
    pymupdf_doc = fitz.open(fn)

    #pdftotext
    with open(fn, "rb") as f:
        pdftotext_doc = pdftotext.PDF(f)

    #pdfplumber
    pdfplumber_doc = pdfplumber.open(fn)

    print("PyMuPDF:\n")
    print(repr(pymupdf_doc[page].get_text()))
    print("\npdftotext:\n")
    print(repr(pdftotext_doc[page]))
    print("\npdfplumber:\n")
    print(repr(pdfplumber_doc.pages[page].extract_text()))


print_comparison('wellsfargo-2022-annual-report.pdf', 132)

PyMuPDF:
' \n \n \n \n \nTable 5.14: TDR Modifications \nPrimary modification type (1) \n \n \n \nFinancial effects of modifications \n \n($ in millions) \n \nPrincipal \nforgiveness \nInterest \nrate \nreduction \nOther \nconcessions \n (2) \nTotal \nCharge-\noffs \n (3) \nWeighted \naverage \ninterest \nrate \nreduction \nRecorded \ninvestment \n \nrelated to \n \ninterest rate \n \nreduction (4) \n \n \n \n \nYear Ended December 31, 2022 \n \n \nCommercial and industrial \n \n$ \n \n24 \n \n24 \n \n349 \n \n397 \n \n— \n \n \n10.69% \n \n$ \n \n24 \n \n \nCommercial real estate \n \n— \n \n12 \n \n112 \n \n124 \n \n— \n \n \n0.92 \n \n12 \n \nLease financing \n \n— \n \n— \n \n2 \n \n2 \n \n— \n \n— \n \n— \n \nTotal commercial \n \n24 \n \n36 \n \n463 \n \n523 \n \n— \n \n \n7.51 \n \n36 \nResidential \n mortgage \n \n1 \n \n \n369 \n \n \n1,357 \n \n \n1,727 \n \n \n6 \n \n \n1.61 \n \n \n369 \n \n \nCredit card \n \n— \n \n311 \n \n— \n \n311 \n \n— \n \n \n20.33 \n \n311 \nAuto \n \n2 \n \n7 \n \n63 \n \n72 \n \n16 \n \n \n4.33 \n \n7 \n \nOther consumer \n \n— \n \n19 \n \n3 \n \n22 \n \n1 \n \n \n11.48 \n \n19 \n \n \nTrial modifications (5) \n \n— \n \n— \n \n228 \n \n228 \n \n— \n \n— \n \n— \n \nTotal consumer \n \n3 \n \n706 \n \n1,651 \n \n2,360 \n \n23 \n \n \n10.14 \n \n706 \nTotal \n \n$ \n \n27 \n \n742 \n \n2,114 \n \n2,883 \n \n23 \n \n \n10.02% \n \n$ \n \n742 \n \n \n \n \nYear Ended December 31, 2021 \n \n \nCommercial and industrial \n \n$ \n \n2 \n \n9 \n \n879 \n \n890 \n \n20 \n \n \n0.81% \n \n$ \n9\n \n \nCommercial real estate \n \n41 \n \n15 \n \n259 \n \n315 \n \n— \n \n \n1.28 \n14\n \nLease financing \n \n— \n \n— \n \n7 \n \n7 \n \n— \n \n— \n—\n \nTotal commercial \n \n43 \n \n24 \n \n1,145 \n \n1,212 \n \n20 \n \n \n1.11 \n23\n \nResidential mortgage \n \n— \n \n70 \n \n1,324 \n \n1,394 \n \n3 \n \n \n1.80 \n70\n \nCredit card \n \n— \n \n106 \n \n— \n \n106 \n \n— \n \n \n19.12 \n106\nAuto \n \n1 \n \n4 \n \n131 \n \n136 \n \n54 \n \n \n3.82 \n4\n \nOther consumer \n \n— \n \n18 \n \n1 \n \n19 \n \n— \n \n \n11.83 \n18\n \n \nTrial modifications (5) \n \n— \n \n— \n \n(3) \n \n(3) \n \n— \n \n— \n—\n \nTotal consumer \n \n1 \n \n198 \n \n1,453 \n \n1,652 \n \n57 \n \n \n12.01 \n198\nTotal \n \n$ \n \n44 \n \n222 \n \n2,598 \n \n2,864 \n \n77 \n \n \n10.84% \n \n$ \n221\n \n \n \n \nYear Ended December 31, 2020 \n \n \nCommercial and industrial \n \n$ \n \n24 \n \n47 \n \n2,971 \n \n3,042 \n \n162 \n \n \n0.74% \n \n$ \n48\n \n \nCommercial real estate \n \n10 \n \n35 \n \n684 \n \n729 \n \n5 \n \n \n1.11 \n35\n \nLease financing \n \n— \n \n— \n \n1 \n \n1 \n \n— \n \n— \n—\n \nTotal commercial \n \n34 \n \n82 \n \n3,656 \n \n3,772 \n \n167 \n \n \n0.90 \n83\n \nResidential mortgage \n \n— \n \n25 \n \n4,277 \n \n4,302 \n \n7 \n \n \n1.93 \n51\n \nCredit card \n \n— \n \n272 \n \n— \n \n272 \n \n— \n \n \n14.12 \n272\nAuto \n \n4 \n \n6 \n \n166 \n \n176 \n \n93 \n \n \n4.65 \n6\n \nOther consumer \n \n— \n \n23 \n \n34 \n \n57 \n \n1 \n \n \n8.28 \n23\n \n \nTrial modifications (5) \n \n— \n \n— \n \n3 \n \n3 \n \n— \n \n— \n—\n \nTotal consumer \n \n4 \n \n326 \n \n4,480 \n \n4,810 \n \n101 \n \n \n11.80 \n352\nTotal \n \n$ \n \n38 \n \n408 \n \n8,136 \n \n8,582 \n \n268 \n \n \n9.73% \n \n$ \n435\n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n \n(1) \nAmounts r\n epresent \n the r\n ecorded \n investment \n in loa\n \nns a\n fter \n recognizing \n the effect\n \ns of t\n \n he TD\n \nR, \n if a\n ny. \n TDRs m\n \nay \n have m\n \nultiple t\n ypes of concessions, \n \n \n but \n are p\n resented \n only \n once in t\n he fir\n \nst \n \nmodification t\n ype b\n ased \n on t\n he or\n \nder \n presented \n in t\n he t\n able a\n bove. \n The r\n eported \n amounts inclu\n \nde loa\n \nns r\n emodified \n of \n $445 m\n \nillion, \n $737 m\n \nillion, \n and \n $1.5 b\n illion \n for \n the y\n ears end\n \ned \n December \n 31, \n \n2022, \n 2021 a\n nd \n 2020, \n respectively. \n(2) \nOther \n concessions inclu\n \nde loa\n \nns wit\n \nh \n payment \n (principal a\n nd/or \n interest) d\n eferral, \n loans d\n ischarged \n in b\n ankruptcy, \n loan r\n enewals, \n term \n extensions a\n nd \n other \n interest \n and \n noninterest \n adjustments, \n but \n \nexclude m\n \nodifications t\n hat \n also for\n \ngive p\n rincipal a\n nd/or \n reduce t\n he cont\n \nractual int\n \nerest \n rate. \n The r\n eported \n amounts inclu\n \nde loa\n \nns t\n hat \n are new TD\n \n \nRs t\n hat \n may \n have C\n OVID-19-related \n payment \n \ndeferrals a\n nd \n exclude C\n OVID-19-related \n payment \n deferrals on loa\n \n \nns p\n reviously \n reported \n as TD\n \nRs g\n iven lim\n \nited \n current \n financial effect\n \ns ot\n \nher \n than p\n ayment \n deferral. \n \n(3) \nCharge-offs inclu\n \nde wr\n \nite-downs of t\n \n he inv\n \nestment \n in t\n he loa\n \nn in t\n he p\n eriod \n it \n is cont\n \nractually \n modified. \n The a\n mount \n of ch\n \narge-off will d\n \n iffer \n from \n the m\n \nodification t\n erms if t\n he loa\n \nn h\n as b\n een ch\n \narged \n \ndown p\n rior \n to t\n he m\n \nodification b\n ased \n on ou\n \nr \n policies. \n In a\n ddition, \n there m\n \nay \n be ca\n \nses wh\n \nere we h\n \n ave a \n charge-off/down wit\n \nh \n no leg\n \nal p\n rincipal m\n \nodification. \n \n(4) \nRecorded \n investment \n related \n to int\n \nerest \n rate r\n eduction r\n eflects t\n he effect \n \n of r\n educed \n interest \n rates on loa\n \n \nns wit\n \nh \n an int\n \nerest \n rate concession a\n \n s one of t\n \n \n heir \n concession t\n ypes, \n which \n includes loa\n \nns \n \nreported \n as a \n principal p\n rimary \n modification t\n ype t\n hat \n also h\n ave a\n n int\n \nerest \n rate concession. \n \n(5) \nTrial m\n \nodifications a\n re g\n ranted \n a \n delay \n in p\n ayments d\n ue u\n nder \n the or\n \niginal t\n erms d\n uring \n the t\n rial p\n ayment \n period. \n However, \n these loa\n \nns cont\n \ninue t\n o a\n dvance t\n hrough \n delinquency \n status a\n nd \n accrue \n \ninterest \n according \n to t\n heir \n original t\n erms. \n Any \n subsequent \n permanent \n modification g\n enerally \n includes int\n \nerest \n rate r\n elated \n concessions; \n however, \n the exa\n \nct \n concession t\n ype a\n nd \n resulting \n financial \n \neffect \n are u\n sually \n not \n known u\n ntil t\n he loa\n \nn is p\n ermanently \n modified. \n Trial m\n \nodifications for \n \n the p\n eriod \n are p\n resented \n net \n of p\n reviously \n reported \n trial m\n \nodifications t\n hat \n became p\n ermanent \n in t\n he \n \ncurrent \n period. \nWells Fargo & Company \n123 \n'

pdftotext:
'Table 5.14: TDR Modifications\nPrimary modification type (1)\n\nPrincipal\nforgiveness\n\n($ in millions)\n\nInterest\nrate\nreduction\n\nOther\nconcessions (2)\n\nTotal\n\nFinancial effects of modifications\n\nChargeoffs (3)\n\nWeighted\naverage\ninterest\nrate\nreduction\n\nRecorded\ninvestment\nrelated to\ninterest rate\nreduction (4)\n\nYear Ended December 31, 2022\n24\n\n24\n\n349\n\n397\n\n—\n\n10.69%\n\nCommercial real estate\n\nCommercial and industrial\n\n$\n\n—\n\n12\n\n112\n\n124\n\n—\n\n0.92\n\nLease financing\n\n—\n\n—\n\n2\n\n2\n\n—\n\n—\n\n—\n\n24\n\n36\n\n463\n\n523\n\n—\n\n7.51\n\n36\n\nTotal commercial\n\n$\n\n24\n12\n\nResidential mortgage\n\n1\n\n369\n\n1,357\n\n1,727\n\n6\n\n1.61\n\n369\n\nCredit card\n\n—\n\n311\n\n—\n\n311\n\n—\n\n20.33\n\n311\n\nAuto\n\n2\n\n7\n\n63\n\n72\n\n16\n\n4.33\n\n7\n\nOther consumer\n\n—\n\n19\n\n3\n\n22\n\n1\n\n11.48\n\n19\n\nTrial modifications (5)\n\n—\n\n—\n\n228\n\n228\n\n—\n\n—\n\n—\n\nTotal consumer\n\n3\n\n706\n\n1,651\n\n2,360\n\n23\n\n10.14\n\n706\n\n27\n\n742\n\n2,114\n\n2,883\n\n23\n\n10.02%\n\n$\n\n2\n\n9\n\n879\n\n890\n\n20\n\n0.81%\n\n$\n\n41\n\n15\n\n259\n\n315\n\n—\n\n1.28\n\nTotal\n\n$\n\n742\n\nYear Ended December 31, 2021\nCommercial and industrial\n\n$\n\nCommercial real estate\nLease financing\n\n9\n14\n\n—\n\n—\n\n7\n\n7\n\n—\n\n—\n\n—\n\nTotal commercial\n\n43\n\n24\n\n1,145\n\n1,212\n\n20\n\n1.11\n\n23\n\nResidential mortgage\n\n—\n\n70\n\n1,324\n\n1,394\n\n3\n\n1.80\n\n70\n\nCredit card\n\n—\n\n106\n\n—\n\n106\n\n—\n\n19.12\n\n106\n\nAuto\n\n1\n\n4\n\n131\n\n136\n\n54\n\n3.82\n\n4\n\nOther consumer\n\n—\n\n18\n\n1\n\n19\n\n—\n\n11.83\n\n18\n\nTrial modifications (5)\n\n—\n\n—\n\n(3)\n\n(3)\n\n—\n\n—\n\n—\n\nTotal consumer\n\n1\n\n198\n\n1,453\n\n1,652\n\n57\n\n12.01\n\n198\n\n44\n\n222\n\n2,598\n\n2,864\n\n77\n\n10.84%\n\n$\n\n221\n\n$\n\n48\n\nTotal\n\n$\n\nYear Ended December 31, 2020\n24\n\n47\n\n2,971\n\n3,042\n\n162\n\n0.74%\n\nCommercial real estate\n\nCommercial and industrial\n\n10\n\n35\n\n684\n\n729\n\n5\n\n1.11\n\nLease financing\n\n—\n\n—\n\n1\n\n1\n\n—\n\n—\n\n—\n\nTotal commercial\n\n34\n\n82\n\n3,656\n\n3,772\n\n167\n\n0.90\n\n83\n\nResidential mortgage\n\n—\n\n25\n\n4,277\n\n4,302\n\n7\n\n1.93\n\n51\n\nCredit card\n\n—\n\n272\n\n—\n\n272\n\n—\n\n14.12\n\n272\n\n(2)\n(3)\n(4)\n(5)\n\n35\n\nAuto\n\n4\n\n6\n\n166\n\n176\n\n93\n\n4.65\n\n6\n\nOther consumer\n\n—\n\n23\n\n34\n\n57\n\n1\n\n8.28\n\n23\n\nTrial modifications (5)\n\n—\n\n—\n\n3\n\n3\n\n—\n\n—\n\n—\n\nTotal consumer\n\n4\n\n326\n\n4,480\n\n4,810\n\n101\n\n11.80\n\n352\n\n38\n\n408\n\n8,136\n\n8,582\n\n268\n\n9.73%\n\nTotal\n(1)\n\n$\n\n$\n\n$\n\n435\n\nAmounts represent the recorded investment in loans after recognizing the effects of the TDR, if any. TDRs may have multiple types of concessions, but are presented only once in the first\nmodification type based on the order presented in the table above. The reported amounts include loans remodified of $445 million, $737 million, and $1.5 billion for the years ended December 31,\n2022, 2021 and 2020, respectively.\nOther concessions include loans with payment (principal and/or interest) deferral, loans discharged in bankruptcy, loan renewals, term extensions and other interest and noninterest adjustments, but\nexclude modifications that also forgive principal and/or reduce the contractual interest rate. The reported amounts include loans that are new TDRs that may have COVID-19-related payment\ndeferrals and exclude COVID-19-related payment deferrals on loans previously reported as TDRs given limited current financial effects other than payment deferral.\nCharge-offs include write-downs of the investment in the loan in the period it is contractually modified. The amount of charge-off will differ from the modification terms if the loan has been charged\ndown prior to the modification based on our policies. In addition, there may be cases where we have a charge-off/down with no legal principal modification.\nRecorded investment related to interest rate reduction reflects the effect of reduced interest rates on loans with an interest rate concession as one of their concession types, which includes loans\nreported as a principal primary modification type that also have an interest rate concession.\nTrial modifications are granted a delay in payments due under the original terms during the trial payment period. However, these loans continue to advance through delinquency status and accrue\ninterest according to their original terms. Any subsequent permanent modification generally includes interest rate related concessions; however, the exact concession type and resulting financial\neffect are usually not known until the loan is permanently modified. Trial modifications for the period are presented net of previously reported trial modifications that became permanent in the\ncurrent period.\n\nWells Fargo & Company\n\n123\n\n\x0c'

pdfplumber:
'Table 5.14: TDR Modifications\nPrimary modification type (1) Financial effects of modifications\nWeighted Recorded\naverage investment\nInterest interest related to\nPrincipal rate Other Charge- rate interest rate\n($ in millions) forgiveness reduction concessions (2) Total offs (3) reduction reduction (4)\nYear Ended December 31, 2022\nCommercial and industrial $ 24 24 349 397 — 10.69% $ 24\nCommercial real estate — 12 112 124 — 0.92 12\nLease financing — — 2 2 — — —\nTotal commercial 24 36 463 523 — 7.51 36\nResidential mortgage 1 369 1,357 1,727 6 1.61 369\nCredit card — 311 — 311 — 20.33 311\nAuto 2 7 63 72 16 4.33 7\nOther consumer — 19 3 22 1 11.48 19\nTrial modifications (5) — — 228 228 — — —\nTotal consumer 3 706 1,651 2,360 23 10.14 706\nTotal $ 27 742 2,114 2,883 23 10.02% $ 742\nYear Ended December 31, 2021\nCommercial and industrial $ 2 9 879 890 20 0.81% $ 9\nCommercial real estate 41 15 259 315 — 1.28 14\nLease financing — — 7 7 — — —\nTotal commercial 43 24 1,145 1,212 20 1.11 23\nResidential mortgage — 70 1,324 1,394 3 1.80 70\nCredit card — 106 — 106 — 19.12 106\nAuto 1 4 131 136 54 3.82 4\nOther consumer — 18 1 19 — 11.83 18\nTrial modifications (5) — — (3) (3) — — —\nTotal consumer 1 198 1,453 1,652 57 12.01 198\nTotal $ 44 222 2,598 2,864 77 10.84% $ 221\nYear Ended December 31, 2020\nCommercial and industrial $ 24 47 2,971 3,042 162 0.74% $ 48\nCommercial real estate 10 35 684 729 5 1.11 35\nLease financing — — 1 1 — — —\nTotal commercial 34 82 3,656 3,772 167 0.90 83\nResidential mortgage — 25 4,277 4,302 7 1.93 51\nCredit card — 272 — 272 — 14.12 272\nAuto 4 6 166 176 93 4.65 6\nOther consumer — 23 34 57 1 8.28 23\nTrial modifications (5) — — 3 3 — — —\nTotal consumer 4 326 4,480 4,810 101 11.80 352\nTotal $ 38 408 8,136 8,582 268 9.73% $ 435\n(1) Amounts represent the recorded investment in loans after recognizing the effects of the TDR, if any. TDRs may have multiple types of concessions, but are presented only once in the first\nmodification type based on the order presented in the table above. The reported amounts include loans remodified of $445 million, $737 million, and $1.5 billion for the years ended December 31,\n2022, 2021 and 2020, respectively.\n(2) Other concessions include loans with payment (principal and/or interest) deferral, loans discharged in bankruptcy, loan renewals, term extensions and other interest and noninterest adjustments, but\nexclude modifications that also forgive principal and/or reduce the contractual interest rate. The reported amounts include loans that are new TDRs that may have COVID-19-related payment\ndeferrals and exclude COVID-19-related payment deferrals on loans previously reported as TDRs given limited current financial effects other than payment deferral.\n(3) Charge-offs include write-downs of the investment in the loan in the period it is contractually modified. The amount of charge-off will differ from the modification terms if the loan has been charged\ndown prior to the modification based on our policies. In addition, there may be cases where we have a charge-off/down with no legal principal modification.\n(4) Recorded investment related to interest rate reduction reflects the effect of reduced interest rates on loans with an interest rate concession as one of their concession types, which includes loans\nreported as a principal primary modification type that also have an interest rate concession.\n(5) Trial modifications are granted a delay in payments due under the original terms during the trial payment period. However, these loans continue to advance through delinquency status and accrue\ninterest according to their original terms. Any subsequent permanent modification generally includes interest rate related concessions; however, the exact concession type and resulting financial\neffect are usually not known until the loan is permanently modified. Trial modifications for the period are presented net of previously reported trial modifications that became permanent in the\ncurrent period.\nWells Fargo & Company 123'

The text from the footnotes in this example look okay using pdfplumber and pdftotext, but with pymupdf it outputs text that looks like (1) \nAmounts r\n epresent \n the r\n ecorded \n investment \n in loa\n \nns a\n fter \n recognizing \n the effect\n \ns of t\n \n he TD\n \nR, \n if a\n ny. with \n scattered throughout.

PyMuPDF version

1.24.9

Operating system

Linux

Python version

3.10

@JorjMcKie
Copy link
Collaborator

As announced in my e-mail, here is script that can be used as a circumvention while the team is working on a final solution.
repair-words.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants