Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insert, delete pages and manage bookmarks #153

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

plasticassius
Copy link

It took me a while to figure out how to use pdfrw, so I thought this example would be useful to show how pages and bookmarks can be manipulated.

If you're interested in this, I can add some documentation to it.

@PeterSlezak
Copy link

@plasticassius I find your code very useful I used it in my project, but I can imagine a cases when it failed (see below if you're interested).
It don't handle all possible ways an outlines can be implemented in the pdf. (I can provide some example pdf files.) The import_outlines only looks for the /Dest (the destination to be displayed when the item is activated), however the same action can be accomplished using the /A entry inside the outline item dictionary. /Dest and /A are mutually exclusive so when /Dest is present then /A is not permitted and vice versa.
There are several actions (/A) types (see table 8.48 on page 653 in PDF reference 1.7). Fortunately only /GoTo action refer to a destination in the current document, so that's the only that needs to be checked when creating TOC. (Also a /JavaScript can be used to achieve the goto page action, but JavaScript can do virtually anything with pdf so we can neglect that.)

Currently you just checking:
if reader.pages[page] is elt.Dest[0]

It could also add condition to see if there is a GoTo action type
if elt.A.S == PdfName.GoTo:

and then to see if that's refers to some page in the document
reader.pages[page] is elt.A.D[0]

However the /D (destination to jump to) can be not only and indirect reference page object, but also a named destination (that's stored in the Root.Names.Dests ... arrays, eclipses means that the actual structure can vary as the array can be spread across multiple objects).

I probably haven't covered everything as I would need to brush my Outlines PDF knowledge up a little bit. Anyway I would greatly appreciate if you continue to develop this code.

@plasticassius
Copy link
Author

I haven't com across a pdf with this type of bookmark, but if you have one, I can add this check to see if it would work.

One issue I would like to handle more gracefully is merging documents with form fields. In particular when I merge documents with digital signatures, the resulting document doesn't play well with PDF-Xchange. The approach I've taken is to manually remove the digital signatures with a text editor and to check for the presence of form fields in the utility.

I have to admit that reading and understanding PDF reference 1.7 is more than I want to handle, particularly since I don't run into any examples of most of it.

@PeterSlezak
Copy link

@plasticassius Here are example pdf files outlines.

  1. outlines_action_goto.pdf - I took it from [Feature Request] Ability to add/remove bookmarks #52 discussion here.
  2. sample_nametrees.pdf - original is from here. It is similar to the previous one but with named destinations.

One issue I would like to handle more gracefully is merging documents with form fields. In particular when I merge documents with digital signatures, the resulting document doesn't play well with PDF-Xchange. The approach I've taken is to manually remove the digital signatures with a text editor and to check for the presence of form fields in the utility.
I'm have only little experience with signatures but provide more details to see if I'm of any help. You have the issue just with PDF-Xchange? Can you share the pdf file causing the issue?

@plasticassius
Copy link
Author

I looked at your sample_nametrees.pdf file, and reader.pages[page] is elt.A.D[0] doesn't pick up anything. It seems that there is another layer of indirection, so that elt.A.D[0] doesn't refer to a page. It seems that there is more complexity involved in these actions.

http://www.tecxoft.com/samples.php has some examples of signed pdfs. The problem is that they are considered as read only by some tools, including PDF-Xchange, making them problematic to work with.

@PeterSlezak
Copy link

Files may looks like read-only probably, because some document changes are not permitted.

image

What is exactly your goal? If you want to modify file but preserve valid signature, then I don't think that's possible. If you don't care about the signature then it should work. You can concatenate two files or remove signature field. See example code below.

Concatenating works for me, signature get invalid. Used following files 1, 2:


inpfn = 'pdf_digital_signature_timestamp.pdf'
inpfn1 = 'sample01.pdf'
outfn = 'merged.pdf'

out = PdfWriter()
out.addpages(PdfReader(inpfn).pages)
out.addpages(PdfReader(inpfn1).pages)
out.write(outfn)

If you don't care about the signature you can remove it (used following pdf):


inpfn = 'pdf_digital_signature_timestamp.pdf'
outfn = 'out.pdf'

out = PdfWriter()

out.addpages(PdfReader(inpfn).pages)
#Remove signature field. Page contains two annotations.
#(5,0) is a lynk annotation and (27,0) is signature field so
#keep just the first one
out.trailer.Root.Pages.Kids[0].Annots = PdfArray('(5,0)')
out.write(outfn)```

@PeterSlezak
Copy link

I looked at your sample_nametrees.pdf file, and reader.pages[page] is elt.A.D[0] doesn't pick up anything. It seems that there is another layer of indirection, so that elt.A.D[0] doesn't refer to a page. It seems that there is more complexity involved in these actions.

sample_nametrees.pdf contains named destinations. So elt.A.D[0] contains a literal string (named destination) and the correspondence between strings and destinations is defined by the /Dests entry in the document’s name dictionary (Root.Names.Dests). The value of this entry is a name tree mapping name strings to destinations.

The sample_nametrees.pdf file Root.Names.Dests have 5 kids each containing names/object pdf array (The actual structure can vary between files). Selected objects from the ample_nametrees.pdf:

trailer
<<
/Size 223
/Root 221 0 R
/Info 222 0 R
>>

221 0 obj <<
/Type /Catalog
/Pages 212 0 R
/Outlines 213 0 R
/Names 220 0 R
 /ViewerPreferences << >> 
/OpenAction 58 0 R
>> endobj

220 0 obj <<
/Dests 219 0 R
>> endobj

219 0 obj <<
/Kids [214 0 R 215 0 R 216 0 R 217 0 R 218 0 R]
>> endobj

214 0 obj <<
/Names [(Doc-Start) 59 0 R (Item.1) 101 0 R (Item.2) 102 0 R (Item.3) 103 0 R (Item.4) 104 0 R (Item.5) 105 0 R]
/Limits [(Doc-Start) (Item.5)]
>> endobj
215 0 obj <<
/Names [(chapter.1) 5 0 R (name1) 126 0 R (page.1) 60 0 R (page.10) 156 0 R (page.2) 67 0 R (page.3) 73 0 R]
/Limits [(chapter.1) (page.3)]
>> endobj
216 0 obj <<
/Names [(page.4) 95 0 R (page.5) 106 0 R (page.6) 118 0 R (page.7) 127 0 R (page.8) 136 0 R (page.9) 142 0 R]
/Limits [(page.4) (page.9)]
>> endobj
217 0 obj <<
/Names [(section.1.1) 9 0 R (section.1.2) 21 0 R (section.1.3) 37 0 R (subsection.1.1.1) 13 0 R (subsection.1.1.2) 17 0 R (subsection.1.2.1) 25 0 R]
/Limits [(section.1.1) (subsection.1.2.1)]
>> endobj

218 0 obj <<
/Names [(subsection.1.2.2) 29 0 R (subsection.1.2.3) 33 0 R (subsection.1.3.1) 41 0 R (subsection.1.3.2) 45 0 R (subsection.1.3.3) 49 0 R (subsection.1.3.4) 53 0 R]
/Limits [(subsection.1.2.2) (subsection.1.3.4)]
>> endobj

Name tree object may have three different entries:

  • Kids - array - (Root and intermediate nodes only; required in intermediate nodes; present in the root node if and only if Names is not present) An array of indirect references to the immediate children of this node. The children may be intermediate or leaf nodes.
  • Names - array - (Root and leaf nodes only; required in leaf nodes; present in the root node if and only if Kids is not present) An array of the form
    [ key1 value1 key2 value2 … keyn valuen ]
    where each keyi is a string and the corresponding valuei is the object associated with that key. The keys are sorted in lexical order, as described below.
  • Limits - array - (Intermediate and leaf nodes only; required) An array of two strings, specifying the (lexically) least and greatest keys included in the Names array of a leaf node or in the Names arrays of any leaf nodes that are descendants of an intermediate node.

@plasticassius
Copy link
Author

I don't want to modify signatures, rather I combine pdfs into other pdfs. However, when one of the component pdfs has a signature, the entire result gets marked as read only by some tools. This can be fixed by removing the signatures in the first place. This is what I mean by handling signatures more gracefully.

At this point I don't have a need to handle destinations, named tree objects, ...

@PeterSlezak
Copy link

PeterSlezak commented Dec 15, 2018

I think I get your point now. The issue may be cause by pdfrw.PdfWriter. It changes the content of pdf file (by file I mean the bytes of the binary pdf file) even when you made no changes to the document (by document I mean the content you see when you open pdf file in a viewer). Try just to read a pdf containing signature field by PdfReader and the write it back to a new pdf file by PdfWriter. The file size may change as well as the order of the objects in the file, because writing process is unorganized. That causes incorrect byte values in the Signature Dictionary (in a ByteRange entry). I assume that applications cannot handle that correctly (I tested on adobe acrobat pro 11 and it get non-responsive when I tried to checked advanced signature properties and got the kill it, PDF-Xchange viewer just says that signature is invalid because the file was modified/corrupted. Foxit don't show the fields et all. So it depends on the implementation.).
PdfWriter needs to be modified to preserve the original file and write the incremental update at the end of the file, to work correctly in your set up. At least with the Tecxsof example files since they implemented digital signature using "byte range digest". Pdfrw might work with "object digest" (haven't tested).

@plasticassius
Copy link
Author

Actually, I think there's no practical way to modify a pdf and preserve a functional signature. The whole point of a signature is to be able to identify the file as original and unmodified. The problem is rather that if multiple documents are merged and it turns out that one or more of them have signatures, the invalid signatures in the merged document often cause failures in other software. The best solution I can think of is to remove the signatures. Since they've been invalidated, they have no useful purpose.

@PeterSlezak
Copy link

The only point I can think of is when signed document is saved with incremental updates then it is possible to undo the changes and recreate the document state as it existed at the time when it was signed.

@plasticassius
Copy link
Author

It took me a while to get around to it, but I expanded the ability to read outlines to include GoTo types like those @PeterSlezak mentioned previously. Have a look if it's not too late for your purposes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants