Using python to bookmark PDF directories

Bookmarking PDF with python

Sometimes the PDF downloaded to the scanned version does not have a bookmarked directory, which is very inconvenient to read. Here is a semi-automatic script for adding bookmark directory through python.

1.1. Install PyPDF2

pip install pypdf2

The error reporting of subsequent running programs is not avoided. The python version must be before 3.7 (3.6).

1.2. Extract the directory information of PDF and save it in txt

This step is troublesome and needs to be implemented manually. Generally, you can use some OCR character recognition tools, or convert the directory page into word. Then sort it into the following txt format:

Each line contains three items: level, title and page, separated by spaces
Use "." To determine the level of bookmarks, for example:
- "Chapter 1" contains 0 "." Is the first level title
- "1.1" contains 1 "." It's a secondary title
- "1.1.1" contains 2 "." It's a three-level title
- ... (and so on)
Please don't have any extra blank lines

Here is my txt:

Chapter 1 Introduction 1 
1.1 Purpose of this book 1 
1.2 Main challenges of information fusion 5 
1.3 Why random sets or FISST 5 
1.3.1 Complexity of multi-objective filtering 6 
1.3.2 Transcendental heuristics 7 
1.3.3 Differences between single objective and multi-objective statistics 7 
1.3.4 Differences between conventional data and fuzzy data 7 
1.3.5 Formal Bayesian modeling 8 
1.3.6 Fuzzy information modeling 8 
1.3.7 Multi source and multi-objective formal modeling 9

1.3. Programming implementation

Click to view the code

import PyPDF2
import sys

class PdfDirGenerator:

    def __init__(self, pdf_path:str, txt_path:str, offset:int, out_path:str=None, levelmark:str='.'):
        
        self.pdf_path = pdf_path    # pdf path
        self.txt_path = txt_path    # txt containing pdf directory information
        self.offset = offset        # Table of contents page offset
        self.out_path = out_path    # Output path
        self.levelmark = levelmark  # Flag used to determine bookmark level
    
          
        self.dir_parent = [None]    

    def getLevelId(self, level):
        """Calculate the series of bookmarks (the symbol of the series is“.")
        Primary directory: 0 One“."，for example: Chapter 1 appendix A etc.
            Secondary directory: 1 One“."，for example: 1.1,A.1
                Tertiary directory: 2 One“."，for example: 2.1.3
        """
        mark_num = 0
        for c in level:
            if c == self.levelmark:
                mark_num += 1
        return mark_num + 1

    def run(self):
        
        print("--------------------------- Adding the bookmark ---------------------------")
        print(" * PDF Source: %s" % self.pdf_path)
        print(" * TXT Source: %s" % self.txt_path)
        print(" * Offset: %d" % self.offset)
        print("---------------------------------------------------------------------------")
        with open(self.txt_path, 'r', encoding='utf-8') as txt:
            
            pdf_reader = PyPDF2.PdfFileReader(self.pdf_path)
            pdf_writer = PyPDF2.PdfFileWriter()
            
            pdf_writer.cloneDocumentFromReader(pdf_reader)
            # BUG: ValueError: {'/Type': '/Outlines', '/Count': 0} is not in list
            # Modify the code ${python_path} / site packages / pypdf2 / PDF Py): getoutlineroot function
            # reference resources: https://www.codetd.com/en/article/11823498

            lines = txt.readlines()
            num_all_lines = len(lines)
            for i, line in enumerate(lines):
                pline = line.split(' ')
                level = pline[0]; title = pline[1]; page = int(pline[2]) + self.offset

                # 1. Calculate the series id of the current level
                # 2. The parent node of the current bookmark is stored in dir_ On parent [Id-1]
                # 3. Update / insert dir_parent[id] 
                id = self.getLevelId(level)
                if id >= len(self.dir_parent):
                    self.dir_parent.append(None)
                self.dir_parent[id] = pdf_writer.addBookmark(level+' '+title, page-1, self.dir_parent[id-1])
                
                print(" * [%d/%d finished] level: %s(%d), title: %s, page: %d" % (i+1, num_all_lines, level, id, title, page))
            
            if self.out_path is None:
                self.out_path = self.pdf_path[:-4] + '(bookmark).pdf'
            with open(self.out_path, 'wb') as out_pdf:
                pdf_writer.write(out_pdf)
                print("---------------------------------------------------------------------------")
                print(" * Save: %s" % self.out_path)
                print("---------------------------------- Done! ----------------------------------")

if __name__ == '__main__':
    
    input_num = len(sys.argv)
    assert(input_num > 3)
    
    opath = None
    if input_num > 4:
        opath = sys.argv[4]

    mark='.'
    if input_num > 5:
        mark = sys.argv[5]

    pdg = PdfDirGenerator(
        pdf_path=sys.argv[1],
        txt_path=sys.argv[2],
        offset=int(sys.argv[3]), # It is generally the number of pages at the end of the table of contents
        out_path=opath,
        levelmark=mark
    )

    pdg.run()

The above code is saved in pdfdirgenerator Py, which has 3 parameters and 2 optional parameters:

The first parameter: the path of the PDF to be bookmarked
The second parameter: the path of txt containing directory information
The third parameter: the number of offset pages of the body content (generally fill in the number of pages of the end page of the table of contents)
The fourth parameter (optional): output path
The fifth parameter (optional): Series flag, default is "."

For example, on the command line, enter:

python .\PdfDirGenerator.py .\Multi-source and multi-target statistical information fusion Mahler.pdf .\dir.txt 27

Operation effect:

1.4. Possible errors

Here is the main reference https://www.codetd.com/en/article/11823498

1.4.1. Question 1: ValueError: {'/ Type': '/ Outlines',' / Count ': 0} is not in list

If the PDF has been modified by other software before, there may be the following errors:

Traceback (most recent call last):
  File ".\PDFbookmark.py", line 70, in <module>
    print(addBookmark(args[1], args[2], int(args[3])))
  File ".\PDFbookmark.py", line 55, in addBookmark
    new_bookmark = writer.addBookmark(title, page + page_offset, parent=parent)
  File "C:\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 732, in addBookmark
    outlineRef = self.getOutlineRoot()
  File "C:\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 607, in getOutlineRoot
    idnum = self._objects.index(outline) + 1
ValueError: {
    
    '/Type': '/Outlines', '/Count': 0} is not in list

Solution: modify PDF getOutlineRoot() function of Py (the path of pdf.py is ${PYTHON_PATH}/site-packages/PyPDF2/pdf.py)

def getOutlineRoot(self):
    if '/Outlines' in self._root_object:
        outline = self._root_object['/Outlines']
        try:
            idnum = self._objects.index(outline) + 1
        except ValueError:
            if not isinstance(outline, TreeObject):
                def _walk(node):
                    node.__class__ = TreeObject
                    for child in node.children():
                        _walk(child)
                _walk(outline)
            outlineRef = self._addObject(outline)
            self._addObject(outlineRef.getObject())
            self._root_object[NameObject('/Outlines')] = outlineRef
            idnum = self._objects.index(outline) + 1
        outlineRef = IndirectObject(idnum, 0, self)
        assert outlineRef.getObject() == outline
    else:
        outline = TreeObject()
        outline.update({ })
        outlineRef = self._addObject(outline)
        self._root_object[NameObject('/Outlines')] = outlineRef

    return outline

1.4.2. Question 2: RuntimeError: generator raised StopIteration

If you make the above modifications and an error occurs when running the script: untimeError: generator raised StopIteration, please check whether the python version is 3.7 or higher (after version v3.7, the python termination iteration process has changed. For details, please refer to PEP 479). To avoid errors, please use Python earlier than 3.7, such as 3.6.

1.5. Code download

https://gitee.com/iam002/add_pdf_bookmarker
The PDF used here is Multi source and multi-objective statistical information fusion by Mahler (z-lib. ORG) pdf , students in need can click Alicloud disk Download.

1.6. reference resources

Keywords: Python

Added by sdyates2001 on Wed, 02 Feb 2022 14:01:06 +0200

Programming VIP