- Bookmarking PDF with python
Sometimes the PDF downloaded to the scanned version does not have a bookmarked directory, which is very inconvenient to read. Here is a semi-automatic script for adding bookmark directory through python.
1.1. Install PyPDF2
pip install pypdf2
The error reporting of subsequent running programs is not avoided. The python version must be before 3.7 (3.6).
1.2. Extract the directory information of PDF and save it in txt
This step is troublesome and needs to be implemented manually. Generally, you can use some OCR character recognition tools, or convert the directory page into word. Then sort it into the following txt format:
- Each line contains three items: level, title and page, separated by spaces
- Use "." To determine the level of bookmarks, for example:
- "Chapter 1" contains 0 "." Is the first level title
- "1.1" contains 1 "." It's a secondary title
- "1.1.1" contains 2 "." It's a three-level title
- ... (and so on)
- Please don't have any extra blank lines
Here is my txt:
Chapter 1 Introduction 1 1.1 Purpose of this book 1 1.2 Main challenges of information fusion 5 1.3 Why random sets or FISST 5 1.3.1 Complexity of multi-objective filtering 6 1.3.2 Transcendental heuristics 7 1.3.3 Differences between single objective and multi-objective statistics 7 1.3.4 Differences between conventional data and fuzzy data 7 1.3.5 Formal Bayesian modeling 8 1.3.6 Fuzzy information modeling 8 1.3.7 Multi source and multi-objective formal modeling 9
1.3. Programming implementation
Click to view the codeimport PyPDF2 import sys class PdfDirGenerator: def __init__(self, pdf_path:str, txt_path:str, offset:int, out_path:str=None, levelmark:str='.'): self.pdf_path = pdf_path # pdf path self.txt_path = txt_path # txt containing pdf directory information self.offset = offset # Table of contents page offset self.out_path = out_path # Output path self.levelmark = levelmark # Flag used to determine bookmark level self.dir_parent = [None] def getLevelId(self, level): """Calculate the series of bookmarks (the symbol of the series is“.") Primary directory: 0 One“.",for example: Chapter 1 appendix A etc. Secondary directory: 1 One“.",for example: 1.1,A.1 Tertiary directory: 2 One“.",for example: 2.1.3 """ mark_num = 0 for c in level: if c == self.levelmark: mark_num += 1 return mark_num + 1 def run(self): print("--------------------------- Adding the bookmark ---------------------------") print(" * PDF Source: %s" % self.pdf_path) print(" * TXT Source: %s" % self.txt_path) print(" * Offset: %d" % self.offset) print("---------------------------------------------------------------------------") with open(self.txt_path, 'r', encoding='utf-8') as txt: pdf_reader = PyPDF2.PdfFileReader(self.pdf_path) pdf_writer = PyPDF2.PdfFileWriter() pdf_writer.cloneDocumentFromReader(pdf_reader) # BUG: ValueError: {'/Type': '/Outlines', '/Count': 0} is not in list # Modify the code ${python_path} / site packages / pypdf2 / PDF Py): getoutlineroot function # reference resources: https://www.codetd.com/en/article/11823498 lines = txt.readlines() num_all_lines = len(lines) for i, line in enumerate(lines): pline = line.split(' ') level = pline[0]; title = pline[1]; page = int(pline[2]) + self.offset # 1. Calculate the series id of the current level # 2. The parent node of the current bookmark is stored in dir_ On parent [Id-1] # 3. Update / insert dir_parent[id] id = self.getLevelId(level) if id >= len(self.dir_parent): self.dir_parent.append(None) self.dir_parent[id] = pdf_writer.addBookmark(level+' '+title, page-1, self.dir_parent[id-1]) print(" * [%d/%d finished] level: %s(%d), title: %s, page: %d" % (i+1, num_all_lines, level, id, title, page)) if self.out_path is None: self.out_path = self.pdf_path[:-4] + '(bookmark).pdf' with open(self.out_path, 'wb') as out_pdf: pdf_writer.write(out_pdf) print("---------------------------------------------------------------------------") print(" * Save: %s" % self.out_path) print("---------------------------------- Done! ----------------------------------") if __name__ == '__main__': input_num = len(sys.argv) assert(input_num > 3) opath = None if input_num > 4: opath = sys.argv[4] mark='.' if input_num > 5: mark = sys.argv[5] pdg = PdfDirGenerator( pdf_path=sys.argv[1], txt_path=sys.argv[2], offset=int(sys.argv[3]), # It is generally the number of pages at the end of the table of contents out_path=opath, levelmark=mark ) pdg.run()
The above code is saved in pdfdirgenerator Py, which has 3 parameters and 2 optional parameters:
- The first parameter: the path of the PDF to be bookmarked
- The second parameter: the path of txt containing directory information
- The third parameter: the number of offset pages of the body content (generally fill in the number of pages of the end page of the table of contents)
- The fourth parameter (optional): output path
- The fifth parameter (optional): Series flag, default is "."
For example, on the command line, enter:
python .\PdfDirGenerator.py .\Multi-source and multi-target statistical information fusion Mahler.pdf .\dir.txt 27
Operation effect:
1.4. Possible errors
Here is the main reference https://www.codetd.com/en/article/11823498
1.4.1. Question 1: ValueError: {'/ Type': '/ Outlines',' / Count ': 0} is not in list
If the PDF has been modified by other software before, there may be the following errors:
Traceback (most recent call last): File ".\PDFbookmark.py", line 70, in <module> print(addBookmark(args[1], args[2], int(args[3]))) File ".\PDFbookmark.py", line 55, in addBookmark new_bookmark = writer.addBookmark(title, page + page_offset, parent=parent) File "C:\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 732, in addBookmark outlineRef = self.getOutlineRoot() File "C:\Anaconda3\lib\site-packages\PyPDF2\pdf.py", line 607, in getOutlineRoot idnum = self._objects.index(outline) + 1 ValueError: { '/Type': '/Outlines', '/Count': 0} is not in list
Solution: modify PDF getOutlineRoot() function of Py (the path of pdf.py is ${PYTHON_PATH}/site-packages/PyPDF2/pdf.py)
def getOutlineRoot(self): if '/Outlines' in self._root_object: outline = self._root_object['/Outlines'] try: idnum = self._objects.index(outline) + 1 except ValueError: if not isinstance(outline, TreeObject): def _walk(node): node.__class__ = TreeObject for child in node.children(): _walk(child) _walk(outline) outlineRef = self._addObject(outline) self._addObject(outlineRef.getObject()) self._root_object[NameObject('/Outlines')] = outlineRef idnum = self._objects.index(outline) + 1 outlineRef = IndirectObject(idnum, 0, self) assert outlineRef.getObject() == outline else: outline = TreeObject() outline.update({ }) outlineRef = self._addObject(outline) self._root_object[NameObject('/Outlines')] = outlineRef return outline
1.4.2. Question 2: RuntimeError: generator raised StopIteration
If you make the above modifications and an error occurs when running the script: untimeError: generator raised StopIteration, please check whether the python version is 3.7 or higher (after version v3.7, the python termination iteration process has changed. For details, please refer to PEP 479). To avoid errors, please use Python earlier than 3.7, such as 3.6.
1.5. Code download
-
The PDF used here is Multi source and multi-objective statistical information fusion by Mahler (z-lib. ORG) pdf , students in need can click Alicloud disk Download.