pay? It's impossible! 20 lines of Python code to implement a permanent free PDF editing tool

Pdf (Portable Document Format), Chinese name "Portable Document Format", is a file format we often come into contact with. Documents, documents Many are in PDF format. With the advantage of stable format, it enables us to optimally maintain the original color and format in the process of printing, sharing and transmission.

 

PDF is a document format based on PostScript language image model. Although it has great advantages in format stability. However, in terms of editability, it introduces another problem for users.

 

For example, PDF has some difficulties in document segmentation, merging, cutting, conversion, editing and so on.

Adobe Reader, Fuxin reader, panda PDF The frequently used PDF tool can only be used for document reading, but the free version can not be used for document editing. Although web-based PDF tools, such as SmallPDF and I love PDF, can be used for PDF editing, there are also restrictions on the document size.

Once, in order to replace a page in PDF, I tried almost all the mainstream PDF tools on the market, and finally had to choose to use paid tools to solve the problem.

After thinking about it, since these commercial software are unreliable, why not consider developing a tool yourself? Why spend so much effort to download and install software that has no integrity when dozens of lines of code can solve the problem?

This article will introduce the use of Python to easily develop a PDF editing tool, which can be used for PDF to TxT, segmentation, merging, cutting and conversion.

PyPDF2

PyPDF2 is a third-party python PDF library, which can {split, merge, cut and} convert PDF files.

In addition, it can add custom data, watermark and password to PDF files, and retrieve text and metadata from PDF files.

install

Direct installation using pip:

$ pip install PyPDF2

Here are some PDF editing functions to demonstrate, and the meaning of the code will be explained line by line.

Delete PDF page

Give the implementation code first,

from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()     // 1
input1 = PdfFileReader(open("example.pdf", "rb")) // 2

def delete_pdf(index):
 pages = input1.getNumPages() // 3

 for i in range(pages):
  if i+1 in index:
   continue
  output.addPage(input1.getPage(i))  // 4

 outputStream = open("PyPDF2-output.pdf", "wb")
 output.write(outputStream)  // 5

delete_pdf([2,3,4])

Here are some key points in the code:

  1. Declare an instance for outputting PDF;
  2. Read local PDF files;
  3. Get the number of pages of PDF document;
  4. Read page i of PDF and add it to the output instance;
  5. Save the edited document locally;

Merge PDF

The deletion of PDF pages has been realized. Next, let's see how to merge the pages in another PDF into the current PDF.

Method 1:

You can expand the way of deleting PDF pages and merge PDF pages.

from PyPDF2 import PdfFileWriter, PdfFileReader

output = PdfFileWriter()
input1 = PdfFileReader(open("example.pdf", "rb"))
input2 = PdfFileReader(open("simple2.pdf", "rb")) // 1

def merge_pdf(add_index, origin_index):
 pages = input1.getNumPages()
 k = 0
 for i in range(pages):
  if i+1 in add_index:
   output.addPage(input2.getPage(origin_index[k])) // 2
   pages += 1
   k += 1
  output.addPage(input1.getPage(i))

 outputStream = open("PyPDF2-output.pdf", "wb")
 output.write(outputStream)

merge_pdf([2,3,4], [0, 0, 0])
  1. Read the source files to be merged;
  2. Traverse to the specified page and merge the pages of the source PDF;

Method 2:

In addition to method 1, there is another way to merge PDF s:

from PyPDF2 import PdfFileMerger // 1

merger = PdfFileMerger()

input1 = open("document1.pdf", "rb") // 2
input2 = open("document2.pdf", "rb")
input3 = open("document3.pdf", "rb")

merger.append(fileobj = input1, pages = (0,3)) // 3

merger.merge(position = 2, fileobj = input2, pages = (0,1)) // 4

merger.append(input3) // 5

output = open("document-output.pdf", "wb")
merger.write(output)
  1. Import PyPDF2 merge module PdfFileMerger;
  2. Read PDF documents to be processed and merged;
  3. Take out the first 3 pages to be merged from the first PDF document;
  4. Insert the first page of the second PDF document into the document;
  5. Attach the third PDF document to the end of the output document;

In addition to the two main functions described above, PyPDF2 also has some other small functions:

rotate

input1.getPage(1).rotateClockwise(90)

Rotate page 1 90 degrees.

Add watermark

page = input1.getPage(3)
watermark = PdfFileReader(open("watermark.pdf", "rb"))
page.mergePage(watermark.getPage(0))

The watermark is stored in another PDF document watermark Pdf.

encryption

password = "secret"
output.encrypt(password)

First give a secret password, and then use encrypt to encrypt the output document.

pdfminer

PyPDF2 described above is mainly good at PDF page level editing, but weak in text and source data level editing.

So here's another Python library to make up for its shortcomings.

PDFMiner is a text extraction tool for PDF documents. It has the following features:

  • Be able to accurately obtain the location and layout information of the text;
  • PDF can be converted to HTML/XML and other formats;
  • Can extract directory;
  • You can extract the label content;
  • Support various font types (Type1, TrueType, Type3 and CID);
  • Support Chinese, Japanese and Korean languages and {vertical writing} text;

install

$ pip install pdfminer

PDF to TxT

In GitHub's managed project, pdfminer provides some practical tool sets under the directory tools, such as PDF to HTML, PDF to HTML and PDF to TXT. We can present the text information in the PDF document directly by using the following command.

$ pdf2txt.py samples/simple1.pdf

summary

Through the above two Python libraries, you can edit metadata from page to text. This article just briefly introduces the basic usage of each item. For a detailed list of usage and functions, you can read the official documentation or read the project source code on GitHub. In addition, divergent thinking based on these basic uses can be found and more valuable application scenarios can be found. For example, after translating text data, translation API is called for document translation. You can also package the software and develop it into a general PDF editing tool.

Keywords: Python Java Programming Linux Big Data

Added by aniesh82 on Fri, 18 Feb 2022 04:26:34 +0200