The ultimate secret to dynamically extract PDF content! And a super website recommendation| PA important resources

- 1 - Last article< PDF content is automatically extracted. You can take what pages you want| PA actual combat case >It explains how to automatically extract the PDF content of the specified page number, and mentions a dynamic extraction situation: extract all the content in the file except the last fixed pages (such as 5 pages).

For example, in the pdf reports of many enterprises, the front pages containing data are not fixed, but the last few pages, hey hey, are some routine notes. In this way, we need to dynamically retrieve the front data pages, and the key is to obtain the pages of the whole pdf report.

However, at present, power automation does not support the operation or method of obtaining the number of pages of pdf files. At this time, we need to automatically call a third-party tool through power automation! For example, the powerful pdf batch processing tool for fried chicken: pdftk!

- 2 -

What is pdftk? In short, it is a toolkit for operating pdf (full name: pdf toolkit). For complete introduction and documents, please refer to the official website PDF Labs: https://www.pdflabs.com/tools/pdftk-the-pdf-toolkit/

However, generally speaking, the content of the official website is tiring for most ordinary users. Therefore, this official website is not the key to this recommendation. The following is the website "batch home":

http://bbs.bathome.net/

First, what can pdftk do?

In short, you can write some simple commands under DOS, that is, you can realize many PDF file processing functions. What are the specific functions? Look at the picture:

What exactly? See example:

merge PDF: 
pdftk 1.pdf 2.pdf 3.pdf cat output 123.pdf
 perhaps (Use Wildcards ):
pdftk *.pdf cat output combined.pdf
 Put multiple PDF The different pages are combined into a new one PDF file(take one.pdf The first seven pages of, two.pdf The first five pages of, one.pdf Page 8 of is merged in order combined.pdf)
pdftk A=one.pdf B=two.pdf cat A1-7 B1-5 A8 output combined.pdf
 rotate PDF The first page of the is 90 degrees clockwise (East) (the direction of the other pages remains unchanged, 2-end Indicates page 2 to the last page)
pdftk in.pdf cat 1E 2-end output out.pdf
 rotate PDF The first page of is 90 degrees counterclockwise (West), and only the first page is extracted
pdftk in.pdf cat 1W output out.pdf
 Select all PDF Page 180 degrees:
pdftk in.pdf cat 1-endS output out.pdf
 Use 128 strength encryption PDF(Safe mode (read only)
pdftk in.pdf output mydoc.128.pdf owner_pw foopass
 Ibid., and PDF Add the access password (a password input box will pop up)
pdftk in.pdf output mydoc.128.pdf owner_pw foo user_pw baz
 Same as above, but run printing:
pdftk in.pdf output mydoc.128.pdf owner_pw foo user_pw baz allow printing
 decrypt PDF file(foopass replace with pdf of owner_pw password): Note: if you know pdf Therefore, this function only cancels the owner's password so that the reader does not need to enter the password
pdftk secured.pdf input_pw foopass output unsecured.pdf
 Merge two PDF Documents, one of which is encrypted, but the final document is not encrypted:
pdftk A=secured.pdf mydoc.pdf input_pw A=foopass cat output combined.pdf
 decompression PDF Stream for text editing:
pdftk mydoc.pdf output mydoc.clear.pdf uncompress
 compress PDF: 
pdftk mydoc.pdf output mydoc.clear.pdf compress
 repair PDF file
pdftk broken.pdf output fixed.pdf
 Break into a single page (file name in) pg_(beginning)
pdftk mydoc.pdf burst
 report PDF Information, output to text
pdftk mydoc.pdf dump_data output report.txt

There are so many examples that you don't need to understand them all at once. Let's first look at the last one: Report PDF information, which contains the information of how many pages a PDF file has! The output results are as follows:

- 3 -

Back to the key point of this article, we can get the total number of PDF files, and we can use the pdftk tool in Power Automate to implement it.

First, let's download the tool at: https://share.weiyun.com/uHScXQNP

Unzip it to a folder that you can easily call. It should contain two files:

Then, the implementation process in power automation is as follows:

Step-01 get files in folder

Step-02 add for each loop operation

Step-03 add "run DOS command" step to obtain pdf file information (including pages)

In general settings, select the path where pdftk is located through the path selection button, select the full name of the current pdf file (% CurrentItem.FullName%) through the parameter selection button, and finally supplement the dumpdata parameter to obtain the information of the pdf file.

Note here that there may be spaces in the file names of some pdf files, so% currentitem Fullname% is enclosed in double quotation marks to avoid errors in dos command operation!

After obtaining the pdf file information through the above steps, we first use the basic text splitting method to separate the number of pages of the pdf file from the pdf information, and then we will talk about other more convenient methods (but involving regular or other dos command usage).

Step-04 add the "split text" operation to split the pdf file information obtained in the previous step according to "NumberOfPages:"

In this way, the pdf file information will be split into 2 parts:

In the result, TextList[1] is the part containing the number of pages.

Step-05 continue to add the "split text" step to split the TextList[1] according to the "newline character"

At this time, we will get multiple lines of content, in which the content in line 1 (label 0) is the number of pages, but note that this is the content in text format. Next, we need to convert the text to numerical value:

Step-06 add the "convert text to numeric value" step to convert TextList2[0] to numeric value:

Step-07 add the step of "extracting text from pdf" to extract pages from page 1 to "page number - 5" according to the range

Step-08 writes the extracted pdf document content into a text file

- 4 -

Through the above simple steps, we obtained the number of pages of the pdf file, and then completed the set goal in the way of "pages - 5".

Here, pdftk tool plays a key role, which is actually an important supplement to power automation to realize pdf file operation. Later, I will continue to explain more problems of pdf file automatic processing encountered in daily work.

Finally, recommend the website again: batch home( http://bbs.bathome.net/ ), there are a large number of tools or methods for batch processing (DOS, PowerShell or VB).

Of course, for most of my friends, I don't need to learn too many DOS or PowerShell commands. However, if I have a little understanding, I know that there may be such a way when I need it. Even if I ask others to help me realize it, I have more ideas.

Added by Kathy on Wed, 15 Dec 2021 21:32:31 +0200