Batch recognition of invoices based on Python and input them into Excel table (nanny tutorial)

The official account: yk Kun Emperor
The background replies to the invoice identification and obtains all the source code

1. Scene description

2. Prepare the environment

3. Extract content

1. Withdraw invoice amount
2. Name of the seller
3. Extraction of taxpayer identification number
4. Drawer

4. Batch identify invoices and save them to excel

5. Invoice verification

1. Apply for Baidu AI application
2. Get token
3. Check invoices
4. The tax bureau inquires the invoice

6. Extended summary

Hello, I'm brother Kun!

Brother Kun today to share an article on office dry goods: batch identify invoices with Python and enter them into excel. For students majoring in finance or company financial personnel, summarizing reimbursement invoices into excel is simply a torture. Brother Kun is a student of finance. I really feel the same. I'll send you the specific process and methods today.

Especially at the end of the year, the financial personnel of the company are miserable in the face of a lot of invoices. We just learned python. We should give full play to the advantages of Python. Here I'll just tell you about the invoice batch identification based on Python and the invoice identification based on machine learning. Those who are interested can mine it first. Let's take a look at the invoice batch identification based on python.

1. Scene description

Here, take four invoices as an example (searched by Kunge online). Put the invoice pictures in the pic folder (any folder can be placed). First identify them in batch with four invoices.

Just open an invoice.

Extraction target: invoice amount, seller name, taxpayer identification number, and invoice drawer.

Finally, save the four contents of each invoice in excel. If you have learned the database, it is recommended to store them directly in the database, which is also a review of the database. After all, there are still many databases in general:

2. Prepare the environment

The required libraries are as follows:

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr


The installation commands are as follows:

pip install pyocr
pip install cnocr
 The following are all the libraries and modules that need to be used


The invoice contains Chinese content (some machine learning content, you only need to know how to use it). We need to identify the Chinese in the picture (page identification). cnocr is a good choice. (there are usage methods above)

Tip: in addition to installing the above libraries, you also need to install additional exe files, otherwise the following errors will occur (relevant configurations should be installed as far as possible)


exe files to install:

ImageMagick 
tesseract-OCR

The installation process of these two software will not be repeated. You can search the tutorial and install it yourself (you can install it on csdn).

It's better to install these two software. It seems that you can display data without installing before. An error may not affect the operation of the program, but you'd better install them, otherwise you may not be able to start when an error is reported elsewhere.

3. Extract content

Next, take one of the pictures as an example to explain how to extract the target content: amount, name, taxpayer identification number and drawer.


Read picture: pic1 jpg

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr

tool = pyocr.get_available_tools()[0]
img_url = "pic1.png"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))

1. Withdrawal amount

The location of the amount in the invoice needs to be intercepted. Different computers may have different coordinates.

Comments are parsed according to their computer debugging location.

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr

tool = pyocr.get_available_tools()[0]
img_url = "pic1.png"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))


#Withdrawal amount
# amount of money
# left = 741
# top = 420
# right = 850
# bottom = 445
left = 645
top =360
right = 719
bottom = 385
image_text1 = new_img.crop((left, top, right, bottom))
#Show pictures
image_text1.show()
txt1 = tool.image_to_string(image_text1)
print(txt1)

The values of left, top, right and bottom here are obtained by modifying the positioning for many times. You can locate according to your invoice content. Don't feel troublesome. If you reach the designated position in one step, you won't improve it. You should understand the code and knock it out yourself.

Then extract the numbers in the picture

Similarly, continue to extract the following: name

2. Extraction name

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr


tool = pyocr.get_available_tools()[0]
img_url = "Invoice identification/pic1.png"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))

#Extraction name
left = 135
top = 390
right = 450
bottom = 410
image_obj2 = new_img.crop((left, top, right, bottom))
#image_obj2.show()



image_obj2.save("tmp.png")
ocr = CnOcr()
res = ocr.ocr("tmp.png")
print(res[0][0])

print("".join(res[0][0]))

#print("".join('%s' %a for a in res))


The name here is in Chinese. We can no longer operate like withdrawing amount (number). You need to use cnocr to take out the Chinese in the picture.

image_obj2.save("tmp.jpg")
ocr = CnOcr()
res = ocr.ocr("tmp.jpg")
print("".join(res[0]))


3. Extraction of taxpayer identification number

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr

tool = pyocr.get_available_tools()[0]
img_url = "pic1.png"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))

#Taxpayer identification number
left = 135
top = 405
right = 450
bottom = 425
image_text3 = new_img.crop((left, top, right, bottom))
#Show pictures
#image_text3.show()

txt3 = tool.image_to_string(image_text3)
print(txt3)


Extract the taxpayer identification number in the picture, and the results are as follows:

4. Drawer

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr

tool = pyocr.get_available_tools()[0]
img_url = "pic1.png"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))

#Draw drawer

left = 458
top = 470
right = 570
bottom = 490
image_obj4 = new_img.crop((left, top, right, bottom))
#image_obj4.show()

image_obj4.save("tmp.png")
ocr = CnOcr()
res = ocr.ocr("tmp.png")
#print("".join(res[0]))

print("".join(res[0][0]))



Because there is Chinese, we use cnocr to extract the Chinese in the picture, just like extracting the name.

ok, in this way, we will extract the four target contents in the invoice, then identify all the invoices under the pic folder, and save the contents to excel.

4. Batch identify invoices and save them to excel

Before reading the picture, encapsulate the above four operations into functions to facilitate each invoice object to call.


Read all pictures in the folder.

from PIL import Image as PI
import pyocr
import io
import pyocr.builders
from cnocr import CnOcr

tool = pyocr.get_available_tools()[0]
img_url = "pic1.png"
with open(img_url, 'rb') as f:
    a = f.read()
new_img = PI.open(io.BytesIO(a))


Start recognition and write the results to excel.

#Batch identification
    ##Write to excel
    import openpyxl
    outwb = openpyxl.Workbook()  # Open a file that will be written
    outws = outwb.create_sheet(index=0)  # Create a sheet in the file to be written

    outws.cell(row=1, column=1, value="name")
    outws.cell(row=1, column=2, value="Taxpayer identification number")
    outws.cell(row=1, column=3, value="amount of money")
    outws.cell(row=1, column=4, value="Drawer")
    count = 2
    filePath = 'Invoice identification'
    pic_name = []
    # for i,j,name in os.walk(filePath):
    #     pic_name = name
    # for i in pic_name[0:1]:
    #     img_url = filePath+"/"+i
    #     with open(img_url, 'rb') as f:
    #         a = f.read()
    #     new_img = PI.open(io.BytesIO(a))
    #     ## Write csv
    #     outws.cell(row=count, column=1, value=text2(new_img))
    #     outws.cell(row=count, column=2, value=text3(new_img))
    #     outws.cell(row=count, column=3, value=text1(new_img))
    #     outws.cell(row=count, column=4, value=text4(new_img))
    #     count = count + 1
    # outwb.save("Invoice summary-Brother Kun.xls")  # Save results


    with open("Invoice identification/pic1.png", 'rb') as f:
        a = f.read()
    new_img = PI.open(io.BytesIO(a))
    ## Write csv
    outws.cell(row=count, column=1, value=text2(new_img))
    outws.cell(row=count, column=2, value=text3(new_img))
    outws.cell(row=count, column=3, value=text1(new_img))
    outws.cell(row=count, column=4, value=text4(new_img))
    
    outwb.save("Invoice summary-Brother Kun.xls")  # Save results

Finally saved as: Invoice Summary - Kunge xls, the results are as follows:

5. Invoice verification

In Kunge's communication group, when chatting with a little partner about this content, the little partner suggested adding a function: invoice verification.

Before the above identification (the invoice of your company may not need to be checked), first call the third-party interface to identify the invoice, and then extract the target content in the invoice after passing the identification.

1. Apply for Baidu AI application


2. Get token

Client here_ ID is the AK and client obtained on the official website_ Secret is the SK obtained on the official website. It can be obtained by applying for the above application

3. Inspection

Let's take this picture as an example to check

The corresponding invoice types are as follows:

The results are as follows:

4. The tax bureau inquires the invoice

Also take this picture as an example for inspection

Fill in the information and click Check. The results are as follows:

Then the tax bureau will check it more clearly. Readers can choose their own way to check according to their own situation.

6. Summary

This article basically successfully achieved the objectives and requirements, and the effect is very good! The complete source code can be combined by the code in the article (all of which have been shared in the article). Interested readers can try it by themselves!

Be sure to try! Be sure to try! Be sure to try!

Brother Kun finally talked about invoice recognition based on python and invoice recognition based on machine learning. If you are interested, you can try it and draw a mind map.

Finally, it's not easy to be original. Please praise and collect!

The official account: yk Kun Emperor
The background replies to the invoice identification and obtains all the source code

Keywords: Python Back-end Project

Added by listenmirndt on Tue, 11 Jan 2022 16:01:15 +0200