The official account: yk Kun Emperor
The background replies to the invoice identification and obtains all the source code
1. Scene description
2. Prepare the environment
3. Extract content
1. Withdraw invoice amount
2. Name of the seller
3. Extraction of taxpayer identification number
4. Drawer
4. Batch identify invoices and save them to excel
5. Invoice verification
1. Apply for Baidu AI application
2. Get token
3. Check invoices
4. The tax bureau inquires the invoice
6. Extended summary
Hello, I'm brother Kun!
Brother Kun today to share an article on office dry goods: batch identify invoices with Python and enter them into excel. For students majoring in finance or company financial personnel, summarizing reimbursement invoices into excel is simply a torture. Brother Kun is a student of finance. I really feel the same. I'll send you the specific process and methods today.
Especially at the end of the year, the financial personnel of the company are miserable in the face of a lot of invoices. We just learned python. We should give full play to the advantages of Python. Here I'll just tell you about the invoice batch identification based on Python and the invoice identification based on machine learning. Those who are interested can mine it first. Let's take a look at the invoice batch identification based on python.
1. Scene description
Here, take four invoices as an example (searched by Kunge online). Put the invoice pictures in the pic folder (any folder can be placed). First identify them in batch with four invoices.
Just open an invoice.
Extraction target: invoice amount, seller name, taxpayer identification number, and invoice drawer.
Finally, save the four contents of each invoice in excel. If you have learned the database, it is recommended to store them directly in the database, which is also a review of the database. After all, there are still many databases in general:
2. Prepare the environment
The required libraries are as follows:
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr
The installation commands are as follows:
pip install pyocr pip install cnocr The following are all the libraries and modules that need to be used
The invoice contains Chinese content (some machine learning content, you only need to know how to use it). We need to identify the Chinese in the picture (page identification). cnocr is a good choice. (there are usage methods above)
Tip: in addition to installing the above libraries, you also need to install additional exe files, otherwise the following errors will occur (relevant configurations should be installed as far as possible)
exe files to install:
ImageMagick tesseract-OCR
The installation process of these two software will not be repeated. You can search the tutorial and install it yourself (you can install it on csdn).
It's better to install these two software. It seems that you can display data without installing before. An error may not affect the operation of the program, but you'd better install them, otherwise you may not be able to start when an error is reported elsewhere.
3. Extract content
Next, take one of the pictures as an example to explain how to extract the target content: amount, name, taxpayer identification number and drawer.
Read picture: pic1 jpg
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr tool = pyocr.get_available_tools()[0] img_url = "pic1.png" with open(img_url, 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a))
1. Withdrawal amount
The location of the amount in the invoice needs to be intercepted. Different computers may have different coordinates.
Comments are parsed according to their computer debugging location.
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr tool = pyocr.get_available_tools()[0] img_url = "pic1.png" with open(img_url, 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a)) #Withdrawal amount # amount of money # left = 741 # top = 420 # right = 850 # bottom = 445 left = 645 top =360 right = 719 bottom = 385 image_text1 = new_img.crop((left, top, right, bottom)) #Show pictures image_text1.show() txt1 = tool.image_to_string(image_text1) print(txt1)
The values of left, top, right and bottom here are obtained by modifying the positioning for many times. You can locate according to your invoice content. Don't feel troublesome. If you reach the designated position in one step, you won't improve it. You should understand the code and knock it out yourself.
Then extract the numbers in the picture
Similarly, continue to extract the following: name
2. Extraction name
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr tool = pyocr.get_available_tools()[0] img_url = "Invoice identification/pic1.png" with open(img_url, 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a)) #Extraction name left = 135 top = 390 right = 450 bottom = 410 image_obj2 = new_img.crop((left, top, right, bottom)) #image_obj2.show() image_obj2.save("tmp.png") ocr = CnOcr() res = ocr.ocr("tmp.png") print(res[0][0]) print("".join(res[0][0])) #print("".join('%s' %a for a in res))
The name here is in Chinese. We can no longer operate like withdrawing amount (number). You need to use cnocr to take out the Chinese in the picture.
image_obj2.save("tmp.jpg") ocr = CnOcr() res = ocr.ocr("tmp.jpg") print("".join(res[0]))
3. Extraction of taxpayer identification number
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr tool = pyocr.get_available_tools()[0] img_url = "pic1.png" with open(img_url, 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a)) #Taxpayer identification number left = 135 top = 405 right = 450 bottom = 425 image_text3 = new_img.crop((left, top, right, bottom)) #Show pictures #image_text3.show() txt3 = tool.image_to_string(image_text3) print(txt3)
Extract the taxpayer identification number in the picture, and the results are as follows:
4. Drawer
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr tool = pyocr.get_available_tools()[0] img_url = "pic1.png" with open(img_url, 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a)) #Draw drawer left = 458 top = 470 right = 570 bottom = 490 image_obj4 = new_img.crop((left, top, right, bottom)) #image_obj4.show() image_obj4.save("tmp.png") ocr = CnOcr() res = ocr.ocr("tmp.png") #print("".join(res[0])) print("".join(res[0][0]))
Because there is Chinese, we use cnocr to extract the Chinese in the picture, just like extracting the name.
ok, in this way, we will extract the four target contents in the invoice, then identify all the invoices under the pic folder, and save the contents to excel.
4. Batch identify invoices and save them to excel
Before reading the picture, encapsulate the above four operations into functions to facilitate each invoice object to call.
Read all pictures in the folder.
from PIL import Image as PI import pyocr import io import pyocr.builders from cnocr import CnOcr tool = pyocr.get_available_tools()[0] img_url = "pic1.png" with open(img_url, 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a))
Start recognition and write the results to excel.
#Batch identification ##Write to excel import openpyxl outwb = openpyxl.Workbook() # Open a file that will be written outws = outwb.create_sheet(index=0) # Create a sheet in the file to be written outws.cell(row=1, column=1, value="name") outws.cell(row=1, column=2, value="Taxpayer identification number") outws.cell(row=1, column=3, value="amount of money") outws.cell(row=1, column=4, value="Drawer") count = 2 filePath = 'Invoice identification' pic_name = [] # for i,j,name in os.walk(filePath): # pic_name = name # for i in pic_name[0:1]: # img_url = filePath+"/"+i # with open(img_url, 'rb') as f: # a = f.read() # new_img = PI.open(io.BytesIO(a)) # ## Write csv # outws.cell(row=count, column=1, value=text2(new_img)) # outws.cell(row=count, column=2, value=text3(new_img)) # outws.cell(row=count, column=3, value=text1(new_img)) # outws.cell(row=count, column=4, value=text4(new_img)) # count = count + 1 # outwb.save("Invoice summary-Brother Kun.xls") # Save results with open("Invoice identification/pic1.png", 'rb') as f: a = f.read() new_img = PI.open(io.BytesIO(a)) ## Write csv outws.cell(row=count, column=1, value=text2(new_img)) outws.cell(row=count, column=2, value=text3(new_img)) outws.cell(row=count, column=3, value=text1(new_img)) outws.cell(row=count, column=4, value=text4(new_img)) outwb.save("Invoice summary-Brother Kun.xls") # Save results
Finally saved as: Invoice Summary - Kunge xls, the results are as follows:
5. Invoice verification
In Kunge's communication group, when chatting with a little partner about this content, the little partner suggested adding a function: invoice verification.
Before the above identification (the invoice of your company may not need to be checked), first call the third-party interface to identify the invoice, and then extract the target content in the invoice after passing the identification.
1. Apply for Baidu AI application
2. Get token
Client here_ ID is the AK and client obtained on the official website_ Secret is the SK obtained on the official website. It can be obtained by applying for the above application
3. Inspection
Let's take this picture as an example to check
The corresponding invoice types are as follows:
The results are as follows:
4. The tax bureau inquires the invoice
Also take this picture as an example for inspection
Fill in the information and click Check. The results are as follows:
Then the tax bureau will check it more clearly. Readers can choose their own way to check according to their own situation.
6. Summary
This article basically successfully achieved the objectives and requirements, and the effect is very good! The complete source code can be combined by the code in the article (all of which have been shared in the article). Interested readers can try it by themselves!
Be sure to try! Be sure to try! Be sure to try!
Brother Kun finally talked about invoice recognition based on python and invoice recognition based on machine learning. If you are interested, you can try it and draw a mind map.
Finally, it's not easy to be original. Please praise and collect!
The official account: yk Kun Emperor
The background replies to the invoice identification and obtains all the source code