python realizes batch VAT invoice character recognition (ocr) and writes it into excel

Background:

Today, I found that someone made an invoice ocr to realize invoice character recognition, so as to solve the problem of inefficient and cumbersome manual input and greatly improve the work efficiency. However, his ocr doesn't seem to be very mature. I just want to call Baidu api to realize the invoice character recognition part, select what I want and write it into excel. It took some time, but it did, Share it and hope to help people in need

First on the renderings, design privacy, mosaic
Write table

preparation

1. For environment configuration, you need to have the following four Libraries

import requests
import base64
import os
import xlwt

2. you need to register an account in Baidu intelligent cloud, then create a text recognition application in the management console, so that the website will give you a API key and a Secret Key. With these two strings of characters, we can proceed with the next operation and get the identity token access_. token

This is Baidu Intelligent Cloud management console.


Then create the application here


Select the character recognition direction and remember to check the VAT invoice. It should be selected by default


Here are the api key and Secret Key. At that time, the one in the code also needs to be replaced with its own to obtain access_token also needs

Then open the technical document, which has an api document to teach you how to obtain access_token


Get access_ The code of token is as follows. Please replace api key and Secret Key with your own

# encoding:utf-8
import requests 

# client_id is the AK and client obtained on the official website_ Secret is the SK obtained on the official website
host = 'https://aip. baidubce. com/oauth/2.0/token? grant_ type=client_ credentials&client_ Id = [AK obtained on official website] & Client_ Secret = [SK obtained on official website] '
response = requests.get(host)
if response:
    print(response.json())

Here we have access_ After the token is, we need to copy and paste it into the code, and our preparations are completed

Modify the invoice storage address path and access in the code_ The token can be used
There are comments in the code
Upper code

# encoding:utf-8

import requests
import base64
import os
import xlwt
'''
VAT invoice identification
'''
# Get invoice body content
def get_context(pic):
    # print('getting picture body content! ')
    data = {}
    try:
        request_url = "https://aip.baidubce.com/rest/2.0/ocr/v1/vat_invoice"
        # Open picture file in binary mode
        f = open(pic, 'rb')
        img = base64.b64encode(f.read())
        params = {"image":img}

        # You need to replace it with your own access_token
        access_token = 'Yours access_token'

        request_url = request_url + "?access_token=" + access_token
        headers = {'content-type': 'application/x-www-form-urlencoded'}
        response = requests.post(request_url, data=params, headers=headers)
        if response:
            # print (response.json())
            json1 = response.json()
            data['SellerRegisterNum'] = json1['words_result']['SellerRegisterNum']
            data['InvoiceDate'] = json1['words_result']['InvoiceDate']
            data['PurchasserName'] = json1['words_result']['PurchaserName']
            data['SellerName'] = json1['words_result']['SellerName']
            data['AmountInFiguers'] = json1['words_result']['AmountInFiguers']
            # print(data['AmountInFiguers'])
        # print('body content obtained successfully! ')
        return data

    except Exception as e:
        print(e)
    return data

# Defines the function that generates the image path
def pics(path):
    print('Generating picture path')
    #Generate an empty list for storing picture paths
    pics = []
    # Traverse the folder, find the files with suffixes jpg and png, and add them to the list after sorting
    for filename in os.listdir(path):
        if filename.endswith('jpg') or filename.endswith('png'):
            pic = path + '/' + filename
            pics.append(pic)
    print('Image path generation succeeded!')
    return pics
# Define a function to get the body content of all files in the folder, return one dictionary at a time, and store all the returned dictionaries in a list
def datas(pics):
    datas = []
    for p in pics:
        data = get_context(p)
        datas.append(data)
    return datas

# Define a function that writes data to excel
def save(datas):
    print('Writing data!')
    book = xlwt.Workbook(encoding='utf-8', style_compression=0)
    sheet = book.add_sheet('VAT invoice content registration', cell_overwrite_ok=True)
    # Set the header, which can be set according to your own needs. I have set five here
    title = ['Billing date', 'Taxpayer identification number', 'Name of purchaser', 'Name of seller', 'Purchase amount']
    for i in range(len(title)):
        sheet.write(0, i, title[i])
    for d in range(len(datas)):
        for j in range(len(title)):
            sheet.write(d + 1, 0, datas[d]['InvoiceDate'])
            sheet.write(d + 1, 1, datas[d]['SellerRegisterNum'])
            sheet.write(d + 1, 2, datas[d]['PurchasserName'])
            sheet.write(d + 1, 3, datas[d]['SellerName'])
            sheet.write(d + 1, 4, datas[d]['AmountInFiguers'])
    print('Data writing succeeded!')
    book.save('VAT invoice.xls')

def main():
    print('Start execution!!!')
    # This is the storage address of your invoice, which can be changed by yourself
    path = 'D:/fapiao'
    Pics = pics(path)
    Datas = datas(Pics)
    save(Datas)
    print('End of execution!')


if __name__ == '__main__':
    main()


be careful:

This project can realize suffix jpg and png's value-added tax invoice text recognition and write it into excel. If other formats are required, it can be modified slightly

Baidu Intelligent Cloud return to the field is very rich, on demand for use, this battle I choose five fields, please use your own modification, are included in the json file returned.

To sum up, path and access have been modified_ Token can be used. Isn't it very convenient

Thank you for reading. See you next time!!!

Keywords: Python OCR

Added by Sgarissta on Sat, 29 Jan 2022 15:52:32 +0200