Python office automation from Word to Excel

Yunqi information:[ Click to see more industry information]
Here you can find the first-hand cloud information of different industries. What are you waiting for? Come on!

Hello, everyone. Today, I was entrusted by a small partner of a civil servant to help him. There is a word like this (the specific content of the document has been modified due to the privacy of the document)

There are nearly 2600 table columns with similar format in total. Each column includes the following information:

date
Issuing unit
Symbol
Title
Sign in column
The three bold items need to be extracted and stored in the Excel table. The table style is as follows:

That is to say, you need to fill in the receiving time, document title and document number to the designated location, and at the same time, you need to change the time to the standard format. If you want to copy and modify the time completely manually, according to the time of 10s for an entry, you can complete 6 items in one minute, then you need to:

And this kind of file arrangement with regular format is very suitable for Python execution. OK, let's invite Python to appear next. I present the necessary information in the code with annotation information.

First use Python to import Word files

# Import the required library docx 
from docx import Document 
 
# Specify the path of the file 
path = r'C:\Users\word.docx'  
 
# read file 
document = Document(path) 
 
# Read all tables in word 
tables = document.tables 

Then divide the questions one by one. First, try to get the three required information of the first file entry in the first table

# Get the first table 
table0 = tables[0] 

If you look carefully, you can see that a file entry occupies 3 rows, so you can set the step size to 3 when iterating through all rows of the table

Pay attention to the observation table, and analyze the required content clearly according to row and cell

# Put a variable in the global to count and number 
n = 0 
for i in range(0, len(table0.rows) + 1, 3): 
    # date 
    date = table0.cell(i, 1).text 
    # Title 
    title = table0.cell(i + 1, 1).text.strip() 
    # Symbol 
    dfn = tables[j].cell(i, 3).text.strip() 
    print(n, date, tite, dfn) 

What we need to solve next is that time we get is in the form of 2 / 1 day / month. We need to convert to YYYY-MM-DD format, which takes advantage of strptime and strftime functions of datetime package:

strptime: parsing time contained in string
strftime: convert to required time format

import datetime 
 
n = 0 
for i in range(0, len(table0.rows) + 1, 3): 
    # date 
    date = table0.cell(i, 1).text 
    # Some entry times are empty, so we don't make too many judgments here 
    if '/' in date: 
        date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d') 
    else: 
        date = '-' 
    # Title 
    title = table0.cell(i + 1, 1).text.strip() 
    # Symbol 
    dfn = tables[j].cell(i, 3).text.strip() 
    print(n, date, tite, dfn) 

In this way, the content of a table is parsed. Note that table[0] is used here, that is, the first table. You can traverse all tables and add a nested loop. In addition, you can catch exceptions to increase program flexibility

n = 0 
for j in range(len(tables)): 
    for i in range(0, len(tables[j].rows)+1, 3): 
        try: 
            # date 
            date = tables[j].cell(i, 1).text 
            if '/' in date: 
                date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d') 
            else: 
                date = '-' 
            # Title 
            title = tables[j].cell(i + 1, 1).text.strip() 
            # Symbol 
            dfn = tables[j].cell(i, 3).text.strip() 
            n += 1 
            print(n, date, title, dfn) 
        except Exception as error: 
            # Catch exceptions, or write them to the log for easy viewing and management 
            print(error) 
            continue 

The information can be exported after parsing and obtaining. The package used is openpyxl

from openpyxl import Workbook 
 
# instantiation  
wb = Workbook() 
# Get current sheet 
sheet = wb.active 
# Set header 
header = ['Serial number', 'Receiving time', 'Document No', 'Document title', 'Symbol', 'Remarks'] 
sheet.append(header) 

Add the following code at the end of the innermost parsing loop

row = [n, date, ' ', title, dfn, ' '] 
sheet.append(row) 

Last remember save of thread

wb.save(r'C:\Users\20200420.xlsx') 

The running time is about 10 minutes. After a while, the program is finished

Finally, the complete code is attached. The code is very simple, and it is most important to clarify the thinking

from docx import Document 
import datetime 
from openpyxl import Workbook 
 
wb = Workbook() 
sheet = wb.active 
header = ['Serial number', 'Receiving time', 'Document No', 'Document title', 'Symbol', 'Remarks'] 
sheet.append(header) 
 
 
path = r'C:\Users\word.docx' 
document = Document(path) 
tables = document.tables 
 
n = 0 
for j in range(len(tables)): 
    for i in range(0, len(tables[j].rows)+1, 3): 
        try: 
            # date 
            date = tables[j].cell(i, 1).text 
            if '/' in date: 
                date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d') 
            else: 
                date = '-' 
            # Title 
            title = tables[j].cell(i + 1, 1).text.strip() 
            # Symbol 
            dfn = tables[j].cell(i, 3).text.strip() 
            n += 1 
            print(n, date, title, dfn) 
            row = [n, date, ' ', title, dfn, ' '] 
            sheet.append(row) 
        except Exception as error: 
            # Catch exceptions, or write them to the log for easy viewing and management 
            print(error) 
            continue 
 
wb.save(r'C:\Users\20200420.xlsx') 

[yunqi online class] product technology experts share every day!
Course address: https://yqh.aliyun.com/live

Join the community immediately, face to face with experts, and keep abreast of the latest news of the course!
[yunqi online classroom community] https://c.tb.cn/F3.Z8gvnK

Original release time: April 21, 2020
Author: Chen Xi
This article comes from "51CTO". You can pay attention to“ 51CTO"

Keywords: Python Excel

Added by shoombooltala on Tue, 21 Apr 2020 10:49:36 +0300