Yunqi information:[ Click to see more industry information]
Here you can find the first-hand cloud information of different industries. What are you waiting for? Come on!
Hello, everyone. Today, I was entrusted by a small partner of a civil servant to help him. There is a word like this (the specific content of the document has been modified due to the privacy of the document)
There are nearly 2600 table columns with similar format in total. Each column includes the following information:
date
Issuing unit
Symbol
Title
Sign in column
The three bold items need to be extracted and stored in the Excel table. The table style is as follows:
That is to say, you need to fill in the receiving time, document title and document number to the designated location, and at the same time, you need to change the time to the standard format. If you want to copy and modify the time completely manually, according to the time of 10s for an entry, you can complete 6 items in one minute, then you need to:
And this kind of file arrangement with regular format is very suitable for Python execution. OK, let's invite Python to appear next. I present the necessary information in the code with annotation information.
First use Python to import Word files
# Import the required library docx from docx import Document # Specify the path of the file path = r'C:\Users\word.docx' # read file document = Document(path) # Read all tables in word tables = document.tables
Then divide the questions one by one. First, try to get the three required information of the first file entry in the first table
# Get the first table table0 = tables[0]
If you look carefully, you can see that a file entry occupies 3 rows, so you can set the step size to 3 when iterating through all rows of the table
Pay attention to the observation table, and analyze the required content clearly according to row and cell
# Put a variable in the global to count and number n = 0 for i in range(0, len(table0.rows) + 1, 3): # date date = table0.cell(i, 1).text # Title title = table0.cell(i + 1, 1).text.strip() # Symbol dfn = tables[j].cell(i, 3).text.strip() print(n, date, tite, dfn)
What we need to solve next is that time we get is in the form of 2 / 1 day / month. We need to convert to YYYY-MM-DD format, which takes advantage of strptime and strftime functions of datetime package:
strptime: parsing time contained in string
strftime: convert to required time format
import datetime n = 0 for i in range(0, len(table0.rows) + 1, 3): # date date = table0.cell(i, 1).text # Some entry times are empty, so we don't make too many judgments here if '/' in date: date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d') else: date = '-' # Title title = table0.cell(i + 1, 1).text.strip() # Symbol dfn = tables[j].cell(i, 3).text.strip() print(n, date, tite, dfn)
In this way, the content of a table is parsed. Note that table[0] is used here, that is, the first table. You can traverse all tables and add a nested loop. In addition, you can catch exceptions to increase program flexibility
n = 0 for j in range(len(tables)): for i in range(0, len(tables[j].rows)+1, 3): try: # date date = tables[j].cell(i, 1).text if '/' in date: date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d') else: date = '-' # Title title = tables[j].cell(i + 1, 1).text.strip() # Symbol dfn = tables[j].cell(i, 3).text.strip() n += 1 print(n, date, title, dfn) except Exception as error: # Catch exceptions, or write them to the log for easy viewing and management print(error) continue
The information can be exported after parsing and obtaining. The package used is openpyxl
from openpyxl import Workbook # instantiation wb = Workbook() # Get current sheet sheet = wb.active # Set header header = ['Serial number', 'Receiving time', 'Document No', 'Document title', 'Symbol', 'Remarks'] sheet.append(header)
Add the following code at the end of the innermost parsing loop
row = [n, date, ' ', title, dfn, ' '] sheet.append(row)
Last remember save of thread
wb.save(r'C:\Users\20200420.xlsx')
The running time is about 10 minutes. After a while, the program is finished
Finally, the complete code is attached. The code is very simple, and it is most important to clarify the thinking
from docx import Document import datetime from openpyxl import Workbook wb = Workbook() sheet = wb.active header = ['Serial number', 'Receiving time', 'Document No', 'Document title', 'Symbol', 'Remarks'] sheet.append(header) path = r'C:\Users\word.docx' document = Document(path) tables = document.tables n = 0 for j in range(len(tables)): for i in range(0, len(tables[j].rows)+1, 3): try: # date date = tables[j].cell(i, 1).text if '/' in date: date = datetime.datetime.strptime(date, '%d/%m').strftime('2020-%m-%d') else: date = '-' # Title title = tables[j].cell(i + 1, 1).text.strip() # Symbol dfn = tables[j].cell(i, 3).text.strip() n += 1 print(n, date, title, dfn) row = [n, date, ' ', title, dfn, ' '] sheet.append(row) except Exception as error: # Catch exceptions, or write them to the log for easy viewing and management print(error) continue wb.save(r'C:\Users\20200420.xlsx')
[yunqi online class] product technology experts share every day!
Course address: https://yqh.aliyun.com/live
Join the community immediately, face to face with experts, and keep abreast of the latest news of the course!
[yunqi online classroom community] https://c.tb.cn/F3.Z8gvnK
Original release time: April 21, 2020
Author: Chen Xi
This article comes from "51CTO". You can pay attention to“ 51CTO"