Finding literature for my graduation thesis is a problem. I climbed the whole network literature directly in python

1, Write in front

Graduation is coming soon, brothers. The graduation thesis is a troublesome thing. You have to check the information on the Internet one by one. That's a waste of time. Let's write a crawler directly, download it in batches and read it slowly. Isn't it uncomfortable?

2, Preparatory work

Using software
Python and pycharm are OK. Any version is OK, as long as you don't use python2.

modular

requests  #Simulation request
Selenium   # Browser automation

 

win+r open the search box, enter cmd, press OK to open the command prompt window, enter pip install plus the name of the module you want to install, and press enter to install. If the download speed is slow, change the domestic image source.

Then download a Google browser driver, the version closest to your browser.
No, look at my top article.

3, Start crawling

Page analysis

First, analyze the elements of the known web page. We usually enter the content you want to search in the input box of the home page, and then jump to the search page.

Through the check page of the browser, we get the XPATH of the input box and the search icon respectively:

input_xpath = '/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/input[1]'
button_xpath =  '/html[1]/body[1]/div[1]/div[2]/div[1]/div[1]/input[2]'

 

Enter the content to search in the input box and operate the search button to go to the results page.

Taking Python as an example, 15925 entries and 300 pages were found. Each page contains 20 entries, and each entry contains title, author, source, etc.

By analyzing the current page, we can find the rule of xpath corresponding to each entry

/html[1]/body[1]/div[5]/div[2]/div[2]/div[2]/form[1]/div[1]/table[1]/tbody[1]/tr[1]/td[2]

 

The penultimate tag number represents the items on this page, and the last tag 2 - 6 represents the title, author, source, publication time and database respectively. You can't download the summary information or literature on the current page. You need to further Click to enter the relevant literature items.

After entering the details page, you can easily locate the summary text according to class name: abstract text, and class name: BTN dlcaj locate the download link. Other elements are the same.

After completing the surface analysis, you can start writing code!

Import the library to use

import time 
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from urllib.parse import urljoin

 

Create browser objects and set relevant parameters

get returns directly without waiting for the interface to load

desired_capabilities = DesiredCapabilities.CHROME
desired_capabilities["pageLoadStrategy"] = "none"

 

Setting up Google drive environment

options = webdriver.ChromeOptions()

 

Set chrome not to load pictures to improve the speed.

options.add_experimental_option("prefs", {"profile.managed_default_content_settings.images": 2})

 

Set not to display windows

options.add_argument('--headless')

 

Create a Google drive

driver = webdriver.Chrome(options=options)

 

Set search subject

theme = "Python"

 

Set the required number of articles

papers_need = 100

 

Open the page to search for keywords

Open page

driver.get("https://www.****.net")

 

I blocked the website. Please change to the largest website for literature review.

Incoming keyword

WebDriverWait( driver, 100 ).until( EC.presence_of_element_located( (By.XPATH ,'''//*[@id="txt_SearchText"]''') ) ).send_keys(theme)

 

Click search

WebDriverWait( driver, 100 ).until( EC.presence_of_element_located( (By.XPATH ,"/html/body/div[1]/div[2]/div/div[1]/input[2]") ) ).click()
time.sleep(3)

 

Click to switch to Chinese Literature

WebDriverWait( driver, 100 ).until( EC.presence_of_element_located( (By.XPATH ,"/html/body/div[5]/div[1]/div/div/div/a[1]") ) ).click()
time.sleep(1)

 

Get the total number of documents and pages

res_unm = WebDriverWait( driver, 100 ).until( EC.presence_of_element_located( (By.XPATH ,"/html/body/div[5]/div[2]/div[2]/div[2]/form/div/div[1]/div[1]/span[1]/em") ) ).text

 

Remove commas from the thousandths

res_unm = int(res_unm.replace(",",''))
page_unm = int(res_unm/20) + 1
print(f"Total found {res_unm} Article results, {page_unm} Page.")

 

Parse result page

Assign sequence number to control the number of articles crawled.

count = 1

 

When the number of crawls is less than the demand, the page number of the web page is cycled.

while count <= papers_need:

 

Wait until the load is complete and sleep for 3S.
Add time where appropriate Sleep (3) delays for a few seconds, which can not only wait for the page to load, but also prevent the IP from being blocked too fast.

time.sleep(3)

title_list = WebDriverWait( driver, 10 ).until( EC.presence_of_all_elements_located( (By.CLASS_NAME  ,"fz14") ) )

 

Looping through entries on a page

for i in range(len(title_list)):
    try:
        term = count%20   # What are the entries on this page
        title_xpath = f"/html[1]/body[1]/div[5]/div[2]/div[2]/div[2]/form[1]/div[1]/table[1]/tbody[1]/tr[{term}]/td[2]"
        author_xpath = f"/html[1]/body[1]/div[5]/div[2]/div[2]/div[2]/form[1]/div[1]/table[1]/tbody[1]/tr[{term}]/td[3]"
        source_xpath = f"/html[1]/body[1]/div[5]/div[2]/div[2]/div[2]/form[1]/div[1]/table[1]/tbody[1]/tr[{term}]/td[4]"
        date_xpath = f"/html[1]/body[1]/div[5]/div[2]/div[2]/div[2]/form[1]/div[1]/table[1]/tbody[1]/tr[{term}]/td[5]"
        database_xpath = f"/html[1]/body[1]/div[5]/div[2]/div[2]/div[2]/form[1]/div[1]/table[1]/tbody[1]/tr[{term}]/td[6]"
        title = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,title_xpath) ) ).text
        authors = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,author_xpath) ) ).text
        source = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,source_xpath) ) ).text
        date = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,date_xpath) ) ).text
        database = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,database_xpath) ) ).text

 

Click entry

title_list[i].click()

 

Gets the handle to the driver

n = driver.window_handles 

 

driver switches to the latest production page

driver.switch_to_window(n[-1])  

 

Start getting page information

# title = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,"/html/body/div[2]/div[1]/div[3]/div/div/div[3]/div/h1") ) ).text
# authors = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,"/html/body/div[2]/div[1]/div[3]/div/div/div[3]/div/h3[1]") ) ).text
institute = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.XPATH ,"/html[1]/body[1]/div[2]/div[1]/div[3]/div[1]/div[1]/div[3]/div[1]/h3[2]") ) ).text
abstract = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.CLASS_NAME  ,"abstract-text") ) ).text
try:
    keywords = WebDriverWait( driver, 10 ).until( EC.presence_of_element_located((By.CLASS_NAME  ,"keywords") ) ).text[:-1]
except:
    keywords = 'nothing'
url = driver.current_url

 

Get download link

link = WebDriverWait( driver, 10 ).until( EC.presence_of_all_elements_located((By.CLASS_NAME  ,"btn-dlcaj") ) )[0].get_attribute('href')
link = urljoin(driver.current_url, link)

 

write file

res = f"{count}\t{title}\t{authors}\t{institute}\t{date}\t{source}\t{database}\t{keywords}\t{abstract}\t{url}".replace("\n","")+"\n"
print(res)
with open('CNKI_res.tsv', 'a', encoding='gbk') as f:
    f.write(res)

 

Skip this item and move on to the next one. If there are multiple windows, close the second window and switch back to the home page.

except:
    print(f" The first{count} Strip crawling failed\n")
    continue
        finally:
n2 = driver.window_handles
if len(n2) > 1:
    driver.close()
    driver.switch_to_window(n2[0])

 

Count to determine whether the demand is sufficient.

count += 1
if count == papers_need:break

 

Switch to the next page

WebDriverWait( driver, 10 ).until( EC.presence_of_element_located( (By.XPATH ,"//a[@id='PageNext']") ) ).click()

 

Close browser

driver.close()

 

# All functions have been realized here. I have prepared these materials for you. You can get them free directly in the group.

# Group: 872937351 (add two groups if the group is full)
# Group II: 924040232
# python Learning route summary
# Boutique Python 100 learning books
# Python Getting started video collection
# Python Actual combat cases
# Python Interview questions
# Python Related software tools/pycharm Permanent activation

 

4, Effect display

Brothers, remember to follow the three companies. Your help is the driving force for me to update~

Added by sun373 on Mon, 03 Jan 2022 00:12:40 +0200