Climbing the School's Official Website to Achieve Achievements

Crawler Actual Warfare (1) -- Crawling Square System

Preface

Some time ago, I participated in the software cup competition, and for the first time I really contacted python, thinking about learning python's knowledge systematically during the summer vacation. Based on practice, I want to learn crawlers and knowledge at the same time, so I will be more impressed. So the official website of the school becomes my goal. (Every year, it is called a painful, school potato server) There are few words to describe the process of implementation.

Use tools and third-party libraries

Language: Python 3.6
Development tools: PyChram
Third-party libraries:

  1. requests (powerful reptile library)
  2. Beautiful Soup (for crawling web page information)
  3. PIL (Image Processing, This Project Used to Display Verification Code Pictures)
  4. Download method
    pip install ***

Login page simulation

First of all, open the landing page. If your school uses a square system, the interface is as follows.

Get the post parameter at logon time

To see the page requested and the parameters of the request, we click F12 to view the debug page.

Select the network, and then try to login on the page once, you can find

The page of the login request is default2.aspx

We click on it to see his form data, which is the request parameter sent, and find the following:

__VIEWSTATE: 
TextBox 1: (Our school is a student number)
TextBox2: (Our school is a password)
TextBox 3: 8cx7 (test found to be a fixed value)
RadioButton List 1:% D1% A7% C9% FA (test found to be a fixed value)
Button1:(Value is empty, but the request needs to be sent when it is sent)

We found that the post ing request for default2.aspx sent the above parameters, only the value of _VIEWSTATE is uncertain, so we further analyzed and found that there is such a h5 statement on the landing page:

<input type="hidden" name="__VIEWSTATE" value="dDw3OTkxMjIwNTU7Oz5vJ/yYUi9dD4fEnRUKesDFl8hEKA==" />

So preliminary judgment, the value of _VIEWSTATE only needs to get the value of the landing page, so far, the code is as follows:

import requests
from bs4 import BeautifulSoup

# Address of School Official Website
url = ''
# Create requests
s = requests.session()
# Get the login interface
login_page = s.get(url+'default2.aspx')
# Get value
soup = BeautifulSoup(login_page.text, 'lxml')
__VIEWSTATE = soup.find('input', attrs={'name': '__VIEWSTATE'}).get('value')

The above code is on the landing page, grab the html of the page, and then get the value of input through bs4, which is _VIEWSTATE in the parameter.

Verification Code Processing

Previously Baidu, the square system used to be able to bypass the verification code, but now seems to be blocked. But the first thing we should try to solve is that he doesn't bypass him. The verification code here provides a couple of ideas.

First: Get the page cookies, crawl the validation code picture, download it to the local, and validate it by showing it to the user for input.
Second: Through machine learning, the background of the verification code here is relatively clean, and it should be not difficult to train. ctpn is recommended.

This time we used the first method, the code is as follows:

import requests
from bs4 import BeautifulSoup

# Address of School Official Website
url = ''
# Create requests
s = requests.session()
# Get the login interface
login_page = s.get(url+'default2.aspx')
# Getting cookie s
cookies = login_page.cookies
# Get the authentication code
pic = s.get(url+'CheckCode.aspx', cookies=cookies).content
with open('ver_pic.png', 'wb') as f:
    f.write(pic)
# Read the validation code and display it.
image = Image.open("ver_pic.png")
image.show()    # display
checkcode = input("Please enter the verification code.:")

Note here that the probability of verification errors is that cookies are not handled properly. Check more.

So far, we have handled the landing process, the following is the complete code of the landing process:

import requests
from bs4 import BeautifulSoup
from PIL import Image

# Address of School Official Website
url = ''
# Create requests
s = requests.session()
# Get the login interface
login_page = s.get(url+'default2.aspx')
# Getting cookie s
cookies = login_page.cookies
soup = BeautifulSoup(login_page.text, 'lxml')
__VIEWSTATE = soup.find('input', attrs={'name': '__VIEWSTATE'}).get('value')
# Get the authentication code
pic = s.get(url+'CheckCode.aspx', cookies=cookies).content
with open('ver_pic.png', 'wb') as f:
    f.write(pic)
# Read the validation code and display it.
image = Image.open("ver_pic.png")
image.show()    # display
checkcode = input("Please enter the verification code.:")
your_id = input("Please enter your student number.:")
your_password = input("Please input a password:")
# Send a request to the landing page
data = {'__VIEWSTATE': __VIEWSTATE,
        'TextBox1': your_id,
        'TextBox2': your_password,
        'TextBox3': checkcode,
        'RadioButtonList1': r'%D1%A7%C9%FA',
        'Button1': ''}
r = s.post(url+'default2.aspx', data, cookies)

Analysis of Achievement Climbing

This part of the idea is the same as the landing part, and there is no validation code is relatively simple, simply paste down the code, mention a few points of attention:

First: If the crawling content prompt Object moved to here needs to add a request header, tell the server which website you came from.
Second: The value of _VIEWSTATE can be written directly and fixed. After trying, fixed writing can be passed.

The code is as follows:

# Setting up the response head
mark_head = {
    'Referer': url+'xs_main.aspx?xh='+your_id
}
# Setting Send Value
find_data = {
    '__VIEWSTATE': '',
    'ddlXN': '',
    'ddlXQ': '',
    'Button2': '%D4%DA%D0%A3%D1%A7%CF%B0%B3%C9%BC%A8%B2%E9%D1%AF'
}

my_cj = s.post(url+'xscj_gc.aspx?xh='+your_id+'&xm='+your_name+'&gnmkdm=N121605', find_data, headers=mark_head)
# Normalized output
soup3 = BeautifulSoup(my_cj.text, 'lxml')
marktable = str(soup3.find_all(id="Datagrid1"))
soup2 = BeautifulSoup(marktable, 'lxml')
lines = soup2.find_all('tr')
my_chengji = {}
for tr in lines:
    chengji = []
    soup1 = BeautifulSoup(str(tr), 'lxml').get_text('\t')
    print(soup1)

The above is all the code, what do not understand can ask me oh, relatively speaking, this site crawl is quite simple.

Finally, I wish you all a pleasant knocking.~

Keywords: Programming Python Session network

Added by The Little Guy on Mon, 05 Aug 2019 08:37:24 +0300