Dynamic web pages - reverse analysis + cases

Introduction: this chapter mainly explains the related technologies of dynamic web page crawling. The crawling of dynamic web pages mainly includes reverse analysis method and simulation method. Today, we will mainly introduce the reverse analysis method. Later, we will focus on the use of selenium Library in the simulation method.




Dynamic web page

1, Overview of dynamic web pages

1.1 what is a dynamic web page


Dynamic web page is the integration of basic HTML syntax specification, Python, Java, C# and other advanced programming languages, database programming and other technologies, in order to realize the efficient, dynamic and interactive management of website content and style. Therefore, in this sense, all web pages generated by web page programming technology combined with high-level programming language other than HTML and database technology are dynamic web pages.

This is the definition found on the Internet. What's the popular saying? What you are looking for may not appear in the web page source code.


1.2 common technologies of dynamic web pages


Dynamic Web pages often use ajax, dynamic HTML and other related technologies to realize the interaction of front and background data. For traditional Web applications, when we submit a form request to the server, after receiving the request, the server returns a new page to the browser. This method not only wastes network bandwidth, but also greatly affects the user experience, because most of the HTML content of the original Web page and the new page obtained after sending the request are the same, Moreover, each user interaction needs to send a request to the server and refresh the whole Web page. This problem gave birth to Ajax technology.

The full name of Ajax is Asynchronous JavaScript and
XML, called asynchronous JavaScript and XML in Chinese, is a combination of JavaScript asynchronous loading technology, XML and Dom, as well as presentation technology, XHTML and CSS. use
Ajax technology does not need to refresh the whole page, but only needs to update the part of the page. Ajax only retrieves some necessary data. It uses SOAP, XML or JSON supported Web
Service interface, we use JavaScript on the client to process the response from the server, so that the data interaction between the client and the server is reduced, and the access speed and user experience are improved. For example, Ajax technology is widely used for user name uniqueness verification when registering mailbox. The data format returned by the server is usually json or xml rather than HTML.


1.3 determination method of dynamic web pages

Static web page:

Static web pages are based on html,. htm,. html,. shtml,. Web pages with xml as suffix. The content of static web pages is fixed. Each page is independent and will not change according to the different needs of visitors.

Dynamic web pages:

A web page that uses ASP or PHP or JSP as a suffix. Based on database technology, dynamic web pages can greatly reduce the workload of website maintenance.

Judgment method:
1. Right click to view the web page source code. If the data is in the web page, it means that it is a static web page, otherwise it is a dynamic web page.
2. You can also use request Get() crawls and returns r.text. If all the data is in the text, it is a static web page. Otherwise, it is a dynamic web page or a combination of dynamic and static web pages. Part of the data is in the web page and part of the data is not on the web page.

1.4 crawling method of dynamic web pages

The crawling methods of dynamic web pages are generally divided into reverse analysis method and simulation method. The reverse analysis method is difficult. By intercepting the requests sent by the website and finding out the real request address, crawlers are required to be familiar with the front-end, especially JavaScript related technologies. The simulation method uses a third-party library such as Selenium to simulate the behavior of the browser and solve the problems of page loading and rendering.

2, Case

Case website: Chongqing famous medical Hall

It's still the original operation. f12 check the source code and pay attention to the red circle. This time, it's XHR. The specific situation should be analyzed in detail.

import  requests
import json
import pymysql
import  time
def get_html(url,headers,time=10):  #get request general function, eliminating the user agent simplified code
    try:
        r = requests.get(url, headers=headers,timeout=time)  # Send request
        r.encoding = r.apparent_encoding  # Sets the character set encoding of the returned content
        r.raise_for_status()  # The returned status code is not equal to 200. An exception is thrown
        return r.text  # Returns the text content of a web page
    except Exception as error:
        print(error)
out_list=[]
def parser(json_txt):
    txt = json.loads(json_txt)
    global row_count
    row_count=txt["doctorCount"] #Total number of rows
    for row in txt["doctors"]: #Doctor information list
        staff_name=row.get("STAFF_NAME") #Name of doctor
        if staff_name is None:
            staff_name=""
        staff_type=row.get("STAFF_TYPE") #title
        if staff_type is None:
            staff_type=""
        remark=row.get("STAFF_REMARK") #brief introduction
        if remark is None:
            remark=""
        #Simply clean and remove the html tag in the introduction
        remark=remark.replace("<p>","").replace("</p>","")
        #Remove white space characters
        remark=remark.strip()
        org_name=row.get("ORG_NAME") #Affiliated Hospital
        org_name=org_name if org_name is not None else ""
        org_grade_name=row.get("ORG_GRADE_NAME")#Hospital grade
        org_grade_name = org_grade_name if org_grade_name is not None else ""
        good_at=row.get("GOOT_AT") #Areas of expertise
        good_at= good_at if good_at is not  None else ""
        row_list=(
            staff_name,
            staff_type,
            remark,
            org_name,
            org_grade_name,
            good_at
        )
        out_list.append(row_list)
def save_mysql(sql, val, **dbinfo): #General data storage mysql function
    try:
        connect = pymysql.connect(**dbinfo)  # Create database link
        cursor = connect.cursor()  # Get cursor object
        cursor.executemany(sql, val)  # Execute multiple SQL
        connect.commit()  # Transaction commit
    except Exception as err:
        connect.rollback()  # Transaction rollback
        print(err)
    finally:
        cursor.close()
        connect.close()
if __name__ == '__main__':
    head = {
    "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64)\
                   AppleWebKit/537.36 (KHTML, like Gecko) \
                   Chrome/89.0.4389.82 Safari/537.36"
    } #Set up user agent to deal with simple anti crawler

    #In order to get the total number of data pieces, climb once to get row_count into global variable
    page_size=10 #Data rows per page
    url='https://www.jkwin.com.cn/yst_web/doctor/getDoctorList?areaId=42&depId=&hasDuty=&pageNo=1&pageSize=10'
    json_txt=get_html(url,head) #Send request
    print(json_txt) #Check for data
    parser(json_txt) #analytic function
    print(out_list) #Check for data
    page=row_count//page_size # the total number of pages is too many and the speed is too slow. You can climb all pages by changing 5 pages to 6 pages
    for i in range(2,6):
        url="https://www.jkwin.com.cn/yst_web/doctor/getDoctorList?areaId=42&depId=&hasDuty=&pageNo={0}&pageSize={1}".format(i,page_size)
        json_txt=get_html(url,head) #Send request
        parser(json_txt) #analytic function
    #After parsing the data, save it in batch to MySQL database at one time
    parms = {                      #Database connection parameters
              "host": "--",
              "user": "root",
              "password": "123456",
              "db": "---",
              "charset": "utf8",
            }
    sql = "INSERT into doctorinfo(staff_name,staff_type,remark,\
                org_name,org_grade_name,good_at)\
                VALUES(%s,%s,%s,%s,%s,%s)"  # SQL with placeholder
    save_mysql(sql, out_list, **parms)  # Call the function. Note * * cannot be omitted

result:
sql statement:

CREATE TABLE `doctorinfo` (
  `staff_name` varchar(255) DEFAULT NULL,
  `staff_type` varchar(255) DEFAULT NULL,
  `remark` varchar(10000) DEFAULT NULL,
  `org_name` varchar(255) DEFAULT NULL,
  `org_grade_name` varchar(255) DEFAULT NULL,
  `good_at` varchar(255) DEFAULT NULL
) 

Bye~

Keywords: Python Database MySQL html

Added by JonathanS on Fri, 11 Feb 2022 09:32:07 +0200