Introduction: this chapter mainly explains the related technologies of dynamic web page crawling. The crawling of dynamic web pages mainly includes reverse analysis method and simulation method. Today, we will mainly introduce the reverse analysis method. Later, we will focus on the use of selenium Library in the simulation method.
Dynamic web page
1, Overview of dynamic web pages
1.1 what is a dynamic web page
Dynamic web page is the integration of basic HTML syntax specification, Python, Java, C# and other advanced programming languages, database programming and other technologies, in order to realize the efficient, dynamic and interactive management of website content and style. Therefore, in this sense, all web pages generated by web page programming technology combined with high-level programming language other than HTML and database technology are dynamic web pages.
This is the definition found on the Internet. What's the popular saying? What you are looking for may not appear in the web page source code.
1.2 common technologies of dynamic web pages
Dynamic Web pages often use ajax, dynamic HTML and other related technologies to realize the interaction of front and background data. For traditional Web applications, when we submit a form request to the server, after receiving the request, the server returns a new page to the browser. This method not only wastes network bandwidth, but also greatly affects the user experience, because most of the HTML content of the original Web page and the new page obtained after sending the request are the same, Moreover, each user interaction needs to send a request to the server and refresh the whole Web page. This problem gave birth to Ajax technology.
The full name of Ajax is Asynchronous JavaScript and
XML, called asynchronous JavaScript and XML in Chinese, is a combination of JavaScript asynchronous loading technology, XML and Dom, as well as presentation technology, XHTML and CSS. use
Ajax technology does not need to refresh the whole page, but only needs to update the part of the page. Ajax only retrieves some necessary data. It uses SOAP, XML or JSON supported Web
Service interface, we use JavaScript on the client to process the response from the server, so that the data interaction between the client and the server is reduced, and the access speed and user experience are improved. For example, Ajax technology is widely used for user name uniqueness verification when registering mailbox. The data format returned by the server is usually json or xml rather than HTML.
1.3 determination method of dynamic web pages
Static web page:
Static web pages are based on html,. htm,. html,. shtml,. Web pages with xml as suffix. The content of static web pages is fixed. Each page is independent and will not change according to the different needs of visitors.
Dynamic web pages:
A web page that uses ASP or PHP or JSP as a suffix. Based on database technology, dynamic web pages can greatly reduce the workload of website maintenance.
Judgment method:
1. Right click to view the web page source code. If the data is in the web page, it means that it is a static web page, otherwise it is a dynamic web page.
2. You can also use request Get() crawls and returns r.text. If all the data is in the text, it is a static web page. Otherwise, it is a dynamic web page or a combination of dynamic and static web pages. Part of the data is in the web page and part of the data is not on the web page.
1.4 crawling method of dynamic web pages
The crawling methods of dynamic web pages are generally divided into reverse analysis method and simulation method. The reverse analysis method is difficult. By intercepting the requests sent by the website and finding out the real request address, crawlers are required to be familiar with the front-end, especially JavaScript related technologies. The simulation method uses a third-party library such as Selenium to simulate the behavior of the browser and solve the problems of page loading and rendering.
2, Case
Case website: Chongqing famous medical Hall
It's still the original operation. f12 check the source code and pay attention to the red circle. This time, it's XHR. The specific situation should be analyzed in detail.
import requests import json import pymysql import time def get_html(url,headers,time=10): #get request general function, eliminating the user agent simplified code try: r = requests.get(url, headers=headers,timeout=time) # Send request r.encoding = r.apparent_encoding # Sets the character set encoding of the returned content r.raise_for_status() # The returned status code is not equal to 200. An exception is thrown return r.text # Returns the text content of a web page except Exception as error: print(error) out_list=[] def parser(json_txt): txt = json.loads(json_txt) global row_count row_count=txt["doctorCount"] #Total number of rows for row in txt["doctors"]: #Doctor information list staff_name=row.get("STAFF_NAME") #Name of doctor if staff_name is None: staff_name="" staff_type=row.get("STAFF_TYPE") #title if staff_type is None: staff_type="" remark=row.get("STAFF_REMARK") #brief introduction if remark is None: remark="" #Simply clean and remove the html tag in the introduction remark=remark.replace("<p>","").replace("</p>","") #Remove white space characters remark=remark.strip() org_name=row.get("ORG_NAME") #Affiliated Hospital org_name=org_name if org_name is not None else "" org_grade_name=row.get("ORG_GRADE_NAME")#Hospital grade org_grade_name = org_grade_name if org_grade_name is not None else "" good_at=row.get("GOOT_AT") #Areas of expertise good_at= good_at if good_at is not None else "" row_list=( staff_name, staff_type, remark, org_name, org_grade_name, good_at ) out_list.append(row_list) def save_mysql(sql, val, **dbinfo): #General data storage mysql function try: connect = pymysql.connect(**dbinfo) # Create database link cursor = connect.cursor() # Get cursor object cursor.executemany(sql, val) # Execute multiple SQL connect.commit() # Transaction commit except Exception as err: connect.rollback() # Transaction rollback print(err) finally: cursor.close() connect.close() if __name__ == '__main__': head = { "User-Agent": "Mozilla/5.0 (Windows NT 6.3; Win64; x64)\ AppleWebKit/537.36 (KHTML, like Gecko) \ Chrome/89.0.4389.82 Safari/537.36" } #Set up user agent to deal with simple anti crawler #In order to get the total number of data pieces, climb once to get row_count into global variable page_size=10 #Data rows per page url='https://www.jkwin.com.cn/yst_web/doctor/getDoctorList?areaId=42&depId=&hasDuty=&pageNo=1&pageSize=10' json_txt=get_html(url,head) #Send request print(json_txt) #Check for data parser(json_txt) #analytic function print(out_list) #Check for data page=row_count//page_size # the total number of pages is too many and the speed is too slow. You can climb all pages by changing 5 pages to 6 pages for i in range(2,6): url="https://www.jkwin.com.cn/yst_web/doctor/getDoctorList?areaId=42&depId=&hasDuty=&pageNo={0}&pageSize={1}".format(i,page_size) json_txt=get_html(url,head) #Send request parser(json_txt) #analytic function #After parsing the data, save it in batch to MySQL database at one time parms = { #Database connection parameters "host": "--", "user": "root", "password": "123456", "db": "---", "charset": "utf8", } sql = "INSERT into doctorinfo(staff_name,staff_type,remark,\ org_name,org_grade_name,good_at)\ VALUES(%s,%s,%s,%s,%s,%s)" # SQL with placeholder save_mysql(sql, out_list, **parms) # Call the function. Note * * cannot be omitted
result:
sql statement:
CREATE TABLE `doctorinfo` ( `staff_name` varchar(255) DEFAULT NULL, `staff_type` varchar(255) DEFAULT NULL, `remark` varchar(10000) DEFAULT NULL, `org_name` varchar(255) DEFAULT NULL, `org_grade_name` varchar(255) DEFAULT NULL, `good_at` varchar(255) DEFAULT NULL )
Bye~