Python Data Network Acquisition 5--Processing Javascript and redirection
So far, the only way we can communicate with the Web server is to send HTTP requests to get pages. In some web pages, we can interact with the web server (sending and receiving information) without a single request, so this web page may use Ajax technology to load data. Using the previous acquisition method, we may only collect the data before loading, but the important data can not be grasped.
Like Ajax, dynamic HTML (DHTML) is also a set of technologies used to solve network problems. DHTML uses client language, such as JavaScript, to control HTML elements of pages. Often, when we collect websites, looking at big content from browsers is different from crawling moral content. Or the page uses a load page to lead us to another page, but the URL links remain unchanged throughout the process.
All this is because JavaScript on the Web is making a mistake. Browsers can execute JavaScript correctly, but we may ignore the code directly during crawling. So what you see in the browser is different from what you crawl.
Ajax/DHTML technology makes crawlers difficult, but Selenium can be used to handle JavaScript code in pages easily.
For example, the following pages are loaded using Ajax technology, and the content of the page changes in about 2 seconds (but the URL links in the address bar remain unchanged).
import requests from bs4 import BeautifulSoup url = 'http://pythonscraping.com/pages/javascript/ajaxDemo.html' r = requests.get(url) soup = BeautifulSoup(r.text, 'lxml') content = soup.find('div', id='content') print(content.string)
This is some content that will appear on the page while it's loading. You don't care about scraping this.
Selenium handles JavaScript
In fact, if you open this page in a browser, the last thing you display is not like this. They are displayed at first, but they are immediately replaced by new content. You can try to wait a few seconds. The above example uses requests access, which returns the response immediately, so only the pre-load content can be retrieved. So if you wait, requests don't seem to work as well. Go to Selenium!
import time from selenium import webdriver driver = webdriver.PhantomJS(executable_path=r'C:\Program Files (x86)\phantomjs\bin\phantomjs.exe') driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html') # Waiting for loading to complete time.sleep(5) content = driver.find_element_by_id('content').text print(content) driver.quit()
Here is some important text you want to retrieve! A button to click!
Phantom Js, which is a browser without interface, is very convenient to use in combination with Selenium. PhantomJs needs to be downloaded.
WebElement has an attribute text to get the text in the tag. Looking at the printed information above, it is true that the new content has been loaded. In use, you need to specify the directory where phantomjs resides. Also, because there is no interface, remember close or quit after use.
The above code limits the search of elements in five seconds, but it is uncertain when the page will load properly. So you can constantly check whether a content of the page has been loaded.
from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By driver = webdriver.PhantomJS(executable_path=r'C:\Program Files (x86)\phantomjs\bin\phantomjs.exe') driver.get('http://pythonscraping.com/pages/javascript/ajaxDemo.html') try: element = WebDriverWait(driver, 10).until(EC.presence_of_element_located((By.ID, 'loadedButton'))) print(element) finally: print(driver.find_element_by_id('content').text) driver.close()
<selenium.webdriver.remote.webelement.WebElement (session="a300db00-6afa-11e7-9f0f-2189a7b4630b", element=":wdc:1500301053057")> Here is some important text you want to retrieve! A button to click!
WebDriverWait and expected_conditions are used to construct implicit waiting. Implicit waiting is to wait for a state in the DOM to continue running the code. There is no explicit waiting time, but there is a maximum waiting time (10s in the example above), while explicit waiting is to specify the waiting time, such as the previous example specifies sleep(5). The expected_conditions specify the expected conditions, and the example above is to wait until the element id is loaded Button is displayed. By is a selector, which can be found in the following way.
ID = "id" XPATH = "xpath" LINK_TEXT = "link text" PARTIAL_LINK_TEXT = "partial link text" NAME = "name" TAG_NAME = "tag name" CLASS_NAME = "class name" CSS_SELECTOR = "css selector"
In fact, the meaning of the following two sentences is the same:
driver.find_element(By.ID, 'loadedButton') driver.find_element_by_id('loadedButton')
Xpath grammar
You can also use Xpath's grammar for lookups. Here are some common grammars.
- / The div selection root node is the div element
- // a Selects all a nodes in the document (including non-root nodes)
- //@ href selects all nodes with href attribute
- // a[@href='https://www.google.com'] Select all a tags with href as Google website
- // a[3] Select the third a label in the document
- // table[last()] Selects the last table in the document
- // a [position () < 3] Select the first three a tags in the document
Handling redirection
Redirecting is divided into client-side redirection (Redirect) and service-side redirection (Dispatch), the latter meaning dispatch, which is often referred to as forwarding. Forwarding only requests once, so Python's requests can be handled easily, but if redirected, requests twice, the url generally changes. It's time to use Selenium. The following example can monitor whether the link has been redirected. The method used is to monitor an element in the DOM from the beginning of the page loading, then call the element repeatedly until a StaleElement Reference Exception is thrown, that is to say, the element is no longer in the DOM of the page, and then it has jumped.
import time # Stale means the element no longer appears on the DOM of the page from selenium.common.exceptions import StaleElementReferenceException from selenium import webdriver def wait_for_load(a_driver): element = a_driver.find_element_by_tag_name('html') print('content', element) count = 0 while True: count += 1 # Over 10 seconds, return directly if count > 20: print('Timing out after 10s and returning') return time.sleep(0.5) # Check whether it's the same element or not. If not, it means that the html tag is no longer in the DOM. If it's not throwing an exception new = a_driver.find_element_by_tag_name('html') print('new', new) if element != new: raise StaleElementReferenceException('Just redirected!') driver = webdriver.PhantomJS(r'C:\Program Files (x86)\phantomjs\bin\phantomjs.exe') driver.get('https://pythonscraping.com/pages/javascript/redirectDemo1.html') try: wait_for_load(driver) except StaleElementReferenceException as e: print(e.msg) finally: print(driver.page_source)
content <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305464563")> new <selenium.webdriver.remote.webelement.WebElement (session="e9c5e030-6b04-11e7-9cea-c913b202710e", element=":wdc:1500305468142")> //Just redirected! <html><head> <title>The Destination Page!</title> </head> <body> This is the page you are looking for! </body></html>
We printed the wdc of the WebElement corresponding to the html element just entering the web page. wdc:1500305464563 is equivalent to an id.
In a loop, it is constantly checked whether it is the same as the original WebElement, and if not, redirection has occurred. The wdc has changed since the redirection occurred. wdc:1500305468142. The page jumps to redirectDemo1.html.
by @sunhaiyu
2017.7.17