Because at the beginning of learning crawler, there was a project to crawl a picture of a website, a picture. [website link] (https://www.splash.com/), Today, I thought about this website when I was thinking about the project. Now I want to use selenium to scroll down the page, so as to crawl the content of multiple pages. This time, I crawled 160 pictures (10 pages down) Code quantity: 50 lines
# coding: utf-8
from selenium import webdriver
import time,requests
from bs4 import BeautifulSoup
driver=webdriver.Chrome()
def get_page(driver): # Go to the specified Homepage
page=driver.get('https://unsplash.com/')
def sroll_page(driver): # Scroll pages and return to page resources
get_page(driver)
# js script to scroll down to the bottom of the page
js='window.scrollTo(0, document.body.scrollHeight);'
for i in range(0,10): # Scroll the page ten times
# This is to execute js script
driver.execute_script(js)
# Because if the page continues to go down to the bottom, you need to leave a time to wait for the page to load successfully. My network speed is very slow, so I left it for ten seconds, and then continued to scroll.
time.sleep(10)
return driver.page_source # Back to page resources
#Functions to parse pages
def parser_page(html):
url_list=[]
Soup=BeautifulSoup(html,'lxml')
div=Soup.find_all('div',class_="_1OvAL _2T3hc _27nWV")
for i in div:
try: # Generally, I like to add an error capture to prevent one or two elements from cramping, which causes the whole program to stop. Here, the website is done in a standard way, and no error is caught
x=i.find_all('a',itemprop="contentUrl")
try:
for z in x:
url_raw=z.get('href')
url='https://unsplash.com/'+url_raw+'/download?force=true'
url_list.append(url)
except Exception as e:
print('Small link error',e)
except Exception as f:
print('Big link error',f)
return url_list
#Download pictures
def download_pic(list):
#print(list,len(list))
for i in range(len(list)):
adress='D://Picture / {0}.png '.format(i)
html=requests.get(list[i],verify=False)
with open(adress, 'ab') as f:
print('Downloading section{0}Zhang picture'.format(i+1))
f.write(html.content)
print('The first{0}Photos written successfully'.format(i+1))
def main(driver):
html=sroll_page(driver)
urls_list=parser_page(html)
download_pic(urls_list)
if __name__=='__main__':
main(driver)
The above is the complete source code. Due to the comprehensive knowledge, it's not suitable for sprouting.