Python multithreaded crawler grab web page image example code

This article mainly introduces the example code of Python's multi-threaded crawler to grab web page pictures. Xiaobian thinks it's very good. Now I'll share it with you and give you a reference. Let's follow Xiaobian and have a look

target

Well, we know that there are many beautiful and beautiful pictures when searching or browsing websites.

When we download, we have to download one by one with the mouse and turn the page.

So, is there a way to automatically identify and download pictures in a non manual way. Meimeida.

Then please use python language to build a crawler to capture and download web page images.

Of course, in order to improve efficiency, we use multi-threaded parallel mode at the same time.

Train of thought analysis

Python has many third-party libraries that can help us achieve a variety of functions. The problem is that we figure out what we need:
1) http request library, according to the website address, you can get the web page source code. You can even download pictures and write them to disk.

2) Analyze the web page source code and identify the picture connection address. Such as regular expressions, or simple third-party libraries.

3) Support for building multithreads or thread pools.

4) If possible, you need to fake the browser or bypass the website verification. Well, the website may be anti crawler 😉)

5) If possible, you also need to automatically create directories, random numbers, date and time, etc.

So, we started doing things. O(∩_∩)O~

Environment configuration

Operating system: windows or linux

Python version: Python 3.6 (not Python 2. X)

Third party Library

urllib.request

threading or concurrent.futures multithreading or thread pool (Python 3.2 +)

re regular expression built-in module

os operating system built-in module

Coding process

Let's break down the process. The complete source code is finally provided in the blog.

Disguised as a browser

import urllib.request

# ------Disguised as a browser---
def makeOpener(head={
  'Connection': 'Keep-Alive',
  'Accept': 'text/html, application/xhtml+xml, */*',
  'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
  'Connection': 'keep-alive',
  'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0'
  }):
  cj = http.cookiejar.CookieJar()
  opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
  header = []
  for key, value in head.items():
    elem = (key, value)
    header.append(elem)
  opener.addheaders = header
  return opener

Get web page source code

# ------Get web page source code---
# url web page link address
def getHtml(url):
  print('url='+url)
  oper = makeOpener()
  if oper is not None:
    page = oper.open(url)
    #print ('-----oper----')
  else:
    req=urllib.request.Request(url)
    # Crawler camouflage browser
    req.add_header('User-Agent','Mozilla/5.0 (Windows NT 6.1; WOW64; rv:57.0) Gecko/20100101 Firefox/57.0')
    page = urllib.request.urlopen(req)
  html = page.read()
  if collectHtmlEnabled: #Collect html
    with open('html.txt', 'wb') as f:
      f.write(html) # Collect local files for analysis
  # ------Modify the character encoding in the html object to UTF-8------
  if chardetSupport:
    cdt = chardet.detect(html)
    charset = cdt['encoding'] #Content analysis with chardet
  else:
    charset = 'utf8'
  try:
    result = html.decode(charset)
  except:
    result = html.decode('gbk')
  return result

Download a single picture

# ------Download the picture according to the picture url------
# folderPath defines the directory imgUrl where pictures are stored. The index of the link address of a picture indicates the number of pictures
def downloadImg(folderPath, imgUrl, index):
  # ------Exception handling------
  try:
    imgContent = (urllib.request.urlopen(imgUrl)).read()
  except urllib.error.URLError as e:
    if printLogEnabled : print ('[Error] the current picture cannot be downloaded')
    return False
  except urllib.error.HTTPError as e:
    if printLogEnabled : print ('[Error] current picture download exception')
    return False
  else:
    imgeNameFromUrl = os.path.basename(imgUrl)
    if printLogEnabled : print ('Downloading page'+str(index+1)+'Picture, picture address:'+str(imgUrl))
    # ------IO processing------
    isExists=os.path.exists(folderPath)
    if not isExists: # Directory does not exist, create
       os.makedirs( folderPath )
       #print ('create directory ')
    # Picture name naming rules, random string
    imgName = imgeNameFromUrl
    if len(imgeNameFromUrl) < 8:
      imgName = random_str(4) + random_str(1,'123456789') + random_str(2,'0123456789')+"_" + imgeNameFromUrl
    filename= folderPath + "\\"+str(imgName)+".jpg"
    try:
       with open(filename, 'wb') as f:
         f.write(imgContent) # Write to local disk
       # If printlogenabled: Print ('download completed '+ str(index+1) +' picture ')
    except :
      return False
    return True

Download a batch of pictures (both multithreading and thread pool modes are supported)

# ------Batch download pictures------
# folderPath defines the directory where pictures are stored imgList the link addresses of multiple pictures
def downloadImgList(folderPath, imgList):
  index = 0
  # print ('poolSupport='+str(poolSupport))
  if not poolSupport:
   #print ('multithreading mode ')
   # ------Multithreaded programming------
   threads = []
   for imgUrl in imgList:
     # If printlogenabled: Print ('ready to download the '+ str(index+1) +' picture ')
     threads.append(threading.Thread(target=downloadImg,args=(folderPath,imgUrl,index,)))
     index += 1
   for t in threads:
     t.setDaemon(True)
     t.start()
   t.join() #Parent thread, waiting for all threads to end
   if len(imgList) >0 : print ('After downloading, save the picture Directory:' + str(folderPath))
  else:
   #print ('thread pool mode ')
    # ------Thread pool programming------
   futures = []
   # Create a thread pool that can hold up to N task s. thePoolSize is a global variable
   with concurrent.futures.ThreadPoolExecutor(max_workers=thePoolSize) as pool: 
    for imgUrl in imgList:
     # If printlogenabled: Print ('ready to download the '+ str(index+1) +' picture ')
     futures.append(pool.submit(downloadImg, folderPath, imgUrl, index))
     index += 1
    result = concurrent.futures.wait(futures, timeout=None, return_when='ALL_COMPLETED')
    suc = 0
    for f in result.done:
      if f.result(): suc +=1
    print('Download completed, total:'+str(len(imgList))+',Number of successes:'+str(suc)+',Directory for storing pictures:' + str(folderPath))

Call example

Take Baidu Post Bar as an example

# ------Download all pictures in Baidu Post------
# folderPath defines the url of the directory where pictures are stored Baidu Post Bar link
def downloadImgFromBaidutieba(folderPath='tieba', url='https://tieba.baidu.com/p/5256331871'):
  html = getHtml(url)
  # ------Using regular expressions to match web content to find image addresses------
  #reg = r'src="(.*?\.jpg)"'
  reg = r'src="(.*?/sign=.*?\.jpg)"'
  imgre = re.compile(reg);
  imgList = re.findall(imgre, html)
  print ('Number of pictures found:' + str(len(imgList)))
  # Download pictures
  if len(imgList) >0 : downloadImgList(folderPath, imgList) 

# Program entry
if __name__ == '__main__':
  now = datetime.datetime.now().strftime('%Y-%m-%d %H-%M-%S')
  # Download all pictures in Baidu Post
  downloadImgFromBaidutieba('tieba\\'+now, 'https://tieba.baidu.com/p/5256331871')

The effect is shown in the figure
Click to learn more about Python web development, data analysis, crawler, artificial intelligence and other learning knowledge,

  • The receiving data are as follows:
  • 1. Like + comment
  • 2. Pay attention to Xiaobian and receive the background private letter "01" for free

Keywords: Python Back-end crawler Data Analysis

Added by gorskyLTD on Tue, 23 Nov 2021 13:02:46 +0200