Python Rapid Development of Scrapy, a Distributed Search Engine - Configuration of Scrapy Startup Files - xpath Expressions

We customize a main.py as the startup file

main.py

#!/usr/bin/env python
# -*- coding:utf8 -*-

from scrapy.cmdline import execute  #Import and execute scrapy command method
import sys
import os

sys.path.append(os.path.join(os.getcwd())) #Add a new path to the Python interpreter, and add the directory of the main.py file to the Python interpreter

execute(['scrapy', 'crawl', 'pach', '--nolog'])  #Execute the scrapy command

Crawler file

What can I learn from my learning process?
python learning resource qun,855 408 893
//There are good learning video tutorials, development tools and e-books in the group.
//Share with you python enterprise talent demand and how to learn python from zero basis, and learn what content.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.http import Request
import urllib.response
from lxml import etree
import re

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        pass

xpath expression

1,

2,

3,

Basic use

allowed_domains sets the crawler start domain name
start_urls Sets the Crawler Start url Address
parse(response) defaults to the crawler callback function, which returns the html information object acquired by the crawler, encapsulating some methods and attributes about htnl information.

Methods and attributes under responsehtml information object
response.url gets the captured rul
response.body retrieves web content
response.body_as_unicode() gets the Unicode code of website content.
The xpath() method filters nodes with xpath expressions
extract() method, which retrieves filtered data and returns a list

# -*- coding: utf-8 -*-
import scrapy

class PachSpider(scrapy.Spider):
    name = 'pach'
    allowed_domains = ['blog.jobbole.com']
    start_urls = ['http://blog.jobbole.com/all-posts/']

    def parse(self, response):
        leir = response.xpath('//A [@class= "archive-title"]/text ()'. extract ()# Gets the specified title
        leir2 = response.xpath('//A [@class= "archive-title"]/@href'. extract ()# Gets the specified url

        print(response.url)    #Get the captured rul
        print(response.body)   #Getting Web Content
        print(response.body_as_unicode())  #Get website content unicode encoding

        for i in leir:
            print(i)
        for i in leir2:
            print(i)

Python Resource Sharing Skirt: 855408893 has installation packages, learn video materials, update technology every day. Here is the gathering place of Python learners, zero foundation, advanced, welcome to click Python resource sharing

Keywords: Python encoding

Added by Hamlets666 on Mon, 07 Oct 2019 06:07:10 +0300