The fourth major operation of data acquisition

The fourth big assignment

Assignment 1

1.1 experimental topic

  • Requirements: master the serialization output method of Item and Pipeline data in the scene; Scrapy+Xpath+MySQL database storage technology route crawling Dangdang website book data

  • Candidate sites: http://search.dangdang.com/?key=python&act=input

  • Key words: students can choose freely

  • Output information: the output information of MySQL is as follows

1.2 ideas

1.2.1 analysis and search process

​ By checking the page, we can find that the information of each book is stored under the li tag

​ Therefore, the content of each li tag is crawled and stored in lis

lis = selector.xpath("//li['@ddt-pit'][starts-with(@class,'line')]")

​ Then analyze the content under each li tag and get the content of the required part through xpath

title = li.xpath("./a[position()=1]/@title").extract_first() #title
price = li.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first() #Price
author = li.xpath("./p[@class='search_book_author']/span[position()=1]/a/@title").extract_first()#author
date = li.xpath("./p[@class='search_book_author']/span[position()=last()- 1]/text()").extract_first()#date
publisher =li.xpath("./p[@class='search_book_author']/span[position()=last()]/a/@title").extract_first()#press
detail = li.xpath("./p[@class='detail']/text()").extract_first()#brief introduction

​ After crawling a page of information, turn the page. The number of pages is limited by the size of the flag. Here, only three pages of information are crawled

if self.flag<2:
	self.flag+=1
	url = response.urljoin(link)
	yield scrapy.Request(url=url, callback=self.parse)

1.2.2 write items.py

title = scrapy.Field()
author = scrapy.Field()
date = scrapy.Field()
publisher = scrapy.Field()
detail = scrapy.Field()
price = scrapy.Field()

1.2.3 pipeline.py preparation

​ Connect to the database first, and if the table does not exist, create the table. If the table exists, delete the table first and then create it

self.con = sqlite3.connect("dangdand.db")
self.cursor = self.con.cursor()
	# Create table
	try:
		self.cursor.execute("drop table book")
	except:
		pass
self.cursor.execute(
"create table book(bTitle varchar(512), bAuthor varchar(256),bPublisher varchar(256), bDate varchar(32), bPrice varchar(16), bDetail varchar(1024))")

​ Close the database and spider

if self.opened:
	self.con.commit()
	self.con.close()
	self.opened = False

​ Insert data into a table

if self.opened:
	#insert data
	self.cursor.execute("insert into book(bTitle,bAuthor,bPublisher,bDate,bPrice,bDetail) values(?,?,?,?,?,?)",(item["title"],item["author"],item["publisher"],item["date"],item["price"],item["detail"]))
	self.count += 1

1.2.4 running the script framework code

Method 1: write a function in the scratch framework

from scrapy import cmdline
cmdline.execute("scrapy crawl Myspider -s LOG_ENABLED=False".split())

​ Among them, Myspider should be changed according to the name in the python of the spider you created

 name = 'Myspider'

​ It is modified according to the name in this sentence, not all MySpider

Method 2: in the command line, enter the spider framework and enter spider crawl Myspider to run. Similarly, Myspider should be modified according to its own requirements.

1.2.5 operation results

Results in database

Command line output results

1.3 complete code

https://gitee.com/q_kj/crawl_project/tree/master/dangdang

1.4 summary

​ The first problem is to reproduce the code. At the beginning, the problem is the start in the generated spider framework_ Url = [''] there will be errors in the process of splicing into a complete code. Remove [] and you can run normally. There are no other problems.

Assignment 2

2.1 experimental topics

  • Requirements: master the serialization output method of Item and Pipeline data in the scene; Crawl the foreign exchange website data using the technology route of "scratch framework + Xpath+MySQL database storage".

  • Candidate website: China Merchants Bank Network: http://fx.cmbchina.com/hq/

  • Output information: MySQL database storage and output format

    Id Currency TSP CSP TBP CBP Time
    1 HKD 86.60 86.60 86.26 85.65 15: 36: 30
    2......

2.2 ideas

2.2.1 analysis and search process

​ By checking the page, you can find that the information of each currency is stored under the tr tag

​ During the inspection, it is found that it is in the tr tag under tbody, so / / * [@ id="rightbox"]/table/tbody/tr is used to match, and the result is empty. Then find out by looking at the source code of the page

There is no tbody tag, so finally use following code to match tr tag

trs = selector.xpath('//div[@id="realRateInfo"]/table/tr')

​ Then match the required content under the tr tag

name = tr.xpath('.//TD [1] / text()). Extract() # transaction currency
TSP = tr.xpath('.//TD [4] / text()). Extract() # spot exchange selling price
CSP = tr.xpath('.//TD [5] / text()). Extract() # cash selling price
TBP = tr.xpath('.//TD [6] / text()). Extract() # spot exchange purchase price
CBP = tr.xpath('.//TD [7] / text()). Extract() # cash purchase price
time = tr.xpath('.//TD [8] / text() '. Extract() # time            

​ However, the content in the first tr tag is the header, which is unnecessary, so it needs to be filtered again. By initially setting count=0, skip when count=0, and set count to 1, and the remaining content can be stored in item for subsequent work.

2.2.2settings.py settings

​ View user through F12_ Agent content, and modify the corresponding location content in settings

​ Modify the robots protocol to False

​ Remove the comments in the following section

2.2.3 writing items.py

currency = scrapy.Field()
TSP = scrapy.Field()
CSP = scrapy.Field()
TBP =scrapy.Field()
CBP = scrapy.Field()
Time = scrapy.Field()

2.2.4 pipeline.py preparation

​ Connect to the database first, and if the table does not exist, create the table. If the table exists, delete the table first and then create it

self.con = sqlite3.connect("bank.db")
self.cursor = self.con.cursor()
	# Create table
	try:
		self.cursor.execute("drop table bank")
	except:
		pass
self.cursor.execute("create table bank(Id int,Currency varchar(32), TSP varchar(32),CSP varchar(32), TBP varchar(32), CBP varchar(32), Time varchar(32))")

​ Close the database and spider

if self.opened:
	self.con.commit()
	self.con.close()
	self.opened = False

​ Insert data into a table

if self.opened:
	#insert data
	self.cursor.execute("insert into bank(Id,Currency,TSP,CSP,TBP,CBP,Time) values(?,?,?,?,?,?,?)",(self.count,item["currency"], item["TSP"], item["CSP"], item["TBP"], item["CBP"], item["Time"]))
	self.count += 1

2.2.5 results

Database result information

Console input information

2.3 complete code

https://gitee.com/q_kj/crawl_project/tree/master/zsBank

2.4 summary

​ The main problem of this problem is that when looking for the tr tag, you can find that there is a tbody tag in the inspection, but you can't match the result through it. However, you can find that there is no tbody tag in the page source code, so you can correctly match the result by removing it. Therefore, you can better see the hierarchy by viewing the page source code. What you see in the inspection may not be the real content.

Assignment 3

3.1 experimental topic

  • Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc; Use Selenium framework + MySQL database storage technology route to crawl the stock data information of "Shanghai and Shenzhen A shares", "Shanghai A shares" and "Shenzhen A shares".

  • Candidate website: Dongfang fortune.com: http://quote.eastmoney.com/center/gridlist.html#hs_a_board

  • Output information: the storage and output format of MySQL database is as follows. The header should be named in English, such as serial number id, stock code: bStockNo..., which is defined and designed by students themselves:

    Serial number Stock code Stock name Latest quotation Fluctuation range Rise and fall Turnover Turnover amplitude highest minimum Today open Received yesterday
    1 688093 N Shihua 28.47 62.22% 10.92 261 thousand and 300 760 million 22.34 32.0 28.08 30.2 17.55
    2......

3.2 ideas

3.2.1 Analysis page

​ By checking the page, you can find that the information of each currency is stored under the tr tag

​ Then analyze each tr tag to get the required content

try:
	id=tr.find_element(by=By.XPATH,value='.//td[1]').text
except:
	id=0

​ Because there is a lot of content required, only one example is given here, and the rest are in the code.

3.2.2 create table

​ Because the information of three plates needs to be crawled, three tables are created to store the information of three plates

#Create connection with database
self.con = sqlite3.connect("stock.db")
self.cursor = self.con.cursor()
	#If the table hushen exists, delete it. Otherwise, do nothing
	try:
		self.cursor.execute("drop table hushen")
	except:
		pass
    #Create hushen table
    try:
        self.cursor.execute("create table hushen(Id varchar (16),bStockNo varchar(32),bStockName varchar(32),bNewPirce varchar(32),bUP varchar(32),bDowm varchar(32),bNum varchar(32),bCount varchar(32),bampl varchar (32),bHighest varchar(32),bMin varchar(32),bTodayOpen varchar(32),bYClose varchar(32))")
	except:
        pass
	# If the table shangzheng exists, delete it. Otherwise, do nothing
	try:
		self.cursor.execute("drop table shangzheng")
	except:
		pass
	#Create shangzheng table
	try:
		self.cursor.execute("create table shangzheng(Id varchar (16),bStockNo varchar(32),bStockName varchar(32),bNewPirce varchar(32),bUP varchar(32),bDowm varchar(32),bNum varchar(32),bCount varchar(32),bampl varchar (32),bHighest varchar(32),bMin varchar(32),bTodayOpen varchar(32),bYClose varchar(32))")
	except:
		pass
    # If the table shenzheng exists, it will be deleted, otherwise nothing will be done
	try:
		self.cursor.execute("drop table shenzheng")
	except:
		pass
	#Create shenzheng table
	try:
		self.cursor.execute("create table shenzheng(Id varchar (16),bStockNo varchar(32),bStockName varchar(32),bNewPirce varchar(32),bUP varchar(32),bDowm varchar(32),bNum varchar(32),bCount varchar(32),bampl varchar (32),bHighest varchar(32),bMin varchar(32),bTodayOpen varchar(32),bYClose varchar(32))")
	except:
		pass

3.2.3 closing the database

try:
	self.con.commit()
	self.con.close()
	self.driver.close()
except Exception as err:
	print(err)

3.2.4 inserting data into tables

​ Because there are three tables, flag is used to label different tables and insert data into different tables.

if flag==1:
    try:
        self.cursor.execute("insert into hushen(Id,bStockNo,bStockName,bNewPirce,bUP,bDowm,bNum,bCount,bampl,bHighest,bMin,bTodayOpen,bYClose) values (?,?,?,?,?,?,?,?,?,?,?,?,?)", (Id,bStockNo,bStockName,bNewPirce,bUP,bDowm,bNum,bCount,bampl,bHighest,bMin,bTodayOpen,bYClose))
        print("Insert data succeeded")
    except Exception as err:
        print(err)
# If the flag is 2, insert data into the shangzheng table
elif flag==2:
    try:
        self.cursor.execute(
            "insert into shangzheng(Id,bStockNo,bStockName,bNewPirce,bUP,bDowm,bNum,bCount,bampl,bHighest,bMin,bTodayOpen,bYClose) values (?,?,?,?,?,?,?,?,?,?,?,?,?)",
            (
            Id, bStockNo, bStockName, bNewPirce, bUP, bDowm, bNum, bCount,bampl, bHighest, bMin, bTodayOpen, bYClose))
        print("Insert data succeeded")
    except Exception as err:
        print(err)
# If the flag is 3, insert data into the shenzheng table
elif flag==3:
    try:
        self.cursor.execute(
            "insert into shenzheng(Id,bStockNo,bStockName,bNewPirce,bUP,bDowm,bNum,bCount,bampl,bHighest,bMin,bTodayOpen,bYClose) values (?,?,?,?,?,?,?,?,?,?,?,?,?)",
            (
            Id, bStockNo, bStockName, bNewPirce, bUP, bDowm, bNum, bCount,bampl, bHighest, bMin, bTodayOpen, bYClose))
        print("Insert data succeeded")
    except Exception as err:
        print(err)

3.2.5 realize page turning

Only two pages of information are crawled here to prove that you can turn the page

if self.count < 1:
	self.count += 1
	nextPage = self.driver.find_element(by=By.XPATH, value='//*[@id="main-table_paginate"]/a[2]')
	time.sleep(10)
	nextPage.click()
	self.processSpider(flag)

3.2.6 realize the crawling of different plates

nextStock=self.driver.find_element(by=By.XPATH,value='//*[@id="nav_sh_a_board"]/a')
self.driver.execute_script("arguments[0].click();", nextStock)

3.2.7 results

hushen table

shangzheng table

shenzheng table

3.3 complete code

https://gitee.com/q_kj/crawl_project/tree/master/forth

3.4 summary

​ Error encountered while clicking(). element click intercepted, passed https://laowangblog.com/selenium-element-click-intercepted.html Change next.click() to self.driver.execute_script("arguments[0].click();", next) successfully solved the problem. Selenium is often unsuccessful in crawling. It may be because the page has not been loaded, so you need to sleep, and you may be able to crawl and retrieve the results normally

Added by smbkgeo1983 on Wed, 10 Nov 2021 22:49:32 +0200