Data acquisition and fusion technology - the fifth operation

Operation ①

1.1 operation content
- requirement:
  - Master Selenium to find HTML elements, crawl Ajax web page data, wait for HTML elements, etc.
  - Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
- Candidate sites: http://www.jd.com/
- Output information: the output information of MYSQL is as follows
  
  mNo mMark mPrice mNote mFile
  
  000001 Samsung Galaxy 9199.00 Samsung Galaxy Note20 Ultra 5G 000001.jpg
  
  000002......
1.2 code and experimental steps
- 1.2.1 experimental steps:

mNo	mMark	mPrice	mNote	mFile
000001	Samsung Galaxy	9199.00	Samsung Galaxy Note20 Ultra 5G	000001.jpg
000002......

Copy the xpath path and pass in the keyword

Click the search button

but = self.driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')
but.click()
time.sleep(10)

Realize scrolling and page turning

for i in range(33):
  self.driver.execute_script("var a = window.innerHeight;window.scrollBy(0,a*0.5);")
  time.sleep(0.5)

Analyze the product page, and the information of each product is in the li tag
html = self.driver.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li') # crawls all li Tags

Traverse each li and crawl node information

for item in range(len(html)):
	try:
		mMark = html[item].find_element_by_xpath('./div//div[@class="p-name"]/a/em/font[1]').text
		print(mMark)
	except Exception as err:
		mMark = " "
	mPrice = html[item].find_element_by_xpath('./div//div[@class="p-price"]/strong/i').text
	mNote = html[item].find_element_by_xpath('./div//div[@class="p-name"]/a/em').text
	src = html[item].find_element_by_xpath('./div//div[@class="p-img"]/a/img').get_attribute('src')
	self.picSave(src)
	self.db.insert(self.count, mMark, mPrice, mNote, str(self.count)+".jpg")
	self.count += 1

Realize page turning

if self.page < 2:
	self.page += 1
	nextPage = self.driver.find_element_by_xpath('//*[@id="J_bottomPage"]/span[1]/a[9]')
	nextPage.click()
	#Perform the crawl function again
	self.Mining()

1.3 operation results:

1.4 experience
- When crawling information, first crawl the li node of each commodity and recycle the information node
- Learned to simulate search
- During crawling, you must pay attention to setting the sleep time so that the web page can be loaded

Operation ②

2.1 operation contents

requirement:
- Proficient in Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc.
- Use Selenium framework + Mysql to simulate login to muke.com, obtain the information of the courses learned in the students' own account, save it in MySQL (course number, course name, teaching unit, teaching progress, course status and course picture address), and store the pictures in the imgs folder under the root directory of the local project. The names of the pictures are stored with the course name.
Candidate website: China mooc website: https://www.icourse163.org

Output information: MySQL database storage and output format

Id	cCourse	cCollege	cSchedule	cCourseStatus	cImgUrl
1	Python web crawler and information extraction	Beijing University of Technology	3 / 18 class hours learned	Completed on May 18, 2021	http://edu-image.nosdn.127.net/C0AB6FA791150F0DFC0946B9A01C8CB2.jpg
2......

2.2 code and experimental steps
- 2.2.1 experimental steps

      # Login entry
      DL = self.driver.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[3]/div')
      DL.click()
      # Click other methods to log in
      QTDL = self.driver.find_element_by_xpath('//span[@class="ux-login-set-scan-code_ft_back"]')
      QTDL.click()
      # Click mobile login
      phoneDL = self.driver.find_element_by_xpath('//ul[@class="ux-tabs-underline_hd"]/li[2]')
      phoneDL.click()
      # Toggle floating window
      phoneI = self.driver.find_element_by_xpath('//div[@class="ux-login-set-container"][@id="j-ursContainer-1"]/iframe')
      self.driver.switch_to.frame(phoneI)
      # Enter phone number
      phoneNum = self.driver.find_element_by_xpath('//*[@id="phoneipt"]')
      phoneNum.send_keys(USERNAME)
      # Input password
      phonePassword = self.driver.find_element_by_xpath('//div[@class="u-input box"]/input[2]')
      phonePassword.send_keys(PASSWORD)
      # Click login and wait for login to succeed
      DlClick = self.driver.find_element_by_xpath('//*[@id="submitBtn"]')
      DlClick.click()
      time.sleep(10)

      # Enter my course
      myClass = self.driver.find_element_by_xpath('//div[@class="_1Y4Ni"]/div')
      myClass.click()
      time.sleep(5)

Analyze the course page, and the information of each course is in a div tag

  # Crawl the div tag of each course
  html = self.driver.find_elements_by_xpath('//*[@id="j-coursewrap"]/div/div/div')

Traverse each div and crawl node information

# Crawling information
for item in html:
    print(self.count)
    cCourse = item.find_element_by_xpath('./div//div[@class="text"]/span[@class="text"]').text
    print(cCourse)
    cCollege = item.find_element_by_xpath('./div//div[@class="school"]/a').text
    print(cCollege)
    cSchedule = item.find_element_by_xpath('./div//div[@class="text"]/a/span').text
    print(cSchedule)
    cCourseStatus = item.find_element_by_xpath('./div//div[@class="course-status"]').text
    print(cCourseStatus)
    src = item.find_element_by_xpath('./div//div[@class="img"]/img').get_attribute("src")
    print(src)
    self.picSave(src, cCourse)
    self.db.insert(self.count, cCourse, cCollege, cSchedule, cCourseStatus, src)
    self.count += 1

Realize page turning

nextPage = self.driver.find_element_by_xpath('//*[@id="j-coursewrap"]/div/div[2]/ul/li[4]/a')
# Realize page turning
if nextPage.get_attribute('class') != "th-bk-disable-gh":
    print(1)
    nextPage.click()
    time.sleep(5)
    # Perform the crawl function again
    self.Mining()

2.3 operation results:

Operation ③

3.1 operation contents
- Requirements: Master big data related services and be familiar with the use of Xshell
  - Complete the document Hua Weiyun_ Big data real-time analysis and processing experiment manual Flume log collection experiment (part) v2 The tasks in docx are the following five tasks. See the document for specific operations.
  - Environment construction
    - Task 1: open MapReduce service
  - Real time analysis and development practice:
    - Task 1: generate test data from Python script
    - Task 2: configure Kafka
    - Task 3: install Flume client
    - Task 4: configure Flume to collect data
3.2 results
- Task 1: generate test data from Python script
  Executing python files
  
  View generated data
- Task 2: configure Kafka
  Execute source
  
  Task 3: install Flume client
- Finally, install Flume
  
  Restart service
- Task 4: configure Flume to collect data

Added by horsleyaa on Tue, 04 Jan 2022 08:31:27 +0200

Programming VIP

Data acquisition and fusion technology - the fifth operation

Operation ①

1.1 operation content

1.2 code and experimental steps

1.3 operation results:

1.4 experience

Operation ②

2.1 operation contents

2.2 code and experimental steps

2.3 operation results:

Operation ③

3.1 operation contents

3.2 results

Popular Keywords