Data acquisition and fusion technology - the fifth operation

Operation ①

  • 1.1 operation content

    • requirement:

      • Master Selenium to find HTML elements, crawl Ajax web page data, wait for HTML elements, etc.
      • Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
    • Candidate sites:

    • Output information: the output information of MYSQL is as follows

      000001 Samsung Galaxy 9199.00 Samsung Galaxy Note20 Ultra 5G 000001.jpg
  • 1.2 code and experimental steps

    • 1.2.1 experimental steps:

Copy the xpath path and pass in the keyword

Click the search button

but = self.driver.find_element_by_xpath('//*[@id="search"]/div/div[2]/button')

Realize scrolling and page turning

for i in range(33):
  self.driver.execute_script("var a = window.innerHeight;window.scrollBy(0,a*0.5);")

Analyze the product page, and the information of each product is in the li tag
html = self.driver.find_elements_by_xpath('//*[@id="J_goodsList"]/ul/li') # crawls all li Tags

Traverse each li and crawl node information

for item in range(len(html)):
		mMark = html[item].find_element_by_xpath('./div//div[@class="p-name"]/a/em/font[1]').text
	except Exception as err:
		mMark = " "
	mPrice = html[item].find_element_by_xpath('./div//div[@class="p-price"]/strong/i').text
	mNote = html[item].find_element_by_xpath('./div//div[@class="p-name"]/a/em').text
	src = html[item].find_element_by_xpath('./div//div[@class="p-img"]/a/img').get_attribute('src')
	self.db.insert(self.count, mMark, mPrice, mNote, str(self.count)+".jpg")
	self.count += 1

Realize page turning

if < 2: += 1
	nextPage = self.driver.find_element_by_xpath('//*[@id="J_bottomPage"]/span[1]/a[9]')
	#Perform the crawl function again
  • 1.3 operation results:


  • 1.4 experience

    • When crawling information, first crawl the li node of each commodity and recycle the information node
    • Learned to simulate search
    • During crawling, you must pay attention to setting the sleep time so that the web page can be loaded

Operation ②

  • 2.1 operation contents

    • requirement:

      • Proficient in Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc.
      • Use Selenium framework + Mysql to simulate login to, obtain the information of the courses learned in the students' own account, save it in MySQL (course number, course name, teaching unit, teaching progress, course status and course picture address), and store the pictures in the imgs folder under the root directory of the local project. The names of the pictures are stored with the course name.
    • Candidate website: China mooc website:

    • Output information: MySQL database storage and output format

      1 Python web crawler and information extraction Beijing University of Technology 3 / 18 class hours learned Completed on May 18, 2021
  • 2.2 code and experimental steps

    • 2.2.1 experimental steps

      # Login entry
      DL = self.driver.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[3]/div')
      # Click other methods to log in
      QTDL = self.driver.find_element_by_xpath('//span[@class="ux-login-set-scan-code_ft_back"]')
      # Click mobile login
      phoneDL = self.driver.find_element_by_xpath('//ul[@class="ux-tabs-underline_hd"]/li[2]')
      # Toggle floating window
      phoneI = self.driver.find_element_by_xpath('//div[@class="ux-login-set-container"][@id="j-ursContainer-1"]/iframe')
      # Enter phone number
      phoneNum = self.driver.find_element_by_xpath('//*[@id="phoneipt"]')
      # Input password
      phonePassword = self.driver.find_element_by_xpath('//div[@class="u-input box"]/input[2]')
      # Click login and wait for login to succeed
      DlClick = self.driver.find_element_by_xpath('//*[@id="submitBtn"]')
      # Enter my course
      myClass = self.driver.find_element_by_xpath('//div[@class="_1Y4Ni"]/div')

Analyze the course page, and the information of each course is in a div tag

  # Crawl the div tag of each course
  html = self.driver.find_elements_by_xpath('//*[@id="j-coursewrap"]/div/div/div')

Traverse each div and crawl node information

# Crawling information
for item in html:
    cCourse = item.find_element_by_xpath('./div//div[@class="text"]/span[@class="text"]').text
    cCollege = item.find_element_by_xpath('./div//div[@class="school"]/a').text
    cSchedule = item.find_element_by_xpath('./div//div[@class="text"]/a/span').text
    cCourseStatus = item.find_element_by_xpath('./div//div[@class="course-status"]').text
    src = item.find_element_by_xpath('./div//div[@class="img"]/img').get_attribute("src")
    self.picSave(src, cCourse)
    self.db.insert(self.count, cCourse, cCollege, cSchedule, cCourseStatus, src)
    self.count += 1

Realize page turning

nextPage = self.driver.find_element_by_xpath('//*[@id="j-coursewrap"]/div/div[2]/ul/li[4]/a')
# Realize page turning
if nextPage.get_attribute('class') != "th-bk-disable-gh":
    # Perform the crawl function again

  • 2.3 operation results:





Operation ③

    • 3.1 operation contents

      • Requirements: Master big data related services and be familiar with the use of Xshell
        • Complete the document Hua Weiyun_ Big data real-time analysis and processing experiment manual Flume log collection experiment (part) v2 The tasks in docx are the following five tasks. See the document for specific operations.
        • Environment construction
          • Task 1: open MapReduce service
        • Real time analysis and development practice:
          • Task 1: generate test data from Python script
          • Task 2: configure Kafka
          • Task 3: install Flume client
          • Task 4: configure Flume to collect data
    • 3.2 results

      • Task 1: generate test data from Python script
        Executing python files

        View generated data

      • Task 2: configure Kafka
        Execute source

        Task 3: install Flume client

      • Finally, install Flume

        Restart service

      • Task 4: configure Flume to collect data

Added by horsleyaa on Tue, 04 Jan 2022 08:31:27 +0200