AWS crawling price

Last month, my tutor asked me to be a little crawler to share the process

requirement analysis

Destination url: AWS Fargate price - serverless container service - AWS cloud service (amazon.com)

You need to crawl the price of Fragate Spot in the web page, but the price is different in different regions, and the price will change every few hours, so you need to crawl regularly.

Store the crawled data into excel to facilitate the later analysis of price changes.

Web page analysis

By analyzing the structure of the web page, I can find in the html that the pricce s in different regions are not directly displayed in the html, and will be displayed in the html only after clicking

Check whether it is an asynchronous request and find that there is no asynchronous request, that is, ajax.

Continue to analyze the js function clicked by clicking the button and find that all price s are encrypted in a js file, which is difficult to crack.

Therefore, we can't use simple page analysis to climb data, so we can only use selenium tool (it's actually an automatic testing tool, but it's also OK to climb things)

Related technology

According to the demand and web page structure analysis, the required technology can be determined

Environmental Science:

An Ubuntu ECS (others are OK)

python3

chrome (installed on the server)

Download chromedriver

Third party Library

selenium (web page)

openpyxl (operating excel)

Simple use of related libraries

openpyxl: [python] common methods of openpyxl_ Hurpe CSDN blog

selenium: Introduction and practice of Python+Selenium foundation - brief book (jianshu.com)

Some notes of python + selenium + chrome headless - SegmentFault

1. Installation - selenium Python Chinese document 2 documentation (selenium Python zh. Readthedocus. IO)

deploy

Using selenium (environment deployment) - kaishuai blog Park (cnblogs.com) in Linux

Run python program in the background under Linux and output log files_ CSDN blog_ nohup python command output log

During the deployment process, you will encounter many linux problems, so you need to check the data yourself. It's not difficult

Source code

import datetime
import time

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium.webdriver.support.ui import Select
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from openpyxl import Workbook
from openpyxl import load_workbook
import re


def get_driver(url):
    # If you start the server, use the following
    # driver = webdriver.Remote(
    #     command_executor='http://127.0.0.1:4444/wd/hub',
    #     desired_capabilities={'browserName': 'chrome',
    #                           'version': '2',
    #                           'javascriptEnabled': True})

    option = webdriver.ChromeOptions()
    option.add_argument('--headless')
    option.add_argument('--disable-gpu')  # If this option is not added, sometimes there will be positioning problems
    option.add_argument("--start-maximized")
    option.add_argument('--window-size=2560, 1440')

    chrome = webdriver.Chrome(options=option)
    chrome.implicitly_wait(3)  # seconds
    chrome.get(url)
    # print(chrome.get_window_size())
    # chrome.maximize_window()
    # print(chrome.get_window_size())
    # chrome.set_window_size(2560, 1440)
    # print(chrome.get_window_size())
    return chrome


def get_data(driver):
    region_data = []
    price_data = []
    try:
        select = driver.find_element_by_xpath('/html/body/div[2]/main/div[2]/div/div[5]/div/div[1]/div/ul')
        region_list = select.find_elements_by_tag_name('li')
    except Exception as e:
        print(e)
        return None, None
    else:
        print('Number of regions:', len(region_list))
        for i in range(23):
            select.click()
            element = driver.find_element_by_class_name("aws-plc-content")
            # print(region_list[i].text)
            # print(element.text)
            region_data.append(region_list[i].text)
            price_data.append(element.text)
            # print('---')
            # time.sleep(1)
            region_list[i].click()
        return region_data, price_data


def save(data1, data2):
    # Data cleaning
    title = []
    price_data = []
    price_data.append(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))
    # The title can only be entered once
    # for region in data1:
    #     title.append('')
    #     title.append(region)
    #     print(region)
    for price in data2:
        p = price.split('\n')
        price1 = re.search('[\d.]+', p[1])
        price2 = re.search('[\d.]+', p[2])
        price_data.append(price1.group())
        price_data.append(price2.group())
    # print(len(title))
    print('Data written this time:', len(price_data))
    # Start writing data
    wb = load_workbook('price.xlsx')
    table = wb.active
    # table.append(title)
    table.append(price_data)
    wb.save('price.xlsx')


if __name__ == '__main__':
    while True:
        url = 'https://aws.amazon.com/cn/fargate/pricing / '# must add https
        driver = get_driver(url)
        data1, data2 = get_data(driver)
        # print(data1)
        # print(data2)
        if data1 is not None and data2 is not None:
            save(data1, data2)
            print("This climb is successful!")
            time.sleep(3600)
        else:
            print('This climb failed, 5 min Then climb again')
            time.sleep(300)
        driver.quit()
        # Climb once in 1h

# Eastern United States (Northern Virginia)
# Price
# 0.012144 USD per vcpu per hour
# USD 0.0013335 per GB per hour

Result display

Finally, I use nginx as a web server to access excel files at any time

You can access it directly in this way, which is more convenient to see whether the crawler is hanging or not (it has been hung several times in the middle, so try catch is added)