Get Amazon comments based on Python

Last time, we got the product information of Amazon. Naturally, we need to take a look at the comments. Users' comments can intuitively reflect whether the current commodity value is worth buying, and Amazon's scoring information can also obtain the weight of scoring.

Amazon's comment area is composed of user ID, score and comment title, region, time and comment body. That's what we got this time.

Test link: https://www.amazon.it/product-reviews/B08GHGTGQ2/ref=cm_cr_arp_d_paging_btm_14?ie=UTF8&pageNumber=14&reviewerType=all_reviews&pageSize=10&sortBy=recent

1, Analyze Amazon's comment requests

First, open the Network in developer mode and Clear the screen to make a request:

You will find that the get request in Doc just has the comment information we want.

But the real comment data is not all here. The page turns down and there is a page turning button:

Click page turning request to the next page. A new request is added in the Fetch/XHR tab. There is no new get request in the Doc tab just now. It is found that all comments are XHR type requests.

 

Get the link and payload data of the post request, which contains the parameters to control page turning. The real comment request has been found.

This pile is the unprocessed information. Among the unprocessed information of these requests, the information with data hook = \ "review \" is the information with comments. After analysis, let's write the request step by step.

2, Get Amazon comments

First, put together the post parameters required by the request to request links for automatic page turning in the future, and then post request links with parameters:

headers = {
    'authority': 'www.amazon.it',
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
}

page = 1
post_data = {
    "sortBy": "recent",
    "reviewerType": "all_reviews",
    "formatType": "",
    "mediaType": "",
    "filterByStar": "",
    "filterByLanguage": "",
    "filterByKeyword": "",
    "shouldAppend": "undefined",
    "deviceType": "desktop",
    "canShowIntHeader": "undefined",
    "pageSize": "10",
    "asin": "B08GHGTGQ2",
}
# Page turning key payload parameter assignment
post_data["pageNumber"] = page,
post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{page}",
post_data["scope"] = f"reviewsAjax{page}",
# Page turning link assignment
spiderurl=f'https://www.amazon.it/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{page}'
res = requests.post(spiderurl,headers=headers,data=post_data)
if res and res.status_code == 200:
    res = res.content.decode('utf-8')
    print(res)

Now that we have obtained this pile of unprocessed information, we will start to process these data.

3, Amazon comments processing

According to the information in the above figure, the information in each section is separated by "& & &", and each piece of information after separation is separated by "," ":

Therefore, use python's split method to separate the string into a list:

# Return value string processing
contents = res.split('&&&')
for content in contents:
    infos = content.split('","')

The data separated by ', "' is split to generate a new list. The comment content is the last element of the list. Remove the" \ "," \ n "and redundant symbols, and you can select them for processing through css/xpath:

for content in contents:
    infos = content.split('","')
    info = infos[-1].replace('"]','').replace('\\n','').replace('\\','')
    # Comment content judgment
    if 'data-hook="review"' in info:
        sel = Selector(text=info)
        data = {}
        data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() # username
        data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() # score
        data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() # date address
        data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() # evaluation title
        data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() # evaluation content
        image = sel.xpath('div[@class="review-image-tile-section"]').extract_first()
        data['image'] = image if image else "not image" #picture
        print(data)

4, Code integration

4.1 agent settings

Stable IP proxy is the most powerful tool for your data acquisition. At present, there is still no stable access to Amazon in China, and there will be connection failure. I use it here ipidea Amazon in Italy can obtain the proxy through account secret and api, and the speed is very stable.

Address: http://www.ipidea.net/?utm-source=csdn&utm-keyword=?wb

The following method of proxy acquisition:

    # api get ip
    def getApiIp(self):
        # Get and only get one ip ------ Italy
        api_url = 'http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1&regions=it&port=1'
        res = requests.get(api_url, timeout=5)
        try:
            if res.status_code == 200:
                api_data = res.json()['data'][0]
                proxies = {
                    'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                    'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                }
                print(proxies)
                return proxies
            else:
                print('Acquisition failed')
        except:
            print('Acquisition failed')

4.2 while loop page turning

The while loop turns pages. The maximum number of comments is 99 pages. After 99 pages, break out of the while loop:

 def getPLPage(self):
        while True:
            # Page turning key payload parameter assignment
            self.post_data["pageNumber"]= self.page,
            self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}",
            self.post_data["scope"] = f"reviewsAjax{self.page}",
            # Page turning link assignment
            spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}'
            res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#Self encapsulated request method
            if res:
                res = res.content.decode('utf-8')
                # Return value string processing
                contents = res.split('&&&')
                for content in contents:
                    infos = content.split('","')
                    info = infos[-1].replace('"]','').replace('\\n','').replace('\\','')
                    # Comment content judgment
                    if 'data-hook="review"' in info:
                        sel = Selector(text=info)
                        data = {}
                        data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() # username
                        data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() # score
                        data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() # date address
                        data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() # evaluation title
                        data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() # evaluation content
                        image = sel.xpath('div[@class="review-image-tile-section"]').extract_first()
                        data['image'] = image if image else "not image" #picture
                        print(data)
            if self.page <= 99:
                print('Next Page')
                self.page += 1
            else:
                break

Final integration code:

# coding=utf-8
import requests
from scrapy import Selector

class getReview():
    page = 1
    headers = {
        'authority': 'www.amazon.it',
        "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
        "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/97.0.4692.71 Safari/537.36",
    }
    post_data = {
        "sortBy": "recent",
        "reviewerType": "all_reviews",
        "formatType": "",
        "mediaType": "",
        "filterByStar": "",
        "filterByLanguage": "",
        "filterByKeyword": "",
        "shouldAppend": "undefined",
        "deviceType": "desktop",
        "canShowIntHeader": "undefined",
        "pageSize": "10",
        "asin": "B08GHGTGQ2",
    }
    #post_ The asin parameter in data is currently written in
    #"https://www.amazon.it/product-reviews/B08GHGTGQ2?ie=UTF8&pageNumber=1&reviewerType=all_reviews&pageSize=10&sortBy=recent"
    #In this link, the possibility of asin value change is not excluded. For example, you can get the get request

    def getPLPage(self):
        while True:
            # Page turning key payload parameter assignment
            self.post_data["pageNumber"]= self.page,
            self.post_data["reftag"] = f"cm_cr_getr_d_paging_btm_next_{self.page}",
            self.post_data["scope"] = f"reviewsAjax{self.page}",
            # Page turning link assignment
            spiderurl = f'https://www.amazon.it/hz/reviews-render/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_next_{self.page}'
            res = self.getRes(spiderurl,self.headers,'',self.post_data,'POST',check)#Self encapsulated request method
            if res:
                res = res.content.decode('utf-8')
                # Return value string processing
                contents = res.split('&&&')
                for content in contents:
                    infos = content.split('","')
                    info = infos[-1].replace('"]','').replace('\\n','').replace('\\','')
                    # Comment content judgment
                    if 'data-hook="review"' in info:
                        sel = Selector(text=info)
                        data = {}
                        data['username'] = sel.xpath('//span[@class="a-profile-name"]/text()').extract_first() # username
                        data['point'] = sel.xpath('//span[@class="a-icon-alt"]/text()').extract_first() # score
                        data['date'] = sel.xpath('//span[@data-hook="review-date"]/text()').extract_first() # date address
                        data['review'] = sel.xpath('//span[@data-hook="review-title"]/span/text()').extract_first() # evaluation title
                        data['detail'] = sel.xpath('//span[@data-hook="review-body"]').extract_first() # evaluation content
                        image = sel.xpath('div[@class="review-image-tile-section"]').extract_first()
                        data['image'] = image if image else "not image" #picture
                        print(data)
            if self.page <= 99:
                print('Next Page')
                self.page += 1
            else:
                break

    # api get ip
    def getApiIp(self):
        # Get and only get one ip ------ Italy
        api_url = 'http://tiqu.ipidea.io:81/abroad?num=1&type=2&lb=1&sb=0&flow=1&regions=it&port=1'
        res = requests.get(api_url, timeout=5)
        try:
            if res.status_code == 200:
                api_data = res.json()['data'][0]
                proxies = {
                    'http': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                    'https': 'http://{}:{}'.format(api_data['ip'], api_data['port']),
                }
                print(proxies)
                return proxies
            else:
                print('Acquisition failed')
        except:
            print('Acquisition failed')

    #The method of sending a request specifically. The agent requests three times and returns an error if it fails three times
    def getRes(self,url,headers,proxies,post_data,method,check):
        if proxies:
            for i in range(3):
                try:
                    # post request to proxy
                    if method == 'POST':
                        res = requests.post(url,headers=headers,data=post_data,proxies=proxies)
                    # get request to proxy
                    else:
                        res = requests.get(url, headers=headers,proxies=proxies)
                    if res:
                        return res
                except:
                    print(f'The first{i}Error in request')
                else:
                    return None
        else:
            for i in range(3):
                proxies = self.getApiIp() if check == '1' else self.getAccountIp()
                try:
                    # post request of request agent
                    if method == 'POST':
                        res = requests.post(url, headers=headers, data=post_data, proxies=proxies)
                    # get request from request broker
                    else:
                        res = requests.get(url, headers=headers, proxies=proxies)
                    if res:
                        return res
                except:
                    print(f"The first{i}Error in request")
                else:
                    return None

if __name__ == '__main__':
    getReview().getPLPage()

summary

This Amazon comment acquisition is two pits: one is the XHR request method for comment information, and the other is the processing of comment information. After analysis, the data acquisition is still very simple. Find the correct request method, and a stable IP agent will get twice the result with half the effort. Find the common ground of the information for processing, and the problem will be solved.           

Keywords: Python Back-end

Added by drewbee on Wed, 19 Jan 2022 17:03:38 +0200