My girlfriend suddenly wants to buy a bra, but I don't know what color is more beautiful. Python explains it [actual combat in the whole process, discussion and collection]

Scene reproduction

Today, the weather is fine. My girlfriend suddenly wants to buy a bra, but I don't know what color is more beautiful. So I ordered me to analyze what color buyers benefit the public, discuss it, and reward it afterwards.

Keywords in this paper

Concurrent process 😊, IP blocked 😳, IP proxy 😏, Agent blocked 😭, A plant 🌿

Pick a "soft persimmon"

Open jd.com, directly search [bra] and select the one with the most comments

Enter the details page and slide down to see the product introduction and product evaluation.

Next, F12 opens the developer tool, selects Network, and then clicks all comments to grab the data package.

Open the url and find that it is really comment data.

Single page crawling

Let's write a small demo to try to crawl the code on this page to see if there is any problem.

import requests
import pandas as pd

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
}

params = {
    'callback':'fetchJSON_comment98',
    'productId':'35152509650',
    'score':'0',
    'sortType':'6',
    'page': '5',
    'pageSize':'10',
    'isShadowSku':'0',
    'rid':'0',
    'fold':'1'
}

url = 'https://club.jd.com/comment/productPageComments.action?'
page_text = requests.get(url=url, headers=headers, params=params).text
page_text

data processing

The data is obtained, but there are some useless characters in the front (there are also characters in the back). Obviously, it can't be directly converted into json format. It needs to be processed.

page_text = page_text[20: len(page_text) - 2]
data = json.loads(page_text)
data

Now that the data format has been processed, we can start to analyze the data and extract the parts we need. Here, we only extract id, color, comment and time.

import pandas as pd

df = pd.DataFrame({'id': [],
                   'color': [],
                   'comment': [],
                   'time': []})
for info in data['comments']:
    df = df.append({'id': info['id'],
                    'color': info['productColor'],
                    'comment': info['content'],
                    'time': info['creationTime']},
                   ignore_index=True)
df

Page turning operation

Next, we need to find the key to turning the page. Let's use the same method to obtain the URLs of the second and third pages for comparison.

After a brief analysis, the page field is the number of pages, which will be used when turning pages. It is worth noting that sortType literally means sorting type. It is speculated that the sorting method may be: heat, time, etc. After testing, it is found that sortType=5 is definitely not sorted by time, but should be heat. We need to obtain the sorted by time, so that it is easier to deal with later. Then we try several values, and finally determine that when sortType=6, it is sorted by evaluation time. At the end of the figure, there is a rid=0. I don't know what role it plays. I crawl two identical URLs (one with rid and the other without), and the test results are the same, so don't worry about it.

Lu code

Write the crawling results first: I started to want to climb 10000 evaluations. As a result, too many IP requests were cold. I adjusted the IP pool and agents, but I didn't resist. I worked hard to complete 1000. Time is not enough. If time and IP are sufficient, I can climb casually. After testing, it was found that the IP blocking time would not exceed one day. The next day, I ran and also had data. Let's look at the main code.

Main scheduling function

Set the url list of crawling. Remember to limit the concurrency in windows environment, or an error will be reported. Add the crawling task to tasks and suspend the task.

async def main(loop):
    # Get url list
    page_list = list(range(0, 1000))
    # Limit concurrency
    semaphore = asyncio.Semaphore(500)
    # Create a task object and add it to the task list
    tasks = [loop.create_task(get_page_text(page, semaphore)) for page in page_list]
    # Pending task list
    await asyncio.wait(tasks)

Page grab function

The capture method is basically the same as that described above, except that aiohttp is used for the request, and the authentication of SSL certificate has also been set. After the program is executed, it can be parsed and saved directly.

async def get_page_text(page, semaphore):
    async with semaphore:
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.54 Safari/537.36'
        }
        params = {
            'callback': 'fetchJSON_comment98',
            'productId': '35152509650',
            'score': '0',
            'sortType': '6',
            'page': f'{page}',
            'pageSize': '10',
            'isShadowSku': '0',
            # 'rid': '0',
            'fold': '1'
        }
        url = 'https://club.jd.com/comment/productPageComments.action?'
        async with aiohttp.ClientSession(connector=aiohttp.TCPConnector(ssl=False), trust_env=True) as session:
            while True:
                try:
                    async with session.get(url=url, proxy='http://' + choice(proxy_list), headers=headers, params=params,
                                           timeout=4) as response:
                        # When an IO request is encountered, suspend the current task, wait for the code after the IO operation is completed. When the collaboration is suspended, the event loop can execute other tasks.
                        page_text = await response.text()
                        # When the data is not obtained successfully, replace the ip to continue the request
                        if response.status != 200:
                            continue
                        print(f"The first{page}Page crawl complete!")
                        break
                except Exception as e:
                    print(e)
                    # Exception caught, continue request
                    continue
        return parse_page_text(page_text)

Parse save function

Save json data parsing to csv in the form of append.

def parse_page_text(page_text):
    page_text = page_text[20: len(page_text) - 2]
    data = json.loads(page_text)

    df = pd.DataFrame({'id': [],
                       'color': [],
                       'comment': [],
                       'time': []})
    for info in data['comments']:
        df = df.append({'id': info['id'],
                        'color': info['productColor'],
                        'comment': info['content'],
                        'time': info['creationTime']},
                       ignore_index=True)

    header = False if Path.exists(Path('Evaluation information.csv')) else True
    df.to_csv('Evaluation information.csv', index=False, mode='a', header=header)
    print('Saved')

visualization

Color distribution

The top three are grayish pink, black and nude skin color. If you don't say much, you can experience it yourself.

Cloud map of evaluation words

It can be seen that the key words of evaluation are mostly some descriptions of upper body feeling, and comfortable wearing is of course the first~

After scattering flowers, it's time to report to your girlfriend~

⭐ The past is wonderful and can't be missed ⭐ ️

Conclusion
❤️ 20000 words, 50 pandas, high frequency operation [pictures and texts, worth collecting] ❤️
❤️ Hematemesis summary Mysql from getting started to being possessed, with pictures and texts (recommended Collection) ❤️

Tools
⭐ Making cool two-dimensional code of Python practical gadgets (with interface and source code) ⭐ ️
❤️ Production certificate photo of Python utility (with interface and source code) ❤️
❤️ Girlfriend desktop files messy? I was so angry that I made her a file sorting tool in Python ❤️

❤️ Source code acquisition method ❤️

Don't forget to like it~

👇 👇 👇

Keywords: Python crawler Python crawler

Added by Supplement on Mon, 25 Oct 2021 06:54:54 +0300

Programming VIP