English translation is too difficult? In a rage, I wrote two translation scripts with crawlers

πŸ“‹ Personal profile

  • πŸ’– About the author: Hello, I'm Daniel 😜
  • πŸ“ Personal homepage: Hall owner a NiuπŸ”₯
  • πŸŽ‰ Support me: like πŸ‘ Collection ⭐ Leave a message πŸ“
  • πŸ“£ Series column: python web crawler🍁
  • πŸ’¬ Maxim: so far, all life is written with failure, but it doesn't prevent me from moving forward! πŸ”₯


preface

It's coming! It's coming! As a programmer, I can't translate English sentences, which I can't bear. I have to arrange scripts!!!

Baidu translation version (simple)

analysis

When you enter Baidu translation, F12 enters all of the network. When you write what you want to translate, you can see the link sug in all of the network, which is our interface url and the parameter is kw.

code

import requests
post_url='https://fanyi.baidu.com/sug'
headers={
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
word = input('Please enter what you want to translate, which can be in various languages:')
data = {
    'kw': word
}
response = requests.post(url=post_url,data=data,headers=headers)
dic_obj = response.json() #Convert json data into a dictionary
print(dic_obj['data'][0]['v'])

result


Youdao translation version (difficult)

Analysis (js reverse)

F12 enters the developer mode and finds the following interface in xhr (where to find ajax requests) in the network.

Then let's look at the parameters:


The comparison between the two figures shows that i should be the sentence we want to translate. The green line is the parameters of different forms, which need us to deal with. It is a 13 bit timestamp. Salt means salt in English, and it is one more bit than the timestamp lts. The first 13 bits are the same, It should be a salted timestamp (for a string of numbers, you can add a string of numbers or strings and then encrypt them. In encryption, we call salting). These two parameters can be simulated separately in python. In order to avoid unnecessary trouble or some people won't, we can directly find their js statement later and execute js generation in python.

The sign here has 32 bits, which should be generated by some encryption algorithm. The most common ones are md5 and rsa encryption. Let's conduct a global search js reverse:


After searching, we found the old penyou md5 encryption and the generation method of parameters. In the figure, r in js is the timestamp, i in js is the salt timestamp, and sign is the string in parentheses encrypted with md5. We also need to analyze the generation of e, which can be found through interrupt debugging.

We can see that e is what we want to translate. Now the parameters are obvious. In fact, we can get the sign by calling the md5 encryption algorithm in the hashlib module in python, but here we don't need to increase the difficulty and practice js reverse. I put the js file of md5 encryption process directly extracted into the online disk. You can extract it yourself and use it in the code.

Link: https://pan.baidu.com/s/1aV1tEo35Oyw4TUExhJoXUA
Extraction code: waan

At the same time, in order to deal with anti crawling, we should add not only user agent, but also Cookie and Referer.

code

import requests
import execjs  #Module executing js statement
import json
import jsonpath

class Youdao():
    def __init__(self,msg):
        # url
        self.url = 'https://fanyi.youdao.com/translate_o?smartresult=dict&smartresult=rule'
        # headers
        self.headers = {
            'User-Agent': 'Mozilla / 5.0(Windows NT 10.0;WOW64) AppleWebKit / 537.36(KHTML, likeGecko) Chrome / 91.0.4472.124Safari / 537.',
            'Cookie': 'OUTFOX_SEARCH_USER_ID = -1032338096@10.169.0.102;OUTFOX_SEARCH_USER_ID_NCOO = 39238000.072458096;JSESSIONID = aaak-QLUNaabh_wFWK8Qx;___rl__test__cookies = 1626662199192',
            'Referer': 'https://fanyi.youdao.com/'
        }
        self.msg = msg
        self.Formdata = None

    def js_Formdata(self):
        #time stamp
        r = execjs.eval('"" + (new Date).getTime()')
        #Timestamp salt
        i = r + str(execjs.eval('parseInt(10 * Math.random(), 10)'))
        ctx = execjs.compile(open('./youdao.js', 'r', encoding='utf-8').read())
        sign = ctx.call('getsign', self.msg,i)  #Call Youdao The getsign function in JS passes in the things to be translated and the salt timestamp.
        self.Formdata = {
            'i': self.msg,
            'from': 'AUTO',
            'to': 'AUTO',
            'smartresult': 'dict',
            'client': 'fanyideskweb',
            'salt': i,
            'sign': sign,
            'lts': r,
            'bv': 'f46e446c6db49492797b7d03ea1e82da',
            'doctype': 'json',
            'version': '2.1',
            'keyfrom': 'fanyi.web',
            'action': 'FY_BY_REALTlME',
        }

    def response(self):
        resp = requests.post(url=self.url,data=self.Formdata,headers=self.headers).text
        data = json.loads(resp)  #Convert json data into a dictionary

        #Using jsonpath to extract data
        if "translateResult" in data:
            k = jsonpath.jsonpath(data, '$..translateResult')[0][0][0]['tgt']
            print(k)

        print("Other translation:")
        if "smartResult" in data:
            lst = jsonpath.jsonpath(data, '$..entries')[0]
            for k in lst[1:]:
                k = k.replace("\r\n", "")
                print(k)

    def main(self):
        #Formdata
        self.js_Formdata()
        #print(self.Formdata)
        #Send request and get response
        self.response()

if __name__ == '__main__':
    msg = input('Please enter the word or sentence you want to translate:')
    youdao = Youdao(msg)
    youdao.main()

result


epilogue

If you think the blogger writes well, give it to the third company!!! πŸ’–πŸ’–πŸ’–

Keywords: Python JSON crawler Python crawler

Added by Snooble on Sat, 12 Feb 2022 13:16:57 +0200