At the end of the last blog, it was said that the port number of the whole network proxy IP was confused by encryption, and this blog will tell you how to crack it! If you find it useful, you might as well make a recommendation.~
1. JS confusion of whole network proxy IP
First entry Network-wide proxy IP Open the developer's tool and click to see the port number. It looks like there's no problem.
If you have crawled through the proxy of this website, you will know that things are not so simple. What if I haven't climbed? It's also very simple. Click the right mouse button and look at the source code of the web page. Search for "port". You can find the following contents:
Obviously this is not the port number displayed on the web page, so how can we get the real port number?
Solution:
First, you need to find a JS file: http://www.goubanjia.com/theme/goubanjia/javascript/pde.js?v=1.0 When you click on it, you can see the following:
Such a complex JS code looks like a big head, but we found that the JS code is an eval function, so can we decode it? At this point you need a tool-- Script House online tool Copy these JS codes in:
Then click Decode:
It's also an eval function, so decode again:
At this point, it's much simpler than the original code, but it's still very readable, so we need to format it first.
var _$ = [
"\x2e\x70\x6f\x72\x74",
"\x65\x61\x63\x68",
"\x68\x74\x6d\x6c",
"\x69\x6e\x64\x65\x78\x4f\x66",
"\x2a",
"\x61\x74\x74\x72",
"\x63\x6c\x61\x73\x73",
"\x73\x70\x6c\x69\x74",
"\x20",
"",
"\x6c\x65\x6e\x67\x74\x68",
"\x70\x75\x73\x68",
"\x41\x42\x43\x44\x45\x46\x47\x48\x49\x5a",
"\x70\x61\x72\x73\x65\x49\x6e\x74",
"\x6a\x6f\x69\x6e",
""
];
$(function() {
$(_$[0])[_$[1]](function() {
var a = $(this)[_$[2]]();
if (a[_$[3]](_$[4]) != -0x1) {
return;
}
var b = $(this)[_$[5]](_$[6]);
try {
b = b[_$[7]](_$[8])[0x1];
var c = b[_$[7]](_$[9]);
var d = c[_$[10]];
var f = [];
for (var g = 0x0; g < d; g++) {
f[_$[11]](_$[12][_$[3]](c[g]));
}
$(this)[_$[2]](window[_$[13]](f[_$[14]](_$[15])) >> 0x3);
} catch (e) {}
});
});
You can see a list and a function, which should be a confusing function, but the data in the list are hexadecimal and need to be decoded (this step can be done with Python):
_ = ["\x2e\x70\x6f\x72\x74", "\x65\x61\x63\x68", "\x68\x74\x6d\x6c", "\x69\x6e\x64\x65\x78\x4f\x66", "\x2a",
"\x61\x74\x74\x72", "\x63\x6c\x61\x73\x73", "\x73\x70\x6c\x69\x74", "\x20", "", "\x6c\x65\x6e\x67\x74\x68",
"\x70\x75\x73\x68", "\x41\x42\x43\x44\x45\x46\x47\x48\x49\x5a", "\x70\x61\x72\x73\x65\x49\x6e\x74",
"\x6a\x6f\x69\x6e", ""
]
_ = [i.encode('utf-8').decode('utf-8') for i in _]
print(_)
# ['.port', 'each', 'html', 'indexOf', '*', 'attr', 'class', 'split', ' ', '', 'length', 'push', 'ABCDEFGHIZ', 'parseInt', 'join', '']
Then add the elements in this list to the JS function above to get the following results:
1 $(function() { 2 $(".port")["each"](function() { 3 var a = $(this)["html"](); 4 if (a["indexOf"]("*") != -0x1) { 5 return; 6 } 7 var b = $(this)["attr"]("class"); 8 try { 9 b = b["split"](" ")[0x1]; 10 var c = b["split"](""); 11 var d = c["length"]; 12 var f = []; 13 for (var g = 0x0; g < d; g++) { 14 f["push"]("ABCDEFGHIZ"["indexOf"](c[g])); 15 } 16 $(this)["html"](window["parseInt"](f["join"]("")) >> 0x3); 17 } catch (e) {} 18 }); 19 });
As you can see, this JS code first finds each port node, then extracts the class value of the port, then splits the string, then obtains the subscript value of each letter in "ABCDEFGHIZ", splices these values into strings, converts them into integer data, and finally moves the integer data 3 bits to the right. For example, "GEA" corresponds to a string of subscripts that is "640". The result of moving three bits to the right after converting to integer data is 80, which is the real port value. Finally, enclose the code to decrypt the port number with Python:
1 et = etree.HTML(html) 2 port_list = et.xpath('//*[contains(@class,"port")]/@class') 3 for port in port_list: 4 port = port.split(' ')[1] 5 num = "" 6 for i in port: 7 num += str("ABCDEFGHIZ".index(i)) 8 print(int(num) >> 3)
2. Replacing Text with Pictures
Some people have commented that some websites use pictures instead of words to achieve anti-crawler. Then I found a website this time. Newegg Click on a product at will to see:
Open the developer's tool and click to see the price. Unexpectedly, the price is displayed by pictures:
Solution:
I found two ways to get the price, one simple and one difficult. The simple way is to use regular expressions, because other parts of the source code contain basic information about the commodity, such as name and price, so we can use regular expressions to match. The code is as follows:
1 import re 2 import requests 3 4 5 url = "https://www.newegg.cn/Product/A36-125-E5L.htm?neg_sp=Home-_-A36-125-E5L-_-CountdownV1" 6 res = requests.get(url) 7 result = re.findall("name:'(.+?)', price:'(.+?)'", res.text) 8 print(result)
The harder way is to download the image to the local area and recognize it. Because the image has high definition, no distortion or interference lines, OCR can be used directly to identify it. But in this way, Tesseract-OCR needs to be installed, and the installation process of this tool is more troublesome. The code cracked in this way is as follows:
1 import requests 2 import pytesseract 3 from PIL import Image 4 from lxml import etree 5 6 7 url = "https://www.newegg.cn/Product/A36-125-E5L.htm?neg_sp=Home-_-A36-125-E5L-_-CountdownV1" 8 res = requests.get(url) 9 et = etree.HTML(res.text) 10 img_url = et.xpath('//*[@id="priceValue"]/span/strong/img/@src')[0] 11 with open('price.png','wb') as f: 12 f.write(requests.get(img_url).content) 13 pytesseract.pytesseract.tesseract_cmd = 'E:/Python/Tesseract-OCR/tesseract.exe' 14 text = pytesseract.image_to_string(Image.open('price.png')) 15 print(text) 16 # 6999.00