Using java to call python's crawler is a very interesting thing, but most of the solutions are not reliable. The author spent two days, hands-on practice, and finally completely solved the problem
Problems to be solved when Java calls Python crawler:
Parameter passing problem
Through python script, sys.argv[1] reads parameters
Problems with dependent packages
Build virtual environment with virtualenv, install all related dependency packages in virtual environment, and execute python script with python interpreter in virtual environment, which can perfectly solve the dependency package problem
java and python data transfer
The python script is responsible for saving the crawled content as a file. After the file is saved, the java program reads the content of the document
import java.io.IOException; import java.io.File; public class BPython { public static void main(String[] args) { // Get current path File directory = new File("");//Set as current folder String dirPath = directory.getAbsolutePath();//Get absolute path Process proc; try { // python interpreter path String pyPath = "/Users/lijianzhao/.virtualenvs/py3/bin/python3"; // python script file path String pyFilePath = dirPath+ "/bdindex.py"; System.out.println(pyFilePath); // Parameters passed to python String argv1 = "Under one man"; proc = Runtime.getRuntime().exec(pyPath + " "+ pyFilePath + " " +argv1); proc.waitFor(); } catch (IOException e) { e.printStackTrace(); } catch (InterruptedException e) { e.printStackTrace(); } } }
# coding=utf-8 import requests from lxml import etree import os import sys def getData(wd): # Set user agent header headers = { # Set user agent header (put sheepskin on wolf) "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36", } # Construct target web address target_url = "https://www.baidu.com/s?wd="+str(wd) # Get response data = requests.get(target_url, headers = headers) # xpath formatting data_etree = etree.HTML(data.content) # Extract data list content_list = data_etree.xpath('//div[@id="content_left"]/div[contains(@class, "result c-container")]') # Define the returned string result = "" # Get title, content, link for content in content_list: result_title = "<Title> " bd_title = content.xpath('.//h3/a') for bd_t in bd_title: result_title += bd_t.xpath('string(.)') result_content = "<content>" bd_content = content.xpath('.//div[@class="c-abstract"]') for bd_c in bd_content: result_content += bd_c.xpath('string(.)') try: result_link = "<link>"+str(list(content.xpath('.//div[@class="f13"]/a[@class="c-showurl"]/@href'))[0]) except: result_link = "<link>: empty" result_list = [result_title, "\n" , result_content , "\n", result_link] for result_l in result_list: result += str(result_l) return result # Save as file def saveDataToFile(file_name, data): # Create folder if os.path.exists("./data/"): pass else: os.makedirs("./data/") with open("./data/"+file_name+".txt", "wb+") as f: f.write(data.encode()) def main(): wd = "" print(wd) try: wd = sys.argv[1] except: pass if (len(wd) == 0): wd = "Hello" str_data = getData(wd) saveDataToFile(wd, str_data) print("end") if __name__ == '__main__': main()
Summary
Python is probably the best crawler language to use. In the future, if you need to collect data, you can call Python's crawler directly with java. Life is short. I use python