Java calls Python crawler

Using java to call python's crawler is a very interesting thing, but most of the solutions are not reliable. The author spent two days, hands-on practice, and finally completely solved the problem


java-python

Problems to be solved when Java calls Python crawler:

Parameter passing problem

Through python script, sys.argv[1] reads parameters

Problems with dependent packages

Build virtual environment with virtualenv, install all related dependency packages in virtual environment, and execute python script with python interpreter in virtual environment, which can perfectly solve the dependency package problem

java and python data transfer

The python script is responsible for saving the crawled content as a file. After the file is saved, the java program reads the content of the document

import java.io.IOException;
import java.io.File;

public class BPython {
    public static void main(String[] args) {
        // Get current path
        File directory = new File("");//Set as current folder
        String dirPath = directory.getAbsolutePath();//Get absolute path
        Process proc;
        try {
            // python interpreter path
            String pyPath = "/Users/lijianzhao/.virtualenvs/py3/bin/python3";
            // python script file path
            String pyFilePath = dirPath+ "/bdindex.py";
            System.out.println(pyFilePath);
            // Parameters passed to python
            String argv1 = "Under one man";
            proc = Runtime.getRuntime().exec(pyPath + " "+ pyFilePath + " " +argv1);
            proc.waitFor();
        } catch (IOException e) {
            e.printStackTrace();
        } catch (InterruptedException e) {
            e.printStackTrace();
        }
    }
}
# coding=utf-8

import requests
from lxml import etree
import os
import sys

def getData(wd):
    # Set user agent header
    headers = {
        # Set user agent header (put sheepskin on wolf)
        "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36",
    }
    # Construct target web address
    target_url = "https://www.baidu.com/s?wd="+str(wd)
    # Get response
    data = requests.get(target_url, headers = headers)
    # xpath formatting
    data_etree = etree.HTML(data.content)
    # Extract data list
    content_list = data_etree.xpath('//div[@id="content_left"]/div[contains(@class, "result c-container")]')
    # Define the returned string
    result = ""
    # Get title, content, link
    for content in content_list:
        result_title = "<Title>  "
        bd_title = content.xpath('.//h3/a')
        for bd_t in bd_title:
            result_title += bd_t.xpath('string(.)')

        result_content = "<content>"
        bd_content = content.xpath('.//div[@class="c-abstract"]')
        for bd_c in bd_content:
            result_content += bd_c.xpath('string(.)')
        try:
            result_link = "<link>"+str(list(content.xpath('.//div[@class="f13"]/a[@class="c-showurl"]/@href'))[0])
        except:
            result_link = "<link>: empty"

        result_list = [result_title, "\n" , result_content , "\n", result_link]
        for result_l in result_list:
            result += str(result_l)
    return result


# Save as file
def saveDataToFile(file_name, data):
    # Create folder
    if os.path.exists("./data/"):
        pass
    else:
        os.makedirs("./data/")
    with open("./data/"+file_name+".txt", "wb+") as f:
        f.write(data.encode())

def main():
    wd = ""
    print(wd)
    try:
        wd = sys.argv[1]
    except:
        pass

    if (len(wd) == 0):
        wd = "Hello"

    str_data = getData(wd)   

    saveDataToFile(wd, str_data)
    print("end")


if __name__ == '__main__':
    main()

Summary

Python is probably the best crawler language to use. In the future, if you need to collect data, you can call Python's crawler directly with java. Life is short. I use python

Keywords: Python Java Mac OS X

Added by SBukoski on Mon, 30 Mar 2020 21:01:34 +0300