Cute new reptiles are so happy

Target: 51job

Don't ask me why I climb this, because my technical ability is limited. Very limited. Very limited.

1. Let's see what useful information this website has first.


Area, it must be necessary. Where else can I find a job.

Why are you looking for a job? What do you say. Directly climb more than 50000, manual funny.

Total pages and total recruitment information. (ignored at first)

Each recruitment information is useful information.

2. Analyze the source code

The first thing I saw was this thing. Just click one to have a look and find useful information.

This line of figures is not simple at first sight, men's sixth sense.

Go back and look at the website. It's really not simple. It's as smart as me..

Now the number that determines this position represents the area code. Then look down for information that may be useful.
Is it so hasty to call a good guy? The information you want to climb is put here.

It's too messy. Throw out some to analyze.

Are very fixed identification, which makes it much easier for us to filter information.


The bottom page is the information I ignored at the beginning. In fact, it is the total number of pages retrieved
How does this website turn the page? Click the second page to see the changes of the website.

Obviously, this position is the number of pages. As for the key words of work, I won't say much.

Conclusion: it's not difficult. Just observe and analyze more.

3. Climb the code.


Required class. (excel processing, work information object extraction, web page analysis, thread, testing)

excel processing class

public class ExcelTools {
	
	/**
	 * @desc  Job Object writing to excel
	 * @param biglist Load each page list < job >
	 * @param filePath	File path
	 * @param key		Web page keywords
	 * @throws IOException	
	 * @throws RowsExceededException
	 * @throws WriteException
	 * @throws InterruptedException
	 */
		public void writeExcel(List <List> biglist,String filePath,String key) throws IOException, RowsExceededException, WriteException, InterruptedException{
			File file=new File(filePath);
			//Create Workbook
			WritableWorkbook excel=Workbook.createWorkbook(file);
			//Create sheet page
			WritableSheet sheet=excel.createSheet("51job-"+key, 0);
			//Create header
			sheet.addCell(new Label(0,0,"Job name"));
			sheet.addCell(new Label(1,0,"fringe benefits"));
			sheet.addCell(new Label(2,0,"Company name"));
			sheet.addCell(new Label(3,0,"wages"));
			sheet.addCell(new Label(4,0,"region"));
			sheet.addCell(new Label(5,0,"Release time"));
			sheet.addCell(new Label(6,0,"Nature of company"));
			sheet.addCell(new Label(7,0,"Recruitment requirements"));
			//Write excel
			int row=1;
			for (int i = 0; i < biglist.size(); i++) {
				List list=biglist.get(i);
				for (int j = 0; j < list.size(); j++) {
					Job job=(Job) list.get(j);
					sheet.addCell(new Label(0,row,job.getJob_name()));
					sheet.addCell(new Label(1,row,job.getJob_welf()));
					sheet.addCell(new Label(2,row,job.getCompany_name()));
					sheet.addCell(new Label(3,row,job.getProvidesalary_text()));
					sheet.addCell(new Label(4,row,job.getWorkarea_text()));
					sheet.addCell(new Label(5,row,job.getUpdatedate()));
					sheet.addCell(new Label(6,row,job.getCompanytype_text()));
					sheet.addCell(new Label(7,row,job.getAttribute_text()));
					row++;
				}
			}
			//write in
			excel.write();
			//Close excel
			excel.close();	
		}
	
	
}

job class

public class Job {

	//Job name
	public String job_name;
	
	
	//fringe benefits
	public String job_welf;
	
	//Company name
	public String company_name;
	
	//wages
	public String providesalary_text;
	
	
	//region
	public String workarea_text;
	
	
	//Release time
	public String updatedate;
	
	
	//Nature of company
	public String companytype_text;

	//Recruitment requirements
	public String attribute_text;

	public String getJob_name() {
		return job_name;
	}


	public void setJob_name(String job_name) {
		this.job_name = job_name;
	}


	public String getJob_welf() {
		return job_welf;
	}


	public void setJob_welf(String job_welf) {
		this.job_welf = job_welf;
	}


	public String getCompany_name() {
		return company_name;
	}


	public void setCompany_name(String company_name) {
		this.company_name = company_name;
	}


	public String getProvidesalary_text() {
		return providesalary_text;
	}


	public void setProvidesalary_text(String providesalary_text) {
		this.providesalary_text = providesalary_text;
	}


	public String getWorkarea_text() {
		return workarea_text;
	}


	public void setWorkarea_text(String workarea_text) {
		this.workarea_text = workarea_text;
	}


	public String getUpdatedate() {
		return updatedate;
	}


	public void setUpdatedate(String updatedate) {
		this.updatedate = updatedate;
	}


	public String getCompanytype_text() {
		return companytype_text;
	}


	public void setCompanytype_text(String companytype_text) {
		this.companytype_text = companytype_text;
	}


	
	public String getAttribute_text() {
		return attribute_text;
	}


	public void setAttribute_text(String attribute_text) {
		this.attribute_text = attribute_text;
	}


	@Override
	public String toString() {
		return "Job [job_name=" + job_name + ", job_welf=" + job_welf + ", company_name=" + company_name
				+ ", providesalary_text=" + providesalary_text + ", workarea_text=" + workarea_text + ", updatedate="
				+ updatedate + ", companytype_text=" + companytype_text + ", attribute_text=" + attribute_text + "]";
	}
	
	
	
}

Web page analysis

public class JsoupHtml {

	String city; // City Pinyin

	// Nonparametric structure
	public JsoupHtml() {
		super();
	}
	// Parametric structure
	public JsoupHtml(String city) {
		this.city = city;
	}

	/**
	 * @desc	Get HTML document
	 * @param 	html
	 * @return 	document object
	 * @throws 	IOException
	 */
	public Document getDocument(String html) throws IOException {
		return Jsoup.connect(html).get();
	}

	/**
	 * @desc	Get the city code
	 * @return
	 * @throws IOException
	 */
	public String getCity() throws IOException {
		String cityHtml = "https://www.51job.com/" + city + "/";
		Document doc = getDocument(cityHtml);
		String citynumber = doc.getElementsByTag("meta").get(4).attr("content");
		String citynumber2 = citynumber.substring(citynumber.indexOf("areaid=") + 7, citynumber.indexOf("&"));
		return citynumber2;
	}
	/**
	 * @desc	Get the total number of pages and Recruitment Information
	 * @param city
	 * @param key
	 * @param moneyNum
	 * @return 
	 * @throws IOException
	 */
	public String getPages(String city, String key, String moneyNum) throws IOException {
		Document doc = getDocument("https://search.51job.com/list/" + city + ",000000,0000,00,9," + moneyNum + "," + key
				+ ",2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=");
		Elements e1 = doc.getElementsByAttributeValue("type", "text/javascript");
		String str = e1.get(2).toString();
		return str.substring(str.indexOf("total_page"), str.indexOf("keyword_ads") - 3).replace('\"', ' ') + "Page data"
				+ "\t" + str.substring(str.indexOf("jobid_count"), str.indexOf("banner_ads") - 3).replace('\"', ' ');
	}

	/**
	 * @desc	Get the job object and store it in the container
	 * @param 	doc
	 * @return  List<Job>
	 * @throws IOException
	 */
	public List<Job> parseHtml(Document doc) throws IOException {
		Elements element = doc.getElementsByAttributeValue("type", "text/javascript");
		String[] joblist = element.get(2).toString().split("job_name");
		List<String> list = new ArrayList<String>();
		for (String string : joblist) {
			list.add(string);
		}
		list.remove(0);
		// Convert the working information container into a string array for parsing
		String[] joblist2 = (String[]) list.toArray(new String[list.size()]);
		List<Job> jobList = new ArrayList<Job>();
		for (int i = 0; i < joblist2.length; i++) {
			String string = joblist2[i];
			Job job = new Job();
			// Job name
			String job_name = string.substring(3, string.indexOf("job_title") - 3);
			job.setJob_name(job_name);
			// fringe benefits
			String job_welf = string.substring(string.indexOf("jobwelf") + 10, string.indexOf("jobwelf_list") - 3);
			job.setJob_welf(job_welf);
			// Company name
			String company_name = string.substring(string.indexOf("company_name") + 15, string.indexOf("provi") - 3);
			job.setCompany_name(company_name);
			// wages
			String providesalary_text = string.substring(string.indexOf("text") + 7, string.indexOf("workarea") - 3)
					.replace('/', ' ');
			job.setProvidesalary_text(providesalary_text);
			// region
			String workarea_text = string.substring(string.indexOf("workarea_text") + 16,
					string.indexOf("updatedate") - 3);
			job.setWorkarea_text(workarea_text);
			// Release time
			String updatedate = string.substring(string.indexOf("updatedate") + 13,
					string.indexOf("iscommunicate") - 3);
			job.setUpdatedate(updatedate);
			// Nature of company
			String companytype_text = string.substring(string.indexOf("companytype_text") + 19,
					string.indexOf("degreefrom") - 3);
			job.setCompanytype_text(companytype_text);
			// Recruitment requirements
			String attribute_text = string.substring(string.indexOf("attribute_text") + 17,
					string.indexOf("companysize_text") - 3);
			job.setAttribute_text(attribute_text);
			jobList.add(job);
		}
		return jobList;
	}
}

Thread class

public class MyThread implements Runnable {

	int pages; // The number of pages the user wants to get
	String key; // Keywords web page
	String moneyNum; // Salary range
	String city; // city code 

	public MyThread(String city, int pages, String key, String moneyNum) {
		this.pages = pages;
		this.key = key;
		this.moneyNum = moneyNum;
		this.city = city;

	}

	JsoupHtml jsouphtml = new JsoupHtml();
	ExcelTools excel = new ExcelTools();

	public void run() {
		List big = new ArrayList<List>();
		long start=System.currentTimeMillis();
		System.out.println(Thread.currentThread().getName() + "Start writing excel.....");
		for (int i = 1; i <= pages; i++) {
			try {
				String html = "https://search.51job.com/list/" + city + ",000000,0000,00,9," + moneyNum + "," + key
						+ ",2," + i
						+ ".html?lang=c&postchannel=0000&workyear=99&cotype=99&degreefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare=";
				Document document = jsouphtml.getDocument(html);
				// Container for storing Recruitment Information
				List list = jsouphtml.parseHtml(document);
				// Determine how many work information objects match
				if (list.size() == 0) {
					System.out.println("There is no job that meets the requirements");
					break;
				}

				big.add(list);
				excel.writeExcel(big, "D:/study/Reptile/" + key + ".xls", key);
			} catch (Exception e) {
				e.getStackTrace();
			}
		}
		System.out.println(Thread.currentThread().getName() + "Execution complete!");
		System.out.println(System.currentTimeMillis()-start+"/ms");
	}

}

Test class

public class Test {

	public static void main(String[] args) throws IOException, RowsExceededException, WriteException, InterruptedException {
		
		
		
		System.out.println("Please enter city name:");
		String city=new Scanner(System.in).next();
		JsoupHtml js=new JsoupHtml(city);
		String city1=js.getCity();
		
		
		System.out.println("Enter the number of query jobs:");
		int jobNum=new Scanner(System.in).nextInt();
		
		
		
		System.out.println("------------Select monthly salary range-----------");
		System.out.println("99-----------All");
		System.out.print("01-----------2k following"+"\t");
		System.out.println("02-----------2-3k");
		System.out.print("03-----------3-4.5k"+"\t");
		System.out.println("04-----------4.5-6k");
		System.out.print("05-----------6-8k"+"\t");
		System.out.println("06-----------0.8-10k");
		System.out.print("07-----------10-15k"+"\t");
		System.out.println("08-----------15-20k");
		System.out.print("09-----------20-30k"+"\t");
		System.out.println("10-----------30-40k");
		System.out.print("11-----------40-50k"+"\t");
		System.out.println("12-----------50k above"+"\t");
		
		
		String moneyNum=new Scanner(System.in).next();
		JsoupHtml  jsouphtml=new JsoupHtml();
		ExecutorService executorService=Executors.newCachedThreadPool();
		
		switch (jobNum) {
		case 1: {
			
			System.out.println("Enter occupation 1:");
			String key1=new Scanner(System.in).next();
			System.out.println(jsouphtml.getPages(city1,key1,moneyNum));
			System.out.println("Please enter the number of pages you print");
			int pages1=new Scanner(System.in).nextInt();
			executorService.execute(new MyThread(city1,pages1, key1,moneyNum));
			break;
			
		}
		case 2:{

			System.out.println("Enter occupation 1:");
			String key1=new Scanner(System.in).next();
			System.out.println(jsouphtml.getPages(city1,key1,moneyNum));
			System.out.println("Please enter the number of pages you print:");
			int pages1=new Scanner(System.in).nextInt();
			
			
			System.out.println("Enter occupation 2:");
			String key2=new Scanner(System.in).next();
			System.out.println(jsouphtml.getPages(city1,key2,moneyNum));
			System.out.println("Please enter the number of pages you print:");
			int pages2=new Scanner(System.in).nextInt();
			
			
			executorService.execute(new MyThread(city1, pages1,  key1,moneyNum));
			executorService.execute(new MyThread(city1, pages2,  key2,moneyNum));
			break;
			
		}
		case 3:{
			
			System.out.println("Enter occupation 1:");
			String key1=new Scanner(System.in).next();
			System.out.println(jsouphtml.getPages(city1,key1,moneyNum));
			System.out.println("Please enter the number of pages you print:");
			int pages1=new Scanner(System.in).nextInt();
			
			
			System.out.println("Enter occupation 2:");
			String key2=new Scanner(System.in).next();
			System.out.println(jsouphtml.getPages(city1,key2,moneyNum));
			System.out.println("Please enter the number of pages you print:");
			int pages2=new Scanner(System.in).nextInt();
			
			
			System.out.println("Enter occupation 3:");
			String key3=new Scanner(System.in).next();
			System.out.println(jsouphtml.getPages(city1,key3,moneyNum));
			System.out.println("Please enter the number of pages you print:");
			int pages3=new Scanner(System.in).nextInt();
			
			executorService.execute(new MyThread( city1,pages1,  key1,moneyNum));
			executorService.execute(new MyThread( city1,pages2,  key2,moneyNum));
			executorService.execute(new MyThread( city1, pages3,  key3,moneyNum));
		
		}
			
		}

4. Crawling effect

Please enter city name:
beijing
 Enter the number of query jobs:
3
------------Select monthly salary range-----------
99-----------All
01-----------2k following	02-----------2-3k
03-----------3-4.5k	04-----------4.5-6k
05-----------6-8k	06-----------0.8-10k
07-----------10-15k	08-----------15-20k
09-----------20-30k	10-----------30-40k
11-----------40-50k	12-----------50k above	
99
 Enter occupation 1:
java
total_page : 156 Page data	jobid_count : 7775
 Please enter the number of pages you print:
10
 Enter occupation 2:
python
total_page : 106 Page data	jobid_count : 5288
 Please enter the number of pages you want to print:
10
 Enter occupation 3:
front end
total_page : 85 Page data	jobid_count : 4231
 Please enter the number of pages you print:
10
pool-1-thread-1 Start writing excel.....
pool-1-thread-3 Start writing excel.....
pool-1-thread-2 Start writing excel.....
pool-1-thread-3 Execution complete!
1710/ms
pool-1-thread-1 Execution complete!
2321/ms
pool-1-thread-2 Execution complete!
3520/ms

Crawled file

5. Summary

The learning time of jsoup is very short, the technology is not in place, and the ability is limited. You can only write about it. Anyway, the effect climbs down, and a large number of subsequent code modifications may be made, and the thread pool will not be used. If you have any questions, please correct them!
But be gentle!!!!!!

Keywords: Java crawler

Added by rvdb86 on Fri, 18 Feb 2022 20:58:45 +0200