Target: 51job
Don't ask me why I climb this, because my technical ability is limited. Very limited. Very limited.
1. Let's see what useful information this website has first.
Area, it must be necessary. Where else can I find a job.
Why are you looking for a job? What do you say. Directly climb more than 50000, manual funny.
Total pages and total recruitment information. (ignored at first)
Each recruitment information is useful information.
2. Analyze the source code
The first thing I saw was this thing. Just click one to have a look and find useful information.
This line of figures is not simple at first sight, men's sixth sense.
Go back and look at the website. It's really not simple. It's as smart as me..
Now the number that determines this position represents the area code. Then look down for information that may be useful.
Is it so hasty to call a good guy? The information you want to climb is put here.
It's too messy. Throw out some to analyze.
Are very fixed identification, which makes it much easier for us to filter information.
The bottom page is the information I ignored at the beginning. In fact, it is the total number of pages retrieved
How does this website turn the page? Click the second page to see the changes of the website.
Obviously, this position is the number of pages. As for the key words of work, I won't say much.
Conclusion: it's not difficult. Just observe and analyze more.
3. Climb the code.
Required class. (excel processing, work information object extraction, web page analysis, thread, testing)
excel processing class
public class ExcelTools { /** * @desc Job Object writing to excel * @param biglist Load each page list < job > * @param filePath File path * @param key Web page keywords * @throws IOException * @throws RowsExceededException * @throws WriteException * @throws InterruptedException */ public void writeExcel(List <List> biglist,String filePath,String key) throws IOException, RowsExceededException, WriteException, InterruptedException{ File file=new File(filePath); //Create Workbook WritableWorkbook excel=Workbook.createWorkbook(file); //Create sheet page WritableSheet sheet=excel.createSheet("51job-"+key, 0); //Create header sheet.addCell(new Label(0,0,"Job name")); sheet.addCell(new Label(1,0,"fringe benefits")); sheet.addCell(new Label(2,0,"Company name")); sheet.addCell(new Label(3,0,"wages")); sheet.addCell(new Label(4,0,"region")); sheet.addCell(new Label(5,0,"Release time")); sheet.addCell(new Label(6,0,"Nature of company")); sheet.addCell(new Label(7,0,"Recruitment requirements")); //Write excel int row=1; for (int i = 0; i < biglist.size(); i++) { List list=biglist.get(i); for (int j = 0; j < list.size(); j++) { Job job=(Job) list.get(j); sheet.addCell(new Label(0,row,job.getJob_name())); sheet.addCell(new Label(1,row,job.getJob_welf())); sheet.addCell(new Label(2,row,job.getCompany_name())); sheet.addCell(new Label(3,row,job.getProvidesalary_text())); sheet.addCell(new Label(4,row,job.getWorkarea_text())); sheet.addCell(new Label(5,row,job.getUpdatedate())); sheet.addCell(new Label(6,row,job.getCompanytype_text())); sheet.addCell(new Label(7,row,job.getAttribute_text())); row++; } } //write in excel.write(); //Close excel excel.close(); } }
job class
public class Job { //Job name public String job_name; //fringe benefits public String job_welf; //Company name public String company_name; //wages public String providesalary_text; //region public String workarea_text; //Release time public String updatedate; //Nature of company public String companytype_text; //Recruitment requirements public String attribute_text; public String getJob_name() { return job_name; } public void setJob_name(String job_name) { this.job_name = job_name; } public String getJob_welf() { return job_welf; } public void setJob_welf(String job_welf) { this.job_welf = job_welf; } public String getCompany_name() { return company_name; } public void setCompany_name(String company_name) { this.company_name = company_name; } public String getProvidesalary_text() { return providesalary_text; } public void setProvidesalary_text(String providesalary_text) { this.providesalary_text = providesalary_text; } public String getWorkarea_text() { return workarea_text; } public void setWorkarea_text(String workarea_text) { this.workarea_text = workarea_text; } public String getUpdatedate() { return updatedate; } public void setUpdatedate(String updatedate) { this.updatedate = updatedate; } public String getCompanytype_text() { return companytype_text; } public void setCompanytype_text(String companytype_text) { this.companytype_text = companytype_text; } public String getAttribute_text() { return attribute_text; } public void setAttribute_text(String attribute_text) { this.attribute_text = attribute_text; } @Override public String toString() { return "Job [job_name=" + job_name + ", job_welf=" + job_welf + ", company_name=" + company_name + ", providesalary_text=" + providesalary_text + ", workarea_text=" + workarea_text + ", updatedate=" + updatedate + ", companytype_text=" + companytype_text + ", attribute_text=" + attribute_text + "]"; } }
Web page analysis
public class JsoupHtml { String city; // City Pinyin // Nonparametric structure public JsoupHtml() { super(); } // Parametric structure public JsoupHtml(String city) { this.city = city; } /** * @desc Get HTML document * @param html * @return document object * @throws IOException */ public Document getDocument(String html) throws IOException { return Jsoup.connect(html).get(); } /** * @desc Get the city code * @return * @throws IOException */ public String getCity() throws IOException { String cityHtml = "https://www.51job.com/" + city + "/"; Document doc = getDocument(cityHtml); String citynumber = doc.getElementsByTag("meta").get(4).attr("content"); String citynumber2 = citynumber.substring(citynumber.indexOf("areaid=") + 7, citynumber.indexOf("&")); return citynumber2; } /** * @desc Get the total number of pages and Recruitment Information * @param city * @param key * @param moneyNum * @return * @throws IOException */ public String getPages(String city, String key, String moneyNum) throws IOException { Document doc = getDocument("https://search.51job.com/list/" + city + ",000000,0000,00,9," + moneyNum + "," + key + ",2,1.html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="); Elements e1 = doc.getElementsByAttributeValue("type", "text/javascript"); String str = e1.get(2).toString(); return str.substring(str.indexOf("total_page"), str.indexOf("keyword_ads") - 3).replace('\"', ' ') + "Page data" + "\t" + str.substring(str.indexOf("jobid_count"), str.indexOf("banner_ads") - 3).replace('\"', ' '); } /** * @desc Get the job object and store it in the container * @param doc * @return List<Job> * @throws IOException */ public List<Job> parseHtml(Document doc) throws IOException { Elements element = doc.getElementsByAttributeValue("type", "text/javascript"); String[] joblist = element.get(2).toString().split("job_name"); List<String> list = new ArrayList<String>(); for (String string : joblist) { list.add(string); } list.remove(0); // Convert the working information container into a string array for parsing String[] joblist2 = (String[]) list.toArray(new String[list.size()]); List<Job> jobList = new ArrayList<Job>(); for (int i = 0; i < joblist2.length; i++) { String string = joblist2[i]; Job job = new Job(); // Job name String job_name = string.substring(3, string.indexOf("job_title") - 3); job.setJob_name(job_name); // fringe benefits String job_welf = string.substring(string.indexOf("jobwelf") + 10, string.indexOf("jobwelf_list") - 3); job.setJob_welf(job_welf); // Company name String company_name = string.substring(string.indexOf("company_name") + 15, string.indexOf("provi") - 3); job.setCompany_name(company_name); // wages String providesalary_text = string.substring(string.indexOf("text") + 7, string.indexOf("workarea") - 3) .replace('/', ' '); job.setProvidesalary_text(providesalary_text); // region String workarea_text = string.substring(string.indexOf("workarea_text") + 16, string.indexOf("updatedate") - 3); job.setWorkarea_text(workarea_text); // Release time String updatedate = string.substring(string.indexOf("updatedate") + 13, string.indexOf("iscommunicate") - 3); job.setUpdatedate(updatedate); // Nature of company String companytype_text = string.substring(string.indexOf("companytype_text") + 19, string.indexOf("degreefrom") - 3); job.setCompanytype_text(companytype_text); // Recruitment requirements String attribute_text = string.substring(string.indexOf("attribute_text") + 17, string.indexOf("companysize_text") - 3); job.setAttribute_text(attribute_text); jobList.add(job); } return jobList; } }
Thread class
public class MyThread implements Runnable { int pages; // The number of pages the user wants to get String key; // Keywords web page String moneyNum; // Salary range String city; // city code public MyThread(String city, int pages, String key, String moneyNum) { this.pages = pages; this.key = key; this.moneyNum = moneyNum; this.city = city; } JsoupHtml jsouphtml = new JsoupHtml(); ExcelTools excel = new ExcelTools(); public void run() { List big = new ArrayList<List>(); long start=System.currentTimeMillis(); System.out.println(Thread.currentThread().getName() + "Start writing excel....."); for (int i = 1; i <= pages; i++) { try { String html = "https://search.51job.com/list/" + city + ",000000,0000,00,9," + moneyNum + "," + key + ",2," + i + ".html?lang=c&postchannel=0000&workyear=99&cotype=99°reefrom=99&jobterm=99&companysize=99&ord_field=0&dibiaoid=0&line=&welfare="; Document document = jsouphtml.getDocument(html); // Container for storing Recruitment Information List list = jsouphtml.parseHtml(document); // Determine how many work information objects match if (list.size() == 0) { System.out.println("There is no job that meets the requirements"); break; } big.add(list); excel.writeExcel(big, "D:/study/Reptile/" + key + ".xls", key); } catch (Exception e) { e.getStackTrace(); } } System.out.println(Thread.currentThread().getName() + "Execution complete!"); System.out.println(System.currentTimeMillis()-start+"/ms"); } }
Test class
public class Test { public static void main(String[] args) throws IOException, RowsExceededException, WriteException, InterruptedException { System.out.println("Please enter city name:"); String city=new Scanner(System.in).next(); JsoupHtml js=new JsoupHtml(city); String city1=js.getCity(); System.out.println("Enter the number of query jobs:"); int jobNum=new Scanner(System.in).nextInt(); System.out.println("------------Select monthly salary range-----------"); System.out.println("99-----------All"); System.out.print("01-----------2k following"+"\t"); System.out.println("02-----------2-3k"); System.out.print("03-----------3-4.5k"+"\t"); System.out.println("04-----------4.5-6k"); System.out.print("05-----------6-8k"+"\t"); System.out.println("06-----------0.8-10k"); System.out.print("07-----------10-15k"+"\t"); System.out.println("08-----------15-20k"); System.out.print("09-----------20-30k"+"\t"); System.out.println("10-----------30-40k"); System.out.print("11-----------40-50k"+"\t"); System.out.println("12-----------50k above"+"\t"); String moneyNum=new Scanner(System.in).next(); JsoupHtml jsouphtml=new JsoupHtml(); ExecutorService executorService=Executors.newCachedThreadPool(); switch (jobNum) { case 1: { System.out.println("Enter occupation 1:"); String key1=new Scanner(System.in).next(); System.out.println(jsouphtml.getPages(city1,key1,moneyNum)); System.out.println("Please enter the number of pages you print"); int pages1=new Scanner(System.in).nextInt(); executorService.execute(new MyThread(city1,pages1, key1,moneyNum)); break; } case 2:{ System.out.println("Enter occupation 1:"); String key1=new Scanner(System.in).next(); System.out.println(jsouphtml.getPages(city1,key1,moneyNum)); System.out.println("Please enter the number of pages you print:"); int pages1=new Scanner(System.in).nextInt(); System.out.println("Enter occupation 2:"); String key2=new Scanner(System.in).next(); System.out.println(jsouphtml.getPages(city1,key2,moneyNum)); System.out.println("Please enter the number of pages you print:"); int pages2=new Scanner(System.in).nextInt(); executorService.execute(new MyThread(city1, pages1, key1,moneyNum)); executorService.execute(new MyThread(city1, pages2, key2,moneyNum)); break; } case 3:{ System.out.println("Enter occupation 1:"); String key1=new Scanner(System.in).next(); System.out.println(jsouphtml.getPages(city1,key1,moneyNum)); System.out.println("Please enter the number of pages you print:"); int pages1=new Scanner(System.in).nextInt(); System.out.println("Enter occupation 2:"); String key2=new Scanner(System.in).next(); System.out.println(jsouphtml.getPages(city1,key2,moneyNum)); System.out.println("Please enter the number of pages you print:"); int pages2=new Scanner(System.in).nextInt(); System.out.println("Enter occupation 3:"); String key3=new Scanner(System.in).next(); System.out.println(jsouphtml.getPages(city1,key3,moneyNum)); System.out.println("Please enter the number of pages you print:"); int pages3=new Scanner(System.in).nextInt(); executorService.execute(new MyThread( city1,pages1, key1,moneyNum)); executorService.execute(new MyThread( city1,pages2, key2,moneyNum)); executorService.execute(new MyThread( city1, pages3, key3,moneyNum)); } }
4. Crawling effect
Please enter city name: beijing Enter the number of query jobs: 3 ------------Select monthly salary range----------- 99-----------All 01-----------2k following 02-----------2-3k 03-----------3-4.5k 04-----------4.5-6k 05-----------6-8k 06-----------0.8-10k 07-----------10-15k 08-----------15-20k 09-----------20-30k 10-----------30-40k 11-----------40-50k 12-----------50k above 99 Enter occupation 1: java total_page : 156 Page data jobid_count : 7775 Please enter the number of pages you print: 10 Enter occupation 2: python total_page : 106 Page data jobid_count : 5288 Please enter the number of pages you want to print: 10 Enter occupation 3: front end total_page : 85 Page data jobid_count : 4231 Please enter the number of pages you print: 10 pool-1-thread-1 Start writing excel..... pool-1-thread-3 Start writing excel..... pool-1-thread-2 Start writing excel..... pool-1-thread-3 Execution complete! 1710/ms pool-1-thread-1 Execution complete! 2321/ms pool-1-thread-2 Execution complete! 3520/ms
Crawled file
5. Summary
The learning time of jsoup is very short, the technology is not in place, and the ability is limited. You can only write about it. Anyway, the effect climbs down, and a large number of subsequent code modifications may be made, and the thread pool will not be used. If you have any questions, please correct them!
But be gentle!!!!!!