Learning crawler from zero: collecting second-hand housing information in the world

l) collection website

[scene description] collect the latest second-hand house information in the world.

[website] https://tj.esf.fang.com/

[collection content]

Collect the title, price, house type, area, unit price, orientation, floor, decoration, community, region, contact person and telephone number of all second-hand houses in the second-hand housing module of Tianjin fangtianxia.

l) train of thought analysis

Overview of configuration ideas:

l) configuration steps

1. Create a new acquisition task

Select [acquisition configuration], click [+] at the top right of the task list to create a new acquisition task, fill in the acquisition entry address in the [acquisition address] box, customize the [task name], and click next.

2. Page turning configuration

Obtain all page turning links in the second-hand house page, observe the page turning link law, and find that:

https://tj.esf.fang.com/house/i32/ Second page link

https://tj.esf.fang.com/house/i33/ Page 3 link

https://tj.esf.fang.com/house/i34/ Page 4 link

It is not difficult to find that the page turning link consists of:

https://tj.esf.fang.com/house/i+ Number of pages+/

① Therefore, the added script is as follows:

② Collection Preview

3. Link extraction

① Create a new template 2 and a new link extraction to extract all the links of second-hand houses in each page.

② The list link requires script configuration. The operation is shown in the following figure:

③ View the page source code, open the page in the browser, click F12 and click the pointer button, as shown in the figure below. Use the pointer button to select the required second-hand house link. At this time, the corresponding source code content appears on the right. The description is linked under the node with class [shop_list shop_list_4].

④ After observation, it is found that we are looking for a second-hand house information corresponding to each node named [dl] under the [shop_list shop_list_4] node.

The href of the sub node of the sub node named dd in each dl node is the link of the second-hand house.

⑤ According to the above ideas, the specific configuration script is as follows. After configuring the script, click save in the upper right corner.

The text is as follows:

var foor = DOM.FindClass("shop_list shop_list_4","div",0 );//Find shop with class_ list shop_ list_ 4 node
var foora= DOM.FindName("dl",foora );//Find the node named dl under the foor node
while(foora)//If it is a foora node
{
	url link;//Define a url
	var pro = DOM.FindName("dd",foora );	//Find the node with name dd under the foora node
	link.urlname= url.StdUrl(URL.urlname,pro.child.child.href);//Outputs the href of the child node linked to the child node of the pro node
	link.title =pro.child.child.title;//The output link title is the title of the child node of the child node of the pro node
	link.tmplid = 3;//Association template 2
	RESULT.AddLink(link);//Result output a link
	foora = foora.next;//Skip to the next node of the foora node, that is, the next [class = listtext] node
}

⑥ The acquisition preview is as follows:

2. Data extraction

① After link extraction, enter the data page. On the basis of the original template, right-click and select Add template. For the newly added template, right-click add data extraction.

② At this time, you need to complete the work of data table creation: select data table creation, and click + in the collection data table structure to add a data table. The name can be customized, and it is named fangtianxia form here.

③ After the data table configuration is completed, select the data attribute configuration on the right side of data extraction, and select the newly created "fangtianxia" data table in the form to see that the fields in the form are displayed on the left side.

④ Click the script window and select the data extraction script

⑤ Observe the position of the required field in the page. The browser opens any second-hand house details page, click F12 and click the pointer button, as shown in the figure below. Use the pointer button to select the required second-hand house field information. At this time, the corresponding source code content appears on the right.

name_ Field: as shown in the figure below, this field is under the node with class [float tit_details].

Price field: as shown in the figure below, this field is under the node with class [TRL item price_esf Sty1].

type_ Field: as shown in the following figure, this field is under the child node of the [tr line Clearfix] node.

area field: as shown in the following figure, this field is under the child node of the [trl-item1 w182] node.

Header field: similarly, it is under the child node of the [trl-item1 w132] node.

orientation field: similarly, it is under the child node of the [trl-item1 w146] node.

floor field: Although this field is under the child node of the [TRL Item1 w182] node, as shown in the figure below, there is more than one TRL Item1 w182 in the source code of this page, so it cannot be obtained in the same way as the above fields.

As shown in the following figure, through the observation of the page source code, it can be found that this field is in the node with class [TRL Item1 w182] in the next node of the next node of the child node with class [tab cont right].

Renovate field: as can be seen from the figure, this field is in the child node with class of [TRL Item1 w132] under the node with class of [tr line Clearfix].

Estate field: as can be seen from the figure, this field contains all the text in the node whose class is [rcont] under the node whose class is [tr line].

zone_ Field: as can be seen from the figure, this field contains all the text in the node whose class is [rcont] under the node whose class is [trl-item2 clearfix].

name_ Field: as can be seen from the figure, all text contents of this field in the node with class [zf_jjname].

Tel field: as can be seen from the figure, this field is the value attribute value in the field whose classid is [AgentTel].

⑥ To sum up, the data extraction script is as follows:

Script text:

var floor=DOM.FindClass("tab-cont-right","div");
var floor1=floor.child.next.next.next;
var floor2=DOM.FindClass("tr-line","div");
var floor3=DOM.FindClass("trl-item2 clearfix","div").next;
var floor4=DOM.FindClass("zf_chat_line","a");
record re;
re.id = MD5(ur);
re.title = DOM.GetTextAll(DOM.FindClass("floatl tit_details","h1"));
re.price=DOM.GetTextAll(DOM.FindClass("trl-item price_esf  sty1","div",0));
re.type_=DOM.GetTextAll(DOM.FindClass("tr-line clearfix","div",0).child);
re.area=DOM.GetTextAll(DOM.FindClass("trl-item1 w182","div").child);
re.priceper=DOM.GetTextAll(DOM.FindClass("trl-item1 w132","div").child);
re.orientation=DOM.GetTextAll(DOM.FindClass("trl-item1 w146","div").child);
re.floor=DOM.GetTextAll(DOM.FindClass("trl-item1 w182","div",floor1));
re.renovate=DOM.GetTextAll(DOM.FindClass("trl-item1 w132","div",floor1).child);
re.estate=DOM.GetTextAll(DOM.FindClass("rcont","div",floor2));
re.zone_=DOM.GetTextAll(DOM.FindClass("rcont","div",floor3));
re.name_=DOM.GetTextAll(DOM.FindClass("zf_jjname","span"));
re.tel=DOM.FindId("AgentTel").value;
RESULT.AddRec(re,this.schemaid);

⑦ All fields are configured above, and the preview effect is as follows:

l) acquisition steps

After the template configuration is completed and there is no problem with the collection preview, data collection can be carried out.

① First, establish the collection data table:

Select create data table, click the form of the template in the form list, and select Create in the associated data table to customize the name of the table, which is named fangtianxia (note that the name cannot use numbers and special symbols), and click OK.

After creation, check the data table.

② Select data collection, check the task name, and click start collection to officially start collection.

③ In data browsing, you can select a data table to view the collected data and export the data.

l) after class review

FindClass(class name, tag type, start to find node): when the qualified class name is unique, use the class name to find the node.

Findname (tag name, start to find node): when the qualified data tag is unique within the search range, you can use the tag name to find the tag node.

Gettextall (the node that needs to get the text and the character code used): get the visible text of the html tag node and all child nodes.

Child: child channel node.

FindId(idVal): find the tag node through the ID attribute value of the tag, where idVal represents the ID attribute value of the tag to be found.

 

If there is any problem during operation, you can enter the official website of front sniffing( http://www.forenose.com ), consult technical support.

Qianolfactory provides one-to-one technical support services free of charge.

Keywords: Big Data crawler

Added by CK9 on Mon, 31 Jan 2022 05:42:31 +0200