node: crawler crawls web page pictures

preface

At the weekend, I was idle at home. I brushed wechat and played with my mobile phone. When I found that my wechat avatar should be changed, I went to the Internet to find the avatar. Looking at the pictures, I thought that as a code farmer, I could climb down these pictures and make them into a wechat applet. I did what I said. I basically knew how to do it. I sorted and shared it with you.

catalogue

  • Install node and download the dependencies
  • Build service
  • Request the page we want to crawl and return json

Install node

We start to install node. You can download it from the node official website nodejs.org/zh-cn/ , run node after downloading,

node -v

After successful installation, the version number you installed will appear.

Next, we use node to print out hello world and create a new input file named index.js

console.log('hello world')

Run this file

node index.js

It will output hello world on the control panel

Build server

Create a new folder named node.

First you need to download the express dependency

npm install express 
Copy code

Create a new file directory named demo.js, as shown in the figure below:

!img](https://p1-jj.byteimg.com/tos...)

Introduce the downloaded express in demo.js

const express = require('express');
const app = express();
app.get('/index', function(req, res) {
res.end('111')
})
var server = app.listen(8081, function() {
    var host = server.address().address
    var port = server.address().port
    console.log("Application instance, access address: http://%s:%s", host, port)

})

Run node demo.js to set up a simple service, as shown in the figure:

!img](https://p1-jj.byteimg.com/tos...)

Request the page we want to crawl

Request the page we want to crawl

npm install superagent
npm install superagent-charset
npm install cheerio

superagent is used to initiate requests. It is a lightweight and progressive ajax api with good readability and low learning curve. It internally relies on nodejs native request api. It is suitable for nodejs environment. You can also use http to initiate requests

Super charset prevents the scrambled data and changes the character format

Cherio is a fast, flexible and implemented jQuery core implementation specially customized for the server. After installing dependencies, you can import them

var superagent = require('superagent');
var charset = require('superagent-charset');
charset(superagent);
const cheerio = require('cheerio');

After the introduction, ask for our address, https://www.qqtn.com/tx/weixi... , as shown in the figure:

!img](https://p1-jj.byteimg.com/tos...)

Declare address variable:

const baseUrl = 'https://www.qqtn.com/'

After these settings, the request is sent. Next, please see the complete code demo.js

var superagent = require('superagent');
var charset = require('superagent-charset');
charset(superagent);
var express = require('express');
var baseUrl = 'https://www.qqtn.com/'; // You can enter any web address
const cheerio = require('cheerio');
var app = express();
app.get('/index', function(req, res) {
    //Set request header
    res.header("Access-Control-Allow-Origin", "*");
    res.header('Access-Control-Allow-Methods', 'PUT, GET, POST, DELETE, OPTIONS');
    res.header("Access-Control-Allow-Headers", "X-Requested-With");
    res.header('Access-Control-Allow-Headers', 'Content-Type');
    //type
    var type = req.query.type;
    //Page number
    var page = req.query.page;
    type = type || 'weixin';
    page = page || '1';
    var route = `tx/${type}tx_${page}.html`
    //The page information of a web page is GB2312, so the chat should be. charset('gb2312 '), and the general web page is UTF-8. You can directly use. charset('utf-8')
    superagent.get(baseUrl + route)
        .charset('gb2312')
        .end(function(err, sres) {
            var items = [];
            if (err) {
                console.log('ERR: ' + err);
                res.json({ code: 400, msg: err, sets: items });
                return;
            }
            var $ = cheerio.load(sres.text);
            $('div.g-main-bg ul.g-gxlist-imgbox li a').each(function(idx, element) {
                var $element = $(element);
                var $subElement = $element.find('img');
                var thumbImgSrc = $subElement.attr('src');
                items.push({
                    title: $(element).attr('title'),
                    href: $element.attr('href'),
                    thumbSrc: thumbImgSrc
                });
            });
            res.json({ code: 200, msg: "", data: items });
        });
});
var server = app.listen(8081, function() {

    var host = server.address().address
    var port = server.address().port

    console.log("Application instance, access address: http://%s:%s", host, port)

})

Run demo.js to return the data we got, as shown in the figure:

A simple node crawler is completed. I hope you can click a star on the project as your recognition and support for the project. Thank you.

Keywords: Python crawler

Added by dsds1121 on Wed, 08 Dec 2021 22:26:36 +0200