Share a wave of GO Crawlers

[TOC]

Share a wave of GO Crawlers

Let's review the last time we talked about using Google to send email

Golang+chromedp+goquery simple crawling dynamic data Go theme month

  • Shared email. What is email
  • What are the mail protocols
  • How to send e-mail using Golf
  • How to send e-mail with plain text, HTML content, attachments, etc
  • How to send e-mail, how to CC, how to BCC
  • How to improve the performance of sending mail

If you want to see how to use Google to send mail, please check the article How to send mail using Golf

I still remember that we shared a simple article about golang crawling web dynamic data before Golang+chromedp+goquery simple crawling dynamic data Go theme month

If a friend is interested, we can study the use of this chromedp framework in detail

Today, let's share the static data of using GO to crawl web pages

What are static web pages and dynamic web pages

What is static web data?

  • It means that there is no program code in the web page, only HTML, that is, only hypertext markup language, and the suffix is generally HTML, HTM, XML, etc
  • Another feature of static web pages is that users can directly click to open them. No matter who opens the page at any time, the content of the page remains unchanged. The html code is fixed, and the effect is fixed

By the way, what is a dynamic web page

  • Dynamic web page is a web page programming technology

    In addition to HTML tags, the web page files of dynamic web pages also include some program codes with specific functions

    These codes are mainly used for the interaction between the browser and the server. The server can dynamically generate web content according to different requests of the client, which is very flexible

In other words, although the page code of dynamic web pages has not changed, the displayed content can change with the passage of time, different environments and the change of database

GO to crawl the static data of the web page

We crawl the data of static web pages. For example, we crawl the static data on this website, and crawl the account and password information on the web pages

http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696

Steps for us to climb this website:

  • Specify a website that needs to be crawled
  • Get data through HTTP GET
  • Convert byte array to string
  • Use regular expressions to match the content we expect (this is very important. In fact, crawling static web pages and processing and filtering data take more time)
  • Filtering data, de duplication and other operations (this step varies according to individual needs and target websites)

Let's write a DEMO and climb up the account and password information of the above website. The information we want on the website is like this. We are only for learning and should not be used to do some bad things

package main

import (
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
)

const (
   // Regular expression to match XL's account password
   reAccount = `(account number|Xunlei account)(;|: )[0-9:]+(| )password:[0-9a-zA-Z]+`
)

// Get website account password
func GetAccountAndPwd(url string) {
   // Get site data
   resp, err := http.Get(url)
   if err !=nil{
      log.Fatal("http.Get error : ",err)
   }
   defer resp.Body.Close()

   // The data content to be read is bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   if err !=nil{
      log.Fatal("ioutil.ReadAll error : ",err)
   }

   // Convert byte array to string
   str := string(dataBytes)

   // Filter XL account and password
   re := regexp.MustCompile(reAccount)

   // How many times do you match? - 1 is all by default
   results := re.FindAllStringSubmatch(str, -1)

   // Output results
   for _, result := range results {
      log.Println(result[0])
   }
}

func main() {
   // Simply set the log parameter
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   // Enter the website address and start crawling data
   GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696")
}

The result of running the above code is as follows:

2021/06/xx xx:05:25 main.go:46: Account No.: 357451317 password: 110120 a
2021/06/xx xx:05:25 main.go:46: Account No.: 907812219 password: 810303
2021/06/xx xx:05:25 main.go:46: Account No.: 797169897 password: zxcvbnm132
2021/06/xx xx:05:25 main.go:46: Xunlei Account No.: 792253782:1 Password: 283999
2021/06/xx xx:05:25 main.go:46: Xunlei Account No.: 147643189:2 Password: 344867
2021/06/xx xx:05:25 main.go:46: Xunlei Account No.: 147643189:1 Password: 267297

It can be seen that the account: the first data and Xunlei account: the first data are crawled down by us. In fact, it is not difficult to crawl the content of static web pages. The time is basically spent on regular expression matching and data processing

According to the above steps of crawling web pages, we can list the following:

  • Visit the website http Get(url)
  • Read data content ioutil ReadAll
  • Convert data to string
  • Set the regular matching rule regexp MustCompile(reAccount)
  • To start filtering data, you can set the filtering quantity re FindAllStringSubmatch(str, -1)

Of course, it will not be so simple in practical work,

For example, the format of the data crawled on the website is not uniform enough, there are many and miscellaneous special characters, and there is no rule. Even the data is dynamic, so there is no way to Get it through Get

However, the above problems can be solved. According to different problems, different schemes and data processing are designed. I believe that friends who meet this point will be able to solve them. In the face of problems, we must have the determination to solve them

Crawl pictures

After looking at the above example, let's try crawling the image data on the web page, such as searching for firewood dogs on a certain degree

It's such a page

Let's copy and paste the url in the url address bar to crawl the data

https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC

Since there are many pictures, we set the data matching only two pictures

Let's see DEMO

  • Incidentally, the function of converting the Get url data into a string is extracted and encapsulated into a small function
  • GetPic specifically uses regular expressions for matching. You can set the number of matches. Here we set the number of matches to 2
package main

import (
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
)

const (
   // Regular expression to match XL's account password
   reAccount = `(account number|Xunlei account)(;|: )[0-9:]+(| )password:[0-9a-zA-Z]+`
   // Regular expression to match the picture
   rePic = `https?://[^"]+?(\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)

func getStr(url string)string{
   resp, err := http.Get(url)
   if err !=nil{
      log.Fatal("http.Get error : ",err)
   }
   defer resp.Body.Close()

   // The data content to be read is bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   if err !=nil{
      log.Fatal("ioutil.ReadAll error : ",err)
   }

   // Convert byte array to string
   str := string(dataBytes)
   return str
}

// Get website account password
func GetAccountAndPwd(url string,n int) {
   str := getStr(url)
   // Filter XL account and password
   re := regexp.MustCompile(reAccount)

   // How many times do you match? - 1 is all by default
   results := re.FindAllStringSubmatch(str, n)

   // Output results
   for _, result := range results {
      log.Println(result[0])
   }
}

// Get website account password
func GetPic(url string,n int) {

   str := getStr(url)

   // Filter pictures
   re := regexp.MustCompile(rePic)

   // How many times do you match? - 1 is all by default
   results := re.FindAllStringSubmatch(str, n)

   // Output results
   for _, result := range results {
      log.Println(result[0])
   }
}


func main() {
   // Simply set l og parameters
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   //GetAccountAndPwd("http://www.ucbug.com/jiaocheng/63149.html?_t=1582307696", -1)
   GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC",2)
}

Run the above code and the results are as follows (no de duplication is done):

2021/06/xx xx:06:39 main.go:63: https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838,1103140037&fm=26&gp=0.jpg
2021/06/xx xx:06:39 main.go:63: https://ss1.bdstatic.com/70cFuXSh_Q1YnxGkpoWK1HF6hhy/it/u=4246005838,1103140037&fm=26&gp=0.jpg

Sure enough, it's what we want, but just printing it and crawling it are all picture links, which certainly can't meet the needs of our real crawlers. We must download the crawling pictures for our use, which is what we want

For the convenience of demonstration, we will add the small function of downloading files to the above code, and we will download the first one

package main

import (
   "fmt"
   "io/ioutil"
   "log"
   "net/http"
   "regexp"
   "strings"
   "time"
)

const (
   // Regular expression to match the picture
   rePic = `https?://[^"]+?(\.((jpg)|(png)|(jpeg)|(gif)|(bmp)))`
)

// Get web page data and convert the data into a string
func getStr(url string) string {
   resp, err := http.Get(url)
   if err != nil {
      log.Fatal("http.Get error : ", err)
   }
   defer resp.Body.Close()

   // The data content to be read is bytes
   dataBytes, err := ioutil.ReadAll(resp.Body)
   if err != nil {
      log.Fatal("ioutil.ReadAll error : ", err)
   }

   // Convert byte array to string
   str := string(dataBytes)
   return str
}

// Get picture data
func GetPic(url string, n int) {

   str := getStr(url)

   // Filter pictures
   re := regexp.MustCompile(rePic)

   // How many times do you match? - 1 is all by default
   results := re.FindAllStringSubmatch(str, n)

   // Output results
   for _, result := range results {
      // Get the specific picture name
      fileName := GetFilename(result[0])
     // Download pictures
      DownloadPic(result[0], fileName)
   }
}

// Gets the name of the file
func GetFilename(url string) (filename string) {
   // Found the index of the last =
   lastIndex := strings.LastIndex(url, "=")
   // Get the string after /, which is the source file name
   filename = url[lastIndex+1:]

   // Put the time stamp before the original name and spell a new name
   prefix := fmt.Sprintf("%d",time.Now().Unix())
   filename = prefix + "_" + filename

   return filename
}

func DownloadPic(url string, filename string) {
   resp, err := http.Get(url)
   if err != nil {
      log.Fatal("http.Get error : ", err)
   }
   defer resp.Body.Close()

   bytes, err := ioutil.ReadAll(resp.Body)
   if err != nil {
      log.Fatal("ioutil.ReadAll error : ", err)
   }

   // File storage path
   filename = "./" + filename

   // Write files and set file permissions
   err = ioutil.WriteFile(filename, bytes, 0666)
   if err != nil {
      log.Fatal("wirte failed !!", err)
   } else {
      log.Println("ioutil.WriteFile successfully , filename = ", filename)
   }
}

func main() {
   // Simply set l og parameters
   log.SetFlags(log.Lshortfile | log.LstdFlags)
   GetPic("https://image.baidu.com/search/index?tn=baiduimage&ps=1&ct=201326592&lm=-1&cl=2&nc=1&ie=utf-8&word=%E6%9F%B4%E7%8A%AC", 1)
}

In the above code, we added two functions to assist us in modifying the picture name and downloading the picture

  • Get the name of the file and rename it with the timestamp, GetFilename
  • Download the specific picture to the current directory, DownloadPic

Run the above code and you can see the following effects

2021/06/xx xx:50:04 main.go:91: ioutil.WriteFile successfully , filename =  ./1624377003_0.jpg

In the current directory, a picture named 1624377003 has been successfully downloaded_ 0.jpg

The following is the image of the specific picture

Some big brothers will say that it's too slow for me to download pictures in one process. Can I download them faster and download them together with multiple processes

And crawl for our little firewood dog

Do we still remember the GO channel and sync package mentioned before? Sharing of GO channel and sync package , you can just practice it. This small function is relatively simple. Let's talk about the general idea. If you are interested, you can implement a wave,

  • Let's read the data of the above website and convert it into a string
  • Use regular matching to find out the corresponding series of picture links
  • Put each picture link into a buffered channel, and set the buffer to 100 for the time being
  • Let's open three more processes to read the data of the channel concurrently and download it locally. For the modification method of the file name, please refer to the above coding

What's up, big brothers and small partners? If you are interested, you can practice it. If you have ideas about crawling dynamic data, we can communicate and discuss it and make progress together

summary

  • Share a brief description of static and dynamic web pages
  • GO crawl static web page simple data
  • GO crawl pictures on Web pages
  • Concurrent crawling of resources on Web pages

Welcome to like, follow and collect

My friends, your support and encouragement are the driving force for me to insist on sharing and improve quality

Well, that's all for this time

Technology is open, and our mentality should be open. Embrace change, live in the sun and strive to move forward.

I'm Nezha, the Little Devil boy. Welcome to praise and pay attention to the collection. See you next time~

Keywords: Go

Added by wyred on Sat, 15 Jan 2022 18:31:24 +0200