PHP for a simple data collection

Introduction

When it comes to data collection, python comes to mind first. The code is simple, efficient and easy to implement.

How does PHP implement data collection?It's simple.

concept

What is data collection?Below is the introduction of Baidu Encyclopedia:

Data collection, also known as Data Acquisition Is an interface that uses a device to collect data from outside the system and input it into the system.Data collection technology is widely used in various fields.

You can simply think of it as stealing data from other people's websites.

Expansion Packs Required

1. Guzzle

This is a PHP HTTP client that makes it easy to send HTTP requests and integrate with Web services.

Installation method:

composer require guzzlehttp/guzzle:~6.0

Or:

Join at composer.json

"require": {
      "guzzlehttp/guzzle": "~6.0"
   }
}

2. QueryList

QueryList is a general PHP list collection class based on phpQuery. Thanks to phpQuery, there is almost no learning cost to use QueryList. With the help of CSS3 selector, QueryList can be easily used, making PHP collection as simple as jQuery selection elements.Several features of QueryList:

  1. Simple learning: only one core API
  2. Easy to use: Use jQuery selector to select Page Elements
  3. Self-contained filtering function to filter out useless content
  4. Supports infinite level nested collection
  5. Collection results are returned directly as a list of collection rules
  6. Support Extensions

We can use it to filter html content

Installation method:

composer require jaeger/querylist:V3.2.1

Collecting cases

Let's take the LearnKu community as an example. We'll collect post information from the community and store it in files and mysql databases.

1. Installation Dependency

Enter the following command on the command line

composer init

Introducing dependencies

{
    "require": {
        "guzzlehttp/guzzle": "~6.0@dev",     
        "jaeger/querylist": "V3.2.1"
    },
    "autoload": {
        "psr-4": {
            "App\\": "app/"
        }
    }
}

Installation Dependency

composer install

2. Collection Classes

appHandleClientHandle.php

<?php


namespace App\Handle;


use GuzzleHttp\Client;
use QL\QueryList;

class ClientHandle
{
    private $client;

    public function __construct()
    {
        $this->client = new Client(['verify' => false]);
    }

    public function queryBody($url, $rules)
    {
        $html = $this->sendRequest($url);

        $data = QueryList::Query($html, $rules)->getData(function ($item) {
            if (array_key_exists('link',$item)){
                $content = $this->sendRequest($item['link']);
                $item['post'] = QueryList::Query($content, [
                    'title' => ['div.pull-left>span', 'text'],
                    'review' => ['p>span.text-mute:eq(0)', 'text'],
                    'comment' => ['p>span.text-mute:eq(1)', 'text'],
                    'content' => ['div.content-body', 'html'],
                    'created_at' => ['p>a>span', 'title'],
                    'updated_at' => ['p>a:eq(2)', 'data-tooltip']
                ])->data[0];
            }
            return $item;
        });
//View collection results
        return $data;
    }

    private function sendRequest($url)
    {

        $response = $this->client->request('GET', $url, [
            'headers' => [
                'User-Agent' => 'testing/1.0',
                'Accept' => 'application/json',
                'X-Foo' => ['Bar', 'Baz']
            ],
            'form_params' => [
                'foo' => 'bar',
                'baz' => ['hi', 'there!']
            ],
            'timeout' => 3.14,
        ]);

        $body = $response->getBody();

//Get Page Source
        $html = (string)$body;

        return $html;
    }
}

Simple analysis:

  1. In the u construct constructor, we instantiate a guzzleClient to initiate an http request.
  2. sendRequest is the html source that passes in the url, then makes an http request and returns the target.
  3. queryBody, which receives a url, and the rules to be collected, do not extend here queryList As long as you use jquery, you'll be sure to get started soon.

        public function queryBody($url, $rules)
        {
          //Initiate a request to receive html source
            $html = $this->sendRequest($url);
          //Pass content $html and rule $rules to QueryList's static method Query processing and get data.
            $data = QueryList::Query($html, $rules)->getData(function ($item) {
              //First I get the list page, then I get the details of the article through the link link to the list.
              
              //Determine if a link is matched
                if (array_key_exists('link',$item)){
                  //Get html source for details page
                    $content = $this->sendRequest($item['link']);
                  //Return to QueryList for data processing
                    $item['post'] = QueryList::Query($content, [
                        'title' => ['div.pull-left>span', 'text'],
                        'review' => ['p>span.text-mute:eq(0)', 'text'],
                        'comment' => ['p>span.text-mute:eq(1)', 'text'],
                        'content' => ['div.content-body', 'html'],
                        'created_at' => ['p>a>span', 'title'],
                        'updated_at' => ['p>a:eq(2)', 'data-tooltip']
                    ])->data[0];
                  //The collection is a collection, so I only take the first data[0]
                }
                return $item;
            });
    //View collection results
            return $data;
        }

3. PDO Classes

AppHandlePdoHandle.php

We use PDO to operate the database, here I simply implement a class

<?php


namespace App\Handle;

class PdoHandle
{
    public $source;

    private $driver;
    private $host;
    private $dbname;
    private $username;
    private $password;

    /**
     * PdoHandle constructor.
     */
    public function __construct($driver = 'mysql', $host = 'localhost', $dbname = 'caiji', $username = 'root', $password = '')
    {
        $this->driver = $driver;
        $this->host = $host;
        $this->dbname = $dbname;
        $this->username = $username;
        $this->password = $password;
        $dsn = $this->driver . ':host=' . $this->host . ';dbname=' . $this->dbname;
        $this->source = new \PDO($dsn, $this->username, $this->password);
        $this->source->setAttribute(\PDO::ATTR_ERRMODE, \PDO::ERRMODE_EXCEPTION);
    }
}

If you believe you can understand it, you won't introduce it

4. Write to file

We write what we've collected into a file

<?php
//Set request time unlimited
set_time_limit(0);
//Introducing automatic loading
require '../vendor/autoload.php';
//Rule, only the first 5 data with subscript less than 5
$rules = [
    'title' => ['span.topic-title:lt(5)', 'text'],
    'link' => ['a.topic-title-wrap:lt(5)', 'href']
];

//collection
$url = "https://learnku.com/laravel";
$client = new \App\Handle\ClientHandle();
$data = $client->queryBody($url, $rules);
//Since we requested two levels, the returned array needs to be processed into one level array
$data = array_map(function ($item) {
    return $item['post'];
}, $data);

//write file
$handle = fopen('2.php','w');
$str = "<?php\n".var_export($data, true).";";
fwrite($handle,$str);
fclose($handle);

After a few seconds, you will see that there are more 2.php files in the file directory, and there is data to represent the success of the collection~

5. Write to database

Write the collected content to the database

1. Create tables

First, we create a posts table with the following fields:

`title`, `review`, `comment`, `content`,`created_at`,`updated_at`

created_at and updated_at do not recommend enforcing time type and mandatory, otherwise the following data will need to be processed

2. Operation

<?php
set_time_limit(0);
require '../vendor/autoload.php';
$rules = [
    'title' => ['span.topic-title', 'text'],
    'link' => ['a.topic-title-wrap', 'href']
];
$url = "https://learnku.com/laravel";

$client = new \App\Handle\ClientHandle();
$data = $client->queryBody($url, $rules);
$data = array_map(function ($item) {
    return $item['post'];
}, $data);

//Writing sql statements
$sql = "INSERT INTO `posts`(`title`, `review`, `comment`, `content`,`created_at`,`updated_at`) VALUES";

//No special data matches the criteria under re-filtering to avoid troubles in warehousing
$data = array_filter($data,function($item){
    return count($item) == 6;
});
//Reset Array Subscript
sort($data);
//Combining sql statements
foreach ($data as $key => $item) {
  //Content has html tags, so we need to use base64 to repository
    $item['content'] = base64_encode($item['content']);
    $value = "'" . implode("','", array_values($item)) . "'";
    $sql .= "($value)";
    if (count($data) - 1 != $key) {
        $sql .= ",";
    }
}

//collection
$db = new \App\Handle\PdoHandle();

try {
    $db->source->query($sql);
    echo 'Successful collection and repository!';
} catch (PDOException $exception) {
    echo $exception->getMessage();
}

After a few seconds, you will see the words "Successful collection and storage" printed on the web page. That means successful ~

We can also just collect the first few, just rewrite the $rules rule

For example: just take the first five, and we can write that.

$rules = [
    'title' => ['span.topic-title:lt(5)', 'text'],
    'link' => ['a.topic-title-wrap:lt(5)', 'href']
];

6. Read data

Read data using PDO

<?php
require '../vendor/autoload.php';

$db = new \App\Handle\PdoHandle();
//query
$sql = "select * from `posts` limit 0,10";

$pdoStatement = $db->source->query($sql);
$data = $pdoStatement->fetchAll(PDO::FETCH_ASSOC);

foreach ($data as &$item){
  //Decrypt Content
    $item['content'] = base64_decode($item['content']);
}

var_dump($data);

End of Article

Hope you get a little bit from what you're watching, and I uploaded it to github too, so you can pull down the partners you need to see it.

case

Personal Blog

Keywords: PHP SQL PDO JQuery

Added by agnalleo on Sun, 15 Dec 2019 08:00:42 +0200