Introduction
When it comes to data collection, python comes to mind first. The code is simple, efficient and easy to implement.
How does PHP implement data collection?It's simple.
concept
What is data collection?Below is the introduction of Baidu Encyclopedia:
Data collection, also known as Data Acquisition Is an interface that uses a device to collect data from outside the system and input it into the system.Data collection technology is widely used in various fields.
You can simply think of it as stealing data from other people's websites.
Expansion Packs Required
1. Guzzle
This is a PHP HTTP client that makes it easy to send HTTP requests and integrate with Web services.
Installation method:
composer require guzzlehttp/guzzle:~6.0
Or:
Join at composer.json
"require": { "guzzlehttp/guzzle": "~6.0" } }
2. QueryList
QueryList is a general PHP list collection class based on phpQuery. Thanks to phpQuery, there is almost no learning cost to use QueryList. With the help of CSS3 selector, QueryList can be easily used, making PHP collection as simple as jQuery selection elements.Several features of QueryList:
- Simple learning: only one core API
- Easy to use: Use jQuery selector to select Page Elements
- Self-contained filtering function to filter out useless content
- Supports infinite level nested collection
- Collection results are returned directly as a list of collection rules
- Support Extensions
We can use it to filter html content
Installation method:
composer require jaeger/querylist:V3.2.1
Collecting cases
Let's take the LearnKu community as an example. We'll collect post information from the community and store it in files and mysql databases.
1. Installation Dependency
Enter the following command on the command line
composer init
Introducing dependencies
{ "require": { "guzzlehttp/guzzle": "~6.0@dev", "jaeger/querylist": "V3.2.1" }, "autoload": { "psr-4": { "App\\": "app/" } } }
Installation Dependency
composer install
2. Collection Classes
appHandleClientHandle.php
<?php namespace App\Handle; use GuzzleHttp\Client; use QL\QueryList; class ClientHandle { private $client; public function __construct() { $this->client = new Client(['verify' => false]); } public function queryBody($url, $rules) { $html = $this->sendRequest($url); $data = QueryList::Query($html, $rules)->getData(function ($item) { if (array_key_exists('link',$item)){ $content = $this->sendRequest($item['link']); $item['post'] = QueryList::Query($content, [ 'title' => ['div.pull-left>span', 'text'], 'review' => ['p>span.text-mute:eq(0)', 'text'], 'comment' => ['p>span.text-mute:eq(1)', 'text'], 'content' => ['div.content-body', 'html'], 'created_at' => ['p>a>span', 'title'], 'updated_at' => ['p>a:eq(2)', 'data-tooltip'] ])->data[0]; } return $item; }); //View collection results return $data; } private function sendRequest($url) { $response = $this->client->request('GET', $url, [ 'headers' => [ 'User-Agent' => 'testing/1.0', 'Accept' => 'application/json', 'X-Foo' => ['Bar', 'Baz'] ], 'form_params' => [ 'foo' => 'bar', 'baz' => ['hi', 'there!'] ], 'timeout' => 3.14, ]); $body = $response->getBody(); //Get Page Source $html = (string)$body; return $html; } }
Simple analysis:
- In the u construct constructor, we instantiate a guzzleClient to initiate an http request.
- sendRequest is the html source that passes in the url, then makes an http request and returns the target.
-
queryBody, which receives a url, and the rules to be collected, do not extend here queryList As long as you use jquery, you'll be sure to get started soon.
public function queryBody($url, $rules) { //Initiate a request to receive html source $html = $this->sendRequest($url); //Pass content $html and rule $rules to QueryList's static method Query processing and get data. $data = QueryList::Query($html, $rules)->getData(function ($item) { //First I get the list page, then I get the details of the article through the link link to the list. //Determine if a link is matched if (array_key_exists('link',$item)){ //Get html source for details page $content = $this->sendRequest($item['link']); //Return to QueryList for data processing $item['post'] = QueryList::Query($content, [ 'title' => ['div.pull-left>span', 'text'], 'review' => ['p>span.text-mute:eq(0)', 'text'], 'comment' => ['p>span.text-mute:eq(1)', 'text'], 'content' => ['div.content-body', 'html'], 'created_at' => ['p>a>span', 'title'], 'updated_at' => ['p>a:eq(2)', 'data-tooltip'] ])->data[0]; //The collection is a collection, so I only take the first data[0] } return $item; }); //View collection results return $data; }
3. PDO Classes
AppHandlePdoHandle.php
We use PDO to operate the database, here I simply implement a class
<?php namespace App\Handle; class PdoHandle { public $source; private $driver; private $host; private $dbname; private $username; private $password; /** * PdoHandle constructor. */ public function __construct($driver = 'mysql', $host = 'localhost', $dbname = 'caiji', $username = 'root', $password = '') { $this->driver = $driver; $this->host = $host; $this->dbname = $dbname; $this->username = $username; $this->password = $password; $dsn = $this->driver . ':host=' . $this->host . ';dbname=' . $this->dbname; $this->source = new \PDO($dsn, $this->username, $this->password); $this->source->setAttribute(\PDO::ATTR_ERRMODE, \PDO::ERRMODE_EXCEPTION); } }
If you believe you can understand it, you won't introduce it
4. Write to file
We write what we've collected into a file
<?php //Set request time unlimited set_time_limit(0); //Introducing automatic loading require '../vendor/autoload.php'; //Rule, only the first 5 data with subscript less than 5 $rules = [ 'title' => ['span.topic-title:lt(5)', 'text'], 'link' => ['a.topic-title-wrap:lt(5)', 'href'] ]; //collection $url = "https://learnku.com/laravel"; $client = new \App\Handle\ClientHandle(); $data = $client->queryBody($url, $rules); //Since we requested two levels, the returned array needs to be processed into one level array $data = array_map(function ($item) { return $item['post']; }, $data); //write file $handle = fopen('2.php','w'); $str = "<?php\n".var_export($data, true).";"; fwrite($handle,$str); fclose($handle);
After a few seconds, you will see that there are more 2.php files in the file directory, and there is data to represent the success of the collection~
5. Write to database
Write the collected content to the database
1. Create tables
First, we create a posts table with the following fields:
`title`, `review`, `comment`, `content`,`created_at`,`updated_at`
created_at and updated_at do not recommend enforcing time type and mandatory, otherwise the following data will need to be processed
2. Operation
<?php set_time_limit(0); require '../vendor/autoload.php'; $rules = [ 'title' => ['span.topic-title', 'text'], 'link' => ['a.topic-title-wrap', 'href'] ]; $url = "https://learnku.com/laravel"; $client = new \App\Handle\ClientHandle(); $data = $client->queryBody($url, $rules); $data = array_map(function ($item) { return $item['post']; }, $data); //Writing sql statements $sql = "INSERT INTO `posts`(`title`, `review`, `comment`, `content`,`created_at`,`updated_at`) VALUES"; //No special data matches the criteria under re-filtering to avoid troubles in warehousing $data = array_filter($data,function($item){ return count($item) == 6; }); //Reset Array Subscript sort($data); //Combining sql statements foreach ($data as $key => $item) { //Content has html tags, so we need to use base64 to repository $item['content'] = base64_encode($item['content']); $value = "'" . implode("','", array_values($item)) . "'"; $sql .= "($value)"; if (count($data) - 1 != $key) { $sql .= ","; } } //collection $db = new \App\Handle\PdoHandle(); try { $db->source->query($sql); echo 'Successful collection and repository!'; } catch (PDOException $exception) { echo $exception->getMessage(); }
After a few seconds, you will see the words "Successful collection and storage" printed on the web page. That means successful ~
We can also just collect the first few, just rewrite the $rules rule
For example: just take the first five, and we can write that.
$rules = [ 'title' => ['span.topic-title:lt(5)', 'text'], 'link' => ['a.topic-title-wrap:lt(5)', 'href'] ];
6. Read data
Read data using PDO
<?php require '../vendor/autoload.php'; $db = new \App\Handle\PdoHandle(); //query $sql = "select * from `posts` limit 0,10"; $pdoStatement = $db->source->query($sql); $data = $pdoStatement->fetchAll(PDO::FETCH_ASSOC); foreach ($data as &$item){ //Decrypt Content $item['content'] = base64_decode($item['content']); } var_dump($data);
End of Article
Hope you get a little bit from what you're watching, and I uploaded it to github too, so you can pull down the partners you need to see it.