Use GitHub to extract links from sitemap and push them to Baidu

Previously, in the article "automation with GitHub Action", I introduced how to use GitHub Action to automatically submit Baidu collection. The blog framework used at that time was Hexo, which was equipped with a plug-in to automatically generate article links. At present, this site has been changed from Hexo to Gridsome. After some tossing, it has finally been successfully migrated. However, some previous watered articles have been deleted ~ in fact, it is too lazy to make a title map for the article ~. The idea of managing articles with Forestry (blog CMS system) is also useful in the middle, but on second thought, the management system is on the browser. The experience of writing articles is not good, and I have uploaded the source code to GitHub. After writing in the Markdown editor locally and uploading to GitHub, the experience is better.

preface

Go far and get back to the point. In the previous one, Hexo hexo-baidu-url-submit The plug-in can complete the article link generation and link push. I have to admire Hexo's excellent ecology, not only the plug-in, but also the theme. This should also be Hexo's biggest advantage. In contrast, Gridsome has a poor number of themes here. Although there are 172 plug-ins, which is about the same as that of Hexo, the localization of most plug-ins is poor. After all, there are more foreign users.

On Gridsome's official website, officials call it seo friendly

Gridsome sites load as static HTML before they hydrate into fully Vue.js-powered SPAs. This makes it possible for search engines to be able to crawl content and give better SEO ranking, and still have all the power of Vue.js.

So, I thought of sitemap. Needless to say, Google will basically include the site map when it is handed in. Although Baidu is not necessarily, it is always better than nothing. Of course, I also think of the API provided by Baidu. Turn around and find it @gridsome/plugin-sitemap . Baidu's API push had to be put down temporarily. Today, I was bored and thought of it again. Now I have a sitemap. Maybe we can directly extract links from it for push. A search shows that there are websites that extract the links in the sitemap, but it's not very troublesome to extract manually and then write scripts using the API. It's better to integrate the extraction directly.

Extract link

Take a look at sitemap XML can find that the format of the link is fixed

<url>
  <loc>https://blog.jalenchuh.cn/</loc>
</url>

Then we just need to use regular to get the content between < LOC > and < / LOC >

(?<=<loc>).*?(?=</loc>)

Push

Curl and post are more convenient. However, using post takes about 30s, while curl only takes about 2s.

Officials also gave curl a way

curl -H 'Content-Type:text/plain' --data-binary @urls.txt "http://data.zz.baidu.com/urls?site=xxx&token=xxx

Write link generation steps

Through urllib request. Urlopen the file and decode it with UTF - 8 decode('utf-8 '). If UTF-8 is not used in local testing, there will be inexplicable errors.

Use re Compile for regular matching, and re Findall directly obtains the matched string, and then comes to a for to solve it perfectly.

import re
import urllib
import requests

sitemap = 'https://blog.jalenchuh.cn/sitemap.xml'

html = urllib.request.urlopen(sitemap).read().decode('utf-8')
result = re.findall(re.compile(r'(?<=<loc>).*?(?=</loc>)'), html)

with open('urls.txt', 'w') as file:
  for data in result:
    print(data, file=file)
file.close()

The above is to use py to generate txt files, which will be pushed by curl later. Of course, you can also use post to push, but it's much slower.

First, install Baidu and define the header

headers = {
  'User-Agent': 'curl/7.12.1',
  'Host': 'data.zz.baidu.com',
  'Content-Type': 'text/plain',
  'Content-Length': '83'
}

Use requests Post push

for data in result:
  print(data + '\n' +
    requests.post(
      url = url,
      data = data,
      headers = headers
    ).text + '\n'
  )

The URL here refers to TOKEN

We should use Baidu first_ Token replaces the original token value, and then replaces it with sed -i in Action.

Write Action file

name: push


on:
  schedule:
    - cron: '0 16 * * *'
  watch:
    types: [started]


jobs:
  build:
    runs-on: ubuntu-latest


    steps:
    - name: Checkout
      uses: actions/checkout@master


    - name: Set up python
      uses: actions/setup-python@v1
      with:
        python-version: 3.8


    - name: Install requests
      run: pip install requests


    - name: generate
      run: python generate.py


    - name: Push
      env:
        BAIDU_TOKEN: ${{ secrets.BAIDUTOKEN }}
        SITE: ${{ secrets.SITE }}
      run: curl -H 'Content-Type:text/plain' --data-binary @urls.txt "http://data.zz.baidu.com/urls?site=${SITE}&token=${BAIDU_TOKEN}"

If you use post to push, just run the py file directly, but before that, you need to replace the Baidu in the file just now_ TOKEN

    - name: BAIDU env
      env:
        BAIDU_TOKEN: ${{ secrets.BAIDUTOKEN }}
      run: sed -i "s/BAIDU_TOKEN/${BAIDU_TOKEN}/" xxx.py

to configure

See for specific configuration README

end

Here, the article will be finished. After all, demand drives generation.

A few days ago, I changed from Hexo to Gridsome and deleted some very water articles. Now there are only a dozen articles left. However, it's good to improve the quality of the blog. I'd rather have fewer articles than a pile of hydrology. Everyone has different definitions of hydrology. In my opinion, the article is very short, but it is not necessarily hydrology. However, the previous OI problem solutions are very water. As soon as the topic is put, just say a few words and go directly to the code. However, appropriate hydrology is acceptable, such as this article.

Use GitHub to extract links from sitemap and push them to Baidu

https://blog.jalenchuh.cn/posts/baidu-submit-using-sitemap/

Author

Jalen

Published in

August 11. 2020

license agreement

CC BY-NC-SA 4.0

When reprinting or quoting this article, please abide by the license agreement, indicate the source and shall not be used for commercial purposes!

Added by timcapulet on Wed, 02 Feb 2022 08:22:32 +0200

Programming VIP

Use GitHub to extract links from sitemap and push them to Baidu

preface

Extract link

Push

Write link generation steps

The URL here refers to TOKEN

Write Action file

to configure

end

Popular Keywords