Previously, in the article "automation with GitHub Action", I introduced how to use GitHub Action to automatically submit Baidu collection. The blog framework used at that time was Hexo, which was equipped with a plug-in to automatically generate article links. At present, this site has been changed from Hexo to Gridsome. After some tossing, it has finally been successfully migrated. However, some previous watered articles have been deleted ~ in fact, it is too lazy to make a title map for the article ~. The idea of managing articles with Forestry (blog CMS system) is also useful in the middle, but on second thought, the management system is on the browser. The experience of writing articles is not good, and I have uploaded the source code to GitHub. After writing in the Markdown editor locally and uploading to GitHub, the experience is better.
preface
Go far and get back to the point. In the previous one, Hexo hexo-baidu-url-submit The plug-in can complete the article link generation and link push. I have to admire Hexo's excellent ecology, not only the plug-in, but also the theme. This should also be Hexo's biggest advantage. In contrast, Gridsome has a poor number of themes here. Although there are 172 plug-ins, which is about the same as that of Hexo, the localization of most plug-ins is poor. After all, there are more foreign users.
On Gridsome's official website, officials call it seo friendly
Gridsome sites load as static HTML before they hydrate into fully Vue.js-powered SPAs. This makes it possible for search engines to be able to crawl content and give better SEO ranking, and still have all the power of Vue.js.
So, I thought of sitemap. Needless to say, Google will basically include the site map when it is handed in. Although Baidu is not necessarily, it is always better than nothing. Of course, I also think of the API provided by Baidu. Turn around and find it @gridsome/plugin-sitemap . Baidu's API push had to be put down temporarily. Today, I was bored and thought of it again. Now I have a sitemap. Maybe we can directly extract links from it for push. A search shows that there are websites that extract the links in the sitemap, but it's not very troublesome to extract manually and then write scripts using the API. It's better to integrate the extraction directly.
Extract link
Take a look at sitemap XML can find that the format of the link is fixed
<url> <loc>https://blog.jalenchuh.cn/</loc> </url>
Then we just need to use regular to get the content between < LOC > and < / LOC >
(?<=<loc>).*?(?=</loc>)
Push
Curl and post are more convenient. However, using post takes about 30s, while curl only takes about 2s.
Officials also gave curl a way
curl -H 'Content-Type:text/plain' --data-binary @urls.txt "http://data.zz.baidu.com/urls?site=xxx&token=xxx
Write link generation steps
Through urllib request. Urlopen the file and decode it with UTF - 8 decode('utf-8 '). If UTF-8 is not used in local testing, there will be inexplicable errors.
Use re Compile for regular matching, and re Findall directly obtains the matched string, and then comes to a for to solve it perfectly.
import re import urllib import requests sitemap = 'https://blog.jalenchuh.cn/sitemap.xml' html = urllib.request.urlopen(sitemap).read().decode('utf-8') result = re.findall(re.compile(r'(?<=<loc>).*?(?=</loc>)'), html) with open('urls.txt', 'w') as file: for data in result: print(data, file=file) file.close()
The above is to use py to generate txt files, which will be pushed by curl later. Of course, you can also use post to push, but it's much slower.
First, install Baidu and define the header
headers = { 'User-Agent': 'curl/7.12.1', 'Host': 'data.zz.baidu.com', 'Content-Type': 'text/plain', 'Content-Length': '83' }
Use requests Post push
for data in result: print(data + '\n' + requests.post( url = url, data = data, headers = headers ).text + '\n' )
The URL here refers to TOKEN
We should use Baidu first_ Token replaces the original token value, and then replaces it with sed -i in Action.
Write Action file
name: push on: schedule: - cron: '0 16 * * *' watch: types: [started] jobs: build: runs-on: ubuntu-latest steps: - name: Checkout uses: actions/checkout@master - name: Set up python uses: actions/setup-python@v1 with: python-version: 3.8 - name: Install requests run: pip install requests - name: generate run: python generate.py - name: Push env: BAIDU_TOKEN: ${{ secrets.BAIDUTOKEN }} SITE: ${{ secrets.SITE }} run: curl -H 'Content-Type:text/plain' --data-binary @urls.txt "http://data.zz.baidu.com/urls?site=${SITE}&token=${BAIDU_TOKEN}"
If you use post to push, just run the py file directly, but before that, you need to replace the Baidu in the file just now_ TOKEN
- name: BAIDU env env: BAIDU_TOKEN: ${{ secrets.BAIDUTOKEN }} run: sed -i "s/BAIDU_TOKEN/${BAIDU_TOKEN}/" xxx.py
to configure
See for specific configuration README
end
Here, the article will be finished. After all, demand drives generation.
A few days ago, I changed from Hexo to Gridsome and deleted some very water articles. Now there are only a dozen articles left. However, it's good to improve the quality of the blog. I'd rather have fewer articles than a pile of hydrology. Everyone has different definitions of hydrology. In my opinion, the article is very short, but it is not necessarily hydrology. However, the previous OI problem solutions are very water. As soon as the topic is put, just say a few words and go directly to the code. However, appropriate hydrology is acceptable, such as this article.
Use GitHub to extract links from sitemap and push them to Baidu
https://blog.jalenchuh.cn/posts/baidu-submit-using-sitemap/
Author
Jalen
Published in
August 11. 2020
license agreement
When reprinting or quoting this article, please abide by the license agreement, indicate the source and shall not be used for commercial purposes!