Recently, in order to extract Judicial Documents Network Relevant information, I entered the Python learning road, wrote nearly two weeks of code, I wrote this article to summarize the pits trampled, and encounter some good information and blog summary, in order to review and participate in their own later period, share with you, and welcome you to add some wonderful content.
I. Environmental Construction and Tool Preparation
1. In order to save time for study, it is recommended to install the integrated environment directly. Anaconda
2,IDE: Pycharm,Pydev
3. Tools: Jupyter Notebook
II. Python Basic Video Course
1,Crazy Python: Quick Start (Python 2.x, you can experience the difference from Python 3.x)
2,Zero Basic Introduction Learning Python (Video lessons for small turtles)
After reading these courses, I have a feeling and mastery of Python. I can continue to read some advanced tutorials.
3,Python 3 Complete(pasword:rghx)
3. Python crawler video tutorial
1,Python Web Crawler Actual Warfare (Seen from the whole point of view, the harvest is not small.)
2,Python 3 Reptilian Case Sharing (Very good courses, lots of dry goods)
IV. Relevant Connections of Python Reptiles
1,Best practices of python reptiles
2,Python Web Crawler Practical Project Code Complete
3,Making a Python Reptile with Zero Foundation
5. Regular expressions and the use of Beautiful Soup, PhatomJS +Selenium
1,Introduction to Python Reptilian White
2,Easy Automation - selenium-web driver (python)
3, Concise Notes on Python Regular Expression re Module
5,Introduction to Python Crawler (7): Regular Expressions
(You can pay attention to the authors of these articles. Generally, they have Python collections. You can collect valuable articles.)
6. The Practice of Climbing Sina News Information by Yourself
I pasted the source code directly here for reference. Python Web Crawler Actual Warfare Lessons learned
News Comment Number Extraction Function
import re
import json
import requests
#js grabs news commentary information
commentURL='http://comment5.news.sina.com.cn/page/info?version=1&format=js&\
channel=gn&newsid=comos-{}&\
group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20&jsvar=loader_1491395188566_53913700'
def getCommentCounts(newsurl):
#Get news id
m=re.search('doc-i(.+).shtml',newsurl)
newsid=m.group(1)
#Get comment information based on news id
comments=requests.get(commentURL.format(newsid))
#Parsing information into json format
jd=json.loads(comments.text.strip('var loader_1491395188566_53913 700='))
return jd['result']['count']['total']
News Text Information Extraction Function
import requests
from datetime import datetime
from bs4 import BeautifulSoup
def getNewsDetail(newsurl):
result={}
res=requests.get(newsurl)
res.encoding='utf-8'
soup=BeautifulSoup(res.text,'html.parser')
result['title']=soup.select('#artibodyTitle')
timesource=soup.select('.time-source')[0].contents[0].strip()
result['dt']=datetime.strptime(timesource,'%Y year%m month%d day%H:%M')
result['source']=soup.select('.time-source span a')[0].text
result['article']=' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])
result['editor']=soup.select('.article-editor')[0].text.lstrip('Responsible Editors:')
return result
7. Feelings
These days, learning, Python crawler ideas and routines are very clear, mainly we have to design different crawling four lines and methods for different websites (anti-crawling, etc.), but still ask yourself to sum up the methods and accumulate knowledge, and also a little bit of their own hope that the crawler can be applied to real life or application (if simple). Extracting information from a web page is actually not very meaningful, such as trying to apply it to mass download a website's pictures or files, and so on, let the crawler serve us.
(ps: I will continue to update and supplement the content, but also add in your message)