Python Reptiles Series (1) Collection and Summary of Early Learning Reptiles

Recently, in order to extract Judicial Documents Network Relevant information, I entered the Python learning road, wrote nearly two weeks of code, I wrote this article to summarize the pits trampled, and encounter some good information and blog summary, in order to review and participate in their own later period, share with you, and welcome you to add some wonderful content.

I. Environmental Construction and Tool Preparation

1. In order to save time for study, it is recommended to install the integrated environment directly. Anaconda

2,IDE: Pycharm,Pydev

3. Tools: Jupyter Notebook

II. Python Basic Video Course

1,Crazy Python: Quick Start (Python 2.x, you can experience the difference from Python 3.x)

2,Zero Basic Introduction Learning Python (Video lessons for small turtles)

After reading these courses, I have a feeling and mastery of Python. I can continue to read some advanced tutorials.

3,Python 3 Complete(pasword:rghx)

3. Python crawler video tutorial

1,Python Web Crawler Actual Warfare (Seen from the whole point of view, the harvest is not small.)

2,Python 3 Reptilian Case Sharing (Very good courses, lots of dry goods)

IV. Relevant Connections of Python Reptiles

1,Best practices of python reptiles

2,Python Web Crawler Practical Project Code Complete

3,Making a Python Reptile with Zero Foundation

4,Introduction to Python Crawler

5,Python 3 (csdn blog)

7,Capturing the Room Information of tv of Bucketfish

5. Regular expressions and the use of Beautiful Soup, PhatomJS +Selenium

1,Introduction to Python Reptilian White

2,Easy Automation - selenium-web driver (python)

3, Concise Notes on Python Regular Expression re Module

4,Introduction to selenium

5,Introduction to Python Crawler (7): Regular Expressions

(You can pay attention to the authors of these articles. Generally, they have Python collections. You can collect valuable articles.)

6. The Practice of Climbing Sina News Information by Yourself

I pasted the source code directly here for reference. Python Web Crawler Actual Warfare Lessons learned

News Comment Number Extraction Function

import re

import json

import requests

#js grabs news commentary information

commentURL='http://comment5.news.sina.com.cn/page/info?version=1&format=js&\

channel=gn&newsid=comos-{}&\

group=&compress=0&ie=utf-8&oe=utf-8&page=1&page_size=20&jsvar=loader_1491395188566_53913700'

def getCommentCounts(newsurl):

#Get news id

m=re.search('doc-i(.+).shtml',newsurl)

newsid=m.group(1)

#Get comment information based on news id

comments=requests.get(commentURL.format(newsid))

#Parsing information into json format

jd=json.loads(comments.text.strip('var loader_1491395188566_53913 700='))

return jd['result']['count']['total']

News Text Information Extraction Function

import requests

from datetime import datetime

from bs4 import BeautifulSoup

def getNewsDetail(newsurl):

result={}

res=requests.get(newsurl)

res.encoding='utf-8'

soup=BeautifulSoup(res.text,'html.parser')

result['title']=soup.select('#artibodyTitle')

timesource=soup.select('.time-source')[0].contents[0].strip()

result['dt']=datetime.strptime(timesource,'%Y year%m month%d day%H:%M')

result['source']=soup.select('.time-source span a')[0].text

result['article']=' '.join([p.text.strip() for p in soup.select('#artibody p')[:-1]])

result['editor']=soup.select('.article-editor')[0].text.lstrip('Responsible Editors:')

return result

7. Feelings

These days, learning, Python crawler ideas and routines are very clear, mainly we have to design different crawling four lines and methods for different websites (anti-crawling, etc.), but still ask yourself to sum up the methods and accumulate knowledge, and also a little bit of their own hope that the crawler can be applied to real life or application (if simple). Extracting information from a web page is actually not very meaningful, such as trying to apply it to mass download a website's pictures or files, and so on, let the crawler serve us.

(ps: I will continue to update and supplement the content, but also add in your message)

Keywords: Python Selenium JSON network

Added by wannalearnit on Tue, 09 Jul 2019 00:45:19 +0300