Beautifulsoup4
It is an HTML or XML parsing library of python, which can be used to easily extract data from web pages.
- Dependent device:
python standard library "html.parser"
Use: BeautofulSoup(markup, "html.parser")
Advantages: Python's built-in standard library; moderate execution speed; strong document fault tolerance
lxml HTML parser "lxml"
Use: BeautofulSoup(markup, "lxml")
Advantages: Fast speed; Document fault tolerance
Basic usage of Beautifulsoup4:
soup = BeautifulSoup(html,'lxml') print(soup.head) ## Get the head tag print(soup.p.b) ## Getting b nodes under p nodes
findall() method:
1.name parameter: can be queried by node name
2.attrs parameters: queries can be based on node attributes
3.text parameter: Used to match node text
css selector:
Get the property p.attrs ['id']
Get text: a.get_text() | a.strong
Pyquery
The python implementation of jQuery in pyquery library can operate and parse HTML documents with jQuery grammar, which is easy to use and fast to parse.
from pyquery import PyQuery doc = PyQuery(html) # Declared object from pyquery import PyQuery as pq doc = pq(html) print(doc('#container .list li'))#It looks for objects with id container class as list and label li as hierarchical relationships, without which the latter is necessarily a child of the former. //Subelement from pyquery import PyQuery as pq doc = pq(html) items = doc('.list')#Get items print(type(items)) print(items) lis = items.find('li')#Using the find method to find the li tag in items, the obtained lis can also continue to call the find method to look down and peel off layer by layer. print(type(lis)) print(lis)
Common methods:
find() Finds nested elements
eq(index) starts by retrieving the specified element 0 from the index
py_html(selector) retrieves the target content through the css selector.
. text() Gets label text
attr('attribute value'): Get tag attributes
thread
import threading ## Import module ## Unordered execution between threads ## Thread is the smallest cpu execution unit ## Threads can implement multitasking to handle I/O-intensive tasks # Thread resources under the same thread are shared //Example: data = [] sum = 0 def run1(num,**kwargs): # global data global sum print(kwargs) lock.acquire() #Lock up for i in range(num): print(i,threading.currentThread().name) # data.append(i) sum += 1 lock.release() #Unlock def run2(num): # global data global sum lock.acquire() for i in range(num): print(i,threading.currentThread().name) # data.append(i) sum += 1 lock.release() if __name__ == '__main__': print('Expenditure Execution Code',threading.currentThread().name) #Thread lock lock = threading.Lock() #Create thread #target: Function to execute #Name: Sets the name of the thread #args: Pass parameters to the function being executed (tuple) #kwargs: Transfer a parameter (dict) to the function being executed #daemon: Default Flase, the main thread terminates, without affecting the execution of the sub-threads #daemon: True, the main thread ends, and the sub-thread ends thread1 = threading.Thread( target=run1,name='Thread 1', args=(10000,),kwargs={'name':'lihua'}, daemon=True ) thread2 = threading.Thread( target=run2, name='Thread 2', args=(10000,), daemon=True ) #Open threads to perform tasks thread1.start() thread2.start() # join(), thread blocking (synchronization), so that tasks in sub-threads are completed, # Go back to the main thread and continue execution # thread1.join() # thread2.join() print(data,sum) print('Code Execution Completed', threading.currentThread().name)
Thread pool
Adding thread pools, faster request data, and executing programs
from concurrent.futures import ThreadPoolExecutor ## Create pool pool = ThreadPoolExecutor(10) ###Adding tasks to the pool (such as frequent request tasks) result = pool.submit(self.send_request, url) ## Add callback function result.add_done_callback(self.parse_info) ## Output of callback function def parseinfo(self.future): text = future.result() print(text)