Beautifulsoup4 and PyQuery & Thread | Thread Pool


It is an HTML or XML parsing library of python, which can be used to easily extract data from web pages.

  • Dependent device:
    python standard library "html.parser"
    Use: BeautofulSoup(markup, "html.parser")
    Advantages: Python's built-in standard library; moderate execution speed; strong document fault tolerance

lxml HTML parser "lxml"
Use: BeautofulSoup(markup, "lxml")
Advantages: Fast speed; Document fault tolerance

Basic usage of Beautifulsoup4:

soup = BeautifulSoup(html,'lxml')
print(soup.head) ## Get the head tag
print(soup.p.b) ## Getting b nodes under p nodes

findall() method: parameter: can be queried by node name
2.attrs parameters: queries can be based on node attributes
3.text parameter: Used to match node text

css selector:
Get the property p.attrs ['id']
Get text: a.get_text() | a.strong


The python implementation of jQuery in pyquery library can operate and parse HTML documents with jQuery grammar, which is easy to use and fast to parse.

from pyquery import PyQuery 
doc = PyQuery(html) # Declared object
from pyquery import PyQuery as pq
doc = pq(html)
print(doc('#container .list li'))#It looks for objects with id container class as list and label li as hierarchical relationships, without which the latter is necessarily a child of the former.

from pyquery import PyQuery as pq
doc = pq(html)
items = doc('.list')#Get items
lis = items.find('li')#Using the find method to find the li tag in items, the obtained lis can also continue to call the find method to look down and peel off layer by layer.

Common methods:
find() Finds nested elements
eq(index) starts by retrieving the specified element 0 from the index
py_html(selector) retrieves the target content through the css selector.
. text() Gets label text
attr('attribute value'): Get tag attributes


import threading  ## Import module
## Unordered execution between threads
## Thread is the smallest cpu execution unit 
## Threads can implement multitasking to handle I/O-intensive tasks
# Thread resources under the same thread are shared

data = []
sum = 0

def run1(num,**kwargs):
    # global data
    global sum
    lock.acquire() #Lock up
    for i in range(num):
        # data.append(i)
        sum += 1
    lock.release() #Unlock

def run2(num):
    # global data
    global sum
    for i in range(num):
        # data.append(i)
        sum += 1

if __name__ == '__main__':

    print('Expenditure Execution Code',threading.currentThread().name)

    #Thread lock
    lock = threading.Lock()
    #Create thread
    #target: Function to execute
    #Name: Sets the name of the thread
    #args: Pass parameters to the function being executed (tuple)
    #kwargs: Transfer a parameter (dict) to the function being executed
    #daemon: Default Flase, the main thread terminates, without affecting the execution of the sub-threads
    #daemon: True, the main thread ends, and the sub-thread ends
    thread1 = threading.Thread(
        target=run1,name='Thread 1',

    thread2 = threading.Thread(
        target=run2, name='Thread 2',

    #Open threads to perform tasks

    # join(), thread blocking (synchronization), so that tasks in sub-threads are completed,
    # Go back to the main thread and continue execution
    # thread1.join()
    # thread2.join()


    print('Code Execution Completed', threading.currentThread().name)

Thread pool

Adding thread pools, faster request data, and executing programs

from concurrent.futures import ThreadPoolExecutor

## Create pool
pool  = ThreadPoolExecutor(10)
###Adding tasks to the pool (such as frequent request tasks)
result = pool.submit(self.send_request, url)
## Add callback function

## Output of callback function
def  parseinfo(self.future):
	text = future.result()

Keywords: Python JQuery xml Attribute

Added by adamlacombe on Fri, 04 Oct 2019 06:22:32 +0300