Information organization and extraction method of python learning notes

1. Preface

We have understood how to deal with HTML text. Think about it. A pot of porridge and print will appear magically. But how should we extract some information in the tag or web page? When you see the title of the article, maybe you have guessed, let's see if it is the same as everyone's guess?

2. Three forms of information marking

2.1. What is information marking

If it's a set of information, you can quickly see what it means,

What if it's a set of information?

Xiao Ming, 2001, software engineering, Beijing.

It seems like another thing, but it's hard to understand. Therefore, we need to use information tags to sort out:

Name: Xiao Ming

Year of birth: 2001

Major: Software Engineering

Residence: Beijing

Is it clear from this view.

  • The marked information can form an information organization structure and increase the information dimension.
  • The marked information can be used for communication, storage, or display
  • The structure of tags is as valuable as information
  • The marked information is more conducive to the understanding and application of the program

The above are his related concepts, which are not difficult to understand. You can also see that they are very important through the above demonstration

2.2. HTML information tags

  • HTML: (Hypertext Markup Language)
  • HTML is the information organization of www (World Wide Web)
  • Hypertext information such as sound, image and audio can be embedded into the text
  • HTML through predefined < >... < / > Organize different types of information in the form of labels

These four points are deja vu. Probably described before. That is to say, web pages have pictures, audio, sound and so on.

2.3 types of information marks

Said so much, are paving the way. It doesn't seem to be what we want to know

At present, there are three types of information tags recognized in the world: XML, JSON and YAML. Let's have a look at them.

2.3.1,XML

  • XML: (eXtensible Markup Language)

  • Abbreviation of space element

    <img src="china.jpg" size="10" />
    
  • Note writing form‘

    <!--This is a comment, very useful -->
    
  • xml constructs all information in the form of tags, which are often used in three ways

    <name>...</ Name >: there is content in the label

    < name / >: there is no content in the tag

    <!-- -->: notes

Similar to html, it also has tag, name and attribute.

Does it look similar to HTML? You have it and I have it.

In fact, in the historical development, there was HTML format first, and then xml format. Therefore, it can be said that xml format is a general form of information expression based on HTML format

3.3.2, JSON

  • JSON: (JavaScript Object Notation)

  • There are types of key value pairs: key: value

    The definition of information type is key; The information is worth describing.

  • usage

    "full name": "Xiao Ming"
    "Year of birth": 2001
    

    **Note: * * adding "" indicates that it is a string type. If it is a number, please ignore it.

    When a key has multiple values

    "name":  ["Xiao Ming", "Xiao Gang"]
    

    You can also nest:

    "Xiao Ming": {
    		"Gender": ""Male",
    		""Professional": "software engineering"
    			}
    

    Key value pairs are nested in the form of {,}.

  • All his formats are summarized as follows:

    "key" : "vaule":

    "Key": ["vaule1", "vaule2]: a key has multiple values

    "Key": {"subkey": "subvalue"}: nested form

3.3.3,YAML

  • YAML: YAML Ain't a Markup Language

    Untyped key value pair key:value

  • usage method

    full name : Xiao Ming
    

    Compared with JSON, we will find that yaml is only a string without double quotes.

    We can also express affiliation by indenting

    Xiao Ming :
    	Gender : male
    	major : software engineering
    

    Use - to express juxtaposition

    software engineering :
    	- Xiao Ming
    	- Xiao Hong
    

    |Represents the whole block of data, #: represents a comment

    yaml: |  # A yaml introduction
    YAML(/ˈjæməl/,Similar ending camel Camel) is a highly readable format used to express data serialization. YAML Many other languages have been referenced, including: C Language Python,Perl,And from XML,Data format of e-mail( RFC 2822)Get inspiration from. Clark Evans It was first published in 2001, and Ingy döt Net And Oren Ben-Kiki He is also the co designer of the language. At present, several programming languages or scripting languages support (or parse) this language.
    
  • All modes of use

    key : vaule
    key : #Comment
    - value1
    - value2
    key :
    	subkey : subv
    

3. Comparison of three information marking forms

3.1 review

Let's first review three different forms of information tagging

  1. XML

    <name>...</name>
    <name />
    <!--  -->
    

    Express information with < > and labels

  2. JSON

    "key" : "vaule: 
    "key" : ["vaule1", "vaule2"]
    "key" : {"subkey": "subvalue"}
    

    Express information with typed key value pairs

  3. YAML

    key : vaule
    key : #Comment
    - value1
    - value2
    key :
    	subkey : subv
    

    Expressing information with untyped key value pairs

3.2, examples

Let's take another look at their examples

  1. XML

    <person>
    	<fistName>Tian</fistName>
        <lastName>Song</lastName>
        <address>
        	<streeAddr>No. 5, Zhongguancun South Street</streeAddr>
            <city>Beijing</city>
            <zipcode>100081</zipcode>
        </address>
        <prof>Computer System</prof><prof>Security</prof>
    </person>
    

    Key information is organized with labels. At a glance, most of them are labels, and there are only a few effective information.

  2. JSON

    {
    	"fistName" : "Tian" ,
    	"lastName" : "Song" ,
    	"address" :{
    					"streeAddr" : "No. 5, Zhongguancun South Street" ,
    					"city" : "Beijing" ,
    					"zipcode" : "100081"
    				} ,
    	"prof" : [ "Computer System" , "Security"]
    }
    

    It can be seen that strings with different types must have double quotation marks

  3. YAML

    fistName : Tian
    lastName : Song
    address : 
    	streeAddr : No. 5, Zhongguancun South Street
    	city : Beijing
    	zipcode : 100081
    prpf :
    - Computer System
    - Security
    

    It looks simpler than the first two

    3.3 comparison of three information marking i forms

  4. XML: the earliest general information markup language, with good expansibility but cumbersome. It is often used for information exchange and transmission on inter.

  5. json: there are types of information, which is suitable for program processing (js) and is simpler than XML. It is often used for information communication between mobile application cloud and nodes, without annotation.

  6. YAML: no type of information, the highest proportion of text information and good readability. Configuration files commonly used in various systems are annotated and easy to read.

There is no harm without comparison. If I choose these three forms, I certainly like YAML and Jie Jian. They look very comfortable. When we can't deny that the other two are useless, json is often used for program interfaces. We need to know what interfaces are. XML is also very popular on the Internet, but YAML is really good for learners like me.

4. General methods of information extraction

Having talked about information marking, how to extract the marked information?

Information extraction refers to extracting the content from the marked information. No matter what form, there are two parts: mark and information. What we care about is the information content we want to propose,

There are three ways. Let's see

  1. Method 1: completely analyze the marked form of information, and then extract the key information

    XML, JSON and YAML require tag parsers, such as tag tree traversal of bs4 library

    Advantages: accurate information analysis

    Disadvantages: the process of extracting information is cumbersome and slow

    Although I haven't tried, I saw the traversal of bs4 library and thought of HTML. Their working principle should be very similar to HTML. I need to remember the information organization of the whole file to know and understand.

  2. Method 2: ignore the marking form and search the key information directly.

    Search, the text search function of the information

    Advantages: the extraction process is simple and fast,

    Disadvantages: the accuracy of information extraction results has nothing to do with the content

    This is similar to Ctrl+F in word. The speed is really fast, but the accuracy is hard to say.

  3. Fusion method

    Combine the analytic form and search method to extract the key information.

    XML,JSON,YAML. search

    Tag parser and text lookup function are required.

4.1, examples

We extract the url link in the HTML page

By analyzing the web page, we can see that there are ur links in the a tag,

Find the head attribute again

We make code

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup(demo, "html.parser")
>>> for link in soup.find_all("a"):
	print(link.get("href"))

	
http://www.icourse163.org/course/BIT-268001
http://www.icourse163.org/course/BIT-1001870001

You'll find two new faces find_all() and get() means the problem is reserved. See below.

5. HTML content search method based on bs4 Library

4.1 review

Surely this picture is no longer strange to you?

4.2,<>. find_ All() method

Previously, we encountered a method, < > find_ Let's take a look at it now

This method can find information in the soup variable. The information inside

Find: find.

<>.fild_all(name, attrs, resursive, string, **kwargs)

There are five parameters in total. Let's have a look.

Returns a list type that stores the search results.

  • Name: the retrieval string of the tag name.

    Let's try to retrieve the p tag

    >>> soup.find_all("p")
    [<p class="title"><b>The demo python introduces several python courses.</b></p>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    

    You can see that a list with results is returned.

    I also want to see the b label

    >>> soup.find_all(["p","b"])
    [<p class="title"><b>The demo python introduces several python courses.</b></p>, <b>The demo python introduces several python courses.</b>, <p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    >>> 
    

    If the given tag is True, all tags in the soup are returned

    >>> for tag in soup.find_all(True):
    	print(tag.name)
    
    html
    head
    title
    body
    p
    b
    p
    a
    a
    

    What if you return a tag that starts with b?

    Here we need to use the regular expression library. Let's have a look

    import re
    >>> for tag in soup.find_all(re.compile("b")):
    	print(tag.name)
    
    body
    b
    

    Label needs to start with b.

  • attrs: the retrieval string of tag attribute value, which can be used for label attribute retrieval. Retrieves whether a label contains some character information.

    Check whether the p tag contains course information

    >>> soup.find_all("p", "course")
    [<p class="course">Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:
    
    <a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a> and <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>.</p>]
    

    The information with course in the p tag attribute is returned.

    Of course, you can also make some conventions on attribute values, such as finding id="link1" as a finding element.

    >>> soup.find_all(id="link1")
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>]
    

    As you can see, it's printed out.

    What if it's link?

    >>> soup.find_all(id="link")
    []
    

    You'll find it empty. However, I have made this mistake before. It should be all input. Of course, it's not impossible. The regular expression mentioned above can still be completed here.

    ->>> soup.find_all(id=re.compile("link"))
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    

    It seems simple.

  • recursive: whether to retrieve all descendants. The default value is true.

    What do you mean? We search the following information starting from the current tag. If we want to search the information of the current node's son, set it to false.

    Let's look for the information under the a tag first

    soup.find_all("a")
    [<a class="py1" href="http://www.icourse163.org/course/BIT-268001" id="link1">Basic Python</a>, <a class="py2" href="http://www.icourse163.org/course/BIT-1001870001" id="link2">Advanced Python</a>]
    

    We are looking for the child tag node of a tag

    >>> soup.find_all("a",recursive=False)
    []
    

    It can be seen that it is empty. Just say he doesn't have a son.

  • string: <>...</> The retrieval string of the string area in.

    Let's search the string Basic python,

    >>> soup.find_all(string = "Basic Python")
    ['Basic Python']
    

    We still need precise input. If we input python and want to produce results, we need regular expressions.

    >>> soup.find_all(string = re.compile("Python"))
    ['Python is a wonderful general-purpose programming language. You can learn Python from novice to professional by tracking the following courses:\r\n', 'Basic Python', 'Advanced Python']
    

    The string related to the python string is printed

Due to < > find_ The all () method is very common, so it also has its abbreviated form

<tag>(...) Equivalent to < tag > find_ all(...)

soup(...) is equivalent to soup find_ all(…)`

<>. find_ All() ` method also has seven extension methods

methodexplain
find()Search and return only one result. The string type is the same as find_all() parameter.
<>.find_parents()Search in the predecessor node and return the list type, the same as find_all() parameter.
<>.find_parent()Return a result in the predecessor node. The string type is the same as find_all() parameter.
<>.find_next_siblings()Search in the post order parallel node and return the category type, the same as find_all() parameter.
<>.find_next_sibling()Return a result and category type in the subsequent parallel node, the same as find_all() parameter.
<>.find_previous_siblings()Search in the pre order parallel node and return the category type, the same as find_all() parameter.
<>.find_previous_sibling()Return a result in the preamble parallel node and return the category type, the same as find_all() parameter.

From what we have learned and some words, we can see that their search scope is different from the returned results, and these 6 + 1 methods are also very easy to remember,

6. Summary

It can be said to extract web content, but it's not quite the same as I thought. Now I seem to understand. It is through the use of find in HTML text_ All method. And get what we want,

Here should also have an understanding of regular expressions, which are used for search.

This is the above, my notes.

Thank you for your. If there are mistakes in the article, you are welcome to correct them; It's my honor if I can help you.

Keywords: Python

Added by Soumen on Sat, 29 Jan 2022 13:09:18 +0200