Data parsing (XML, HTML)


Features and usage scenarios of XML

Creation of an XML file

Syntax rules for XML:

Labeling rules for XML:

Other components of XML

What are document constraints

XML        VS        HTML

What is XML parsing

Two ways of parsing

Common parsing tools for DOM

Parsing XML files using Dom4J

JSOUP parsing

Features and usage scenarios of XML

  • One is plain text, which uses UTF-8 encoding by default. Can be nested;
  • If you save the XML content as a file, it is an XML file
  • Scenarios for using XML: XML content is often used for network transmission as a message or as a configuration file to store system information.

Creation of an XML file

Creating an XML file requires that the suffix of the file be xml, such as hello_world.xml

Syntax rules for XML:

The suffix name of the xml file is:. xml

First line of document declaration required

<?xml version="1.0" encoding="UTF-8" ?>

Version: The default version number of the XML, this property must exist

encoding: encoding of this XML file

Labeling rules for XML:

  • A tag consists of a pair of angle brackets and legal identities: <name></name>, there must be one root tag, one and only one
  • Labels must exist in pairs, start again, end
  • Special tags can be mismatched, but must have an end tag <br/>
  • Tags can define attributes separated by attributes and tag name spaces. Attribute values must be quoted: <studengt id = "1"></name>
  • Labels need to be nested correctly

<student id="1">

<name>Zhang San</name>


Other components of XML

  • Annotation information can be defined in an XML file: <!- Comment Content-->
  • The following special words can exist in an XML file

< < Less than

> > greater than

& Ampersand sign

&apos; 'Single quotation marks

" "Quotes"

What are document constraints

Document Constraints: Used to limit how tags and attributes in an xml file are written.

Classification of document constraints



XML Document Constraints - Use of DTD s (Understanding)

Requirements: Constrain the writing of an XML file using DTD document constraints

1. Write dtd constraint documentation with a suffix of. dtd

2. Import the DTD constraint document into the XML file you need to write

3. Write the contents of an XML file as required by the constraints

XML Document Constraints - Use of schema (Understanding)

1. Schemas can constrain specific data types and are more constrained.

2. The schema itself is also an xml file and is required by other constraints, so write it more carefully

Requirements: Constrain the writing of an XML file using schema file constraints

1. Write a schema constraint document with a suffix of. xsd, see it in code

2. Import the schema constraint document into the XML file you need to write

3. Write tags for XML files based on Constraints

XML        VS        HTML

Both are products of the w3c organization whose primary function is storage and data transfer

* HTML is now widely used on the web

* Fixed and semantical labels (label names cannot be customized)

Improper grammar (no header tags will do no harm)

* XML is now widely used in data configuration

Completely customized labels (unlike HTTP)

* Very strict grammar

What is XML parsing

Using programs to read data in XML

Two ways of parsing

DOM Resolution

* SAX parsing

  • DOM Resolution

    • Generating a DOM (document) tree when parsing XML allows us to access and modify the contents of the tree arbitrarily
    • Disadvantages: If there are too many contents or levels in the document, the larger the tree generated and the higher the memory usage
    • Advantages: Any access and modification (can be added or deleted)

  • SAX parsing
    • For DOM parsing, it is a faster and more efficient way to parse, mainly using tree traversal algorithm for node access. Where do you need to go, you just need to traverse the nodes on the path.
    • Disadvantages: Do not know the full extent of the spanning tree, all can not be added or deleted (read while parsing, not sure if there are elements below)
    • Advantages: faster, more efficient, less memory

Common parsing tools for DOM

Parsing XML files using Dom4J

Requirement: Use Dom4J to parse data from an XMl file


  • Download Dom4J Framework, Official Download ( dom4j ), download. jar file

  • Create a file in the project plus: lib

  • Will dom4j-2.1.1.jar file copied to lib file
  • Right-click on the jar file, select Build Path ->click Add path...
  • Import package use in class

Dom4J parses the XML - gets the Document object

SAXReader class

Construction methodExplain
public SAXReader()Create parser object for Dom4J
Document read(String url)Loading an XML file into a Document object

Document class

Method NameExplain
Element getRootElement()Get the root element object

Common methods in Dom4J

Method NameExplain
List<Element> elements()Get all the elements under the current element
List<Element> elements(String name)Returns a collection of child elements with the specified name under the current element
Element elements(String name)Gets the child element with the specified name under the current element, and returns the first one if there are many children with the same name
String getName()Get the element name
String attributeValue(String name)Get the attribute value directly from the attribute name
String elementText (child element name)Gets the text of the child element with the specified name
String getText()To text

Code demonstration:

Create an xml file in the project

Parse Code:

import java.util.List;

import org.dom4j.Attribute;
import org.dom4j.Document;
import org.dom4j.DocumentException;
import org.dom4j.Element;

public class Dom4JDemo {

	public static void main(String[] args) throws DocumentException {
		//Create DOM4J Parser Object
		SAXReader reader = new SAXReader();
		//Pass in an xml file to the interpreter object that needs to be parsed
		//document is the object of the xml file
		Document document  = File("users.xml"));
		//Get the root node element in the xml file
		Element root = document.getRootElement();	//Here root node root=users
		//Get all under the root element
		List<Element> elements = root.elements();
		//Traversing elements in a node
		for (Element element1 : elements) {
			//Elements that traverse each node
//			System.out.println(element.asXML());
			//Gets the list of properties of the current node 	 Output: id:1 		 id:2 	 id:3
			List<Attribute> att = element1.attributes();
//			for (Attribute attbu : att) {
//				System.out.println(attbu.getName() + ":" + attbu.getValue());
//			}
			//Gets the specified property value
			System.out.println("id"+":" + element1.attributeValue("id"));
			//Get all child elements
			List<Element> ele = element1.elements();
			for (Element elet : ele) {
				System.out.println(elet.getName() + ":" + elet.getText());
			//Gets the specified element
			System.out.println("name" + ":" + element1.element("name").getText());


There are many results to run, which is not shown here. Friends of interest can run the following on their own.

JSOUP parsing

Features of jsoup parsing:
Not only can xml be parsed, but html can also be parsed

Get our elements by name

  • The rest of the installation steps are the same as xDmo4J
  • Put it in the lib file of the project

Resolve using JSOUP:

Code directly:


import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.Jsoup;

public class JSOUPTest {

	public static void main(String[] args) throws IOException {
		//1. Get the document object for the xml file      
		//Note: The Document class imports ORG here. Jsoup. Nodes. Document, dare not import under Dom4J package
		Document document = Jsoup.parse(new File("users.xml"), "utf-8");
		//Getting a single tag element object can only be obtained through the id attribute In xml/html, each tag can define the id attribute but the value cannot be the same
//		Element elementById = document.getElementById("1");
//		//Print out all the contents of the label with id of "1" (including the contents of sublabels)
//		System.out.println(elementById);
		//Getting all the label objects named user elementsByTag is essentially a list
		Elements elementsByTag = document.getElementsByTag("user");	
		//Print all elements under user's label label, two ways to get them
		//1. Consider elementsByTag as a list traversal output
//		for (Element element : elementsByTag) {
//			System.out.println(element);
//		}
		//2. Direct Output


Keywords: xml html

Added by phphelpme on Wed, 29 Dec 2021 02:51:01 +0200