Crawler-json module and jsonpath module

JSON (JavaScript Object Notation) is a lightweight data exchange format, which makes it easy for people to read and write. At the same time, it also facilitates the analysis and generation of the machine. It is suitable for data interaction scenarios, such as data interaction between the front desk and the back desk of a website.

JSON is comparable to XML.

JSON module is included in Python 3.X, which can be used directly by import json.

Official documents: http://docs.python.org/library/json.html

Json Online Parsing Website: http://www.json.cn/#

JSON

json simply means objects and arrays in JavaScript, so these two structures are objects and arrays, through which various complex structures can be represented.

  1. Object: Object is represented by {} in js, data structure is {key1: value1, key2:value2,...} key-value pair structure. In object-oriented language, key is the attribute of the object, value is the corresponding attribute value, so it is easy to understand that the value method is the object. Key obtains the attribute value. The type of the attribute value can be number, string, array, object. .
  2. Array: Array is the content enclosed by [] in js. The data structure is ['Python','JavaScript','C++',...]. The value is obtained by index in the same way as in all languages. The type of field value can be number, string, array, object.

json module

The json module provides four functions: dumps, dump, loads, and load, which are used for conversion between strings and Python data types.

1.json.dumps()

To convert Python type into Json string and return a str object, the type conversion from Python to Json is compared as follows:

Python Json
dict object
list, tuple array
str, utf-8 string
int, float number
True true
False false
None null  
#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

import json

listStr = [1, 2, 3, 4]
tupleStr = (1, 2, 3, 4)
dictStr = {"city": "Beijing", "name": "Ant"}

print(json.dumps(listStr))
# [1, 2, 3, 4]

print(type(json.dumps(listStr)))
# <class 'str'>

print(json.dumps(tupleStr))
# [1, 2, 3, 4]

print(type(json.dumps(tupleStr)))
# <class 'str'>

# Note: Asii encoding is used by default when json.dumps() is serialized
# Add the parameter ensure_ascii=False to disable ASCII encoding and encode it as utf-8
print(json.dumps(dictStr, ensure_ascii = False))
# `city': `Beijing', `name': `ant'}

print(type(json.dumps(dictStr, ensure_ascii = False)))
# <class 'str'>

2.json.dump()

Serialize Python built-in types into Json objects and write to files

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

import json

listStr = [{"city": "Beijing"}, {"name": "Ant"}]
json.dump(listStr, open("listStr.json", "w", encoding = "utf-8"), ensure_ascii = False)

dictStr = {"city": "Beijing", "name": "Ant"}
json.dump(dictStr, open("dictStr.json", "w", encoding = "utf-8"), ensure_ascii = False)

 3.json.loads()

The Json format string decoding is converted into Python object, and the type conversion from Json to Python is compared as follows:

Json Python
object dict
array list
string utf-8
number(int) int
number(real) float
true True
false False
null None
#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

import json

strList = '[1, 2, 3, 4]'

strDict = '{"city": "Beijing", "name": "Ant"}'

print(json.loads(strList))
# [1, 2, 3, 4]

# json data automatically stored by utf-8
print(json.loads(strDict))
# {city':'Beijing','name':'ant'}

4.json.load()

Read Json-style strings in files and convert them to Python types

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

import json

strList = json.load(open("listStr.json", "r", encoding = "utf-8"))
print(strList)
# [{city':'Beijing'}, {name':'ant'}]

strDict = json.load(open("dictStr.json", "r", encoding = "utf-8"))
print(strDict)
# {city':'Beijing','name':'ant'}

JsonPath

JsonPath is an information extraction class library. It is a tool for extracting specified information from JSON documents. It provides a variety of language versions, including JavaScript, Python, PHP and Java.

For JSON, JsonPath is equivalent to XPATH for XML.

JsonPath versus XPath syntax:

JsonPath has clear structure, high readability, low complexity and is very easy to match. The following table corresponds to the use of XPath.

 
Xpath JSONPath describe
/ $ root node
. @ Existing Nodes
/ . or [] Take the child node
.. n/a Take the parent node, Jsonpath is not supported
// .. Select all eligible nodes regardless of location
* * Match all element nodes
@ n/a JsonPath does not support attribute access
[] [] Iterators (can do simple iteration operations inside, such as array subscripts, value selection based on content, etc.)
| [,] Supporting multiple selection in iterators
[] ?() Supporting filtering operations
n/a () Supporting expression computation
() n/a Grouping, JsonPath does not support

Examples:

To pull the city JSON file: http://www.lagou.com/lbs/getAllCitySearchLabels.json For example, get all city names.

#!/usr/bin/python3
# -*- conding:utf-8 -*-
__author__ = 'mayi'

import urllib.request
import json
import jsonpath

# Drag-and-Drop Urban JSON Files
url = 'http://www.lagou.com/lbs/getAllCitySearchLabels.json'
# User-Agent Header
header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.71 Safari/537.36'}

# The url, together with the headers, constructs the Request request, which will be accompanied by the User-Agent of the chrome browser
request = urllib.request.Request(url, headers = header)

# Send this request to the server
response = urllib.request.urlopen(request)

# Get page content: bytes
html = response.read()

# Transcoding: bytes to str
html = html.decode("utf-8")

# Converting json format strings into python objects
obj = json.loads(html)

# Starting from the root node, match the name node
city_list = jsonpath.jsonpath(obj, '$..name')

# Print the acquired name node
print(city_list)
# Print its type
print(type(city_list))

# Write to local disk file
with open("city.json", "w", encoding = "utf-8") as f:
    content = json.dumps(city_list, ensure_ascii = False)
    f.write(content)

Keywords: Python JSON encoding Attribute

Added by badapple on Wed, 12 Jun 2019 00:59:47 +0300