Python 3 Chapter 6: data coding and processing

Chapter 6: data coding and processing

6.1 reading and writing CSV data

problem

You want to read and write a CSV file.

Solution

For most data reading and writing problems in csv format, you can use csv library. For example, suppose you are in a place called stocks There are some stock market data in the csv file, like this:

Symbol,Price,Date,Time,Change,Volume
"AA",39.48,"6/11/2007","9:36am",-0.18,181800
"AIG",71.38,"6/11/2007","9:36am",-0.15,195500
"AXP",62.58,"6/11/2007","9:36am",-0.46,935000
"BA",98.31,"6/11/2007","9:36am",+0.12,104800
"C",53.08,"6/11/2007","9:36am",-0.25,360900
"CAT",78.29,"6/11/2007","9:36am",-0.23,225400

The following shows you how to read these data as a sequence of tuples:

import csv
with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        # Process row
        ...

In the above code, row will be a list. Therefore, in order to access a field, you need to use subscripts, such as row[0] to access Symbol and row[4] to access Change.

Since this subscript access usually causes confusion, you can consider using named tuples. For example:

from collections import namedtuple
with open('stock.csv') as f:
    f_csv = csv.reader(f)
    headings = next(f_csv)
    Row = namedtuple('Row', headings)
    for r in f_csv:
        row = Row(*r)
        # Process row
        ...

It allows you to use column names such as row Symbol and row Change replaces subscript access. It should be noted that this only takes effect when the column name is a legal Python identifier. If not, you may need to modify the original column name (such as replacing non identifier characters with underscores).

Another option is to read the data into a dictionary sequence. This can be done:

import csv
with open('stocks.csv') as f:
    f_csv = csv.DictReader(f)
    for row in f_csv:
        # process row
        ...

In this version, you can use column names to access the data of each row. For example, row['Symbol '] or row['Change']

In order to write CSV data, you can still use the CSV module, but first create a writer object. For example:

headers = ['Symbol','Price','Date','Time','Change','Volume']
rows = [('AA', 39.48, '6/11/2007', '9:36am', -0.18, 181800),
         ('AIG', 71.38, '6/11/2007', '9:36am', -0.15, 195500),
         ('AXP', 62.58, '6/11/2007', '9:36am', -0.46, 935000),
       ]

with open('stocks.csv','w') as f:
    f_csv = csv.writer(f)
    f_csv.writerow(headers)
    f_csv.writerows(rows)

If you have a dictionary sequence of data, you can do something like this:

headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume']
rows = [{'Symbol':'AA', 'Price':39.48, 'Date':'6/11/2007',
        'Time':'9:36am', 'Change':-0.18, 'Volume':181800},
        {'Symbol':'AIG', 'Price': 71.38, 'Date':'6/11/2007',
        'Time':'9:36am', 'Change':-0.15, 'Volume': 195500},
        {'Symbol':'AXP', 'Price': 62.58, 'Date':'6/11/2007',
        'Time':'9:36am', 'Change':-0.46, 'Volume': 935000},
        ]

with open('stocks.csv','w') as f:
    f_csv = csv.DictWriter(f, headers)
    f_csv.writeheader()
    f_csv.writerows(rows)

discuss

You should always give priority to the CSV module to split or parse CSV data. For example, you might write code like this:

with open('stocks.csv') as f:
for line in f:
    row = line.split(',')
    # process row
    ...

One disadvantage of using this method is that you still need to deal with some difficult details. For example, if some field values are surrounded by quotation marks, you have to remove these quotation marks. In addition, if a field surrounded by quotation marks happens to contain a comma, the program will make an error because it produces a line of wrong size.

By default, the csv library recognizes the csv encoding rules used by Microsoft Excel. This is probably the most common form, and will also bring you the best compatibility. However, if you look at the csv document, you will find that there are many ways to apply it to other encoding formats (such as modifying split characters, etc.). For example, if you want to read data divided by tab, you can do this:

# Example of reading tab-separated values
with open('stock.tsv') as f:
    f_tsv = csv.reader(f, delimiter='\t')
    for row in f_tsv:
        # Process row
        ...

If you are reading CSV data and converting them into named tuples, you need to pay attention to the legitimacy authentication of column names. For example, a CSV format file has a column header row containing illegal identifiers, similar to the following:

Street Address,Num-Premises,Latitude,Longitude 5412 N CLARK,10,41.980262,-87.668452

This will eventually cause a ValueError exception to fail when creating a named tuple. To solve this problem, you may have to fix the column header first. For example, you can replace an illegal identifier with a regular expression as follows:

import re
with open('stock.csv') as f:
    f_csv = csv.reader(f)
    headers = [ re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv) ]
    Row = namedtuple('Row', headers)
    for r in f_csv:
        row = Row(*r)
        # Process row
        ...

Another important point to emphasize is that the data generated by CSV is of string type, and it will not do any other type of conversion. If you need to do such type conversion, you must implement it manually. The following is an example of performing other types of conversion on CSV data:

col_types = [str, float, str, str, float, int]
with open('stocks.csv') as f:
    f_csv = csv.reader(f)
    headers = next(f_csv)
    for row in f_csv:
        # Apply conversions to the row items
        row = tuple(convert(value) for convert, value in zip(col_types, row))
        ...

In addition, the following is an example of a specific field in the conversion dictionary:

print('Reading as dicts with type conversion')
field_types = [ ('Price', float),
                ('Change', float),
                ('Volume', int) ]

with open('stocks.csv') as f:
    for row in csv.DictReader(f):
        row.update((key, conversion(row[key]))
                for key, conversion in field_types)
        print(row)

Generally speaking, you may not want to think more about these conversion problems. In practice, CSV files have more or less missing data, damaged data and other problems that make the conversion fail. Therefore, unless your data is guaranteed to be accurate, you must consider these issues (you may need to add appropriate error handling mechanisms).

Finally, if you read CSV data for data analysis and statistics, you may need to take a look at the pandas package. Pandas contains a very convenient function called pandas read_ CSV (), which can load CSV data into a DataFrame object. Then you can use this object to generate various forms of statistics, filter data and perform other high-level operations. There will be such an example in section 6.13.

6.2 reading and writing JSON data

problem

You want to read and write data in JSON(JavaScript Object Notation) encoding format.

Solution

The JSON module provides a very simple way to encode and decode JSON data. The two main functions are JSON Dumps () and JSON Loads() has far fewer interfaces than other serialization function libraries such as pickle. The following shows how to convert a Python data structure to JSON:

import json

data = {
    'name' : 'ACME',
    'shares' : 100,
    'price' : 542.23
}

json_str = json.dumps(data)

The following shows how to convert a JSON encoded string back to a Python data structure:

data = json.loads(json_str)

If you are dealing with files instead of strings, you can use JSON Dump () and JSON Load() to encode and decode JSON data. For example:

# Writing JSON data
with open('data.json', 'w') as f:
    json.dump(data, f)

# Reading data back
with open('data.json', 'r') as f:
    data = json.load(f)

discuss

The basic data types supported by JSON coding are None, bool, int, float and str, as well as lists, tuples and dictionaries containing these types of data. For dictionaries, keys must be of string type (any non string key in the dictionary will be converted to string first when encoding). To follow the JSON specification, you should only code Python lists and dictionaries. Moreover, in web applications, it is standard practice for top-level objects to be encoded as a dictionary.

The format of JSON encoding is almost identical to Python syntax, except for some small differences. For example, true is mapped to true, false is mapped to false, and None is mapped to null. The following is an example to demonstrate the effect of encoded string:

>>> json.dumps(False)
'false'
>>> d = {'a': True,
...     'b': 'Hello',
...     'c': None}
>>> json.dumps(d)
'{"b": "Hello", "c": null, "a": true}'
>>>

If you try to check the JSON decoded data, it is usually difficult to determine its structure by simple printing, especially when the nested structure of the data is deep or contains a large number of fields. To solve this problem, consider using the pprint() function of the pprint module instead of the ordinary print() function. It will be output in alphabetical order of key and in a more beautiful way. Here's an example of how to print out search results on Twitter beautifully:

>>> from urllib.request import urlopen
>>> import json
>>> u = urlopen('http://search.twitter.com/search.json?q=python&rpp=5')
>>> resp = json.loads(u.read().decode('utf-8'))
>>> from pprint import pprint
>>> pprint(resp)
{'completed_in': 0.074,
'max_id': 264043230692245504,
'max_id_str': '264043230692245504',
'next_page': '?page=2&max_id=264043230692245504&q=python&rpp=5',
'page': 1,
'query': 'python',
'refresh_url': '?since_id=264043230692245504&q=python',
'results': [{'created_at': 'Thu, 01 Nov 2012 16:36:26 +0000',
            'from_user': ...
            },
            {'created_at': 'Thu, 01 Nov 2012 16:36:14 +0000',
            'from_user': ...
            },
            {'created_at': 'Thu, 01 Nov 2012 16:36:13 +0000',
            'from_user': ...
            },
            {'created_at': 'Thu, 01 Nov 2012 16:36:07 +0000',
            'from_user': ...
            }
            {'created_at': 'Thu, 01 Nov 2012 16:36:04 +0000',
            'from_user': ...
            }],
'results_per_page': 5,
'since_id': 0,
'since_id_str': '0'}
>>>

Generally speaking, JSON decoding will create dicts or lists based on the provided data. If you want to create other types of objects, you can give JSON Loads() pass object_pairs_hook or object_hook parameter. For example, the following is an example showing how to decode JSON data and retain its order in an OrderedDict:

>>> s = '{"name": "ACME", "shares": 50, "price": 490.1}'
>>> from collections import OrderedDict
>>> data = json.loads(s, object_pairs_hook=OrderedDict)
>>> data
OrderedDict([('name', 'ACME'), ('shares', 50), ('price', 490.1)])
>>>

The following is an example of how to convert a JSON dictionary into a Python object:

>>> class JSONObject:
...     def __init__(self, d):
...         self.__dict__ = d
...
>>>
>>> data = json.loads(s, object_hook=JSONObject)
>>> data.name
'ACME'
>>> data.shares
50
>>> data.price
490.1
>>>

In the last example, the JSON decoded dictionary is passed to the user as a single parameter__ init__ () . Then, you can use it as you like, such as using it directly as an example dictionary.

There are also some options that are useful when coding JSON. If you want to get beautiful formatted string output, you can use JSON indent parameter of dumps(). It makes the output similar to the pprint() function. For example:

>>> print(json.dumps(data))
{"price": 542.23, "name": "ACME", "shares": 100}
>>> print(json.dumps(data, indent=4))
{
    "price": 542.23,
    "name": "ACME",
    "shares": 100
}
>>>

Object instances are usually not JSON serializable. For example:

>>> class Point:
...     def __init__(self, x, y):
...         self.x = x
...         self.y = y
...
>>> p = Point(2, 3)
>>> json.dumps(p)
Traceback (most recent call last):
    File "<stdin>", line 1, in <module>
    File "/usr/local/lib/python3.3/json/__init__.py", line 226, in dumps
        return _default_encoder.encode(obj)
    File "/usr/local/lib/python3.3/json/encoder.py", line 187, in encode
        chunks = self.iterencode(o, _one_shot=True)
    File "/usr/local/lib/python3.3/json/encoder.py", line 245, in iterencode
        return _iterencode(o, 0)
    File "/usr/local/lib/python3.3/json/encoder.py", line 169, in default
        raise TypeError(repr(o) + " is not JSON serializable")
TypeError: <__main__.Point object at 0x1006f2650> is not JSON serializable
>>>

If you want to serialize an object instance, you can provide a function whose input is an instance and returns a serializable dictionary. For example:

def serialize_instance(obj):
    d = { '__classname__' : type(obj).__name__ }
    d.update(vars(obj))
    return d

If you want to get this instance in reverse, you can do this:

# Dictionary mapping names to known classes
classes = {
    'Point' : Point
}

def unserialize_object(d):
    clsname = d.pop('__classname__', None)
    if clsname:
        cls = classes[clsname]
        obj = cls.__new__(cls) # Make instance without calling __init__
        for key, value in d.items():
            setattr(obj, key, value)
        return obj
    else:
        return d

Here are examples of how to use these functions:

>>> p = Point(2,3)
>>> s = json.dumps(p, default=serialize_instance)
>>> s
'{"__classname__": "Point", "y": 3, "x": 2}'
>>> a = json.loads(s, object_hook=unserialize_object)
>>> a
<__main__.Point object at 0x1017577d0>
>>> a.x
2
>>> a.y
3
>>>

The json module also has many other options to control the parsing of lower-level numbers, special values such as NaN, etc. You can refer to the official documentation for more details.

6.3 parsing simple XML data

problem

You want to extract data from a simple XML document.

Solution

You can use XML etree. The elementtree module extracts data from a simple XML document. For demonstration purposes, suppose you want to parse an RSS feed on Planet Python. The following is the corresponding code:

from urllib.request import urlopen
from xml.etree.ElementTree import parse

# Download the RSS feed and parse it
u = urlopen('http://planet.python.org/rss20.xml')
doc = parse(u)

# Extract and output tags of interest
for item in doc.iterfind('channel/item'):
    title = item.findtext('title')
    date = item.findtext('pubDate')
    link = item.findtext('link')

    print(title)
    print(date)
    print(link)
    print()

Run the above code and the output is similar to this:

Steve Holden: Python for Data Analysis
Mon, 19 Nov 2012 02:13:51 +0000
http://holdenweb.blogspot.com/2012/11/python-for-data-analysis.html

Vasudev Ram: The Python Data model (for v2 and v3)
Sun, 18 Nov 2012 22:06:47 +0000
http://jugad2.blogspot.com/2012/11/the-python-data-model.html

Python Diary: Been playing around with Object Databases
Sun, 18 Nov 2012 20:40:29 +0000
http://www.pythondiary.com/blog/Nov.18,2012/been-...-object-databases.html

Vasudev Ram: Wakari, Scientific Python in the cloud
Sun, 18 Nov 2012 20:19:41 +0000
http://jugad2.blogspot.com/2012/11/wakari-scientific-python-in-cloud.html

Jesse Jiryu Davis: Toro: synchronization primitives for Tornado coroutines
Sun, 18 Nov 2012 20:17:49 +0000
http://feedproxy.google.com/~r/EmptysquarePython/~3/_DOZT2Kd0hQ/

Obviously, if you want to do further processing, you need to replace the print() statement to do other interesting things.

discuss

It is common in many applications to process data in XML encoded format. XML is not only widely used in data exchange on the Internet, but also a common format for storing application data (such as word processing, music library, etc.). The following discussion assumes that readers are familiar with the basics of XML.

In many cases, when XML is used to store only data, the corresponding document structure is very compact and intuitive. For example, the RSS feed in the above example is similar to the following format:

<?xml version="1.0"?>
<rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/">
    <channel>
        <title>Planet Python</title>
        <link>http://planet.python.org/</link>
        <language>en</language>
        <description>Planet Python - http://planet.python.org/</description>
        <item>
            <title>Steve Holden: Python for Data Analysis</title>
            <guid>http://holdenweb.blogspot.com/...-data-analysis.html</guid>
            <link>http://holdenweb.blogspot.com/...-data-analysis.html</link>
            <description>...</description>
            <pubDate>Mon, 19 Nov 2012 02:13:51 +0000</pubDate>
        </item>
        <item>
            <title>Vasudev Ram: The Python Data model (for v2 and v3)</title>
            <guid>http://jugad2.blogspot.com/...-data-model.html</guid>
            <link>http://jugad2.blogspot.com/...-data-model.html</link>
            <description>...</description>
            <pubDate>Sun, 18 Nov 2012 22:06:47 +0000</pubDate>
        </item>
        <item>
            <title>Python Diary: Been playing around with Object Databases</title>
            <guid>http://www.pythondiary.com/...-object-databases.html</guid>
            <link>http://www.pythondiary.com/...-object-databases.html</link>
            <description>...</description>
            <pubDate>Sun, 18 Nov 2012 20:40:29 +0000</pubDate>
        </item>
        ...
    </channel>
</rss>

xml. etree. ElementTree. The parse () function parses the entire XML document and converts it into a document object. Then, you can use find(), iterfind(), findtext() and other methods to search for specific XML elements. The parameters of these functions are a specified tag name, such as channel/item or title.

Each time you specify a tag, you need to traverse the entire document structure. Each search operation starts with a starting element. Similarly, the tag name specified for each operation is also the relative path of the starting element. For example, execute doc Iterfind ('channel / item ') to search all item elements below the channel element. Doc represents the top level of the document (that is, the first level rss element). Then the next call is item FindText () starts the search from the location of the found item element.

Each element in the ElementTree module has some important attributes and methods, which are very useful in parsing. The tag attribute contains the name of the tag, the text attribute contains the internal text, and the get() method can get the attribute value. For example:

>>> doc
<xml.etree.ElementTree.ElementTree object at 0x101339510>
>>> e = doc.find('channel/title')
>>> e
<Element 'title' at 0x10135b310>
>>> e.tag
'title'
>>> e.text
'Planet Python'
>>> e.get('some_attribute')
>>>

One thing to emphasize is XML etree. ElementTree is not the only way to parse XML. For more advanced applications, you need to consider using lxml. It uses the same programming interface as ElementTree, so the above example is also applicable to lxml. You just need to replace the initial import statement with from lxml Etree import parse is OK. Lxml fully follows the XML standard and is very fast. It also supports validation, XSLT, XPath and other features.

6.4 incremental parsing of large XML files

problem

You want to use as little memory as possible to extract data from a large XML document.

Solution

Whenever you encounter incremental data processing, you should think of iterators and generators at the first time. The following is a very simple function that can incrementally process a large XML file with little memory:

from xml.etree.ElementTree import iterparse

def parse_and_remove(filename, path):
    path_parts = path.split('/')
    doc = iterparse(filename, ('start', 'end'))
    # Skip the root element
    next(doc)

    tag_stack = []
    elem_stack = []
    for event, elem in doc:
        if event == 'start':
            tag_stack.append(elem.tag)
            elem_stack.append(elem)
        elif event == 'end':
            if tag_stack == path_parts:
                yield elem
                elem_stack[-2].remove(elem)
            try:
                tag_stack.pop()
                elem_stack.pop()
            except IndexError:
                pass

To test this function, you need to have a large XML file first. Usually you can find such documents on government websites or public data websites. For example, you can download the Chicago City Road pothole database in XML format. At the time of writing this book, the downloaded file already contains more than 100000 lines of data, and the coding format is similar to the following:

<response>
    <row>
        <row ...>
            <creation_date>2012-11-18T00:00:00</creation_date>
            <status>Completed</status>
            <completion_date>2012-11-18T00:00:00</completion_date>
            <service_request_number>12-01906549</service_request_number>
            <type_of_service_request>Pot Hole in Street</type_of_service_request>
            <current_activity>Final Outcome</current_activity>
            <most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>
            <street_address>4714 S TALMAN AVE</street_address>
            <zip>60632</zip>
            <x_coordinate>1159494.68618856</x_coordinate>
            <y_coordinate>1873313.83503384</y_coordinate>
            <ward>14</ward>
            <police_district>9</police_district>
            <community_area>58</community_area>
            <latitude>41.808090232127896</latitude>
            <longitude>-87.69053684711305</longitude>
            <location latitude="41.808090232127896"
            longitude="-87.69053684711305" />
        </row>
        <row ...>
            <creation_date>2012-11-18T00:00:00</creation_date>
            <status>Completed</status>
            <completion_date>2012-11-18T00:00:00</completion_date>
            <service_request_number>12-01906695</service_request_number>
            <type_of_service_request>Pot Hole in Street</type_of_service_request>
            <current_activity>Final Outcome</current_activity>
            <most_recent_action>CDOT Street Cut ... Outcome</most_recent_action>
            <street_address>3510 W NORTH AVE</street_address>
            <zip>60647</zip>
            <x_coordinate>1152732.14127696</x_coordinate>
            <y_coordinate>1910409.38979075</y_coordinate>
            <ward>26</ward>
            <police_district>14</police_district>
            <community_area>23</community_area>
            <latitude>41.91002084292946</latitude>
            <longitude>-87.71435952353961</longitude>
            <location latitude="41.91002084292946"
            longitude="-87.71435952353961" />
        </row>
    </row>
</response>

Suppose you want to write a script to arrange the zip code numbers according to the number of pit reports. You can do something like this:

from xml.etree.ElementTree import parse
from collections import Counter

potholes_by_zip = Counter()

doc = parse('potholes.xml')
for pothole in doc.iterfind('row/row'):
    potholes_by_zip[pothole.findtext('zip')] += 1
for zipcode, num in potholes_by_zip.most_common():
    print(zipcode, num)

The only problem with this script is that it will first load the entire XML file into memory and then parse it. On my machine, about 450MB of memory space is needed to run this program. If the following code is used, the program only needs to be modified a little:

from collections import Counter

potholes_by_zip = Counter()

data = parse_and_remove('potholes.xml', 'row/row')
for pothole in data:
    potholes_by_zip[pothole.findtext('zip')] += 1
for zipcode, num in potholes_by_zip.most_common():
    print(zipcode, num)

The result: This version of the code only needs 7MB of memory to run - a significant savings in memory resources.

discuss

The techniques in this section rely on two core functions in the ElementTree module. First, the iterparse() method allows incremental operations on XML documents. When using, you need to provide a file name and a list of one or more of the following types of events: start, end, start ns and end ns. The iterator created by iterparse() generates tuples like (event, elem), where event is one of the above event lists and elem is the corresponding XML element. For example:

>>> data = iterparse('potholes.xml',('start','end'))
>>> next(data)
('start', <Element 'response' at 0x100771d60>)
>>> next(data)
('start', <Element 'row' at 0x100771e68>)
>>> next(data)
('start', <Element 'row' at 0x100771fc8>)
>>> next(data)
('start', <Element 'creation_date' at 0x100771f18>)
>>> next(data)
('end', <Element 'creation_date' at 0x100771f18>)
>>> next(data)
('start', <Element 'status' at 0x1006a7f18>)
>>> next(data)
('end', <Element 'status' at 0x1006a7f18>)
>>>

The start event is created when an element is first created and has not been inserted into other data (such as child elements). The end event is created when an element has been completed. Although not shown in the example, the start ns and end ns events are used to handle the declaration of XML document namespaces.

In the example in this section, the start and end events are used to manage the element and tag stacks. The stack represents the hierarchy of the document when it is parsed. It is also used to judge whether an element matches and pass it to the parse function_ and_ The path to remove(). If there is a match, the element is returned to the caller using the yield statement.

The following statement after yield is the core feature of ElementTree, which makes the program occupy very little memory:

elem_stack[-2].remove(elem)

This statement causes the element previously generated by yield to be deleted from its parent node. Assuming that there is no other place to reference this element, the element is destroyed and memory is reclaimed.

The final effect of iterative parsing and deletion of nodes is an efficient incremental cleaning process on documents. The document tree structure has not been completely created from beginning to end. Nevertheless, the XML data can be processed in the above simple way.

The main drawback of this scheme is its performance. As a result of my own tests, the version that reads the entire document into memory runs almost twice as fast as the incremental version. But it uses 60 times more memory than the latter. Therefore, if you are more concerned about memory usage, the incremental version wins.

6.5 converting dictionaries to XML

problem

You want to use a Python dictionary to store data and convert it to XML format.

Solution

Although XML etree. The elementtree library is usually used for parsing. In fact, it can also create XML documents. For example, consider the following function:

from xml.etree.ElementTree import Element

def dict_to_xml(tag, d):
'''
Turn a simple dict of key/value pairs into XML
'''
elem = Element(tag)
for key, val in d.items():
    child = Element(key)
    child.text = str(val)
    elem.append(child)
return elem

Here is an example:

>>> s = { 'name': 'GOOG', 'shares': 100, 'price':490.1 }
>>> e = dict_to_xml('stock', s)
>>> e
<Element 'stock' at 0x1004b64c8>
>>>

The conversion result is an Element instance. For I/O operations, use XML etree. The tostring() function in elementtree can easily convert it into a byte string. For example:

>>> from xml.etree.ElementTree import tostring
>>> tostring(e)
b'<stock><price>490.1</price><shares>100</shares><name>GOOG</name></stock>'
>>>

If you want to add an attribute value to an element, you can use the set() method:

>>> e.set('_id','1234')
>>> tostring(e)
b'<stock _id="1234"><price>490.1</price><shares>100</shares><name>GOOG</name>
</stock>'
>>>

If you still want to keep the order of elements, consider constructing an OrderedDict instead of a normal dictionary. Please refer to subsection 1.7.

discuss

When creating XML, you are limited to constructing string type values. For example:

def dict_to_xml_str(tag, d):
    '''
    Turn a simple dict of key/value pairs into XML
    '''
    parts = ['<{}>'.format(tag)]
    for key, val in d.items():
        parts.append('<{0}>{1}</{0}>'.format(key,val))
    parts.append('</{}>'.format(tag))
    return ''.join(parts)

The problem is that if you manually construct it, you may encounter some trouble. For example, what happens when the dictionary value contains some special characters?

>>> d = { 'name' : '<spam>' }

>>> # String creation
>>> dict_to_xml_str('item',d)
'<item><name><spam></name></item>'

>>> # Proper XML creation
>>> e = dict_to_xml('item',d)
>>> tostring(e)
b'<item><name>&lt;spam&gt;</name></item>'
>>>

Notice that in the later example of the program, the characters' < 'and' > 'are replaced with < and >

The following is for reference only. If you need to convert these characters manually, you can use XML sax. escape() and unescape() functions in saxutils. For example:

>>> from xml.sax.saxutils import escape, unescape
>>> escape('<spam>')
'&lt;spam&gt;'
>>> unescape(_)
'<spam>'
>>>

In addition to creating correct output, there is another reason to recommend that you create Element instances instead of strings, that is, it is not so easy to construct a larger document using string combinations. The Element instance can be processed in a variety of ways without considering parsing XML text. In other words, you can complete all your operations on an advanced data structure and output it in the form of string at the end.

6.6 parsing and modifying XML

problem

You want to read an XML document, make some changes to it, and then write the results back to the XML document.

Solution

Using XML etree. The elementtree module can easily handle these tasks. The first step is to parse the document in the usual way. For example, suppose you have a file named PRED XML document, similar to the following:

<?xml version="1.0"?>
<stop>
    <id>14791</id>
    <nm>Clark &amp; Balmoral</nm>
    <sri>
        <rt>22</rt>
        <d>North Bound</d>
        <dd>North Bound</dd>
    </sri>
    <cr>22</cr>
    <pre>
        <pt>5 MIN</pt>
        <fd>Howard</fd>
        <v>1378</v>
        <rn>22</rn>
    </pre>
    <pre>
        <pt>15 MIN</pt>
        <fd>Howard</fd>
        <v>1867</v>
        <rn>22</rn>
    </pre>
</stop>

The following is an example of using ElementTree to read this document and modify it:

>>> from xml.etree.ElementTree import parse, Element
>>> doc = parse('pred.xml')
>>> root = doc.getroot()
>>> root
<Element 'stop' at 0x100770cb0>

>>> # Remove a few elements
>>> root.remove(root.find('sri'))
>>> root.remove(root.find('cr'))
>>> # Insert a new element after <nm>...</nm>
>>> root.getchildren().index(root.find('nm'))
1
>>> e = Element('spam')
>>> e.text = 'This is a test'
>>> root.insert(2, e)

>>> # Write back to a file
>>> doc.write('newpred.xml', xml_declaration=True)
>>>

The processing result is a new XML file like the following:

<?xml version='1.0' encoding='us-ascii'?>
<stop>
    <id>14791</id>
    <nm>Clark &amp; Balmoral</nm>
    <spam>This is a test</spam>
    <pre>
        <pt>5 MIN</pt>
        <fd>Howard</fd>
        <v>1378</v>
        <rn>22</rn>
    </pre>
    <pre>
        <pt>15 MIN</pt>
        <fd>Howard</fd>
        <v>1867</v>
        <rn>22</rn>
    </pre>
</stop>

discuss

It's easy to modify the structure of an XML document, but you must keep in mind that all modifications are made to the parent node element and treat it as a list. For example, if you delete an element, delete it from its immediate parent by calling the remove() method of the parent node. If you insert or add a new element, you also use the insert() and append() methods of the parent node element. You can also use indexing and slicing operations on elements, such as element[i] or element[i:j]

If you need to create a new Element, you can use the Element class shown in the scheme in this section. We have discussed it in detail in section 6.5.

6.7 parsing XML documents using namespaces

problem

You want to parse an XML document that uses an XML namespace.

Solution

Consider the following document that uses namespaces:

<?xml version="1.0" encoding="utf-8"?>
<top>
    <author>David Beazley</author>
    <content>
        <html xmlns="http://www.w3.org/1999/xhtml">
            <head>
                <title>Hello World</title>
            </head>
            <body>
                <h1>Hello World!</h1>
            </body>
        </html>
    </content>
</top>

If you parse the document and execute a normal query, you will find that this is not so easy, because all the steps become quite cumbersome.

>>> # Some queries that work
>>> doc.findtext('author')
'David Beazley'
>>> doc.find('content')
<Element 'content' at 0x100776ec0>
>>> # A query involving a namespace (doesn't work)
>>> doc.find('content/html')
>>> # Works if fully qualified
>>> doc.find('content/{http://www.w3.org/1999/xhtml}html')
<Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0>
>>> # Doesn't work
>>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/head/title')
>>> # Fully qualified
>>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/'
... '{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title')
'Hello World'
>>>

You can simplify this process by wrapping the namespace processing logic into a tool class:

class XMLNamespaces:
    def __init__(self, **kwargs):
        self.namespaces = {}
        for name, uri in kwargs.items():
            self.register(name, uri)
    def register(self, name, uri):
        self.namespaces[name] = '{'+uri+'}'
    def __call__(self, path):
        return path.format_map(self.namespaces)

Use this class in the following way:

>>> ns = XMLNamespaces(html='http://www.w3.org/1999/xhtml')
>>> doc.find(ns('content/{html}html'))
<Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0>
>>> doc.findtext(ns('content/{html}html/{html}head/{html}title'))
'Hello World'
>>>

discuss

Parsing XML documents with namespaces can be cumbersome. The above XML namespaces just allow you to use abbreviated names instead of full URI s to make it a little simpler.

Unfortunately, there is no way to get namespace information in basic ElementTree parsing. However, if you use the iterparse() function, you can get more information about the scope of namespace processing. For example:

>>> from xml.etree.ElementTree import iterparse
>>> for evt, elem in iterparse('ns2.xml', ('end', 'start-ns', 'end-ns')):
... print(evt, elem)
...
end <Element 'author' at 0x10110de10>
start-ns ('', 'http://www.w3.org/1999/xhtml')
end <Element '{http://www.w3.org/1999/xhtml}title' at 0x1011131b0>
end <Element '{http://www.w3.org/1999/xhtml}head' at 0x1011130a8>
end <Element '{http://www.w3.org/1999/xhtml}h1' at 0x101113310>
end <Element '{http://www.w3.org/1999/xhtml}body' at 0x101113260>
end <Element '{http://www.w3.org/1999/xhtml}html' at 0x10110df70>
end-ns None
end <Element 'content' at 0x10110de68>
end <Element 'top' at 0x10110dd60>
>>> elem # This is the topmost element
<Element 'top' at 0x10110dd60>
>>>

Finally, if the XML text you want to process needs to use namespaces in addition to other advanced XML features, it is recommended that you use lxml function library instead of ElementTree. For example, lxml provides better support for validating documents using DTD s, better XPath support, and some other advanced XML features. This section actually just teaches you how to make XML parsing a little easier.

6.8 interaction with relational database

problem

You want to query, add or delete records in a relational database.

Solution

The standard way to represent multiline data in Python is a sequence of tuples. For example:

stocks = [
    ('GOOG', 100, 490.1),
    ('AAPL', 50, 545.75),
    ('FB', 150, 7.45),
    ('HPQ', 75, 33.2),
]

According to PEP249, by providing data in this form, you can easily use Python standard database API to interact with relational database. All operations on the database are completed through SQL query statements. Each row of input and output data is represented by a tuple.

For demonstration, you can use the sqlite3 module in the Python standard library. If you use a different database (such as MySql, Postgresql or ODBC), you have to install the corresponding third-party module to provide support. However, the corresponding programming interfaces are almost the same, except for a little nuance.

The first step is to connect to the database. Usually you need to execute the connect() function and provide it with some database name, host, user name, password and other necessary parameters. For example:

>>> import sqlite3
>>> db = sqlite3.connect('database.db')
>>>

In order to process the data, you need to create a cursor next. Once you have a cursor, you can execute SQL queries. For example:

>>> c = db.cursor()
>>> c.execute('create table portfolio (symbol text, shares integer, price real)')
<sqlite3.Cursor object at 0x10067a730>
>>> db.commit()
>>>

To insert multiple records into the database table, use a statement like the following:

>>> c.executemany('insert into portfolio values (?,?,?)', stocks)
<sqlite3.Cursor object at 0x10067a730>
>>> db.commit()
>>>

To execute a query, use a statement like the following:

>>> for row in db.execute('select * from portfolio'):
...     print(row)
...
('GOOG', 100, 490.1)
('AAPL', 50, 545.75)
('FB', 150, 7.45)
('HPQ', 75, 33.2)
>>>

If you want to accept user input as a parameter to perform query operations, you must ensure that you use placeholders like the following? To reference parameters:

>>> min_price = 100
>>> for row in db.execute('select * from portfolio where price >= ?',
                          (min_price,)):
...     print(row)
...
('GOOG', 100, 490.1)
('AAPL', 50, 545.75)
>>>

discuss

Interacting with the database at a lower level is very simple. You only need to provide SQL statements and call the corresponding modules to update or extract data. Still, there are some tricky details that you need to list one by one.

One difficulty is the direct mapping of data in the database to Python types. For date types, you can usually use the datetime instance in the datetime module, or possibly the system timestamp in the time module. For numeric types, especially financial data using decimals, it can be represented by the decimal instance in the decimal module. Unfortunately, the specific mapping rules are different for different databases. You must refer to the corresponding documents.

Another more complex problem is the construction of SQL statement strings. You should never use Python string formatting operators (such as%) or format() method to create such a string. If the values passed to these formatting operators come from user input, your program is likely to suffer from SQL injection attacks (see) http://xkcd.com/327 ). Wildcards in query statements? Instruct the background database to use its own string replacement mechanism, which is more secure.

Unfortunately, wildcards are used differently in different database backgrounds. Most modules use? Or% s, and others use different symbols, such as: 0 or: 1 to indicate parameters. Similarly, you still have to refer to the corresponding documentation of the database module you use. The paramstyle attribute of a database module contains information about the parameter reference style.

For simple database data reading and writing, using database API is usually very simple. If you want to deal with more complex problems, it is recommended that you use more advanced interfaces, such as the interface provided by an object relational mapping ORM. Libraries like SQLAlchemy allow you to use Python classes to represent a database table, and can realize various database operations while hiding the underlying SQL.

6.9 encoding and decoding hexadecimal numbers

problem

You want to decode a hexadecimal string into a byte string or encode a byte string into a hexadecimal string.

Solution

If you simply decode or encode a hexadecimal original string, you can use the binassii module. For example:

>>> # Initial byte string
>>> s = b'hello'
>>> # Encode as hex
>>> import binascii
>>> h = binascii.b2a_hex(s)
>>> h
b'68656c6c6f'
>>> # Decode back to bytes
>>> binascii.a2b_hex(h)
b'hello'
>>>

Similar functions can also be found in the base64 module. For example:

>>> import base64
>>> h = base64.b16encode(s)
>>> h
b'68656C6C6F'
>>> base64.b16decode(h)
b'hello'
>>>

discuss

In most cases, it is easy to convert hexadecimal by using the above functions. The main difference between the above two technologies is the case processing. Function Base64 B16decode() and Base64 B16 encode () can only operate hexadecimal letters in uppercase, while functions in the binassii module can handle both case and case.

Another thing to note is that the output generated by the encoding function is always a byte string. If you want to force Unicode output, you need to add an additional interface step. For example:

>>> h = base64.b16encode(s)
>>> print(h)
b'68656C6C6F'
>>> print(h.decode('ascii'))
68656C6C6F
>>>

When decoding hexadecimal numbers, the functions b16decode() and a2b_hex() can accept bytes or unicode strings. However, unicode strings must contain only ASCII encoded hexadecimal numbers.

6.10 encoding and decoding Base64 data

problem

You need to decode or encode binary data in Base64 format.

Solution

There are two functions b64encode() and b64decode() in the base64 module to help you solve this problem. For example;

>>> # Some byte data
>>> s = b'hello'
>>> import base64

>>> # Encode as Base64
>>> a = base64.b64encode(s)
>>> a
b'aGVsbG8='

>>> # Decode from Base64
>>> base64.b64decode(a)
b'hello'
>>>

discuss

Base64 encoding is only used for byte oriented data, such as byte strings and byte arrays. In addition, the output result of the encoding process is always a byte string. If you want to mix Base64 encoded data with Unicode text, you must add an additional decoding step. For example:

>>> a = base64.b64encode(s).decode('ascii')
>>> a
'aGVsbG8='
>>>

When decoding Base64, byte string and Unicode text can be used as parameters. However, Unicode strings can only contain ASCII characters.

6.11 reading and writing binary array data

problem

You want to read and write structured data from a binary array into Python tuples.

Solution

You can use the struct module to process binary data. The following is a sample code that writes a list of Python tuples to a binary file and encodes each tuple into a structure using struct.

from struct import Struct
def write_records(records, format, f):
    '''
    Write a sequence of tuples to a binary file of structures.
    '''
    record_struct = Struct(format)
    for r in records:
        f.write(record_struct.pack(*r))

# Example
if __name__ == '__main__':
    records = [ (1, 2.3, 4.5),
                (6, 7.8, 9.0),
                (12, 13.4, 56.7) ]
    with open('data.b', 'wb') as f:
        write_records(records, '<idd', f)

There are many ways to read this file and return a list of tuples. First, if you plan to read the file incrementally in blocks, you can do this:

from struct import Struct

def read_records(format, f):
    record_struct = Struct(format)
    chunks = iter(lambda: f.read(record_struct.size), b'')
    return (record_struct.unpack(chunk) for chunk in chunks)

# Example
if __name__ == '__main__':
    with open('data.b','rb') as f:
        for rec in read_records('<idd', f):
            # Process rec
            ...

If you want to read the whole file into a byte string at one time, and then parse it in fragments. Then you can do this:

from struct import Struct

def unpack_records(format, data):
    record_struct = Struct(format)
    return (record_struct.unpack_from(data, offset)
            for offset in range(0, len(data), record_struct.size))

# Example
if __name__ == '__main__':
    with open('data.b', 'rb') as f:
        data = f.read()
    for rec in unpack_records('<idd', data):
        # Process rec
        ...

In both cases, the result is an iteratable object that returns the original tuple used to create the file.

discuss

For programs that need to encode and decode binary data, struct module is usually used. To declare a new structure, just create a struct instance like this:

# Little endian 32-bit integer, two double precision floats
record_struct = Struct('<idd')

Structures usually use some structure code values i, d, f, etc. [reference] Python documentation ]. These codes represent a specific binary data type, such as 32-bit integer, 64 bit floating-point number, 32-bit floating-point number, etc. The first character < specifies the byte order. In this example, it means "low order first". Change this character to > to indicate that the high bit is in the front, or! Indicates the network byte order.

The resulting Struct instance has many properties and methods to manipulate the corresponding type of structure. The size attribute contains the number of bytes of the structure, which is useful in I/O operations. The pack() and unpack() methods are used to package and unpack data. For example:

>>> from struct import Struct
>>> record_struct = Struct('<idd')
>>> record_struct.size
20
>>> record_struct.pack(1, 2.0, 3.0)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
>>> record_struct.unpack(_)
(1, 2.0, 3.0)
>>>

Sometimes you can see that pack() and unpack() operations are called as module level functions, like the following:

>>> import struct
>>> struct.pack('<idd', 1, 2.0, 3.0)
b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@'
>>> struct.unpack('<idd', _)
(1, 2.0, 3.0)
>>>

This works, but it doesn't feel as elegant as the instance method, especially when the same structure in your code appears in multiple places. By creating a Struct instance, the format code will be specified only once, and all operations will be processed centrally. This makes code maintenance easier (because you only need to change one part of the code).

Reading binary structure code requires some very interesting and beautiful programming skills. In function read_ In records, iter() is used to create an iterator that returns fixed size data blocks. Refer to section 5.8. The iterator will continue to call a user provided callable object (such as lambda: f.read(record_struct.size)), until it returns a special value (such as b '), at which time the iteration stops. For example:

>>> f = open('data.b', 'rb')
>>> chunks = iter(lambda: f.read(20), b'')
>>> chunks
<callable_iterator object at 0x10069e6d0>
>>> for chk in chunks:
... print(chk)
...
b'\x01\x00\x00\x00ffffff\x02@\x00\x00\x00\x00\x00\x00\x12@'
b'\x06\x00\x00\x00333333\x1f@\x00\x00\x00\x00\x00\x00"@'
b'\x0c\x00\x00\x00\xcd\xcc\xcc\xcc\xcc\xcc*@\x9a\x99\x99\x99\x99YL@'
>>>

As you can see, one reason to create an iteratable object is that it allows you to use a generator derivation to create records. If you don't use this technology, the code may look like this:

def read_records(format, f):
    record_struct = Struct(format)
    while True:
        chk = f.read(record_struct.size)
        if chk == b'':
            break
        yield record_struct.unpack(chk)

In function unpack_ Another method, unpack, is used in records()_ from() . unpack_from() is very useful for extracting binary data from a large binary array because it does not produce any temporary objects or memory copy operations. You just need to give it a byte string (or array) and a byte offset, and it will unpack the data directly from that location.

If you use unpack() instead of unpack_from(), you need to modify the code to construct a large number of small slices and calculate the offset. For example:

def unpack_records(format, data):
    record_struct = Struct(format)
    return (record_struct.unpack(data[offset:offset + record_struct.size])
            for offset in range(0, len(data), record_struct.size))

In addition to the complexity of the code, this scheme has to do a lot of extra work because it performs a lot of offset calculation, copies data and constructs small sliced objects. If you are going to unpack a large number of structures from a large byte string read, unpack_from() will do better.

When unpacking, the named tuple object in the collections module may be what you want to use. It allows you to set the attribute name for the returned tuple. For example:

from collections import namedtuple

Record = namedtuple('Record', ['kind','x','y'])

with open('data.p', 'rb') as f:
    records = (Record(*r) for r in read_records('<idd', f))

for r in records:
    print(r.kind, r.x, r.y)

If your program needs to process a large amount of binary data, you'd better use numpy module. For example, you can read a binary data into a structured array instead of a tuple list. As follows:

>>> import numpy as np
>>> f = open('data.b', 'rb')
>>> records = np.fromfile(f, dtype='<i,<d,<d')
>>> records
array([(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)],
dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')])
>>> records[0]
(1, 2.3, 4.5)
>>> records[1]
(6, 7.8, 9.0)
>>>

Finally, if you need to read binary data from known file formats (such as image format, graphics file, HDF5, etc.), check to see if Python has provided existing modules. Because there is no need to build the wheel again and again as a last resort.

6.12 reading nested and variable length binary data

problem

You need to read complex binary format data containing nested or variable length record sets. These data may include pictures, videos, electronic map files, etc.

Solution

The struct module can be used to encode / decode almost all types of binary data structures. To explain this data clearly, suppose you use the following Python data structure to represent a collection of points that make up a series of polygons:

polys = [
    [ (1.0, 2.5), (3.5, 4.0), (2.5, 1.5) ],
    [ (7.0, 1.2), (5.1, 3.0), (0.5, 7.5), (0.8, 9.0) ],
    [ (3.4, 6.3), (1.2, 0.5), (4.6, 9.2) ],
]

Now suppose that the data is encoded into a binary file starting with the following header:

+------+--------+------------------------------------+
|Byte  | Type   |  Description                       |
+======+========+====================================+
|0     | int    |  File code (0) x1234,Small end)          |
+------+--------+------------------------------------+
|4     | double |  x Minimum value of (small end)                |
+------+--------+------------------------------------+
|12    | double |  y Minimum value of (small end)                |
+------+--------+------------------------------------+
|20    | double |  x Maximum value of (small end)                |
+------+--------+------------------------------------+
|28    | double |  y Maximum value of (small end)                |
+------+--------+------------------------------------+
|36    | int    |  Number of triangles (small end)                |
+------+--------+------------------------------------+

Immediately following the head is a series of polygon records, with the coding format as follows:

+------+--------+-------------------------------------------+
|Byte  | Type   |  Description                              |
+======+========+===========================================+
|0     | int    |  Record length( N Bytes)                        |
+------+--------+-------------------------------------------+
|4-N   | Points |  (X,Y) Coordinates in floating point numbers                 |
+------+--------+-------------------------------------------+

To write such a file, you can use the following Python code:

import struct
import itertools

def write_polys(filename, polys):
    # Determine bounding box
    flattened = list(itertools.chain(*polys))
    min_x = min(x for x, y in flattened)
    max_x = max(x for x, y in flattened)
    min_y = min(y for x, y in flattened)
    max_y = max(y for x, y in flattened)
    with open(filename, 'wb') as f:
        f.write(struct.pack('<iddddi', 0x1234,
                            min_x, min_y,
                            max_x, max_y,
                            len(polys)))
        for poly in polys:
            size = len(poly) * struct.calcsize('<dd')
            f.write(struct.pack('<i', size + 4))
            for pt in poly:
                f.write(struct.pack('<dd', *pt))

When reading the data back, you can use the function struct Unpack (), the code is very similar, which is basically the reverse order of the above write operations. As follows:

def read_polys(filename):
    with open(filename, 'rb') as f:
        # Read the header
        header = f.read(40)
        file_code, min_x, min_y, max_x, max_y, num_polys = \
            struct.unpack('<iddddi', header)
        polys = []
        for n in range(num_polys):
            pbytes, = struct.unpack('<i', f.read(4))
            poly = []
            for m in range(pbytes // 16):
                pt = struct.unpack('<dd', f.read(16))
                poly.append(pt)
            polys.append(poly)
    return polys

Although this code works, it is mixed with a lot of code to read and unpack data structures and other details. If such code is used to deal with real data files, it is a little too complicated. Therefore, it is obvious that there should be another solution to simplify these steps and let programmers focus on the most important things.

In the next part of this section, I will gradually demonstrate a better scheme for parsing byte data. The goal is to provide programmers with an advanced file formatting method and simplify the details of reading and unpacking data. But I want to remind you first that the next part of the code in this section should be the most complex and advanced example in the whole book, using a lot of object-oriented programming and metaprogramming techniques. Be sure to read our discussion carefully and refer to other chapters.

First, when reading byte data, the file header and other data structures are usually included at the beginning of the file. Although the struct module can unpack this data into a tuple, another way to represent this information is to use a class. As follows:

import struct

class StructField:
    '''
    Descriptor representing a simple structure field
    '''
    def __init__(self, format, offset):
        self.format = format
        self.offset = offset
    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            r = struct.unpack_from(self.format, instance._buffer, self.offset)
            return r[0] if len(r) == 1 else r

class Structure:
    def __init__(self, bytedata):
        self._buffer = memoryview(bytedata)

Here, we use a descriptor to represent each structure field. Each descriptor contains a structure compatible format code and a byte offset, which are stored in the internal memory buffer. In__ get__ () method, struct unpack_ The from () function is used to unpack a value from the buffer, eliminating additional fragmentation or copy steps.

The Structure class is a basic class that accepts byte data and stores it in the internal memory buffer, which is used by the structurfield descriptor. memoryview() is used here. We will explain in detail what it is used for later.

Using this code, you can now define a high-level structure object to represent the file format expected by the table information above. For example:

class PolyHeader(Structure):
    file_code = StructField('<i', 0)
    min_x = StructField('<d', 4)
    min_y = StructField('<d', 12)
    max_x = StructField('<d', 20)
    max_y = StructField('<d', 28)
    num_polys = StructField('<i', 36)

The following example uses this class to read the header data of the polygon data we wrote earlier:

>>> f = open('polys.bin', 'rb')
>>> phead = PolyHeader(f.read(40))
>>> phead.file_code == 0x1234
True
>>> phead.min_x
0.5
>>> phead.min_y
0.5
>>> phead.max_x
7.0
>>> phead.max_y
9.2
>>> phead.num_polys
3
>>>

This is interesting, but there are still some annoying things about this way. First of all, although you get the convenience of a class interface, the code is still a little bloated, and the user needs to specify many underlying details (such as reusing StructField, specifying offset, etc.). In addition, the returned result class does have some convenient methods to calculate the total number of structures.

Whenever you encounter redundant class definitions like this, you should consider using class decorators or metaclasses. A feature of metaclass is that it can be used to fill in many low-level implementation details, so as to release the burden of users. Let me take an example and use metaclasses to slightly transform our Structure class:

class StructureMeta(type):
    '''
    Metaclass that automatically creates StructField descriptors
    '''
    def __init__(self, clsname, bases, clsdict):
        fields = getattr(self, '_fields_', [])
        byte_order = ''
        offset = 0
        for format, fieldname in fields:
            if format.startswith(('<','>','!','@')):
                byte_order = format[0]
                format = format[1:]
            format = byte_order + format
            setattr(self, fieldname, StructField(format, offset))
            offset += struct.calcsize(format)
        setattr(self, 'struct_size', offset)

class Structure(metaclass=StructureMeta):
    def __init__(self, bytedata):
        self._buffer = bytedata

    @classmethod
    def from_file(cls, f):
        return cls(f.read(cls.struct_size))

Using the new Structure class, you can define a Structure as follows:

class PolyHeader(Structure):
    _fields_ = [
        ('<i', 'file_code'),
        ('d', 'min_x'),
        ('d', 'min_y'),
        ('d', 'max_x'),
        ('d', 'max_y'),
        ('i', 'num_polys')
    ]

As you can see, it's much easier to write. The class method we added is from_file() allows us to easily read data from a file without knowing the size and structure of any data. For example:

>>> f = open('polys.bin', 'rb')
>>> phead = PolyHeader.from_file(f)
>>> phead.file_code == 0x1234
True
>>> phead.min_x
0.5
>>> phead.min_y
0.5
>>> phead.max_x
7.0
>>> phead.max_y
9.2
>>> phead.num_polys
3
>>>

Once you start using metaclasses, you can make them smarter. For example, suppose you also want to support nested byte structures. Here is a small improvement to the previous metaclass, providing a new auxiliary descriptor to achieve the desired effect:

class NestedStruct:
    '''
    Descriptor representing a nested structure
    '''
    def __init__(self, name, struct_type, offset):
        self.name = name
        self.struct_type = struct_type
        self.offset = offset

    def __get__(self, instance, cls):
        if instance is None:
            return self
        else:
            data = instance._buffer[self.offset:
                            self.offset+self.struct_type.struct_size]
            result = self.struct_type(data)
            # Save resulting structure back on instance to avoid
            # further recomputation of this step
            setattr(instance, self.name, result)
            return result

class StructureMeta(type):
    '''
    Metaclass that automatically creates StructField descriptors
    '''
    def __init__(self, clsname, bases, clsdict):
        fields = getattr(self, '_fields_', [])
        byte_order = ''
        offset = 0
        for format, fieldname in fields:
            if isinstance(format, StructureMeta):
                setattr(self, fieldname,
                        NestedStruct(fieldname, format, offset))
                offset += format.struct_size
            else:
                if format.startswith(('<','>','!','@')):
                    byte_order = format[0]
                    format = format[1:]
                format = byte_order + format
                setattr(self, fieldname, StructField(format, offset))
                offset += struct.calcsize(format)
        setattr(self, 'struct_size', offset)

In this code, the nested structure descriptor is used to overlay another structure defined on a memory area. It instantiates a given structure type by slicing the original memory buffer. Since the underlying memory buffer is initialized through a memory view, this slicing operation does not cause any additional memory replication. Instead, it is just a superposition of the previous memory. In addition, in order to prevent repeated instantiation, the descriptor saves the internal structure object in the instance by using the same technology as in section 8.10.

With this new revision, you can write as follows:

class Point(Structure):
    _fields_ = [
        ('<d', 'x'),
        ('d', 'y')
    ]

class PolyHeader(Structure):
    _fields_ = [
        ('<i', 'file_code'),
        (Point, 'min'), # nested struct
        (Point, 'max'), # nested struct
        ('i', 'num_polys')
    ]

Surprisingly, it can also work normally as expected. In our actual operation:

>>> f = open('polys.bin', 'rb')
>>> phead = PolyHeader.from_file(f)
>>> phead.file_code == 0x1234
True
>>> phead.min # Nested structure
<__main__.Point object at 0x1006a48d0>
>>> phead.min.x
0.5
>>> phead.min.y
0.5
>>> phead.max.x
7.0
>>> phead.max.y
9.2
>>> phead.num_polys
3
>>>

So far, a framework for processing fixed length records has been written. But what if the component record becomes longer? For example, a polygon file contains a variable length part.

One solution is to write a class to represent byte data, and write a tool function to parse the content in how many ways. It is similar to the code in section 6.11:

class SizedRecord:
    def __init__(self, bytedata):
        self._buffer = memoryview(bytedata)

    @classmethod
    def from_file(cls, f, size_fmt, includes_size=True):
        sz_nbytes = struct.calcsize(size_fmt)
        sz_bytes = f.read(sz_nbytes)
        sz, = struct.unpack(size_fmt, sz_bytes)
        buf = f.read(sz - includes_size * sz_nbytes)
        return cls(buf)

    def iter_as(self, code):
        if isinstance(code, str):
            s = struct.Struct(code)
            for off in range(0, len(self._buffer), s.size):
                yield s.unpack_from(self._buffer, off)
        elif isinstance(code, StructureMeta):
            size = code.struct_size
            for off in range(0, len(self._buffer), size):
                data = self._buffer[off:off+size]
                yield code(data)

Class method sizedrecord from_ File () is a tool used to read data blocks with size prefix from a file, which is also a common way for many file formats. As input, it accepts a structure format encoding containing size encoding, and it is also its own form. Optional includes_ The size parameter specifies whether the number of bytes includes the header size. Here is an example of how to use to read individual polygon data from a polygon file:

>>> f = open('polys.bin', 'rb')
>>> phead = PolyHeader.from_file(f)
>>> phead.num_polys
3
>>> polydata = [ SizedRecord.from_file(f, '<i')
...             for n in range(phead.num_polys) ]
>>> polydata
[<__main__.SizedRecord object at 0x1006a4d50>,
<__main__.SizedRecord object at 0x1006a4f50>,
<__main__.SizedRecord object at 0x10070da90>]
>>>

It can be seen that the contents of the SizedRecord instance have not been resolved yet. You can use iter_as() method, which accepts a Structure format code or Structure class as input. In this way, you can analyze the data flexibly, for example:

>>> for n, poly in enumerate(polydata):
...     print('Polygon', n)
...     for p in poly.iter_as('<dd'):
...         print(p)
...
Polygon 0
(1.0, 2.5)
(3.5, 4.0)
(2.5, 1.5)
Polygon 1
(7.0, 1.2)
(5.1, 3.0)
(0.5, 7.5)
(0.8, 9.0)
Polygon 2
(3.4, 6.3)
(1.2, 0.5)
(4.6, 9.2)
>>>

>>> for n, poly in enumerate(polydata):
...     print('Polygon', n)
...     for p in poly.iter_as(Point):
...         print(p.x, p.y)
...
Polygon 0
1.0 2.5
3.5 4.0
2.5 1.5
Polygon 1
7.0 1.2
5.1 3.0
0.5 7.5
0.8 9.0
Polygon 2
3.4 6.3
1.2 0.5
4.6 9.2
>>>

Combining all these, here is a read_ Another modified version of the poly() function:

class Point(Structure):
    _fields_ = [
        ('<d', 'x'),
        ('d', 'y')
    ]

class PolyHeader(Structure):
    _fields_ = [
        ('<i', 'file_code'),
        (Point, 'min'),
        (Point, 'max'),
        ('i', 'num_polys')
    ]

def read_polys(filename):
    polys = []
    with open(filename, 'rb') as f:
        phead = PolyHeader.from_file(f)
        for n in range(phead.num_polys):
            rec = SizedRecord.from_file(f, '<i')
            poly = [ (p.x, p.y) for p in rec.iter_as(Point) ]
            polys.append(poly)
    return polys

discuss

This section shows you many advanced programming techniques, including descriptor, delay calculation, metaclass, class variable and memory view. However, they all serve the same specific goal.

A major feature of the above implementation is that it is based on the idea of lazy unpacking. When a Structure instance is created__ init__ () just create a memory view of byte data without doing anything else. In particular, no unpacking or other Structure related operations occur at this time. One motivation for this is that you may only be interested in a small part of a byte record. We only need to unpack the part you need to access, not the whole file.

In order to implement lazy unpacking and packaging, you need to use the StructField descriptor class. User in_ fields_ Each attribute listed in is transformed into a StructField descriptor, which saves the relevant structure format code and offset value to the storage cache. The metaclass StructureMeta automatically creates these descriptors when multiple structure classes are defined. One of the main reasons we use metaclasses is that it makes it very convenient for users to specify the structure format through a high-level description without considering the low-level details.

One of the subtleties of StructureMeta is that it fixes the byte data order. That is, if any attribute specifies a byte order (< indicates low priority or > indicates high priority), the order of all subsequent fields shall prevail. This can help avoid extra input, but we can still switch the order in the middle of the definition. For example, you may have some complex structures, such as the following:

class ShapeFile(Structure):
    _fields_ = [ ('>i', 'file_code'), # Big endian
        ('20s', 'unused'),
        ('i', 'file_length'),
        ('<i', 'version'), # Little endian
        ('i', 'shape_type'),
        ('d', 'min_x'),
        ('d', 'min_y'),
        ('d', 'max_x'),
        ('d', 'max_y'),
        ('d', 'min_z'),
        ('d', 'max_z'),
        ('d', 'min_m'),
        ('d', 'max_m') ]

As we mentioned earlier, the use of memoryview() can help us avoid memory duplication. When the structure is nested, memoryviews can overlay different parts of the mechanism defined on the same memory area. This feature is subtle, but it focuses on the slicing behavior of memory view and ordinary byte array. If you slice on a byte string or byte array, you usually get a copy of the data. The memory view slice is not like this. It is just superimposed on the existing memory. Therefore, this method is more efficient.

There are many related chapters that can help us expand the scheme discussed here. Refer to section 8.13 to build a type system using descriptors. Section 8.10 has more discussion on delaying the calculation of attribute values, and is also related to the implementation of NestedStruct descriptor. Section 9.19 has an example of using metaclasses to initialize class members, which is very similar to the StructureMeta class. Python's ctypes source code is also very interesting. It provides support for similar functions such as defining data structures and nesting data structures.

6.13 data accumulation and statistical operation

problem

You need to process a large data set and calculate the sum of data or other statistics.

Solution

For any data analysis problem involving statistics, time series and other related technologies, it can be considered Pandas Library .

In order to let you experience it first, here is an example of using Pandas to analyze the city of Chicago Mouse and rodent database Examples. At the time of writing this article, the database is a CSV file with about 74000 rows of data.

>>> import pandas

>>> # Read a CSV file, skipping last line
>>> rats = pandas.read_csv('rats.csv', skip_footer=1)
>>> rats
<class 'pandas.core.frame.DataFrame'>
Int64Index: 74055 entries, 0 to 74054
Data columns:
Creation Date 74055 non-null values
Status 74055 non-null values
Completion Date 72154 non-null values
Service Request Number 74055 non-null values
Type of Service Request 74055 non-null values
Number of Premises Baited 65804 non-null values
Number of Premises with Garbage 65600 non-null values
Number of Premises with Rats 65752 non-null values
Current Activity 66041 non-null values
Most Recent Action 66023 non-null values
Street Address 74055 non-null values
ZIP Code 73584 non-null values
X Coordinate 74043 non-null values
Y Coordinate 74043 non-null values
Ward 74044 non-null values
Police District 74044 non-null values
Community Area 74044 non-null values
Latitude 74043 non-null values
Longitude 74043 non-null values
Location 74043 non-null values
dtypes: float64(11), object(9)

>>> # Investigate range of values for a certain field
>>> rats['Current Activity'].unique()
array([nan, Dispatch Crew, Request Sanitation Inspector], dtype=object)
>>> # Filter the data
>>> crew_dispatched = rats[rats['Current Activity'] == 'Dispatch Crew']
>>> len(crew_dispatched)
65676
>>>

>>> # Find 10 most rat-infested ZIP codes in Chicago
>>> crew_dispatched['ZIP Code'].value_counts()[:10]
60647 3837
60618 3530
60614 3284
60629 3251
60636 2801
60657 2465
60641 2238
60609 2206
60651 2152
60632 2071
>>>

>>> # Group by completion date
>>> dates = crew_dispatched.groupby('Completion Date')
<pandas.core.groupby.DataFrameGroupBy object at 0x10d0a2a10>
>>> len(dates)
472
>>>

>>> # Determine counts on each day
>>> date_counts = dates.size()
>>> date_counts[0:10]
Completion Date
01/03/2011 4
01/03/2012 125
01/04/2011 54
01/04/2012 38
01/05/2011 78
01/05/2012 100
01/06/2011 100
01/06/2012 58
01/07/2011 1
01/09/2012 12
>>>

>>> # Sort the counts
>>> date_counts.sort()
>>> date_counts[-10:]
Completion Date
10/12/2012 313
10/21/2011 314
09/20/2011 316
10/26/2011 319
02/22/2011 325
10/26/2012 333
03/17/2011 336
10/13/2011 378
10/14/2011 391
10/07/2011 457
>>>

Well, it seems that October 7, 2011 is a very busy day for mice_

discuss

Pandas is a large function library with many features, which I can't cover here. But as long as you need to analyze large data sets, group data, calculate various statistics or other similar tasks, this function library is really worth looking at.

Keywords: Python

Added by rbeschizza on Sun, 23 Jan 2022 18:09:42 +0200