Chapter 6: data coding and processing
6.1 reading and writing CSV data
problem
You want to read and write a CSV file.
Solution
For most data reading and writing problems in csv format, you can use csv library. For example, suppose you are in a place called stocks There are some stock market data in the csv file, like this:
Symbol,Price,Date,Time,Change,Volume "AA",39.48,"6/11/2007","9:36am",-0.18,181800 "AIG",71.38,"6/11/2007","9:36am",-0.15,195500 "AXP",62.58,"6/11/2007","9:36am",-0.46,935000 "BA",98.31,"6/11/2007","9:36am",+0.12,104800 "C",53.08,"6/11/2007","9:36am",-0.25,360900 "CAT",78.29,"6/11/2007","9:36am",-0.23,225400
The following shows you how to read these data as a sequence of tuples:
import csv with open('stocks.csv') as f: f_csv = csv.reader(f) headers = next(f_csv) for row in f_csv: # Process row ...
In the above code, row will be a list. Therefore, in order to access a field, you need to use subscripts, such as row[0] to access Symbol and row[4] to access Change.
Since this subscript access usually causes confusion, you can consider using named tuples. For example:
from collections import namedtuple with open('stock.csv') as f: f_csv = csv.reader(f) headings = next(f_csv) Row = namedtuple('Row', headings) for r in f_csv: row = Row(*r) # Process row ...
It allows you to use column names such as row Symbol and row Change replaces subscript access. It should be noted that this only takes effect when the column name is a legal Python identifier. If not, you may need to modify the original column name (such as replacing non identifier characters with underscores).
Another option is to read the data into a dictionary sequence. This can be done:
import csv with open('stocks.csv') as f: f_csv = csv.DictReader(f) for row in f_csv: # process row ...
In this version, you can use column names to access the data of each row. For example, row['Symbol '] or row['Change']
In order to write CSV data, you can still use the CSV module, but first create a writer object. For example:
headers = ['Symbol','Price','Date','Time','Change','Volume'] rows = [('AA', 39.48, '6/11/2007', '9:36am', -0.18, 181800), ('AIG', 71.38, '6/11/2007', '9:36am', -0.15, 195500), ('AXP', 62.58, '6/11/2007', '9:36am', -0.46, 935000), ] with open('stocks.csv','w') as f: f_csv = csv.writer(f) f_csv.writerow(headers) f_csv.writerows(rows)
If you have a dictionary sequence of data, you can do something like this:
headers = ['Symbol', 'Price', 'Date', 'Time', 'Change', 'Volume'] rows = [{'Symbol':'AA', 'Price':39.48, 'Date':'6/11/2007', 'Time':'9:36am', 'Change':-0.18, 'Volume':181800}, {'Symbol':'AIG', 'Price': 71.38, 'Date':'6/11/2007', 'Time':'9:36am', 'Change':-0.15, 'Volume': 195500}, {'Symbol':'AXP', 'Price': 62.58, 'Date':'6/11/2007', 'Time':'9:36am', 'Change':-0.46, 'Volume': 935000}, ] with open('stocks.csv','w') as f: f_csv = csv.DictWriter(f, headers) f_csv.writeheader() f_csv.writerows(rows)
discuss
You should always give priority to the CSV module to split or parse CSV data. For example, you might write code like this:
with open('stocks.csv') as f: for line in f: row = line.split(',') # process row ...
One disadvantage of using this method is that you still need to deal with some difficult details. For example, if some field values are surrounded by quotation marks, you have to remove these quotation marks. In addition, if a field surrounded by quotation marks happens to contain a comma, the program will make an error because it produces a line of wrong size.
By default, the csv library recognizes the csv encoding rules used by Microsoft Excel. This is probably the most common form, and will also bring you the best compatibility. However, if you look at the csv document, you will find that there are many ways to apply it to other encoding formats (such as modifying split characters, etc.). For example, if you want to read data divided by tab, you can do this:
# Example of reading tab-separated values with open('stock.tsv') as f: f_tsv = csv.reader(f, delimiter='\t') for row in f_tsv: # Process row ...
If you are reading CSV data and converting them into named tuples, you need to pay attention to the legitimacy authentication of column names. For example, a CSV format file has a column header row containing illegal identifiers, similar to the following:
Street Address,Num-Premises,Latitude,Longitude 5412 N CLARK,10,41.980262,-87.668452
This will eventually cause a ValueError exception to fail when creating a named tuple. To solve this problem, you may have to fix the column header first. For example, you can replace an illegal identifier with a regular expression as follows:
import re with open('stock.csv') as f: f_csv = csv.reader(f) headers = [ re.sub('[^a-zA-Z_]', '_', h) for h in next(f_csv) ] Row = namedtuple('Row', headers) for r in f_csv: row = Row(*r) # Process row ...
Another important point to emphasize is that the data generated by CSV is of string type, and it will not do any other type of conversion. If you need to do such type conversion, you must implement it manually. The following is an example of performing other types of conversion on CSV data:
col_types = [str, float, str, str, float, int] with open('stocks.csv') as f: f_csv = csv.reader(f) headers = next(f_csv) for row in f_csv: # Apply conversions to the row items row = tuple(convert(value) for convert, value in zip(col_types, row)) ...
In addition, the following is an example of a specific field in the conversion dictionary:
print('Reading as dicts with type conversion') field_types = [ ('Price', float), ('Change', float), ('Volume', int) ] with open('stocks.csv') as f: for row in csv.DictReader(f): row.update((key, conversion(row[key])) for key, conversion in field_types) print(row)
Generally speaking, you may not want to think more about these conversion problems. In practice, CSV files have more or less missing data, damaged data and other problems that make the conversion fail. Therefore, unless your data is guaranteed to be accurate, you must consider these issues (you may need to add appropriate error handling mechanisms).
Finally, if you read CSV data for data analysis and statistics, you may need to take a look at the pandas package. Pandas contains a very convenient function called pandas read_ CSV (), which can load CSV data into a DataFrame object. Then you can use this object to generate various forms of statistics, filter data and perform other high-level operations. There will be such an example in section 6.13.
6.2 reading and writing JSON data
problem
You want to read and write data in JSON(JavaScript Object Notation) encoding format.
Solution
The JSON module provides a very simple way to encode and decode JSON data. The two main functions are JSON Dumps () and JSON Loads() has far fewer interfaces than other serialization function libraries such as pickle. The following shows how to convert a Python data structure to JSON:
import json data = { 'name' : 'ACME', 'shares' : 100, 'price' : 542.23 } json_str = json.dumps(data)
The following shows how to convert a JSON encoded string back to a Python data structure:
data = json.loads(json_str)
If you are dealing with files instead of strings, you can use JSON Dump () and JSON Load() to encode and decode JSON data. For example:
# Writing JSON data with open('data.json', 'w') as f: json.dump(data, f) # Reading data back with open('data.json', 'r') as f: data = json.load(f)
discuss
The basic data types supported by JSON coding are None, bool, int, float and str, as well as lists, tuples and dictionaries containing these types of data. For dictionaries, keys must be of string type (any non string key in the dictionary will be converted to string first when encoding). To follow the JSON specification, you should only code Python lists and dictionaries. Moreover, in web applications, it is standard practice for top-level objects to be encoded as a dictionary.
The format of JSON encoding is almost identical to Python syntax, except for some small differences. For example, true is mapped to true, false is mapped to false, and None is mapped to null. The following is an example to demonstrate the effect of encoded string:
>>> json.dumps(False) 'false' >>> d = {'a': True, ... 'b': 'Hello', ... 'c': None} >>> json.dumps(d) '{"b": "Hello", "c": null, "a": true}' >>>
If you try to check the JSON decoded data, it is usually difficult to determine its structure by simple printing, especially when the nested structure of the data is deep or contains a large number of fields. To solve this problem, consider using the pprint() function of the pprint module instead of the ordinary print() function. It will be output in alphabetical order of key and in a more beautiful way. Here's an example of how to print out search results on Twitter beautifully:
>>> from urllib.request import urlopen >>> import json >>> u = urlopen('http://search.twitter.com/search.json?q=python&rpp=5') >>> resp = json.loads(u.read().decode('utf-8')) >>> from pprint import pprint >>> pprint(resp) {'completed_in': 0.074, 'max_id': 264043230692245504, 'max_id_str': '264043230692245504', 'next_page': '?page=2&max_id=264043230692245504&q=python&rpp=5', 'page': 1, 'query': 'python', 'refresh_url': '?since_id=264043230692245504&q=python', 'results': [{'created_at': 'Thu, 01 Nov 2012 16:36:26 +0000', 'from_user': ... }, {'created_at': 'Thu, 01 Nov 2012 16:36:14 +0000', 'from_user': ... }, {'created_at': 'Thu, 01 Nov 2012 16:36:13 +0000', 'from_user': ... }, {'created_at': 'Thu, 01 Nov 2012 16:36:07 +0000', 'from_user': ... } {'created_at': 'Thu, 01 Nov 2012 16:36:04 +0000', 'from_user': ... }], 'results_per_page': 5, 'since_id': 0, 'since_id_str': '0'} >>>
Generally speaking, JSON decoding will create dicts or lists based on the provided data. If you want to create other types of objects, you can give JSON Loads() pass object_pairs_hook or object_hook parameter. For example, the following is an example showing how to decode JSON data and retain its order in an OrderedDict:
>>> s = '{"name": "ACME", "shares": 50, "price": 490.1}' >>> from collections import OrderedDict >>> data = json.loads(s, object_pairs_hook=OrderedDict) >>> data OrderedDict([('name', 'ACME'), ('shares', 50), ('price', 490.1)]) >>>
The following is an example of how to convert a JSON dictionary into a Python object:
>>> class JSONObject: ... def __init__(self, d): ... self.__dict__ = d ... >>> >>> data = json.loads(s, object_hook=JSONObject) >>> data.name 'ACME' >>> data.shares 50 >>> data.price 490.1 >>>
In the last example, the JSON decoded dictionary is passed to the user as a single parameter__ init__ () . Then, you can use it as you like, such as using it directly as an example dictionary.
There are also some options that are useful when coding JSON. If you want to get beautiful formatted string output, you can use JSON indent parameter of dumps(). It makes the output similar to the pprint() function. For example:
>>> print(json.dumps(data)) {"price": 542.23, "name": "ACME", "shares": 100} >>> print(json.dumps(data, indent=4)) { "price": 542.23, "name": "ACME", "shares": 100 } >>>
Object instances are usually not JSON serializable. For example:
>>> class Point: ... def __init__(self, x, y): ... self.x = x ... self.y = y ... >>> p = Point(2, 3) >>> json.dumps(p) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/local/lib/python3.3/json/__init__.py", line 226, in dumps return _default_encoder.encode(obj) File "/usr/local/lib/python3.3/json/encoder.py", line 187, in encode chunks = self.iterencode(o, _one_shot=True) File "/usr/local/lib/python3.3/json/encoder.py", line 245, in iterencode return _iterencode(o, 0) File "/usr/local/lib/python3.3/json/encoder.py", line 169, in default raise TypeError(repr(o) + " is not JSON serializable") TypeError: <__main__.Point object at 0x1006f2650> is not JSON serializable >>>
If you want to serialize an object instance, you can provide a function whose input is an instance and returns a serializable dictionary. For example:
def serialize_instance(obj): d = { '__classname__' : type(obj).__name__ } d.update(vars(obj)) return d
If you want to get this instance in reverse, you can do this:
# Dictionary mapping names to known classes classes = { 'Point' : Point } def unserialize_object(d): clsname = d.pop('__classname__', None) if clsname: cls = classes[clsname] obj = cls.__new__(cls) # Make instance without calling __init__ for key, value in d.items(): setattr(obj, key, value) return obj else: return d
Here are examples of how to use these functions:
>>> p = Point(2,3) >>> s = json.dumps(p, default=serialize_instance) >>> s '{"__classname__": "Point", "y": 3, "x": 2}' >>> a = json.loads(s, object_hook=unserialize_object) >>> a <__main__.Point object at 0x1017577d0> >>> a.x 2 >>> a.y 3 >>>
The json module also has many other options to control the parsing of lower-level numbers, special values such as NaN, etc. You can refer to the official documentation for more details.
6.3 parsing simple XML data
problem
You want to extract data from a simple XML document.
Solution
You can use XML etree. The elementtree module extracts data from a simple XML document. For demonstration purposes, suppose you want to parse an RSS feed on Planet Python. The following is the corresponding code:
from urllib.request import urlopen from xml.etree.ElementTree import parse # Download the RSS feed and parse it u = urlopen('http://planet.python.org/rss20.xml') doc = parse(u) # Extract and output tags of interest for item in doc.iterfind('channel/item'): title = item.findtext('title') date = item.findtext('pubDate') link = item.findtext('link') print(title) print(date) print(link) print()
Run the above code and the output is similar to this:
Steve Holden: Python for Data Analysis Mon, 19 Nov 2012 02:13:51 +0000 http://holdenweb.blogspot.com/2012/11/python-for-data-analysis.html Vasudev Ram: The Python Data model (for v2 and v3) Sun, 18 Nov 2012 22:06:47 +0000 http://jugad2.blogspot.com/2012/11/the-python-data-model.html Python Diary: Been playing around with Object Databases Sun, 18 Nov 2012 20:40:29 +0000 http://www.pythondiary.com/blog/Nov.18,2012/been-...-object-databases.html Vasudev Ram: Wakari, Scientific Python in the cloud Sun, 18 Nov 2012 20:19:41 +0000 http://jugad2.blogspot.com/2012/11/wakari-scientific-python-in-cloud.html Jesse Jiryu Davis: Toro: synchronization primitives for Tornado coroutines Sun, 18 Nov 2012 20:17:49 +0000 http://feedproxy.google.com/~r/EmptysquarePython/~3/_DOZT2Kd0hQ/
Obviously, if you want to do further processing, you need to replace the print() statement to do other interesting things.
discuss
It is common in many applications to process data in XML encoded format. XML is not only widely used in data exchange on the Internet, but also a common format for storing application data (such as word processing, music library, etc.). The following discussion assumes that readers are familiar with the basics of XML.
In many cases, when XML is used to store only data, the corresponding document structure is very compact and intuitive. For example, the RSS feed in the above example is similar to the following format:
<?xml version="1.0"?> <rss version="2.0" xmlns:dc="http://purl.org/dc/elements/1.1/"> <channel> <title>Planet Python</title> <link>http://planet.python.org/</link> <language>en</language> <description>Planet Python - http://planet.python.org/</description> <item> <title>Steve Holden: Python for Data Analysis</title> <guid>http://holdenweb.blogspot.com/...-data-analysis.html</guid> <link>http://holdenweb.blogspot.com/...-data-analysis.html</link> <description>...</description> <pubDate>Mon, 19 Nov 2012 02:13:51 +0000</pubDate> </item> <item> <title>Vasudev Ram: The Python Data model (for v2 and v3)</title> <guid>http://jugad2.blogspot.com/...-data-model.html</guid> <link>http://jugad2.blogspot.com/...-data-model.html</link> <description>...</description> <pubDate>Sun, 18 Nov 2012 22:06:47 +0000</pubDate> </item> <item> <title>Python Diary: Been playing around with Object Databases</title> <guid>http://www.pythondiary.com/...-object-databases.html</guid> <link>http://www.pythondiary.com/...-object-databases.html</link> <description>...</description> <pubDate>Sun, 18 Nov 2012 20:40:29 +0000</pubDate> </item> ... </channel> </rss>
xml. etree. ElementTree. The parse () function parses the entire XML document and converts it into a document object. Then, you can use find(), iterfind(), findtext() and other methods to search for specific XML elements. The parameters of these functions are a specified tag name, such as channel/item or title.
Each time you specify a tag, you need to traverse the entire document structure. Each search operation starts with a starting element. Similarly, the tag name specified for each operation is also the relative path of the starting element. For example, execute doc Iterfind ('channel / item ') to search all item elements below the channel element. Doc represents the top level of the document (that is, the first level rss element). Then the next call is item FindText () starts the search from the location of the found item element.
Each element in the ElementTree module has some important attributes and methods, which are very useful in parsing. The tag attribute contains the name of the tag, the text attribute contains the internal text, and the get() method can get the attribute value. For example:
>>> doc <xml.etree.ElementTree.ElementTree object at 0x101339510> >>> e = doc.find('channel/title') >>> e <Element 'title' at 0x10135b310> >>> e.tag 'title' >>> e.text 'Planet Python' >>> e.get('some_attribute') >>>
One thing to emphasize is XML etree. ElementTree is not the only way to parse XML. For more advanced applications, you need to consider using lxml. It uses the same programming interface as ElementTree, so the above example is also applicable to lxml. You just need to replace the initial import statement with from lxml Etree import parse is OK. Lxml fully follows the XML standard and is very fast. It also supports validation, XSLT, XPath and other features.
6.4 incremental parsing of large XML files
problem
You want to use as little memory as possible to extract data from a large XML document.
Solution
Whenever you encounter incremental data processing, you should think of iterators and generators at the first time. The following is a very simple function that can incrementally process a large XML file with little memory:
from xml.etree.ElementTree import iterparse def parse_and_remove(filename, path): path_parts = path.split('/') doc = iterparse(filename, ('start', 'end')) # Skip the root element next(doc) tag_stack = [] elem_stack = [] for event, elem in doc: if event == 'start': tag_stack.append(elem.tag) elem_stack.append(elem) elif event == 'end': if tag_stack == path_parts: yield elem elem_stack[-2].remove(elem) try: tag_stack.pop() elem_stack.pop() except IndexError: pass
To test this function, you need to have a large XML file first. Usually you can find such documents on government websites or public data websites. For example, you can download the Chicago City Road pothole database in XML format. At the time of writing this book, the downloaded file already contains more than 100000 lines of data, and the coding format is similar to the following:
<response> <row> <row ...> <creation_date>2012-11-18T00:00:00</creation_date> <status>Completed</status> <completion_date>2012-11-18T00:00:00</completion_date> <service_request_number>12-01906549</service_request_number> <type_of_service_request>Pot Hole in Street</type_of_service_request> <current_activity>Final Outcome</current_activity> <most_recent_action>CDOT Street Cut ... Outcome</most_recent_action> <street_address>4714 S TALMAN AVE</street_address> <zip>60632</zip> <x_coordinate>1159494.68618856</x_coordinate> <y_coordinate>1873313.83503384</y_coordinate> <ward>14</ward> <police_district>9</police_district> <community_area>58</community_area> <latitude>41.808090232127896</latitude> <longitude>-87.69053684711305</longitude> <location latitude="41.808090232127896" longitude="-87.69053684711305" /> </row> <row ...> <creation_date>2012-11-18T00:00:00</creation_date> <status>Completed</status> <completion_date>2012-11-18T00:00:00</completion_date> <service_request_number>12-01906695</service_request_number> <type_of_service_request>Pot Hole in Street</type_of_service_request> <current_activity>Final Outcome</current_activity> <most_recent_action>CDOT Street Cut ... Outcome</most_recent_action> <street_address>3510 W NORTH AVE</street_address> <zip>60647</zip> <x_coordinate>1152732.14127696</x_coordinate> <y_coordinate>1910409.38979075</y_coordinate> <ward>26</ward> <police_district>14</police_district> <community_area>23</community_area> <latitude>41.91002084292946</latitude> <longitude>-87.71435952353961</longitude> <location latitude="41.91002084292946" longitude="-87.71435952353961" /> </row> </row> </response>
Suppose you want to write a script to arrange the zip code numbers according to the number of pit reports. You can do something like this:
from xml.etree.ElementTree import parse from collections import Counter potholes_by_zip = Counter() doc = parse('potholes.xml') for pothole in doc.iterfind('row/row'): potholes_by_zip[pothole.findtext('zip')] += 1 for zipcode, num in potholes_by_zip.most_common(): print(zipcode, num)
The only problem with this script is that it will first load the entire XML file into memory and then parse it. On my machine, about 450MB of memory space is needed to run this program. If the following code is used, the program only needs to be modified a little:
from collections import Counter potholes_by_zip = Counter() data = parse_and_remove('potholes.xml', 'row/row') for pothole in data: potholes_by_zip[pothole.findtext('zip')] += 1 for zipcode, num in potholes_by_zip.most_common(): print(zipcode, num)
The result: This version of the code only needs 7MB of memory to run - a significant savings in memory resources.
discuss
The techniques in this section rely on two core functions in the ElementTree module. First, the iterparse() method allows incremental operations on XML documents. When using, you need to provide a file name and a list of one or more of the following types of events: start, end, start ns and end ns. The iterator created by iterparse() generates tuples like (event, elem), where event is one of the above event lists and elem is the corresponding XML element. For example:
>>> data = iterparse('potholes.xml',('start','end')) >>> next(data) ('start', <Element 'response' at 0x100771d60>) >>> next(data) ('start', <Element 'row' at 0x100771e68>) >>> next(data) ('start', <Element 'row' at 0x100771fc8>) >>> next(data) ('start', <Element 'creation_date' at 0x100771f18>) >>> next(data) ('end', <Element 'creation_date' at 0x100771f18>) >>> next(data) ('start', <Element 'status' at 0x1006a7f18>) >>> next(data) ('end', <Element 'status' at 0x1006a7f18>) >>>
The start event is created when an element is first created and has not been inserted into other data (such as child elements). The end event is created when an element has been completed. Although not shown in the example, the start ns and end ns events are used to handle the declaration of XML document namespaces.
In the example in this section, the start and end events are used to manage the element and tag stacks. The stack represents the hierarchy of the document when it is parsed. It is also used to judge whether an element matches and pass it to the parse function_ and_ The path to remove(). If there is a match, the element is returned to the caller using the yield statement.
The following statement after yield is the core feature of ElementTree, which makes the program occupy very little memory:
elem_stack[-2].remove(elem)
This statement causes the element previously generated by yield to be deleted from its parent node. Assuming that there is no other place to reference this element, the element is destroyed and memory is reclaimed.
The final effect of iterative parsing and deletion of nodes is an efficient incremental cleaning process on documents. The document tree structure has not been completely created from beginning to end. Nevertheless, the XML data can be processed in the above simple way.
The main drawback of this scheme is its performance. As a result of my own tests, the version that reads the entire document into memory runs almost twice as fast as the incremental version. But it uses 60 times more memory than the latter. Therefore, if you are more concerned about memory usage, the incremental version wins.
6.5 converting dictionaries to XML
problem
You want to use a Python dictionary to store data and convert it to XML format.
Solution
Although XML etree. The elementtree library is usually used for parsing. In fact, it can also create XML documents. For example, consider the following function:
from xml.etree.ElementTree import Element def dict_to_xml(tag, d): ''' Turn a simple dict of key/value pairs into XML ''' elem = Element(tag) for key, val in d.items(): child = Element(key) child.text = str(val) elem.append(child) return elem
Here is an example:
>>> s = { 'name': 'GOOG', 'shares': 100, 'price':490.1 } >>> e = dict_to_xml('stock', s) >>> e <Element 'stock' at 0x1004b64c8> >>>
The conversion result is an Element instance. For I/O operations, use XML etree. The tostring() function in elementtree can easily convert it into a byte string. For example:
>>> from xml.etree.ElementTree import tostring >>> tostring(e) b'<stock><price>490.1</price><shares>100</shares><name>GOOG</name></stock>' >>>
If you want to add an attribute value to an element, you can use the set() method:
>>> e.set('_id','1234') >>> tostring(e) b'<stock _id="1234"><price>490.1</price><shares>100</shares><name>GOOG</name> </stock>' >>>
If you still want to keep the order of elements, consider constructing an OrderedDict instead of a normal dictionary. Please refer to subsection 1.7.
discuss
When creating XML, you are limited to constructing string type values. For example:
def dict_to_xml_str(tag, d): ''' Turn a simple dict of key/value pairs into XML ''' parts = ['<{}>'.format(tag)] for key, val in d.items(): parts.append('<{0}>{1}</{0}>'.format(key,val)) parts.append('</{}>'.format(tag)) return ''.join(parts)
The problem is that if you manually construct it, you may encounter some trouble. For example, what happens when the dictionary value contains some special characters?
>>> d = { 'name' : '<spam>' } >>> # String creation >>> dict_to_xml_str('item',d) '<item><name><spam></name></item>' >>> # Proper XML creation >>> e = dict_to_xml('item',d) >>> tostring(e) b'<item><name><spam></name></item>' >>>
Notice that in the later example of the program, the characters' < 'and' > 'are replaced with < and >
The following is for reference only. If you need to convert these characters manually, you can use XML sax. escape() and unescape() functions in saxutils. For example:
>>> from xml.sax.saxutils import escape, unescape >>> escape('<spam>') '<spam>' >>> unescape(_) '<spam>' >>>
In addition to creating correct output, there is another reason to recommend that you create Element instances instead of strings, that is, it is not so easy to construct a larger document using string combinations. The Element instance can be processed in a variety of ways without considering parsing XML text. In other words, you can complete all your operations on an advanced data structure and output it in the form of string at the end.
6.6 parsing and modifying XML
problem
You want to read an XML document, make some changes to it, and then write the results back to the XML document.
Solution
Using XML etree. The elementtree module can easily handle these tasks. The first step is to parse the document in the usual way. For example, suppose you have a file named PRED XML document, similar to the following:
<?xml version="1.0"?> <stop> <id>14791</id> <nm>Clark & Balmoral</nm> <sri> <rt>22</rt> <d>North Bound</d> <dd>North Bound</dd> </sri> <cr>22</cr> <pre> <pt>5 MIN</pt> <fd>Howard</fd> <v>1378</v> <rn>22</rn> </pre> <pre> <pt>15 MIN</pt> <fd>Howard</fd> <v>1867</v> <rn>22</rn> </pre> </stop>
The following is an example of using ElementTree to read this document and modify it:
>>> from xml.etree.ElementTree import parse, Element >>> doc = parse('pred.xml') >>> root = doc.getroot() >>> root <Element 'stop' at 0x100770cb0> >>> # Remove a few elements >>> root.remove(root.find('sri')) >>> root.remove(root.find('cr')) >>> # Insert a new element after <nm>...</nm> >>> root.getchildren().index(root.find('nm')) 1 >>> e = Element('spam') >>> e.text = 'This is a test' >>> root.insert(2, e) >>> # Write back to a file >>> doc.write('newpred.xml', xml_declaration=True) >>>
The processing result is a new XML file like the following:
<?xml version='1.0' encoding='us-ascii'?> <stop> <id>14791</id> <nm>Clark & Balmoral</nm> <spam>This is a test</spam> <pre> <pt>5 MIN</pt> <fd>Howard</fd> <v>1378</v> <rn>22</rn> </pre> <pre> <pt>15 MIN</pt> <fd>Howard</fd> <v>1867</v> <rn>22</rn> </pre> </stop>
discuss
It's easy to modify the structure of an XML document, but you must keep in mind that all modifications are made to the parent node element and treat it as a list. For example, if you delete an element, delete it from its immediate parent by calling the remove() method of the parent node. If you insert or add a new element, you also use the insert() and append() methods of the parent node element. You can also use indexing and slicing operations on elements, such as element[i] or element[i:j]
If you need to create a new Element, you can use the Element class shown in the scheme in this section. We have discussed it in detail in section 6.5.
6.7 parsing XML documents using namespaces
problem
You want to parse an XML document that uses an XML namespace.
Solution
Consider the following document that uses namespaces:
<?xml version="1.0" encoding="utf-8"?> <top> <author>David Beazley</author> <content> <html xmlns="http://www.w3.org/1999/xhtml"> <head> <title>Hello World</title> </head> <body> <h1>Hello World!</h1> </body> </html> </content> </top>
If you parse the document and execute a normal query, you will find that this is not so easy, because all the steps become quite cumbersome.
>>> # Some queries that work >>> doc.findtext('author') 'David Beazley' >>> doc.find('content') <Element 'content' at 0x100776ec0> >>> # A query involving a namespace (doesn't work) >>> doc.find('content/html') >>> # Works if fully qualified >>> doc.find('content/{http://www.w3.org/1999/xhtml}html') <Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0> >>> # Doesn't work >>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/head/title') >>> # Fully qualified >>> doc.findtext('content/{http://www.w3.org/1999/xhtml}html/' ... '{http://www.w3.org/1999/xhtml}head/{http://www.w3.org/1999/xhtml}title') 'Hello World' >>>
You can simplify this process by wrapping the namespace processing logic into a tool class:
class XMLNamespaces: def __init__(self, **kwargs): self.namespaces = {} for name, uri in kwargs.items(): self.register(name, uri) def register(self, name, uri): self.namespaces[name] = '{'+uri+'}' def __call__(self, path): return path.format_map(self.namespaces)
Use this class in the following way:
>>> ns = XMLNamespaces(html='http://www.w3.org/1999/xhtml') >>> doc.find(ns('content/{html}html')) <Element '{http://www.w3.org/1999/xhtml}html' at 0x1007767e0> >>> doc.findtext(ns('content/{html}html/{html}head/{html}title')) 'Hello World' >>>
discuss
Parsing XML documents with namespaces can be cumbersome. The above XML namespaces just allow you to use abbreviated names instead of full URI s to make it a little simpler.
Unfortunately, there is no way to get namespace information in basic ElementTree parsing. However, if you use the iterparse() function, you can get more information about the scope of namespace processing. For example:
>>> from xml.etree.ElementTree import iterparse >>> for evt, elem in iterparse('ns2.xml', ('end', 'start-ns', 'end-ns')): ... print(evt, elem) ... end <Element 'author' at 0x10110de10> start-ns ('', 'http://www.w3.org/1999/xhtml') end <Element '{http://www.w3.org/1999/xhtml}title' at 0x1011131b0> end <Element '{http://www.w3.org/1999/xhtml}head' at 0x1011130a8> end <Element '{http://www.w3.org/1999/xhtml}h1' at 0x101113310> end <Element '{http://www.w3.org/1999/xhtml}body' at 0x101113260> end <Element '{http://www.w3.org/1999/xhtml}html' at 0x10110df70> end-ns None end <Element 'content' at 0x10110de68> end <Element 'top' at 0x10110dd60> >>> elem # This is the topmost element <Element 'top' at 0x10110dd60> >>>
Finally, if the XML text you want to process needs to use namespaces in addition to other advanced XML features, it is recommended that you use lxml function library instead of ElementTree. For example, lxml provides better support for validating documents using DTD s, better XPath support, and some other advanced XML features. This section actually just teaches you how to make XML parsing a little easier.
6.8 interaction with relational database
problem
You want to query, add or delete records in a relational database.
Solution
The standard way to represent multiline data in Python is a sequence of tuples. For example:
stocks = [ ('GOOG', 100, 490.1), ('AAPL', 50, 545.75), ('FB', 150, 7.45), ('HPQ', 75, 33.2), ]
According to PEP249, by providing data in this form, you can easily use Python standard database API to interact with relational database. All operations on the database are completed through SQL query statements. Each row of input and output data is represented by a tuple.
For demonstration, you can use the sqlite3 module in the Python standard library. If you use a different database (such as MySql, Postgresql or ODBC), you have to install the corresponding third-party module to provide support. However, the corresponding programming interfaces are almost the same, except for a little nuance.
The first step is to connect to the database. Usually you need to execute the connect() function and provide it with some database name, host, user name, password and other necessary parameters. For example:
>>> import sqlite3 >>> db = sqlite3.connect('database.db') >>>
In order to process the data, you need to create a cursor next. Once you have a cursor, you can execute SQL queries. For example:
>>> c = db.cursor() >>> c.execute('create table portfolio (symbol text, shares integer, price real)') <sqlite3.Cursor object at 0x10067a730> >>> db.commit() >>>
To insert multiple records into the database table, use a statement like the following:
>>> c.executemany('insert into portfolio values (?,?,?)', stocks) <sqlite3.Cursor object at 0x10067a730> >>> db.commit() >>>
To execute a query, use a statement like the following:
>>> for row in db.execute('select * from portfolio'): ... print(row) ... ('GOOG', 100, 490.1) ('AAPL', 50, 545.75) ('FB', 150, 7.45) ('HPQ', 75, 33.2) >>>
If you want to accept user input as a parameter to perform query operations, you must ensure that you use placeholders like the following? To reference parameters:
>>> min_price = 100 >>> for row in db.execute('select * from portfolio where price >= ?', (min_price,)): ... print(row) ... ('GOOG', 100, 490.1) ('AAPL', 50, 545.75) >>>
discuss
Interacting with the database at a lower level is very simple. You only need to provide SQL statements and call the corresponding modules to update or extract data. Still, there are some tricky details that you need to list one by one.
One difficulty is the direct mapping of data in the database to Python types. For date types, you can usually use the datetime instance in the datetime module, or possibly the system timestamp in the time module. For numeric types, especially financial data using decimals, it can be represented by the decimal instance in the decimal module. Unfortunately, the specific mapping rules are different for different databases. You must refer to the corresponding documents.
Another more complex problem is the construction of SQL statement strings. You should never use Python string formatting operators (such as%) or format() method to create such a string. If the values passed to these formatting operators come from user input, your program is likely to suffer from SQL injection attacks (see) http://xkcd.com/327 ). Wildcards in query statements? Instruct the background database to use its own string replacement mechanism, which is more secure.
Unfortunately, wildcards are used differently in different database backgrounds. Most modules use? Or% s, and others use different symbols, such as: 0 or: 1 to indicate parameters. Similarly, you still have to refer to the corresponding documentation of the database module you use. The paramstyle attribute of a database module contains information about the parameter reference style.
For simple database data reading and writing, using database API is usually very simple. If you want to deal with more complex problems, it is recommended that you use more advanced interfaces, such as the interface provided by an object relational mapping ORM. Libraries like SQLAlchemy allow you to use Python classes to represent a database table, and can realize various database operations while hiding the underlying SQL.
6.9 encoding and decoding hexadecimal numbers
problem
You want to decode a hexadecimal string into a byte string or encode a byte string into a hexadecimal string.
Solution
If you simply decode or encode a hexadecimal original string, you can use the binassii module. For example:
>>> # Initial byte string >>> s = b'hello' >>> # Encode as hex >>> import binascii >>> h = binascii.b2a_hex(s) >>> h b'68656c6c6f' >>> # Decode back to bytes >>> binascii.a2b_hex(h) b'hello' >>>
Similar functions can also be found in the base64 module. For example:
>>> import base64 >>> h = base64.b16encode(s) >>> h b'68656C6C6F' >>> base64.b16decode(h) b'hello' >>>
discuss
In most cases, it is easy to convert hexadecimal by using the above functions. The main difference between the above two technologies is the case processing. Function Base64 B16decode() and Base64 B16 encode () can only operate hexadecimal letters in uppercase, while functions in the binassii module can handle both case and case.
Another thing to note is that the output generated by the encoding function is always a byte string. If you want to force Unicode output, you need to add an additional interface step. For example:
>>> h = base64.b16encode(s) >>> print(h) b'68656C6C6F' >>> print(h.decode('ascii')) 68656C6C6F >>>
When decoding hexadecimal numbers, the functions b16decode() and a2b_hex() can accept bytes or unicode strings. However, unicode strings must contain only ASCII encoded hexadecimal numbers.
6.10 encoding and decoding Base64 data
problem
You need to decode or encode binary data in Base64 format.
Solution
There are two functions b64encode() and b64decode() in the base64 module to help you solve this problem. For example;
>>> # Some byte data >>> s = b'hello' >>> import base64 >>> # Encode as Base64 >>> a = base64.b64encode(s) >>> a b'aGVsbG8=' >>> # Decode from Base64 >>> base64.b64decode(a) b'hello' >>>
discuss
Base64 encoding is only used for byte oriented data, such as byte strings and byte arrays. In addition, the output result of the encoding process is always a byte string. If you want to mix Base64 encoded data with Unicode text, you must add an additional decoding step. For example:
>>> a = base64.b64encode(s).decode('ascii') >>> a 'aGVsbG8=' >>>
When decoding Base64, byte string and Unicode text can be used as parameters. However, Unicode strings can only contain ASCII characters.
6.11 reading and writing binary array data
problem
You want to read and write structured data from a binary array into Python tuples.
Solution
You can use the struct module to process binary data. The following is a sample code that writes a list of Python tuples to a binary file and encodes each tuple into a structure using struct.
from struct import Struct def write_records(records, format, f): ''' Write a sequence of tuples to a binary file of structures. ''' record_struct = Struct(format) for r in records: f.write(record_struct.pack(*r)) # Example if __name__ == '__main__': records = [ (1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7) ] with open('data.b', 'wb') as f: write_records(records, '<idd', f)
There are many ways to read this file and return a list of tuples. First, if you plan to read the file incrementally in blocks, you can do this:
from struct import Struct def read_records(format, f): record_struct = Struct(format) chunks = iter(lambda: f.read(record_struct.size), b'') return (record_struct.unpack(chunk) for chunk in chunks) # Example if __name__ == '__main__': with open('data.b','rb') as f: for rec in read_records('<idd', f): # Process rec ...
If you want to read the whole file into a byte string at one time, and then parse it in fragments. Then you can do this:
from struct import Struct def unpack_records(format, data): record_struct = Struct(format) return (record_struct.unpack_from(data, offset) for offset in range(0, len(data), record_struct.size)) # Example if __name__ == '__main__': with open('data.b', 'rb') as f: data = f.read() for rec in unpack_records('<idd', data): # Process rec ...
In both cases, the result is an iteratable object that returns the original tuple used to create the file.
discuss
For programs that need to encode and decode binary data, struct module is usually used. To declare a new structure, just create a struct instance like this:
# Little endian 32-bit integer, two double precision floats record_struct = Struct('<idd')
Structures usually use some structure code values i, d, f, etc. [reference] Python documentation ]. These codes represent a specific binary data type, such as 32-bit integer, 64 bit floating-point number, 32-bit floating-point number, etc. The first character < specifies the byte order. In this example, it means "low order first". Change this character to > to indicate that the high bit is in the front, or! Indicates the network byte order.
The resulting Struct instance has many properties and methods to manipulate the corresponding type of structure. The size attribute contains the number of bytes of the structure, which is useful in I/O operations. The pack() and unpack() methods are used to package and unpack data. For example:
>>> from struct import Struct >>> record_struct = Struct('<idd') >>> record_struct.size 20 >>> record_struct.pack(1, 2.0, 3.0) b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@' >>> record_struct.unpack(_) (1, 2.0, 3.0) >>>
Sometimes you can see that pack() and unpack() operations are called as module level functions, like the following:
>>> import struct >>> struct.pack('<idd', 1, 2.0, 3.0) b'\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00@\x00\x00\x00\x00\x00\x00\x08@' >>> struct.unpack('<idd', _) (1, 2.0, 3.0) >>>
This works, but it doesn't feel as elegant as the instance method, especially when the same structure in your code appears in multiple places. By creating a Struct instance, the format code will be specified only once, and all operations will be processed centrally. This makes code maintenance easier (because you only need to change one part of the code).
Reading binary structure code requires some very interesting and beautiful programming skills. In function read_ In records, iter() is used to create an iterator that returns fixed size data blocks. Refer to section 5.8. The iterator will continue to call a user provided callable object (such as lambda: f.read(record_struct.size)), until it returns a special value (such as b '), at which time the iteration stops. For example:
>>> f = open('data.b', 'rb') >>> chunks = iter(lambda: f.read(20), b'') >>> chunks <callable_iterator object at 0x10069e6d0> >>> for chk in chunks: ... print(chk) ... b'\x01\x00\x00\x00ffffff\x02@\x00\x00\x00\x00\x00\x00\x12@' b'\x06\x00\x00\x00333333\x1f@\x00\x00\x00\x00\x00\x00"@' b'\x0c\x00\x00\x00\xcd\xcc\xcc\xcc\xcc\xcc*@\x9a\x99\x99\x99\x99YL@' >>>
As you can see, one reason to create an iteratable object is that it allows you to use a generator derivation to create records. If you don't use this technology, the code may look like this:
def read_records(format, f): record_struct = Struct(format) while True: chk = f.read(record_struct.size) if chk == b'': break yield record_struct.unpack(chk)
In function unpack_ Another method, unpack, is used in records()_ from() . unpack_from() is very useful for extracting binary data from a large binary array because it does not produce any temporary objects or memory copy operations. You just need to give it a byte string (or array) and a byte offset, and it will unpack the data directly from that location.
If you use unpack() instead of unpack_from(), you need to modify the code to construct a large number of small slices and calculate the offset. For example:
def unpack_records(format, data): record_struct = Struct(format) return (record_struct.unpack(data[offset:offset + record_struct.size]) for offset in range(0, len(data), record_struct.size))
In addition to the complexity of the code, this scheme has to do a lot of extra work because it performs a lot of offset calculation, copies data and constructs small sliced objects. If you are going to unpack a large number of structures from a large byte string read, unpack_from() will do better.
When unpacking, the named tuple object in the collections module may be what you want to use. It allows you to set the attribute name for the returned tuple. For example:
from collections import namedtuple Record = namedtuple('Record', ['kind','x','y']) with open('data.p', 'rb') as f: records = (Record(*r) for r in read_records('<idd', f)) for r in records: print(r.kind, r.x, r.y)
If your program needs to process a large amount of binary data, you'd better use numpy module. For example, you can read a binary data into a structured array instead of a tuple list. As follows:
>>> import numpy as np >>> f = open('data.b', 'rb') >>> records = np.fromfile(f, dtype='<i,<d,<d') >>> records array([(1, 2.3, 4.5), (6, 7.8, 9.0), (12, 13.4, 56.7)], dtype=[('f0', '<i4'), ('f1', '<f8'), ('f2', '<f8')]) >>> records[0] (1, 2.3, 4.5) >>> records[1] (6, 7.8, 9.0) >>>
Finally, if you need to read binary data from known file formats (such as image format, graphics file, HDF5, etc.), check to see if Python has provided existing modules. Because there is no need to build the wheel again and again as a last resort.
6.12 reading nested and variable length binary data
problem
You need to read complex binary format data containing nested or variable length record sets. These data may include pictures, videos, electronic map files, etc.
Solution
The struct module can be used to encode / decode almost all types of binary data structures. To explain this data clearly, suppose you use the following Python data structure to represent a collection of points that make up a series of polygons:
polys = [ [ (1.0, 2.5), (3.5, 4.0), (2.5, 1.5) ], [ (7.0, 1.2), (5.1, 3.0), (0.5, 7.5), (0.8, 9.0) ], [ (3.4, 6.3), (1.2, 0.5), (4.6, 9.2) ], ]
Now suppose that the data is encoded into a binary file starting with the following header:
+------+--------+------------------------------------+ |Byte | Type | Description | +======+========+====================================+ |0 | int | File code (0) x1234,Small end) | +------+--------+------------------------------------+ |4 | double | x Minimum value of (small end) | +------+--------+------------------------------------+ |12 | double | y Minimum value of (small end) | +------+--------+------------------------------------+ |20 | double | x Maximum value of (small end) | +------+--------+------------------------------------+ |28 | double | y Maximum value of (small end) | +------+--------+------------------------------------+ |36 | int | Number of triangles (small end) | +------+--------+------------------------------------+
Immediately following the head is a series of polygon records, with the coding format as follows:
+------+--------+-------------------------------------------+ |Byte | Type | Description | +======+========+===========================================+ |0 | int | Record length( N Bytes) | +------+--------+-------------------------------------------+ |4-N | Points | (X,Y) Coordinates in floating point numbers | +------+--------+-------------------------------------------+
To write such a file, you can use the following Python code:
import struct import itertools def write_polys(filename, polys): # Determine bounding box flattened = list(itertools.chain(*polys)) min_x = min(x for x, y in flattened) max_x = max(x for x, y in flattened) min_y = min(y for x, y in flattened) max_y = max(y for x, y in flattened) with open(filename, 'wb') as f: f.write(struct.pack('<iddddi', 0x1234, min_x, min_y, max_x, max_y, len(polys))) for poly in polys: size = len(poly) * struct.calcsize('<dd') f.write(struct.pack('<i', size + 4)) for pt in poly: f.write(struct.pack('<dd', *pt))
When reading the data back, you can use the function struct Unpack (), the code is very similar, which is basically the reverse order of the above write operations. As follows:
def read_polys(filename): with open(filename, 'rb') as f: # Read the header header = f.read(40) file_code, min_x, min_y, max_x, max_y, num_polys = \ struct.unpack('<iddddi', header) polys = [] for n in range(num_polys): pbytes, = struct.unpack('<i', f.read(4)) poly = [] for m in range(pbytes // 16): pt = struct.unpack('<dd', f.read(16)) poly.append(pt) polys.append(poly) return polys
Although this code works, it is mixed with a lot of code to read and unpack data structures and other details. If such code is used to deal with real data files, it is a little too complicated. Therefore, it is obvious that there should be another solution to simplify these steps and let programmers focus on the most important things.
In the next part of this section, I will gradually demonstrate a better scheme for parsing byte data. The goal is to provide programmers with an advanced file formatting method and simplify the details of reading and unpacking data. But I want to remind you first that the next part of the code in this section should be the most complex and advanced example in the whole book, using a lot of object-oriented programming and metaprogramming techniques. Be sure to read our discussion carefully and refer to other chapters.
First, when reading byte data, the file header and other data structures are usually included at the beginning of the file. Although the struct module can unpack this data into a tuple, another way to represent this information is to use a class. As follows:
import struct class StructField: ''' Descriptor representing a simple structure field ''' def __init__(self, format, offset): self.format = format self.offset = offset def __get__(self, instance, cls): if instance is None: return self else: r = struct.unpack_from(self.format, instance._buffer, self.offset) return r[0] if len(r) == 1 else r class Structure: def __init__(self, bytedata): self._buffer = memoryview(bytedata)
Here, we use a descriptor to represent each structure field. Each descriptor contains a structure compatible format code and a byte offset, which are stored in the internal memory buffer. In__ get__ () method, struct unpack_ The from () function is used to unpack a value from the buffer, eliminating additional fragmentation or copy steps.
The Structure class is a basic class that accepts byte data and stores it in the internal memory buffer, which is used by the structurfield descriptor. memoryview() is used here. We will explain in detail what it is used for later.
Using this code, you can now define a high-level structure object to represent the file format expected by the table information above. For example:
class PolyHeader(Structure): file_code = StructField('<i', 0) min_x = StructField('<d', 4) min_y = StructField('<d', 12) max_x = StructField('<d', 20) max_y = StructField('<d', 28) num_polys = StructField('<i', 36)
The following example uses this class to read the header data of the polygon data we wrote earlier:
>>> f = open('polys.bin', 'rb') >>> phead = PolyHeader(f.read(40)) >>> phead.file_code == 0x1234 True >>> phead.min_x 0.5 >>> phead.min_y 0.5 >>> phead.max_x 7.0 >>> phead.max_y 9.2 >>> phead.num_polys 3 >>>
This is interesting, but there are still some annoying things about this way. First of all, although you get the convenience of a class interface, the code is still a little bloated, and the user needs to specify many underlying details (such as reusing StructField, specifying offset, etc.). In addition, the returned result class does have some convenient methods to calculate the total number of structures.
Whenever you encounter redundant class definitions like this, you should consider using class decorators or metaclasses. A feature of metaclass is that it can be used to fill in many low-level implementation details, so as to release the burden of users. Let me take an example and use metaclasses to slightly transform our Structure class:
class StructureMeta(type): ''' Metaclass that automatically creates StructField descriptors ''' def __init__(self, clsname, bases, clsdict): fields = getattr(self, '_fields_', []) byte_order = '' offset = 0 for format, fieldname in fields: if format.startswith(('<','>','!','@')): byte_order = format[0] format = format[1:] format = byte_order + format setattr(self, fieldname, StructField(format, offset)) offset += struct.calcsize(format) setattr(self, 'struct_size', offset) class Structure(metaclass=StructureMeta): def __init__(self, bytedata): self._buffer = bytedata @classmethod def from_file(cls, f): return cls(f.read(cls.struct_size))
Using the new Structure class, you can define a Structure as follows:
class PolyHeader(Structure): _fields_ = [ ('<i', 'file_code'), ('d', 'min_x'), ('d', 'min_y'), ('d', 'max_x'), ('d', 'max_y'), ('i', 'num_polys') ]
As you can see, it's much easier to write. The class method we added is from_file() allows us to easily read data from a file without knowing the size and structure of any data. For example:
>>> f = open('polys.bin', 'rb') >>> phead = PolyHeader.from_file(f) >>> phead.file_code == 0x1234 True >>> phead.min_x 0.5 >>> phead.min_y 0.5 >>> phead.max_x 7.0 >>> phead.max_y 9.2 >>> phead.num_polys 3 >>>
Once you start using metaclasses, you can make them smarter. For example, suppose you also want to support nested byte structures. Here is a small improvement to the previous metaclass, providing a new auxiliary descriptor to achieve the desired effect:
class NestedStruct: ''' Descriptor representing a nested structure ''' def __init__(self, name, struct_type, offset): self.name = name self.struct_type = struct_type self.offset = offset def __get__(self, instance, cls): if instance is None: return self else: data = instance._buffer[self.offset: self.offset+self.struct_type.struct_size] result = self.struct_type(data) # Save resulting structure back on instance to avoid # further recomputation of this step setattr(instance, self.name, result) return result class StructureMeta(type): ''' Metaclass that automatically creates StructField descriptors ''' def __init__(self, clsname, bases, clsdict): fields = getattr(self, '_fields_', []) byte_order = '' offset = 0 for format, fieldname in fields: if isinstance(format, StructureMeta): setattr(self, fieldname, NestedStruct(fieldname, format, offset)) offset += format.struct_size else: if format.startswith(('<','>','!','@')): byte_order = format[0] format = format[1:] format = byte_order + format setattr(self, fieldname, StructField(format, offset)) offset += struct.calcsize(format) setattr(self, 'struct_size', offset)
In this code, the nested structure descriptor is used to overlay another structure defined on a memory area. It instantiates a given structure type by slicing the original memory buffer. Since the underlying memory buffer is initialized through a memory view, this slicing operation does not cause any additional memory replication. Instead, it is just a superposition of the previous memory. In addition, in order to prevent repeated instantiation, the descriptor saves the internal structure object in the instance by using the same technology as in section 8.10.
With this new revision, you can write as follows:
class Point(Structure): _fields_ = [ ('<d', 'x'), ('d', 'y') ] class PolyHeader(Structure): _fields_ = [ ('<i', 'file_code'), (Point, 'min'), # nested struct (Point, 'max'), # nested struct ('i', 'num_polys') ]
Surprisingly, it can also work normally as expected. In our actual operation:
>>> f = open('polys.bin', 'rb') >>> phead = PolyHeader.from_file(f) >>> phead.file_code == 0x1234 True >>> phead.min # Nested structure <__main__.Point object at 0x1006a48d0> >>> phead.min.x 0.5 >>> phead.min.y 0.5 >>> phead.max.x 7.0 >>> phead.max.y 9.2 >>> phead.num_polys 3 >>>
So far, a framework for processing fixed length records has been written. But what if the component record becomes longer? For example, a polygon file contains a variable length part.
One solution is to write a class to represent byte data, and write a tool function to parse the content in how many ways. It is similar to the code in section 6.11:
class SizedRecord: def __init__(self, bytedata): self._buffer = memoryview(bytedata) @classmethod def from_file(cls, f, size_fmt, includes_size=True): sz_nbytes = struct.calcsize(size_fmt) sz_bytes = f.read(sz_nbytes) sz, = struct.unpack(size_fmt, sz_bytes) buf = f.read(sz - includes_size * sz_nbytes) return cls(buf) def iter_as(self, code): if isinstance(code, str): s = struct.Struct(code) for off in range(0, len(self._buffer), s.size): yield s.unpack_from(self._buffer, off) elif isinstance(code, StructureMeta): size = code.struct_size for off in range(0, len(self._buffer), size): data = self._buffer[off:off+size] yield code(data)
Class method sizedrecord from_ File () is a tool used to read data blocks with size prefix from a file, which is also a common way for many file formats. As input, it accepts a structure format encoding containing size encoding, and it is also its own form. Optional includes_ The size parameter specifies whether the number of bytes includes the header size. Here is an example of how to use to read individual polygon data from a polygon file:
>>> f = open('polys.bin', 'rb') >>> phead = PolyHeader.from_file(f) >>> phead.num_polys 3 >>> polydata = [ SizedRecord.from_file(f, '<i') ... for n in range(phead.num_polys) ] >>> polydata [<__main__.SizedRecord object at 0x1006a4d50>, <__main__.SizedRecord object at 0x1006a4f50>, <__main__.SizedRecord object at 0x10070da90>] >>>
It can be seen that the contents of the SizedRecord instance have not been resolved yet. You can use iter_as() method, which accepts a Structure format code or Structure class as input. In this way, you can analyze the data flexibly, for example:
>>> for n, poly in enumerate(polydata): ... print('Polygon', n) ... for p in poly.iter_as('<dd'): ... print(p) ... Polygon 0 (1.0, 2.5) (3.5, 4.0) (2.5, 1.5) Polygon 1 (7.0, 1.2) (5.1, 3.0) (0.5, 7.5) (0.8, 9.0) Polygon 2 (3.4, 6.3) (1.2, 0.5) (4.6, 9.2) >>> >>> for n, poly in enumerate(polydata): ... print('Polygon', n) ... for p in poly.iter_as(Point): ... print(p.x, p.y) ... Polygon 0 1.0 2.5 3.5 4.0 2.5 1.5 Polygon 1 7.0 1.2 5.1 3.0 0.5 7.5 0.8 9.0 Polygon 2 3.4 6.3 1.2 0.5 4.6 9.2 >>>
Combining all these, here is a read_ Another modified version of the poly() function:
class Point(Structure): _fields_ = [ ('<d', 'x'), ('d', 'y') ] class PolyHeader(Structure): _fields_ = [ ('<i', 'file_code'), (Point, 'min'), (Point, 'max'), ('i', 'num_polys') ] def read_polys(filename): polys = [] with open(filename, 'rb') as f: phead = PolyHeader.from_file(f) for n in range(phead.num_polys): rec = SizedRecord.from_file(f, '<i') poly = [ (p.x, p.y) for p in rec.iter_as(Point) ] polys.append(poly) return polys
discuss
This section shows you many advanced programming techniques, including descriptor, delay calculation, metaclass, class variable and memory view. However, they all serve the same specific goal.
A major feature of the above implementation is that it is based on the idea of lazy unpacking. When a Structure instance is created__ init__ () just create a memory view of byte data without doing anything else. In particular, no unpacking or other Structure related operations occur at this time. One motivation for this is that you may only be interested in a small part of a byte record. We only need to unpack the part you need to access, not the whole file.
In order to implement lazy unpacking and packaging, you need to use the StructField descriptor class. User in_ fields_ Each attribute listed in is transformed into a StructField descriptor, which saves the relevant structure format code and offset value to the storage cache. The metaclass StructureMeta automatically creates these descriptors when multiple structure classes are defined. One of the main reasons we use metaclasses is that it makes it very convenient for users to specify the structure format through a high-level description without considering the low-level details.
One of the subtleties of StructureMeta is that it fixes the byte data order. That is, if any attribute specifies a byte order (< indicates low priority or > indicates high priority), the order of all subsequent fields shall prevail. This can help avoid extra input, but we can still switch the order in the middle of the definition. For example, you may have some complex structures, such as the following:
class ShapeFile(Structure): _fields_ = [ ('>i', 'file_code'), # Big endian ('20s', 'unused'), ('i', 'file_length'), ('<i', 'version'), # Little endian ('i', 'shape_type'), ('d', 'min_x'), ('d', 'min_y'), ('d', 'max_x'), ('d', 'max_y'), ('d', 'min_z'), ('d', 'max_z'), ('d', 'min_m'), ('d', 'max_m') ]
As we mentioned earlier, the use of memoryview() can help us avoid memory duplication. When the structure is nested, memoryviews can overlay different parts of the mechanism defined on the same memory area. This feature is subtle, but it focuses on the slicing behavior of memory view and ordinary byte array. If you slice on a byte string or byte array, you usually get a copy of the data. The memory view slice is not like this. It is just superimposed on the existing memory. Therefore, this method is more efficient.
There are many related chapters that can help us expand the scheme discussed here. Refer to section 8.13 to build a type system using descriptors. Section 8.10 has more discussion on delaying the calculation of attribute values, and is also related to the implementation of NestedStruct descriptor. Section 9.19 has an example of using metaclasses to initialize class members, which is very similar to the StructureMeta class. Python's ctypes source code is also very interesting. It provides support for similar functions such as defining data structures and nesting data structures.
6.13 data accumulation and statistical operation
problem
You need to process a large data set and calculate the sum of data or other statistics.
Solution
For any data analysis problem involving statistics, time series and other related technologies, it can be considered Pandas Library .
In order to let you experience it first, here is an example of using Pandas to analyze the city of Chicago Mouse and rodent database Examples. At the time of writing this article, the database is a CSV file with about 74000 rows of data.
>>> import pandas >>> # Read a CSV file, skipping last line >>> rats = pandas.read_csv('rats.csv', skip_footer=1) >>> rats <class 'pandas.core.frame.DataFrame'> Int64Index: 74055 entries, 0 to 74054 Data columns: Creation Date 74055 non-null values Status 74055 non-null values Completion Date 72154 non-null values Service Request Number 74055 non-null values Type of Service Request 74055 non-null values Number of Premises Baited 65804 non-null values Number of Premises with Garbage 65600 non-null values Number of Premises with Rats 65752 non-null values Current Activity 66041 non-null values Most Recent Action 66023 non-null values Street Address 74055 non-null values ZIP Code 73584 non-null values X Coordinate 74043 non-null values Y Coordinate 74043 non-null values Ward 74044 non-null values Police District 74044 non-null values Community Area 74044 non-null values Latitude 74043 non-null values Longitude 74043 non-null values Location 74043 non-null values dtypes: float64(11), object(9) >>> # Investigate range of values for a certain field >>> rats['Current Activity'].unique() array([nan, Dispatch Crew, Request Sanitation Inspector], dtype=object) >>> # Filter the data >>> crew_dispatched = rats[rats['Current Activity'] == 'Dispatch Crew'] >>> len(crew_dispatched) 65676 >>> >>> # Find 10 most rat-infested ZIP codes in Chicago >>> crew_dispatched['ZIP Code'].value_counts()[:10] 60647 3837 60618 3530 60614 3284 60629 3251 60636 2801 60657 2465 60641 2238 60609 2206 60651 2152 60632 2071 >>> >>> # Group by completion date >>> dates = crew_dispatched.groupby('Completion Date') <pandas.core.groupby.DataFrameGroupBy object at 0x10d0a2a10> >>> len(dates) 472 >>> >>> # Determine counts on each day >>> date_counts = dates.size() >>> date_counts[0:10] Completion Date 01/03/2011 4 01/03/2012 125 01/04/2011 54 01/04/2012 38 01/05/2011 78 01/05/2012 100 01/06/2011 100 01/06/2012 58 01/07/2011 1 01/09/2012 12 >>> >>> # Sort the counts >>> date_counts.sort() >>> date_counts[-10:] Completion Date 10/12/2012 313 10/21/2011 314 09/20/2011 316 10/26/2011 319 02/22/2011 325 10/26/2012 333 03/17/2011 336 10/13/2011 378 10/14/2011 391 10/07/2011 457 >>>
Well, it seems that October 7, 2011 is a very busy day for mice_
discuss
Pandas is a large function library with many features, which I can't cover here. But as long as you need to analyze large data sets, group data, calculate various statistics or other similar tasks, this function library is really worth looking at.