1. Preface
Hello, everyone, I'm Xiao Zhang~
In the daily process of dealing with Python, it is inevitable to involve data reading and writing business. For example, when doing crawler, you need to store the crawled data locally first, and then do the next step of processing; When doing data visualization analysis, you need to read the data from the hard disk into the memory, and then carry out subsequent steps such as data cleaning, visualization and modeling
When reading and writing data, we have many choices for storage objects. The most convenient ones are local document types: csv, txt, etc. the advantages are convenient and simple, but the disadvantages are also obvious: document data is easy to be lost, damaged, deleted and unrecoverable, and it is not convenient for data sharing among multiple people;
Another large class of storage objects are databases. At present, MongoDB and Mysql are more flow persistent databases; Compared with document storage, the biggest advantage of the database is that the data is not easy to lose and easy to retrieve. In the face of a huge amount of data, only a few lines of sql statements can realize the CRUD (add, read, modify and delete) function
In this Python tutorial, we will summarize the reading and writing methods of Python data. There are four types of storage objects involved: txt, csv, MongoDV and MySQL
2. Python document reading and writing
For local document data reading and writing, this paper lists two common document types: txt and csv; The test data comes from the second-hand housing supply of Lianjia. For the details of housing supply data crawling, please refer to the old news Python collected 3000 pieces of second-hand housing data in Beijing. Let's see what I have analyzed?
2.1 txt reading and writing
Among the several data storage objects introduced in this article, txt reading and writing is the simplest one. I won't introduce the reading and writing method in detail here. For those who don't know much about it, please refer to the following test code:
# txt data source, remember to add encoding='utf-8 'parameter to prevent random code dataSource = open('data.txt',encoding='utf-8',mode='r') # txt data read / write operation savePath = 'save.txt' f = open(savePath,encoding='utf-8',mode='w+')# read for lineData in dataSource.readlines(): dataList = lineData.replace('\n','').split(',') print(dataList) f.write('\t'.join(dataList) +'\n') f.close()
The whole process of txt reading and writing is completed by means of file flow. It should be noted that if Chinese garbled code occurs in the process of reading and writing, the parameter encoding = 'utf-8' needs to be added to effectively avoid:
2.2 csv reading and writing
Compared with txt, python enthusiasts prefer to store data in csv or xlsx, which is relatively more standardized, especially for numerical data. For students with poor programming foundation, they can directly carry out visual analysis without Python and Excel;
There are many ways to read and write csv in Python. In the following examples, I list two common ways. One is based on Python csv and the other is based on pandas
csv data reading
When reading a test, you read data from csv
Method 1: read with Python CSV
# Data reading using Python CSV csvPath = 'save.csv' with open(csvPath, newline='',encoding='utf-8') as csvfile: spamreader = csv.reader(csvfile, delimiter=',', quotechar='|') for row in spamreader: print(', '.join(row))
Method 2: read with pandas
csvPath = 'save.csv' df = pandas.read_csv(csvPath,header=0) print(df)
csv data writing
Method 1, with the help of Python CSV
# txt data source, remember to add encoding='utf-8 'parameter to prevent random code dataSource = open('data.txt',encoding='utf-8',mode='r') # Using Python csv to write csv files csvPath = 'save.csv' with open(csvPath, 'w', newline='',encoding='utf-8') as csvfile: spamwriter = csv.writer(csvfile, delimiter=' ', quotechar='|', quoting=csv.QUOTE_MINIMAL) for index, lineData in enumerate(dataSource.readlines()): dataList = lineData.replace('\n', '').split(',') spamwriter.writerow(dataList)
Method 2, with the help of pandas
dataSource = open('data.txt',encoding='utf-8',mode='r') # Using pandas to write csv files csvPath = 'save.csv' data = [] columnName = [] for index,lineData in enumerate(dataSource.readlines()): if(index==0):# The first row is the column name columnName = lineData.replace('\n','').split(',') continue dataList = lineData.replace('\n','').split(',') data.append(dataList) df =pandas.DataFrame(data,columns =columnName) # Prevent garbled code df.to_csv(csvPath,encoding="utf_8_sig",index=False)
As for the operation of csv files with Python, I prefer pandas. In pandas, the data are displayed in DataFrame or Series type. For most numerical calculation or statistical analysis methods, there are encapsulated functions, which can be called directly, which is very convenient
For details on the basic use of pandas, please refer to the old text:
- Pandas data processing (I) -- master several simple functions!
- Pandas data processing (II) -- conditional filtering and sorting!
- Pandas data processing (III) - group by
- General usage of Pandas data processing | apply() function
- Pandas basic usage - dataframe to_ The dict() function uses
3. Python database reading and writing
After introducing the reading and writing methods of Python in local documents, the following will introduce the processing of Python in the database; Here I choose MongoDB and MySQL, two popular databases; Before the formal introduction, please make sure that everyone's environment is ready
3.1 MongoDB data reading and writing
Mongodb is a kind of document oriented database, belonging to NoSQL camp, which is often referred to as non relational database,
Data acceptance during storage is composed of key value pairs (key = > value), similar to JSON objects; In other words, when Python interacts with Mongodb data, it needs to be passed in the form of dictionary
How to connect MongoDB with Python? We don't need to worry about this. Python's biggest advantage can rely on a huge third-party library. pymongo can help us realize CRUD business operations on the data in MongoDB
MongoDB connection initialization
After the local MongoDB download is completed, the default port of the database is 127017 and the host address is loclhost. Before reading and writing data, you need to complete the initialization connection to the database. The code is as follows
import pymongo # Connect to database and create data table myclient = pymongo.MongoClient("mongodb://localhost:27017") database = myclient["database"]# Connect to database collection = database["table"]# collection is similar to table in Mysql
"Database" and "table" in the code represent the database name and table name respectively. If there is no corresponding database or data table during data writing, MongoDB will automatically create it
MongoDB data writing
When I introduced MongoDB earlier, I mentioned that its data structure is composed of key value pairs, which is similar to JSON objects; Therefore, in the data insertion step, it is necessary to convert the data into dictionary form (column name as key and column value as value). The data style is as follows:
Data writing Python code implementation
dataSource = open('data.txt',encoding='utf-8',mode='r') readlines = dataSource.readlines() keys_list = readlines[0].replace('\n', '').split(',') for lineData in readlines[1:]: dataList = lineData.replace('\n', '').split(',') # Data dictionary formatting dataDict = dict(zip(keys_list, dataList)) if collection.find_one(dataDict):# Before inserting, judge whether the data record exists in the data table; print("Insert done......") else:# Insert when not present collection.insert_one(dataDict)
The effect after data writing is as follows. The GUI tool here uses the official mongo compass
pymongo data insertion can be done in two ways
- insert_one() insert one piece of data at a time;
- insert_many() inserts multiple pieces of data each time, and the inserted pieces of data need to be stored in the form of a list;
MongoDB data reading
MongoDB data reading can be understood as unconditional query, and a cursor object is returned through find(); Cursor, which means cursor in Chinese, is similar to iterator and can return database data in turn through iteration
The data reading method is as follows:
myclient = pymongo.MongoClient("mongodb://localhost:27017") database = myclient["database"] collection = database["table"] for post_data in collection.find(): print(post_data)
In addition to unconditional query, you can add some filter criteria as follows:
- When querying, only the first 10 pieces of data are returned
for post_data in collection.find()[:10]: print(post_data)
- When querying, you need to return a field value in reverse order, such as_ id field
for post_data in collection.find().sort('_id',pymongo.ASCENDING): print(post_data)
- Only the data with community as xihongyuan will be returned
for post_data in collection.find({"community":"Xihongyuan"}): print(post_data)
There are many practical methods on how to operate MongoDB with pymongo. Here I only introduce a small part. For details, please refer to the official pymongo documents:
https://pymongo.readthedocs.io/en/stable/api/pymongo/collection.html
3.2 MySQL read / write
MySQL is a typical relational database. Data tables can be linked by foreign keys. In this way, when a field is updated in a table, other associated data tables will be updated automatically. The business logic of data addition, deletion, modification and query is mainly completed by SQL statements,
At present, there are two types of libraries that Python can use to manipulate MySQL: MySQL LDB and pymysql. The two libraries operate MySQL in similar ways. Their core functions depend on sql statements. It can be considered that they provide an interface between Python and mysql,
In this article, pymysql is selected as the test library. Before reading and writing with pymysql, it is recommended to test whether the MySQL database on the machine is available through the command line. The test methods are as follows:
MySQL connection initialization
When pymysql connects to MySQL, you need to configure ip, port, user name, password and database name (unlike MongoDB, MySQL database needs to be created in advance)
import pymysql db_conn = pymysql.connect(host= '127.0.0.1', port= 3306, user= 'root', password= '123456', db= 'db_example')
MySQL data insertion
When inserting data into MySQL, you can create a new table or insert based on the table in the original database. In the following test code, I use the former
# Create table data_table sql1 = " drop table if exists %s;" sql2 ='''CREATE TABLE %s ( title varchar(1000), community varchar(1000), citydirct varchar(1000), houseinfo varchar(1000), dateinfo varchar(1000), taglist varchar(1000), totalprice varchar(1000), unitprice varchar(1000) ) ''' try: with db_conn.cursor() as cursor: cursor.execute(sql1,"data_table") cursor.execute(sql2,"data_table") db_conn.commit() except Exception as e: db_conn.rollback()
MySQL data insertion code
for lineData in dataSource.readlines()[1:]: dataList = lineData.replace('\n', '').split(',') # Splicing sql insert statements insert_sql = 'INSERT data_table (title,community,citydirct,houseinfo,dateinfo,taglist,totalprice,unitprice) VALUES ' + \ '("' + '","'.join(dataList) + '")' print(insert_sql) # Execute sql statement with db_conn.cursor() as cursor: cursor.execute(insert_sql) # close database db_conn.close()
The effect of data insertion is as follows. Here is the UI interface of mysql using Navicat:
From the above code block, we can see that in the step of connecting MySQL with pymysql and inserting data, the business function mainly depends on the SQL statement, that is, if you want to operate MySQL with python, you need to master the SQL statement in advance
INSERT data_table (title,community,citydirct,houseinfo,dateinfo,taglist,totalprice,unitprice) VALUES ' + \ '("' + '","'.join(dataList) + '")
pymysql operation MySQL needs to pay attention to that every time one or more sql statements are executed, you need to commit(), otherwise the execution will be invalid
cursor.execute(sql2,"data_table") db_conn.commit()
MySQL data query
SELECT * FROM data_table statement to query the data table data_ All data in table:
import pymysql db_conn = pymysql.connect(host= 'localhost', port=3308, user= 'root', password= '123456', db= 'db_example', charset='utf8mb4') findSql = 'SELECT * FROM data_table' try: with db_conn.cursor() as cursor: cursor.execute(findSql) db_conn.commit() results = cursor.fetchall() for itemData in results: print(itemData) except Exception as e: db_conn.rollback()
In addition to unconditional query, you can also use SELECT * FROM data_table where .... Statement to implement conditional query Represents the conditional statement to be added
In addition to basic reading and writing, you can also do some business functions such as updating and deleting with the help of sql statements; The trouble is that when splicing sql statements in Python, syntax errors are easy to occur, which is far inferior to Java,
There are encapsulated jar packages in Java. For some common addition, deletion, modification and query businesses, you only need to pass in parameters, and you don't need to care about whether the sql statement is legal. It's very convenient! Compared with Java, there is still much room for improvement in the interaction between Python and MySQL~
Summary
This paper introduces the reading and writing methods of Python for four types of storage objects, which are divided into two categories: document and database, from txt to csv and from MongoDB to MySQL, covering most business scenarios in daily life; For the convenience of most small partners learning python, TXT, csv and other documents will be preferred as storage objects when processing data in Python;
However, once the amount of data becomes huge, or there are many documents storing data, the disadvantages of document object storage will be revealed: document data is easy to be deleted by mistake, the relationship between data is easy to be confused, documents are misplaced and so on; At this point, the advantages of the database can be revealed:
1. Huge data storage capacity on mysql3 In version 22, the maximum storage space of a single table is 4GB. One database can create multiple data tables, and MySQL can also create multiple databases, as shown in the figure below
2. The relationship between data sheets is clear and easy to retrieve quickly; One database can gather multiple data tables to one place for easy search; With the increase of data volume, the data can also be divided into multiple databases for storage according to attributes and usage scenarios;
3. If the data in the database is deleted by mistake, it can also be recovered by technical means. Therefore, it is necessary to understand the basic operation mode of database data!
Well, that's all about Python data storage. If the content is helpful to you, you might as well give me a compliment to encourage me~
Finally, thank you for reading. See you next time!