python file reading of machine learning

1, read_sql and to_sql function

1.1 read_sql function

  1. Reading data is generally read from the database, so in read_ Fill in the corresponding SQL statement in the SQL () function to read the data we want
  2. pd.read_sql(sql, con, index_col=None, coerce_float=True, params=None, parse_dates=None, columns=None, chunksize=None)
    1) sql: SQL command string;
    2) con: the Engine connecting to the SQL database is generally established with modules such as SQLAlchemy or PyMysql;
    3)index_col: select a column as the Index;
    4)coerce_float: read the string in digital form directly in float type;
    5)parse_dates: convert a column of date type string to datatime type data. You can directly provide the column name to be converted and convert it in the default date form, or you can also provide the column name in dictionary form and the format of conversion date;
#Use pymysql module to establish database connection and read the data in the database
import pandas as pd 
from pymysql import *
conn=connect(host='localhost',port=3306, database='database_name', user='root', password='123456',charset='utf8')
#Host is the IP address of the host connecting to the database, the default port of mysql is 3306, database is the name of the database, user and password are the account and password connecting to the database respectively, and charset is the character code of the database
#Write an SQL statement and use read_sql() function to read
sql_cmd='SELECT * FROM table_name'
df=pd.read_sql(sql_cmd, conn)
df.head()	
#Test the parameter parse_ The role of dates
sql_cmd_2 = "SELECT * FROM test_date"
df_1 = pd.read_sql(sql_cmd_2, conn)
df_1.head()
"""
output
number  date_columns
0    1      2021-11-11
1    2      2021-10-01
2    3      2021-11-10
"""

df_1.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   number        3 non-null      int64 
 1   date_columns  3 non-null      object
dtypes: int64(1), object(1)
memory usage: 176.0+ bytes
"""
#Normally, by default, date_columns is also treated as String data. If we pass parse_ The dates parameter applies date resolution to the column
df_2 = pd.read_sql(sql_cmd_2, conn, parse_dates="date_columns")
df_2.info()
"""
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3 entries, 0 to 2
Data columns (total 2 columns):
 #   Column        Non-Null Count  Dtype         
---  ------        --------------  -----         
 0   number        3 non-null      int64         
 1   date_columns  3 non-null      datetime64[ns]
dtypes: datetime64[ns](1), int64(1)
memory usage: 176.0 bytes
"""
#It is converted into the corresponding date format. Of course, we can also use another format mentioned above
parse_dates={"date_column": {"format": "%d/%m/%y"}}
df_3 = pd.read_sql(sql_cmd_2, conn, parse_dates=parse_dates)
df_3.info()

1.2 to_sql function

  • DataFrame.to_sql(name,con,schema=None,if_exists='fail',index=True, index_label=None,chunksize=None,dtype=None,method=None)
    1) Name: is the name of the table in the input database;
    2) con: the way to link with the database. The engine type of sqlalchemy is recommended;
    3) schema: the engine of the corresponding database. If it is not set, the default engine of the database will be used, such as innodb engine in mysql;
    4)if_exists: when the data table already exists in the database, the operations on the data table include replace, append and append. If fail, ValueError will be prompted when the table exists;
    5) Index: process the index of DataFrame. When it is True, the index will also be written into the data table as data;
    6)index_label: when the previous parameter index is True, set the column name of index when writing the data table;
    7) Chunksize: set an integer, such as 20000, the number of data lines when writing data at one time. When the amount of data is large, it needs to be set, otherwise the link timeout and writing failure will occur;
    8) dtype: when writing the data table, you can set the name of the column (the keys should be the column names and the values should be the sqlchemy types or strings for the SQLite3 legacy mode). When setting, the type should be consistent with the type of sqlalchemy When not set, to_ When SQL generates a table, it will automatically be compatible with the largest type;
  • to_ In the SQL () parameter, except name and con are required, the optional parameter index is recommended to be False, and dtype is not recommended.
  • to_ The SQL method is created when the data table does not exist. If it exists, it is created according to if_ replace, append and fail set by the exists parameter correspond to replacement, append and failure respectively.
#sqlalchemy. create_ The engine() function creates engine in two ways:
from sqlalchemy import create_engine
engine = create_engine('mysql+mysqldb://user:passwd@127.0.0.1/database?charset=utf8') 
engine = create_engine('mysql+mysqlconnector:// user:passwd@127.0.0.1/database?charset=utf8') 
#user: passwd@127.0.0.1 /Database -- > format: user name: password @ server address / database name
from sqlalchemy import create_engine
DB_STRING = 'mysql+mysqlconnector://user:passwd@127.0.0.1/database?charset=utf8'
engine = create_engine(DB_STRING)
data.to_sql('Table name',con = engine)
#This is used in mysql5 Before version 7, there was no problem, but after version 8, mysql8 changed the password encryption method, and an error will be prompted when using this method.
#In use to_ When writing SQL to versions above mysql8, you need to use mysqldb as the driver
#When importing pymysql, you need pymysql install_ as_ Only mysqldb() can be used
import pymysql
pymysql.install_as_MySQLdb()
DB_STRING = 'mysql+mysqldb://user:passwd@127.0.0.1/db_name?charset=utf8'
engine = create_engine(DB_STRING)
data.to_sql('Table name',con = engine)
#Parameter dtype usage
#DATE,CHAR,VARCHAR... You can go to the official document of sqlalchemy to view all sql data types: [‘TypeEngine’, ‘TypeDecorator’, ‘UserDefinedType’, ‘INT’, ‘CHAR’, ‘VARCHAR’, ‘NCHAR’, ‘NVARCHAR’, ‘TEXT’, ‘Text’, ‘FLOAT’, ‘NUMERIC’, ‘REAL’, ‘DECIMAL’, ‘TIMESTAMP’, ‘DATETIME’, ‘CLOB’, ‘BLOB’, ‘BINARY’, ‘VARBINARY’, ‘BOOLEAN’, ‘BIGINT’, ‘SMALLINT’, ‘INTE GER’, ‘DATE’, ‘TIME’, ‘String’, ‘Integer’, ‘SmallInteger’, ‘BigInteger’, ‘Numeric’, ‘Float’, ‘DateTime’, ‘Date’, ‘Time’, ‘LargeBinary’, ‘Binary’, ‘Boolean’, ‘Unicode’, ‘Concatenable’, ‘UnicodeText’, ‘PickleType’, ‘Interval’, ‘Enum’, ‘Indexable’, ‘ARRAY’, 'JSON'] you can select the appropriate type to correspond to the database
from sqlalchemy.types import DATE,CHAR,VARCHAR 
DTYPES = {'col_1 Field name' : DATE, 'col_2':CHAR(4),'col_3':VARCHAR(10)}
df.to_sql('Table name',con = engine,dtype = DTYPES)
#The data will be written into the df of the data table. dtype specifies the data type field corresponding to the column name

#Execute the SQL statement to view the written database data
engine.execute("SELECT * FROM table_name").fetchall()

2, from_dict function and to_dict function

2.1 from_dict function

a_dict = {
    'school': 'Tsinghua University',
    'geographical position': 'Beijing',
    'ranking': 1
}
df = pd.json_normalize(a_dict) #Convert data in dictionary format to DataFrame
df

df = pd.DataFrame(a_dict, index = [0])#Convert data in dictionary format to DataFrame
df

df = pd.DataFrame.from_dict(a_dict,orient='index')#Convert data in dictionary format to DataFrame
df

data = {'col_1': [1, 2, 3, 4],
        'col_2': ['A', 'B', 'C', 'D']}
df = pd.DataFrame.from_dict(data, orient='columns')#Convert data in dictionary format to DataFrame
df

df = pd.DataFrame.from_dict(data, orient='index')#Convert data in dictionary format to DataFrame
df

2.2 to_dict function

  • to_dict(orient): the values of orient include dict (default), list, split, records and index
df = pd.DataFrame({'shape': ['square', 'circle', 'triangle'],
                   'degrees': [360, 360, 180],
                   'sides': [4, 5, 3]})
df

  • Output as dict: the value of orient is' dict '
#Output as dict
dict_out=df.to_dict(orient='dict')
dict_out
#output
"""
{'shape': {0: 'square', 1: 'circle', 2: 'triangle'},
 'degrees': {0: 360, 1: 360, 2: 180},
 'sides': {0: 4, 1: 5, 2: 3}}
"""
#read
df=pd.DataFrame.from_dict(dict_out)
df
#result
"""
shape	degrees	sides
0	square	360	4
1	circle	360	5
2	triangle	180	3
"""
  • The output is dict: orient, and the value is' split '
#Output as dict
dict_out=df.to_dict(orient='split')
dict_out
#output
"""
{'index': [0, 1, 2],
 'columns': ['shape', 'degrees', 'sides'],
 'data': [['square', 360, 4], ['circle', 360, 5], ['triangle', 180, 3]]}
"""
#read
df=pd.DataFrame.from_dict(dict_out)
df
#result
"""
index	columns	data
0	0	shape	[square, 360, 4]
1	1	degrees	[circle, 360, 5]
2	2	sides	[triangle, 180, 3]
"""
  • The output is dict: orient, and the value is' list '
#Output as dict
dict_out=df.to_dict(orient='list')
dict_out
#output
"""
{'shape': ['square', 'circle', 'triangle'],
 'degrees': [360, 360, 180],
 'sides': [4, 5, 3]}
"""
#read
df=pd.DataFrame.from_dict(dict_out)
df
#result
"""
    shape    degrees   sides
0	square	  360	    4
1	circle	  360	    5
2	triangle  180	    3
"""
  • The output is dict: orient, and the value is' index '
#Output as dict
dict_out=df.to_dict(orient='index')
dict_out
#output
"""
{0: {'index': 0, 'columns': 'shape', 'data': ['square', 360, 4]},
 1: {'index': 1, 'columns': 'degrees', 'data': ['circle', 360, 5]},
 2: {'index': 2, 'columns': 'sides', 'data': ['triangle', 180, 3]}}
"""
  • The output is dict: orient, and the value is' records'
#Output as dict
dict_out=df.to_dict(orient='records')
dict_out
#output
"""
[{0: 0, 1: 1, 2: 2},
 {0: 'shape', 1: 'degrees', 2: 'sides'},
 {0: ['square', 360, 4], 1: ['circle', 360, 5], 2: ['triangle', 180, 3]}]
"""

3, read_json function and to_json function

3.1 read_json function

  • pandas.read_json()
    1) Parameter orient: the format of the corresponding JSON string
    2) split: the format is similar to: {index: [index], columns: [columns], data: [values]}
    3) records: format is similar to: [{column: value},..., {column: value}]
    4) Index: the format is similar to: {index: {column: value}}
    5) columns: the format is similar to: {column: {index: value}}
    6) values: array

  • split format: the format is similar to: {index: [index], columns: [columns], data: [values]}

import pandas as pd
#split: the format is similar to: {index: [index], columns: [columns], data: [values]}
#For example, our JSON string is so long
a='{"index":[1,2,3],"columns":["a","b"],"data":[[1,3],[2,8],[3,9]]}'
df = pd.read_json(a, orient='split')
df
#output
"""
    a	b
1	1	3
2	2	8
3	3	9
"""
  • records format: similar to: [{column: value},..., {column: value}]
import pandas as pd
#For example, our JSON string is so long
#records: format is similar to: [{column: value},..., {column: value}]
#For example, our JSON string is so long
a = '[{"name":"Tom","age":"18"},{"name":"Amy","age":"20"},{"name":"John","age":"17"}]'
df_1 = pd.read_json(a, orient='records')
df_1
#output
"""
    name   age
0	Tom	   18
1	Amy	   20
2	John   17
"""
  • Index format: similar to: {index: {column: value}}
import pandas as pd
#For example, our JSON string is so long
#index: {index: {column: value}}
#For example, our JSON string is so long
a = '{"index_1":{"name":"John","age":20},"index_2":{"name":"Tom","age":30},"index_3":{"name":"Jason","age":50}}'
df_1 = pd.read_json(a, orient='index')
df_1
#output
"""
         name	age
index_1	 John	20
index_2	 Tom	30
index_3	 Jason	50
"""
  • columns format: the format is similar to: {column: {index: value}}
import pandas as pd
#columns: the format is similar to: {column: {index: value}}
a = '{"index_1":{"name":"John","age":20},"index_2":{"name":"Tom","age":30},"index_3":{"name":"Jason","age":50}}'
df_1 = pd.read_json(a, orient='columns')
df_1
#output
"""
	    index_1	index_2	index_3
name	John	Tom	  Jason
age	     20	    30	   50
"""
  • values array
import pandas as pd
#values: array
#For example, our JSON string is so long
v='[["a",1],["b",2],["c", 3]]'
df_1 = pd.read_json(v, orient="values")
df_1
#output
"""
	0	1
0	a	1
1	b	2
2	c	3
"""

3.2 to_json function

  • To output the DataFrame data object as a JSON string, you can use to_json() method, in which the orient parameter can output strings in different formats, and the usage is roughly the same as that above

4, read_html functions and to_html function

4.1 read_html function

  • Sometimes we need to grab a table information on the web page. Compared with using Xpath or beautiful soup, we can use the encapsulated function read in pandas_ HTML to get it quickly. For example, we use it to grab part of the content on the novice tutorial Python website
#!pip3 install lxml #Use read for the first time_ HTML, you need to install the lxml package
from lxml import *
url = "https://www.runoob.com/python/python-exceptions.html"
dfs = pd.read_html(url, header=None, encoding='utf-8')#The return is a list
df=dfs[0]
df.head()
#output
"""
         Exception name	       describe
0	       NaN	           NaN
1	BaseException	    Base class for all exceptions
2	SystemExit	        Interpreter requests exit
3	KeyboardInterrupt	User interrupt execution(Usually input^C)
4	Exception	        Base class for general errors
"""

4.2 to_html function

import numpy as np
df = pd.DataFrame(np.random.randn(3, 3),columns=['A','B','C'])
df.to_html("test_1.html",index=False)
dfs = pd.read_html("test_1.html")
dfs[0]
#output
"""
         A	         B 	        C
0	-0.348165	-0.572741	-0.190966
1	0.700785	-0.848750	-1.853054
2	0.941161	-0.944569	0.749450
"""

5, read_csv function and to_csv function

5.1 read_csv function

#Parameter filepath_or_buffer: the path of data input, which can be URL
#sep: the separator specified when reading csv files. The default is comma
#index_col: after reading the file, we can specify a column as the index of the DataFrame
pd.read_csv(filepath_or_buffer='data.csv',index_col=0)
#output
"""
    num1	num2	num3	num4
0	 1	     2	     3	     4
1	 6	    12	     7	     9
2	 11	    13	    15	    18
3	 12	    10	    16	    18
"""
#index_col in addition to specifying a single column, we can also specify multiple columns
pd.read_csv(filepath_or_buffer='data.csv',index_col=['num1','num2'])
#output
                 Unnamed: 0	num3	num4
num1	num2			
 1	     2	  		0		3		4
 6	    12			1		7		9
 11	    13			2		15		18
 12	    10			3		16		18
#usecols: this parameter can be used if there are many columns in the dataset, and we don't want all the columns, but only the specified columns
#pd.read_csv('data.csv', usecols = ["column name 1", "column name 2",...])
pd.read_csv('data.csv', usecols=["num1", "num2"])
#output
     num1	num2	
0	 1	     2
1	 6	    12
2	 11	    13
3	 12	    10
#usecols in addition to specifying the column name, you can also select the desired column by index. The example code is as follows
df = pd.read_csv("data.csv", usecols = [1, 2])
df
#output
     num1	num2	
0	 1	     2
1	 6	    12
2	 11	    13
3	 12	    10
#The usecols parameter can receive a function and pass the column name as a parameter to the function. If the condition is satisfied, the column is selected, otherwise the column is not selected.
pd.read_csv("data.csv", usecols = lambda x:len(x)<5)
#output
    num1	num2	num3	num4
0	 1	     2	     3	     4
1	 6	    12	     7	     9
2	 11	    13	    15	    18
3	 12	    10	    16	    18
#Prefix: when the imported data does not have a header, it can be used to prefix the column name
pd.read_csv("data.csv", prefix="test_", header = None)
#output
	test_0	test_1	test_2	test_3	test_4
0	NaN		num1	num2	num3	num4
1	0.0		1		2		3		4
2	1.0		6		12		7		9
3	2.0		11		13		15		18
4	3.0		12		10		16		18
#Skirows: which rows are filtered out and the index of the rows filled in the parameters
pd.read_csv("data.csv", skiprows=[0, 1])
#The above code filters out the data of the first two lines and directly outputs the data of the third and fourth lines. You can see that the data of the second line is taken as the header
#output
	1	6	12	7	9
0	2	11	13	15	18
1	3	12	10	16	18
#nrows: this parameter sets the number of file lines read in at one time, which is very useful for reading large files. For example, a PC with 16G memory cannot hold hundreds of G large files
pd.read_csv("data.csv", nrows=2)
#output
	Unnamed:0	num1	num2	num3	num4
0	  0			1		2		3		4
1	  1			6		12		7		9

5.2 to_csv function

5.2.1 save csv format

#This method is mainly used to write DataFrame into csv file. The example code is as follows
df=pd.read_csv("data.csv", usecols = lambda x:len(x)<5)
df.to_csv("file name.csv", index = False)

5.2.2 save in zip format

#We can also output to a zip file format
df = pd.read_csv("data.csv")
compression_opts = dict(method='zip',
                        archive_name='output.csv')
df.to_csv('output.zip', index=False,
          compression=compression_opts)

6, read_excel functions and to_excel function

6.1 read_excel function

#! pip install xlrd #Execute read for the first time_ When using Excel, you need to install xlrd package
#!pip install openpyxl #Execute read for the first time_ When using Excel, you need to install openpyxl package
df = pd.read_excel("data.xlsx",engine='openpyxl') #Because of the version problem of xlrd package, you must add engine='openpyxl '
df
#output
	Unnamed: 0	num1	num2	num3	num4
0		0		1		2		3		4
1		1		6		12		7		9
2		2		11		13		15		18
3		3		12		10		16		18
#dtype: this parameter can set the data type of a specified column
df = pd.read_excel("data.xlsx",engine='openpyxl',dtype={'Unnamed: 0': str, 'num1': float})
print(df.info())
df
#output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4 entries, 0 to 3
Data columns (total 5 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  4 non-null      object 
 1   num1        4 non-null      float64
 2   num2        4 non-null      int64  
 3   num3        4 non-null      int64  
 4   num4        4 non-null      int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 288.0+ bytes
None

	Unnamed: 0	num1	num2	num3	num4
0		0		1.0		2		3		4
1		1		6.0		12		7		9
2		2		11.0	13		15		18
3		3		12.0	10		16		18
#heet_name: set which sheet in excel to read
df = pd.read_excel("data.xlsx",engine='openpyxl', sheet_name="Sheet2")
df
#output
	Unnamed: 0	num5	num6	num7	num8
0		0		1		2		3		4
1		1		6		12		7		9
2		2		11		13		15		18
3		3		12		10		16		18
#Of course, if we want to read the data in multiple sheets at one time, it is also possible. The last returned data is returned in the form of dict
df = pd.read_excel("data.xlsx",engine='openpyxl',sheet_name=["Sheet1", "Sheet2"])
df
#output
{'Sheet1':    Unnamed: 0  num1  num2  num3  num4
 0           0     1     2     3     4
 1           1     6    12     7     9
 2           2    11    13    15    18
 3           3    12    10    16    18,
 'Sheet2':    Unnamed: 0  num5  num6  num7  num8
 0           0     1     2     3     4
 1           1     6    12     7     9
 2           2    11    13    15    18
 3           3    12    10    16    18}

#For example, if we only want the data of Sheet1, we can use the get function
df.get("Sheet1")
#output
	Unnamed: 0	num1	num2	num3	num4
0		0		1		2		3		4
1		1		6		12		7		9
2		2		11		13		15		18
3		3		12		10		16		18

6.2 to_excel function

#In addition to writing DataFrame objects into Excel tables, the ExcelWriter() method also plays a similar role
df1 = pd.DataFrame([['A', 'B'], ['C', 'D']],
                   index=['Row 1', 'Row 2'],
                   columns=['Col 1', 'Col 2'])
df1.to_excel("output1.xlsx")
#Specifies the name of the Sheet
df1.to_excel("output1.xlsx", sheet_name='Sheet_Name_1_1_1')
#Output multiple DataFrame datasets to different sheets in one Excel
df2 = df1.copy()
with pd.ExcelWriter('output2.xlsx') as writer:
    df1.to_excel(writer, sheet_name='Sheet_name_1_1_1')
    df2.to_excel(writer, sheet_name='Sheet_name_2_2_2')

#Add another Sheet based on the existing Sheet
df3 = df1.copy()
with pd.ExcelWriter('output2.xlsx', mode="a", engine="openpyxl") as writer:
    df3.to_excel(writer, sheet_name='Sheet_name_3_3_3')
#There are two mode parameters: W and A. w refers to overwrite writing, which will overwrite the original Sheet; a refers to append writing, which will add a Sheet again

#It can be generated to Excel file and compressed
import zipfile
with zipfile.ZipFile("output_excel.zip", "w") as zf:
    with zf.open("output_excel.xlsx", "w") as buffer:
        with pd.ExcelWriter(buffer) as writer:
            df1.to_excel(writer)

#Data in date format or date time format can also be processed accordingly
from datetime import date, datetime
df = pd.DataFrame(
    [
        [date(2019, 1, 10), date(2021, 11, 24)],
        [datetime(2019, 1, 10, 23, 33, 4), datetime(2021, 10, 20, 13, 5, 13)],
    ],
    index=["Date", "Datetime"],
    columns=["X", "Y"])
df
#output
					X						Y
Date		2019-01-10 00:00:00		2021-11-24 00:00:00
Datetime	2019-01-10 23:33:04		2021-10-20 13:05:13

#Store and write according to date format
with pd.ExcelWriter(
    "output_excel_date.xlsx",
    date_format="YYYY-MM-DD",
    datetime_format="YYYY-MM-DD HH:MM:SS"
) as writer:
    df.to_excel(writer)

7, read_table function

  • For txt files, you can use read_csv() method, or read_table() method, where the parameters and read_ The parameters in CSV () are roughly the same
#Read data separated by spaces
import pandas as pd
df = pd.read_table('sentences.txt',header=None, encoding='utf8') #When enconding='utf8 'is not added, Chinese will display garbled code
df
#output
		0
0	I love you China!
1	China is great and the motherland is safe!
2	People's happiness!

8, read_pickle function and to_pickle function

  • The Pickle module in Python implements the binary sequence and deserialization of a python object structure. The serialization process is to convert the text information into a binary data stream and save the data type at the same time. For example, in the process of data processing, if you suddenly have something to leave, you can directly serialize the data to the local. At this time, the type of data in processing is the same as that saved locally. After deserialization, it is also the data type, rather than processing from scratch.

8.1 read_pickle function

from datetime import date, datetime
df = pd.DataFrame(
    [
        [date(2019, 1, 10), date(2021, 11, 24)],
        [datetime(2019, 1, 10, 23, 33, 4), datetime(2021, 10, 20, 13, 5, 13)],
    ],
    index=["Date", "Datetime"],
    columns=["X", "Y"])
df.to_pickle('test.pkl')
df
#result
					X				    	Y
Date		2019-01-10 00:00:00	  2021-11-24 00:00:00
Datetime	2019-01-10 23:33:04	  2021-10-20 13:05:13

8.2 to_pickle function

df2 = pd.read_pickle("test.pkl")
df2
#result
					X				    	Y
Date		2019-01-10 00:00:00	  2021-11-24 00:00:00
Datetime	2019-01-10 23:33:04	  2021-10-20 13:05:13

9, read_clipboard function and to_clipboard function

9.1 read_clipboard function

  • Sometimes it is inconvenient to obtain data. We can copy the data in any place of the computer, and then use read in Pandas_ The clipboard () method is used to read the data copied successfully
df=pd.read_clipboard()
df
#result
					X						Y
Date		2019-01-10	00:00:00	2021-11-24	00:00:00
Datetime	2019-01-10	23:33:04	2021-10-20	13:05:13

to_clipboard function

  • If there is copy, there will be paste. We can output the DataFrame dataset to the clipboard
df.to_clipboard()

Keywords: Python Machine Learning

Added by thatsgreat2345 on Fri, 28 Jan 2022 07:12:27 +0200