pandas is a software package of Python language. When we use Python language for machine learning programming, it is a very common basic programming library. This article is an introduction to it.
pandas provides a fast, flexible and expressive data structure, which aims to make the work of "relation" or "mark" data simple and intuitive. It aims to be a high-level building block for practical data analysis in Python.
Introduction
pandas are suitable for many different types of data, including:
- Table data with heterogeneous types of columns, such as SQL tables or Excel data
- Ordered and disordered (not necessarily fixed frequency) time series data.
- Arbitrary matrix data with row and column labels (uniform or different types)
- Any other form of observation/statistical data set.
Since this is a Python language package, you need to first have a Python language environment on your machine. In this regard, please search the Internet for your own access methods.
For information on how to obtain pandas, please refer to the instructions on the official website: pandas Installation.
Usually, we can perform the installation through pip:
sudo pip3 install pandas
Or through conda To install pandas:
conda install pandas
The latest version of pandas (February 2018) is v0.22.0 (Release date: December 29, 2017).
I have put the source code and test data of this article on Github: pandas_tutorial Readers can access it.
In addition, pandas are often associated with NumPy Together, the source code in this article will also be used NumPy.
Readers are advised to read NumPy I have some familiarity with pandas. I have written a basic tutorial on NumPy before. See here: NumPy tutorial of Python machine learning library
Core data structure
The core of pandas is Series and DataFrame.
These two types of data structures are compared as follows:
Name | dimension | Explain |
---|---|---|
Series | 1 dimension | Isomorphic type arrays with tags |
DataFrame | 2 dimension | Table structure with labels, variable size, and can contain heterogeneous data columns |
A DataFrame can be seen as a container for Series, that is, a DataFrame can contain several Series.
Note: Before version 0.20.0, there was a three-dimensional data structure named Panel. This is why the pandas library is named panel:
-data
-s. But this data structure has been abandoned because it is seldom used.
Series
Because Series is a one-dimensional structure of data, we can create this data directly through arrays, like this:
# data_structure.py
import pandas as pd
import numpy as np
series1 = pd.Series([1, 2, 3, 4])
print("series1:\n{}\n".format(series1))
The output of this code is as follows:
series1:
0 1
1 2
2 3
3 4
dtype: int64
The output is as follows:
- The last line of output is the type of data in Series, where the data is of type int64.
- Data is output in the second column, and the first column is the index of the data, which is called Index in pandas.
We can print out the data and index in Series separately:
# data_structure.py
print("series1.values: {}\n".format(series1.values))
print("series1.index: {}\n".format(series1.index))
The output of these two lines of code is as follows:
series1.values: [1 2 3 4]
series1.index: RangeIndex(start=0, stop=4, step=1)
If not specified (as above), the index is in the form of [1, N-1]. But we can also specify indexes when creating Series. Indexes do not necessarily need to be integers. They can be any type of data, such as strings. For example, we map seven notes with seven letters. The purpose of the index is to obtain the corresponding data through it, such as the following:
# data_structure.py
series2 = pd.Series([1, 2, 3, 4, 5, 6, 7],
index=["C", "D", "E", "F", "G", "A", "B"])
print("series2:\n{}\n".format(series2))
print("E is {}\n".format(series2["E"]))
The output of this code is as follows:
series2:
C 1
D 2
E 3
F 4
G 5
A 6
B 7
dtype: int64
E is 3
DataFrame
Let's take a look at the creation of the DataFrame. We can create a 4x4 matrix through NumPy's interface to create a DataFrame like this:
# data_structure.py
df1 = pd.DataFrame(np.arange(16).reshape(4,4))
print("df1:\n{}\n".format(df1))
The output of this code is as follows:
df1:
0 1 2 3
0 0 1 2 3
1 4 5 6 7
2 8 9 10 11
3 12 13 14 15
From this output, we can see that the default index and column names are in the form of [0, N-1].
We can specify column names and indexes when creating a DataFrame, like this:
# data_structure.py
df2 = pd.DataFrame(np.arange(16).reshape(4,4),
columns=["column1", "column2", "column3", "column4"],
index=["a", "b", "c", "d"])
print("df2:\n{}\n".format(df2))
The output of this code is as follows:
df2:
column1 column2 column3 column4
a 0 1 2 3
b 4 5 6 7
c 8 9 10 11
d 12 13 14 15
We can also directly specify column data to create a DataFrame:
# data_structure.py
df3 = pd.DataFrame({"note" : ["C", "D", "E", "F", "G", "A", "B"],
"weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]})
print("df3:\n{}\n".format(df3))
The output of this code is as follows:
df3:
note weekday
0 C Mon
1 D Tue
2 E Wed
3 F Thu
4 G Fri
5 A Sat
6 B Sun
Please note that:
- Different columns of the DataFrame can be of different data types
- If you create a DataFrame with Series arrays, each Series will be a row, not a column
For example:
# data_structure.py
noteSeries = pd.Series(["C", "D", "E", "F", "G", "A", "B"],
index=[1, 2, 3, 4, 5, 6, 7])
weekdaySeries = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
index=[1, 2, 3, 4, 5, 6, 7])
df4 = pd.DataFrame([noteSeries, weekdaySeries])
print("df4:\n{}\n".format(df4))
The output of df4 is as follows:
df4:
1 2 3 4 5 6 7
0 C D E F G A B
1 Mon Tue Wed Thu Fri Sat Sun
We can add or delete column data to the DataFrame in the following form:
# data_structure.py
df3["No."] = pd.Series([1, 2, 3, 4, 5, 6, 7])
print("df3:\n{}\n".format(df3))
del df3["weekday"]
print("df3:\n{}\n".format(df3))
The output of this code is as follows:
df3:
note weekday No.
0 C Mon 1
1 D Tue 2
2 E Wed 3
3 F Thu 4
4 G Fri 5
5 A Sat 6
6 B Sun 7
df3:
note No.
0 C 1
1 D 2
2 E 3
3 F 4
4 G 5
5 A 6
6 B 7
Index Objects and Data Access
The Index object of pandas contains metadata information describing the axis. When a Series or DataFrame is created, an array or sequence of tags is converted to an Index. Index objects for columns and rows of the DataFrame can be obtained in the following way:
# data_structure.py
print("df3.columns\n{}\n".format(df3.columns))
print("df3.index\n{}\n".format(df3.index))
The output of these two lines of code is as follows:
df3.columns
Index(['note', 'No.'], dtype='object')
df3.index
RangeIndex(start=0, stop=7, step=1)
Please note that:
- Index is not a collection, so it can contain duplicate data
- Index object's value cannot be changed, so it can access data safely.
The DataFrame provides the following two operators to access the data:
- loc: Access data by index of rows and columns
- iloc: Access data through row and column Subscripts
For example:
# data_structure.py
print("Note C, D is:\n{}\n".format(df3.loc[[0, 1], "note"]))
print("Note C, D is:\n{}\n".format(df3.iloc[[0, 1], 0]))
The first line of code accesses elements with a row index of 0 and 1 and a column index of "note". The second line of code accesses elements with line subscripts of 0 and 1 (for df3, line indexes and line subscripts are exactly the same, so both are 0 and 1 here, but they have different meanings), and column subscripts of 0.
The output of these two lines of code is as follows:
Note C, D is:
0 C
1 D
Name: note, dtype: object
Note C, D is:
0 C
1 D
Name: note, dtype: object
File operation
The pandas library provides a series of read_functions to read files in various formats, as follows:
- read_csv
- read_table
- read_fwf
- read_clipboard
- read_excel
- read_hdf
- read_html
- read_json
- read_msgpack
- read_pickle
- read_sas
- read_sql
- read_stata
- read_feather
Read Excel files
Note: To read Excel files, you need to install another library: xlrd
Installation can be accomplished by pip:
sudo pip3 install xlrd
After installation, you can view the information of this library through pip:
$ pip3 show xlrd
Name: xlrd
Version: 1.1.0
Summary: Library for developers to extract data from Microsoft Excel (tm) spreadsheet files
Home-page: http://www.python-excel.org/
Author: John Machin
Author-email: sjmachin@lexicon.net
License: BSD
Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Requires:
Next let's look at a simple example of reading Excel:
# file_operation.py
import pandas as pd
import numpy as np
df1 = pd.read_excel("data/test.xlsx")
print("df1:\n{}\n".format(df1))
The content of this Excel is as follows:
df1:
C Mon
0 D Tue
1 E Wed
2 F Thu
3 G Fri
4 A Sat
5 B Sun
Note: The code and data files in this article can be obtained from the Github repository mentioned at the beginning of this article.
Read CSV files
Next, let's look at an example of reading a CSV file.
The first CSV file is as follows:
$ cat test1.csv
C,Mon
D,Tue
E,Wed
F,Thu
G,Fri
A,Sat
The way to read it is also simple:
# file_operation.py
df2 = pd.read_csv("data/test1.csv")
print("df2:\n{}\n".format(df2))
Let's look at the second example. The contents of this document are as follows:
$ cat test2.csv
C|Mon
D|Tue
E|Wed
F|Thu
G|Fri
A|Sat
Strictly speaking, this is not a CSV file, because its data is not separated by commas. In this case, we can read the file by specifying a delimiter, like this:
# file_operation.py
df3 = pd.read_csv("data/test2.csv", sep="|")
print("df3:\n{}\n".format(df3))
In fact, read_csv supports many parameters to adjust the parameters read, as shown in the following table:
parameter | Explain |
---|---|
path | File path |
sep or delimiter | Field separator |
header | Number of rows for column names, default 0 (first row) |
index_col | Column number or name used as the row index in the result |
names | List of column names for results |
skiprows | Number of rows skipped from the starting position |
na_values | Value Sequence Replacing NA |
comment | Separation of annotated characters at the end of a line |
parse_dates | Try to parse the data into datetime. Default to False |
keep_date_col | If the column is connected to the parse date, the connected column is retained. The default is False. |
converters | Column Converter |
dayfirst | When parsing ambiguous dates, they are stored internally. Default to False |
data_parser | Functions for resolving dates |
nrows | Number of rows read from a file |
iterator | Returns a TextParser object for reading part of the content |
chunksize | Specify the size of the read block |
skip_footer | Number of rows to be ignored at the end of the file |
verbose | Output information of various analytic outputs |
encoding | File encoding |
squeeze | If the parsed data contains only one column, a Series is returned |
thousands | Thousands of delimiters |
See here for a detailed description of the read_csv function: pandas.read_csv
Dealing with invalid values
The real world is not perfect, and the data we read often carries some invalid values. If these invalid values are not handled properly, it will cause great interference to the program.
There are two main ways to deal with invalid values: to ignore these invalid values directly, or to replace invalid values with valid values.
Let me first create a data structure with invalid values. Then the pandas.isna function is used to confirm which values are invalid:
# process_na.py
import pandas as pd
import numpy as np
df = pd.DataFrame([[1.0, np.nan, 3.0, 4.0],
[5.0, np.nan, np.nan, 8.0],
[9.0, np.nan, np.nan, 12.0],
[13.0, np.nan, 15.0, 16.0]])
print("df:\n{}\n".format(df));
print("df:\n{}\n".format(pd.isna(df)));****
The output of this code is as follows:
df:
0 1 2 3
0 1.0 NaN 3.0 4.0
1 5.0 NaN NaN 8.0
2 9.0 NaN NaN 12.0
3 13.0 NaN 15.0 16.0
df:
0 1 2 3
0 False True False False
1 False True True False
2 False True True False
3 False True False False
Ignore invalid values
We can discard invalid values through the pandas.DataFrame.dropna function:
# process_na.py
print("df.dropna():\n{}\n".format(df.dropna()));
Note: By default, dropna does not change the original data structure, but returns a new data structure. If you want to change the data itself directly, you can pass the parameter inplace = True when calling this function.
For the original structure, when all invalid values are discarded, it will no longer be a valid DataFrame, so the output of this line of code is as follows:
df.dropna():
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []
We can also choose to discard the column where the whole column is invalid:
# process_na.py
print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all')));
Note: axis=1 indicates the axis of the column. how can take the value'any'or'all', the default is the former.
The output of this line of code is as follows:
df.dropna(axis=1, how='all'):
0 2 3
0 1.0 3.0 4.0
1 5.0 NaN 8.0
2 9.0 NaN 12.0
3 13.0 15.0 16.0
Replace invalid values
We can also replace invalid values with valid values by fillna functions. Like this:
# process_na.py
print("df.fillna(1):\n{}\n".format(df.fillna(1)));
The output of this code is as follows:
df.fillna(1):
0 1 2 3
0 1.0 1.0 3.0 4.0
1 5.0 1.0 1.0 8.0
2 9.0 1.0 1.0 12.0
3 13.0 1.0 15.0 16.0
It may not make sense to replace all invalid values with the same data, so we can specify different data to fill in. For ease of operation, we can modify the names of rows and columns by rename method before filling in:
# process_na.py
df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'},
columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'},
inplace=True);
df.fillna(value={'col2': 2}, inplace=True)
df.fillna(value={'col3': 7}, inplace=True)
print("df:\n{}\n".format(df));
The output of this code is as follows:
df:
col1 col2 col3 col4
index1 1.0 2.0 3.0 4.0
index2 5.0 2.0 7.0 8.0
index3 9.0 2.0 7.0 12.0
index4 13.0 2.0 15.0 16.0
Processing strings
String processing is often involved in data, so let's look at pandas for string manipulation.
The str field of Series contains a series of functions to handle strings. Moreover, these functions automatically process invalid values.
Here are some examples. In the first set of data, we deliberately set up some strings containing spaces:
# process_string.py
import pandas as pd
s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']);
print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip()))
print("s1.str.strip():\n{}\n".format(s1.str.strip()))
print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit()))
In this example, we see the processing of the string strip and the determination of whether the string itself is a number. The output of this code is as follows:
s1.str.rstrip():
0 1
1 2
2 3
3 4
4 5
dtype: object
s1.str.strip():
0 1
1 2
2 3
3 4
4 5
dtype: object
s1.str.isdigit():
0 False
1 False
2 False
3 True
4 True
dtype: bool
Here are some other examples that show how to handle string capitalization, lowercase, and string length:
# process_string.py
s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird',
'Comfortably Numb', 'All Along the Watchtower'])
print("s2.str.lower():\n{}\n".format(s2.str.lower()))
print("s2.str.upper():\n{}\n".format(s2.str.upper()))
print("s2.str.len():\n{}\n".format(s2.str.len()))
The output of this code is as follows:
s2.str.lower():
0 stairway to heaven
1 eruption
2 freebird
3 comfortably numb
4 all along the watchtower
dtype: object
s2.str.upper():
0 STAIRWAY TO HEAVEN
1 ERUPTION
2 FREEBIRD
3 COMFORTABLY NUMB
4 ALL ALONG THE WATCHTOWER
dtype: object
s2.str.len():
0 18
1 8
2 8
3 16
4 24
dtype: int64
Concluding remarks
This article is an introduction to pandas, so we only cover the most basic operations. about
- MultiIndex/Advanced Indexing
- Merge, join, concatenate
- Computational tools