Introduction to Python Data Analysis and Practical Learning Resources

pandas is a software package of Python language. When we use Python language for machine learning programming, it is a very common basic programming library. This article is an introduction to it.

pandas provides a fast, flexible and expressive data structure, which aims to make the work of "relation" or "mark" data simple and intuitive. It aims to be a high-level building block for practical data analysis in Python.

Introduction

pandas are suitable for many different types of data, including:

  • Table data with heterogeneous types of columns, such as SQL tables or Excel data
  • Ordered and disordered (not necessarily fixed frequency) time series data.
  • Arbitrary matrix data with row and column labels (uniform or different types)
  • Any other form of observation/statistical data set.

Since this is a Python language package, you need to first have a Python language environment on your machine. In this regard, please search the Internet for your own access methods.

For information on how to obtain pandas, please refer to the instructions on the official website: pandas Installation.

Usually, we can perform the installation through pip:

sudo pip3 install pandas

Or through conda To install pandas:

conda install pandas

The latest version of pandas (February 2018) is v0.22.0 (Release date: December 29, 2017).

I have put the source code and test data of this article on Github: pandas_tutorial Readers can access it.

In addition, pandas are often associated with NumPy Together, the source code in this article will also be used NumPy.

Readers are advised to read NumPy I have some familiarity with pandas. I have written a basic tutorial on NumPy before. See here: NumPy tutorial of Python machine learning library

Core data structure

The core of pandas is Series and DataFrame.

These two types of data structures are compared as follows:

Name dimension Explain
Series 1 dimension Isomorphic type arrays with tags
DataFrame 2 dimension Table structure with labels, variable size, and can contain heterogeneous data columns

A DataFrame can be seen as a container for Series, that is, a DataFrame can contain several Series.

Note: Before version 0.20.0, there was a three-dimensional data structure named Panel. This is why the pandas library is named panel:

-data

-s. But this data structure has been abandoned because it is seldom used.

Series

Because Series is a one-dimensional structure of data, we can create this data directly through arrays, like this:

# data_structure.py

import pandas as pd
import numpy as np

series1 = pd.Series([1, 2, 3, 4])
print("series1:\n{}\n".format(series1))

The output of this code is as follows:

series1:
0    1
1    2
2    3
3    4
dtype: int64

The output is as follows:

  • The last line of output is the type of data in Series, where the data is of type int64.
  • Data is output in the second column, and the first column is the index of the data, which is called Index in pandas.

We can print out the data and index in Series separately:

# data_structure.py

print("series1.values: {}\n".format(series1.values))

print("series1.index: {}\n".format(series1.index))

The output of these two lines of code is as follows:

series1.values: [1 2 3 4]

series1.index: RangeIndex(start=0, stop=4, step=1)

If not specified (as above), the index is in the form of [1, N-1]. But we can also specify indexes when creating Series. Indexes do not necessarily need to be integers. They can be any type of data, such as strings. For example, we map seven notes with seven letters. The purpose of the index is to obtain the corresponding data through it, such as the following:

# data_structure.py

series2 = pd.Series([1, 2, 3, 4, 5, 6, 7],
    index=["C", "D", "E", "F", "G", "A", "B"])
print("series2:\n{}\n".format(series2))
print("E is {}\n".format(series2["E"]))

The output of this code is as follows:

series2:
C    1
D    2
E    3
F    4
G    5
A    6
B    7
dtype: int64

E is 3

DataFrame

Let's take a look at the creation of the DataFrame. We can create a 4x4 matrix through NumPy's interface to create a DataFrame like this:

# data_structure.py

df1 = pd.DataFrame(np.arange(16).reshape(4,4))
print("df1:\n{}\n".format(df1))

The output of this code is as follows:

df1:
    0   1   2   3
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15

From this output, we can see that the default index and column names are in the form of [0, N-1].

We can specify column names and indexes when creating a DataFrame, like this:

# data_structure.py

df2 = pd.DataFrame(np.arange(16).reshape(4,4),
    columns=["column1", "column2", "column3", "column4"],
    index=["a", "b", "c", "d"])
print("df2:\n{}\n".format(df2))

The output of this code is as follows:

df2:
   column1  column2  column3  column4
a        0        1        2        3
b        4        5        6        7
c        8        9       10       11
d       12       13       14       15

We can also directly specify column data to create a DataFrame:

# data_structure.py

df3 = pd.DataFrame({"note" : ["C", "D", "E", "F", "G", "A", "B"],
    "weekday": ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]})
print("df3:\n{}\n".format(df3))

The output of this code is as follows:

df3:
  note weekday
0    C     Mon
1    D     Tue
2    E     Wed
3    F     Thu
4    G     Fri
5    A     Sat
6    B     Sun

Please note that:

  • Different columns of the DataFrame can be of different data types
  • If you create a DataFrame with Series arrays, each Series will be a row, not a column

For example:

# data_structure.py

noteSeries = pd.Series(["C", "D", "E", "F", "G", "A", "B"],
    index=[1, 2, 3, 4, 5, 6, 7])
weekdaySeries = pd.Series(["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"],
    index=[1, 2, 3, 4, 5, 6, 7])
df4 = pd.DataFrame([noteSeries, weekdaySeries])
print("df4:\n{}\n".format(df4))

The output of df4 is as follows:

df4:
     1    2    3    4    5    6    7
0    C    D    E    F    G    A    B
1  Mon  Tue  Wed  Thu  Fri  Sat  Sun

We can add or delete column data to the DataFrame in the following form:

# data_structure.py

df3["No."] = pd.Series([1, 2, 3, 4, 5, 6, 7])
print("df3:\n{}\n".format(df3))

del df3["weekday"]
print("df3:\n{}\n".format(df3))

The output of this code is as follows:

df3:
  note weekday  No.
0    C     Mon    1
1    D     Tue    2
2    E     Wed    3
3    F     Thu    4
4    G     Fri    5
5    A     Sat    6
6    B     Sun    7

df3:
  note  No.
0    C    1
1    D    2
2    E    3
3    F    4
4    G    5
5    A    6
6    B    7

Index Objects and Data Access

The Index object of pandas contains metadata information describing the axis. When a Series or DataFrame is created, an array or sequence of tags is converted to an Index. Index objects for columns and rows of the DataFrame can be obtained in the following way:

# data_structure.py

print("df3.columns\n{}\n".format(df3.columns))
print("df3.index\n{}\n".format(df3.index))

The output of these two lines of code is as follows:

df3.columns
Index(['note', 'No.'], dtype='object')

df3.index
RangeIndex(start=0, stop=7, step=1)

Please note that:

  • Index is not a collection, so it can contain duplicate data
  • Index object's value cannot be changed, so it can access data safely.

The DataFrame provides the following two operators to access the data:

  • loc: Access data by index of rows and columns
  • iloc: Access data through row and column Subscripts

For example:

# data_structure.py

print("Note C, D is:\n{}\n".format(df3.loc[[0, 1], "note"]))
print("Note C, D is:\n{}\n".format(df3.iloc[[0, 1], 0]))

The first line of code accesses elements with a row index of 0 and 1 and a column index of "note". The second line of code accesses elements with line subscripts of 0 and 1 (for df3, line indexes and line subscripts are exactly the same, so both are 0 and 1 here, but they have different meanings), and column subscripts of 0.

The output of these two lines of code is as follows:

Note C, D is:
0    C
1    D
Name: note, dtype: object

Note C, D is:
0    C
1    D
Name: note, dtype: object

File operation

The pandas library provides a series of read_functions to read files in various formats, as follows:

  • read_csv
  • read_table
  • read_fwf
  • read_clipboard
  • read_excel
  • read_hdf
  • read_html
  • read_json
  • read_msgpack
  • read_pickle
  • read_sas
  • read_sql
  • read_stata
  • read_feather

Read Excel files

Note: To read Excel files, you need to install another library: xlrd

Installation can be accomplished by pip:

sudo pip3 install xlrd

After installation, you can view the information of this library through pip:

$  pip3 show xlrd
Name: xlrd
Version: 1.1.0
Summary: Library for developers to extract data from Microsoft Excel (tm) spreadsheet files
Home-page: http://www.python-excel.org/
Author: John Machin
Author-email: sjmachin@lexicon.net
License: BSD
Location: /Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/site-packages
Requires:

Next let's look at a simple example of reading Excel:

# file_operation.py

import pandas as pd
import numpy as np

df1 = pd.read_excel("data/test.xlsx")
print("df1:\n{}\n".format(df1))

The content of this Excel is as follows:

df1:
   C  Mon
0  D  Tue
1  E  Wed
2  F  Thu
3  G  Fri
4  A  Sat
5  B  Sun

Note: The code and data files in this article can be obtained from the Github repository mentioned at the beginning of this article.

Read CSV files

Next, let's look at an example of reading a CSV file.

The first CSV file is as follows:

$ cat test1.csv
C,Mon
D,Tue
E,Wed
F,Thu
G,Fri
A,Sat

The way to read it is also simple:

# file_operation.py

df2 = pd.read_csv("data/test1.csv")
print("df2:\n{}\n".format(df2))

Let's look at the second example. The contents of this document are as follows:

$ cat test2.csv
C|Mon
D|Tue
E|Wed
F|Thu
G|Fri
A|Sat

Strictly speaking, this is not a CSV file, because its data is not separated by commas. In this case, we can read the file by specifying a delimiter, like this:

# file_operation.py

df3 = pd.read_csv("data/test2.csv", sep="|")
print("df3:\n{}\n".format(df3))

In fact, read_csv supports many parameters to adjust the parameters read, as shown in the following table:

parameter Explain
path File path
sep or delimiter Field separator
header Number of rows for column names, default 0 (first row)
index_col Column number or name used as the row index in the result
names List of column names for results
skiprows Number of rows skipped from the starting position
na_values Value Sequence Replacing NA
comment Separation of annotated characters at the end of a line
parse_dates Try to parse the data into datetime. Default to False
keep_date_col If the column is connected to the parse date, the connected column is retained. The default is False.
converters Column Converter
dayfirst When parsing ambiguous dates, they are stored internally. Default to False
data_parser Functions for resolving dates
nrows Number of rows read from a file
iterator Returns a TextParser object for reading part of the content
chunksize Specify the size of the read block
skip_footer Number of rows to be ignored at the end of the file
verbose Output information of various analytic outputs
encoding File encoding
squeeze If the parsed data contains only one column, a Series is returned
thousands Thousands of delimiters

See here for a detailed description of the read_csv function: pandas.read_csv

Dealing with invalid values

The real world is not perfect, and the data we read often carries some invalid values. If these invalid values are not handled properly, it will cause great interference to the program.

There are two main ways to deal with invalid values: to ignore these invalid values directly, or to replace invalid values with valid values.

Let me first create a data structure with invalid values. Then the pandas.isna function is used to confirm which values are invalid:

# process_na.py

import pandas as pd
import numpy as np

df = pd.DataFrame([[1.0, np.nan, 3.0, 4.0],
                  [5.0, np.nan, np.nan, 8.0],
                  [9.0, np.nan, np.nan, 12.0],
                  [13.0, np.nan, 15.0, 16.0]])

print("df:\n{}\n".format(df));
print("df:\n{}\n".format(pd.isna(df)));****

The output of this code is as follows:

df:
      0   1     2     3
0   1.0 NaN   3.0   4.0
1   5.0 NaN   NaN   8.0
2   9.0 NaN   NaN  12.0
3  13.0 NaN  15.0  16.0

df:
       0     1      2      3
0  False  True  False  False
1  False  True   True  False
2  False  True   True  False
3  False  True  False  False

Ignore invalid values

We can discard invalid values through the pandas.DataFrame.dropna function:

# process_na.py

print("df.dropna():\n{}\n".format(df.dropna()));

Note: By default, dropna does not change the original data structure, but returns a new data structure. If you want to change the data itself directly, you can pass the parameter inplace = True when calling this function.

For the original structure, when all invalid values are discarded, it will no longer be a valid DataFrame, so the output of this line of code is as follows:

df.dropna():
Empty DataFrame
Columns: [0, 1, 2, 3]
Index: []

We can also choose to discard the column where the whole column is invalid:

# process_na.py

print("df.dropna(axis=1, how='all'):\n{}\n".format(df.dropna(axis=1, how='all')));

Note: axis=1 indicates the axis of the column. how can take the value'any'or'all', the default is the former.

The output of this line of code is as follows:

df.dropna(axis=1, how='all'):
      0     2     3
0   1.0   3.0   4.0
1   5.0   NaN   8.0
2   9.0   NaN  12.0
3  13.0  15.0  16.0

Replace invalid values

We can also replace invalid values with valid values by fillna functions. Like this:

# process_na.py

print("df.fillna(1):\n{}\n".format(df.fillna(1)));

The output of this code is as follows:

df.fillna(1):
      0    1     2     3
0   1.0  1.0   3.0   4.0
1   5.0  1.0   1.0   8.0
2   9.0  1.0   1.0  12.0
3  13.0  1.0  15.0  16.0

It may not make sense to replace all invalid values with the same data, so we can specify different data to fill in. For ease of operation, we can modify the names of rows and columns by rename method before filling in:

# process_na.py

df.rename(index={0: 'index1', 1: 'index2', 2: 'index3', 3: 'index4'},
          columns={0: 'col1', 1: 'col2', 2: 'col3', 3: 'col4'},
          inplace=True);
df.fillna(value={'col2': 2}, inplace=True)
df.fillna(value={'col3': 7}, inplace=True)
print("df:\n{}\n".format(df));

The output of this code is as follows:

df:
        col1  col2  col3  col4
index1   1.0   2.0   3.0   4.0
index2   5.0   2.0   7.0   8.0
index3   9.0   2.0   7.0  12.0
index4  13.0   2.0  15.0  16.0

Processing strings

String processing is often involved in data, so let's look at pandas for string manipulation.

The str field of Series contains a series of functions to handle strings. Moreover, these functions automatically process invalid values.

Here are some examples. In the first set of data, we deliberately set up some strings containing spaces:

# process_string.py

import pandas as pd

s1 = pd.Series([' 1', '2 ', ' 3 ', '4', '5']);
print("s1.str.rstrip():\n{}\n".format(s1.str.lstrip()))
print("s1.str.strip():\n{}\n".format(s1.str.strip()))
print("s1.str.isdigit():\n{}\n".format(s1.str.isdigit()))

In this example, we see the processing of the string strip and the determination of whether the string itself is a number. The output of this code is as follows:

s1.str.rstrip():
0     1
1    2
2    3
3     4
4     5
dtype: object

s1.str.strip():
0    1
1    2
2    3
3    4
4    5
dtype: object

s1.str.isdigit():
0    False
1    False
2    False
3     True
4     True
dtype: bool

Here are some other examples that show how to handle string capitalization, lowercase, and string length:

# process_string.py

s2 = pd.Series(['Stairway to Heaven', 'Eruption', 'Freebird',
                    'Comfortably Numb', 'All Along the Watchtower'])
print("s2.str.lower():\n{}\n".format(s2.str.lower()))
print("s2.str.upper():\n{}\n".format(s2.str.upper()))
print("s2.str.len():\n{}\n".format(s2.str.len()))

The output of this code is as follows:

s2.str.lower():
0          stairway to heaven
1                    eruption
2                    freebird
3            comfortably numb
4    all along the watchtower
dtype: object

s2.str.upper():
0          STAIRWAY TO HEAVEN
1                    ERUPTION
2                    FREEBIRD
3            COMFORTABLY NUMB
4    ALL ALONG THE WATCHTOWER
dtype: object

s2.str.len():
0    18
1     8
2     8
3    16
4    24
dtype: int64

Concluding remarks

This article is an introduction to pandas, so we only cover the most basic operations. about

  • MultiIndex/Advanced Indexing
  • Merge, join, concatenate
  • Computational tools

Keywords: Ruby Python Excel pip Programming

Added by gin on Sat, 12 Oct 2019 20:23:32 +0300