Learning notes of python machine learning numpy Library

Introduction to Numpy Library

NumPy is a powerful Python library, which is mainly used to perform calculations on multidimensional arrays. The word NumPy comes from two words -- Numerical and python. NumPy provides a large number of library functions and operations to help programmers easily perform Numerical calculations. It is widely used in the field of data analysis and machine learning. It has the following characteristics:

Numpy has built-in parallel computing function. When the system has multiple cores, numpy will automatically perform parallel computing when doing some computing.
The bottom layer of Numpy is written in C language, and the GIL (global interpreter lock) is released internally. Its operation speed on the array is not limited by the Python interpreter, and its efficiency is much higher than that of pure Python code.
There is a powerful N-dimensional Array object Array (something similar to a list).
Practical linear algebra, Fourier transform and random number generation function.

Performance comparison between Numpy array and Python list:

For example, we want to square each element in a Numpy array and Python list. Then the code is as follows:
python list:

import numpy as np
import time
t1=time.time()
a=[]
for x in range(1000000):
    a.append(x**2)
t2=time.time()
print('python List time consuming:',t2-t1)

result:

python List time consuming: 0.34757137298583984

numpy array:

t3=time.time()
b=np.arange(1000000)**2
t4=time.time()
print('numpy Array time consuming:',t4-t3)

result:

numpy Array time consuming: 0.003968954086303711

Tutorial address:

Official website: https://docs.scipy.org/doc/numpy/user/quickstart.html .
Chinese documents: https://www.numpy.org.cn/user_guide/quickstart_tutorial/index.html .

catalog:

Basic usage of Numpy array
Numpy array operation
Numpy index and slice
Deep copy and shallow copy
File operation
NAN and INF value processing
np.random module
Axis understanding
General function

Basic usage of NumPy array

Numpy is a Python scientific computing library used to quickly process arrays of arbitrary dimensions.
Numpy provides an N-dimensional array type ndarray, which describes a collection of "items" of the same type.
Numpy.ndarray supports vectorization.
Numpy is written in c language, and the GIL is released at the bottom. Its operation speed on the array is no longer limited by the python interpreter.

Array in numpy:

The use of arrays in Numpy is very similar to lists in Python. The differences between them are as follows:

Multiple data types can be stored in a list. For example, a = [1, 'a'] is allowed, while arrays can only store the same data type.
Arrays can be multidimensional. When all the data in multidimensional arrays are numerical types, they are equivalent to matrices in linear algebra and can operate on each other.

Create an array (np.ndarray object):

The data type of the array in Numpy is called ndarray.

Generated from lists in Python:

import numpy as np
a1 = np.array([1,2,3,4])
print(a1)
print(type(a1))

Use NP Arange generation, NP The usage of range is similar to that of range in Python:

import numpy as np
a2 = np.arange(2,21,2)
print(a2)

Use NP Random generates an array of random numbers:

a1 = np.random.random(2,2) # Generates an array of random numbers with 2 rows and 2 columns
a2 = np.random.randint(0,10,size=(3,3)) # The element is a random array of 3 rows and 3 columns from 0 to 10

Use the function to generate a special array:

import numpy as np
a1 = np.zeros((2,2)) #Generate an array of 2 rows and 2 columns with all elements of 0
a2 = np.ones((3,2)) #Generate an array of 3 rows and 2 columns with all elements being 1
a3 = np.full((2,2),8) #Generate an array of 2 rows and 2 columns with all elements of 8
a4 = np.eye(3) #Generate a 3x3 matrix with element 1 and other elements 0 on the skew square

Common properties of ndarray:

ndarray.dtype:

Because the array can only store the same data type, you can get the data type of the elements in the array through dtype. Here is ndarray Common data types of dtype:

data type	describe	Unique identifier
bool	Boolean type (True or False) stored in one byte	'b'
int8	One byte size, - 128 to 127	'i1'
int16	Integer, 16 bit integer (- 32768 ~ 32767)	'i2'
int32	Integer, 32-bit integer (- 2147483648 ~ 2147483647)	'i4'
int64	Integer, 64 bit integer (- 9223372036854775808 ~ 9223372036854775807)	'i8'
uint8	Unsigned integer, 0 to 255	'u1'
uint16	Unsigned integer, 0 to 65535	'u2'
uint32	Unsigned integer, 0 to 2 * * 32 - 1	'u4'
uint64	Unsigned integer, 0 to 2 * * 64 - 1	'u8'
float16	Semi precision floating point number: 16 bits, sign 1 bit, index 5 bits, precision 10 bits	'f2'
float32	Single precision floating point number: 32 bits, sign 1 bit, exponent 8 bits, precision 23 bits	'f4'
float64	Double precision floating point number: 64 bits, sign 1 bit, index 11 bits, precision 52 bits	'f8'
complex64	Complex number, which represents the real part and imaginary part with two 32-bit floating-point numbers respectively	'c8'
complex128	Complex numbers, representing the real part and imaginary part with two 64 bit floating-point numbers respectively	'c16'
object_	python object	'O'
string_	character string	'S'
unicode_	unicode type	'U'

We can see that Numpy has many more types of values than Python's built-in, because Numpy is designed to efficiently process massive data. For example, if you want to store tens of billions of numbers, and these numbers do not exceed 254 (within a byte), you can set dtype to int8, which can save memory space more than using int64 by default. The type related operations are as follows:

default data type

import numpy as np
a=np.arange(10)
print(a)
print(a.dtype)
# If it is a windows system, the default is int32
# If it is a mac or linux system, it is determined according to the system

result:

[0 1 2 3 4 5 6 7 8 9]
int32

Specify the data type for each element

b=np.array([1,2,3,4,5],dtype=np.int8)#Specify the type of each element
print(b)
print(b.dtype)

result:

[1 2 3 4 5]
int8

f=np.array(['a','b'],dtype='S')#Unique identifier of string'S'
print(f)
print(f.dtype)

result:

[b'a' b'b']
|S1

Storage object

class Person:
    def __init__(self,name,age):
        self.name=name
        self.age=age
d=np.array([Person('Zhang San',18),Person('Li Si',18)])
print(d)
print(d.dtype)

result:

[<__main__.Person object at 0x000001B757D7CEE0>
 <__main__.Person object at 0x000001B757D7C910>]
object

Modify the data type of each element

import numpy as np
a1 = np.array([1,2,3])
print(a1.dtype) # In the window system, the default is int32
# Modify dtype below
a2 = a1.astype(np.int64) # astype does not modify the array itself, but returns the modified result
print(a2.dtype)

result:

int32
int64

ndarray.size: get the number of elements of the array

import numpy as np
   a1 = np.array([[1,2,3],[4,5,6]])
   print(a1.size) #6 is printed because there are a total of 6 elements

ndarray.ndim: Dimension of array

 a1 = np.array([1,2,3])
   print(a1.ndim) # Dimension is 1
   a2 = np.array([[1,2,3],[4,5,6]])
   print(a2.ndim) # Dimension is 2
   a3 = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[10,11,12]]])
   print(a3.ndim) # Dimension is 3

ndarray.shape:

The array created by numpy has a shape attribute, which is a tuple and returns the dimension of each dimension. Sometimes we may need to know the specific dimension of a certain dimension.

Code example 1:

 a1 = np.array([1,2,3])
   print(a1.shape) # Output (3,), which means a one-dimensional array with 3 data

   a2 = np.array([[1,2,3],[4,5,6]])
   print(a2.shape) # Output (2,3), which means a binary array, 2 rows and 3 columns

   a3 = np.array([
       [
           [1,2,3],
           [4,5,6]
       ],
       [
           [7,8,9],
           [10,11,12]
       ]
   ])
   print(a3.shape) # Output (2,2,3), which means a three-dimensional array. There are two blocks in total, and each element has two rows and three columns

Code example 2:

Two dimensional case
>>> import numpy as np
>>> y = np.array([[1,2,3],[4,5,6]])
>>> print(y)
[[1 2 3]
 [4 5 6]]
>>> print(y.shape)
(2, 3)
>>> print(y.shape[0])
2
>>> print(y.shape[1])
3
 Can see y Is a two-dimensional array with two rows and three columns, y.shape[0]Number of representative lines, y.shape[1]Represents the number of columns.

Three dimensional situation
>>> x  = np.array([[[1,2,3],[4,5,6]],[[7,8,9],[0,1,2]],[[3,4,5],[6,7,8]]])
>>>> print(x)
[[[1 2 3]
  [4 5 6]]

 [[7 8 9]
  [0 1 2]]

 [[3 4 5]
  [6 7 8]]]
>>> print(x.shape)
(3, 2, 3)
>>> print(x.shape[0])
3
>>> print(x.shape[1])
2
>>> print(x.shape[2])
3
 Can see x It is a three-dimensional array containing three two-dimensional arrays with two rows and three columns, x.shape[0]Represents the number of two-dimensional arrays, x.shape[1]Represents the number of rows of a two-dimensional array, x.shape[2]Represents the number of columns in a two-dimensional array.

summary
 As you can see, shape[0]Represents the dimension of the outermost array, shape[1]Represents the dimension of the sub peripheral array. The number is increasing and the dimension is from outside to inside.

In addition, we can also use ndarray Reshape to modify the dimension of the array. The example code is as follows:

 a1 = np.arange(12) #Generate a one-dimensional array with 12 data
   print(a1) 

   a2 = a1.reshape((3,4)) #It becomes a two-dimensional array with 3 rows and 4 columns
   print(a2)

   a3 = a1.reshape((2,3,2)) #Into a three-dimensional array, a total of 2 blocks, each block is 2 rows and 2 columns
   print(a3)

   a4 = a2.reshape((12,)) # Change the two-dimensional array of a2 into a one-dimensional array with 12 columns, (12,) indicates that the tuple needs to be left blank when there is only one value
   print(a4)

   a5 = a2.flatten() # No matter how many dimensions a2 is, it will become a one-dimensional array
   print(a5)

Brief summary:

reshape(1,12) # changes to a two-dimensional array. reshape() can be understood as an array with several parameters
Several elements in the shape result represent several dimensional arrays

ndarray.itemsize:

The size of each element in the array, in bytes. For example, the following code:

   a1 = np.array([1,2,3],dtype=np.int32)
   print(a1.itemsize) # Print 4, because each byte is 8 bits, 32 bits / 8 = 4 bytes

Numpy array operation

Calculation of arrays and numbers:

In the Python list, if you want to add a number to all the elements in the list, you can either use the map function or loop the whole list. However, the array in NumPy can be operated directly on the array. The example code is as follows:

import numpy as np
a1 = np.random.random((3,4))
print(a1)
# If you want to multiply all elements on the a1 array by 10, you can do so by
a2 = a1*10#Grammar sugar
print(a2)
# You can also use round to keep only 2 decimal places for all elements
a3 = a2.round(2)

The above example is multiplication. In fact, addition, subtraction and division are similar.

Array and array calculation:

Related operations between corresponding elements:

import numpy as np
a1 = np.arange(0,24).reshape((3,8))
a2 = np.random.randint(1,10,size=(3,8))
a3 = a1 * a2 #Addition / subtraction / division / multiplication can be related to the elements in the corresponding position
print(a1)
print(a2)
print(a3)

result:

[[ 0  1  2  3  4  5  6  7]
 [ 8  9 10 11 12 13 14 15]
 [16 17 18 19 20 21 22 23]]
[[5 8 7 1 3 3 8 5]
 [1 2 4 3 7 5 3 8]
 [1 7 2 4 1 3 3 4]]
[[  0   8  14   3  12  15  48  35]
 [  8  18  40  33  84  65  42 120]
 [ 16 119  36  76  20  63  66  92]]

Operations between arrays with the same number of rows and only 1 column:

import numpy as np
a1 = np.random.randint(10,20,size=(3,8)) #3 rows and 8 columns
a2 = np.random.randint(1,10,size=(3,1)) #3 rows and 1 column
a3 = a1 - a2 #The number of rows is the same, and a2 has only one column, which can operate on each other
print(a1)
print('='*30)
print(a2)
print('='*30)
print(a3)

result:

[[11 17 14 14 16 19 11 16]
 [10 14 15 10 18 11 19 10]
 [19 19 18 17 17 18 18 10]]
==============================
[[4]
 [3]
 [1]]
==============================
[[ 7 13 10 10 12 15  7 12]
 [ 7 11 12  7 15  8 16  7]
 [18 18 17 16 16 17 17  9]]

It is not difficult to find that each element of each column of a1 subtracts the value of each element corresponding to a2

Operations between arrays with the same number of columns and only 1 row:

import numpy as np
a1 = np.random.randint(10,20,size=(3,8)) #3 rows and 8 columns
a2 = np.random.randint(1,10,size=(1,8))
a3 = a1 - a2
print(a1)
print('='*30)
print(a2)
print('='*30)
print(a3)

result:

[[19 10 10 10 19 11 16 11]
 [10 17 10 13 17 18 10 15]
 [14 10 14 11 16 18 15 17]]
==============================
[[2 4 1 9 6 3 3 8]]
==============================
[[17  6  9  1 13  8 13  3]
 [ 8 13  9  4 11 15  7  7]
 [12  6 13  2 10 15 12  9]]

It is not difficult to find that each element of each row of a1 subtracts the value of each element corresponding to a2

Broadcasting principle:

If the axis lengths of the trailing dimension (i.e. the dimension from the end) of the two arrays match, or the length of one of them is 1, they are considered broadcast compatible. Broadcasting will be carried out on the missing and / or length 1 dimensions. See the following case analysis:

Can an array with shape of (3,8,2) operate with an array of (8,3)?
Analysis: No, because according to the broadcasting principle, 2 and 3 in (3,8,2) and (8,3) are not equal from the back to the front, so the operation cannot be carried out.
Can an array with shape of (3,8,2) operate with an array of (8,1)?
Analysis: Yes, because according to the broadcasting principle, although 2 and 1 in (3,8,2) and (8,1) are not equal, one side can participate in the operation because its length is 1.
Can an array with shape of (3,1,8) operate with an array of (8,1)?
Analysis: Yes, because according to the broadcasting principle, 4 and 1 in (3,1,4) and (8,1) are not equal and 1 and 8 are not equal, but one of the two terms has a length of 1, so it can participate in the operation.

Code example:

a1=np.array([1,2,3,4])
a2=np.random.randint(1,10,size=(8,1))
a3=a1+a2
print(a1)
print('='*30)
print(a2)
print('='*30)
print(a3)
print('='*30)
print(a3.shape)

result:

[1 2 3 4]
==============================
[[7]
 [1]
 [7]
 [2]
 [8]
 [1]
 [6]
 [3]]
==============================
[[ 8  9 10 11]
 [ 2  3  4  5]
 [ 8  9 10 11]
 [ 3  4  5  6]
 [ 9 10 11 12]
 [ 2  3  4  5]
 [ 7  8  9 10]
 [ 4  5  6  7]]
==============================
(8, 4)

Operation of array shape:

Through some functions, it is very convenient to operate the shape of the array.

reshape and resize methods:

reshape is to convert the array into a specified shape, and then return the converted result. The shape of the original array will not change. Call method:

a1 = np.random.randint(0,10,size=(3,4))
a2 = a1.reshape((2,6)) #Return the modified result without affecting the original array itself

resize is to convert the array into a specified shape, which will directly modify the array itself. No value is returned. Call method:

a1 = np.random.randint(0,10,size=(3,4))
a1.resize((2,6)) #a1 itself has changed

flatten and t ravel methods:

Both methods convert multi-dimensional arrays into one-dimensional arrays, but there are the following differences:

flatten converts the array into a one-dimensional array and then returns the copy back, so subsequent modifications to the return value will not affect the previous array.
T ravel returns the view (which can be understood as a reference) after converting the array into a one-dimensional array, so subsequent modifications to the return value will affect the previous array.

x = np.array([[1, 2], [3, 4]])
x.flatten()[1] = 100 #At this time, the position element of x[0] is still 1
x.ravel()[1] = 100 #At this time, the position element of x[0] is 100

Combination of different arrays:

If you want to combine multiple arrays, you can also use some of these functions.

vstack: stack arrays vertically. The array must have the same number of columns to stack. The example code is as follows:

a1 = np.random.randint(0,10,size=(3,5))
a2 = np.random.randint(0,10,size=(1,5))
a3 = np.vstack([a1,a2])

Hsstack: stack arrays horizontally. The rows of the array must be the same to overlay. The example code is as follows:

a1 = np.random.randint(0,10,size=(3,2))
a2 = np.random.randint(0,10,size=(3,1))
a3 = np.hstack([a1,a2])

3.concatenate([],axis): stack two arrays, but in the horizontal or vertical direction. It depends on the parameters of axis. If axis=0, If axis=1, then it means stacking in the vertical direction (row). If axis=None, then the two arrays will be combined into a one-dimensional array. It should be noted that if stacking in the horizontal direction, then the rows must be the same. If stacking in the vertical direction, then the columns must be the same. The example code is as follows:

a = np.array([[1, 2], [3, 4]])
b = np.array([[5, 6]])
np.concatenate((a, b), axis=0)
# result:
array([[1, 2],
    [3, 4],
    [5, 6]])

np.concatenate((a, b.T), axis=1)
# result:
array([[1, 2, 5],
    [3, 4, 6]])

np.concatenate((a, b), axis=None)
# result:
array([1, 2, 3, 4, 5, 6])

Cutting of array:

Through hsplit, vsplit and array_split can cut an array.

hsplit: cut in the vertical direction. It is used to specify how many columns are divided. You can use numbers to represent how many parts are divided, or you can use arrays to represent where to divide. The example code is as follows:

a1 = np.arange(16.0).reshape(4, 4)
np.hsplit(a1,2) #Split into two parts
>>> array([[ 0.,  1.],
     [ 4.,  5.],
     [ 8.,  9.],
     [12., 13.]]), array([[ 2.,  3.],
     [ 6.,  7.],
     [10., 11.],
     [14., 15.]])]

np.hsplit(a1,[1,2]) #It means cutting a knife where the subscript is 1 and cutting a knife where the subscript is 2. It is divided into three parts
>>> [array([[ 0.],
     [ 4.],
     [ 8.],
     [12.]]), array([[ 1.],
     [ 5.],
     [ 9.],
     [13.]]), array([[ 2.,  3.],
     [ 6.,  7.],
     [10., 11.],
     [14., 15.]])]

vsplit: cut horizontally. It is used to specify how many lines to divide. You can use numbers to represent how many parts to divide, or you can use arrays to represent where to divide. The example code is as follows:

np.vsplit(x,2) #Represents a total of 2 arrays divided into rows
>>> [array([[0., 1., 2., 3.],
     [4., 5., 6., 7.]]), array([[ 8.,  9., 10., 11.],
     [12., 13., 14., 15.]])]

np.vsplit(x,(1,2)) #Delegates are divided by row, where the subscript is 1 and where the subscript is 2
>>> [array([[0., 1., 2., 3.]]),
    array([[4., 5., 6., 7.]]),
    array([[ 8.,  9., 10., 11.],
           [12., 13., 14., 15.]])]

split/array_ Split (array, indicate_or_secont, axis): used to specify the cutting method. When cutting, you need to specify whether to cut by row or column. axis=1 represents by column and axis=0 represents by row. The example code is as follows:

np.array_split(x,2,axis=0) #Cut into 2 parts in vertical direction
>>> [array([[0., 1., 2., 3.],
     [4., 5., 6., 7.]]), array([[ 8.,  9., 10., 11.],
     [12., 13., 14., 15.]])]

Array (matrix) transpose and axis swap:

An array in numpy is actually a matrix in linear algebra. Matrices can be transposed. ndarray has a T attribute that returns the result of the transpose of this array. The example code is as follows:

a1 = np.arange(0,24).reshape((4,6))
a2 = a1.T
print(a2)

Another method is called transfer. This method returns a View (which can be understood as a reference temporarily), that is, modifying the return value will affect the original array. The example code is as follows:

a1 = np.arange(0,24).reshape((4,6))
a2 = a1.transpose()

Why do we need to transpose the matrix? Sometimes we need to use it when doing some calculations. For example, when doing the inner product of a matrix. The matrix must be transposed and multiplied by the previous matrix:

a1 = np.arange(0,24).reshape((4,6))
a2 = a1.T
print(a1)
print('='*30)
print(a2)
print('='*30)
print(a1.dot(a2))#The dot function returns the inner product of two matrices

result:

[[ 0  1  2  3  4  5]
 [ 6  7  8  9 10 11]
 [12 13 14 15 16 17]
 [18 19 20 21 22 23]]
==============================
[[ 0  6 12 18]
 [ 1  7 13 19]
 [ 2  8 14 20]
 [ 3  9 15 21]
 [ 4 10 16 22]
 [ 5 11 17 23]]
==============================
[[  55  145  235  325]
 [ 145  451  757 1063]
 [ 235  757 1279 1801]
 [ 325 1063 1801 2539]]

Numpy index and slice:

Operations on one-dimensional arrays:

import numpy as np
#One dimensional array (same usage as python list slicing)
a1=np.arange(10)#Generate a 1-9 one-dimensional array
print(a1)
#Index operation
print(a1[4])
#Slice operation
print(a1[4:6])

result:

[0 1 2 3 4 5 6 7 8 9]
4
[4 5]

Multidimensional array

Indexes and slices. Indexes are discontinuous and slices are continuous. If commas are used for separation, rows are in front of commas and columns are behind commas. If only one value in multidimensional array is a row (brackets need to be added), slices do not need brackets.

import numpy as np
a2=np.random.randint(0,10,size=(4,6))
print(a2)

result:

[[4 9 9 0 0 9]
 [2 4 4 5 2 0]
 [9 5 4 0 8 5]
 [8 2 3 3 7 4]]

#Get the first row of the above matrix
print(a2[0])
print('='*30)
#Get the middle two lines
print(a2[1:3])
print('='*30)
#Get lines 1, 3, and 4
print(a2[[0,2,3]])
print('='*30)
#Get the second number in the third line
print(a2[2,1])
print('='*30)
#Get two discontinuous numbers
print(a2[[1,2],[2,3]])
print('='*30)
#Get the number of 2, 3 rows, 4, 5 columns
print(a2[1:3,3:5])
print('='*30)
#Get column related operations:
#Get first column
print(a2[:,0])
#Get columns 3 and 4
print('='*30)
print(a2[:,2:3])

result:

[ 0  6 12 18]
==============================
[[ 1  7 13 19]
 [ 2  8 14 20]]
==============================
[[ 0  6 12 18]
 [ 2  8 14 20]
 [ 3  9 15 21]]
==============================
8
==============================
[13 20]
==============================
[[19]
 [20]]
==============================
[0 1 2 3 4 5]
==============================
[[12]
 [13]
 [14]
 [15]
 [16]
 [17]]

Summary:

Array name [row, column]. For rows and columns, slice and index operations can be used respectively
Common operations are as follows:
([index, index], [[index 1, index 2], [index 3, index 4]], [: (row slice),: (column slice)], [[x (index), y, Z], [[X1 (index), y1],[x2,y2]...])
Note: a[[1,2],[2,3]] indicates that the two numbers of the third column in the second row and the fourth column in the third row are discontinuous, and a[1:2,2:3] indicates that all the numbers of the third column in the second row and the fourth column in the third row are continuous

Boolean index:

Boolean operations are also vector operations, such as the following code:

a1 = np.arange(0,24).reshape((4,6))
print(a1<10) #A new array will be returned, and all the values in this array are of bool type

result:

[[ True  True  True  True  True  True]
 [ True  True  True  True False False]
 [False False False False False False]
 [False False False False False False]]

a1 = np.arange(0,24).reshape((4,6))
a2 = a1 < 10
print(a1[a2]) #In this way, the value of the position corresponding to the element that is True in a2 will be extracted in a1

result:

[0 1 2 3 4 5 6 7 8 9]

Summary: Boolean operations can include! =, = =, >, <, > =<= And & (and) and | (or).

a1 = np.arange(0,24).reshape((4,6))
a2 = a1[(a1 < 5) | (a1 > 10)]
print(a2)

Substitution of values:

Using the index, you can also replace some values. Replace the value of the position that meets the condition with another value. For example, the following code:

a1 = np.arange(0,24).reshape((4,6))
a1[3] = 0 #Replace all values in the third row with 0
print(a1)

You can also use conditional indexes to:

a1 = np.arange(0,24).reshape((4,6))
a1[a1 < 5] = 0 #Replace all values less than 5 with 0
print(a1)

You can also use functions to implement:

# where function:
a1 = np.arange(0,24).reshape((4,6))
a2 = np.where(a1 < 10,1,0) #Change all numbers less than 10 in a1 to 1 and the rest to 0
print(a2)

Deep copy and shallow copy

When manipulating arrays, their data is sometimes copied into a new array, sometimes not. This is often confusing for beginners. There are three situations:

Do not copy:

a = np.arange(12)
b = a #This will not be copied
print(b is a) #Returns True, indicating that b and a are the same

View or shallow copy:

In some cases, variables will be copied, but the memory space they point to is the same. This situation is called shallow copy, or view. For example, the following code:

a = np.arange(12)
c = a.view()
print(c is a) #Returns False, indicating that c and a are two different variables
c[0] = 100
print(a[0]) #Print 100, indicating that the change to c will affect the value above a, indicating that the memory space they point to is still the same. This is called shallow copy, or view

Deep copy:

Put a complete copy of the previous data into another memory space, which is two completely different values. The example code is as follows:

a = np.arange(12)
d = a.copy()
print(d is a) #Returns False, indicating that d and a are two different variables
d[0] = 100
print(a[0]) #Print 0, indicating that the memory space pointed to by d and a is completely different.

example:

As mentioned earlier, this is the case with flatten and travel. Travel returns View and flatten returns deep copy.

File operation

To manipulate CSV files:

File save:

Sometimes we have an array that needs to be saved to a file, so we can use NP Savetxt. Related functions are described as follows:

np.savetxt(frame, array, fmt='%.18e', delimiter=None)
* frame : File, string, or generator, which can be.gz or.bz2 Compressed file
* array : An array stored in a file
* fmt : The format in which the file is written, for example:%d %.2f %.18e
* delimiter : Split string, default is any space

The following are examples of use:

a = np.arange(100).reshape(5,20)
np.savetxt("a.csv",a,fmt="%d",delimiter=",")

Read file:

Sometimes our data needs to be read from the file, so NP Loadtext. Related functions are described as follows:

np.loadtxt(frame, dtype=np.float, delimiter=None, unpack=False)
* frame: File, string, or generator, which can be.gz or.bz2 Compressed file.
* dtype: Data type, optional.
* delimiter: Split string, default is any space.
* skiprows: Skip front x that 's ok.
* usecols: Reads the specified column and combines it with tuples.
* unpack: If True，The read array is transposed.

np's unique storage solution:

numpy also has a unique storage solution. The file name is in Ending in npy or npz. The following functions are stored and loaded.

Storage: NP Save (fname, array) or NP savez(fname,array). The extension of the former function is npy, whose extension is npz, which is compressed.
Loading: NP load(fname).

CSV file operation:

Read csv file:

import csv

with open('stock.csv','r') as fp:
    reader = csv.reader(fp)
    titles = next(reader)#Skip the first line and next moves down the pointer
    for x in reader:
        print(x)

In this way, when obtaining data in the future, it is necessary to obtain data through the following table. If you want to get the data through the title. Then you can use DictReader. The example code is as follows:

with open( 'stock.csv' ,'r') as fp:
#reader object created using DictReader
#The data of the header row will not be included
#reader is an iterator. After traversing the iterator, a dictionary is returned.
reader = csv.DictReader(fp)
for x in reader:
	value = { "name" :x [ 'secShortName ' ], ' volumn ' :x [ ' turnoverVol']}
	print(value)

Write data to csv file:

To write data to a csv file, you need to create a writer object, which mainly uses two methods. One is writerow, and the other is to write a row. One is writerows, and the other is to write multiple rows. The example code is as follows:

import csv

headers = ['name','age','classroom']
values = [
    ('zhiliao',18,'111'),
    ('wena',20,'222'),
    ('bbc',21,'111')
]
with open('test.csv','w',newline='') as fp:
    writer = csv.writer(fp)
    writer.writerow(headers)
    writer.writerows(values)

You can also write data in the form of a dictionary. At this time, you need to use DictWriter. The example code is as follows:

import csv

headers = ['name','age','classroom']
values = [
    {"name":'wenn',"age":20,"classroom":'222'},
    {"name":'abc',"age":30,"classroom":'333'}
]
with open('test.csv','w',newline='') as fp:
    writer = csv.DictWriter(fp,headers)
    writer.writerow({'name':'zhiliao',"age":18,"classroom":'111'})
    writer.writerows(values)

Note: when the header needs to be written:

writer =csv. DictWriter(fp, headers)
#When writing header data, you need to call the writeheader method
writer.witeheader ()

NAN and INF value processing

First of all, we need to know what these two English words mean:

NAN: Not A number does not mean a number, but it belongs to floating-point type, so you need to pay attention to its type when you want to perform data operations.
INF: Infinity, which means Infinity, also belongs to floating point type. np.inf means positive Infinity, - NP INF means negative Infinity, which is generally Infinity when the divisor is 0. For example, 2 / 0.

Some features of NAN:

Nan and Nan are not equal. Like NP NAN != np. The Nan condition is true.
NAN and any value, the result is NAN.

Sometimes, especially when reading data from files, some missing values often appear. The occurrence of missing values will affect the processing of data. Therefore, we must deal with the missing values before data analysis. There are many ways to deal with it, which need to be done according to the actual situation. There are generally two processing methods: delete the missing value and fill it with other values.

Delete missing values:

Sometimes, if we want to delete the NAN in the array, we can change the idea to extract only the values that are not NAN. The example code is as follows:

# 1. Delete all NAN values. Because the array will not know how to change after deleting the values, it will be turned into a one-dimensional array
data = np.random.randint(0,10,size=(3,5)).astype(np.float)
data[0,1] = np.nan
data = data[~np.isnan(data)] # At this time, the data will have no nan and become a 1-dimensional array

# 2. Delete the line of NAN
data = np.random.randint(0,10,size=(3,5)).astype(np.float)
# Set the (0,1) and (1,2) values to NAN
data[[0,1],[1,2]] = np.NAN
# Get which rows have NAN
lines = np.where(np.isnan(data))[0]
# Use the delete method to delete the specified row. axis=0 indicates the deleted row, and lines indicates the deleted row number
data1 = np.delete(data,lines,axis=0)

be careful:
Except that deiete uses axis=0 to represent rows, most other functions use axis=1 to represent rows.

Replace with other values:

mathematics	English
59	89
90	32
78	45
34	NAN
NAN	56
23	56

If you want to require the total score of each grade and the average score of each grade, you can use some values instead. For example, if you want to calculate the total score, you can replace NAN with 0. If you want to require an average score, you can replace NAN with the average of other values. The example code is as follows:

scores = np.loadtxt("nan_scores.csv",skiprows=1,delimiter=",",encoding="utf-8",dtype=np.str)
scores[scores == ""] = np.NAN
scores = scores.astype(np.float)
# 1. Find out the total score of students' grades
scores1 = scores.copy()
socres1.sum(axis=1)

# 2. Calculate the average score of each course
scores2 = scores.copy()
for x in range(scores2.shape[1]):
    score = scores2[:,x]
    non_nan_score = score[score == score]
    score[score != score] = non_nan_score.mean()
print(scores2.mean(axis=0))

np.random module

np.random provides us with many functions to obtain random numbers. Let's study it here.

np.random.seed:

It is used to specify the integer value at the beginning of the algorithm used to generate random numbers. If the same seed() value is used, the random numbers generated each time are the same. If this value is not set, the system selects this value according to time. At this time, the random numbers generated each time are different due to time differences. Generally, there are no special requirements and no setting is required. The following codes:

np.random.seed(1)
print(np.random.rand()) # Print 0.417022004702574
print(np.random.rand()) # Print other values, because the random number seed will only affect the generation of the next random number.

np.random.rand:

Generate an array with values between [0,1]. The shape is specified by the parameter. If there is no parameter, a random value will be returned. The example code is as follows:

data1 = np.random.rand(2,3,4) # Generate an array of 2 blocks, 3 rows and 4 columns with values from 0 to 1
data2 = np.random.rand() #Generate a random number between 0 and 1

np.random.randn:

Generate mean( μ) 0, standard deviation( σ) The value of the standard normal distribution of 1. The example code is as follows:

data = np.random.randn(2,3) #Generate an array of 2 rows and 3 columns. The values in the array meet the standard positive distribution

np.random.randint:

Generate a random number within the specified range, and you can specify the dimension through the size parameter. The example code is as follows:

data1 = np.random.randint(10,size=(3,5)) #Generate an array with values between 0-10, 3 rows and 5 columns
data2 = np.random.randint(1,20,size=(3,6)) #Generate an array with values between 1-20, 3 rows and 6 columns

np.random.choice:

Randomly sample from a list or array. Or sample from the specified interval. The number of samples can be specified through parameters:

data = [4,65,6,3,5,73,23,5,6]
result1 = np.random.choice(data,size=(2,3)) #Randomly sample from data to generate an array of 2 rows and 3 columns
result2 = np.random.choice(data,3) #Randomly sample three data from data to form a one-dimensional array
result3 = np.random.choice(10,3) #Take 3 values randomly from 0-10

np.random.shuffle:

Scramble the position of the elements of the original array. The example code is as follows:

a = np.arange(10)
np.random.shuffle(a) #The positions of the elements of a will be changed randomly

For more random module documentation, please refer to Numpy's official documentation: https://docs.scipy.org/doc/numpy/reference/routines.random.html

Axis understanding

In short, the outermost parentheses represent axis=0, and the counting of axis corresponding to the inward parentheses is increased by 1 in turn. What do you mean? Let's explain it again.

The outer bracket is axis=0, and the inner two sub brackets are axis=1. Operation mode: if the axis is specified for relevant operations, it will use the position 0, position 1, position 2... Of each direct child element under the axis for relevant operations respectively.
Now let's do a few operations in the way we just understood. For example, there is a two-dimensional array:

x = np.array([[0,1],[2,3]])

Find the sum of x array in the case of axis=0 and axis=1:

>>> x.sum(axis=0)
 array([2, 4])

The reason why we get [2,4] is that if we add it in the way of axis=0, we will add the 0th position and the first position of all direct child elements under the outermost axis... And so on, we get 0 + 2 and 2 + 3, and then add them to get [2,4].

 >>> x.sum(axis=1)
 array([1, 5])

Because we add in the way of axis=1, the elements with axis 1 will be taken out for summation. The result is 0,1, which is added as 1, and 2,3 is added as 5. Therefore, the final result is [1,5].

Use NP Max finds the maximum value when axis=0 and axis=1:

import  numpy as np
np.random.seed(100)
x = np.random.randint(0,10,size=(3,5))
print(x)
x.max(axis=0)

result:

[[8 8 3 7 7]
 [0 4 2 5 2]
 [2 2 1 0 8]]
array([8, 8, 3, 7, 8])

Because we calculate the maximum value according to axis=0, we will find the direct child element in the outermost axis, then put the 0th value of each child element together for the maximum value, put the first value together for the maximum value, and so on. If axis=1, you can get each direct child element and find the maximum value of each child element:

x.max(axis=1)

result:

array([8, 5, 8])

Use NP Delete deletes elements when axis=0 and axis=1:

x = np.array([[0,1],[2,3]])

 >>> np.delete(x,0,axis=0)#
 array([[2, 3]])

np.delete is an exception. If we delete it in the way of axis=0, it will first find the 0 in the direct child element under the outermost bracket, and then delete it, leaving the data in the last row.

>>> np.delete(x,0,axis=1)
 array([[1],
        [3]])

Similarly, if we delete according to axis=1, the data in the first column will be deleted.

For the delete function:
The delete function in numpy has three parameters:
numpy.delete(arr, obj, axis)
arr: matrix to be processed
obj: where is it processed
Axis: This is an optional parameter, axis = None, 1, 0

axis=None: arr will expand by row first, then delete the number at position obj-1 (starting from 0) by obj, and return a row matrix.

axis = 0: arr delete by line

axis = 1: arr delete by column

Three dimensional array:

General function

Unary function:

function	describe
np.abs	absolute value
np.sqrt	Root opening
np.square	square
np.exp	Calculate index (e^x)
np.log，np.log10，np.log2，np.log1p	Find the logarithm with e as the base, 10 as the low, 2 as the low and (1+x) as the base
np.sign	Label the values in the array. Those greater than 0 become 1, those equal to 0 become 0, and those less than 0 become - 1
np.ceil	Rounding in the direction of infinity, for example, 5.1 becomes 6 and - 6.3 becomes - 6
np.floor	Forensics in the direction of negative infinity. For example, 5.1 will become 5 and - 6.3 will become - 7
np.rint，np.round	Returns the rounded value
np.modf	Separate integers and decimals to form two arrays
np.isnan	Determine whether it is nan
np.isinf	Determine if it is inf
np.cos，np.cosh，np.sin，np.sinh，np.tan，np.tanh	trigonometric function
np.arccos，np.arcsin，np.arctan	Inverse trigonometric function

Binary function:

function	describe
np.add	Addition operation (i.e. 1 + 1 = 2), equivalent to+
np.subtract	Subtraction (i.e. 3-2 = 1), equivalent to-
np.negative	Negative number operation (i.e. - 2) is equivalent to adding a minus sign
np.multiply	Multiplication (i.e. 2 * 3 = 6), equivalent to*
np.divide	Division operation (i.e. 3 / 2 = 1.5), equivalent to/
np.floor_divide	Rounding operation, equivalent to//
np.mod	Remainder operation, equivalent to%
greater,greater_equal,less,less_equal,equal,not_equal	>,>=,<,<=,=,!= Function expression for
logical_and	&Function expression for
logical_or	\|Function expression for

Aggregate function:

The Security version, that is, the element value is NAN, does not affect the corresponding calculation

Function name	NAN Security version	describe
np.sum	np.nansum	Calculate the sum of elements
np.prod	np.nanprod	Calculate the product of elements
np.mean	np.nanmean	Calculate the average of the elements
np.std	np.nanstd	Calculate the standard deviation of the element
np.var	np.nanvar	Calculate the variance of the element
np.min	np.nanmin	Calculate the minimum value of the element
np.max	np.nanmax	Calculate the maximum value of the element
np.argmin	np.nanargmin	Find the index of the minimum value
np.argmax	np.nanargmax	Find the index of the maximum value
np.median	np.nanmedian	Calculate the median of the element

Use NP Sum or a.sum can be implemented. And when using, you can specify which axis. Similarly, python also has a built-in sum function, but the execution efficiency of Python's built-in sum function is not NP Sum is so high that you can learn from the following code tests:

a = np.random.rand(1000000)
%timeit sum(a) #Use Python's built-in sum function to find the sum and see the time spent
%timeit np.sum(a) #Use Numpy's sum function to sum and look at the time it takes

result:

73.3 ms ± 1.47 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
899 µs ± 39.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Explanation:

Timing of single line code execution:% timeit
Timing of multiline code execution:%% timeit:

It can only be used under ipython. (so, of course, Jupiter notebook can be used, and the python environment in pycham is also Jupiter notebook.)
%timeit can measure how long a line of code executes multiple times
%%timeit can measure the execution time of multiple lines of code

Boolean array functions:

Function name	describe
np.any	Verify that any element is true
np.all	Verify that all elements are true

For example, to see if all elements in the array are 0, you can use the following code:

np.all(a==0) 
# Or
(a==0).all()

For example, if we want to see whether there is a number equal to 0 in the array, we can implement it through the following code:

np.any(a==0)
# Or
(a==0).any()

Sort:

np.sort: Specifies the axis to sort. The default is to sort using the last axis of the array.

a = np.random.randint(0,10,size=(3,5))
b = np.sort(a) #Sort by row. Because the last axis is 1, the innermost elements are sorted.
c = np.sort(a,axis=0) #Sort by column because axis=0 is specified
print(a)
print('='*30)
print(b)
print('='*30)
print(c)

result:

[[1 6 5 9 2]
 [1 6 8 7 2]
 [8 5 5 3 8]]
==============================
[[1 2 5 6 9]
 [1 2 6 7 8]
 [3 5 5 8 8]]
==============================
[[1 5 5 3 2]
 [1 6 5 7 2]
 [8 6 8 9 8]]

And darray Sort(), this method will directly affect the original array, rather than returning a new sorted array.

np.argsort: returns the sorted subscript value. The example code is as follows:

np.argsort(a) #By default, the last axis is also used for sorting.

result:

array([[0, 4, 2, 1, 3],#This subscript is the value of the original array a, that is, the first element in the first row of the original array A is ranked first, and the fifth element is ranked second
       [0, 4, 1, 3, 2],
       [3, 1, 2, 0, 4]], dtype=int64)

Descending sort: NP Sort will sort in ascending order by default. If we want to sort in descending order. Then the following scheme can be adopted:

 # 1. Use a minus sign
 -np.sort(-a)

 # 2. Use sort, argsort and take
 indexes = np.argsort(-a) #The sorted results are in descending order
 np.take(a,indexes) #Extract the corresponding elements from a according to the subscript

Other functions supplement:

np.apply_along_axis: executes the specified function along an axis. The example code is as follows:

 # Find the average value of array a according to rows, and remove the maximum and minimum values.
 np.apply_along_axis(lambda x:x[(x != x.max()) & (x != x.min())].mean(),axis=1,arr=a)#Each row of the array is passed to x

np.linspace: used to divide the values in a specified interval into equal parts. The example code is as follows:

 # Divide 0-1 into 12 points to generate an array
 np.linspace(0,1,12)

np.unique: returns the unique value in the array.

 # Returns the unique value in array a, and returns the number of occurrences of each unique value.
 np.unique(a,return_counts=True)

https://docs.scipy.org/doc/numpy/reference/index.html

Keywords: Python Machine Learning Data Analysis numpy

Added by wee493 on Thu, 30 Dec 2021 14:04:04 +0200

Programming VIP

Learning notes of python machine learning numpy Library

Introduction to Numpy Library

Performance comparison between Numpy array and Python list:

Tutorial address:

catalog:

Basic usage of NumPy array

Array in numpy:

Create an array (np.ndarray object):

Common properties of ndarray:

ndarray.dtype:

ndarray.size: get the number of elements of the array

ndarray.ndim: Dimension of array

ndarray.shape:

In addition, we can also use ndarray Reshape to modify the dimension of the array. The example code is as follows:

ndarray.itemsize:

Numpy array operation

Calculation of arrays and numbers:

Array and array calculation:

Broadcasting principle:

Code example:

Operation of array shape:

reshape and resize methods:

flatten and t ravel methods:

Combination of different arrays:

Cutting of array:

Array (matrix) transpose and axis swap:

Numpy index and slice:

Boolean index:

Substitution of values:

Deep copy and shallow copy

Do not copy:

View or shallow copy:

Deep copy:

example:

File operation

To manipulate CSV files:

File save:

Read file:

np's unique storage solution:

CSV file operation:

Read csv file:

Write data to csv file:

NAN and INF value processing

Some features of NAN:

Delete missing values:

Replace with other values:

np.random module

np.random.seed:

np.random.rand:

np.random.randn:

np.random.randint:

np.random.choice:

np.random.shuffle:

more:

Axis understanding

Three dimensional array:

General function

Unary function:

Binary function:

Aggregate function:

Boolean array functions:

Sort:

Other functions supplement:

more:

Popular Keywords