numpy: a common module in python

Reference video: Start Python data mining in 4 days

1 Numpy advantage

1.1 introduction to numpy

Numpy (Numerical Python) is an open-source Python science computing library, which is used to quickly process arrays of any dimension. (numerical → \ to → numerical)
Numpy supports common array and matrix operations. For the same numerical calculation task, using numpy is much simpler than using Python directly.
Numpy uses the ndarray object to handle multidimensional arrays, which is a fast and flexible big data container.

1.2 introduction to ndarray

NumPy provides an N-dimensional array type, the ndarray, which describes a collection of "item" of the same type.

NumPy provides an N-dimensional array type, ndarray, which describes a collection of "items" of the same type.

N d array: n → \ to → any; D → \ to → dimension dimension; array → \ to → array

import numpy as np

score = np.array([[80, 89, 86, 67, 79],
[78, 97, 89, 67, 81],
[90, 94, 78, 67, 74],
[91, 91, 90, 67, 69],
[76, 87, 75, 67, 86],
[70, 79, 84, 67, 84],
[94, 92, 93, 67, 64],
[86, 85, 83, 67, 80]])	# Store data in the ndarray container

In[1] : score
Out[1]: array([[80, 89, 86, 67, 79],
           [78, 97, 89, 67, 81],
           [90, 94, 78, 67, 74],
           [91, 91, 90, 67, 69],
           [76, 87, 75, 67, 86],
           [70, 79, 84, 67, 84],
           [94, 92, 93, 67, 64],
           [86, 85, 83, 67, 80]])

In[1] : type(score)
Out[1]: numpy.ndarray

1.3 comparison of operation efficiency between ndarray and Python native list

Python lists can be used to store one-dimensional arrays, and multi-dimensional arrays can be realized by nesting lists.

So why do you need to use Numpy's ndarray?

import random
import time

# Generate a large array
python_list = []
for i in range(100000000):
    python_list.append(random.random())
ndarray_list = np.array(python_list)

# Sum of native Python list
t1 = time.time()
a = sum(python_list)
t2 = time.time()
d1 = t2 - t1		# 0.7309620380401611

# Darray sum
t3 = time.time()
b = np.sum(ndarray_list)
t4 = time.time()
d2 = t4 - t3		# 0.12980318069458008

Summary:

From this, we can see that the calculation speed of ndarray is much faster, saving time.
The biggest characteristic of machine learning is a large number of data operations. Without a fast solution, python may not achieve good results in the field of machine learning.
Numpy is specially designed for the operation and operation of ndarray, so the storage efficiency and I / O performance of array are far better than those of nested list in Python. The larger the array is, the more obvious the advantage of numpy is.

1.4 advantages of darray

Memory block style: ndarray - the same type - is not universal; list - different types - is universal.
It can be seen from the figure that when darray stores data, the data and data address are continuous, which makes the batch operation of array elements faster.

Because the types of all elements in ndarray are the same, and the element types in Python list are arbitrary, the memory of ndarray can be continuous when storing elements, while the python native list can only find the next element through addressing, which results in that the ndarray of Numpy is less than that of Python native list in general performance, but in scientific calculation, the ndarray of Numpy is just It can save a lot of loop statements, and the code usage is much simpler than Python native list.
Darray supports parallelization (vectorization).
The bottom layer of Numpy is written in c language, and GIL (global interpreter lock) is released internally. Its operation speed to array is not limited by Python interpreter, and its efficiency is far higher than pure Python code.

2 N-dimensional array - ndarray

2.1 properties of ndarray

Property name	Attribute interpretation
ndarray.shape	Tuple of array dimension
ndarray.ndim	Array dimension
ndarray.size	Number of elements in array
ndarray.dtype	Type of array element
ndarray.itemsize	Length of an array element (bytes)

In[1] : score.shape
Out[1]: (8, 5)		# 8 rows and 5 columns, tuple representation

In[1] : score.ndim
Out[1]: 2		# 2D

In[1] : score.size
Out[1]: 40		# 40 elements

In[1] : score.dtype
Out[1]: dtype('int64')	# Default integer type

In[1] : score.itemsize
Out[1]: 8		# 8 bytes for one element

2.2 shape of ndarray

# First create some arrays:
a = np.array([[1,2,3],[4,5,6]])
b = np.array([1,2,3,4])
c = np.array([[[1,2,3],[4,5,6]],[[1,2,3],[4,5,6]]])

In[1] : a
Out[1]: array([[1, 2, 3],
           [4, 5, 6]])

In[1] : a.shape
Out[1]: (2, 3)		# Two dimensional array

In[1] : b
Out[1]: array([1, 2, 3, 4])

In[1] : b.shape
Out[1]: (4,)		# One dimensional array

In[1]: c
Out[1]: array([[[1, 2, 3],
            [4, 5, 6]],
    
           [[1, 2, 3],
            [4, 5, 6]]])

In[1] : c.shape
Out[1]: (2, 2, 3)	# 3D array

be careful:

You can view the outermost brackets. There are several dimensions
Shape of array ndarray.shape Represented by tuples

2.3 types of ndarray

Dtype is numpy.dtype type

name	describe	Shorthand
np.bool	Boolean type stored in one byte (True or False)	'b'
np.int8	One byte size, - 128 to 127	'i'
np.int16	Integer, - 32768 to 32767	'i2'
np.int32	Integer, - 231 to 232-1	'i4'
np.int64	Integer, - 263 to 263-1	'i8'
np.uint8	Unsigned integer, 0 to 255	'u'
np.uint16	Unsigned integer, 0 to 65535	'u2'
np.uint32	Unsigned integer, 0 to 232-1	'u4'
np.uint64	Unsigned integer, 0 to 264-1	'u8'
np.float16	Semi precision floating point number: 16 bits, sign 1 bit, index 5 bits, precision 10 bits	'f2'
np.float32	Single precision floating point number: 32 bits, sign 1 bit, index 8 bits, precision 23 bits	'f4'
np.float64	Double precision floating-point number: 64 bits, sign 1 bit, index 11 bits, precision 52 bits	'f8'
np.complex64	Complex number, two 32-bit floating-point numbers for real part and virtual part respectively	'c8'
np.complex 128	Complex number, two 64 bit floating-point numbers for real part and virtual part respectively	'c16'
np.object	python object	'O'
np.string	character string	'S'
np. unicode_	unicode type	'U'

data = np.array([1.1, 2.2, 3.3])

In[1] : data
Out[1]: array([1.1, 2.2, 3.3])

In[1] : data.dtype
Out[1]: dtype('float64')	# Default floating point type

When creating an array, specify the type:

In[1] : np.array([1.1, 2.2, 3.3], dtype="float32")
Out[1]: array([1.1, 2.2, 3.3], dtype=float32)

In[1] : np.array([1.1, 2.2, 3.3], dtype=np.float32)
Out[1]: array([1.1, 2.2, 3.3], dtype=float32)

# Rarely used
arr = np.array(['python','tensorflow','scikit-learn', 'numpy'], dtype =np.string_)

In[1] : arr
Out[1]: array([b'python', b'tensorflow', b'scikit-learn', b'numpy'], dtype='|S12')

Note: if not specified, int64 is the default integer and foat64 is the default decimal.

3 basic operation

ndarray. Method () or np. Function name ().

3.1 method of generating array

3.1.1 generate arrays of 0 and 1

empty(shape[, dtype, order])
empty_like(a[, dtype, order, subok])
eye(N[, M, k, dtype, order])
identity(n[, dtype)
ones(shape[, dtype, order])
ones_like(a[, dtype, order, subok])
zeros(shape[, dtype, order])
zeros_like(a[, dtype, order, subok])
full(shape, fill_value[, dtype, order])
full_like(a, fill_value, dtype, order, subok])

# Generate an array of 0
In[1] : np.zeros(shape=(3, 4), dtype="float32")
Out[1]: array([[0., 0., 0., 0.],
           [0., 0., 0., 0.],
           [0., 0., 0., 0.]], dtype=float32)
# Generate an array of 1
In[1] : np.ones(shape=[2, 3], dtype=np.int32)
Out[1]: array([[1, 1, 1],
           [1, 1, 1]], dtype=int32)

Note: View np.shape When attribute is used, the representation method is tuple; when shape is specified, either tuple or list can be used.

3.1.2 generating from an existing array

array(object[, dtype, copy, order, subok, ndmin])
asarray(a[, dtype, order])
asanyarray(a[, dtype, order])
ascontiguousarray(a[, dtype])
asmatrix(data[, dtype)
copy[a[, order]

On the difference between array and asarray

a = np.array([[1, 2, 3], [4, 5, 6]])

In[1] : a
Out[1]: array([[1, 2, 3],
       [4, 5, 6]])

np.array(), created from an existing array

a1 = np.array(a)

In[1] : a1
Out[1]: array([[1, 2, 3],
       [4, 5, 6]])

np.asarray(), equivalent to the form of index, does not really create a new

a2 = np.asarray(a)

In[1] : a2
Out[1]: array([[1, 2, 3],
       [4, 5, 6]])

np.copy()

a3 = np.copy(a)

In[1] : a3
Out[1]: array([[1, 2, 3],
       [4, 5, 6]])

Summary: A1= np.array (a)，a2 = np.asarray(a)，a3 = np.copy(a) The data display of the three is the same

Modified value:

a[1, 1] = 1000

In[1] : a
Out[1]: array([[   1,    2,    3],
       [   4, 1000,    6]])

a1 = np.array(a) , data unchanged, deep copy

In[1] : a1
Out[1]: array([[1, 2, 3],
       [4, 5, 6]])

a2 = np.asarray(a) , data change, shallow copy

In[1] : a2
Out[1]: array([[   1,    2,    3],
       [   4, 1000,    6]])

a3 = np.copy(score), data unchanged, deep copy

In[1] : data3
Out[1]: array([[1, 2, 3],
       [4, 5, 6]])

3.1.3 generate fixed range array

np.linspace(start, stop, num, endpoint, retstep, dtype)

Generating equispaced sequences

Start value of start sequence
stop sequence termination value
- If endpoint is true, the value is included in the sequence, and the default closed range is []
num number of evenly spaced samples to generate, default is 50
Whether the endpoint sequence contains the stop value? The default value is ture
retstep if true, returns the step size between the sample and consecutive numbers
dtype output data type of darray

In[1] : np.linspace(0, 10, 5)
Out[1]: array([ 0. ,  2.5,  5. ,  7.5, 10. ])

numpy.arange(start, stop, step, dtype), left close, right open [), step is the step

In[1] : np.arange(0, 10, 5)
Out[1]: array([ 0,  5])

numpy.logspace(start, stop, num, endpoint, base, dtype) to build an equal ratio sequence

3.1.4 generate random array: np.random modular

1. Uniform distribution

One of the important distributions in probability statistics. As the name implies, uniformity means equal possibility. Even distribution is very rare in the natural situation, and the plant community with a certain plant row spacing is even distribution.

np.random.rand(d0, d1, … dn), returns a set of evenly distributed numbers in [0.0, 1.0]
np.random.uniform(low=0.0, high=1.0, size=None)
- Function: random sampling from a uniform distribution [low, high], note that the definition field is left closed and right open
- Parameter introduction:
  low: lower limit of sampling, float type, default value is 0;
  high: sampling upper bound, float type, default value is 1;
  Size: number of output samples, int or tuple type, for example, size=(m,n,k), output mnk samples, and output 1 value by default.
  Return value: ndarray type, whose shape is consistent with the description in parameter size.
np.random.randint(low, high=None, size=None, dtype='l')
- Random sampling from a uniform distribution generates an array of integers or N-dimensional integers. Range: if high is not None, take the random integer between [low, high], otherwise take the random integer between [0, low].

Example:

data1 = np.random.uniform(low=-1, high=1, size=1000000)

In[1] : data1
Out[1]: array([-0.49795073, -0.28524454,  0.56473937, ...,  0.6141957 ,
            0.4149972 ,  0.89473129])

Draw a picture to see the distribution:

import matplotlib.pyplot as plt

# 1. Create canvas
plt.figure(figsize=(20, 8), dpi=80)
# 2. Draw histogram
plt.hist(data1, 1000)
# 3. Display image
plt.show()

2. Normal distribution

Introduction:

Normal distribution is a kind of probability distribution, which is the distribution of continuous random variables with two parameters μ mu μ and σ Sigma σ. The first parameter μ mu μ is the mean value of random variables subject to normal distribution, and the second parameter σ Sigma σ is the standard deviation of this random variable, so the normal distribution is recorded as N(μ, σ) N(\ mu, sigma) N(μ, σ).

Application: the probability distribution of many random Li quantities in life, production and scientific experiments can be approximately described by normal distribution.
Characteristics of normal distribution: μ \ mu μ determines the location, and standard deviation σ \ Sigma σ determines the magnitude and concentration of the distribution. When μ = 0\mu=0 μ = 0, σ = 1\sigma=1 σ = 1, the normal distribution is the standard normal distribution.

f(x)=1σ2πe−(x−μ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}}e^{-\frac{(x-\mu)^2}{2\sigma^2}}f(x)=σ2π1e−2σ2(x−μ)2

Standard deviation: square root of variance.
Variance:
s2=(x1−μ)2+(x2−μ)2+(x3−μ)2+…… +(xn−μ)2ns^2=\frac{(x_1-\mu)^2+(x_2-\mu)^2+(x_3-\mu)^2+…… +(x_n-\mu)^2}{n}s2=n(x1−μ)2+(x2−μ)2+(x3−μ)2+…… +(xn − μ) 2 where μ \ mu μ is the average value, nnn is the total number of data, and sss is the standard deviation σ \ sigma σ
σ=1N∑i=1N(xi−μ)2\sigma=\sqrt{\frac{1}{N}\sum_{i=1}^N(x_i-\mu)^2} σ = N1 I = 1 Σ n (Xi − μ) 2 significance of standard deviation and variance: a measure of the degree of dispersion of a set of data in probability theory and statistics.

Normal distribution syntax:

np.random.rand(d0, d1, ... dn)
- Function: return one or more sample values from the standard normal distribution
np.random.normal(loc=0.0, scale=1.0, size=None)
- loc: float type, the mean value of this probability distribution (corresponding to the center of the entire distribution)
- Scale: float type, the standard deviation of this probability distribution (corresponding to the width of the distribution, the larger the scale is, the fatter it is, the smaller the scale is, the thinner it is)
- size: int or tuple of ints, the output shape, which is None by default, only outputs one value
np.random.standard_normal(size=None), returns an array of standard normal distributions of the specified shape.

Example:

# Normal distribution
data2 = np.random.normal(loc=1.75, scale=0.1, size=1000000)

In[1] : data2
Out[1]: array([1.66381498, 1.81276401, 1.58393696, ..., 1.72017482, 1.90260969,
           1.69554529])

Image display effect:

# 1. Create canvas
plt.figure(figsize=(20, 8), dpi=80)

# 2. Draw histogram
plt.hist(data2, 1000)

# 3. Display image
plt.show()

3.2 array operation

Case: randomly generate 2-week trading day up and down data of 8 stocks

Two week trading days: 2 × 5 = 10 days
Randomly generated up and down in a normal distribution, such as mean 0, variance 1

3.2.1 creation of stock up and down data

stock_change = np.random.normal(loc=0, scale=1, size=(8, 10))

In[1] : stock_change
Out[1]: array([[-0.03469926,  1.68760014,  0.05915316,  2.4473136 , -0.61776756,
            -0.56253866, -1.24738637,  0.48320978,  1.01227938, -1.44509723],
           [-1.8391253 , -1.10142576,  0.09582268,  1.01589092, -1.20262068,
             0.76134643, -0.76782097, -1.11192773,  0.81609586,  0.07659056],
	   ...
           [-2.93762047,  0.22199761,  0.98788788,  0.37899235,  0.28281886,
            -1.75837237, -0.09262863, -0.92354076,  1.11467277,  0.76034531],
           [-0.39473551,  0.28402164, -0.15729195, -0.59342945, -1.0311294 ,
            -1.07651428,  0.18618331,  1.5780439 ,  1.31285558,  0.10777784]])

3.2.2 index and slice of array

Get the up and down data of the first three trading days of the first stock

# Two dimensional array, two dimensions
In[1] : stock_change[0, :3]
Out[1]: array([-0.03469926,  1.68760014,  0.05915316])

One dimensional, two-dimensional and three-dimensional array index:

# Three dimensional, one dimensional
a1 = np.array([[[1,2,3],[4,5,6]], [[12,3,34],[5,6,7]]])

In[1] : a1
Out[1]: array([[[ 1,  2,  3],
            [ 4,  5,  6]],
    
           [[12,  3, 34],
            [ 5,  6,  7]]])

# Index, slice
In[1] : a1[0, 0, 1]
Out[1]: 2

# shape
In[1] : a1.shape
Out[1]: (2, 2, 3)

# Value modification
a1[1, 0, 2] = 100000

In[1] : a1
Out[1]: array([[[     1,      2,      3],
            [     4,      5,      6]],
    
           [[    12,      3, 100000],
            [     5,      6,      7]]])

3.3.3 shape modification - transpose

Demand: turn the above stock lines and date columns into date lines and stock columns

ndarray.reshape(shape[, order]): returns an array containing the same data with a new shape.

# When converting shapes, be sure to pay attention to the element matching of arrays
stock_change.reshape([-1, 201]) # The shape of the array is modified to: (4, 20), - 1: indicates to be calculated

In[1] : stock_change.reshape((10, 8)) 	# stock_change.reshape([10, 8])
Out[1]: array([[-0.03469926,  1.68760014,  0.05915316,  2.4473136 , -0.61776756,
            -0.56253866, -1.24738637,  0.48320978],	# 8 data in a row
           [ 1.01227938, -1.44509723, -1.8391253 , -1.10142576,  0.09582268,
             1.01589092, -1.20262068,  0.76134643],
           ...........])

# Original data unchanged
In[1] : stock_change
Out[1]: array([[-0.03469926,  1.68760014,  0.05915316,  2.4473136 , -0.61776756,
            -0.56253866, -1.24738637,  0.48320978,  1.01227938, -1.44509723],
           [-1.8391253 , -1.10142576,  0.09582268,  1.01589092, -1.20262068,
             0.76134643, -0.76782097, -1.11192773,  0.81609586,  0.07659056],
            ............])

Summary:

ndarray.reshape(shape) only the shape is modified, and the row and column are not converted. The original data is read by row, from left to right, 8 data are read successively, 10 times in total → \ to → 10 rows and 8 columns.
ndarray.reshape(shape) returns the new ndarray, and the original data is unchanged.

ndarray.resize(new_shape[, refcheck]): Change shape and size of array in-place.

In[1]: stock_change.resize((10, 8))

In[1] : stock_change.shape
Out[1]: (10, 8)

In[1] : stock_change		
Out[1]: array([[-0.03469926,  1.68760014,  0.05915316,  2.4473136 , -0.61776756,
            -0.56253866, -1.24738637,  0.48320978]
            [1.01227938, -1.44509723,  -1.8391253 , -1.10142576,  0.09582268,
             1.01589092, -1.20262068,  0.76134643],
            ............])

Summary:

ndarray.resize(shape) only the shape is modified, and the row and column are not converted. The original data is read by row, from left to right, 8 data are read successively, 10 times in total → \ to → 10 rows and 8 columns.
ndarray.resize(shape) does not return a value. The original ndarray is modified.

Transpose of array: exchange rows and columns of array

In[1] : stock_change.T
Out[1]: array([[-0.03469926, -1.8391253 , -0.74293074,  0.45914676, -0.50413407,
             0.3627785 , -2.93762047, -0.39473551],
           [ 1.68760014, -1.10142576, -0.7836588 , -0.78330377, -1.35848099,
             1.00279706,  0.22199761,  0.28402164],
             ............])

Summary:

ndarray.T interchanges rows and columns of an array.
ndarray.T returns a new ndarray. The original data has not changed.

3.3.4 type modification

ndarray.astype(type): change array type

# Floating point ndarray to integer
In[1] : stock_change.astype("int32")
Out[1]: array([[ 0,  1,  0,  2,  0,  0, -1,  0,  1, -1],
           [-1, -1,  0,  1, -1,  0,  0, -1,  0,  0],
           [ 0,  0,  1,  0,  1,  2,  0,  0,  0,  0],
           [ 0,  0, -1,  0,  0, -1,  0, -1,  0,  0],
           [ 0, -1, -2, -1,  0,  0,  0,  1,  1,  0],
           [ 0,  1,  0, -2, -2, -1,  1, -2,  1,  1],
           [-2,  0,  0,  0,  0, -1,  0,  0,  1,  0],
           [ 0,  0,  0,  0, -1, -1,  0,  1,  1,  0]], dtype=int32)

ndarray.tostring([order]) or ndarray.tobytes([order]): Construct Python bytes containing the raw data bytes in the array

Concept: the process of changing an object or structure from memory to be storable or transportable is called serialization. Serialization is the process of converting an object into a byte stream (byte [], or string, binary string) to store it in memory, database or file for a long time. Its main purpose is to save the state of the object for later use when needed.

In[1] : stock_change.tostring()
Out[1]: b'\x95&\x99\xdd\x19\xc4\xa1\.......\xa2\x95x&\x19\x94\x03@\x9f?\x8c\x98P\xdbt\x01\xf5?t\xd8 -T\x97\xbb?'

3.3 array de duplication

Syntax: np.unique(ndarray)

temp = np.array([[1, 2, 3, 4],[3, 4, 5, 6]])

In[1] : temp
Out[1]: array([[1, 2, 3, 4],
           [3, 4, 5, 6]])

# Method 1:
In[1] : np.unique(temp)
Out[1]: array([1, 2, 3, 4, 5, 6])

# Method 2:
In[1] : temp.flatten()		# Smoothing
Out[1]: array([1, 2, 3, 4, 3, 4, 5, 6])

In[1] : set(temp.flatten())	# Set de duplication, set can only deal with one-dimensional form
Out[1]: {1, 2, 3, 4, 5, 6}

4 ndarray operation

4.1 logical operation

Operate on data that meets a condition.

4.1.1 logical judgment

stock_change = np.random.normal(loc=0, scale=1, size=(5, 5))

In[1] : stock_change
Out[1]: array([[ 1.46338968, -0.45576704,  0.29667843,  0.16606916,  0.46446682],
       [ 0.36775845,  0.24078108,  0.122042  ,  1.19314047,  1.34072589],
       [-1.48252741, -0.69347186,  0.91122464, -0.30606473,  0.41598897],
       [ 0.39438905, -1.31770556,  1.7344868 , -1.52812773, -0.47703227],
       [-0.9822216 , -1.09482936, -0.81834523,  0.57335311,  0.97390091]])

# In logical judgment, if the increase or decrease is greater than 0.5, it is marked as True, otherwise it is False
In[1] : stock_change > 0.5
Out[1]: array([[ True, False, False, False, False],
       [False, False, False,  True,  True],
       [False, False,  True, False, False],
       [False, False,  True, False, False],
       [False, False, False,  True,  True]])

# Judge stock_ Is change [0:2, 0:5] all up
In[1] : stock_change[0:2, 0:5] > 0
Out[1]: array([[ True, False,  True,  True,  True],
           [ True,  True,  True,  True,  True]])

4.1.2 Boolean index

In[1] : stock_change[stock_change > 0.5] = 1.1
In[1] : stock_change
Out[1]: array([[ 1.1       , -0.45576704,  0.29667843,  0.16606916,  0.46446682],
       [ 0.36775845,  0.24078108,  0.122042  ,  1.1       ,  1.1       ],
       [-1.48252741, -0.69347186,  1.1       , -0.30606473,  0.41598897],
       [ 0.39438905, -1.31770556,  1.1       , -1.52812773, -0.47703227],
       [-0.9822216 , -1.09482936, -0.81834523,  1.1       ,  1.1       ]])

4.2 general judgment function

np.all (Boolean): returns True only if it is all True, or False if there is one False

In[1] : np.all(stock_change[0:2, 0:5] > 0)
Out[1]: False

np.any (Boolean): returns True as long as there is one True, and False only if it is all False

# Judge whether the top five stocks have risen in this period
In[1] : np.any(stock_change[:5, :] > 0)
Out[1]: True

four point three np.where (ternary operator)

through the use of np.where (Boolean value, True position value, False position value) can perform more complex operations

# If the rise and fall of the first four days is greater than 0, it is set as 1, otherwise it is 0
temp = stock_change[:4, :4]

# Method 1:
In[1] : temp
Out[1]: array([[ 1.1       , -0.45576704,  0.29667843,  0.16606916],
           [ 0.36775845,  0.24078108,  0.122042  ,  1.1       ],
           [-1.48252741, -0.69347186,  1.1       , -0.30606473],
           [ 0.39438905, -1.31770556,  1.1       , -1.52812773]])

In[1] : np.where(temp > 0, 1, 0) # Set the position greater than zero to 1 and the position less than zero to 0
Out[1]: array([[1, 0, 1, 1],
           [1, 1, 1, 1],
           [0, 0, 1, 0],
           [1, 0, 1, 0]])

# Method 2:
In[1] : temp > 0
Out[1]: array([[ True, False,  True,  True],
           [ True,  True,  True,  True],
           [False, False,  True, False],
           [ True, False,  True, False]])

In[1] : np.where([[ True, False,  True,  True],
       [ True,  True,  True,  True],
       [False, False,  True, False],
       [ True, False,  True, False]], 1, 0)
Out[1]: array([[1, 0, 1, 1],
           [1, 1, 1, 1],
           [0, 0, 1, 0],
           [1, 0, 1, 0]])

Composite logic needs to be combined np.logical_and() and np.logical_or() use

# If the rise and fall of the first four days is greater than 0.5 and less than 1, change to 1, otherwise 0
# (temp > 0.5) and (temp < 1) error reporting

# Method 1:
In[1] : np.logical_and(temp > 0.5, temp < 1)
Out[1]: array([[False, False, False, False],
           [False, False, False, False],
           [False, False, False, False],
           [False, False, False, False]])

In[1] : np.where([[False, False, False, False],
       [False, False, False, False],
       [False, False, False, False],
       [False, False, False, False]], 1, 0)
Out[1]: array([[0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0]])

# Method 2:
In[1] : np.where(np.logical_and(temp > 0.5, temp < 1), 1, 0)
Out[1]: array([[0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0],
           [0, 0, 0, 0]])

# If the rise and fall of the first four days is greater than 0.5 or less than - 0.5, change to 1, otherwise it is 0

# Method 1:
In[1] : np.logical_or(temp > 0.5, temp < -0.5)
Out[1]: array([[ True, False, False, False],
           [False, False, False,  True],
           [ True,  True,  True, False],
           [False,  True,  True,  True]])
# ditto

# Method 2:
In[1] : np.where(np.logical_or(temp > 0.5, temp < -0.5), 11, 3)
Out[1]: array([[11,  3,  3,  3],
           [ 3,  3,  3, 11],
           [11, 11, 11,  3],
           [ 3, 11, 11, 11]])

4.4 statistical operation

4.4.1 statistical indicators

np. Function name or ndarray. Method name: in the field of data mining / machine learning, the value of statistical indicators is also a way for us to analyze problems.

np.min(a[, axis, out, keepdims])
- Return the minimum of an array or minimum along an axis
np.max(a[, axis, out, keepdims])
- Return the maximum of an array or maximum along an axis
np.median(a[, axis, out, overwrite_ input, keepdims])
- Compute the median along the specified axis
np.mean(a[, axis, dtype, out, keepdims])
- Compute the arithmetic mean along the specified axis
np.std(a[, axis, dtype, out, ddof, keepdims])
- Compute the standard deviation along the specified axis
np.var(a[, axis, dtype, out, ddof, keepdims])
- Compute the variance along the specified axis

4.4.2 statistical calculation of stock rise and fall

When making statistics, the axis axis value is not necessarily the same. The values of different API axes in Numpy are different. Here, axis=0 represents column, axis=1 represents row for statistics.

# The biggest increase of the first four stocks in the last four days
In[1] : temp # shape: (4, 4) 0  1
Out[1]: array([[ 1.1       , -0.45576704,  0.29667843,  0.16606916],
           [ 0.36775845,  0.24078108,  0.122042  ,  1.1       ],
           [-1.48252741, -0.69347186,  1.1       , -0.30606473],
           [ 0.39438905, -1.31770556,  1.1       , -1.52812773]])

# Find the maximum value of all elements
In[1] : temp.max()
Out[1]: 1.1

# Find the maximum value of each column element
In[1] : temp.max(axis=0)
Out[1]: array([1.1       , 0.24078108, 1.1       , 1.1       ])

# Find the maximum value of each element
In[1] : np.max(temp, axis=-1)	# -1 represents the last dimension
Out[1]: array([1.1, 1.1, 1.1, 1.1])

Returns the location of the maximum and minimum values:

np.argmax(temp, axis=)
np.argmin(temp, axis=)

# Get the stock to specify which day will increase the most
In[1] : np.argmax(temp, axis=-1)
Out[1]: array([0, 3, 2, 2], dtype=int64)

Note: if multiple locations have the same data, return the location of the first data

5 operation between arrays

5.1 operation of array and number

arr = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])

In[1] : arr / 10
Out[1]: array([[0.1, 0.2, 0.3, 0.2, 0.1, 0.4],
           [0.5, 0.6, 0.1, 0.2, 0.3, 0.1]])

# Comparison of python list operations
a = [1, 2, 3, 4, 5]

In[1] : a * 3
Out[1]: [1, 2, 3, 4, 5, 1, 2, 3, 4, 5, 1, 2, 3, 4, 5]

5.2 operation of array and array

arr1 = np.array([[1, 2, 3, 2, 1, 4], [5, 6, 1, 2, 3, 1]])
arr2 = np.array([[1, 2, 3, 4], [3, 4, 5, 6]])

In[1] : arr1 # (2, 6)
Out[1]: array([[1, 2, 3, 2, 1, 4],
           [5, 6, 1, 2, 3, 1]])

In[1] : arr2 # (2, 4)
Out[1]: array([[1, 2, 3, 4],
           [3, 4, 5, 6]])

# arr1 + arr2, error reporting

5.3 broadcast mechanism

The premise of performing broadcast is that the two ndarrays perform the element wise operation. The function of broadcast mechanism is to facilitate the mathematical operation of different shapes of ndarrays (the core data structure of numpy Library).

When two arrays are operated, numpy will compare their shapes one by one from right to left. The operation of two arrays needs to meet any of the following conditions:
- Equal dimensions
- shape (one of the corresponding places is 1)
The dimension of the result takes the maximum value of each shape of the array

For example: can operate

Image (3d array): 256 × 256 × 3
Scale (1d array):             3
Result(3d array): 256 × 256 × 3

A     (4d array): 9 × 1 × 7 × 1
B     (3d array) :     8 × 1 × 5
Result(4d array): 9 × 8 × 7 × 5

A     (2d array): 5 × 4
B     (1d array):     1
Result(2d array): 5 × 4

A     (3d array): 15 × 3 × 5
B     (3d array): 15 × 1 × 1
Result (3d array): 15 × 3 × 5

Example:

arr1 = np.array([[1, 2, 3, 2, 1, 4], 
		 [5, 6, 1, 2, 3, 1]]) # (2, 6)
arr2 = np.arrayarray([[1],
    	              [3]])	      # (2, 2)

In[1] : arr1 + arr2  		      # (2, 6)
Out[1]: array([[2, 3, 4, 3, 2, 5],
           [8, 9, 4, 5, 6, 4]])

In[1] : arr1 * arr2
Out[1]: array([[ 1,  2,  3,  2,  1,  4],
           [15, 18,  3,  6,  9,  3]])

5.4 matrix operation

5.4.1 what is a matrix

Matrix, English matrix, and array difference matrix must be 2D, but array can be multidimensional.

Matrix and two-dimensional array difference: matrix must be two-dimensional array, two-dimensional array is not necessarily matrix.

There are two ways to store matrices:

Darray 2D array
matrix data structure
- np.mat(): convert array (ndarray 2D array) to matrix type

# Method 1:
# Darray storage matrix
data = np.array([[80, 86],
		[82, 80],
		[85, 78],
		[90, 90],
		[86, 82],
		[82, 90],
		[78, 80],
		[92, 94]])

# Method 2:
# Matrix storage matrix:
In[1] : np.mat(data)
Out[1]: matrix([[80, 86],
        [82, 80],
        [85, 78],
        [90, 90],
        [86, 82],
        [82, 90],
        [78, 80],
        [92, 94]])

In[1] : type(np.mat(data))
Out[1]: numpy.matrix

5.4.2 matrix multiplication

Matrix multiplication: (M row, N column) × (N row, L column) = (M row, L column) (M row, N column) × (N row, L column) = (M row, L column) (M row, N column) × (N row, L column) = (M row, L column)

Darray 2D array matrix multiplication API: np.matmul()， np.dot()

1. ndarray matrix operation:

In[1] : data	 # (8, 2) * (2, 1) = (8, 1)
Out[1]: array([[80, 86],
           [82, 80],
           [85, 78],
           [90, 90],
           [86, 82],
           [82, 90],
           [78, 80],
           [92, 94]])

weights = np.array([[0.3], [0.7]])

In[1] : weights
Out[1]: array([[0.3],
           [0.7]])

# Method 1:
In[1] : np.matmul(data, weights)
Out[1]: array([[84.2],
           [80.6],
           [80.1],
           [90. ],
           [83.2],
           [87.6],
           [79.4],
           [93.4]])

# Method 2:
In[1] : np.dot(data, weights)
Out[1]: array([[84.2],
           [80.6],
           [80.1],
           [90. ],
           [83.2],
           [87.6],
           [79.4],
           [93.4]])

Note: data * weights does not satisfy the broadcast mechanism and cannot be operated.

2. matrix data structure matrix operation

data_mat = np.mat(data)
weights_mat = np.mat([[0.3], [0.7]])

In[1] : weights_mat
Out[1]: matrix([[0.3],
            [0.7]])

In[1] : data_mat * weights_mat
Out[1]: matrix([[84.2],
            [80.6],
            [80.1],
            [90. ],
            [83.2],
            [87.6],
            [79.4],
            [93.4]])

3. Direct multiplication of ndarray matrix operation:@

In[1] : data @ weights
Out[1]: array([[84.2],
           [80.6],
           [80.1],
           [90. ],
           [83.2],
           [87.6],
           [79.4],
           [93.4]])

6 merger and division

Split and merge the data.

6.1 merger

numpy.hstack(tup): Stack arrays in sequence horizontally(column wise).
numpy.vstack(tup): Stack arrays in sequence vertically(row wise).
numpy.concatenate((a1, a2, ...), axis=0)

data1= np.array([[1, 2],[3, 4],[5, 6]])     # (2, 3)

In[1] : data1
Out[1]: array([[1, 2],
	       [3, 4],
	       [5, 6]])

data2 = np.array([[7, 8],[9, 10],[11, 12]])

In[1] : data2
Out[1]: array([[ 7,  8],
	       [ 9, 10],
	       [11, 12]])

# Column splicing
In[1] : np.hstack((data1, data2))
Out[1]: array([[ 1,  2,  7,  8],
	       [ 3,  4,  9, 10],
	       [ 5,  6, 11, 12]])

# When axis=1, it is spliced together according to the column direction of the array
In[1] : np.concatenate((data1, data2), axis = 1)
Out[1]: array([[ 1,  2,  7,  8],
	       [ 3,  4,  9, 10],
	       [ 5,  6, 11, 12]])

# Row splicing
In[1] : np.vstack((data1, data2))
Out[1]: array([[ 1,  2],
	       [ 3,  4],
	       [ 5,  6],
	       [ 7,  8],
	       [ 9, 10],
	       [11, 12]])

# When axs=0, they are spliced together according to the row direction of the array
In[1] : np.concatenate((data1, data2), axis = 0)
Out[1]: array([[ 1,  2],
	       [ 3,  4],
	       [ 5,  6],
	       [ 7,  8],
	       [ 9, 10],
	       [11, 12]])

6.2 segmentation

numpy.split(ary, indices_or_sections, axis=0): Split an array into multiple sub-arrays.

x = np.arange(9.0)

# Divided into three groups according to the number of groups
In[1] : x
Out[1]: array([0., 1., 2., 3., 4., 5., 6., 7, 8.])

In[1] : np.split(x,3)
Out[1]: [array([0., 1., 2.]), array([3., 4., 5.]), array([6., 7., 8.])]

# Divided by index, four groups, the first group: [0,3), the second group: [3,5), the third group: [5,6), the fourth group: [6,...)
In[1] : np.split(x,[3, 5, 6])
Out[1]: [array([0., 1., 2.]), 
	 array([3., 4.]), 
	 array([5.]), 
	 array([6., 7., 8.])]

7 IO operation and data processing (common pandas)

Problem: most of the data is in the file and needs to be obtained with tools.

Numpy is not suitable for reading and processing data. This section is only for a brief understanding.

7.1 Numpy read

genfromtxt(fname[, dtype, comments, delimiter, ...): Load data from a text file, with missing values handled as specified.
- fname: file path
- delimiter: separator

data = np.genfromtxt("test.csv", delimiter=",")

In[1] : data
Out[1]: array([[  nan,   nan,   nan,   nan],
           [  1. , 123. ,   1.4,  23. ],
           [  2. , 110. ,   nan,  18. ],
           [  3. ,   nan,   2.1,  19. ]])

Disadvantage: numpy reads strings as missing values (nan)

7.2 missing value handling

Missing value: when reading the local file as float, if there is a missing (or None), nan will appear.

If the average value before the replacement is greater than 0, the average value after the replacement will be smaller.

There are two ways to deal with missing values:

Delete samples with missing values directly
Replace / impute: average by column, fill with average

How to calculate the median or mean value of a set of data and delete the row (column) with missing data are introduced in panda.

To view nan types:

In[1] : type(data[2, 2])
Out[1]: numpy.float64	# float

Handling nan (not recommended):

def fill_nan_by_column_mean(t):
    # Ergodic column
    for i in range(t.shape[1]):
        # Calculate the number of nan
        nan_num = np.count_nonzero(t[:, i][t[:, i] != t[:, i]])
        if nan_num > 0:
            now_col = t[:, i]
            # Summation
            now_col_not_nan = now_col[np.isnan(now_col) == False].sum()
            # And / number
            now_col_mean = now_col_not_nan / (t.shape[0] - nan_num)
            # Assign to now_col
            now_col[np.isnan(now_col)] = now_col_mean
            # Assign to t, i.e. update the current column of T
            t[:, i] = now_col
    return t

In[1] : data
Out[1]: array([[  nan,   nan,   nan,   nan],
           [  1. , 123. ,   1.4,  23. ],
           [  2. , 110. ,   nan,  18. ],
           [  3. ,   nan,   2.1,  19. ]])

In[1] : fill_nan_by_column_mean(data)
Out[1]: array([[  2.  , 116.5 ,   1.75,  20.  ],
           [  1.  , 123.  ,   1.4 ,  23.  ],
           [  2.  , 110.  ,   1.75,  18.  ],
           [  3.  , 116.5 ,   2.1 ,  19.  ]])

8 summary

Keywords: Python less Attribute Big Data

Added by AnarKy on Tue, 16 Jun 2020 09:31:42 +0300

Programming VIP

numpy: a common module in python

1 Numpy advantage

1.1 introduction to numpy

1.2 introduction to ndarray

1.3 comparison of operation efficiency between ndarray and Python native list

1.4 advantages of darray

2 N-dimensional array - ndarray

2.1 properties of ndarray

2.2 shape of ndarray

2.3 types of ndarray

3 basic operation

3.1 method of generating array

3.1.1 generate arrays of 0 and 1

3.1.2 generating from an existing array

3.1.3 generate fixed range array

3.1.4 generate random array: np.random modular

3.2 array operation

3.2.1 creation of stock up and down data

3.2.2 index and slice of array

3.3.3 shape modification - transpose

3.3.4 type modification

3.3 array de duplication

4 ndarray operation

4.1 logical operation

4.1.1 logical judgment

4.1.2 Boolean index

4.2 general judgment function

four point three np.where (ternary operator)

4.4 statistical operation

4.4.1 statistical indicators

4.4.2 statistical calculation of stock rise and fall

5 operation between arrays

5.1 operation of array and number

5.2 operation of array and array

5.3 broadcast mechanism

5.4 matrix operation

5.4.1 what is a matrix

5.4.2 matrix multiplication

6 merger and division

6.1 merger

6.2 segmentation

7 IO operation and data processing (common pandas)

7.1 Numpy read

7.2 missing value handling

8 summary

Popular Keywords