Big data - Python data analysis 2 (numpy module)

NumPy(Numerical Python) is an extension of the Python language that supports a large number of dimension array and matrix operations, as well as a large number of mathematical function libraries for array operations.

Create a matrix (using ndarray objects)

For numpy modules in python, the ndarray object provided by them is usually used. It is easy to create an ndarray object by taking a list as an argument. For example:

import numpy as np #Introducing numpy Library

#Create a one-dimensional narray object

a = np.array([1,2,3,4,5])


#Creating a two-dimensional narray object

a2 = np.array([[1,2,3,4,5],[6,7,8,9,10]])


#Create multidimensional objects with their analogies

Get the number of rows and columns of a matrix (in two dimensions)

Customized to programming with matlab, when traversing a matrix, you usually get the number of rows and columns of the matrix first. To get the length of each dimension of a narray object, you can use the shape property of the narray object

import numpy as np

a = np.array([[1,2,3,4,5],[6,7,8,9,10]])

print(a.shape) #The result returns a tuple tuple (2L, 5L)

print(a.shape[0]) #Get the number of rows, return 2

print(a.shape[1]) #Get the number of columns, return 5

Interception of Matrix

Intercept by row and column

Matrix intercepts are the same as list and can be intercepted by [] (square brackets)

import numpy as np

a = np.array([[1,2,3,4,5],[6,7,8,9,10]])

print(a[0:1]) #Intercept the first row and return [[1 2 3 4 5]]

print(a[1,2:5]) #Intercept the second row, the third and fourth columns, and return to [8 9]

print(a[1,:]) #Intercept the second row and return [67 8 9 10]

Conditional interception

Conditional interception is actually a Boolean statement that passes in itself in [] (square brackets)

import numpy as np
a = np.array([[1,2,3,4,5],[6,7,8,9,10]])
b = a[a>6] # Intercepts elements greater than 6 in matrix a, and the range is a one-dimensional array
print(b) # Return [7 8 9 10]

# In fact, the Boolean statement first generates a Boolean matrix, which is truncated by passing it in [] (square brackets)
print(a>6)
"""
Return
[[False False False False False]
[False True True True True]]
"""

Conditional interception is often used to convert elements in a matrix that meet certain conditions into specific values. For example, elements greater than 6 in a matrix are converted to 0

import numpy as np
a = np.array([[1,2,3,4,5],[6,7,8,9,10]])
print(a)
"""
The starting matrix is
[[ 1 2 3 4 5]
[ 6 7 8 9 10]]
"""

a[a % 2 == 0] = 0
print(a)
"""
The matrix after even zeroing is
[[1 0 3 0 5]
 [0 7 0 9 0]]
"""

Combination of matrices

Combination of matrices can be achieved by the hstack and vstack methods in numpy

import numpy as np

a1 = np.array([[1,2],[3,4]])
a2 = np.array([[5,6],[7,8]])
#!Note that parameters are passed in as a list or tuple
print(np.hstack([a1,a2]))
"""
Merge horizontally and return the following results
[[1 2 5 6]
[3 4 7 8]]
"""

print(np.vstack((a1,a2)))
"""
Vertical merge, returns the following results
[[1 2]
[3 4]
[5 6]
[7 8]]
"""

Consolidation of matrices can also be done using the concatenatef method

np.concatenate( (a1,a2), axis=0 ) Equivalent to np.vstack( (a1,a2) )

np.concatenate( (a1,a2), axis=1 ) Equivalent to np.hstack( (a1,a2) )

Create Matrix by Function

The numpy module comes with functions to create ndarray objects, which make it easy to create common or regular matrices.

arange

import numpy as np
a = np.arange(10) # The default is from 0 to 10 (excluding 10) with a step of 1
print(a) # Return [0 1 2 3 4 5 6 7 8 9]
a1 = np.arange(5,10) # From 5 to 10 (excluding 10) with a step of 1
print(a1) # Return to [5 6 7 8 9]
a2 = np.arange(5,20,2) # From 5 to 20 (excluding 20), with a step of 2
print(a2) # Return [5 7 9 11 13 15 17 19]

linspace

linspace() is similar to matlab's linspace in that it creates a specified number of equal-interval sequences and actually generates an equal-difference column.

import numpy as np

# linspace(start, stop, num=50, endpoint=True, retstep=False, dtype=None, axis=0)
"""
start : array_like
        The starting value of the sequence.
    stop : array_like
        The end value of the sequence, unless `endpoint` is set to False.
        In that case, the sequence consists of all but the last of ``num + 1``
        evenly spaced samples, so that `stop` is excluded.  Note that the step
        size changes when `endpoint` is False.
    num : int, optional
        Number of samples to generate. Default is 50. Must be non-negative.
    endpoint : bool, optional
        If True, `stop` is the last sample. Otherwise, it is not included.
        Default is True.
    retstep : bool, optional
        If True, return (`samples`, `step`), where `step` is the spacing
        between samples.
    dtype : dtype, optional
        The type of the output array.  If `dtype` is not given, infer the data
        type from the other input arguments.
"""
a = np.linspace(0,10,5) # Generate equal difference columns with 7 numbers starting at 0 and ending at 10
print(a)
# Result [0.2.5 5 5.7.5 10.]

a = np.linspace(0, 10, 5, restep=True)
# Results (array ([0., 2.5, 5., 7.5, 10.]), 2.5)

Start: start value
stop:end value
num: the quantity to be generated, defaults to 50 if not written
endpoint: The default is True, meaning stop, and False means no stop
restep: Default to False, if set to True, returns tuples of equal difference columns and steps
dtype: Specifies the data type, which is inferred by default from the input parameters

logspace

linspace is used to generate equal-difference columns, and logspace is used to generate logarithmic equal-ratio columns

import numpy as np

# logspace(start, stop, num=50, endpoint=True, base=10.0, dtype=None, axis=0):
    """
    Return numbers spaced evenly on a log scale.

    In linear space, the sequence starts at ``base ** start``
    (`base` to the power of `start`) and ends with ``base ** stop``
    (see `endpoint` below).

    .. versionchanged:: 1.16.0
        Non-scalar `start` and `stop` are now supported.

    Parameters
    ----------
    start : array_like
        ``base ** start`` is the starting value of the sequence.
    stop : array_like
        ``base ** stop`` is the final value of the sequence, unless `endpoint`
        is False.  In that case, ``num + 1`` values are spaced over the
        interval in log-space, of which all but the last (a sequence of
        length `num`) are returned.
    num : integer, optional
        Number of samples to generate.  Default is 50.
    endpoint : boolean, optional
        If true, `stop` is the last sample. Otherwise, it is not included.
        Default is True.
    base : float, optional
        The base of the log space. The step size between the elements in
        ``ln(samples) / ln(base)`` (or ``log_base(samples)``) is uniform.
        Default is 10.0.
    dtype : dtype
        The type of the output array.  If `dtype` is not given, infer the data
        type from the other input arguments.
    axis : int, optional
        The axis in the result to store the samples.  Relevant only if start
        or stop are array-like.  By default (0), the samples will be along a
        new axis inserted at the beginning. Use -1 to get an axis at the end.
    """
a = np.logspace(0,2,5)
print(a)
# Results [1.3.16227766 10.31.6227766 100.]

Start: start value
stop: end value
num: Number of elements, default is 50
endpoint: The default is True, meaning stop, and False means no stop
Base: Specifies the base of the logarithm, defaulting to 10
dtype: Specifies the data type, which is inferred by default from the input parameters
Axis: The axis in the result is used to store the sample. Only relevant if start or stop is similar to an array. By default (0), the sample will follow the new axis inserted at the beginning. Use -1 to get the end of the axis

import numpy as np

a = np.logspace([0, 0], [2, 2], 5, axis=-1)
b = np.logspace([0, 0], [2, 2], 5)
print(a)
"""
[[  1.           3.16227766  10.          31.6227766  100.        ]
 [  1.           3.16227766  10.          31.6227766  100.        ]]
"""
print(b)
"""
[[  1.           1.        ]
 [  3.16227766   3.16227766]
 [ 10.          10.        ]
 [ 31.6227766   31.6227766 ]
 [100.         100.        ]]
"""

ones,zeros,eye,empty

ones creates the full 1 matrix zeros creates the full 0 matrix eye creates the unit matrix empty creates the empty matrix (the actual value)

import numpy as np
a_ones = np.ones((3,4)) # Create a 3*4 Full 1 Matrix
print(a_ones)
"""
Result
[[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]
[ 1. 1. 1. 1.]]
"""
a_zeros = np.zeros((3,4)) # Create a full 0 matrix of 3*4
print(a_zeros)

"""
Result
[[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]
[ 0. 0. 0. 0.]]
"""
a_eye = np.eye(3) # Create a third-order unit matrix
print(a_eye)
"""
Result
[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
"""
a_empty = np.empty((3,4)) # Create an empty matrix of 3*4
print(a_empty)
"""
Result
[[ 1.78006111e-306 -3.13259416e-294 4.71524461e-309 1.94927842e+289]
[ 2.10230387e-309 5.42870216e+294 6.73606381e-310 3.82265219e-297]
[ 6.24242356e-309 1.07034394e-296 2.12687797e+183 6.88703165e-315]]
"""

fromstring

The fromstring() method converts strings to ndarray objects, which is useful when digitizing strings to obtain ascii sequences of strings

import numpy as np

a = "abcdef"
b = np.fromstring(a,dtype=np.int8) # Because a character is 8, specify dtype as np.int8
print(b) # Return [97 98 99 100 101 102]

fromfunction

The fromfunction() method generates elements of a matrix based on the row and column numbers of the matrix. For example, create a matrix in which each element is the sum of the row and column numbers

import numpy as np

def func(i,j):
return i+j
a = np.fromfunction(func,(5,6))
# The first parameter is the specified function, and the second parameter is a list or tuple, indicating the size of the matrix
print(a)
"""
Result
[[ 0. 1. 2. 3. 4. 5.]
[ 1. 2. 3. 4. 5. 6.]
[ 2. 3. 4. 5. 6. 7.]
[ 3. 4. 5. 6. 7. 8.]
[ 4. 5. 6. 7. 8. 9.]]
"""

Operations of Matrix

Common matrix operators

The ndarray object in numpy overloads many operators that can be used to perform operations on the corresponding elements between matrices.

operator	Explain
+	Addition of Matrix Corresponding Elements
-	Subtraction of Matrix Corresponding Elements
*	Multiplication of Matrix Corresponding Elements
/	Divide the corresponding elements of a matrix, and take the quotient if they are all integers
%	Remainder after dividing corresponding elements of a matrix
**	Each element of the matrix is n-th power, e.g. **2: Each element is squared

import numpy as np

a1 = np.array([[4,5,6],[1,2,3]])
a2 = np.array([[6,5,4],[3,2,1]])
print(a1+a2) # Addition
"""
Result
[[10 10 10]
[ 4 4 4]]
"""

print(a1/a2) # Integer division quotient
"""
Result
[[0 1 1]
[0 1 3]]
"""


print(a1%a2) # Divide Remainder
"""
Result
[[4 0 2]
[1 0 0]]
"""

Common Matrix Functions

Similarly, there are many functions defined in numpy that can be used to act on each element in the matrix. The numpy module, import numpy as np, a is an ndarray object by default, is imported into the table.

Matrix function	Explain
np.sin(a)	Take sine for each element in matrix a, sin(x)
np.cos(a)	Cosine, cos(x) for each element in matrix a
np.tan(a)	Tangent each element in matrix a, tan(x)
np.arcsin(a)	Take the arcsine of each element in matrix a, arcsin(x)
np.arccos(a)	Take the inverse cosine of each element in matrix a, arccos(x)
np.arctan(a)	Arct a n(x)
np.exp(a)	Take an exponential function for each element in matrix a, ex
np.sqrt(a)	Root every element in matrix a x

import numpy as np
a = np.array([[1,2,3],[4,5,6]])
print(np.sin(a))
"""
Result
[[ 0.84147098 0.90929743 0.14112001]
[-0.7568025 -0.95892427 -0.2794155 ]]
"""

print(np.arcsin(a))
"""
Result
C:\Users\Administrator\Desktop\learn.py:6: RuntimeWarning: invalid value encountered in arcsin
print(np.arcsin(a))
[[ 1.57079633 nan nan]
[ nan nan nan]]
"""

RuntimeWarning results in nan(not a number) when elements in the matrix are not within the defined domain

Matrix Multiplication (Point Multiplication)

Matrix multiplication must satisfy the condition of matrix multiplication, that is, the number of columns of the first matrix is equal to the number of rows of the second matrix.

The function of matrix multiplication is dot

import numpy as np
a1 = np.array([[1,2,3],[4,5,6]]) # a1 is a 2*3 matrix
a2 = np.array([[1,2],[3,4],[5,6]]) # a2 is a 3*2 matrix
print(a1.shape[1]==a2.shape[0]) # True, satisfies the matrix multiplication condition
print(a1.dot(a2))
# a1.dot(a2) is equivalent to a1*a2 in matlab
# And a1*a2 in python is equivalent to a1.*a2 in matlab
"""
Result
[[22 28]
[49 64]]
"""

transpose of matrix

import numpy as np

a = np.array([[1,2,3],[4,5,6]])

print(a.transpose())
"""
Result
[[1 4]
[2 5]
[3 6]]
"""

The simpler method of transposing a matrix is a.T.

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
print(a.T)
"""
Result
[[1 4]
[2 5]
[3 6]]
"""

Inverse a_1 of a matrix

To invert a matrix, you need to import numpy.linalg first, and invert it with the inv function of linalg. The inversion of a matrix requires that the number of rows and columns of the matrix are the same

import numpy as np

import numpy.linalg as lg

a = np.array([[1,2,3],[4,5,6],[7,8,9]])
print(lg.inv(a))
"""
Result
[[ -4.50359963e+15 9.00719925e+15 -4.50359963e+15]
[ 9.00719925e+15 -1.80143985e+16 9.00719925e+15]
[ -4.50359963e+15 9.00719925e+15 -4.50359963e+15]]
"""

a = np.eye(3) # 3rd Order Unit Matrix

print(lg.inv(a)) # The inverse of the unit matrix is himself
"""
Result
[[ 1. 0. 0.]
[ 0. 1. 0.]
[ 0. 0. 1.]]
"""

Matrix information acquisition (such as mean)

Maximum and Minimum

The functions to get the maximum and minimum values of elements in a matrix are max and min, respectively, to get the maximum and minimum values for the entire matrix, row, or column.

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
print(a.max()) #Get the maximum value of the entire matrix Result:6
print(a.min()) #Results: 1

# You can specify the keyword parameter axis to get the maximum (small) value for a row or the maximum (small) value for a column
# axis=0 Row direction maximum (small) value, which is the maximum (small) value for each column
# axis=1 column direction maximum (small) value, which is the maximum (small) value for each row
# for example

print(a.max(axis=0))
# The result is [45 6]

print(a.max(axis=1))
# The result is [3 6]

# To get the location of the maximum and minimum elements, you can get them through the argmax function
print(a.argmax(axis=1))
# The result is [2]

average value

Similarly, the mean of the entire matrix, row, or column can be obtained.

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
print(a.mean()) #The result is: 3.5

# Similarly, the keyword axis parameter allows you to specify which direction to get the average value
print(a.mean(axis=0)) # Results [2.5 3.5 4.5]
print(a.mean(axis=1)) # Result [2.5.]

variance

Variance function is var(), variance function var() is equivalent to function mean (abs(x-x.mean())**2), where x is a matrix

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
print(a.var()) # Results 2.916666667



print(a.var(axis=0)) # Results [2.25 2.25 2.25]
print(a.var(axis=1)) # Results [0.66666667 0.66666667]

standard deviation

The function of standard deviation is std(). std() is equivalent to sqrt (mean (abs (x-x.mean()**2)) or sqrt(x.var())

import numpy as np

a = np.array([[1,2,3],[4,5,6]])
print(a.std()) # Results 1.70782512766

print(a.std(axis=0)) # Results [1.5 1.5 1.5]
print(a.std(axis=1)) # Results [0.81649658 0.81649658]

median

The median refers to the value in the middle after the sequence is sorted by size, and if there are even numbers, the average of the two in the middle.

For example, the sequence [5,2,6,4,2], in size order [2,2,4,5,6], in the middle is 4, so the median value of this sequence is 4.

Or the sequence [5,2,6,4,3,2], in size order [2,2,3,4,5,6], because there are even numbers, and the middle two numbers are 3,4, the median value of this sequence is 3.5.

The median function is median(), the call method is numpy.median(x,[axis]), axis can specify the axis direction, default axis = None, median to all numbers

import numpy as np

x = np.array([[1,2,3],[4,5,6]])

print(np.median(x)) # Median all numbers
# Result 3.5



print(np.median(x,axis=0)) # Median along first dimension
# Results [2.5 3.5 4.5]

print(np.median(x,axis=1)) # Median along the second dimension
# Result [2.5.]

Summation

The function of matrix summation is sum(), which can sum rows, columns, or the entire matrix

import numpy as np

a = np.array([[1,2,3],[4,5,6]])

print(a.sum()) # Sum the entire matrix
# Result 21

print(a.sum(axis=0)) # Sum Row Directions
# Result [57 9]

print(a.sum(axis=1)) # Sum Column Directions
# Result [6 15]

Cumulative Sum

The cumulative sum of a location refers to the sum of all elements before (including) that location.

For example, a sequence [1,2,3,4,5], whose cumulative sum is [1,3,6,10,15], that is, the first element is 1, the second element is 1+2=3,..., and the fifth element is 1+2+3+4+5=15.

The cumulative sum function of a matrix is cumsum(), which can sum rows, columns, or the entire matrix

import numpy as np

a = np.array([[1,2,3],[4,5,6]])

print(a.cumsum()) # Sum the entire matrix
# Results [1 3 6 10 15 21]

print(a.cumsum(axis=0)) # Cumulative Sum of Row Directions
"""
Result
[[1 2 3]
[5 7 9]]
"""

print(a.cumsum(axis=1)) # Cumulative Sum of Column Directions
"""
Result
[[ 1 3 6]
[ 4 9 15]]
"""

Keywords: Python Big Data Machine Learning AI

Added by DarkArchon on Tue, 21 Sep 2021 00:49:46 +0300

Programming VIP