Machine learning - Data Science Library Day 3 - Notes

What is numpy

A basic library for scientific calculation in Python, which focuses on numerical calculation. It is also the basic library of most Python scientific calculation libraries. It is mostly used to perform numerical operations on large and multi-dimensional arrays

Axis

In numpy, it can be understood as direction, which is expressed by 0,1,2... Numbers. For a one-dimensional array, there is only one 0 axis, for a two-dimensional array (shape(2,2)), there are 0 axis and 1 axis, and for a three-dimensional array (shape (2,2,3)), there are 0,1,2 axes

With the concept of axis, our calculation will be more convenient. For example, to calculate the average value of a two-dimensional group, we must specify which direction to calculate the average value of the numbers above
Create array:


Modify the shape of the array

Inter array operation



Transpose matrix

numpy read data
CSV: comma separated value, comma separated value file
Displaying: table status
Source file: formatted text with newline and comma separated rows and columns. The data of each line represents a record
Because csv is easy to display, read and write, many places also use csv format to store and transmit small and medium-sized data. In order to facilitate teaching, we will often operate csv format files, but it is also easy to operate the data in the database

numpy read data

numpy reads and stores data

# coding=utf-8
import numpy as np
us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"
# t1 = np.loadtxt(us_file_path,delimiter=",",dtype="int",unpack=True)
t2 = np.loadtxt(us_file_path,delimiter=",",dtype="int")
# print(t1)
print(t2)
print("*"*100)

b = t2[2:5,1:4]
# print(b)
#Take multiple non adjacent points
#The selected result is (0, 0) (2, 1) (2, 3)
c = t2[[0,2,2],[0,1,3]]
print(c)

Operation results:

Boolean index in numpy

Ternary operators in numpy

nan and inf in numpy

nan(NAN,Nan):not a number means not a number

When we read the local file as float, nan will appear if it is missing
As an inappropriate calculation (such as infinity minus infinity)
inf(-inf,inf):infinity,inf means positive infinity, - inf means negative infinity
When will inf appear, including (- inf, + INF)
For example, if a number is divided by 0, (an error will be reported directly in python, and an inf or - inf in numpy)

Notes on nan in numpy

1. Two nan are not equal

2.np.nan!=np.nan
3. Use the above characteristics to judge the number of nan in the array

4. Judge whether a number is nan through NP IsNaN (a)

5.nan and any value calculation are nan

###Common statistical functions in numpy
Summation: t.sum(axis=None)
Mean: t.mean(a,axis=None) is greatly affected by outliers
Median: NP median(t,axis=None)
Maximum value: t.max(axis=None)
Minimum value: t.min(axis=None)
Extreme value: NP PTP (T, axis = none) is the difference between the maximum value and the minimum value
Standard deviation: t.std(axis=None)

Fill nan in numpy

# coding=utf-8
import numpy as np
# print(t1)
def fill_ndarray(t1):
    for i in range(t1.shape[1]):  #Traverse each column
        temp_col = t1[:,i]  #Current column
        nan_num = np.count_nonzero(temp_col!=temp_col)
        if nan_num !=0: #If it is not 0, it indicates that nan exists in the current column
            temp_not_nan_col = temp_col[temp_col==temp_col] #The current array column is not nan
            # Select the position that is currently nan and assign the value to the mean value that is not nan
            temp_col[np.isnan(temp_col)] = temp_not_nan_col.mean()
    return t1
if __name__ == '__main__':
    t1 = np.arange(24).reshape((4, 6)).astype("float")
    t1[1, 2:] = np.nan
    print(t1)
    t1 = fill_ndarray(t1)
    print(t1)

Operation results:

[hands on] the data of youtube1000 in Britain and the United States are combined with the previous matplotlib to draw the histogram of the number of comments

import numpy as np
from matplotlib import  pyplot as plt
us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"
# t1 = np.loadtxt(us_file_path,delimiter=",",dtype="int",unpack=True)
t_us = np.loadtxt(us_file_path,delimiter=",",dtype="int")
#Get the data of comments
t_us_comments = t_us[:,-1]
#Select data smaller than 5000
t_us_comments = t_us_comments[t_us_comments<=5000]
print(t_us_comments.max(),t_us_comments.min())
d = 50
bin_nums = (t_us_comments.max()-t_us_comments.min())//d
#mapping
plt.figure(figsize=(20,8),dpi=80)
plt.hist(t_us_comments,bin_nums)
plt.show()

Operation results:

[hands on] I want to know the relationship between the number of comments and the number of likes on youtube in the UK, and how to draw and change the map

import numpy as np
from matplotlib import  pyplot as plt
us_file_path = "./youtube_video_data/US_video_data_numbers.csv"
uk_file_path = "./youtube_video_data/GB_video_data_numbers.csv"
t_uk = np.loadtxt(uk_file_path,delimiter=",",dtype="int")
#Choose data that likes books smaller than 500000
t_uk = t_uk[t_uk[:,1]<=500000]
t_uk_comment = t_uk[:,-1]
t_uk_like = t_uk[:,1]
plt.figure(figsize=(20,8),dpi=80)
plt.scatter(t_uk_like,t_uk_comment)
plt.show()

Operation results:

Row column exchange of array

The horizontal or vertical splicing of arrays is very simple, but what should we pay attention to before splicing?
Vertical splicing: each column represents the same meaning!!! Otherwise, the bull's head is not the horse's mouth
If the meaning of each column is different, the columns of a certain group of numbers should be exchanged at this time to make them the same as other types

[hands on] what should we do now if we want to study and analyze the data methods of the two countries in the previous case together and retain the country information (the country source of each data)

import numpy as np
us_data = "./youtube_video_data/US_video_data_numbers.csv"
uk_data = "./youtube_video_data/GB_video_data_numbers.csv"
#Load country data
us_data = np.loadtxt(us_data,delimiter=",",dtype=int)
uk_data = np.loadtxt(uk_data,delimiter=",",dtype=int)
# Add country information
#Construct data with all 0
zeros_data = np.zeros((us_data.shape[0],1)).astype(int)
ones_data = np.ones((uk_data.shape[0],1)).astype(int)
#Add an array with all 0 and 1 columns respectively
us_data = np.hstack((us_data,zeros_data))
uk_data = np.hstack((uk_data,ones_data))
# Splice two sets of data
final_data = np.vstack((us_data,uk_data))
print(final_data)

Operation results:

numpy more easy to use methods

1. Get the position of the maximum and minimum values
np.argmax(t,axis=0)
np.argmin(t,axis=1)
2. Create an array of all zeros: NP zeros((3,4))
3. Create an array of all 1: NP ones((3,4))
4. Create a square array (square array) with diagonal 1: NP eye(3)

Keywords: Python Machine Learning AI

Added by wrequed on Tue, 15 Feb 2022 17:12:50 +0200