Basic tutorial of Python data analysis: NumPy Learning Guide (2nd Edition) Note 6: Chapter 3 common functions 2 - median, variance, date and flattening

This chapter will introduce the common functions of NumPy. Specifically, we will take the analysis of historical stock prices as an example to introduce how to load data from files and how to use NumPy's basic mathematical and statistical analysis functions. Here you will also learn how to read and write files, and try functional programming and NumPy linear algebra.

Chapter 3 common functions

3.9 statistical analysis

Stock traders are interested in predicting the closing price. Common sense tells us that this price should be close to some average. Both arithmetic mean and weighted mean are methods to find the center point in the numerical distribution. However, they are neither robust nor sensitive to outlier s. For example, if we have a closing price of up to $1 million, this will affect our calculation results.

3.10 hands on practice: simple statistical analysis

We can use some thresholds to remove outliers, but there is a better method, the median. The values of each variable are arranged in order of size to form a sequence. The number in the middle of the sequence is the median. For example, if we have five values 1, 2, 3, 4 and 5, the median is the middle number 3. Here are the steps to calculate the median.

  • (1) Calculate the median closing price. Create a new Python script file named simplestats py. You already know how to read data from the CSV file into the array, so you only need to copy one line of code and ensure that only the closing price data is obtained, as shown below:
c=np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)
  • (2) The median function will help us find the median. We call it and print the result immediately. Add the following line of code:
print("median =", np.median(c))

The output of this code is as follows:

median = 352.055
  • (3) Since this is the first time we use the medium function, let's check whether the result is correct. This is not because we are suspicious! Of course, we can browse the whole data file and find the correct answer manually, but that's too boring. We will sort the price array and output the value in the middle after sorting, which simulates the algorithm of finding the median. The msort function can help us complete the first step. We will call this function to get the sorted array and output the result.
sorted_close = np.msort(c)
print("sorted =", sorted_close)

The output of this code is as follows:

sorted = [336.1  338.61 339.32 342.62 342.88 343.44 344.32 345.03 346.5  346.67
 348.16 349.31 350.56 351.88 351.99 352.12 352.47 353.21 354.54 355.2
 355.36 355.76 356.85 358.16 358.3  359.18 359.56 359.9  360.   363.13]

Great, the code is in effect! Now let's get the number in the middle:

N = len(c)
print("middle =", sorted_close[int((N - 1)/2)])

The output is as follows:

middle = 351.99
  • (4) Eh, this value is different from the value given by the median function. What's going on? After careful observation, we found that the result returned by the medium function did not even appear in our data file. This is even more strange! Before submitting a bug report to the NumPy team, let's take a look at the documentation. It turns out that this mystery is easy to solve. The reason is that our simple algorithm simulation only works for arrays with odd lengths. For an even length array, the median value should be equal to the average of the two numbers in the middle. Therefore, enter the following code:
print("average middle =", (sorted_close[int(N /2)] + sorted_close[int((N - 1) / 2)]) / 2)

The output results are as follows:

average middle = 352.055

succeed!

  • (5) Another statistic we care about is variance. Variance can reflect the degree of change of variables. In our example, variance can also tell us the size of investment risk. Stocks whose share prices change too sharply are bound to cause trouble for their holders.
    In NumPy, only one line of code is required to calculate the variance. See the following:
print "variance =", np.var(c)

The following results will be given:

variance = 50.126517888888884
  • (6) Since we don't believe in NumPy's function, let's review the results again according to the definition of variance in the document. Note that the definition of variance here may be inconsistent with what you see in the book of statistics, but this definition is more general in statistics. Variance refers to the value obtained by dividing the sum of squares of deviations between each data and the arithmetic mean of all data by the number of data.
    Some books tell us that we should subtract 1 from the number of data to remove the sum of squares of deviations 1.
print("variance from definition =", np.mean((c - c.mean())**2))

The output results are as follows:

variance from definition = 50.126517888888884
  • simplestats. The complete code of Py is as follows:
import numpy as np

c = np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)

print("median =", np.median(c))
sorted_close = np.msort(c)
print("sorted =", sorted_close)

N = len(c)
print("middle =", sorted_close[int((N - 1)/2)])
print("average middle =", (sorted_close[int(N /2)] + sorted_close[int((N - 1) / 2)]) / 2)

print("variance =", np.var(c))
print("variance from definition =", np.mean((c - c.mean())**2))

3.11 stock returns

In academic literature, the analysis of closing price is often based on stock return and logarithmic return. Simple rate of return refers to the rate of change between two adjacent prices, while logarithmic rate of return refers to the difference between two prices after taking logarithms of all prices. We learned the knowledge of logarithm in high school. The logarithm of "a" minus the logarithm of "b" is equal to the logarithm of "a divided by b". Therefore, logarithmic rate of return can also be used to measure the rate of change of price. Note that since the yield is a ratio, for example, we divide us dollars by US dollars (or other monetary units), it is dimensionless. In short, what investors are most interested in is the variance or standard deviation of the rate of return, because it represents the size of investment risk.

3.12 hands on practice: analyze stock returns

Analyze the stock return according to the following steps.

  • (1) First, let's calculate the simple rate of return. The diff function in NumPy can return an array composed of the difference between adjacent array elements. This is a bit like differentiation in calculus. In order to calculate the yield, we also need to divide the difference by the price of the previous day. Note here, however, that diff returns an array one element less than the closing price array. After careful consideration, we use the following code:
returns = np.diff( c ) / c[ : -1]

Note that we do not divide by the last value in the closing price array. Next, calculate the standard deviation with the std function:

print("Standard deviation =", np.std(returns))

The output results are as follows:

Standard deviation = 0.012922134436826306
  • (2) Logarithmic returns are even simpler to calculate. We first use the log function to get the logarithm of each closing price, and then use the diff function for the result.
logreturns = np.diff( np.log(c) )

In general, we should check the input array to ensure that it does not contain zero and negative numbers. Otherwise, you will get an error prompt.
However, in our example, the stock price is always positive, so the check can be omitted.

  • (3) We are likely to be very interested in which trading days the yield is positive. After completing the previous steps, we only need to use the where function to do this. The where function can return the index values of all array elements that meet the specified conditions. Enter the following code:
posretindices = np.where(returns > 0)
print("Indices with positive returns", posretindices)

The indexes of all positive elements in the array can be output.

Indices with positive returns (array([ 0,  1,  4,  5,  6,  7,  9, 10, 11, 12, 16, 17, 18, 19, 21, 22, 23,
       25, 28], dtype=int64),)
  • (4) In investment science, volatility is a measure of price change. Historical volatility can be calculated from historical price data. Logarithmic returns are required to calculate historical volatility, such as annual or monthly volatility** The annual volatility is equal to the standard deviation of logarithmic return divided by its mean, and then divided by the square root of the reciprocal of the trading day. Usually, the trading day is 252 days** We use std and mean functions to calculate, and the code is as follows:
annual_volatility = np.std(logreturns)/np.mean(logreturns)
annual_volatility = annual_volatility / np.sqrt(1./252.)
print("Yearly volatility",annual_volatility)

The annual volatility output is:

Yearly volatility 129.27478991115132
  • (5) Notice the division operation in the sqrt function. In Python, the division mechanism of integers is different from that of floating-point numbers. We must use floating-point numbers to get the correct results. Similar to the method of calculating annual volatility, the monthly volatility is calculated as follows:
print("Monthly volatility", annual_volatility * np.sqrt(1./12.))

Monthly volatility output is:

Monthly volatility 37.318417377317765

Analysis of stock returns example complete code return Py file code is as follows:

import numpy as np

c = np.loadtxt('data.csv', delimiter=',', usecols=(6,), unpack=True)

returns = np.diff( c ) / c[ : -1]
print("Standard deviation =", np.std(returns))

logreturns = np.diff( np.log(c) )
posretindices = np.where(returns > 0)
print("Indices with positive returns", posretindices)

annual_volatility = np.std(logreturns)/np.mean(logreturns)
annual_volatility = annual_volatility / np.sqrt(1./252.)
print("Yearly volatility", annual_volatility)
print("Monthly volatility", annual_volatility * np.sqrt(1./12.))

3.13 date analysis

Do you sometimes have Monday anxiety and Friday mania? Want to know whether the stock market is affected by the above phenomenon? I think it is worth studying in depth.

3.14 hands on practice: analyze date data

First, we need to read in the closing price data. Then, divide the closing price data according to the day of the week, and calculate the average price respectively.

Finally, we will find out which day of the week has the highest average closing price and which day has the lowest. Before we start, there is a kind reminder: you may want to use the analysis results to buy or sell stocks one day. However, the amount of data here is not enough to make reliable decisions. Please consult a professional statistical analyst before making a decision!

Programmers don't like dates because dealing with dates is always cumbersome. NumPy is oriented to floating-point operations, so it needs to do some special processing on dates. Please try the following code by yourself, write the script file separately or use the code file attached to this book:

import numpy as np
from datetime import datetime

dates, close=np.loadtxt('data.csv', delimiter=',', usecols=(1,6), unpack=True)

After executing the above code, you will get an error prompt:

ValueError: invalid literal for float(): 28-01-2011

Process the date as follows.

  • (1) Obviously, NumPy tries to convert dates to floating-point numbers. What we need to do is explicitly tell NumPy how to convert the date, which requires a specific parameter in the loadtext function. This parameter is converters, which is a dictionary for mapping between data columns and conversion functions.

To do this, we must write the conversion function:

def datestr2num(s):
    return datetime.strptime(s.decode('ascii'), "%d-%m-%Y").date().weekday()

We pass the date as a string to the datestr2num function, such as "28-01-2011". This string will first be converted into a datetime object in the specified form "% d-%m-%Y". In addition, this is a function provided by the Python standard library, which is independent of NumPy. The datetime object is then converted to a date object. Finally, call the weekday method to return a number. As you can see in the comments, this number can be an integer from 0 to 6, 0 for Monday and 6 for Sunday. Of course, the specific number is not important, it is only used for identification.

Note: data The data in the second column of the CSV file is a date format string. By default, the data read by loadtext() needs to be in binary encoding format, and the returned value is a byte string bytes, so it needs to be converted into a string in binary format. Therefore, it needs to decode the string and use the function decode('asii ') to change it into string format. Otherwise, a TypeError: strptime() argument 1 must be str, not bytes exception is thrown.

  • (2) Next, we hook up the date conversion function so that we can read in the data.
dates, close=np.loadtxt('data.csv', delimiter=',', usecols=(1,6), converters={1: datestr2num}, unpack=True)
print("Dates =", dates)

The output results are as follows:

Dates = [4. 0. 1. 2. 3. 4. 0. 1. 2. 3. 4. 0. 1. 2. 3. 4. 1. 2. 3. 4. 0. 1. 2. 3.
 4. 0. 1. 2. 3. 4.]

As you can see, there are no Saturdays and Sundays. Stock trading is closed on weekends.

  • (3) Let's create an array of five elements representing five working days of the week. Array elements will be initialized to 0.
averages = np.zeros(5)

This array will be used to store the average closing price of each working day.

  • (4) We already know that the where function will return the index values of all array elements that meet the conditions according to the specified conditions.
    **The take function can take the corresponding elements from the array according to these index values** We will use the take function to get the closing price of each working day. In the following loop body, we will traverse the date IDs from 0 to 4, or Monday to Friday, and then use the where function to get the index value of each working day and store it in the indexes array. Use the take function to get the element values corresponding to these index values. Finally, we calculate the average value for each working day and store it in the averages array. The code is as follows:
averages = np.zeros(5)

for i in range(5):
    indices = np.where(dates == i) 
    prices = np.take(close, indices)
    avg = np.mean(prices)
    print("Day", i, "prices", prices, "Average", avg)
    averages[i] = avg

The output results are as follows:

Day 0 prices [[339.32 351.88 359.18 353.21 355.36]] Average 351.7900000000001
Day 1 prices [[345.03 355.2  359.9  338.61 349.31 355.76]] Average 350.63500000000005
Day 2 prices [[344.32 358.16 363.13 342.62 352.12 352.47]] Average 352.1366666666666
Day 3 prices [[343.44 354.54 358.3  342.88 359.56 346.67]] Average 350.8983333333333
Day 4 prices [[336.1  346.5  356.85 350.56 348.16 360.   351.99]] Average 350.0228571428571
  • (5) If you like, you can also find out which weekday average closing price is the highest and which is the lowest. This is easy to do. Use the max and min functions. The code is as follows:
top = np.max(averages)
print( "Highest average", top)
print( "Top day of the week", np.argmax(averages))

bottom = np.min(averages)
print( "Lowest average", bottom)
print( "Bottom day of the week", np.argmin(averages))

The output results are as follows:

Highest average 352.1366666666666
Top day of the week 2
Lowest average 350.0228571428571
Bottom day of the week 4

What did you just do
The argmin function returns the index value of the smallest element in the averages array. Here is 4, that is, Friday. and
The argmax function returns the index value of the largest element in the averages array. Here is 2, that is, Wednesday.

The complete code of the example code is as follows:

import numpy as np
from datetime import datetime

def datestr2num(s):
    return datetime.strptime(s.decode('ascii'), "%d-%m-%Y").date().weekday()

dates, close=np.loadtxt('data.csv', delimiter=',', usecols=(1,6), converters={1: datestr2num}, unpack=True)
print("Dates =", dates)

averages = np.zeros(5)

for i in range(5):
    indices = np.where(dates == i) 
    prices = np.take(close, indices)
    avg = np.mean(prices)
    print("Day", i, "prices", prices, "Average", avg)
    averages[i] = avg
    
top = np.max(averages)
print( "Highest average", top)
print( "Top day of the week", np.argmax(averages))

bottom = np.min(averages)
print( "Lowest average", bottom)
print( "Bottom day of the week", np.argmin(averages))

3.15 weekly summary

In the previous "hands-on" tutorial, we used after disk data. In other words, these data are obtained by summarizing the transaction data of a whole day. If you are interested in the cotton market and have decades of data, you may want to further summarize and compress the data. Let's do it. Let's summarize Apple stock data by week.

3.16 hands on practice: summary data

We're going to summarize all the data from Monday to Friday throughout the trading week. There is a holiday in the time period covered by the data: February 21 is the presidential anniversary. This day is Monday and the U.S. stock market is closed, so there is no data record of this day in our sample data. The first day in the data is Friday, which is inconvenient to process. Follow these steps to summarize the data.

  • (1) For simplicity, we only consider the data of the first three weeks, so as to avoid the lack of data caused by holidays. You can try to expand it later.
import numpy as np
from datetime import datetime

def datestr2num(s):
    return datetime.strptime(s.decode('ascii'), "%d-%m-%Y").date().weekday()

dates, open, high, low, close=np.loadtxt('data.csv', delimiter=',', usecols=(1, 3, 4, 5, 6), converters={1: datestr2num}, unpack=True)

close = close[:16]
dates = dates[:16]
  • (2) First, let's find the first Monday in the sample data. Recall that in Python, the corresponding code for Monday is 0, which can be used as a condition for the where function. Next, we will take out the first element in the array with an index value of 0. However, the result returned by the where function is a multidimensional array, so you need to flatten it with the t ravel function.
# Find the first Monday
first_monday = np.ravel(np.where(dates == 0))[0]
print( "The first Monday index is", first_monday)

The output results are as follows:

The first Monday index is 1
  • (3) The next step is to find the last Friday of the sample data, similar to finding the first Monday. The corresponding code for Friday is 4. In addition, we use - 1 as the index value to locate the last element of the array.
# Find the last Friday
last_friday = np.ravel(np.where(dates == 4))[-1]
print( "The last Friday index is", last_friday)

The output results are as follows:

The last Friday index is 15

Next, create an array to store the index values for each day of the three weeks.

weeks_indices = np.arange(first_monday, last_friday + 1)
print( "Weeks indices initial", weeks_indices)

The output results are as follows:

Weeks indices initial [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15]
  • (4) Divide the array with the split function according to the five elements of each subarray:
weeks_indices = np.split(weeks_indices, 3)
print( "Weeks indices after split", weeks_indices)

The output results are as follows:

Weeks indices after split [array([1, 2, 3, 4, 5], dtype=int64), array([ 6,  7,  8,  9, 10], dtype=int64), array([11, 12, 13, 14, 15], dtype=int64)]
  • (5) In NumPy, the dimension of an array is also called an axis. Now let's get familiar with apple_ along_ Axis function. This function will call another function given by us to act on each array element. At present, there are three elements in our array, which correspond to three weeks in the sample data, and the index value in the element corresponds to one day in the sample data. Calling apply_along_axis, provide the function name we defined, summarize, and specify the number of the axis or dimension to be acted on (such as taking 1), the target array, and the parameters of a variable number of summarize functions.
weeksummary = np.apply_along_axis(summarize, 1, weeks_indices, open, high, low, close)
print( "Week summary", weeksummary)

(6) Write the summarize function. This function will return a tuple for the data of each week, including the opening price, highest price, lowest price and closing price of this week, which is similar to the after hours data of each day.

def summarize(a, o, h, l, c):
    monday_open = o[a[0]]
    week_high = np.max( np.take(h, a) )
    week_low = np.min( np.take(l, a) )
    friday_close = c[a[-1]]

    return("APPL", monday_open, week_high, week_low, friday_close)

Note that we use the take function to get the value of the array element according to the index value, and use the max and min functions to easily calculate the highest and lowest stock prices of a week. The opening price of a week is the opening price of Monday, and the closing price of a week is the closing price of Friday.

Week summary [['APPL' '335.8' '346.7' '334.3' '346.5']
 ['APPL' '347.89' '360.0' '347.64' '356.85']
 ['APPL' '356.79' '364.9' '349.52' '350.56']]

(7) Use the savetxt function in NumPy to save the data to a file.

np.savetxt("weeksummary.csv", weeksummary, delimiter=",", fmt="%s")

As shown in the code, we specify the file name, the array name to be saved, the separator (in this case, English punctuation comma), and the format for storing floating-point numbers.
The format string starts with a percent sign. Next is an optional flag character: - indicates the left alignment of the result, 0 indicates the left complement of 0, and + indicates the output symbol (positive sign + or negative sign -). The third part is the optional output width parameter, which represents the minimum number of bits of output. The fourth part is the precision format character, with "." Start with an integer representing precision. Finally, there is a type specifying character, which is specified as string type in our example.

weeksummary. The contents of the CSV file are as follows:

APPL,335.8,346.7,334.3,346.5
APPL,347.89,360.0,347.64,356.85
APPL,356.79,364.9,349.52,350.56

The complete code of the example code is as follows:

import numpy as np
from datetime import datetime

def datestr2num(s):
    return datetime.strptime(s.decode('ascii'), "%d-%m-%Y").date().weekday()

dates, open, high, low, close=np.loadtxt('data.csv', delimiter=',', usecols=(1, 3, 4, 5, 6), converters={1: datestr2num}, unpack=True)

close = close[:16]
dates = dates[:16]

# get first Monday
first_monday = np.ravel(np.where(dates == 0))[0]
print( "The first Monday index is", first_monday)

# get last Friday
last_friday = np.ravel(np.where(dates == 4))[-1]
print( "The last Friday index is", last_friday)

weeks_indices = np.arange(first_monday, last_friday + 1)
print( "Weeks indices initial", weeks_indices)

weeks_indices = np.split(weeks_indices, 3)
print( "Weeks indices after split", weeks_indices)

def summarize(a, o, h, l, c):
    monday_open = o[a[0]]
    week_high = np.max( np.take(h, a) )
    week_low = np.min( np.take(l, a) )
    friday_close = c[a[-1]]

    return("APPL", monday_open, week_high, week_low, friday_close)

weeksummary = np.apply_along_axis(summarize, 1, weeks_indices, open, high, low, close)
print( "Week summary", weeksummary)

np.savetxt("weeksummary.csv", weeksummary, delimiter=",", fmt="%s")
  1. Pay attention to the difference in calculation between sample variance and population variance. The population variance is the sum of the squares of the deviations removed by the number of data, while the sample variance is the sum of the squares of the deviations removed by the number of sample data minus 1, where the number of sample data minus 1 (i.e. n-1) is called the degree of freedom. The reason for this difference is to ensure that the sample variance is an unbiased estimator—— translator's note ↩︎

Keywords: numpy

Added by Xurion on Sun, 16 Jan 2022 23:56:17 +0200