Super parameter adjustment and automatic time series modeling are carried out by genetic algorithm

In previous articles, we introduced some knowledge based on genetic algorithm. This article will use genetic algorithm to process machine learning model and time series data.

Super parameter adjustment (TPOT)

Automatic machine learning (Auto ML) helps us find the most suitable model for prediction by automating the whole machine learning process. For machine learning model, Auto ML may more mean the adjustment and optimization of super parameters.

Here, we use a package named Tpot of python to operate. Tpot is based on scikit learn. Although it is still under development, its functions can help us understand these concepts. The following figure shows the working principle of Tpot:

from tpot import TPOTClassifier
from tpot import TPOTRegressormodel = TPOTClassifier(generations=100, population_size=100, offspring_size=None, mutation_rate=0.9, crossover_rate=0.1, scoring=None, cv=5, subsample=1.0, n_jobs=1, max_time_mins=None, max_eval_time_mins=5, random_state=None, config_dict=None, template=None, warm_start=False, memory=None, use_dask=False, periodic_checkpoint_folder=None, early_stop=None, verbosity=0, disable_update_check=False, log_file=None)

Through the above code, you can get a simple regression model, which is the default parameter list

generations=100, 
population_size=100, 
offspring_size=None  # Jeff notes this gets set to population_size
mutation_rate=0.9, 
crossover_rate=0.1, 
scoring="Accuracy",  # for Classification
cv=5, 
subsample=1.0, 
n_jobs=1,
max_time_mins=None, 
max_eval_time_mins=5,
random_state=None, 
config_dict=None,
warm_start=False, 
memory=None,
periodic_checkpoint_folder=None, 
early_stop=None
verbosity=0
disable_update_check=False

Let's see what super parameters can be adjusted:

generations : Number of iterations. The default value is 100.
population_size: The number of individuals retained in a genetically programmed population per generation. The default value is 100.
offspring_size: The number of offspring produced in each generation of genetic programming. The default value is 100.
mutation_rate: Mutation rate range of genetic programming algorithm[0.0, 1.0] . This parameter tells GP How many pipes does the algorithm apply random changes to each word iteration. The default value is 0.9
crossover_rate: Genetic programming algorithm, crossover rate, range[0.0, 1.0] . This parameter tells the genetic programming algorithm how many pipelines to "cultivate" in each generation.
scoring: Functions used to evaluate the problem, such as accuracy, average accuracy roc_auc,Recall rate, etc. The default is accuracy.
cv: Cross validation strategy used when evaluating pipelines. The default value is 5.
random_state: TPOT The seed of the pseudo-random number generator used in. Use this parameter to ensure operation TPOT The same random seeds are used to get the same results.
n_jobs: = -1 Multiple CPU Run on the kernel to speed up tpot Process.
period_checkpoint_folder: "any_string",The evolution of the model can be observed while the training score is improved.

mutation_rate + crossover_rate cannot exceed 1.0.

Next, we use Tpot and sklearn to train the model.

from pandas import read_csv
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import RepeatedStratifiedKFold
from tpot import TPOTClassifier

dataframe = read_csv('sonar.csv')
#define X and y 
data = dataframe.values
X, y = data[:, :-1], data[:, -1]

# minimally prepare dataset
X = X.astype('float32')
y = LabelEncoder().fit_transform(y.astype('str'))

#split the dataset
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# define cross validation 
cv = RepeatedStratifiedKFold(n_splits=10, n_repeats=3, random_state=1)

Now let's run TPOTClassifier to optimize the super parameters

#initialize the classifier
model = TPOTClassifier(generations=5, population_size=50,max_time_mins= 2,cv=cv, scoring='accuracy', verbosity=2, random_state=1, n_jobs=-1)
model.fit(X_train, y_train)

#export the best model
model.export('tpot_best_model.py')

The last sentence of code saves the model in py file, you can import directly after using. Because the biggest problem for AutoML is the training time, in order to save time, population_size,max_time_mins equivalence uses the minimum setting.

Now let's open the file tpot_best_model.py, look at the content:

import numpy as np
import pandas as pd
from sklearn.linear_model import RidgeCV
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
from tpot.export_utils import set_param_recursive

# NOTE: Make sure that the outcome column is labeled 'target' in the data file
tpot_data = pd.read_csv('PATH/TO/DATA/FILE', sep='COLUMN_SEPARATOR', dtype=np.float64)
features = tpot_data.drop('target', axis=1)
training_features, testing_features, training_target, testing_target = \
            train_test_split(features, tpot_data['target'], random_state=1)

# Average CV score on the training set was: -29.099695845082277
exported_pipeline = make_pipeline(
    PolynomialFeatures(degree=2, include_bias=False, interaction_only=False),
    RidgeCV()
)

# Fix random state for all the steps in exported pipeline
set_param_recursive(exported_pipeline.steps, 'random_state', 1)

exported_pipeline.fit(training_features, training_target)
results = exported_pipeline.predict(testing_features)

If you want to make a prediction, use the following code

yhat = exported_pipeline.predict(new_data)

The above is the super parameter optimization method of genetic algorithm in AutoML / machine learning. Genetic algorithm is inspired by Darwin's natural selection process and is used in computer science to find optimization search. In short, genetic algorithm consists of three common behaviors: selection, crossover and mutation

Let's see how it helps the model dealing with time series

Time series data modeling (AutoTS)

In Python, there is a package called AutoTS that can process time series data. Let's start with it

#AutoTS
from autots import AutoTS
from autots.models.model_list import model_lists
print(model_lists)

There are many different types of time series models in autots. Based on these algorithms, he finds the optimal model through the optimization of genetic algorithm.

Now let's load the data.

#default data
from autots.datasets import load_monthly
#data was in long format
df_long = load_monthly(long=True)

Initialize AutoTS with different parameters

model = AutoTS(
    forecast_length=3,
    frequency='infer',
    ensemble='simple',
    max_generations=5,
    num_validations=2,
)
model = model.fit(df_long, date_col='datetime', value_col='value', id_col='series_id')
#help(AutoTS)

This step takes a long time because it must go through multiple algorithm iterations.

print(model)

We can also view the model accuracy score

#accuracy score
model.results()
#aggregated from cross validation
validation_results = model.results("validation")

From the list of model accuracy scores, you can also see the column "Ensemble" highlighted above. Its low precision verifies a theory that Ensemble always performs better. This statement is incorrect.

If you find the best model, you can make a prediction.

prediction = model.predict()
forecasts_df = prediction.forecast
upper_forecasts_df = prediction.upper_forecast
lower_forecasts_df = prediction.lower_forecast
#or
forecasts_df1 = prediction.long_form_results()
upper_forecasts_df1 = prediction.long_form_results()
lower_forecasts_df1 = prediction.long_form_results()

The default method is to use all models provided by AutoTs for training. If we want to execute on some model lists and set different weights for a feature, we need some custom configurations.

First, read the data

from autots import AutoTS
from autots.datasets import load_hourlydf_wide = load_hourly(long=False)

# here we care most about traffic volume, all other series assumed to be weight of 1
weights_hourly = {'traffic_volume': 20}

Define Model List:

model_list = [
    'LastValueNaive',
    'GLS',
    'ETS',
    'AverageValueNaive',
]

model = AutoTS(
    forecast_length=49,
    frequency='infer',
    prediction_interval=0.95,
    ensemble=['simple', 'horizontal-min'],
    max_generations=5,
    num_validations=2,
    validation_method='seasonal 168',
    model_list=model_list,
    transformer_list='all',
    models_to_validate=0.2,
    drop_most_recent=1,
    n_jobs='auto',
)

Define weights when fitting models with data:

model = model.fit(
    df_wide,
    weights=weights_hourly,
)

prediction = model.predict()
forecasts_df = prediction.forecast

This is done, which is very simple for use

Related resources

The following are the documentation for the two python packages:

https://winedarksea.github.io...

http://epistasislab.github.io...

Finally, if you are interested in participating in the Kaggle competition, please write to me and invite you to join the Kaggle competition exchange group

Keywords: Machine Learning AI Deep Learning Data Mining

Added by VMinder on Fri, 31 Dec 2021 16:05:22 +0200