Machine learning notes - introductory case 4 of exploratory data analysis (EDA)

1, Data set description

1. Data set

Tabular Playground Series - Feb 2022 | KagglePractice your ML skills on this approachable dataset!

For this challenge, you will predict bacterial species based on repeated lossy measurements of DNA fragments. The fragment with length of 10 is analyzed by Raman spectrum, which calculates the histogram of bases in the fragment.

Each row of data contains a histogram generated by repeated measurement samples, each row contains the output of all 286 histogram possibilities, and then the deviation spectrum (completely random) is subtracted from the result.

The data (training and testing) also contains simulated measurement errors (rate of change) of many samples, which makes the problem more challenging.

2. Preliminary observation

The target column is the target variable and is controlled by Streptoccus_ pyogenes,Salmonella_enterica,Enterococcus_hirae,Escherichia_coli,Campylobacter_jejuni,Streptococcus_pneumoniae,Staphylococcus_aureus,Escherichia_fergusonii,Bacteroides_fragilis,Klebsiella_pneumoniae and other 10 kinds of bacteria.

The training data set has 200000 rows and 288 columns, including 286 features, 1 target variable target and 1 row column_ id.

The test data set has 100000 rows and 287 columns, including 286 features, one of which is row_id.

There were no missing values in the training and test data sets.

2, Exploratory analysis

1. Data summary


2. Data type

        row_id is int64 bit, target is object, and other data is float64 bit.
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200000 entries, 0 to 199999
Columns: 288 entries, row_id to target
dtypes: float64(286), int64(1), object(1)
memory usage: 439.5+ MB

3. Duplicate view of data

First remove row_id column

data = pd.read_csv('data/train.csv')
del data['row_id']
# Find duplicates
duplicates_train = data.duplicated().sum()
print('Duplicates in train data: {0}'.format(duplicates_train))

# To repeat
data.drop_duplicates(keep='first', inplace=True)
duplicates_train = data.duplicated().sum()

print('Train data shape:', data.shape)
print('Duplicates in train data: {0}'.format(duplicates_train))

It can be seen that there are duplicate data, and it is necessary to consider whether to remove it in the later processing.  

Duplicates in train data: 76007
Train data shape: (123993, 287)
Duplicates in train data: 0 

4. View data distribution

import plotly.graph_objects as go
fig = go.Figure()
fig.add_trace(go.Pie(values = target_class['count'],labels = target_class.index,hole = 0.6, 
                     hoverinfo ='label+percent'))
fig.update_traces(textfont_size = 12, hoverinfo ='label+percent',textinfo='label', 
                  showlegend = False,marker = dict(colors =["#201E1F","#FF4000","#FAAA8D","#FEEFDD","#50B2C0",
                  title = dict(text = 'Target Distribution'))

5. Correlation analysis

f, ax = plt.subplots(figsize=(20,20))
ax = sns.heatmap(df.corr(), vmin=-1, vmax=+1)

6. Characteristic distribution

rows, cols = 56, 5
f, axs = plt.subplots(nrows=rows, ncols=cols, figsize=(20, 200))
n_feat = 0
for row in tqdm(range(rows)):
    for col in range(cols):
            sns.kdeplot(x=NUM_FEATURES[n_feat], fill=True, alpha=1, linewidth=3, 
                                        edgecolor="#264653", data=df, ax=axs[row, col], color='w')
            axs[row, col].patch.set_facecolor("#619b8a")
            axs[row, col].patch.set_alpha(0.8)
            axs[row, col].grid(color="#264653", alpha=1, axis="both")
        except IndexError: # hide last empty graphs
            axs[row, col].set_visible(False)
        n_feat += 1

7. Numerical discreteness

Although the feature is a floating-point number, there are not 200000 unique values, but only about 100.
The last digits are always the same (from 1.00846558e-05 to 9.70846558e-05, they always end with 0846558).

This observation strongly suggests that these values are initially integers. Divide these integers by 1000000 and subtract a constant.

The paper given by kaggle describes this process and gives the formula of addition constant, which they call deviation. With this formula, we can convert floating-point numbers back to original integers:

Frontiers | Analysis of Identification Method for Bacterial Species and Antibiotic Resistance Genes Using Optical Data From DNA Oligomers | Microbiology

def bias(w, x, y, z):
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

def bias_of(s):
    w = int(s[1:s.index('T')])
    x = int(s[s.index('T')+1:s.index('G')])
    y = int(s[s.index('G')+1:s.index('C')])
    z = int(s[s.index('C')+1:])
    return factorial(10) / (factorial(w) * factorial(x) * factorial(y) * factorial(z) * 4**10)

train_i = pd.DataFrame({col: ((train_df[col] + bias_of(col)) * 1000000).round().astype(int)
                        for col in elements})
test_i = pd.DataFrame({col: ((test_df[col] + bias_of(col)) * 1000000).round().astype(int)
                       for col in elements})

Calculate the maximum common divisor according to the above raw data,

# def gcd_of_all(df_i):
#     gcd = df_i[elements[0]]
#     for col in elements[1:]:
#         gcd = np.gcd(gcd, df_i[col])
#     return gcd

train_df['gcd'] = np.gcd.reduce(train_i[elements], axis=1)
test_df['gcd'] = np.gcd.reduce(test_i[elements], axis=1)
# train_df['gcd'] = train_i[elements].apply(np.gcd.reduce, axis=1) # slow
# test_df['gcd'] = test_i[elements].apply(np.gcd.reduce, axis=1)
# train_df['gcd'] = gcd_of_all(train_i)
# test_df['gcd'] = gcd_of_all(test_i)
np.unique(train_df['gcd'], return_counts=True), np.unique(test_df['gcd'], return_counts=True)
((array([    1,    10,  1000, 10000]), array([49969, 50002, 50058, 49971])),
 (array([    1,    10,  1000, 10000]), array([25208, 24951, 24930, 24911])))

We see four gcd values of the same frequency (1, 10, 1000 and 10000). Connecting this result with what they wrote in the paper, we understand this part of the experiment:

For each line, they extracted the bacterial DNA and cut it into decamers (DNA substrings with a length of 10). Then they put 1000000, 100000, 1000 or 100 decamers into their machine, which calculated how many times each of the 286 types from A0T0G0C10 to A10T0G0C0 appeared. This is what they call the spectrum (also known as a histogram with 286 bin). They standardized the spectrum by dividing all counts by the sum of the rows and subtracting the deviation.

Each bacterium has its own characteristic spectrum, and the competitive task is to predict the name of the bacterium from the spectrum of the sample. If the sample spectrum consists of one million decimal, we will accurately estimate the real frequency and predict the name easily; If the spectrum consists of only 100 decamers, we have little information and it will be difficult to predict (category overlap). We can see the influence of decamer number in the following four PCA diagrams:

for scale in np.sort(train_df['gcd'].unique()):
    # Compute the PCA
    pca = PCA(whiten=True, random_state=1)[elements][train_df['gcd'] == scale])

    # Transform the data so that the components can be analyzed
    Xt_tr = pca.transform(train_i[elements][train_df['gcd'] == scale])
    Xt_te = pca.transform(test_i[elements][test_df['gcd'] == scale])

    # Plot a scattergram, projected to two PCA components, colored by classification target
    plt.scatter(Xt_tr[:,0], Xt_tr[:,1], c=train_df.target_num[train_df['gcd'] == scale], cmap='tab10', s=1)
    plt.title(f"{1000000 // scale} decamers ({(train_df['gcd'] == scale).sum()} samples with gcd = {scale})")


We might want to create four separate classifiers for the four GCD values. For GCD = 1, we expect high accuracy; For GCD = 10000, the accuracy will be lower.
If we only create one classifier, gcd can be used as an additional function.

Keywords: Machine Learning Data Analysis Data Mining kaggle

Added by Gregadeath on Tue, 01 Mar 2022 06:46:03 +0200