Introduction to the principle of naive Bayesian classifier and its implementation in python code

Frequency school and Bayesian school

Speaking of probability and statistics, we have to mention frequency school and Bayesian school, two different probability schools evolved through different understanding of probability.

Frequency school

  • Core idea: the parameter to be obtained is a certain value. Although it is unknown, it will not change due to the change of the sample. The sample data is generated randomly. Therefore, when the data sample is infinite, the calculated frequency is the probability. The main focus is to study the sample space and analyze the distribution of samples

  • Extended application: maximum likelihood estimation (MLE)

Bayesian school

  • Core idea: the parameters to be obtained are random variables, while the samples are fixed. The focus is mainly on the distribution of parameters.

    In the Bayesian school, the parameters {are random variables and change with the sample information, so the Bayesian school puts forward a fixed mode of thinking: a priori distribution and a posteriori distribution of sample information.

  • Extended application: maximum a posteriori estimation (MAP)

  • Bayesian formula

    Assuming that the A priori probability of A is P(A), the A priori probability of B is P(B), the A posteriori probability of A under the condition of B is P(A|B), and the A posteriori probability of B under the condition of A is P(B|A), then

By simplification:

Where A represents A prediction result; B represents A set of observation data; P(B) represents A priori probability, that is, the probability of A before B is observed; P(B|A) represents A posteriori probability, that is, the probability of A after B is observed; P(B|A) is the likelihood function; P(B) is the model evidence

The formula can be understood as a posteriori probability, a priori probability adjustment factor. In the above formula, a posteriori probability is, a priori probability is, and the adjustment factor is

Naive Bayes classifier

Based on Bayesian theorem, an unusually naive classification algorithm naive Bayesian classifier is extended. Its basic idea is: under given conditions, calculate the probability of each possible category and take the maximum prediction value. From the above ideas, it can be clearly understood that naive Bayes is suitable for discrete data. Its mathematical description will be given below.

If , is the input and , is the characteristic attribute of , assuming that its characteristic attributes are independent of each other, which type of , should be predicted

According to the idea of Bayesian classifier, it should be the category with the greatest probability, that is, if there is, according to Bayesian theorem, the probability of each category is:

It can be found that its denominator is independent of i, so the probabilities of different categories can be compared only by molecules.

Implementation of naive Bayesian classifier in python

 
import pandas as pd
 ​
 def load_data(path,sep=',',encoding='utf=8'):
     '''Read data, input data needs to have a header
     return dataframe'''
     filetype = path.split('.')[-1]
     if filetype in ['csv', 'txt']:
         data = pd.read_csv(path, sep=sep, encoding=encoding)
     if filetype == 'xslx':
         data = pd.read_excel(path)
     return data
 ​
 ​
 def cal_prob(data,col,res):
     '''Calculate occurrence frequency'''
     count_all = len(data[col])
     count_res = len(data[data[col]==res])
 ​
     return count_res/count_all
 ​
 ​
 def cal_prio_prob(data, label):
     '''Calculate a priori probability
     return {res1:prob1, res2:prob2,...}'''
     prio_prob = {}
     for res in data[label].unique():
         prob = cal_prob(data, label, res)
         prio_prob[res] = prob
     
     return prio_prob
 ​
 ​
 def cal_likelihood_prob(data, label, input):
     '''Calculated likelihood function
     return {res1:prob1, res2:prob2,...}'''
     likelihood_prob = {}
     for res in data[label].unique():
         data_p = data[data[label]==res]
         prob = 1
         for col in data:
             if col != label:
                 prob = prob * cal_prob(data_p, col, input[col])
         likelihood_prob[res] = prob
     
     return likelihood_prob
 ​
 ​
 def bayes_classifier(path, label, input):
     '''Compare output most likely categories'''
     data = load_data(path)
     prio_prob = cal_prio_prob(data, label)
     likelihood_prob = cal_likelihood_prob(data, label, input)
     max = 0
     for c in prio_prob.keys():
         prob = prio_prob[c] * likelihood_prob[c]
         print(f'{c}The result is{prob}')
         if prob > max:
             cla = c
             max = prob
     
     print(f'stay{input}In this case,{label}The result is{cla}')
     
     
 if __name__=='__main__':
     path = 'weather.csv'
     label = 'PlayTennis'
     input = {'Outlook':'Sunny', 'Temperature':'Cool','Humidity':'High','Wind':'Strong'}
     bayes_classifier(path, label, input)

reference resources Bayesian formula from shallow to deep

Keywords: Python Machine Learning

Added by kylera on Thu, 16 Dec 2021 17:51:14 +0200