Bayesian decision theory and naive Bayesian classifier

1, Bayesian decision theory

1. Probability and statistics

(1) Probability: know a model and corresponding parameters, and predict the results and characteristics (such as mean, variance, covariance, etc.) produced by the model, i.e. data;
(2) Statistics: predict the model and corresponding parameters according to the existing data;
It can be seen that the purpose of probability and statistics is the opposite;

2. Probability function and likelihood function

For functions P ( x ∣ θ ) P\left( x|\theta \right) P(x∣ θ)， There are two inputs, x x x represents a specific data, θ \theta θ Represent model parameters; Then there are two situations:
(1) x x x change, θ \theta θ When the model and parameters are fixed, different samples are given x x x. So what you get is x x What is the probability of x? This is the probability function;
(2) x x x fixed, θ \theta θ In the case of change, it is calculated for different model parameters θ \theta θ， What is the probability of x this sample point? This is the likelihood function (possibility).

3. Bayesian formula

P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ) P\left( A|B \right) =\frac{P\left( A \right)P\left( B|A \right) }{P\left( B \right)} P(A∣B)=P(B)P(A)P(B∣A)
There is A causal relationship between events in Bayesian formula. If A is the cause and B is the result, that is, A is one of the causes of B; Then:

P ( A ∣ B ) P\left( A|B \right) P(A ∣ B) is called A posteriori probability, which is A consequence probability, that is, after event B has occurred, how likely it is that A causes B to occur, which can be understood as how sure you can believe A piece of evidence (or how sure you can determine that A causes B to occur);
P ( A ) P(A) P(A) is called a priori probability, that is, the probability that can be obtained directly from previous experience and analysis. For example, if we throw a normal coin, we can know that the probability of facing up is 0.5 a priori according to experience;
P ( B ∣ A ) P\left( B|A \right) P(B ∣ A) is called class conditional probability, which can be understood as if A A A happens, then next B B B how likely is it to occur, e.g A A A = 'today is May Day', B B B = 'today's Holiday', then P ( B ∣ A ) = 1 P(B|A)=1 P(B∣A)=1； Or likelihood.
P ( B ) P(B) P(B) is the "evidence" factor used for normalization, i.e. leading to B B B is a collection of all possibilities that occur, such as B B B = 'today's Holiday', then P ( B ) P(B) P(B) is a collection of all the possibilities that may lead to today's holiday, such as today's weekend or may day, for an event B B B. Usually P(B) is a definite value.
Explanation of full probability formula and Bayesian formula in probability theory and mathematical statistics:
Understanding Bayesian formula from two perspectives
(1) The Bayesian formula can be reduced to:
P ( A ∣ B ) = P ( A ) P ( B ∣ A ) P ( B ∣ A ) P ( A ) + P ( B ∣ ∼ A ) P ( ∼ A ) P\left( A|B \right) =\frac{P\left( A \right) P\left( B|A \right)}{P\left( B|A \right) P\left( A \right) +P\left( B|\sim A \right) P\left( \sim A \right)} P(A∣B)=P(B∣A)P(A)+P(B∣∼A)P(∼A)P(A)P(B∣A)
If we want to say that the proposition "today is a holiday, then today is May Day" is true, we can think of what to do, as long as we make the denominator of the above Bayesian formula P ( B ∣ ∼ A ) P ( ∼ A ) P\left( B|\sim A \right) P\left( \sim A \right) P(B ∣∼ A)P(∼ A) is 0, that is, other possible factors other than may day that cause the holiday are excluded; But obviously, this proposition is false, because there are more than one factor causing the holiday;
So we can conclude that all factors should be considered when making judgment. If today is a holiday, but today is not necessarily may day, it is more likely to be a weekend;
(2) From P ( A ) P(A) From the perspective of P(A): we assume A A A and B B B closely related, take the extreme case order P ( B ∣ A ) = 1 P(B|A)=1 P(B ∣ A)=1, i.e. if A A A happens, then B B B must occur; But even in this case, if P ( A ) P(A) P(A) is too small, so combined P ( A ∣ B ) P(A|B) P(A ∣ B) will also be small; For example, for those millions of lottery tickets, once someone wins, he must have become a millionaire at that time; But in fact, only very few of all millionaires have won the million lottery; So we can know that even if a factor is strongly related to the result, when the probability of this factor is very low, we should carefully consider whether it is caused by other factors.

4. Bayesian decision theory

Bayesian decision theory is a basic method to implement decision under the framework of probability. It considers selecting the optimal category marker based on the known correlation probability and misjudgment loss.
Suppose there are N possible category markers, i.e y = { c 1 , c 2 , . . . , c N } y=\left\{ c_1,c_2,...,c_N \right\} y={c1,c2,...,cN}， λ i j \lambda _{ij} λ ij # is to mark a real as c j {c_j} Sample of cj + x x x misclassified as c i c_i ci ， expected loss (conditional risk)
R ( c i ∣ x ) = ∑ j = 1 N λ i j P ( c j ∣ x ) R\left( c_i|x \right) =\sum_{j=1}^N{\lambda _{ij}P\left( c_j|x \right)} R(ci∣x)=j=1∑NλijP(cj∣x)
Our aim is to minimize conditional risk when
λ i j = { 0 , i f i = j 1 , o t h e r w i s e \lambda _{ij}=\begin{cases} 0, if\,\,i\,\,=\,\,j\\ 1, otherwise\\ \end{cases} λij={0,ifi=j1,otherwise
Conditional risk at this time R ( c ∣ x ) = 1 − P ( c ∣ x ) R\left( c|x \right) =1-P\left( c|x \right) R(c ∣ x)=1 − P(c ∣ x), the Bayesian optimal classifier to minimize the classification error rate is h ∗ ( x ) = a r g max ⁡ c ∈ y P ( c ∣ x ) h^*\left( x \right) =\underset{c\in y}{arg\max}P\left( c|x \right) h∗(x)=c∈yargmaxP(c∣x)
That is, for each sample x x x. Select a posteriori probability P ( c ∣ x ) P(c|x) P(c ∣ x) is the largest category mark. The methods include maximum likelihood estimation and maximum a posteriori probability estimation.
(this part comes from watermelon book)

5. Maximum likelihood estimation (MLE) and maximum a posteriori probability estimation (MAP)

2, Naive Bayesian classifier

1. Introduction

For a posteriori probability P(c|x), the class conditional probability P(x|c) is the joint probability on all attributes, and there is influence between attributes; In order to simplify the problem, naive Bayesian classifier assumes that all attributes are independent of each other, that is, each attribute has an independent impact on the classification results

2. Algorithm representation

3. Code implementation

The training of discrete attributes and continuous attributes should be treated differently

Continuous attributes:

import numpy as np#Iris, three categories, each with 50 samples
import random
import math
def data_split(data):#The original data is divided into a training set with 120 samples and a test set with 30 samples
    data = data[1:]
    train_ratio = 0.8
    train_size = int(len(data) * train_ratio)
    random.shuffle(data)
    train_set = data[:train_size]
    test_set = data[train_size:]
    return train_set, test_set

def class_separate(data_set):#Process, classify and count the data
    separated_data = {}
    data_info = {}#It indicates the corresponding quantity of each kind of flower
    for data in data_set:
        if data[-1] not in separated_data:#Create a new dictionary index. The data and info dictionaries have three indexes respectively, that is, the variety of flowers
            separated_data[data[-1]] = []
            data_info[data[-1]] = 0
        separated_data[data[-1]].append(data)#Divide all flowers into three categories, and each flower will be added to the corresponding flower category
        data_info[data[-1]] += 1
    if 'Species' in separated_data:
        del separated_data['Species']
    if 'Species' in data_info:
        del data_info['Species']
    return separated_data, data_info

def prior_prob(data, data_info):#Calculate the a priori probability of each class
    data_prior_prob = {}
    data_sum = len(data)
    for cla, num in data_info.items():
        data_prior_prob[cla] = num / float(data_sum)
    return data_prior_prob

def mean(data):#Find the mean
    data = [float(x) for x in data]#String to number
    return sum(data) / len(data)

def var(data):#Find variance
    data = [float(x) for x in data]
    mean_data = mean(data)
    var = sum([math.pow((x - mean_data), 2) for x in data]) / float(len(data) - 1)
    return var

def prob_dense(x, mean, var):#Since the sample attribute is a continuous value, the probability density is used
    exponent = math.exp(math.pow((float(x) - mean), 2) / (-2 * var))
    p = (1 / math.sqrt(2 * math.pi * var)) * exponent
    return p

def attribute_info(data):#The mean and variance of the four attributes are calculated respectively
    dataset = np.delete(data, -1, axis = 1) #delete a tap
    result = [(mean(att), var(att)) for att in zip(*dataset)]#zip decomposes the meta group and generates the results of four attributes
    return result

def summarize_class(data):
  data_separated, data_info = class_separate(data)
  summarizeClass = {}
  for index, x in data_separated.items():#The three categories correspond to 4 attributes, a total of 12 groups, and each group contains the corresponding mean and variance
      summarizeClass[index] = attribute_info(x)
  return summarizeClass

def cla_prob(test, summarizeClass):#At this time, summarizeClass is the probability density function parameters of the three categories, i.e. model parameters, and test is the test data (sample data)
  prob = {}
  for cla, summary in summarizeClass.items():#cla category. summary is the set of four attributes (mean and variance) corresponding to this category
      prob[cla] = 1
      for i in range(len(summary)):#Number of attributes
          mean, var = summary[i]
          x = test[i]
          p = prob_dense(x, mean, var)#Calculate the conditional probability of discrete attribute values
      prob[cla] *= p #Finally, we get the conditional probability of this class about all attributes
  return prob#Obtain conditional probability

def bayesian(input_data, data, data_info):#We only need to consider the numerator of Bayesian formula, because the denominator is the same, that is, we only need to consider a priori probability and conditional probability
  priorProb = prior_prob(data, data_info)
  data_summary = summarize_class(data)
  classProb = cla_prob(input_data, data_summary)
  result = {}
  for cla, prob in classProb.items():
      # print(type(classProb))
      p = prob * priorProb[cla]
      result[cla] = p
  return max(result, key=result.get)

iris = np.array(np.loadtxt("IrisData.csv", dtype=str, delimiter=",")).tolist()##There are 150 samples for importing iris data, including four attributes and three categories (0,1,2), and each category has 50 samples
train_set, test_set = data_split(iris)
train_separated, train_info = class_separate(train_set)#Obtain the classified data and the number of samples in each category
correct = 0
for x in test_set:
  input_data = x[:-1]
  label = x[-1]
  result = bayesian(input_data, train_set, train_info)
  if result == label:
      correct += 1
print("precision: ", correct / len(test_set))

Training results:

Added by rocksolidhq on Thu, 30 Dec 2021 11:01:41 +0200

Programming VIP