Machine learning naive Bayes

Reference resources:
https://cuijiahua.com/blog/2017/11/ml_4_bayes_1.html
https://cuijiahua.com/blog/2017/11/ml_5_bayes_2.html
https://www.jianshu.com/p/5953923f43f0

1, Brief introduction of naive Bayes

1.1 introduction to naive Bayesian algorithm

Naive Bayesian algorithm, which is based on the Bayesian algorithm, is simplified, that is, the attributes are assumed to be independent of each other when the target value is given.

1.2 Bayes theorem

Bayesian decision theory: choose the occurrence with high probability as the final judgment.

According to the known basic condition probability and partial probability, the probability under some conditions is deduced.

1.3 conditional probability inference

The probability of all events is S
A the probability of an event is a
B the probability of an event is b
The probability of opposite event of A is A '
The probability of A and B common event is A ∩ B
Note: A and a are opposite and constitute S together.

We can deduce the probability of A event under B condition, and then change A ∩ B to another representation step by step





This is the formula of conditional probability.
If considering the following total probability formula and the above picture, only A and A 'conditional probability formula can be changed to:

1.4. Full probability inference

If events A1, A2 An forms a complete event group, namely

And all of them have positive probability, so for any event A, there is the following full probability formula:

1.5 Bayesian inference


P(A) is called "Prior probability", that is, before the occurrence of event B, we judge the probability of event A.
P(A|B) is called "Posterior probability", that is, after the occurrence of event B, we reevaluate the probability of event A.
P(B|A)/P(B) is called "Likelyhood", which is an adjustment factor to make the estimated probability closer to the real probability.

Therefore, the conditional probability can be understood as the following formula:
Posterior probability = prior probability x adjustment factor

1.6 Laplacian smoothing


When maximum likelihood estimation is used, some values of the possible characteristic X(j) do not appear in the sample of Ck label. At this time, the likelihood function is 0, and the target function is 0, which will lead to the deviation of classification. To solve this problem, Bayesian estimation is used:
Where Sj is the number of non repeated values of the j-th feature in the Ck tag. When λ = 0 is the maximum likelihood estimation, when λ = 1, it is called Laplacian smoothing. Similarly, the Bayesian estimation of prior probability is:

Two, example

2.1 example explanation

Take online community message as an example. In order not to affect the development of the community, we need to block insulting speech, so we need to build a fast filter. If a message uses negative or insulting language, it will be marked as inappropriate content. Filtering such content is a common requirement. There are two types of this problem: insulting and non insulting, which are represented by 1 and 0 respectively.

# Segmented entry
postingList = [
	['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
	['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
	['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
	['stop', 'posting', 'stupid', 'worthless', 'garbage'],
	['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
	['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']
]
# Category label vector, 1 for insulting words, 0 for not
classVec = [0, 1, 0, 1, 0, 1]

2.2 data processing steps

If we want to judge whether it is an insulting word group, we need to calculate the probability that each word in a group of words is an insulting word through the basic judgment data, and then add up the same word probability in the group to get the insulting probability and non insulting probability. After the basic probability is obtained, the final judgment can be obtained by calculating the word group to be judged.

  1. Count all words (de duplication)
  2. Copy the total number, compare the total data with each phrase, and get the corresponding total number of data Vectorization
  3. According to the classification of word groups, the total data of vectorization is used to get the insulting probability and non insulting probability of each word
  4. Then we use probability to classify test word groups

However, if we just do this, we will find that the insulting probability and non insulting probability either have a value or are 0. In the subsequent processing, the probability will be 0 because the value is too small according to the conditional probability formula, and the classification will be wrong.

So we need to improve Laplace smoothing by naive Bayes

2.3. Complete code

# !/usr/bin/python
# -*- coding: utf-8 -*- 
# @Time : 2020/1/1 22:48 
# @Author : ljf
# @File : NB_test6.py
import numpy as np


def loadDataSet():
    """
    //Function Description: create experiment sample
    Returns:
        postingList:    Experimental sample segmentation terms
        classVec:       Category label vector
    """
    # Segmented entry
    postingList = [
        ['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],
        ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],
        ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],
        ['stop', 'posting', 'stupid', 'worthless', 'garbage'],
        ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],
        ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']
    ]
    classVec = [0, 1, 0, 1, 0, 1]  # Category label vector, 1 for insulting words, 0 for not
    return postingList, classVec


def setOfWords2Vec(vocabList, inputSet):
    """
    //Function Description: quantize the inputSet according to the vocabList vocabulary, and each element of the vector is 1 or 0
    Args:
        vocabList:  createVocabList List returned
        inputSet:   List of segmented entries
    Returns:
        returnVec:  Document vector,Word set model
    """
    returnVec = [0] * len(vocabList)  # Create a vector in which all elements are 0
    for word in inputSet:  # Traverse each entry
        if word in vocabList:  # Set 1 if entry exists in Glossary
            returnVec[vocabList.index(word)] = 1
        else:
            print("the word: %s is not in my Vocabulary!" % word)
    return returnVec  # Return document vector


def createVocabList(dataSet):
    """
    //Function Description: organize the segmented experimental sample entries into a list of non repeated entries, that is, a glossary
    Args:
        dataSet:    Collated sample data set
    Returns:
        vocabSet:   Returns a list of entries that are not repeated, that is, a glossary
    """
    vocabSet = set([])  # Create an empty non repeating list
    for document in dataSet:
        vocabSet = vocabSet | set(document)  # Union and collection
    return list(vocabSet)


def trainNB0(trainMatrix, trainCategory):
    """
    //Function Description: naive Bayesian classifier training function
    Args:
        trainMatrix:    Training document matrix, i.e setOfWords2Vec Returned returnVec Constructed matrix
        trainCategory:  Training category label vector, i.e loadDataSet Returned classVec
    Returns:
        p0Vect:     Conditional probability array of non insulting class
        p1Vect:     Conditional probability array of insulting class
        pAbusive:   The probability that documents belong to insults
    """
    numTrainDocs = len(trainMatrix)  # Count the number of documents trained
    numWords = len(trainMatrix[0])  # Count entries per document
    pAbusive = sum(trainCategory) / float(numTrainDocs)  # The probability that documents belong to insults
    p0Num = np.zeros(numWords)
    p1Num = np.zeros(numWords)  # Create the numpy.zeros array and initialize the number of entries to 0
    p0Denom = 0.0
    p1Denom = 0.0  # Denominator initialized to 0
    for i in range(numTrainDocs):
        if trainCategory[i] == 1:  # Data required for statistics of conditional probabilities belonging to insults, i.e. P(w0|1),P(w1|1),P(w2|1)···
            p1Num += trainMatrix[i]
            p1Denom += sum(trainMatrix[i])
        else:  # Data required for statistics of conditional probabilities belonging to non insulting categories, i.e. P(w0|0),P(w1|0),P(w2|0)···
            p0Num += trainMatrix[i]
            p0Denom += sum(trainMatrix[i])
    p1Vect = p1Num / p1Denom
    p0Vect = p0Num / p0Denom
    return p0Vect, p1Vect, pAbusive  # Returns the conditional probability array belonging to the insulting class, the conditional probability array belonging to the non insulting class, and the probability of the document belonging to the insulting class


def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):
    """
    //Function Description: naive Bayesian classifier classification function
    Args:
        vec2Classify:   Array of terms to be classified
        p0Vec:          Conditional probability array of insulting class
        p1Vec:          Conditional probability array of non insulting class
        pClass1:        The probability that documents belong to insults
    Returns:
        0:              Non insulting
        1:              It's an insult
    """
    p1 = sum(vec2Classify * p1Vec) + np.log(pClass1)  # The corresponding elements are multiplied. logA * B = logA + logB, so add log(pClass1) here
    p0 = sum(vec2Classify * p0Vec) + np.log(1.0 - pClass1)
    print('p0:', p0)
    print('p1:', p1)
    if p1 > p0:
        return 1
    else:
        return 0


def testingNB():
    """
    //Function Description: Test naive Bayes classifier
    Returns:
        //nothing
    """
    listOPosts, listClasses = loadDataSet()  # Create an experiment sample
    myVocabList = createVocabList(listOPosts)  # Create Glossary
    trainMat = []
    for postinDoc in listOPosts:
        trainMat.append(setOfWords2Vec(myVocabList, postinDoc))  # Quantify the experimental samples
    p0V, p1V, pAb = trainNB0(np.array(trainMat), np.array(listClasses))  # Training naive Bayesian classifier

    testEntry = ['love', 'my', 'dalmation']  # Test sample 1
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))  # Test sample Vectorization
    if classifyNB(thisDoc, p0V, p1V, pAb):
        print(testEntry, 'It's an insult')  # Perform classification and print classification results
    else:
        print(testEntry, 'Non insulting')  # Perform classification and print classification results

    testEntry = ['stupid', 'garbage']  # Test sample 2
    thisDoc = np.array(setOfWords2Vec(myVocabList, testEntry))  # Test sample Vectorization
    if classifyNB(thisDoc, p0V, p1V, pAb):
        print(testEntry, 'It's an insult')  # Perform classification and print classification results
    else:
        print(testEntry, 'Non insulting')  # Perform classification and print classification results


if __name__ == '__main__':
    testingNB()

Three, summary

  • Before training naive Bayesian classifier, there are still many things to be learned in text cleaning.
  • According to the extracted classification features, the text is vectorized, and then naive Bayesian classifier is trained.
  • The difference in the number of De high frequency words has an impact on the results.
  • Laplacian smoothing plays an active role in improving the classification effect of naive Bayesian classifier.
113 original articles published, 96 praised, 90000 visitors+
Private letter follow

Keywords: Python

Added by thedualmind on Thu, 30 Jan 2020 13:01:54 +0200