Chimerge algorithm and matlab code implementation (data discretization)

Chimerge algorithm and matlab code implementation (data discretization)

1. Algorithm analysis

(chi square test)

Is the deviation degree between the actual observation value and the theoretical inference value of the statistical sample. The deviation degree between the actual observation value and the theoretical inference value determines the chi square value. If the chi square value is large, the deviation is large, on the contrary, the deviation is small; If the square values are equal, the two values are equal.

χ 2 = ∑ i = 1 m ∑ j = 1 k ( A i j − E i j ) 2 E i j \chi^2=\sum_{i=1}^m\sum_{j=1}^k{(A_{ij}-E_{ij})^2\over E_{ij}} χ2=i=1∑m​j=1∑k​Eij​(Aij​−Eij​)2​
Suppose to compare the chi square of two attribute values in the first attribute in iris:

Attribute valueFirst classSecond classThird class
4.31600
4.9411

Data in table 16 16 16 indicates the number of instances whose attribute value of the first attribute in iris dataset is 4.3 and whose instance label is 1

Because the chi square between two adjacent intervals is compared
m m The value of m is 2
k k k is the number of categories of each dataset (the category of iris dataset is 3)
A i j A_{ij} Aij is the actual frequency. The ith attribute value (4.3 or 4.9) label is the number of times of class j
R i R_{i} Ri , is the number of occurrences of the ith attribute value / / take iris dataset as an example (16 + 0 + 0)
C j C_j Cj is the total number of the j-th category / / take iris dataset as an example (16 + 4)
N N N is the amount of all the above data / / take iris data set as an example (16 + 0 + 0 + 4 + 1 + 1)
E i j E_{ij} Eij is A i j A_{ij} Aij # expected value R i ∗ C j N {R_i*C_j\over N} NRi​∗Cj​​

Chimerge algorithm mainly relies on Chi square test: the adjacent intervals with the minimum chi square value are combined until the stopping criterion is met.

Algorithm flow:
1) Initialization: sort the instances according to the discrete attribute values [first attribute, instance label] / [second attribute, instance label] /... [i attribute, instance label] to divide i group of matrices and sort the attribute values of this i group of matrices
2) Perform interval merging from bottom to top. When the conditions are met, stop merging
(there are two kinds of conditions to meet here: the first is that the chi square value is greater than a certain threshold; the second is that the merging will be stopped when the interval reaches a certain number)
    . Calculate the chi square distribution of adjacent sections according to the above chi square formula
    . Merge the pair of intervals with the lowest chi square value

2. Matlab code

function [new_data,cut_point]=Chimerge(data,data_block);

   
   [m,n]=size(data);      
   class=length(unique(data(:,1))); 
   cut_point=zeros(n-1,data_block); 
   new_data=zeros(m,n);            
   new_data(:,1)=data(:,1);         
   att_map=cell(n,1);      
   
  
   for i=2:n  %traversal attributes 
       pair_att_lable=[data(:,i),data(:,1)];  
       pair_att_lable=sortrows(pair_att_lable,1);
       tmp_map=zeros(1,class+1);   
    
       inner=1;     
       for j=1:m   
           num=pair_att_lable(j,1);   
           loc=find(tmp_map(:,1)==num); 
           if isempty(loc)  
               tmp_map(inner,1)=num; 
               tmp_map(inner,pair_att_lable(j,2)+1)=tmp_map(inner,pair_att_lable(j,2)+1)+1;
               inner=inner+1;
           else
     
               tmp_map(loc,pair_att_lable(j,2)+1)=tmp_map(loc,pair_att_lable(j,2)+1)+1;
           end
           clear loc; %eliminate loc variable
       end
       att_map{i}=tmp_map;
       clear num pair_att_lable tmp_map inner loc
   end
   clear num pair_att_lable tmp_map inner loc

   for i=2:n 
       temp_data=att_map{i};    
       corrent_interval=length(temp_data(:,1));  
       while corrent_interval>data_block       
           chi2_mat=zeros(corrent_interval-1,1); 
           for j=1:corrent_interval-1           
               chi2=chi2test(temp_data(j,2:end),temp_data(j+1,2:end));   
               chi2_mat(j)=chi2;              
           end
           [minv,index]=min(chi2_mat);           %Find the smallest chi square value
           merge_loc_1=index;                     
           merge_loc_2=index+1;
           temp_data(merge_loc_1,2:end)=temp_data(merge_loc_1,2:end)+temp_data(merge_loc_2,2:end); %Data merging
           temp_data(merge_loc_2,:)=[];        
           corrent_interval=corrent_interval-1;   
       end
       %------------------------------------------Get split point

       cut_point(i-1,:)=temp_data(:,1);           
       %------------------------------------------ 
       
       
     
       %------------------------------------------
       for j=1:m  %Traverse all instances
           for k=1:data_block    
               if k~=data_block    
                   if data(j,i)>=temp_data(k,1) && data(j,i)<temp_data(k+1,1) %temp_data Is a one-dimensional column matrix
                       new_data(j,i)=k;
                       break;       
                   end
                   if data(j,i)<temp_data(1,1)  
                       new_data(j,i)=1;
                       break;
                   end
               else
                   if data(j,i)>=temp_data(k,1)
                       new_data(j,i)=k;
                   end
               end
           end
       end
       %-------------------------------------------
       
       
   end
   
end

%Here is u v Finding chi square distribution from two row vectors
function chi2=chi2test(u,v)
    len=length(u);   %Get the number of categories
    N=sum(u)+sum(v); %Get the total number of instances
    chi2=0;          %Initialize the passed parameters
    box=[u;v];       %Constructed as a two row matrix
    for i=1:2
        for j=1:len
            Eij=sum(box(i,:))*sum(box(:,j))/N;
            if Eij==0
                tmp_chi=0;
            else
                tmp_chi=(box(i,j)-Eij)^2/Eij;
            end
            chi2=chi2+tmp_chi;
        end
    end
    
end

Link: https://download.csdn.net/download/weixin_44912943/19102941
The chimerge M download file contains the analysis of the above code
If the code is wrong, please leave a message and point it out!
If you have any questions, please leave a message to discuss!

Keywords: Algorithm Machine Learning

Added by mendoz on Tue, 08 Feb 2022 04:11:58 +0200