Establish GMM likelihood function using PyMC3

1. Likelihood function of GMM

about
K = 3   ,   μ = [ ( 1 , 1 ) ( 3 , 3 ) ( 5 , 5 ) ] , c o v = [ ( 1 − 0.8 − 0.8 1 ) ( 1 0.8 0.8 1 ) ( 1 − 0.8 − 0.8 1 ) ] K=3\ ,\ \mu=\begin{bmatrix} (1,1)\\(3,3)\\(5,5)\end{bmatrix}, cov=\begin{bmatrix}\begin{pmatrix}1&-0.8\\-0.8&1\end{pmatrix}\\\begin{pmatrix}1&0.8\\0.8&1\end{pmatrix}\\\begin{pmatrix}1&-0.8\\-0.8&1\end{pmatrix}\end{bmatrix} K=3 , μ=⎣⎡​(1,1)(3,3)(5,5)​⎦⎤​,cov=⎣⎢⎢⎢⎢⎢⎢⎡​(1−0.8​−0.81​)(10.8​0.81​)(1−0.8​−0.81​)​⎦⎥⎥⎥⎥⎥⎥⎤​
The mixing ratio is w ∼ D i r i c h l e t ( α )   ,   α = ( 8 , 4 , 8 ) w\sim Dirichlet(\alpha)\ ,\ \alpha=(8,4,8) w∼Dirichlet( α)  ,  α= (8,4,8), PM can be used in PyMC3 Mix or PM Densitydist to construct likelihood function and extract samples.
First, construct and analyze the data as a reference.

%matplotlib inline
import arviz as az
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import pymc3 as pm
import theano.tensor as tt
import seaborn as sns
from sklearn import mixture
import dill, theano
from scipy.stats import multivariate_normal as mvn
x,y = np.mgrid[-3:9:0.1, -3:9:0.1]
pos = np.dstack((x,y))
com1 = mvn(means[0],cov[0]).pdf(pos)
com2 = mvn(means[1],cov[1]).pdf(pos)
com3 = mvn(means[2],cov[2]).pdf(pos)
liklihood_function = w[0]*com1 + w[1]*com2 + w[2]*com3
fig = plt.figure(1)
ax = fig.add_subplot(111)
ax.contourf(x,y,liklihood_function,levels=100,cmap='jet')
ax.scatter(x=means[:,0], y=means[:,1], color='white', edgecolor='black')
plt.title('GMM Analytical, K=3')
plt.show()

Operation results:

The goal of constructing likelihood function is not to estimate, but to stably produce more accurate samples, so we should provide as many and accurate parameters as possible for the model.

1.1 use PM Mixture

In order to reduce the burden of PyMC3 reasoning, the parameters are assigned outside the model as much as possible.

K=3
means = np.array([1,1,3,3,5,5]).reshape((3,2))
cov = np.array([1, -0.8, -0.8, 1, 1, 0.8, 0.8, 1, 1, -0.8, -0.8, 1]).reshape((3,2,2))
alpha = np.array([8, 4, 8])
w = np.random.dirichlet(alpha)

with pm.Model() as model:
	comp1 = pm.MvNormal.dist(mu=means[0], cov=cov[0], shape=2)
	comp2 = pm.MvNormal.dist(mu=means[1], cov=cov[1], shape=2)
	comp3 = pm.MvNormal.dist(mu=means[2], cov=cov[2], shape=2)
	liklihood = pm.Mixture('liklihood', w=w, comp_dists=[comp1, comp2, comp3], shape=2)

Use Slice sampling and convert the track to DataFrame, using Seaborn Join plot to view the sampling results:

with model:
	trace = pm.sample(2000, step=pm.Slice(), tune=2000, pickle_backend='dill')
	dftrace = pm.trace_to_dataframe(trace)
sns.jointplot(data=dftrace, x='liklihood__0', y='liklihood__1', kind='hex')

Operation results:

The results are basically consistent with the analytical data.

1.2 use PM DensityDist

After constructing the analytic function with theano, specify the log likelihood function, and finally pass PM Densitydist is specified as the likelihood function.

#Constructing analytic likelihood function
cov_inverse = np.linalg.inv(cov)
z = tt.matrix('z')
z.tag.test_value=pm.floatX([0,0])
def gmm2dliklihood(z):
    z = z.T
    comp1 = 1/np.sqrt(2*np.pi*np.linalg.det(cov)[0])*np.exp(-1/2*(
         (z[0]-means[0,0])**2*cov_inverse[0,0,0]
        +(z[0]-means[0,0])*(z[1]-means[0,1])*cov_inverse[0,0,1]
        +(z[1]-means[0,1])*(z[0]-means[0,0])*cov_inverse[0,1,0]
        +(z[1]-means[0,1])**2*cov_inverse[0,1,1]))
    comp2 = 1/np.sqrt(2*np.pi*np.linalg.det(cov)[1])*np.exp(-1/2*(
         (z[0]-means[1,0])**2*cov_inverse[1,0,0]
        +(z[0]-means[1,0])*(z[1]-means[1,1])*cov_inverse[1,0,1]
        +(z[1]-means[1,1])*(z[0]-means[1,0])*cov_inverse[1,1,0]
        +(z[1]-means[1,1])**2*cov_inverse[1,1,1]))
    comp3 = 1/np.sqrt(2*np.pi*np.linalg.det(cov)[2])*np.exp(-1/2*(
         (z[0]-means[2,0])**2*cov_inverse[2,0,0]
        +(z[0]-means[2,0])*(z[1]-means[2,1])*cov_inverse[2,0,1]
        +(z[1]-means[2,1])*(z[0]-means[2,0])*cov_inverse[2,1,0]
        +(z[1]-means[2,1])**2*cov_inverse[2,1,1]))
    return w[0]*comp1 + w[1]*comp2 + w[2]*comp3

First, determine whether the constructed likelihood function is correct:

gmm2dliklihood_plot = theano.function([z],gmm2dliklihood(z))
grid = pm.floatX(np.mgrid[ -2:9:1000j, -2:9:1000j])
grid2d = grid.reshape(2, -1).T
pdf = gmm2dliklihood_plot(grid2d)
plt.contourf(grid[0], grid[1], pdf.reshape(1000, 1000), cmap='inferno',levels=80)
plt.title('theano function gmm2d')
plt.show()

Operation results:

from pymc3.distributions.dist_math import bound
#Constructing log likelihood function
def gmmlogp(z):
    return np.log(gmm2dliklihood(z))
#Pass both likelihood functions into the model, use Slice sampling and convert them to DataFrame
with pm.Model() as model:
    gmm2dlike = pm.DensityDist("gmm2dlike", logp=gmmlogp, shape=(2,))
    trace = pm.sample(2000, step=pm.Slice(), tune=2000, pickle_backend='dill')
    dftrace = pm.trace_to_dataframe(trace)
#View through jointplot
sns.jointplot(data=dftrace, x='gmm2dlike__0', y='gmm2dlike__1', kind='hex', cmap='Reds')

Operation results:

Multiprocess sampling (4 chains in 4 jobs)
Slice: [gmm2dlike]

 100.00% [16000/16000 00:11<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 2_000 tune and 2_000 draw iterations (8_000 + 8_000 draws total) took 30 seconds.
The number of effective samples is smaller than 10% for some parameters.


It is basically consistent with the analytical data and the results in 1.1, but it is too slow to generate samples through these two methods, and a faster method should be considered.

2. Use sklearn's GMM as a fast sample generator

Using sklearn's GMM as a fast sample generator will bring many convenience in practice. For example, when we need 100000 or more GMM likelihood function sample points, we do not need to pass pymc3 Sample performs long MCMC sampling, but uses sklearn's GMM fitting according to a small number of MCMC samples, and finally quickly generates a large number of samples, so as to save time.

from sklearn import mixture
skgmm = mixture.GaussianMixture(n_components=K
                                , covariance_type='full'
                                , weights_init=w
                                ,means_init=means
                               )
skgmm.fit(trace['liklihood'])
sksamples, index = skgmm.sample(10000000)
sns.jointplot(x=sksamples[:,0], y=sksamples[:,1], kind='hex', cmap='Greens')

You can check the difference between the two by timing. The same GMM likelihood function still generates 1000000 samples:

with pm.Model() as model:
	comp1 = pm.MvNormal.dist(mu=means[0], cov=cov[0], shape=2)
	comp2 = pm.MvNormal.dist(mu=means[1], cov=cov[1], shape=2)
	comp3 = pm.MvNormal.dist(mu=means[2], cov=cov[2], shape=2)
	liklihood = pm.Mixture('liklihood', w=w, comp_dists=[comp1, comp2, comp3], shape=2)
	trace = pm.sample(1000000, step=pm.Slice(), tune=2000, pickle_backend='dill')
	dftrace = pm.trace_to_dataframe(trace)

Operation results:

Multiprocess sampling (4 chains in 4 jobs)
Slice: [liklihood]
100.00% [4008000/4008000 47:17<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 2_000 tune and 1_000_000 draw iterations (8_000 + 4_000_000 draws total) took 2856 seconds.
The number of effective samples is smaller than 10% for some parameters.

Or this way:

cov_inverse = np.linalg.inv(cov)
z = tt.matrix('z')
z.tag.test_value=pm.floatX([0,0])
def gmm2dliklihood(z):
    z = z.T
    comp1 = 1/np.sqrt(2*np.pi*np.linalg.det(cov)[0])*np.exp(-1/2*(
         (z[0]-means[0,0])**2*cov_inverse[0,0,0]
        +(z[0]-means[0,0])*(z[1]-means[0,1])*cov_inverse[0,0,1]
        +(z[1]-means[0,1])*(z[0]-means[0,0])*cov_inverse[0,1,0]
        +(z[1]-means[0,1])**2*cov_inverse[0,1,1]))
    comp2 = 1/np.sqrt(2*np.pi*np.linalg.det(cov)[1])*np.exp(-1/2*(
         (z[0]-means[1,0])**2*cov_inverse[1,0,0]
        +(z[0]-means[1,0])*(z[1]-means[1,1])*cov_inverse[1,0,1]
        +(z[1]-means[1,1])*(z[0]-means[1,0])*cov_inverse[1,1,0]
        +(z[1]-means[1,1])**2*cov_inverse[1,1,1]))
    comp3 = 1/np.sqrt(2*np.pi*np.linalg.det(cov)[2])*np.exp(-1/2*(
         (z[0]-means[2,0])**2*cov_inverse[2,0,0]
        +(z[0]-means[2,0])*(z[1]-means[2,1])*cov_inverse[2,0,1]
        +(z[1]-means[2,1])*(z[0]-means[2,0])*cov_inverse[2,1,0]
        +(z[1]-means[2,1])**2*cov_inverse[2,1,1]))
    return w[0]*comp1 + w[1]*comp2 + w[2]*comp3
def gmmlogp(z):
    return np.log(gmm2dliklihood(z))
with pm.Model() as model:
    gmm2dlike = pm.DensityDist("gmm2dlike", logp=gmmlogp, shape=(2,))
    trace = pm.sample(1000000, step=pm.Slice(), tune=2000, pickle_backend='dill')
    dftrace = pm.trace_to_dataframe(trace)

Operation results:

Multiprocess sampling (4 chains in 4 jobs)
Slice: [gmm2dlike]
100.00% [4008000/4008000 24:18<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 2_000 tune and 1_000_000 draw iterations (8_000 + 4_000_000 draws total) took 1475 seconds.
The number of effective samples is smaller than 10% for some parameters.

It takes a long time, and the method of using a small number of samples + sklearnGMM:

from pymc3.distributions.dist_math import bound
from sklearn import mixture
def gmmlogp(z):
    return np.log(gmm2dliklihood(z))
with pm.Model() as model:
    gmm2dlike = pm.DensityDist("gmm2dlike", logp=gmmlogp, shape=(2,))
    trace = pm.sample(400, step=pm.Slice(), pickle_backend='dill')
    dftrace = pm.trace_to_dataframe(trace)
Only 400 samples in chain.
Multiprocess sampling (4 chains in 4 jobs)
Slice: [gmm2dlike]

100.00% [5600/5600 00:06<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 400 draw iterations (4_000 + 1_600 draws total) took 23 seconds.
The rhat statistic is larger than 1.05 for some parameters. This indicates slight problems during sampling.
The estimated number of effective samples is smaller than 200 for some parameters.
%%time
skgmm = mixture.GaussianMixture(n_components=K
                                , covariance_type='full'
                                , weights_init=w
                                ,means_init=means)

skgmm.fit(trace['gmm2dlike'])
sksamples, index = skgmm.sample(1000000)
sns.jointplot(data=dftrace, x='gmm2dlike__0', y='gmm2dlike__1', kind='hex', cmap='Reds')
Wall time: 173 ms

Operation results:

Keywords: Python Machine Learning

Added by Garrett on Sun, 19 Dec 2021 21:15:03 +0200