MLE for Poisson Distribution

MLE for Poisson Distribution

Poisson Example: In a study, researchers are interested in anayzing co-speech gestures (arm and hand movements accompanying the speech) by Katalan and Korean speakers. The data below comes from study in which participants watched a cartoon and then told a partner (a friend or a professor) about what they had seen. The data is available as a .csv file in the following Github repository.

If you place the word flat at the beginning of data url you can quickly view data as a Flat Data as given here.

Question: Assuming that the number of gestures follows a Poisson distribution with a mean parameter \(\lambda\). Find the maximum likelihood estimate of \(\lambda\) based on this sample.

#turn off warnings
import warnings
warnings.filterwarnings("ignore")
#import required libraries
import pandas as pd
import numpy as np
from scipy.special import factorial
import matplotlib.pyplot as plt

First import the data, check its shape (size), and hava a look at a few first lines. Please note that to read a .csv file from a GitHub repository, first go to the url where the data is available, click on the data set, then a preview window will open, and click on the Raw data button, copy thins url and paste it into pd.read_csv.

gesture_data = pd.read_csv("https://raw.githubusercontent.com/bodowinter/poisson_tutorial/main/dyads.csv")
#size of the data frame
gesture_data.shape
(54, 6)
#let's see the first 10 rows
gesture_data.head(n = 10) 
ID context dur language gender gestures
0 Catalan_1 friend 137 Catalan M 61
1 Catalan_1 prof 136 Catalan M 78
2 Catalan_2 friend 90 Catalan F 31
3 Catalan_2 prof 107 Catalan F 40
4 Catalan_3 friend 181 Catalan M 81
5 Catalan_3 prof 165 Catalan M 49
6 Catalan_4 friend 130 Catalan M 32
7 Catalan_4 prof 115 Catalan M 49
8 Catalan_5 friend 122 Catalan F 39
9 Catalan_5 prof 125 Catalan F 32

Get the gestures column only and change its name.

#get only the gestures column
data = gesture_data.gestures

Check the mean.

np.mean(data)
49.44444444444444
#creater a grid vector for lambda
lambd = np.arange(1, 1000, 1)
#print(lambd)

Define a function which calculates the MLE of the data under a Poisson distribution. This function requires two parameters: data and a vector of possible values for \(\lambda\) parameter. Then, the function calculates the log-likelihood of the data for each \(\lambda\) value, and plot log-likelihood values versus \(\lambda\) values. The \(\hat \lambda\) where the curve the attains a maximum can considered as the maximum likelihood estimate for \(\lambda\).

def poisson_loglik_plot(data, lambd):
    
    #fixed values
    n = np.size(data)
    sum_x = np.sum(data)
    sum_in_x_factorial = np.sum(np.log(factorial(data)))
    
    #define log_likelihood
    log_lik = -lambd*n + sum_x * np.log(lambd)-sum_in_x_factorial
    
    # plot log-lik versus lambda values
    fig, ax = plt.subplots(figsize = (6, 6))

    plt.plot(lambd, log_lik)  
    plt.axvline(x=lambd[np.argmax(log_lik)], color='r') #find the lambda where log-lik is a maximum.
    ax.text(49, -48000, '49', color='red', ha='center', va='top') #not a smart way         
    plt.xlabel(r'$\lambda$')
    plt.ylabel('Log-likelihood value')
    plt.title('Log-likelihood of gestures data under Poisson Distribution')
    plt.legend(loc='upper right')
    plt.show()
poisson_loglik_plot(data = data, lambd = lambd)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../../_images/poisson_mle_16_1.png

The \(\hat \lambda\) = 49 is the MLE for \(\lambda\).

Some optimzation routines in Python may prefer minimizing an objective function. In that case, we need to define negative log-likelihood. In such a case, the \(\hat \lambda\) where the curve the attains a minimum can considered as the maximum likelihood estimate for \(\lambda\).

def poisson_negloglik_plot(data, lambd):
    
    #fixed values
    n = np.size(data)
    sum_x = np.sum(data)
    sum_in_x_factorial = np.sum(np.log(factorial(data)))
    
    #define log_likelihood
    log_lik = -lambd*n + sum_x * np.log(lambd)-sum_in_x_factorial
    
    neg_log_lik = -log_lik
    
    # plot neg-log-lik versus lambda values
    fig, ax = plt.subplots(figsize = (6, 6))

    plt.plot(lambd, neg_log_lik)  
    plt.axvline(x=lambd[np.argmin(neg_log_lik)], color='r') #find the lambda where neg_log-lik is a minimum.
    ax.text(49, -3000, '49', color='red', ha='center', va='top') #not a smart way         
    plt.xlabel(r'$\lambda$')
    plt.ylabel('Negative log-likelihood value')
    plt.title('Log-likelihood of gestures data under Poisson Distribution')
    plt.legend(loc='upper right')
    plt.show()
poisson_negloglik_plot(data = data, lambd = lambd)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
../../_images/poisson_mle_20_1.png

We can also find the maximum (minimum) point where the log-likelihood (negative log-likelihood) attains through an optimization routine.

First define neg-log-lik as a function.

def poisson_negloglik(lambd):
    
    #data is not an input argument, assuming data object is already available in the session.
    #fixed values
    n = np.size(data)
    sum_x = np.sum(data)
    sum_in_x_factorial = np.sum(np.log(factorial(data)))
    
    #define log_likelihood
    log_lik = -lambd*n + sum_x * np.log(lambd)-sum_in_x_factorial
    
    neg_log_lik = -log_lik
    
    return neg_log_lik

Then use minimize() function fom Scipy.optimize with method='L-BFGS-B' to find the arg min.

from scipy.optimize import minimize
poisson_mle = minimize(poisson_negloglik, np.array([1]), method='L-BFGS-B')
poisson_mle
      fun: 443.5523349036321
 hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
      jac: array([0.])
  message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
     nfev: 34
      nit: 14
     njev: 17
   status: 0
  success: True
        x: array([49.4445145])