MLE for Poisson Distribution
MLE for Poisson DistributionΒΆ
Poisson Example: In a study, researchers are interested in anayzing co-speech gestures (arm and hand movements accompanying the speech) by Katalan and Korean speakers. The data below comes from study in which participants watched a cartoon and then told a partner (a friend or a professor) about what they had seen. The data is available as a .csv
file in the following Github repository.
If you place the word flat
at the beginning of data url you can quickly view data as a Flat Data as given here.
Question: Assuming that the number of gestures follows a Poisson distribution with a mean parameter \(\lambda\). Find the maximum likelihood estimate of \(\lambda\) based on this sample.
#turn off warnings
import warnings
warnings.filterwarnings("ignore")
#import required libraries
import pandas as pd
import numpy as np
from scipy.special import factorial
import matplotlib.pyplot as plt
First import the data, check its shape (size), and hava a look at a few first lines. Please note that to read a .csv
file from a GitHub
repository, first go to the url where the data is available, click on the data set, then a preview window will open, and click on the Raw data
button, copy thins url and paste it into pd.read_csv
.
gesture_data = pd.read_csv("https://raw.githubusercontent.com/bodowinter/poisson_tutorial/main/dyads.csv")
#size of the data frame
gesture_data.shape
(54, 6)
#let's see the first 10 rows
gesture_data.head(n = 10)
ID | context | dur | language | gender | gestures | |
---|---|---|---|---|---|---|
0 | Catalan_1 | friend | 137 | Catalan | M | 61 |
1 | Catalan_1 | prof | 136 | Catalan | M | 78 |
2 | Catalan_2 | friend | 90 | Catalan | F | 31 |
3 | Catalan_2 | prof | 107 | Catalan | F | 40 |
4 | Catalan_3 | friend | 181 | Catalan | M | 81 |
5 | Catalan_3 | prof | 165 | Catalan | M | 49 |
6 | Catalan_4 | friend | 130 | Catalan | M | 32 |
7 | Catalan_4 | prof | 115 | Catalan | M | 49 |
8 | Catalan_5 | friend | 122 | Catalan | F | 39 |
9 | Catalan_5 | prof | 125 | Catalan | F | 32 |
Get the gestures column only and change its name.
#get only the gestures column
data = gesture_data.gestures
Check the mean.
np.mean(data)
49.44444444444444
#creater a grid vector for lambda
lambd = np.arange(1, 1000, 1)
#print(lambd)
Define a function which calculates the MLE of the data under a Poisson distribution. This function requires two parameters: data and a vector of possible values for \(\lambda\) parameter. Then, the function calculates the log-likelihood of the data for each \(\lambda\) value, and plot log-likelihood values versus \(\lambda\) values. The \(\hat \lambda\) where the curve the attains a maximum can considered as the maximum likelihood estimate for \(\lambda\).
def poisson_loglik_plot(data, lambd):
#fixed values
n = np.size(data)
sum_x = np.sum(data)
sum_in_x_factorial = np.sum(np.log(factorial(data)))
#define log_likelihood
log_lik = -lambd*n + sum_x * np.log(lambd)-sum_in_x_factorial
# plot log-lik versus lambda values
fig, ax = plt.subplots(figsize = (6, 6))
plt.plot(lambd, log_lik)
plt.axvline(x=lambd[np.argmax(log_lik)], color='r') #find the lambda where log-lik is a maximum.
ax.text(49, -48000, '49', color='red', ha='center', va='top') #not a smart way
plt.xlabel(r'$\lambda$')
plt.ylabel('Log-likelihood value')
plt.title('Log-likelihood of gestures data under Poisson Distribution')
plt.legend(loc='upper right')
plt.show()
poisson_loglik_plot(data = data, lambd = lambd)
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
The \(\hat \lambda\) = 49 is the MLE for \(\lambda\).
Some optimzation routines in Python may prefer minimizing an objective function. In that case, we need to define negative log-likelihood. In such a case, the \(\hat \lambda\) where the curve the attains a minimum can considered as the maximum likelihood estimate for \(\lambda\).
def poisson_negloglik_plot(data, lambd):
#fixed values
n = np.size(data)
sum_x = np.sum(data)
sum_in_x_factorial = np.sum(np.log(factorial(data)))
#define log_likelihood
log_lik = -lambd*n + sum_x * np.log(lambd)-sum_in_x_factorial
neg_log_lik = -log_lik
# plot neg-log-lik versus lambda values
fig, ax = plt.subplots(figsize = (6, 6))
plt.plot(lambd, neg_log_lik)
plt.axvline(x=lambd[np.argmin(neg_log_lik)], color='r') #find the lambda where neg_log-lik is a minimum.
ax.text(49, -3000, '49', color='red', ha='center', va='top') #not a smart way
plt.xlabel(r'$\lambda$')
plt.ylabel('Negative log-likelihood value')
plt.title('Log-likelihood of gestures data under Poisson Distribution')
plt.legend(loc='upper right')
plt.show()
poisson_negloglik_plot(data = data, lambd = lambd)
No artists with labels found to put in legend. Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
We can also find the maximum (minimum) point where the log-likelihood (negative log-likelihood) attains through an optimization routine.
First define neg-log-lik as a function.
def poisson_negloglik(lambd):
#data is not an input argument, assuming data object is already available in the session.
#fixed values
n = np.size(data)
sum_x = np.sum(data)
sum_in_x_factorial = np.sum(np.log(factorial(data)))
#define log_likelihood
log_lik = -lambd*n + sum_x * np.log(lambd)-sum_in_x_factorial
neg_log_lik = -log_lik
return neg_log_lik
Then use minimize()
function fom Scipy.optimize
with method='L-BFGS-B'
to find the arg min.
from scipy.optimize import minimize
poisson_mle = minimize(poisson_negloglik, np.array([1]), method='L-BFGS-B')
poisson_mle
fun: 443.5523349036321
hess_inv: <1x1 LbfgsInvHessProduct with dtype=float64>
jac: array([0.])
message: 'CONVERGENCE: NORM_OF_PROJECTED_GRADIENT_<=_PGTOL'
nfev: 34
nit: 14
njev: 17
status: 0
success: True
x: array([49.4445145])