import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
import statsmodels as sm
import empiricaldist as emd
import scipy.stats as st
import utils as ut
import warnings
'ignore') warnings.filterwarnings(
In daily life, “probability” and “likelihood” are used interchangeably. But, they don’t refer to the same thing. I believe the most intuitive way to distinguish both is by thinking about the information you have while answering questions:
- Probability questions start with known model and unknown data. If someone asks you about the chance of seeing 6 heads in 10 coin tosses, you don’t have any observations. But, you have the model: You know the process that generates “heads” and “tails” and it’s parameter value (usually expressed as theta θ): 0.5.
- Let’s say you did 10 coin tosses and ended up with 9 heads. How likely you were to observe 9 if the coin was fair? In this situation, the data is known and the question refers to the underlying data generating process (i.e., refers to the parameter).
We can use this type of approach and let the parameter values vary and calculate the likelihood of the data under each value of the parameter(s).
= pd.read_csv('possum.csv', usecols=['pop', 'tail_l'])
possum = possum.copy()
df
df.head()
pop | tail_l | |
---|---|---|
0 | Vic | 36.0 |
1 | Vic | 36.5 |
2 | Vic | 39.0 |
3 | Vic | 38.0 |
4 | Vic | 36.0 |
We’ll use possum data, where each row corresponds to features of a possum from Australia and New Guinea. Let’s attempt to predict which region a possum is from as a function of tail length. (You can get the data from here)
I assume basic knowledge of logistic regression, which we’ll use to model the relationship between our predictor and the binary target variable.
= np.linspace(-3, 3, 241) # possible values for intercept (beta_0)
qs_cept = np.linspace(-1, 1, 81) # possible values for slope (beta_1) qs_slope
print(qs_cept[:5], '\n', qs_slope[:5])
[-3. -2.975 -2.95 -2.925 -2.9 ]
[-1. -0.975 -0.95 -0.925 -0.9 ]
= pd.DataFrame(index=range(len(qs_cept)), columns=range(len(qs_slope)))
df_likeli = qs_cept
df_likeli.index = qs_slope df_likeli.columns
1, inplace=True)
df_likeli.fillna(
df_likeli.head()
-1.000 | -0.975 | -0.950 | -0.925 | -0.900 | -0.875 | -0.850 | -0.825 | -0.800 | -0.775 | ... | 0.775 | 0.800 | 0.825 | 0.850 | 0.875 | 0.900 | 0.925 | 0.950 | 0.975 | 1.000 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
-3.000 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
-2.975 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
-2.950 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
-2.925 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
-2.900 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | ... | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 | 1 |
5 rows × 81 columns
So, what I have done is I created grid of equally spaced values both for intercept and the slope. I put them into a dataframe where columns represent possible values for slope and indices represent possible values for intercept.
Now, I ask this question: For each pair of intercept and slope, b0 and b1, how likely I am to see the observed data?
'pop'] = df['pop'].apply(lambda x: 1 if x == 'Vic' else 0) df[
# Centering data
= df['tail_l'].mean().round()
offset 'x'] = df['tail_l'] - offset
df['y'] = df['pop']
df[
# I refer our predictor as x from now on (for convenience), y becomes our target variable which takes 1 if possum is from Victoria region and 0 otherwise.
= df.groupby('x')['y'].agg(['sum', 'count'])
agg_data 10).T agg_data.head(
x | -5.0 | -3.5 | -3.0 | -2.5 | -2.0 | -1.5 | -1.0 | -0.5 | 0.0 | 0.5 |
---|---|---|---|---|---|---|---|---|---|---|
sum | 2 | 1 | 4 | 2 | 6 | 6 | 9 | 4 | 4 | 1 |
count | 2 | 1 | 5 | 2 | 9 | 7 | 13 | 12 | 6 | 4 |
= agg_data['count'] # represents number of observation with corresponding x values
ns = agg_data['sum'] # represents successes, which means Victoria region ks
People get confused, rightfully, when they hear logistic regression getting mentioned as a linear model since all they see is an S-shaped function in a classic graph where x-axis represents the predictor and y-axis represents the probability. But that squiggle is a result of a transformation of log odds into probabilities. You may have deduced that before if you have taken a look at the equation:
\(\displaystyle log_e(\frac{p_i}{1 - p_i}) = \beta_0 + \beta_1 x_{1i} + ... + \beta_k x_{k_i}\)
However, we know that odds can be expressed as probabilities, so we’ll make that transformation.
For representation let’s select an intercept and a slope:
= 0
cept = 1 slope
= agg_data.index
x_values = cept + slope * x_values
log_odds = np.exp(log_odds)
odds = odds / (odds + 1)
ps 6] # Probabilities coming from the model ps[:
Index([0.006692850924284856, 0.029312230751356316, 0.04742587317756679,
0.07585818002124356, 0.11920292202211755, 0.18242552380635632],
dtype='float64', name='x')
How likely I am to observe k success in n trials with \(p_i\) where \(p_i\) comes from the model above.
# How likely I am to observe k success in n trials with p_i where p_i comes from model
= st.binom.pmf(ks, ns, ps)
likelis 6] likelis[:
array([4.47942535e-05, 2.93122308e-02, 2.40951774e-05, 5.75446348e-03,
1.64675234e-04, 2.10930303e-04])
# Taking the product for whole data
likelis.prod()
6.58621661515704e-55
Let’s take a look at the fit:
=(6,4))
plt.figure(figsize
+offset, ps, label='random selected model', color='C1')
plt.plot(x_values'x']+offset, df['y'], s=30, label='data')
plt.scatter(df[
ut.decorate()
This was for only one pair of intercept and slope, and it seems like we can do better. We calculated the likelihood for educational purposes here but in general, likelihoods by themselves does not mean much (I try to choose my words carefully here). Let’s try other possible pairs for our parameters and compare the likelihoods of each to take the one that maximizes the likelihood of the observed data.
from scipy.special import expit
= df_likeli.copy()
likelihood
for cept in likelihood.index:
for slope in likelihood.columns:
= expit(cept + slope * x_values) # transformation to probabilities
probs = st.binom.pmf(ks, ns, probs) # likelihood of each observation
likelis == cept, slope] = likelis.prod() # likelihood of the whole data under the selected pair of parameter values likelihood.loc[likelihood.index
= likelihood.stack().idxmax()
row, col print(row, col)
-0.32499999999999973 -0.7
This is the pair that maximizes the likelihood.
= row
cept = col
slope
= cept + slope * x_values
log_odds = np.exp(log_odds)
odds = odds / (odds + 1) ps
=(6,4))
plt.figure(figsize
+offset, ps, label='model', color='C1')
plt.plot(x_values'x']+offset, df['y'], s=30, label='data')
plt.scatter(df[
ut.decorate()
Let’s check our parameter estimates with the help of statsmodels:
import statsmodels.formula.api as smf
= 'y ~ x'
formula = smf.logit(formula, data=df).fit(disp=False)
results results.params
Intercept -0.320498
x -0.699641
dtype: float64
We didn’t get those exact values since we used grid approximation, and the grid that I created didn’t include those exact values. Let’s drop a few details here:
- As the parameter space gets larger, the computational process becomes slower. So, there’s that trade-off between precision and time.
- We only had two parameters to estimate and that’s OK for grid approach. However, once you start to get more than three, number of possible combination for parameters gets incredibly large which makes grid approach infeasible.
- Another thing is related to the number of observation while calculating the likelihood for the whole data: When you try to take the product of the likelihoods with many observations, you may ran into an issue called underflow, where computer has trouble multiplying bunch of values around zero. Hence, log likelihood comes to rescue: Logarithms takes the values around zero away from zero, thus solving the underflow problem.
I hope this example made likelihoods more intuitive. There’s more to know about them: Are they probability densities? Are they probability masses? Or are they both? That’s for a whole different post, things can get messy in a second. However, it’s important to understand likelihoods in the context above, since they can be used to update our beliefs about parameters which is what we do with Bayesian methods.
As usual, have a nice weekend :)