import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
import empiricaldist as emd
import scipy.stats as st
'display.max_columns', None)
pd.set_option('display.width', 170)
pd.set_option('display.max_rows', None)
pd.set_option('display.float_format', lambda x: '%.3f' % x) pd.set_option(
The following example is taken from Allen Downey’s Think Bayes. I believe this is a great one to show how to update priors.
Downey takes a match between France and Crotia, played back in 2018 World Cup, that France won 4-2. Then, he aims to answer two questions:
- How confident we are about France being the better team?
- In a rematch, what is the probability that France would win again?
I’ll only attempt to answer the first question so that you have a reason to check Downey’s book Think Bayes.
Starting out with certain assumptions:
- There is a goal scoring-rate for every team, goals per game (more specifically, per 90), which we will denote as lambda.
- A goal is equally likely during any minute of the game, regardless of strategy, tempo etc.
(This is also an assumption under a huge framework in basketball analytics, regularized adjusted plus-minus, wanted to point out just in case you follow my basketball analytics content) - It isn’t possible for a team to score more than once during a single minute.
= 1.4 # lambda, goal scoring rate (i.e. goals per game)
lam = st.poisson(lam) # poisson dist. with lambda = 1.4 dist
# probability of scoring "k" goals (4 in this case)
= 4
k round(3) # pmf evaluated at 4. dist.pmf(k).
0.039
So, there’s 3.9% chance to observe 4 goals, under the model above.
= 1.4 # goal scoring rate
lam = np.arange(10) # possible values for goals
goals = st.poisson(lam).pmf(goals)
probas
probas
array([2.46596964e-01, 3.45235750e-01, 2.41665025e-01, 1.12777012e-01,
3.94719540e-02, 1.10521471e-02, 2.57883433e-03, 5.15766866e-04,
9.02592015e-05, 1.40403202e-05])
= plt.figure(figsize=(8,6))
fig = sns.barplot(x=goals, y=probas)
g
'Goals')
g.set_xlabel('PMF') g.set_ylabel(
Text(0, 0.5, 'PMF')
Let’s try to move the other way around: Estimate the goal-scoring rate from given goals.
Downey has used data from previous World Cups to estimate that the each team scores 1.4 goals per game, approximately. Hence, it is reasonable to make mean of lambda 1.4.
The goal scoring rate is continuous and it can’t take values below 0, hence a distribution that reflects this features would be great: Gamma distribution. Additionally, it’s easy to construct one since it only takes one parameter which is the mean which we already have value for.
= 1.4 # mean of the distribution
alpha = np.linspace(0, 10, 101) # possible values of lam between 0 and 10
lams = st.gamma(alpha).pdf(lams) # probability densities ps
= emd.Pmf(ps, lams)
prior # Pmf offers "normalize" method, which divides by the total probability of the data (i.e., probability under any parameter/hypothesis) prior.normalize()
9.889360237140306
= pd.DataFrame(prior.ps, prior.qs).rename(columns={0:'probas'}) df_prior
= plt.figure(figsize=(8,6))
fig = sns.lineplot(x=df_prior.index, y=df_prior.probas, color='orange', linestyle='--', linewidth=2)
g
'Prior Distribution')
g.set_title('PMF')
g.set_ylabel('Goals') g.set_xlabel(
Text(0.5, 0, 'Goals')
sum(df_prior.index * df_prior.probas) # mean of the distribution np.
1.4140818156118378
Updating with new data
So, our initial belief for France’s goal scoring rate (in this example, for other teams as well), goals per 90 mins, was 1.4. Then we observed 4 goals from France, should we still think that France’s goal scoring rate is 1.4? If not, how much should it change?
= lams # different lambdas (different goal scoring rates)
lambdas = 4 # observed data
k = st.poisson(lambdas).pmf(k) # for each lambda (for each parameter), how likely we are to see 4 goals
likelihood 4] likelihood[:
array([0.00000000e+00, 3.77015591e-06, 5.45820502e-05, 2.50026149e-04])
'likelihood'] = likelihood
df_prior[4) df_prior.head(
probas | likelihood | |
---|---|---|
0.000 | 0.000 | 0.000 |
0.100 | 0.041 | 0.000 |
0.200 | 0.049 | 0.000 |
0.300 | 0.052 | 0.000 |
= emd.Pmf(df_prior['probas'] * df_prior['likelihood'], df_prior.index)
p_norm p_norm.normalize()
0.05015532557804499
'posterior'] = p_norm df_prior[
= plt.figure(figsize=(8,6))
fig = sns.lineplot(x=df_prior.index, y=df_prior.posterior, color='blue', linestyle='--')
g
'Goal scoring rate')
g.set_xlabel('PMF')
g.set_ylabel('France') g.set_title(
Text(0.5, 1.0, 'France')
Same steps for Crotia:
= df_prior[['probas']].copy()
df_crotia 3) df_crotia.head(
probas | |
---|---|
0.000 | 0.000 |
0.100 | 0.041 |
0.200 | 0.049 |
= 2 # Crotia's number of goals in the match
k = st.poisson(lams).pmf(2)
likelihood_cro 4] likelihood_cro[:
array([0. , 0.00452419, 0.01637462, 0.03333682])
'likelihood'] = likelihood_cro df_crotia[
= emd.Pmf(df_crotia['probas'] * df_crotia['likelihood'], df_crotia.index)
p_norm p_norm.normalize()
0.1609321178598705
'posterior'] = p_norm df_crotia[
= plt.figure(figsize=(8,6))
fig = sns.lineplot(x=df_crotia.index, y=df_crotia['posterior'], color='red', linestyle='--')
g
'Goal scoring rate')
g.set_xlabel('PMF')
g.set_ylabel('Crotia') g.set_title(
Text(0.5, 1.0, 'Crotia')
How confident we are that France is the better team?
print(
'Mean of France: ', str(np.sum(df_prior.index * df_prior.posterior).round(1)), '\n'
'Mean of Crotia: ', str(np.sum(df_crotia.index * df_crotia.posterior).round(1))
)
Mean of France: 2.7
Mean of Crotia: 1.7
= plt.figure(figsize=(8,6))
fig = sns.lineplot(x=df_crotia.index, y=df_crotia['posterior'], color='red', label='Crotia')
g = sns.lineplot(x=df_prior.index, y=df_prior['posterior'], color='blue', label='France')
g
'Goals')
g.set_xlabel('PMF')
g.set_ylabel('France vs Crotia') g.set_title(
Text(0.5, 1.0, 'France vs Crotia')
Probability of superiority is a way to express an effect size. Here’s a great visualization tool that you can play with to make expression of effect sizes more intuitive: Interpreting Effect Sizes: An Interactive Visualization
# Probability of superiority.
= 0
france_proba
for i in df_prior.index.tolist():
for j in df_crotia.index.tolist():
if i > j:
+= df_prior.loc[i, 'posterior'] * df_crotia.loc[j, 'posterior']
france_proba else:
continue
round(2) france_proba.
0.75
Probability of superiority feels very intuitive: It is the probability of randomly sampled lambda for France being higher than randomly sampled lambda for Crotia. If there isn’t much overlap between two distributions, we are more confident that the one group (in this example, a team) is higher/lower than the other: Probability of superiority reflects that. You can use any tool (Python, R etc.) to simulate this, and I highly suggest it if you are not currently able to wrap your head around. Just generate different distributions and play with the idea.
I didn’t include but there is one more reason why we used poisson and gamma distributions, it is related to something called conjugates that I first came across while I was going through Daniel Lakens’ Improving Your Statistical Inferences: Bayesian statistics. They make things computationally more feasible compared to the grid approach here.
Examples (like this one) from many books can be found on my GitHub repo, and I highly suggest you to go through Allen Downey’s books. Have a nice weekend.