### DIRICHLET DISTRIBUTION:

For a Bernoulli distribution the conjugate prior is the beta distribution, as derived here.

For a categorical distribution, where there is a single experiment with $$K$$ possible values.

We are given $$n$$ data points from such distribution, $$x_{i:n}=\{x_i,\dots,x_n\}$$ were each point $$x_i$$ can assume $$K=3$$ classes, $$\{1 = (100), 2= (010), 3=(001)\},$$ each with likelihood

$p(x_i|\theta)=\prod_{j=1}^k \theta_j^{\mathbb I(x_{ij}=1)}$ In other words if the probability of $$(100)$$ is $$\theta_{j=1},$$ and the probability of $$(010)$$ is $$\theta_{j=2},$$ etc., the probability that the sample $$x_i$$ is $$(010)$$ is $$\theta_2,$$ and $$\mathbb I(x_{i2}=1)=1$$ indicates that the sample $$x_i=2,$$ in which case its probability will be $$\theta_{2}.$$

For the entire sample, the likelihood will be

$p(x_{1:n}\vert \theta)=\prod_{i=1}^n \prod_{j=1}^k \theta_j^{\mathbb I(x_{ij}=1)}.$

the generalization of the beta distribution that provides a prior for this categorical distribution is the Dirichlet distribution with the pdf defined as

$\mathrm{Dir}(\mathbf \theta\vert \mathbf \alpha) := \frac{1}{B(\mathbf \alpha)}\prod_{k=1}^K\, \theta_{k=1}^{\alpha_k -1}$

defined on the probability simplex, i.e. the set of vectors such that $$0\leq \theta_k \leq 1,$$ and $$\displaystyle \sum_{k=1}^K \theta_k =1.$$

$$B(\alpha_1,\dots,\alpha_K)$$ is the generalization of the beta function to $$K$$ variables:

$B(\mathbf \alpha) := \frac{\prod_{i=1}^K \Gamma(\alpha_i)}{\Gamma(\alpha_0)}$

where $$\alpha_0:=\sum_{k=1}^K \alpha_k.$$

For instance, in the Beta prior of a coin toss, the form is $$\theta^{\alpha_1-1}(1-\theta)^{\alpha_2 -1}=\theta^{\alpha_1-1}\theta_2^{\alpha_2 -1},$$ with $$\theta_2$$ being the probability of tails, and $$\theta_1 + \theta_2 =1,$$ while in the case of a die of K = 6 sides the Dirichlet will be $$\theta_1^{\alpha_1 - 1}\dots \theta_K^{\alpha_K-1}=\theta_1^{\alpha_1 - 1}\dots (1 -\theta_1-\dots-\theta_5)^{\alpha_6 -1}.$$

In the case of a beta-Bernoulli conjugate we derive the posterior as:

\begin{align} p(\theta|\mathcal D) &\propto p(\mathcal D | \theta)\,p(\theta)\\[2ex] &\propto \underset{\text{Bern likelihood}}{\big[\theta^{N_1} (1-\theta)^{N_2} \big]}\underset{\text{Beta}}{\big[\theta^{\alpha_1 - 1}(1-\theta)^{\alpha_2-1} \big]} \end{align}

whereas in the case of a categorical distribution (dice):

\begin{align} p(\mathbf\theta|X_{1:n}) &\propto p(X_{1:n} | \mathbf\theta)\,p(\mathbf\theta)\\[2ex] &\propto \underbrace{\prod_{i=1}^n \prod_{j=1}^k \theta_j^{\mathbb I(x_{ij}=1)}}_{\text{cat'l likelihood}} \underbrace{\prod_{j=1}^K \theta_j^{\alpha_j -1}}_{\text{Dir}}\\[2ex] &=\prod_{j=1}^K\theta_j^{N_j}\prod_{j=1}^K \theta_j^{\alpha_j-1}\\[2ex] &=\prod_{j=1}^K\theta^{(N_j +\alpha_j)-1} \end{align}

Hence, the posterior is Dirichlet.