From this lecture by Nando de Freitas.
For a Bernoulli distribution the conjugate prior is the beta distribution, as derived here.
For a categorical distribution, where there is a single experiment with \(K\) possible values.
We are given \(n\) data points from such distribution, \(x_{i:n}=\{x_i,\dots,x_n\}\) were each point \(x_i\) can assume \(K=3\) classes, \(\{1 = (100), 2= (010), 3=(001)\},\) each with likelihood
\[p(x_i|\theta)=\prod_{j=1}^k \theta_j^{\mathbb I(x_{ij}=1)}\] In other words if the probability of \((100)\) is \(\theta_{j=1},\) and the probability of \((010)\) is \(\theta_{j=2},\) etc., the probability that the sample \(x_i\) is \((010)\) is \(\theta_2,\) and \(\mathbb I(x_{i2}=1)=1\) indicates that the sample \(x_i=2,\) in which case its probability will be \(\theta_{2}.\)
For the entire sample, the likelihood will be
\[p(x_{1:n}\vert \theta)=\prod_{i=1}^n \prod_{j=1}^k \theta_j^{\mathbb I(x_{ij}=1)}.\]
the generalization of the beta distribution that provides a prior for this categorical distribution is the Dirichlet distribution with the pdf defined as
\[\mathrm{Dir}(\mathbf \theta\vert \mathbf \alpha) := \frac{1}{B(\mathbf \alpha)}\prod_{k=1}^K\, \theta_{k=1}^{\alpha_k -1}\]
defined on the probability simplex, i.e.ย the set of vectors such that \(0\leq \theta_k \leq 1,\) and \(\displaystyle \sum_{k=1}^K \theta_k =1.\)
\(B(\alpha_1,\dots,\alpha_K)\) is the generalization of the beta function to \(K\) variables:
\[B(\mathbf \alpha) := \frac{\prod_{i=1}^K \Gamma(\alpha_i)}{\Gamma(\alpha_0)}\]
where \(\alpha_0:=\sum_{k=1}^K \alpha_k.\)
For instance, in the Beta prior of a coin toss, the form is \(\theta^{\alpha_1-1}(1-\theta)^{\alpha_2 -1}=\theta^{\alpha_1-1}\theta_2^{\alpha_2 -1},\) with \(\theta_2\) being the probability of tails, and \(\theta_1 + \theta_2 =1,\) while in the case of a die of K = 6 sides the Dirichlet will be \(\theta_1^{\alpha_1 - 1}\dots \theta_K^{\alpha_K-1}=\theta_1^{\alpha_1 - 1}\dots (1 -\theta_1-\dots-\theta_5)^{\alpha_6 -1}.\)
In the case of a beta-Bernoulli conjugate we derive the posterior as:
\[\begin{align} p(\theta|\mathcal D) &\propto p(\mathcal D | \theta)\,p(\theta)\\[2ex] &\propto \underset{\text{Bern likelihood}}{\big[\theta^{N_1} (1-\theta)^{N_2} \big]}\underset{\text{Beta}}{\big[\theta^{\alpha_1 - 1}(1-\theta)^{\alpha_2-1} \big]} \end{align}\]
whereas in the case of a categorical distribution (dice):
\[\begin{align} p(\mathbf\theta|X_{1:n}) &\propto p(X_{1:n} | \mathbf\theta)\,p(\mathbf\theta)\\[2ex] &\propto \underbrace{\prod_{i=1}^n \prod_{j=1}^k \theta_j^{\mathbb I(x_{ij}=1)}}_{\text{cat'l likelihood}} \underbrace{\prod_{j=1}^K \theta_j^{\alpha_j -1}}_{\text{Dir}}\\[2ex] &=\prod_{j=1}^K\theta_j^{N_j}\prod_{j=1}^K \theta_j^{\alpha_j-1}\\[2ex] &=\prod_{j=1}^K\theta^{(N_j +\alpha_j)-1} \end{align}\]
Hence, the posterior is Dirichlet.
NOTE: These are tentative notes on different topics for personal use - expect mistakes and misunderstandings.