Jonathan Weisberg

How Scientific is Scientific Polarization?

Tue, 10 Nov 2020 00:00:00 -0500

As Joe Biden cleared 270 last week, some people remarked on how different the narrative would’ve been had the votes been counted in a different order:

It's staggering to think about how differently PA would be viewed/covered right now if the EDay/mail ballots were being counted in the opposite order.
— Dave Wasserman (@Redistrict) November 5, 2020

The idea that order shouldn’t affect your final take is a classic criterion of rationality. Whatever order the evidence comes in, your final opinion should be the same if it’s the same total evidence in the end.

This post is about how O’Connor & Weatherall’s model of “scientific polarization” runs afoul of this constraint. In their model, divergent opinions arise from a shared body of evidence, despite everyone involved being rational. Just how rational is what we’re considering here.

The Model

Here’s the quick version of O’Connor & Weatherall’s model. You can find the gory details in my previous post, but we won’t need them here.

A community of medical doctors is faced with a novel treatment for some condition. Currently, patients with this condition have a .5 chance of recovering. The new treatment either increases that chance or decreases it. In actual fact it increases the chance of recovery, but our doctors don’t know that yet.

Some doctors start out more skeptical of the new treatment, others more optimistic. Those with credence > .5 try the new treatment on their patients, and share the results with the others. Everybody then updates their credence in the new treatment. The cycle of experimentation, sharing, and updating then repeats.

Crucially though, our doctors don’t fully trust one another’s results. If a doctor has a very different opinion about the new treatment than her colleague, she won’t fully trust that colleague’s data. She may even discount them entirely, if their credences differ enough.

As things develop, this medical community is apt to split. Some doctors learn the truth about the new treatment’s superiority, while others remain skeptical and even come to completely disregard the results reported by their colleagues. This won’t always happen, but it’s the likely outcome given certain assumptions. Crucial for us here: doctors must discount one other’s data entirely when their credences differ significantly—by .5 let’s say, just for concreteness.

The Problem

This way of evaluating evidence depends on the order. Here’s an extreme example to make the point vivid.

Suppose Dr. Hibbert has credence .501 in the new treatment’s benefits, and his colleagues Nick and Zoidberg are both at 1.0. Nick and Zoidberg each have a report to share with Hibbert, containing bad news about the new treatment. Nick found that it failed in all but 1 of his 10 patients, while Zoidberg found that it failed in all 10 of his. Whose report should Dr. Hibbert update on first?

If he listens to Nick first, he’ll fall below .5 and ignore Zoidberg’s report as a result. His difference of opinion with Zoidberg will be so large that Hibbert will come to discount him entirely. But if he listens to Zoidberg first, he’ll ignore Nick then, for the same reason.

So Hibbert can only really listen to one of them. And since their reports are different, he’ll end up with different credences depending on who he listens to. Zoidberg’s report is slightly more discouraging. So Hibbert will end up more skeptical of the new treatment if he listens to Zoidberg first, than if he listens to Nick first.

Can It Be Fixed?

This problem isn’t an artifact of the particulars of O’Connor & Weatherall’s model. It’s in the nature of the project. Any polarization model of the same, broad kind must have the same bug.

Polarization happens because skeptical agents come to ignore their optimistic colleagues at some point. Otherwise, skeptics would eventually be drawn to the truth. As long as they’re still willing to give some credence to the experimental results, they’ll eventually see that those results favour optimism about the new treatment.

But even if our agents never ignored one another completely, we’d still have this problem. Suppose all three of our characters have the same credence. And Nick has one success to report where Zoidberg has one failure. Intuitively, once Hibbert hears them both out, he should end up right back where he started, no matter who he listens to first.

But if he listens to Nick first, his credence will move away from Zoidberg’s. So when he gets to Zoidberg’s report it’ll carry less weight than Nick’s did. He’ll end up more confident than he started. Whereas he’ll end up less confident if he proceeds in reverse order.

Does It Matter?

It seems like polarization can’t be fully scientific if it’s driven by mistrust based on difference of opinion. But that doesn’t make the model worthless, or even uninteresting. O’Connor & Weatherall are already clear that their agents aren’t meant to be “rational with a capital ‘R’” anyway.

Quite plausibly, real people behave something like the agents in this model a lot of the time. The model might be capturing a very real phenomenon, even if it’s an irrational one. We just have to take the “scientific” in “scientific polarization” with the right amount of salt.

Mistrust & Polarization

Mon, 09 Nov 2020 00:00:00 -0500

This is post 3 of 3 on simulated epistemic networks (code here):

The first post introduced a simple model of collective inquiry. Agents experiment with a new treatment and share their data, then update on all data as if it were their own. But what if they mistrust one another?

It’s natural to have less than full faith in those whose opinions differ from your own. They seem to have gone astray somewhere, after all. And even if not, their views may have illicitly influenced their research.

So maybe our agents won’t take the data shared by others at face value. Maybe they’ll discount it, especially when the source’s viewpoint differs greatly from their own. O’Connor & Weatherall (O&W) explore this possibility, and find that it can lead to polarization.

Polarization

Until now, our communities always reached a consensus. Now though, some agents in the community may conclude the novel treatment is superior, while others abandon it, and even ignore the results of their peers using the new treatment.

In the example animated below, agents in blue have credence >.5 so they experiment with the new treatment, sharing the results with everyone. Agents in green have credence ≤.5 but are still persuadable. They still trust the blue agents enough to update on their results—though they discount these results more the greater their difference of opinion with the agent who generated them. Finally, red agents ignore results entirely. They’re so far from all the blue agents that they don’t trust them at all.

Fig. 1. Example of polarization in the O'Connor–Weatherall model

In this simulation, we reach a point where there are no more green agents, only unpersuadable skeptics in red and highly confident believers in blue. And the blues have become so confident, they’re unlikely to ever move close enough to any of the reds to get their ear. So we’ve reached a stable state of polarization.

How often does such polarization occur? It depends on the size of the community, and on the “rate of mistrust,” $m$. Details on this parameter are below, but it’s basically the rate at which difference of opinion increases discounting. The larger $m$ is, the more a given difference in our opinions will cause you to discount data I share with you.

Here’s how these two factors affect the probability of polarization. (Note: we’re considering only complete networks here.)

Fig. 2. Probability of polarization depends on community size and rate of mistrust.

So the more agents are inclined to mistrust one another, the more likely they are to end up polarized. No surprise there. But larger communities are also more disposed to polarize. Why?

As O&W explain, the more agents there are, the more likely it is that strong skeptics will be present at the start of inquiry: agents with credence well below .5. These agents will tend to ignore the reports of the optimists experimenting with the new treatment. So they anchor a skeptical segment of the population.

The mistrust multiplier $m$ is essential for polarization to happen in this model. There’s no polarization unless $m > 1$. So let’s see the details of how $m$ works.

Jeffrey Updating

The more our agents differ in their beliefs, the less they’ll trust each other. When Dr. Nick reports evidence $E$ to Dr. Hibbert, Hibbert won’t simply conditionalize on $E$ to get his new credence $P’(H) = P(H \mathbin{\mid} E)$. Instead he’ll take a weighted average of $P(H \mathbin{\mid} E)$ and $P(H \mathbin{\mid} \neg E)$. In other words, he’ll use Jeffrey conditionalization: $$ P’(H) = P(H \mathbin{\mid} E) P’(E) + P(H \mathbin{\mid} \neg E) P’(\neg E). $$ But to apply this formula we need to know the value for $P’(E)$. We need to know how believable Hibbert finds $E$ when Nick reports it.

O&W note two factors that should affect $P’(E)$.

The more Nick’s opinion differs from Hibbert’s, the less Hibbert will trust him. So we want $P’(E)$ to decrease with the absolute difference between Hibbert’s credence in $H$ and Nick’s. Call this absolute difference $d$.
We also want $P’(E)$ to decrease with $P(\neg E)$. Nick’s report of $E$ has to work against Hibbert’s skepticism about $E$ to make $P’(E)$ high.

A natural proposal then is that $P’(E)$ should decrease with the product $d \cdot P(\neg E)$, which suggests $1 - d \cdot P(\neg E)$ as our formula. When $d = 1$ this would mean Hibbert ignores Nick’s report: $P’(E) = 1 - P(\neg E) = P(E)$. And when they are simpatico, $d = 0$, Hibbert will trust Nick fully and just conditionalizes on his report, since then $P’(E) = 1$.

This is fine from a formal point of view, but it means that Hibbert will basically never ignore Nick’s testimony completely. There is zero chance of $d = 1$ ever happening in our models.

So, to explore models where agents fully discount one another’s testimony, we introduce the mistrust multiplier, $m \geq 0$. This makes our final formula: $$P’(E) = 1 - \min(1, d \cdot m) \cdot P(\neg E).$$ The $\min$ is there to prevent negative values. When $d \cdot m > 1$, we just replace it with $1$ so that $P’(E) = P(E)$. Here’s what this function looks like for one example, where $m = 1.5$ and $P(E) = .6$:

Fig. 2. Posterior of the evidence $P'(E)$ when $m = 1.5$ and $P(E) = .6$

Note the kink, the point after which agents just ignore one another’s data.

O&W also consider models where the line doesn’t flatten, but keeps going down. In that case agents don’t ignore one another, but rather “anti-update.” They take a report of $E$ as a reason to decrease their credence in $E$. This too results in polarization, more frequently and with greater severity, in fact.

Discussion

Polarization only happens when $m > 1$. Only then do some agents mistrust their colleagues enough to fully discount their reports. If this never happened, they would eventually be drawn to the truth (however slowly) by the data coming from their more optimistic colleagues.

So is $m > 1$ a plausible assumption? I think it can be. People can be so unreliable that their reports aren’t believable at all. In some cases a report can even decrease the believability of the proposition reported. Some sources are known for their fabrications.

Ultimately it comes down to whether $P(E \,\vert\, R_E) > P(E)$, i.e. whether someone reporting $E$ increases the probability of $E$. Nothing in-principle stops this association from being present, absent, or reversed. It’s an empirical matter of what one knows about the source of the report.

How Robust is the Zollman Effect?

Mon, 02 Nov 2020 00:00:00 -0500

This is the second in a trio of posts on simulated epistemic networks:

This post summarizes some key ideas from Rosenstock, Bruner, and O’Connor’s paper on the Zollman effect, and reproduces some of their results in Python. As always you can grab the code from GitHub.

Last time we met the Zollman effect: sharing experimental results in a scientific community can actually hurt its chances of arriving at the truth. Bad luck can generate misleading results, discouraging inquiry into superior options. By limiting the sharing of results, we can increase the chance that alternatives will be explored long enough for their superiority to emerge.

But is this effect likely to have a big impact on actual research communities? Or is it rare enough, or small enough, that we shouldn’t really worry about it?

Easy Like Sunday Morning

Last time we saw the Zollman effect can be substantial. The chance of success increased from .89 to .97 when 10 researchers went from full sharing to sharing with just two neighbours (from complete to cycle).

Fig. 1. The Zollman effect: less connected networks can have a better chance of discovering the truth

But that was assuming the novel alternative is only slightly better: .501 chance of success instead of .5, a difference of .001. We’d be less likely to get misleading results if the difference were .01, or .1. It should be easier to see the new treatment’s superiority in the data then.

So RBO (Rosenstock, Bruner, and O’Connor) rerun the simulations with different values for ϵ, the increase in probability of success afforded by the new treatment. Last time we held ϵ fixed at .001, now we’ll let it vary up to .1. We’ll only consider a complete network vs. a wheel this time, and we’ll hold the number of agents fixed at 10. The number of trials each round continues to be 1,000.

Fig. 2. The Zollman effect vanishes as the difference in efficacy between the two treatments increases

Here the Zollman effect shrinks as ϵ grows. In fact it’s only visible up to about .025 in our simulations.

More Trials, Fewer Tribulations

Something similar can happen as we increase n, the number of trials each researcher performs. Last time we held n fixed at 1,000, now let’s have it vary from 10 up to 10,000. We’ll stick to 10 agents again, although this time we’ll set ϵ to .01 instead of .001.

Fig. 3. The Zollman effect vanishes as the number of trials per iteration increases

Again the Zollman effect fades, this time as the parameter n increases.

The emerging theme is that the easier the epistemic problem is, the smaller the Zollman effect. Before, we made the problem easier by making the novel treatment more effective. Now we’re making things easier by giving our agents more data. These are both ways of making the superiority of the novel treatment easier to see. The easier it is to discern two alternatives, the less our agents need to worry about inquiry being prematurely shut down by the misfortune of misleading data.

Agent Smith

Last time we saw that the Zollman effect seemed to grow as our network grew, from 3 up to 10 agents. But RBO note that the effect reverses after a while. Let’s return to n = 1,000 trials and ϵ = .001, so that we’re dealing with a hard problem again. And let’s see what happens as the number of agents grows from 3 up to 100.

Fig. 4. The Zollman effect eventually shrinks as the number of agents increases

The effect grows from 3 agents up to around 10. But then it starts to shrink again, narrowing to a meagre .01 at 100 agents.

What’s happening here? As RBO explain, in the complete network a larger community effectively means a larger sample size at each round. Since the researchers pool their data, a community of 50 will update on the results of 25,000 trials at each round, assuming half the community has credence > 0.5. And a community of 100 people updates on the results of 50,000 trials, etc.

As the pooled sample size increases, so does the probability it will accurately reflect the novel treatment’s superiority. The chance of the community being misled drops away.

Conclusion

RBO conclude that the Zollman effect only afflicts epistemically “hard” problems, where it’s difficult to discern the superior alternative from the data. But that doesn’t mean it’s not an important effect. Its importance just depends on how common it is for interesting problems to be “hard.”

Do such problems crop up in actual scientific research, and if so how often? It’s difficult to say. As RBO note, the model we’ve been exploring is both artificially simple and highly idealized. So it’s unclear how often real-world problems, which tend to be messier and more complex, will follow similar patterns.

On the one hand, they argue, our confidence that the Zollman effect is important should be diminished by the fact that it’s not robust against variations in the parameters. Fragile effects are less likely to come through in messy, real-world systems. On the other hand, they point to some empirical studies where Zollman-like effects seem to crop up in the real world.

So it’s not clear. Maybe determining whether Zollman-hard problems are a real thing is itself a Zollman-hard problem?

The Zollman Effect

Wed, 28 Oct 2020 00:00:00 -0500

I’m drafting a new social epistemology section for the SEP entry on formal epistemology. It’ll focus on a series of three papers that study epistemic networks using computer simulations. This post is the first in a series of three explainers, one on each paper.

In each post I’ll summarize the main ideas and replicate some key results in Python. You can grab the final code from GitHub if you want to play along and tinker.

The Idea

More information generally means a better chance at discovering the truth, at least from an individual perspective. But not as a community, Zollman finds, at least not always. Sharing all our information with one another can make us less likely to reach the correct answer to a question we’re all investigating.

Imagine there are two treatments available for some medical condition. One treatment is old, and its efficacy is well known: it has a .5 chance of success. The other treatment is new and might be slightly better or slightly worse: a .501 chance of success, or else .499.

Some doctors are wary of the new treatment, others are more optimistic. So some try it on their patients while others stick to the old ways.

As it happens the optimists are right: the new treatment is superior (chance .501 of success). So as they gather data about the new treatment and share it with the medical community, its superiority will eventually emerge as a consensus, right? At least, if all our doctors see all the evidence and weigh it fairly?

Not necessarily. It’s possible that those trying the new treatment will hit a string of bad luck. Initial studies may get a run of less-than-stellar results, which don’t accurately reflect the new treatment’s superiority. After all, it’s only slightly better than the traditional treatment. So it might not show its mettle right away. And if it doesn’t, the optimists may abandon it before it has a chance to prove itself.

One way to mitigate this danger, it turns out, is to restrict the flow of information in the medical community. Imagine one doctor gets a run of bad luck—a string of patients who don’t do so well with the new treatment, creating the misleading impression that the new treatment is inferior. If they share this result with everyone, it’s more likely the whole community will abandon the new treatment. Whereas if they only share it with a few colleagues, others will keep trying the new treatment a while longer, hopefully giving them time to discover its superiority.

The Model

We can test this story by simulation. We’ll create a network of doctors, each with their own initial credence that the new treatment is superior. Those with credence > .5 will try the new treatment, others will stick to the old. Doctors directly connected in the network will share results with their neighbours, and everyone will update on whatever results they see using Bayes’ theorem.

We’ll consider networks of different sizes, from 3 to 10 agents. And we’ll try three different network “shapes”: complete, wheel, and cycle.

Fig. 1. Three network configurations, illustrated here with 6 agents each

These shapes vary in their connectedness. The complete network is fully connected, while the cycle is the least connected. Each doctor only confers with their two neighbours in the cycle. The wheel is in between.

Our conjecture is that the cycle will prove most reliable. A doctor who gets a run of bad luck—a string of misleading results—will do the least damage there. Sharing their results might discourage their two neighbours from learning the truth. But the others in the network may keep investigating, and ultimately learn the truth about the new treatment’s superiority. The wheel should be more vulnerable to accidental misinformation, however, and the complete network most vulnerable.

Nitty Gritty

Initially, each doctor is assigned a random credence that the new treatment is superior, uniformly from the [0, 1] interval.

Those with credence > .5 will then try the new treatment on 1,000 patients. The number of successes will be randomly determined, according to the binomial distribution with probability of success .501.

Each doctor then shares their results with their neighbours, and updates by Bayes’ theorem on all data available to them (their own + neighbors’). Then we do another round of experimenting, sharing, and updating, followed by another, and so on until the community reaches a consensus.

Consensus can be achieved in either of two ways. Either everyone learns the truth that the new treatment is superior: credence > .99 let’s say. Alternatively, everyone might reach credence ≤ .5 in the new treatment. Then no one experiments with it further, so it’s impossible for it to make a comeback. (The .99 cutoff is kind of arbitrary, but it’s very unlikely the truth could be “unlearned” after that point.)

Results

Here’s what happens when we run each simulation 10,000 times. Both the shape of the network and the number of agents affect how often the community finds the truth.

Fig. 2. Probability of discovering the truth depends on network configuration and number of agents.

The less connected the network, the more likely they’ll find the truth. And a bigger community is more likely to find the truth too. Why?

Bigger, less connected networks are better insulated against misleading results. Some doctors are bound to get data that don’t reflect the true character of the new treatment once in a while. And when that happens, their misleading results risk polluting the community with misinformation, discouraging others from experimenting with the new treatment. But the more people in the network, the more likely the misleading results will be swamped by accurate, representative results from others. And the fewer people see the misleading results, the fewer people will be misled.

Here’s an animated pair of simulations to illustrate the second effect. Here I set the six scientists’ starting credences to the same, even spread in both networks: .3, .4, .5, .6, .7, and .8. I also gave them the same sequence of random data. Only the connections in the networks are different, and in this case it makes all the difference. Only the cycle learns the truth. The complete network goes dark very early, abandoning the novel treatment entirely after just 26 iterations.

Fig. 3. Two networks with identical priors encounter identical evidence, but only one discovers the truth.

What saves the cycle network is the agent who starts with .8 credence (bottom left). She starts out optimistic enough to keep going after the group encounters an initial string of dismaying results. In the complete network, however, she receives so much negative evidence early on that she gives up almost right away. Her optimism is overwhelmed by the negative findings of her many neighbours. Whereas the cycle exposes her to less of this discouraging evidence, giving her time to keep experimenting with the novel treatment, ultimately winning over her neighbours.

As Rosenstock, Bruner, and O’Connor put it: sometimes less is more, when it comes to sharing the results of scientific inquiry. But how important is this effect? How often is it present, and is it big enough to worry about in actual practice? Next time we’ll follow Rosenstock, Bruner, and O’Connor further and explore these questions.

The Beta Prior and the Lambda Continuum

Tue, 17 Dec 2019 00:00:00 -0500

In an earlier post we met the $\lambda$-continuum, a generalization of Laplace’s Rule of Succession. Here is Laplace’s rule, stated in terms of flips of a coin whose bias is unknown.

The Rule of Succession: Given $k$ heads out of $n$ flips, the probability the next flip will land heads is $$\frac{k+1}{n+2}.$$

To generalize we introduce an adjustable parameter, $\lambda$. Intuitively $\lambda$ captures how cautious we are in drawing conclusions from the observed frequency.

The $\lambda$ Continuum: Given $k$ heads out of $n$ flips, the probability the next flip will land heads is $$\frac{k + \lambda / 2}{n + \lambda}.$$

When $\lambda = 2$, this just is the Rule of Succession. When $\lambda = 0$, it becomes the “Straight Rule,” which matches the observed frequency, $k/n$. The general pattern is: the larger $\lambda$, the more flips we need to see before we tend toward the observed frequency, and away from the starting default value of $1/ 2$.$\newcommand{\p}{P}\newcommand{\given}{\mid}\newcommand{\dif}{d}$

So what’s so special about $\lambda = 2$? Why did Laplace and others take a special interest in the Rule of Succession? Because it derives from the Principle of Indifference. We saw that setting $\lambda = 2$ basically amounts to assuming all possible frequencies have equal prior probability. Or that all possible biases of the coin are equally likely. The Rule of Succession thus corresponds to a uniform prior.

What about other values of $\lambda$ then? What kind of prior do they correspond to? This question has an elegant and illuminating answer, which we’ll explore here.

PDF version here

A Preview

Let’s preview the result we’ll arrive at. Because, although the core idea isn’t very technical, deriving the full result does takes some noodling. It will be good to have some sense of where we’re going.

Here’s a picture of the priors that correspond to various choices of $\lambda$. The $x$-axis is the bias of the coin, the $y$-axis is the probability density.

Notice how $\lambda = 2$ is a kind of inflection point. The plot goes from being concave up to concave down. When $\lambda < 2$, the prior is U-shaped. Then, as $\lambda$ grows above $2$, we approach a normal distribution centered on $1/ 2$.

So, when $\lambda < 2$, we start out pretty sure the coin is biased, though we don’t know in which direction. When $\lambda < 2$ we’re inclined to run with the observed frequency, whatever that is. If we observe a heads on the first toss, we’ll be pretty confident the next toss will land heads too. And the lower $\lambda$ is, the more confident we’ll be about that.

Whereas $\lambda > 2$ corresponds to an inclination to think the coin fair, or at least fair-ish. So it takes a while for the observed frequency to draw us away from our initial expectation of $1/ 2$. (Unless the observed frequency is itself $1/ 2$.)

That’s the intuitive picture we’re working towards. Let’s see how to get there.

Pseudo-observations

Notice that the Rule of Succession is the same as pretending we’ve already observed one heads and one tails, and then using the Straight Rule. A $3$rd toss landing heads would give us an observed frequency of $2/3$, precisely what the Rule of Succession gives when just $1$ toss has landed heads. If $k = n = 1$, then $$ \frac{k+1}{n+2} = \frac{2}{3}. $$ So, setting $\lambda = 2$ amounts to imagining we have $2$ observations already, and then using the observed frequency as the posterior probability.

Setting $\lambda = 4$ is like pretending we have $4$ observations already. If we have $2$ heads and $2$ tails so far, then a heads on the $5$th toss would make for an observed frequency of $3/5$. And this is the posterior probability the $\lambda$-continuum dictates for a single heads when $\lambda = 4$: $$ \frac{k + \lambda/2}{n + \lambda} = \frac{1 + 4/2}{1 + 4} = \frac{3}{5}. $$ In general, even values of $\lambda > 0$ amount to pretending we’ve already observed $\lambda$ flips, evenly split between heads and tails, and then using the observed frequency as the posterior probability.

This doesn’t quite answer our question, but it’s the key idea. We know that the uniform prior distribution gives rise to the posterior probabilities dictated by $\lambda = 2$. We want to know what prior distribution corresponds to other settings of $\lambda$. We see here that, for $\lambda = 4, 6, 8, \ldots$ the relevant prior is the same as the “pseudo-posterior” we would have if we updated the uniform prior on an additional $2$ “pseudo-observations”, or $4$, or $6$, etc.

So we just need to know what these pseudo-posteriors look like, and then extend the idea beyond even values of $\lambda$.

Pseudo-posteriors

Let’s write $S_n = k$ to mean that we’ve observed $k$ heads out of $n$ flips. We’ll use $p$ for the unknown, true probability of heads on each flip. Our uniform prior distribution is $f(p) = 1$ for $0 \leq p \leq 1$. We want to know what $f(p \given S_n = k)$ looks like.

In a previous post we derived a formula for this: $$ f(p \given S_n = k) = \frac{(n+1)!}{k!(n-k)!} p^k (1-p)^{n-k}. $$ This is the posterior distribution after observing $k$ heads out of $n$ flips, assuming we start with a uniform prior which corresponds to $\lambda = 2$. So, when we set $\lambda$ to a larger even number, it’s the same as starting with $f(p) = 1$ and updating on $S_{\lambda - 2} = \lambda/2 - 1$. We subtract $2$ here because $2$ pseudo-observations were already counted in forming the uniform prior $f(p) = 1$.

Thus the prior distribution $f_\lambda$ for a positive, even value of $\lambda$ is: $$ \begin{aligned} f_\lambda(p) &= f(p \given S_{\lambda - 2} = \lambda/2 - 1)\\
&= \frac{(\lambda - 1)!}{(\lambda/2 - 1)!(\lambda/2 - 1)!} p^{\lambda/2 - 1} (1-p)^{\lambda/2 - 1}. \end{aligned} $$ This prior generates the picture we started with for $\lambda \geq 2$.

As $\lambda$ increases, we move from a uniform prior towards a normal distribution centered on $p = 1/ 2$. This makes intuitive sense: the more we accrue evenly balanced observations, the more our expectations come to resemble those for a fair coin.

So, what about odd values of $\lambda$? Or non-integer values? To generalize our treatment beyond even values, we need to generalize our formula for $f_\lambda$.

The Beta Prior

Recall our formula for $f(p \given S_n = k)$: $$ \frac{(n+1)!}{k!(n-k)!} p^k (1-p)^{n-k}. $$ This is a member of a famous family of probability densities, the beta densities. To select a member from this family, we specify two parameters $a,b > 0$ in the formula: $$ \frac{1}{B(a,b)} p^{a-1} (1-p)^{b-1}. $$ Here $B(a,b)$ is the beta function, defined: $$ B(a,b) = \int_0^1 x^{a-1} (1-x)^{b-1} \dif x. $$ We showed that, when $a$ and $b$ are natural numbers, $$ B(a,b) = \frac{(a-1)!(b-1)!}{(a+b-1)!}. $$ To generalize our treatment of $f_\lambda$ beyond whole numbers, we first need to do the same for the beta function. We need $B(a,b)$ for all positive real numbers.

As it turns out, this is a matter of generalizing the notion of factorial. The generalization we need is called the gamma function, and it looks like this:

The formal definition is $$ \Gamma(x) = \int_0^\infty u^{x-1} e^{-u} \dif u. $$ The gamma function connects to the factorial function because it has the property: $$ \Gamma(x+1) = x\Gamma(x). $$ This entails, by induction, that $\Gamma(n) = (n-1)!$ for any natural number $n$.

In fact we can substitute gammas for factorials in our formula for the beta function: $$ B(a,b) = \frac{\Gamma(a)\Gamma(b)}{\Gamma(a+b)}. $$ Proving this formula would require a long digression, so we’ll take it for granted here.

Now we can now work with beta densities whose parameters are not whole numbers. For any $a, b > 0$, the beta density is $$ \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} p^{a-1} (1-p)^{b-1}. $$ We can now show our main result: setting $a = b = \lambda/2$ generates the $\lambda$-continuum.

From Beta to Lambda

We’ll write $X_{n+1} = 1$ to mean that toss $n+1$ lands heads. We want to show $$ \p(X_{n+1} = 1 \given S_n = k) = \frac{k + \lambda/2}{n + \lambda}, $$ given two assumptions.

The tosses are independent and identically distributed with probability $p$ for heads.
The prior distribution $f_\lambda(p)$ is a beta density with $a = b = \lambda/2$.

We start by applying the Law of Total Probability: $$ \begin{aligned} P(X_{n+1} = 1 \given S_n = k) &= \int_0^1 P(X_{n+1} = 1 \given S_n = k, p) f_\lambda(p \given S_n = k) \dif p\\
&= \int_0^1 p f_\lambda(p \given S_n = k) \dif p. \end{aligned} $$ Notice, this is the expected value of $p$, according to the posterior $f_\lambda(p \given S_n = k)$. To analyze it further, we use two facts proved below.

The posterior $f_\lambda(p \given S_n = k)$ is itself a beta density, but with parameters $k + \lambda/2$ and $n - k + \lambda/2$.
The expected value of any beta density with parameters $a$ and $b$ is $a/(a+b)$.

Thus $$ \begin{aligned} P(X_{n+1} = 1 \given S_n = k) &= \int_0^1 p f_\lambda(p \given S_n = k) \dif p \\
&= \frac{k + \lambda/2}{k + \lambda/2 + n - k + \lambda/2}\\
&= \frac{k + \lambda/2}{n + \lambda}. \end{aligned} $$ This is the desired result, we just need to establish Facts 1 and 2.

Fact 1

Here we show that, if $f(p)$ is a beta density with parameters $a$ and $b$, then $f(p \given S_n = k)$ is a beta density with parameters $k+a$ and $n - k + b$.

Suppose $f(p)$ is a beta density with parameters $a$ and $b$: $$ f(p) = \frac{1}{B(a, b)} p^{a-1} (1-p)^{b-1}. $$ We calculate $f(p \given S_n = k)$ using Bayes’ theorem: \begin{align} f(p \given S_n = k) &= \frac{f(p) P(S_n = k \given p)}{P(S_n = k)}\\
&= \frac{p^{a-1} (1-p)^{b-1} \binom{n}{k} p^k (1-p)^{n-k}}{B(a,b) P(S_n = k)}\\
&= \frac{\binom{n}{k}}{B(a,b) \p(S_n = k)} p^{k+a-1} (1-p)^{n-k+b-1} .\tag{1} \end{align} To analyze $\p(S_n = k)$, we begin with the Law of Total Probability: $$ \begin{aligned} P(S_n = k) &= \int_0^1 P(S_n = k \given p) f(p) \dif p\\
&= \int_0^1 \binom{n}{k} p^k (1-p)^{n-k} \frac{1}{B(a, b)} p^{a-1} (1-p)^{b-1} \dif p\\
&= \frac{\binom{n}{k}}{B(a, b)} \int_0^1 p^{a+k-1} (1-p)^{b+n-k-1} \dif p\\
&= \frac{\binom{n}{k}}{B(a, b)} B(k+a, n-k+b). \end{aligned} $$ Substituting back into Equation (1), we get: $$ f(p \given S_n = k) = \frac{1}{B(k+a, n-k+b)} p^{k+a-1} (1-p)^{n-k+b-1}. $$ So $f(p \given S_n = k)$ is the beta density with parameters $k + a$ and $n - k + b$.

Fact 2

Here we show that the expected value of a beta density with parameters $a$ and $b$ is $a/(a+b)$. The expected value formula gives: $$ \frac{1}{B(a, b)} \int_0^1 p p^{a-1} (1-p)^{b-1} \dif p\\
= \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \int_0^1 p^a (1-p)^{b-1} \dif p. $$ The integrand look like a beta density, with parameters $a+1$ and $b$. So we multiply by $1$ in a form that allows us to pair it with the corresponding normalizing constant: $$ \begin{aligned} \begin{split} \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} & \int_0^1 p^a (1-p)^{b-1} \dif p \\
&= \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a + 1)\Gamma(b)} {\Gamma(a + b + 1)}\int_0^1 \frac{\Gamma(a + b + 1)}{\Gamma(a + 1)\Gamma(b)} p^a (1-p)^{b-1} \dif p\\
&= \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{\Gamma(a + 1)\Gamma(b)} {\Gamma(a + b + 1)}. \end{split} \end{aligned} $$ Finally, we use the the property $\Gamma(a+1) = a \Gamma(a)$ to obtain: $$ \frac{\Gamma(a + b)}{\Gamma(a)\Gamma(b)} \frac{a\Gamma(a)\Gamma(b)} {(a+b) \Gamma(a + b)} = \frac{a} {a+b}. $$

Picturing It

What do our priors corresponding to $\lambda < 2$ look like? Above we saw that they’re U-shaped, approaching a flat line as $\lambda$ increases. Here’s a closer look:

We can also look at odd values $\lambda \geq 2$ now, where the pattern is the same as we observed previously.

What About Zero?

What about when $\lambda = 0$? This is a permissible value on the $\lambda$-continuum, giving rise to the Straight Rule as we’ve noted. But it doesn’t correspond to any beta density. The parameters would be $a = b = \lambda/2 = 0$. Whereas we require $a, b > 0$, since the integral $$ \int_0^1 p^{-1}(1-p)^{-1} \dif p $$ diverges.

In fact no prior can agree with the Straight Rule. At least, not on the standard axioms of probability. The Straight Rule requires $\p(HH \given H) = 1$, which entails $\p(HT \given H) = 0$. By the usual definition of conditional probability then, $\p(HT) = 0$. Which means $\p(HTT \given HT)$ is undefined. Yet the Straight Rule says $\p(HTT \given HT) = 1/ 2$.

We can accommodate the Straight Rule by switching to a nonstandard axiom system, where conditional probabilities are primitive, rather than being defined as ratios of unconditional probabilities. This is approach is sometimes called “Popper–Rényi” style probability.

Alternatively, we can stick with the standard, Kolmogorov system and instead permit “improper” priors: prior distributions that don’t integrate to $1$, but which deliver posteriors that do.

Taking this approach, the beta density with $a = b = 0$ is called the Haldane prior. It’s sometimes regarded as “informationless,” since its posteriors just follow the observed frequencies. But other priors, like the uniform prior, also have some claim to representing perfect ignorance. The Jeffreys prior, which is obtained by setting $a = b = 1/ 2$ (so $\lambda = 1$), is another prior with a similar claim.

That multiple priors can make this claim is a reminder of one of the great tragedies of epistemology: the problem of priors.

Acknowledgments

I’m grateful to Boris Babic for reminding me of the beta-lambda connection. For more on beta densities I recommend the videos at stat110.net.

Belief in Psyontology

Tue, 10 Dec 2019 21:59:04 -0500

Which is more fundamental, full belief or partial belief? I argue that neither is, ontologically speaking. A survey of some relevant cognitive psychology supports a dualist ontology instead. Beliefs come in two kinds, categorical and graded, with neither kind more fundamental than the other. In particular, the graded kind is no more fundamental. When we discuss belief in on/off terms, we are not speaking coarsely or informally about states that are ultimately credal.

Could've Thought Otherwise

Tue, 10 Dec 2019 21:58:08 -0500

Evidence is univocal, not equivocal. Its implications don’t depend on our beliefs or values, the evidence says what it says. But that doesn’t mean there’s no room for rational disagreement between people with the same evidence. Evaluating evidence is a lot like polling an electorate: getting an accurate reading requires a bit of luck, and even the best pollsters are bound to get slightly different results. So even though evidence is univocal, rationality’s requirements are not “unique”. Understanding this resolves several puzzles to do with uniqueness and disagreement.

Laplace's Rule of Succession

Tue, 10 Dec 2019 00:00:00 -0500

The Rule of Succession gives a simple formula for “enumerative induction”: reasoning from observed instances to unobserved ones. If you’ve observed 8 ravens and they’ve all been black, how certain should you be the next raven you see will also be black? According to the Rule of Succession, 90%. In general, the probability is $(k+1)/(n+2)$ that the next observation will be positive, given $k$ positive observations out of $n$ total.

When does the Rule of Succession apply, and why is it $(k+1)/(n+2)$? Laplace first derived a special case of the rule in 1774, using certain assumptions. The same assumptions also allow us to derive the general rule, and following the derivation through answers both questions. $\newcommand{\p}{P}\newcommand{\given}{\mid}\newcommand{\dif}{d}$

PDF version here

As motivation, imagine we’re drawing randomly, with replacement, from an urn of marbles some proportion $p$ of which are black. Strictly speaking, $p$ must be a rational number in this setup. But formally, we’ll suppose $p$ can be any real number in the unit interval.

If we have no idea what $p$ is, it’s natural to start with a uniform prior over its possible values. Formally, $p$ is a random variable with a uniform density on the $[0,1]$ interval. Each draw induces another random variable, $$ X_i = \begin{cases} 1 & \text{ if the $i^\text{th}$ draw is black},\newline 0 & \text{ otherwise}. \end{cases} $$ We’ll define one last random variable $S_n$, which counts the black draws: $$ S_n = X_1 + \ldots + X_n . $$ Laplace’s assumptions are then as follows.

Each $X_i$ has the same chance $p$ of being $1$.
That chance is independent of whatever values the other $X_j$’s take.
The prior distribution over $p$ is uniform: $f(p) = 1$ for $0 \leq p \leq 1$.

Given these assumptions, the Rule of Succession follows: $$ \p(X_{n+1} = 1 \given S_n = k) = \frac{k+1}{n+2}. $$ We’ll start by deriving this result for the special case where all observations are positive, so that $k = n$.

Laplace’s Special Case

When $k = n$, the Rule of Succession says: $$ \p(X_{n+1} = 1 \given S_n = n) = \frac{n+1}{n+2}. $$ To derive this result, we start with the Law of Total Probability. \begin{align} \p(X_{n+1} = 1 \given S_n = n) &= \int_0^1 \p(X_{n+1} = 1 \given S_n = n, p) f(p \given S_n = n) \dif p\newline &= \int_0^1 \p(X_{n+1} = 1 \given p) f(p \given S_n = n) \dif p\newline &= \int_0^1 p \, f(p \given S_n = n) \dif p. \tag{1} \end{align} To finish the calculation, we need to compute $f(p \given S_n = n)$. We need to know how observing $n$ out of $n$ black marbles changes the probability density over $p$.

For this we turn to Bayes’ theorem. $$ \begin{aligned} f(p \given S_n = n) &= \frac{ f(p) \p(S_n = n \given p) }{ \p(S_n = n) }\newline &= \frac{ \p(S_n = n \given p) }{ \p(S_n = n) }\newline &= \frac{ p^n }{ \p(S_n = n) }\newline &= c p^n. \end{aligned} $$ Here $c$ is an as-yet unknown constant: the inverse of $\p(S_n = n)$, whatever that is. To find $c$, first observe by calculus that: $$ \int_0^1 c p^n \dif p = \left. \left(\frac{c p^{n+1}}{n+1}\right) \right|_0^1 = \frac{c}{n+1}. $$ Then observe that this quantity must equal $1$, since we’ve integrated $f(p \given S_n = n)$, a probability density. Thus $c = n + 1$, and hence $$ f(p \given S_n = n) = (n+1) p^n. $$ Returning now to finish our original calculation in Equation (1): $$ \begin{aligned} \p(X_{n+1} = 1 \given S_n = n) &= \int_0^1 p \, f(p \given S_n = n) \dif p\newline &= \int_0^1 p \, (n+1) p^n \dif p\newline &= (n+1) \int_0^1 p^{n+1} \dif p\newline &= (n+1) \left. \left(\frac{p^{n+2}}{n+2}\right) \right|_0^1\newline &= \frac{n+1}{n+2}. \end{aligned} $$ This is the Rule of Succession when $k = n$, as desired.

The General Case

The proof of the general case starts similarly. We first apply the Law of Total Probability to obtain $$ \p(X_{n+1} = 1 \given S_n = k) = \int_0^1 p \, f(p \given S_n = k) \dif p. \tag{2} $$ Then we use Bayes’ theorem to compute $f(p \given S_n = k)$. \begin{align} f(p \given S_n = k) &= \frac{ \p(S_n = k \given p) }{ \p(S_n = k) }\newline &= \frac{ \binom{n}{k} p^k (1-p)^{n-k} }{ \p(S_n = k) } \tag{3}. \end{align} Note that we used the formula for a binomial probability here to calculate the numerator $\p(S_n = k \given p)$.

Computing the denominator $\p(S_n = k)$ requires a different approach from the special case. We start with the Law of Total Probability: \begin{align} \p(S_n = k) &= \int_0^1 \p(S_n = k \given p) f(p) \dif p\newline &= \int_0^1 \p(S_n = k \given p) \dif p \newline &= \int_0^1 \binom{n}{k} p^k (1-p)^{n-k} \dif p \newline &= \binom{n}{k} \int_0^1 p^k (1-p)^{n-k} \dif p. \end{align} This leaves us facing an instance of a famous function, the “beta function,” which is defined: $$ B(a, b) = \int_0^1 x^a (1-x)^{b} \dif x. $$ In our case $a$ and $b$ are natural numbers, so $B(a,b)$ has an elegant formula, which we use now and prove later: $$ B(a, b) = \frac{a!b!}{(a + b + 1)!}. $$ For us, $a = k$ and $b = n-k$, so we have $$ \p(S_n = k) = \binom{n}{k} B(k, n-k) = \binom{n}{k} \frac{k!(n-k)!}{(n + 1)!}. $$ Substituting back into our calculation of $f(p \given S_n = k)$ in Equation (3): $$ \begin{aligned} f(p \given S_n = k) &= \frac{ \binom{n}{k} p^k (1-p)^{n-k} }{ \binom{n}{k} B(k, n-k) }\newline &= \frac{(n + 1)!}{k!(n-k)!} p^k (1-p)^{n-k} . \end{aligned} $$ Then we finish our original calculation from Equation (2): \begin{align} \p(X_{n+1} = 1 \given S_n = k) &= \int_0^1 p \frac{(n + 1)!}{k!(n-k)!} p^k (1-p)^{n-k} \dif p\newline &= \frac{(n + 1)!}{k!(n-k)!} \int_0^1 p^{k+1} (1-p)^{n-k} \dif p\newline &= \frac{(n + 1)!}{k!(n-k)!} B(k+1, n-k)\newline &= \frac{(n + 1)!}{k!(n-k)!} \frac{(k+1)!(n-k)!}{(k+1 + n-k + 1)!}\newline &= \frac{k+1}{n + 2}. \end{align} This is the Rule of Succession, as desired.

The Beta Function

Finally, let’s derive the formula we used for the beta function: $$ \int_0^1 x^a (1-x)^{b} \dif x = \frac{a!b!}{(a + b + 1)!}, $$ where $a$ and $b$ are natural numbers. We proceed in two steps: integration by parts, then a proof by induction.

Notice first that when $b = 0$ our integral simplifies and is straightforward: $$ \int_0^1 x^a \dif x = \frac{1}{a+1}. $$ So let’s assume $b > 0$ and pursue integration by parts. If we let $$ u = (1 - x)^b, \quad \dif v = x^a \dif x, $$ then $$ \dif u = -b (1 - x)^{b-1}, \quad v = \frac{x^{a+1}}{a+1}. $$ So $$ \begin{aligned} \int_0^1 x^a (1-x)^{b} \dif x &= \left. \left(\frac{ x^{a+1} (1 - x)^b }{ a+1 }\right) \right|_0^1 + \frac{b}{a+1} \int_0^1 x^{a+1} (1 - x)^{b-1} \dif x\newline &= \frac{b}{a+1} \int_0^1 x^{a+1} (1 - x)^{b-1} \dif x. \end{aligned} $$

Now we use this identity in an argument by induction. We already noted that when $b = 0$ we have $B(a, 0) = 1/(a+1)$. This satisfies the general formula $$ B(a, b) = \frac{a!b!}{(a+b+1)!}. $$ By induction on $b > 0$, we find the formula holds in general: $$ \begin{aligned} B(a, b) &= \int_0^1 x^a (1-x)^{b} \dif x\newline &= \frac{b}{a+1} \int_0^1 x^{a+1} (1 - x)^{b-1} \dif x\newline &= \frac{b}{a+1} B(a+1, b-1)\newline &= \frac{b}{a+1} \frac{(a+1)!(b-1)!}{(a + 1 + b - 1 + 1)!}\newline &= \frac{a!b!}{(a + b + 1)!}. \end{aligned} $$

Acknowledgments

Our proof of the special case follows this excellent video by Joe Blitzstein. And our proof of the general case comes from Sheldon Ross’ classic textbook, A First Course in Probability, Exercise 30 on page 128 of the 7th edition.

Crash Course in Inductive Logic

Tue, 19 Nov 2019 00:00:00 -0500

There are four ways things can turn out with two flips of a coin: $$ HH, \quad HT, \quad TH, \quad TT.$$ If we know nothing about the coin’s tendencies, we might assign equal probability to each of these four possible outcomes: $$ Pr(HH) = Pr(HT) = Pr(TH) = Pr(TT) = 1/ 4. $$ But from another point of view, there are primarily three possibilities. If we ignore order, the possible outcomes are $0$ heads, $1$ head, or $2$ heads. So we might instead assign equal probability to these three outcomes, then divide the middle $1/ 3$ evenly between $HT$ and $TH$: $$ Pr(HH) = 1/3 \qquad Pr(HT) = Pr(TH) = 1/6 \qquad Pr(TT) = 1/ 3. $$

This two-stage approach may seem odd. But it’s actually friendlier from the point of view of inductive reasoning. On the first scheme, a heads on the first toss doesn’t increase the probability of another heads. It stays fixed at $1/ 2$: $$ \newcommand{\p}{Pr} \newcommand{\given}{\mid} \renewcommand{\neg}{\mathbin{\sim}} \renewcommand{\wedge}{\mathbin{\text{&}}} \p(HH \given H) = \frac{1/ 4}{1/ 4 + 1/ 4} = \frac{1}{2}. $$ Whereas it does increase on the second strategy, from $1/ 2$ to $2/ 3$: $$ \p(HH \given H) = \frac{1/ 3}{1/ 3 + 1/ 6} = \frac{2}{3}. $$ The two-stage approach thus learns from experience, where the single-step division is skeptical about induction.

This holds true as we increase the number of flips. If we do three tosses for example, we’ll find that $\p(HHH \given HH) = 3/ 4$ on the two-stage analysis. Whereas this probability stays stubbornly fixed at $1/ 2$ on the first approach. It won’t budge no matter how many heads we observe, so we can’t learn anything about the coin’s bias this way.

This is the difference between Carnap’s famous account of induction, from his 1950 book Logical Foundations of Probability, and the account he finds in Wittgenstein’s Tractatus. ¹ Although Carnap had actually been scooped by W. E. Johnson, who worked out a similar analysis about $25$ years earlier.

This is a short explainer on some key elements of inductive logic worked out by Johnson and Carnap and the place of those ideas in the story of inductive logic.

PDF version here

States & Structures

Carnap calls a fine-grained specification like $TH$ a state-description. The coarser grained “$1$ head” is a structure-description. A state-description specifies which flips land heads, and which tails. While a structure-description specifies how many land heads and tails, without necessarily saying which.

It needn’t be coin flips landing heads or tails, of course. The same ideas apply to any set of objects or events, and any feature they might have or lack.

Suppose we have two objects $a$ and $b$, each of which might have some property $F$. Working for a moment as Carnap did, in first-order logic, here is an example of a structure-description: $$ (Fa \wedge \neg Fb) \vee (\neg Fa \wedge Fb). $$ But this isn’t a state-description, since it doesn’t specify which object has $F$. It only says how many objects have $F$, namely $1$. One of the disjuncts alone would be a state-description though: $$ Fa \wedge \neg Fb. $$

Carnap’s initial idea was that all structure-descriptions start out with the same probability. These probabilities are then divided equally among the state-descriptions that make up a structure-description.

For example, if we do three flips, there are four structure-descriptions: $0$ heads, $1$ head, $2$ heads, and $3$ heads. Some of these have only one state-description. For example, there’s only one way to get $0$ heads, namely $TTT$. So $$ \p(TTT) = 1/ 4. $$ But others have multiple state-descriptions. There are three ways to get $1$ head for example, so we divide $1/ 4$ between them: $$ \p(HTT) = \p(THT) = \p(TTH) = 1/ 12. $$

The effect is that more homogeneous sequences start out more probable. There’s only one way to get all heads, so the $HH$ state-description inherits the full probability of the corresponding “$2$ heads” structure-description. But a $50$-$50$ split has multiple permutations, each of which inherits only a portion of the same quantum of probability. A heterogeneous sequence of heads and tails thus starts out less probable than a homogeneous one.

That’s why the two-stage analysis is induction-friendly. It effectively builds Hume’s “uniformity of nature” assumption into the prior probabilities.

The Rule of Succession

The two-stage assignment also yields a very simple formula for induction: Laplace’s famous Rule of Succession. (Derivation in the Appendix.)

The Rule of Succession: Given $k$ heads out of $n$ observed flips, the probability of heads on a subsequent toss is $$\frac{k+1}{n+2}.$$

Laplace arrived at this rule about $150$ years earlier by somewhat different means. But there is a strong similarity.

Laplace supposed that our coin has some fixed, but unknown, chance $p$ of landing heads on each toss. Suppose we regard all possible values $0 \leq p \leq 1$ as equally likely.² If we then update our beliefs about the true value of $p$ using Bayes’ theorem, we arrive at the Rule of Succession. (Proving this is a bit involved. Maybe I’ll go over it another time.)

The two-stage way of assigning prior probabilities is essentially the same idea, just applied in a discrete setting. By treating all structure-descriptions as equiprobable, we make all possible frequencies of heads equiprobable. This is a discrete analogue of treating all possible values of $p$ as equiprobable.

The Continuum of Inductive Methods

Both Johnson and Carnap eventually realized that the two methods of assigning priors we’ve considered are just two points on a larger continuum.

The $\lambda$ Continuum: Given $k$ heads out of $n$ observed flips, the probability of heads on a subsequent toss is $$\frac{k + \lambda/2}{n + \lambda},$$ for some $\lambda$ in the range $0 \leq \lambda \leq \infty$.

What value should $\lambda$ take here? Notice we get the Rule of Succession if $\lambda = 2$. And we get inductive skepticism if we let $\lambda$ approach $\infty$. For then $k$ and $n$ fall away and the ratio converges to $1/ 2$, no matter what $k$ and $n$ are.

If we set $\lambda = 0$, we get a formula we haven’t discussed yet: $k/n$. Reichenbach called this the Straight Rule. (In modern statistical parlance it’s the “maximum likelihood estimate.”)³

The overall pattern is: the higher $\lambda$, the more “cautious” our inductive inferences will be. A larger $\lambda$ means less influence from $k$ and $n$: the probability of another heads stays closer to the initial value of $1/ 2$. In the extreme case where $\lambda = \infty$, it stays stuck at exactly $1/ 2$ forever.

A low value of $\lambda$, on the other hand, will make our inferences more ambitious. In the extreme case $\lambda = 0$, we jump immediately to the observed frequency. Our expectation about the next toss is just $k/n$, the frequency we’ve observed so far. If we’ve observed only one flip and it was heads ($k = n = 1$), we’ll be certain of heads on the second toss! ⁴

We can illustrate this pattern in a plot. First let’s consider what happens if the coin keeps coming up heads, i.e. $k = n$. As $n$ increases, various settings of $\lambda$ behave as follows.

Now suppose the coin only lands heads every third time, so that $k \approx n/3$.

Notice how lower settings of $\lambda$ bounce around more here before settling into roughly $1/ 3$. Higher settings approach $1/ 3$ more steadily, but they take longer to get there.

Carnap’s Program

Johnson and Carnap went much further, and others since have gone further still. For example, we can include more than one predicate, we can use relational predicates, and much more.

But philosophers aren’t too big on this research program nowadays. Why not?

Choosing $\lambda$ is one issue. Once we see that it’s more than a binary choice, between inductive optimism and skepticism, it’s hard to see why we should plump for any particular value of $\lambda$. We could set $\lambda = 2$, or $\pi$, or $42$. By what criterion could we make this choice? No clear answer emerged from Carnap’s program.

Another issue is Goodman’s famous grue puzzle. Suppose we trade our coin flips for emeralds. We might replace the heads/tails dichotomy with green/not-green then. But we could instead replace it with grue/not-grue. The prescriptions of our inductive logic depend on our choice of predicate—on the underlying language to which we apply our chosen value of $\lambda$.

So the Johnson/Carnap system doesn’t provide us with rules for inductive reasoning, more a framework for formulating such rules. We have to decide which predicates should be projectible by choosing the underlying language. And then we have to decide how projectible they should be by choosing $\lambda$. Only then does the framework tell us what conclusions to draw from a given set of observations.

Personally, I still find the framework useful. It provides a lovely way to express informal ideas more rigorously. In it we can frame questions about induction, skepticism, and prior probabilities with lucidity.

I also like it as a source of toy models. For example, I might test when a given claim about induction holds and when it doesn’t, by playing with different incarnations of $\lambda$.

The framework’s utility is thus a lot like that of its deductive cousins. Compare Timothy Williamson’s use of modal logic to create models of Gettier cases, for example, or his model of improbable knowledge.

Even in deductive logic, we only get as much out as we put in. We have to choose our connectives in propositional logic, our accessibility relation in modal logic, etc. But a flexible system like possible-world frames still has its uses. We can use it to explore philosophical options and their interconnections.

Appendix: Deriving the Rule of Succession

To derive the Rule of Succession from the two-stage assignment of priors, we need two key formulas.

The prior probability of a particular sequence with $k$ heads out of $n$ flips.
The prior probability of the same initial sequence, followed by one more heads.

The first quantity is the probability of getting $k$ heads out of $n$ flips, regardless of order, divided by the number of ways to get $k$ heads out of $n$ flips. The number of ways to get $k$ heads out of $n$ flips is called the binomial coefficient. It’s written $\binom{n}{k}$, and there’s a nice formula for calculating it: $$ \binom{n}{k} = \frac{n!}{(n-k)!k!}. $$ Since a sequence of $n$ flips can feature anywhere from $0$ to $n$ heads, there are $n+1$ structure descriptions, each with probability $1/(n+1)$. Thus the probability of a specific state-description with $k$ heads out of $n$ flips is \begin{align} \frac{1}{(n+1)\binom{n}{k}} &= \frac{1}{(n+1) \frac{n!}{(n-k)!k!}}\\
&= \frac{(n-k)!k!}{(n+1)!}.\tag{1} \end{align}

The second probability we need is for the same initial sequence, but with an additional heads on the next toss. That’s a sequence with $k+1$ heads out of $n+1$ tosses. There are $n+2$ structure descriptions now, each with probability $1/(n+2)$. So the probability in question is \begin{align} \frac{1}{(n+2)\binom{n+1}{k+1}} &= \frac{1}{(n+2) \frac{(n+1)!}{(n-k)!(k+1)!}}\\
&= \frac{(n-k)!(k+1)!}{(n+2)!}.\tag{2} \end{align}

Now, to get the conditional probability we’re after, we take the ratio of the second probability $(2)$ over the first probability $(1)$: $$ \begin{aligned} \frac{ \frac{(n-k)!(k+1)!}{(n+2)!} }{ \frac{(n-k)!k!}{(n+1)!} } &= \frac{(n-k)!(k+1)!}{(n+2)!} \frac{(n+1)!}{(n-k)!k!} \\
&= \frac{k+1}{n+2}. \end{aligned} $$ This agrees with the rule of succession, as desired.

So far though, we’ve only shown the rule of succession for a specific, observed sequence. We’ve shown that $\p(HTHH \given HTH) = 3/ 4$, for example. But what if we don’t know the particular sequence so far? Maybe we only know there were $2$ heads out of $3$ tosses. Shouldn’t we still be able to derive the same result?

We can, with the help of a relevant theorem of probability: if $\p(A \given B) = \p(A \given C)$, and $B$ and $C$ are mutually exclusive, then $$ \p(A \given B \vee C) = \p(A \given B) = \p(A \given C). $$ In our case $A$ specifies heads on flip $n+1$, while $B$ and $C$ each specify some sequence for flips $1$ through $n$. Although these sequences feature the same number of heads and tails, $B$ and $C$ specify different orderings. So they’re mutually exclusive.

We’ve already shown that $$ \p(A \given B) = \frac{k+1}{n+2} = \p(A \given C). $$ So we just have to verify the theorem: $$ \begin{aligned} \p(A \given B \vee C) &= \frac{\p(A \wedge (B \vee C))}{\p(B \vee C)}\\
&= \frac{\p(A \wedge B) + \p(A \wedge C)}{\p(B \vee C)}\\
&= \frac{\p(A \given B)\p(B) + \p(A \given C)\p( C)}{\p(B \vee C)}\\
&= \frac{\p(A \given B) \left( \p(B) + \p( C) \right)}{\p(B \vee C)}\\
&= \p(A \given B). \end{aligned} $$ By applying this formula repeatedly to a disjunction of state-descriptions, we get the conditional probability on the structure description of interest.

Carnap also cites Keynes and Peirce as endorsing the Wittgensteinian approach. But thanks to Jonathan Livengood I learned this is actually a misattribution: Keynes mistakenly attributes the view to Peirce, and Carnap seems to have followed Keynes’ error. ^[return]
More precisely: we regard them as having the same probability density, namely $1$.
^[return]
For the $\lambda = 0$ case, we need probability axioms that permit conditioning on zero-probability events. For example, $\p(HH \given H) = 1$ so $\p(HT \given H) = 0$. Thus $\p(HT) = 0$, and $\p(HTH \given HT)$ is undefined on the usual, Kolmogorov axioms.
^[return]
When $n = 0$ we have to stipulate that the probability is $1/ 2$, the limit as $\lambda \rightarrow 0$.
^[return]

The Super-Humean Theory of Belief

Wed, 21 Aug 2019 00:00:00 -0500

The classic “Lockean” thesis about full and partial belief says full belief is rational iff strong partial belief is rational. Hannes Leitgeb’s “Humean” thesis proposes a subtler connection. $ \newcommand\p{Pr} \newcommand{\B}{\mathbf{B}} \newcommand{\given}{\mid} $

The Humean Thesis: For a rational agent whose full beliefs are given by the set $\mathbf{B}$, and whose credences by the probability function $\p$: $B \in \mathbf{B}$ iff $\p(B \given A) > t$ for all $A$ consistent with $\mathbf{B}$.

Notice that we can think of this as, instead, a coherentist theory of justification. Suppose we replace credence with “evidential” probability (think: Carnap, Williamson). Then we get a theory of justification where beliefs aren’t justified in isolation. It’s not enough for a belief to be highly probable in its own right, it has to be part of a larger body that underwrites that high probability.

Flipping things around, the coherentist theory of justification from my last wacky post doubles as an even wackier theory of full belief. The Humean view is roughly that a belief is justified iff its fellows secure its high probability. Now the “Super-Humean” view says a belief is justified to the extent its fellows secure its high centrality.

(Last time we explored one fun way of measuring centrality, drawing on coherentism for inspiration, and network theory for the math. But network theory offers many others ways of measuring centrality, which could be slotted in here to provide alternative theories of full and partial belief.)

Like Leitgeb’s Humean view, the Super-Humean view has a holistic character. Instead of evaluating full beliefs just by looking at your credences, we also have to look at what else you believe.

Another parallel: both theories have a permissive quality. Leitgeb presents examples where more than one set $\B$ fits with a given credence function $\p$, on the Humean view. And the same will be true on the Super-Humean view.¹

But there are interesting differences. We can evaluate beliefs individually on the Super-Humean account, even though our method of evaluation is holistic. True, a belief’s justification depends on what else you believe. But your beliefs don’t all stand or fall together; some can come out justified even though others come out unjustified.

Strictly speaking, some beliefs come out highly justified even though others come out hardly justified. Because, differing again from the Humean view, evaluations are graded on the Super-Humean view. Each belief is assigned a degree of justification.

One nice thing about the Super-Humean view, then, is that it allows for “non-ideal” theorizing. We can study non-ideal agents, and discern more justified beliefs from lesser ones.

“But does it handle the lottery and preface paradoxes?”, is the question we always ask about a theory of full belief. As is so often the case, the answer is “yes, but…”.

Consider a lottery of $100$ tickets with one to be selected at random as the winner. If you believe of each ticket that it will lose, we have a network of $101$ nodes: $L_1$ through $L_{100}$, plus the tautology node $\top$. How strong are the connections between these nodes? Assuming we take $L_3–L_{100}$ as givens in determining the weight of the $L_2 \rightarrow L_1$ arrow, it gets weight $0$ since $$\p(L_1 \given L_2 \wedge L_3 \wedge \ldots \wedge L_{100}) = 0.$$ And likewise for all the other arrows,² except those pointing to the $\top$ node (they always get weight $1$). All the $L_i$ beliefs thus come out with rock-bottom justification compared to $\top$, i.e. you aren’t justified in believing these lottery propositions.

Contrast that with a preface case, where you believe each of $100$ claims you’ve researched, $C_1$ through $C_{100}$. These claims are positively correlated though, or at least independent.³ So $$\p(C_1 \given C_2 \wedge C_3 \wedge \ldots \wedge C_{100}) \approx 1,$$ and likewise for the other $C_i$. The belief-graph here is thus tightly connected, and the $C_i$ nodes will score high on centrality compared to $\top$. So you’re highly justified in your beliefs in the preface case.

So far so good, at least if you think—as I tend to—that lottery beliefs should come out unjustified, while preface beliefs should come out justified. What’s the “but…” then? I see two issues (at least).

First, we had to assume that all your remaining beliefs are taken as given in assessing the weight of a connection like $L_2 \rightarrow L_1$. That worked out well here. But as a general rule, it doesn’t always have great results, as Juan Comesaña noted about our treatment of the Tweety case last time.

We could go all the way to the other extreme of course, and just evaluate the $L_2 \rightarrow L_1$ connection in isolation by looking at $\p(L_1 \given L_2)$. But that seems too extreme, since it means ignoring the agent’s other beliefs altogether.

What we want is something in between, it seems. We want the agent’s other beliefs to “get in the way” enough that they substantially weaken the connections in the lottery graph. But we don’t want them to be taken entirely for granted. Exactly how to achieve the right balance here is something I’m not sure about.

Second issue: what if you only adopt a few lottery beliefs, just $L_1$ and $L_2$ for example? Then we can’t exploit the “collective defeat” that drove our treatment of the lottery.

You might respond that this is a fine result, since isolated lottery beliefs are actually justified. It’s only when you apply the same logic to all the tickets that your justification is undercut. But I find this unsatisfying.

Maybe a student encountering the paradox for the first time is justified in believing their ticket will lose. But it should be enough to defeat that justification that they merely realize they could believe the same thing about all the other tickets, for identical reasons. Even if they don’t go ahead to form those beliefs, they should drop the one belief they had about their own ticket.

This is one way in which Leitgeb’s Humean theory seems superior to me. On the Humean view, which beliefs are rational depends on how the space of possibilities is partitioned (see Leitgeb 2014). And the partition is determined by the context—how the subject frames the situation in their mind. (At least, that’s how I understand Leitgeb here.) So just realizing the symmetry of the lottery paradox is enough to defeat justification, on the Humean view.

Example: imagine we’ll flip a coin of unknown bias $10$ times. And suppose the probabilities obey Laplace’s Rule of Succession (a.k.a. Carnap’s $\mathfrak{m}^*$ confirmation function). Then, if you believe each flip will land heads, your beliefs will all come out highly justified, i.e. highly central in your web of $10$ beliefs. But they’d have the same justification if you instead believed each flip will land tails.

Permissivism aside, this might seem a pretty bad result on its own. Even if our theory fixed which way you should go, say heads instead of tails, that would be pretty weird. Shouldn’t you wait for at least a few flips before forming any such beliefs?

The problem is that we haven’t required your beliefs to be inherently probable, only that they render one another probable. The Lockean and Humean theories have such a threshold requirement built-in, but we can build it into our theory too. We can just stipulate that a full belief should be highly probable, as well as being highly central in the network of all your beliefs.
^[return]
More carefully, each arrow gets the minimum possible weight. If we use the “Google hack” from last time, this is some small positive number $\epsilon$ instead of $0$. ^[return]
Notice we’re borrowing the crux of Pollock’s (1994) classic treatment of the lottery and preface paradoxes. We’re just plugging his observation into a different formal framework. ^[return]

Coherentism Without Coherence

Wed, 14 Aug 2019 00:00:00 -0500

If you look at the little network diagram below, you’ll probably agree that $P$ is the most “central” node in some intuitive sense.

This post is about using a belief’s centrality in the web of belief to give a coherentist account of its justification. The more central a belief is, the more justified it is.

But how do we quantify “centrality”? The rough idea: the more ways there are to arrive at a proposition by following inferential pathways in the web of belief, the more central it is.

Since we’re coherentists today (for the next 10 minutes, anyway), cyclic pathways are allowed here. If we travel $P \rightarrow Q \rightarrow R \rightarrow P$, that counts as an inferential path leading to $P$. And if we go around that cycle twice, that counts as another such pathway.

You might think this just wrecks the whole idea. Every node has infinitely many such pathways leading to it, after all. By cycling around and around we can come up with literally any number of pathways ending at a given node.

But, by examining how these pathways differ in the limit, we can differentiate between more and less central nodes/beliefs. We can thus clarify a sense in which $P$ is most central, and quantify that centrality. We can even use that quantity to answer a classic objection to coherentism leveled by Klein & Warfield (1994).

As a bonus, we can do all this without ever giving an account of what makes a corpus of beliefs “coherent.” This flips the script on a lot of contemporary formal work on coherentism.¹ Because coherentism is holistic, you might think it has to evaluate the coherence of a whole corpus first, before it can assess the individual members.² But we’ll see this isn’t so. $$ \newcommand\T{\intercal} \newcommand{\A}{\mathbf{A}} \renewcommand{\v}{\mathbf{v}} $$

Counting Pathways

Our idea is to count how many paths there are leading to $P$ vs. other nodes. We start with paths of length $1$, then count paths of length $2$, then length $3$, and so on. As we count longer and longer paths, each node’s count approaches infinity.

But not their relative ratios! If, at each step, we divide the number of paths ending at $P$ by the number of all paths, this ratio converges.

To find its limit, we represent our graph numerically. A graph can be represented in a table, where each node corresponds to a row and column. The columns represent “sources” and the rows represent “targets.” We put a $1$ where the column node points to the row node, otherwise we put a $0$.

	$P$	$Q$	$R$
$P$	0	1	1
$Q$	1	0	0
$R$	0	1	0

Hiding the row and column names gives us a matrix we’ll call $\A$: $$ \A = \left[ \begin{matrix} 0 & 1 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \end{matrix} \right]. $$ Notice how each row records the length-$1$ paths leading to the corresponding node. There are two such paths to $P$, and one each to $Q$ and $R$.³

The key to counting longer paths is to take powers of $\A$. If we multiply $\A$ by itself to get $\A^2$, we get a record of the length-$2$ paths: $$ \A^2 = \A \times \A = \left[ \begin{matrix} 0 & 1 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \end{matrix} \right] \left[ \begin{matrix} 0 & 1 & 1 \\
1 & 0 & 0 \\
0 & 1 & 0 \end{matrix} \right] = \left[ \begin{matrix} 1 & 1 & 0 \\
0 & 1 & 1 \\
1 & 0 & 0 \end{matrix} \right]. $$ There are two such paths to $P$: $$ \begin{aligned} Q \rightarrow R \rightarrow P,\\
P \rightarrow Q \rightarrow P. \end{aligned} $$ Similarly for $Q$: $$ \begin{aligned} Q \rightarrow P \rightarrow Q,\\
R \rightarrow P \rightarrow Q. \end{aligned} $$ While $R$ has just one length-$2$ path: $$ P \rightarrow Q \rightarrow R. $$ If we go on to examine $\A^3$, its rows will tally the length-$3$ paths; in general, $\A^n$ tallies the paths of length-$n$.

But we want relative ratios, not raw counts. The trick to getting these is to divide $\A$ at each step by a special number $\lambda$, known as the “leading eigenvalue” of $\A$ (details below). If we take the limit $$ \lim_{n \rightarrow \infty} \left(\frac{\A}{\lambda}\right)^n $$ we get a matrix whose columns all have a special property: $$ \left[ \begin{matrix} 0.41 & 0.55 & 0.31 \\
0.31 & 0.41 & 0.23 \\
0.23 & 0.31 & 0.18 \end{matrix} \right]. $$ They all have the same relative proportions. They’re multiples of the same “frequency vector,” a vector of positive values that sum to $1$: $$ \left[ \begin{matrix} 0.43 \\
0.32 \\
0.25 \\
\end{matrix} \right]. $$ So as we tally longer and longer paths, we find that $43\%$ of those paths lead to $P$, compared with $32\%$ for $Q$ and $25\%$ for $R$. Thus $P$ is about $1.3$ times as justified as $Q$ ($.43/.32$), and about $1.7$ times as justified as $R$ ($.43/.25$).

We want absolute degrees of justification though, not just comparative ones. So we borrow a trick from probability theory and use a tautology for scale.

We add a special node $\top$ to our graph, which every other node points to, though $\top$ doesn’t point back.

Updating our matrix $\A$ accordingly, we insert $\top$ in the first row/column: $$ \A = \left[ \begin{matrix} 0 & 1 & 1 & 1 \\
0 & 0 & 1 & 1 \\
0 & 1 & 0 & 0 \\
0 & 0 & 1 & 0 \end{matrix} \right]. $$ Redoing our limit anlaysis gives us the vector $(1.00, 0.57, 0.43, 0.33)$. But this isn’t our final answer, because it’s actually not possible for the non-$\top$ nodes to get a value higher than $2/3$ in a graph with just $3$ non-$\top$ nodes.⁴ So we divide elementwise by $(1, 2/3, 2/3, 2/3)$ to scale things, giving us our final result: $$ \left[ \begin{matrix} 1.00 \\
0.85 \\
0.65 \\
0.49 \end{matrix} \right]. $$ The relative justifications are the same as before, e.g. $P$ is still $1.3$ times as justified as $Q$. But now we can make absolute assessments too. $R$ comes out looking pretty bad ($0.49$), as seems right, while $Q$ looks a bit better ($0.65$). Of course $P$ looks best ($0.85$), though maybe not quite good enough to be justified tout court.

The Klein–Warfield Problem

Ok that’s theoretically nifty and all, but does it work on actual cases? Let’s try it out by looking at a notorious objection to coherentism. Klein & Warfield (1994) argue that coherentism flouts the laws of probability. How so?

Making sense of things often means believing more: taking on new beliefs to resolve the tensions in our existing ones. For example, if we think Tweety is a bird who can’t fly, the tension is resolved if we also believe they’re a penguin.⁵

But believing more means believing less probably. Increases in logical strength bring decreases in probability (unless the stronger content was already guaranteed with probability $1$). So increasing the coherence in one’s web of belief will generally mean decreasing its probability. How could increasing coherence increase justification, then?

Merricks (1995) points out that, even though the probability of the whole corpus goes down, the probabilities of individual beliefs go up in a way. After all, it’s more likely Tweety can’t fly if they’re a penguin, than if they’re just a bird of some unknown species.

That’s only the beginning of a satisfactory answer though. After all, we might not be justified in believing Tweety’s a penguin in the first place! Adding a new belief to support an existing belief doesn’t help if the new belief has no support itself. We need a more global assessment, which is where the present account shines.

Suppose we add $P$ = Tweety is a penguin to the network containing $B$ = Tweety is a bird and $\neg F$ = Tweety can’t fly. Will this increase the centrality/justification of $B$ and of $\neg F$? Yes, but we need to sort out the support relations to verify this.

Presumably $P$ supports $B$, and $\neg F$ too. But what about the other way around? If Tweety is a flightless bird, there’s a decent chance they’re a penguin. But it’s hardly certain; they might be an emu or kiwi instead. Come to think of it, isn’t support a matter of degree, so don’t we need finer tools than just on/off arrows?

Yes, and the refinement is easy. We accommodate degrees of support by attaching weights to our arrows. Instead of just placing a $1$ in our matrix $\A$ wherever the column-node points to the row-node, we put a number from the $[0,1]$ interval that reflects the strength of support. The same limit analysis as before still works, as it turns out. We just think of our inferential links as “leaky pipes” now, where weaker links make for leakier pipelines.

We still need concrete numbers to analyze the Tweety example. But it’s a toy example, so let’s just make up some plausible-ish numbers to get us going. Let’s suppose $1\%$ of birds are flightless, and birds are an even smaller percentage of the flightless things, say $0.1\%$. Let’s also pretend that $20\%$ of flightless birds are penguins.

Before believing Tweety is a penguin then, our web of belief looks like this:

Calculating the degrees of justification for $B$ and $\neg F$, both come out very close to $0$ as you’d expect (with $B$ closer to $0$ than $\neg F$). Now we add $P$.

Recalculating degrees of justification, we find that they increase drastically. $B$ and $F$ are now justified to degree $0.85$, while $P$ is justified to degree $0.26$. (All numbers approximate.)

So our account vindicates Merricks. Not only does adding $P$ to the corpus add “local” justification for $B$ and for $\neg F$. It also improves their standing on a more global assessment.

You might be worried though: did $P$ come out too weakly justified, at just $0.26$? No: that’s either an artifact of oversimplification, or else it’s actually the appropriate outcome. Notice that $B$ and $\neg F$ don’t really support Tweety being a penguin. They’re a flightless bird, sure, but maybe they’re an emu, kiwi, or moa. We chose to believe penguin, and maybe we have our reasons. If we do, then the graph is missing background beliefs which would improve $P$’s standing once added. But otherwise, we just fell prey to stereotyping or comes-to-mind-bias, in which case it’s right that $P$ stand poorly.

Technical Background

The notion of centrality used here is a common tool in network analysis, where it’s known as “eigenvector centrality.” Because the frequency vector we arrive at in the limit is an eigenvector of the matrix $\A$. In fact it’s a special eigenvector, the only one with all-positive values.

Since we’re measuring justification on a $0$-to-$1$ scale, our account depends on there always being such an eigenvector for $\A$. In fact we need it to be unique, up to scaling (i.e. up to multiplication by a constant).

The theorem that guarantees this is actually quite old, going back to work by Oskar Perron and Georg Frobenius published around 1910. Here’s one version of it.

Perron–Frobenius Theorem. Let $\A$ be a square matrix whose entries are all positive. Then all of the following hold.

$\A$ has an eigenvalue $\lambda$ that is larger (in absolute value) than $\A$’s other eigenvalues. We call $\lambda$ the leading eigenvalue.
$\A$’s leading eigenvalue has an eigenvector $\v$ whose entries are all positive. We call $\v$ the leading eigenvector.
$\A$ has no other positive eigenvectors, save multiples of $\v$.
The powers $(\A/\lambda)^n$ as $n \rightarrow \infty$ approach a matrix whose columns are all multiples of $\v$.

Now, our matrices had some zeros, so they weren’t positive in all their entries. But it doesn’t really matter, as it turns out.

Frobenius’ contribution was to generalize this result to many cases that feature zeros. But even in cases where Frobenius’ weaker conditions aren’t satisfied, we can just borrow a trick from Google.⁶ Instead of using a $0$-to-$1$ scale, we use $\epsilon$-to-$1$ for some very small positive number $\epsilon$. Then all entries in $\A$ are guaranteed to be positive, and we just rescale our results accordingly. (Choose $\epsilon$ small enough and the difference is negligible in practice.)

Acknowledgments

This post owes a lot to prior work by Elena Derksen and Selim Berker. I’d never really thought much about how coherence and justification relate prior to reading Derksen’s work. And Berker’s prompted me to take graphs more seriously as a way of formalizing coherentism. I’m also grateful to David Wallace for introducing me to the Perron–Frobenius theorem’s use as a tool in network analysis.

See Shogenji (1999) and Fitelson (2003) for some early accounts. See Section 6 of Olsson’s SEP entry for a survey and more recent references.
^[return]
In his seminal book on coherentism, Bonjour (1985) writes: “the justification of a particular empirical belief finally depends, not on other particular beliefs as the linear conception of justification would have it, but instead on the overall system and its coherence.” This doesn’t commit us to assessing overall coherence before individual justification. But that’s a natural conclusion you might come away with.
^[return]
We could count every proposition as pointing to itself. This would mean putting $1$’s down the diagonal, i.e. adding the identity matrix $\mathbf{I}$ to $\A$. This can be useful as a way to ensure the limits we’ll require exist. But we’ll solve that problem differently in the “Technical Background” section. And otherwise it doesn’t really affect our results. It increases the leading eigenvalue by $1$, but doesn’t affect the leading eigenvector.
^[return]
In general, the maximum possible centrality is $(k-1)/k$ in a graph with $k$ non-$\top$ nodes.
^[return]
Hat tip to Erik J. Olsson’s entry on coherentism in the SEP, which uses this example in place of Klein & Warfield’s slightly more involved one.
^[return]
Google’s founders used a variant of eigenvector centrality called “PageRank” in their original search engine.
^[return]

The Open Handbook of Formal Epistemology

Wed, 26 Jun 2019 00:00:00 -0500

Today The Open Handbook of Formal Epistemology is available for download. It’s an open access book, the first published by PhilPapers itself. (The editors are Richard Pettigrew and me.)

The book features 11 outstanding entries by 11 wonderful philosophers.

“Precise Credences”, by Michael G. Titelbaum
“Decision Theory”, by Johanna Thoma
“Imprecise Probabilities”, by Anna Mahtani
“Primitive Conditional Probabilities”, by Kenny Easwaran
“Infinitesimal Probabilities”, by Sylvia Wenmackers
“Comparative Probabilities”, by Jason Konek
“Belief Revision Theory”, by Hanti Lin
“Ranking Theory”, by Franz Huber
“Full & Partial Belief”, by Konstantin Genin
“Doxastic Logic”, by Michael Caie
“Conditionals”, by R. A. Briggs

We wanted to include lots more, but didn’t want to hold up publication any longer. Hopefully a second edition will cover more.

For me personally, a central aim of this project was to demonstrate a point about open access publishing and shared standards. The budget for this book was exactly $0.00, and this was only possible because we didn’t need a human typesetter.

Pretty much everyone in formal epistemology uses the same, standardized format to do their writing. And that format plugs in to a high-quality, freely available typesetting program. So all you have to do to turn a dozen contributions from different authors into a unified book is paste them into a template and click “typeset”.

Ok, it did actually take some noodling to iron out the kinks. But mainly just because of my poor planning. Having done it once now and learned the gotchas, a second go would come pretty close to the copy→paste→typeset dream.

So for me, the moral is that philosophers in general should settle on a similar standard (all academics, really). If we did, we’d have a lot more freedom from commercial publishers. We could publish open access books like this on the regular. The books would be freely and easily available to all, and authors would retain copyright.

Collective action problems plague academia, and philosophical publishing in particular. But this one’s about as close to an opportunity for a major Pareto improvement as we’re likely to get.

No Escape from Allais: Reply to Buchak

Tue, 04 Jun 2019 00:00:00 -0500

In Risk & Rationality, Buchak (2013) advertises REU theory as able to recover the modal preferences in the Allais paradox. In our (2017) however, we pointed out that REU theory only applies in the “grand world” setting, where it actually struggles with the modal Allais preferences. Buchak (2017) offers two replies. Here we enumerate a variety of technical and philosophical problems with each.

Prestige and Placement in North American Philosophy

Tue, 04 Jun 2019 00:00:00 -0500

How does prestige correlate with placement in academic philosophy? There’s good stuff on this already, like this post by Carolyn Dicey Jennings, Pablo Contreras Kallens, and Justin Vlasits.¹ This post uses the same data sources, but emphasizes different things (visualization, North American PhDs, and primarily tenure-track jobs).

TT Placement in North America

Let’s start with a simple question of broad interest. In North America, how well does the PGR rating of one’s PhD-granting program predict one’s chances of landing a tenure-track (TT) job?

Consider all the people who got a PhD from a North American philosophy program in the years 2012–14.² Focus for now on those from PhD programs ranked by the 2006 edition of the PGR.³ Now group them according to those PGR ratings, rounded to the nearest 0.5.

This gives us 7 groups of PhDs (rankings range from 2.0 to 5.0). According to the APDA’s data, the portion from each group who ended up in TT jobs are as follows:

There’s clearly a positive connection; almost perfectly linear in fact. And the gist—very crudely speaking—is that a high prestige PhD about doubles your chances of landing a TT job over a low-prestige PhD: from ~30% to ~60%.

Note that the data are sparse at the extremes though. Consider this raw look, where each point is a PhD graduate.

A “violin plot” shows the same thing but easier to read: the thickness of the violins indicates the density of points at each x-position.

With so few points at the ends, we shouldn’t read too much into the exact placement rates there.

Other Placement Types

What about other kinds of jobs? Let’s consider five categories, defined as follows.

Postdoc: “Fellowship/Postdoc” in the APDA database.
Permanent: any of the following in the APDA database.
- “Tenure-Track”
- “Lecturer (Permanent)”
- “Instructor (Permanent)”
- “Adjunct (Permanent)”
- “Other (Permanent)”
Tenure-Track: “Tenure-Track” in the APDA database.
PhD Program: Tenure-Track at a PhD-granting program.
PGR Ranked: Tenure-Track at a 2006 PGR-ranked program.

These are hardly perfect definitions, but they’re manageable with this data while still being pretty informative.

Note that a graduate can appear in multiple categories (Tenure-Track is a subset of Permanent, after all).

Unranked Programs

What about PhD programs not ranked in the 2006 PGR?⁴ The numbers may be iffier here. Some programs have only one graduate listed for example, a graduate who got a TT job. But there are only a few such programs, and more than 600 graduates otherwise. So the numbers may still be good approximations.

Postdoc	Permanent	TT	PhD	PGR
0.13	0.46	0.39	0.05	0.01

If you’re curious which programs stand out among the unranked, here are the top 10 by TT placement (excluding those with 5 or fewer graduates).

Program	N	Postdoc	Permanent	TT	PhD
The Catholic University of America	11	0.00	0.91	0.91	0.18
Baylor University	13	0.15	0.92	0.85	0.08
DePaul University	11	0.00	0.82	0.82	0.09
University of Tennessee	13	0.00	0.77	0.77	0.00
University of New Mexico	7	0.00	0.71	0.71	0.00
Vanderbilt University	9	0.11	0.67	0.67	0.00
University of South Florida	15	0.13	0.60	0.60	0.00
Florida State University	12	0.17	0.58	0.58	0.08
University of Oregon	12	0.17	0.67	0.58	0.00
University of Kansas	9	0.00	0.67	0.56	0.00

Note that the top 3 are at Christian universities, and as you might expect, a lot of their placement is driven by hires at Christian schools.

Here are the 10 “largest” programs, i.e. those with the most graduates listed in the APDA database.

Program	N	Postdoc	Permanent	TT	PhD	PGR
Boston College	28	0.21	0.68	0.50	0.04	0.00
The New School	26	0.23	0.31	0.27	0.08	0.00
Purdue University	22	0.14	0.32	0.27	0.00	0.00
Stony Brook University	22	0.00	0.55	0.41	0.05	0.00
Emory University	21	0.52	0.62	0.52	0.10	0.05
Southern Illinois University	20	0.05	0.30	0.30	0.00	0.00
Duquesne University	18	0.22	0.50	0.44	0.00	0.00
Fordham University	18	0.22	0.44	0.39	0.00	0.00
Villanova University	18	0.00	0.78	0.39	0.00	0.00
University of Guelph	17	0.12	0.06	0.00	0.00	0.00

Departmental TT Placement

Looking at placement rates by department raises the question: how well does a department’s PGR rating predict its TT placement rate?

There’s a clear connection, but also a lot of variation. Which are the programs that especially stand out from the trend? Suppressing the sizing for visibility, we can label those programs above/below the trendline by at least 0.2.

For complete listings of departmental placement rates, check out the APDA’s infograms here.

The code for this analysis is available here.

Also check out Figure 1 in this paper by Helen De Cruz.
^[return]
Why these years? Because that’s where the data is best. The APDA has focused its collection efforts so far on graduates from the years 2012–16, so that’s where the data is the most plentiful. But the data for 2015 and 2016 graduates probably aren’t “ripe” enough yet for our purposes; many graduates who will ultimately find TT jobs are probably still in postdocs and other temporary gigs. Thanks to Brian Weatherson for pushing me to take this into account.

Of course, the 2012–2014 data aren’t fully ripe either. But previous noodling suggests they’re probably pretty close.
^[return]
Why the 2006 edition? Partly for continuity with the APDA’s own analysis. But also because students often use PGR rankings to choose PhD programs, and the rankings available to them typically predate the year of their PhD by 6 or 7 years.
^[return]
Thanks to Amanda at the Philosophers’ Cocoon for prompting me to look at this.
^[return]

Nobody Expects the Chance Function!

Fri, 15 Feb 2019 11:52:00 -0500

Here’s a striking result that caught me off guard the other day. It came up in a facebook thread, and judging by the discussion there it caught a few other people in this neighbourhood off guard too.

The short version: chances are “self-expecting” pretty much if and only if they’re “self-certain”. Less cryptically: the chance of a proposition equals its expected chance just in case the chance function assigns probability 1 to itself being the true chance function, modulo an exception to be discussed below.

The same result applies to any probabilities of course, whether they represent physical chances or evidential probabilities or whatever. In fact, thanks to friends on facebook, I learned that it drives this lovely paper by Kevin Dorst.

I just happened to stumble across it while thinking about chances, because Richard Pettigrew uses the assumption that chances are self-expecting in his Phil Review paper on accuracy and the Principal Principle. But later, in his landmark book on accuracy, he switches to the requirement that they be self-certain. It turns out this isn’t a coincidence. The result we’re about to look at illuminates this shift.

The result goes back to 1997 at least, in a paper by Dov Samet. Proving the full result is a bit more involved than what I’ll present here. For simplicity, I’ll only prove a special case at the end. But along the way we’ll look at some suggestive examples that illustrate the full version.

The Chance Matrix

Imagine we have just four possible worlds, resulting from two tosses of a coin. What are the physical chances at each of the four possible worlds $HH$, $HT$, $TH$, and $TT$? $\newcommand{\mstar}{\mathfrak{m}^*} \newcommand{\C}{\mathbf{C}}$

One natural thought is to apply Laplace’s classic rule of succession: given $s$ heads out of $n$ tosses, conclude that the probability of heads on each toss is $(s+1)/(n+2)$. So at $HH$-world for example, the chance of heads was $3/ 4$ on each toss.

If we assume the tosses are independent, then $HH$-world had chance $(3/ 4)(3/ 4) = 9/16$ of being actual, according to the chance function at $HH$-world. Whereas $HT$-world had chance $(3/ 4)(1/ 4) = 3/16$ of being actual at $HH$-world. The full chance function at $HH$-world can be displayed as a column vector: $$ \left( \begin{matrix} 9/16\\
3/16\\
3/16\\
1/16 \end{matrix} \right). $$ Applying the same recipe at $HT$-world would give us a different column vector. And sticking the columns for all four worlds together, we get a $4 \times 4$ chance matrix for our space of possible worlds: $$ \mathbf{C} = \left( \begin{matrix} 9/16 & 1/ 4 & 1/ 4 & 1/16\\
3/16 & 1/ 4 & 1/ 4 & 3/16\\
3/16 & 1/ 4 & 1/ 4 & 3/16\\
1/16 & 1/ 4 & 1/ 4 & 9/16 \end{matrix} \right). $$ Each column gives the chances at a world, while a row gives the chances of a world. For example, entry $c_{14}$ gives the chance at world $4$ of world $1$ being actual. It says how likely the sequence $HH$ was if the actual unfolding of events is instead $TT$, namely $1/16$.

A different thought would be to appeal to Carnap’s notorious “logical” prior $\mstar$: $$ \mstar = \left( \begin{matrix} 1/3\\
1/6\\
1/6\\
1/3 \end{matrix} \right). $$ This assignment of probabilities ignores the actual unfolding of events in each world. It falls out of a bit of a priori reasoning instead. There are three possible outcomes: $2$ heads, $1$ head, or $0$ heads. Each is equally likely, $1/ 3$. But there are two ways to get $1$ head, so the $1/ 3$ there gets subdivided equally between $HT$ and $TH$, leaving $1/6$ for each.

Since these chances ignore the actual unfolding of events in each world, the chance matrix we get here is extremely anti-Humean. It’s just four repetitions of $\mstar$: $$ \mathbf{C} = \left( \begin{matrix} 1/3 & 1/3 & 1/3 & 1/3\\
1/6 & 1/6 & 1/6 & 1/6\\
1/6 & 1/6 & 1/6 & 1/6\\
1/3 & 1/3 & 1/3 & 1/3 \end{matrix} \right). $$ You might think that’s a pretty terrible theory of chance, and I sympathize. But what we’re about to see is that, of our two chance matrices, only the second is “self-expecting”. And its terribleness is part of the reason why.

Self Expectation

Pettigrew’s Phil Review paper assumes that chance functions are “self-expecting”. The chance of a proposition at a given world must equal its expected value, where the expectation is taken according to the chances at that world.

In terms of a chance matrix $\C$, this amounts to the requirement that $\C \C = \C$. When we multiply $\C$ by $\C$, we take dot-products of rows and columns. For example, if we were doing the calculation by hand, we’d start by multiplying the first row of $\C$ by the first column of $\C$. And this is just the weighted average of the various possible chances of the first world, where the weights are the chances at that world. In other words, it’s the expected chance of $HH$-world at $HH$-world.

In general, the dot product of row $i$ with column $j$ is the expected chance of world $i$ at world $j$. For this expected chance to equal the chance of world $i$ at world $j$, it must be that $\C \C = \C$. More succinctly, $\C^2 = \C$.

Matrices that have this property—squaring them leaves them unchanged—are called idempotent. And when our matrices are column stochastic (all values are nonnegative and each column sums to $1$), idempotence is a very… well, potent requirement.

For example, our first chance matrix based on Laplace’s rule of succession is not idempotent. Its square is not itself, but something quite different. Our second, Carnapian matrix is idempotent though. Its square is just itself. And that’s not a coincidence.

Self Certainty

Any chance matrix whose columns are redundant will be idempotent. After all, if the chances are the same at every world, the expected value of any world is always the same. So its expected value just is the value it has at every world.

But redundant columns also mean that the chances are self certain. Each world’s chance assignment gives zero probability to the chances being anything other than what they are at that world. Because there are no worlds where the chances are different.

The chances can vary from world to world and still be self-expecting though. There are idempotent chance matrices where the columns are not simply redundant. For example, here’s another idempotent chance matrix: $$ \left( \begin{matrix} 1/3 & 1/3 & 0 & 0\\
2/3 & 2/3 & 0 & 0\\
0 & 0 & 1/ 4 & 1/ 4\\
0 & 0 & 3/ 4 & 3/ 4 \end{matrix} \right). $$ But notice how it’s still kind of a degenerate case. There are two, disjoint regions of modal space here that regard one another as zero-chance. And within each region the chances are the same at each world. Worlds $1$ and $2$ have the same chances, and they give zero chance to worlds $3$ and $4$. And vice versa from the point of view of worlds $3$ and $4$.

In other words, self-expectation and self-certainty go hand in hand here once again.

A Sliver of Daylight

Is there any daylight at all then between self-expectation and self-certainty?

Self-certainty entails self-expectation, and the argument is pretty short. If the only worlds with positive chance according to world $j$ assign the same chances as world $j$ does, then any average of those chances will just be those same chances.

But self-expectation doesn’t quite entail self-certainty. For example, here’s an idempotent chance matrix that’s not self-certain: $$ \left( \begin{matrix} 1/3 & 1/3 & 0 & 0 & 25/94\\
2/3 & 2/3 & 0 & 0 & 25/47\\
0 & 0 & 1/ 4 & 1/ 4 & 19/376\\
0 & 0 & 3/ 4 & 3/ 4 & 57/376\\
0 & 0 & 0 & 0 & 0 \end{matrix} \right). $$ It’s kind of a lame counterexample though, because the new, fifth world we’ve introduced (the coin explodes or something idk) has zero chance at every world, even itself.

In fact this is what Samet proves: this is the only kind of counterexample possible! If probabilities are self-expecting, then they must be either self-certain or self-effacing. They must assign zero chance to the chances being otherwise, or they must assign chance one to them being otherwise.

In terms of matrices, there are only three kinds of idempotent chance matrix:

All columns are identical.
The matrix is block diagonal, with identical columns inside each block.
The matrix is as in (2), except for some columns $j_1, \ldots, j_n$. But the corresponding rows $j_1, \ldots, j_n$ contain only zeros.

Strictly speaking (1) is actually a special case of (2). But (1) deserves direct attention because it arises in a way that’s interesting both philosophically and mathematically.

The Connected Case

Here’s a natural thought, one that’s driven a lot of the literature on chance and Lewis’ Principal Principle. The thought: however events unfold at one world, there’s a chance they could have evolved differently. There’s even some small, non-zero chance they could have evolved quite differently.

Taking this thought a bit further, you might think there’s a region of modal space where, even though worlds $w_1$ and $w_n$ have different chances, $w_n$ is always reachable from world $w_1$. More exactly, there’s always a connecting sequence of worlds $w_1, w_2, \ldots, w_n$ where $w_i$ gives non-zero chance to $w_{i+1}$.

In terms of coin tosses, maybe it’s a law of nature that all coins land heads when all hundred out of one hundred flips land heads. But when there’s a mix of heads and tails, there’s at least some chance the mix could have had a few more heads, or a few more tails. So every world where the sequence isn’t perfectly uniform can be reached from every other. If not in a single, positive-chance hop, then at least by a series of hops, perhaps by switching the outcomes of the flips one at a time for example.

In terms of graphs, such a region of modal space is said to be connected. In terms of matrices, it amounts to the chance matrix for this region being regular: there must be some power $n$ such that $\C^n$ contains all positive entries.

Now, regular matrices have the remarkable property that, as we multiply them against themselves more and more times, the result converges to a matrix $\mathbf{P}$ whose columns are all identical: $$ \lim_{n \rightarrow \infty} \C^n = \mathbf{P} = \left( \begin{matrix} p_1 & \ldots & p_1 \\
\vdots & \ldots & \vdots \\
p_k & \ldots & p_k \\
\end{matrix} \right). $$ Now recall that for $\C$ to be self-expecting, it must be idempotent, meaning $\C^2 = \C$. But that means $\C^n = \C$ for any power $n$. But then $\C = \mathbf{P}$, so $\C$ must already have redundant columns.

What does this mean for us? One way to think about it: there isn’t as much room for chances to vary from world to world as one might have thought. If the chances are going to be self-expecting, they must be the same at every world across a whole region of modal space despite the facts turning out quite differently at various worlds across that region.

This point is strongly reminiscent of David Lewis’ famous “Big Bad Bug” of course. And there’s tons of relevant literature, most of which I confess I never really absorbed. So I’ll close by linking to just one paper I’m finding especially helpful on this right now, Richard Pettigrew’s “What Chance-Credence Norms Should Not Be”.

Jonathan Weisberg

How Scientific is Scientific Polarization?

The Model

The Problem

Can It Be Fixed?

Does It Matter?

Mistrust & Polarization

Polarization

Jeffrey Updating

Discussion

How Robust is the Zollman Effect?

Easy Like Sunday Morning

More Trials, Fewer Tribulations

Agent Smith

Conclusion

The Zollman Effect

The Idea

The Model

Nitty Gritty

Results

The Beta Prior and the Lambda Continuum

A Preview

Pseudo-observations

Pseudo-posteriors

The Beta Prior

From Beta to Lambda

Fact 1

Fact 2

Picturing It

What About Zero?

Acknowledgments

Belief in Psyontology

Could've Thought Otherwise

Laplace's Rule of Succession

Laplace’s Special Case

The General Case

The Beta Function

Acknowledgments

Crash Course in Inductive Logic

States & Structures

The Rule of Succession

The Continuum of Inductive Methods

Carnap’s Program

Further Readings

Appendix: Deriving the Rule of Succession

The Super-Humean Theory of Belief

Coherentism Without Coherence

Counting Pathways

The Klein–Warfield Problem

Technical Background

Acknowledgments

The Open Handbook of Formal Epistemology

No Escape from Allais: Reply to Buchak

Prestige and Placement in North American Philosophy

TT Placement in North America

Other Placement Types

Unranked Programs

Departmental TT Placement

Nobody Expects the Chance Function!

The Chance Matrix

Self Expectation

Self Certainty

A Sliver of Daylight

The Connected Case