Last time we took Brier distance beyond two dimensions. We showed that it’s “proper” in any finite number of dimensions. Today we’ll show that Euclidean distance is “improper” in any finite number dimensions.

When I first sat down to write this post, I had in mind a straightforward generalization of our previous result for Euclidean distance in two dimensions. And I figured it would be easy to prove.

Not so.

My initial conjecture was false, and worse, when I asked my accuracy-guru friends for the truth, nobody seemed to know. (They did offer lots of helpful suggestions, though.)

So today we’re muddling through on our own even more than usual. Here goes.

# Background

Let’s recall where we are. We’ve been considering different ways of measuring the inaccuracy of a probability assignment given a possibility, or a “possible world”.

Let’s start today by regimenting our terminology. We’ve used these terms semi-formally for a while now. But let’s gather them here for reference, and to make them a little more precise.$ \renewcommand{\vec}[1]{\mathbf{#1}} \newcommand{\p}{\vec{p}} \newcommand{\q}{\vec{q}} \newcommand{\u}{\vec{u}} \newcommand{\EIpq}{EI_{\p}(\q)} \newcommand{\EIpp}{EI_{\p}(\p)} $

Given a number of dimensions $n$:

- A
*probability assignment*$\p = (p_1, \ldots, p_n)$ is a vector of positive real numbers that sum to $1$. - A
*possible world*is a vector $\u$ of length $n$ containing all zeros except for a single $1$. (A unit vector of length $n$, in other words.) - A
*measure of inaccuracy*$D(\p, \u)$ is a function that takes a probability assignment and a possible world and returns a real number.

We’ve been considering two measures of inaccuracy. The first is the familiar Euclidean distance between $\p$ and $\u$. For example, when $\u = (1, 0, \ldots, 0)$ we have: $$ \sqrt{(p_1 - 1)^2 + (p_2 - 0)^2 + \ldots + (p_n - 0)^2}.$$ The second way of measuring inaccuracy is less familiar, Brier distance, which is just the square of Euclidean distance: $$ (p_1 - 1)^2 + (p_2 - 0)^2 + \ldots + (p_n - 0)^2.$$

What we found in $n = 2$ dimensions is that Euclidean distance is “unstable” in a way that Brier is not. If we measure inaccuracy using Euclidean distance, a probability assignment can expect some *other* probability assignment to do better accuracy-wise, i.e. to have lower inaccuracy.

In fact, given almost any probability assignment, the way to minimize expected inaccuracy is to leap to certainty in the most likely possibility. Given $(2/3, 1/3)$, for example, the way to minimize expected inaccuracy is to move to $(1,0)$.

Because Euclidean distance is unstable in this way, it’s called an “improper” measure of inaccuracy. So, two more bits of terminology:

- Given a probability assignment $\p$ and a measure of inaccuracy $D$, the
*expected inaccuracy*of probability assignment $\q$, written $\EIpq$, is the weighted sum: $$ \EIpq = p_1 D(\q,\u_1) + \ldots + p_n D(\q,\u_n), $$ where $\u_i$ is the possible world with a $1$ at index $i$. - A measure of inaccuracy $D$ is
*improper*if there is a probability assignment $\p$ such that for some assignment $\q \neq \p$, $\EIpq < \EIpp$ when inaccuracy is measured according to $D$.

Last time we showed that Brier is *proper* in any finite number of dimensions $n$. Today our main task is to show that Euclidean distance is * improper* in any finite number of dimensions $n$.

But first, let’s get a tempting mistake out of the way.

# A Conjecture and Its Refutation

In our first post, we saw that Euclidean distance isn’t just improper in two dimensions. It’s also *extremizing*: the assignment $(2/3, 1/3)$ doesn’t just expect *some* other assignment to do better accuracy-wise. It expects the assignment $(1,0)$ to do best!

At first I thought we’d be proving a straightforward generalization of that result today:

**Conjecture 1 (False).** Let $(p_1, \ldots, p_n)$ be a probability assignment with a unique largest element $p_i$. If we measure inaccuracy by Euclidean distance, then $\EIpq$ is minimized when $\q = \u_i$.

Intuitively: expected inaccuracy is minimized by leaping to certainty in the most probable possibility. Turns out this is false in three dimensions. Here’s a

**Counterexample.** Let’s define:
$$
\begin{align}
\p &= (5/12, 4/12, 3/12),\\

\p’ &= (6/12, 4/12, 2/12),\\

\u_1 &= (1, 0, 0).
\end{align}
$$

Then we can calculate (or better, have *Mathematica* calculate):
$$
\begin{align}
\EIpp &\approx .804,\\

EI_{\p}(\p’) &\approx .800,\\

EI_{\p}(\u_1) &\approx .825.
\end{align}
$$
In this case $\EIpp < EI_{\p}(\u_1)$. So leaping to certainty doesn’t minimize expected inaccuracy (as measured by Euclidean distance).

Of course, staying put doesn’t minimize it either, since $EI_{\p}(\p’) < \EIpp$.

So what *does* minimize it in this example? I asked *Mathematica* to minimize $\EIpq$ and got… nothing for days. Eventually I gave up waiting and asked instead for a numerical approximation of the minimum. One second later I got:

$$EI_{\p}(0.575661, 0.250392, 0.173947) \approx 0.797432.$$

I have no idea what that is in more meaningful terms, I’m sorry to say. But at least we know it’s not anywhere near the extreme point $\u_1$ I conjectured at the outset. (See the **Update** at the end for a little more.)

# A Shortcut and Its Shortcomings

So I asked friends who do this kind of thing for a living how they handle the $n$-dimensional case. A couple of them suggested taking a shortcut around it!

Look, you’ve already handled the two-dimensional case. And that’s just an instance of higher dimensional cases.

Take a probability assignment like (2/3, 1/3). We can also think of it as (2/3, 1/3, 0), or as (2/3, 0, 1/3, 0), etc.

No matter how many zeros we sprinkle around in there, the same thing is going to happen as in the two-dimensional case. Leaping to certainty in the 2/3 possibility will minimize expected inaccuracy. (Because possibilities with no probability make no difference to expected value calculations.)

So no matter how many dimensions we’re working in, there will always be

someprobability assignment where leaping to certainty minimizes expected inaccuracy. It just might have lots of zeros in it.So Euclidean distance is, technically, improper in any finite number of dimensions.

At first I thought that was good enough for philosophy. Though I still wanted to know how to handle “no zeros” cases for the mathematical clarity.

Then I realized there may be a philosophical reason to be dissatisfied with this shortcut. A lot of people endorse the Regularity principle: you should never assign zero probability to any possibility. For these people, the shortcut might be a dead end.

(Of course, maybe we shouldn’t embrace Regularity if we’re working in the accuracy framework. I won’t stop for that question here.)

# A Theorem and Its Corollary

So let’s take the problem head on. We want to show that Euclidean distance is improper in $n > 2$ dimensions, even when there are “no zeros”. Two last bits of terminology:

- A probability assignment $(p_1, \ldots, p_n)$ is
*regular*if $p_i > 0$ for all $i$. - A probability assignment $(p_1, \ldots, p_n)$ is
*uniform*if $p_i = p_j$ for all $i,j$.

So, for example, the assignment $(1/3, 1/3, 1/3)$ is both regular and uniform. Whereas the assignment $(2/5, 2/5, 1/5)$ is regular, but not uniform.

What we’ll show is that assignments like $(2/5, 2/5, 1/5)$ make Euclidean distance “unstable”: they expect some other assignment to do better, accuracy-wise. (Exactly which other assignment they’ll expect to do best isn’t always easy to say.)

(Though I try to keep the math in these posts as elementary as possible, this proof will use calculus. If you know a bit about derivatives, you should be fine. Technically we’ll use multi-variable calculus. But if you’ve worked with derivatives in single-variable calculus, that should be enough for the main ideas.)

**Theorem.**
Let $\p = (p_1, \ldots, p_n)$ be a regular, non-uniform probability assignment. If accuracy is measured by Euclidean distance, then $EI_{\p}(\q)$ is not minimized when $\q = \p$.

*Proof.*
Let $\p = (p_1, \ldots, p_n)$ be a regular and non-uniform probability assignment, and measure inaccuracy using Euclidean distance. Then:
$$
\begin{align}
EI_{\p}(\q) &= p_1 \sqrt{(q_1 - 1)^2 + \ldots + (q_n - 0)^2} + \ldots + p_n \sqrt{(q_1 - 0)^2 + \ldots + (q_n - 1)^2}\\

&= p_1 \sqrt{(q_1 - 1)^2 + \ldots + q_n^2} + \ldots + p_n \sqrt{q_1^2 + \ldots + (q_n - 1)^2}
\end{align}
$$

The crux of our proof will be that the derivatives of this function are non-zero at the point $\q = \p$. Since the minimum of a function is always a “critical point”, that suffices to show that $\q = \p$ is not a minimum of $\EIpq$.

To start, we calculate the partial derivative of $\EIpq$ for an arbitrary $q_i$:
$$
\begin{align}
\frac{\partial}{\partial q_i} \EIpq
&=
\frac{\partial}{\partial q_i} \left( p_1 \sqrt{(q_1 - 1)^2 + \ldots + q_n^2} + \ldots + p_n \sqrt{q_1^2 + \ldots + (q_n - 1)^2} \right)\\

&=
p_1 \frac{\partial}{\partial q_i} \sqrt{(q_1 - 1)^2 + \ldots + q_n^2} + \ldots + p_n \frac{\partial}{\partial q_i} \sqrt{q_1^2 + \ldots + (q_n - 1)^2}\\

&= \quad
p_i \frac{q_i - 1}{\sqrt{(q_i - 1)^2 + \sum_{j \neq i} q_j^2}} + \sum_{j \neq i} p_j \frac{q_i}{\sqrt{(q_j - 1)^2 + \sum_{k \neq j} q_k^2}}\\

&= \quad
\sum_{j \neq i} \frac{p_j q_i}{\sqrt{(q_j - 1)^2 + \sum_{k \neq j} q_k^2}} - \sum_{j \neq i} \frac{p_i q_j}{\sqrt{(q_i - 1)^2 + \sum_{j \neq i} q_j^2}}.
\end{align}
$$

Then we evaluate at $\q = \p$: $$ \begin{align} \frac{\partial}{\partial q_i} \EIpp &= \sum_{j \neq i} \frac{p_i p_j}{\sqrt{(p_j - 1)^2 + \sum_{k \neq j} p_k^2}} - \sum_{j \neq i} \frac{p_i p_j}{\sqrt{(p_i - 1)^2 + \sum_{j \neq i} p_j^2}} \end{align} $$

Now, because $\p$ is not uniform, some of its elements are larger than others. And because it is finite, there is at least one largest element. When $p_i$ is one of these largest elements, then $\partial / \partial q_i \EIpp$ is negative.

Why?

In our equation for $\partial / \partial q_i \EIpp$, each positive term has a corresponding negative term whose numerator is identical. And when $p_i$ is a largest element of $\p$, the denominator of each negative term will never be larger, but will sometimes be smaller, than the denominator of its corresponding positive term. Subtracting $1$ from $p_i$ before squaring does more to reduce the sum of squares $p_i^2 + \sum_{j \neq i} p_j^2$ than subtracting $1$ from any smaller term would. It effectively removes the/a largest square from the sum and substitutes the smallest replacement. So the negative terms are never smaller, but are sometimes larger, than their positive counterparts.

If, on the other hand, $p_i$ is the one of the smallest elements, then $\partial / \partial q_i \EIpp$ is positive. For then the reverse argument applies: the denominator of each negative term will never be smaller and will sometimes be larger than the denominator of the corresponding positive term. So the negatives terms are never larger, but are sometimes smaller, than their positive counterparts.

We have shown that the partial derivates of $\EIpq$ are non-zero at the point $\q = \p$. Thus $\p$ is not a critical point of $\EIpq$, and hence cannot be a minimum of $\EIpq$. $\Box$

**Corollary.** Euclidean distance is improper in any finite number of dimensions.

*Proof.* This is just a slight restatement of our theorem. If $\q = \p$ is not a minimum of $\EIpq$, then there is some $\q \neq \p$ such that $\EIpq < \EIpp$. $\Box$

# Conjectures Awaiting Refutations

Notice, we’ve also shown something a bit stronger. We showed that the slope of $\EIpq$ at the point $\q = \p$ is always negative in the direction of $\p$’s largest element(s), and positive in the direction of its smallest element(s). That means we can always reduce expected inaccuracy by taking some small quantity away from the/a smallest element of $\p$ and adding it to the/a largest element. In other words, we can always reduce expected inaccuracy by moving *some* way towards perfect certainty in the/a possibility that $\p$ rates most probable.

However, we *haven’t* shown that repeatedly minimizing expected inaccuracy will, eventually, lead to certainty in the/a possibility that was most probable to begin with. For one thing, we haven’t shown that moving towards certainty in this direction minimizes expected inaccuracy at each step. We’ve only shown that moving in this direction reduces it.

Still, I’m pretty sure a result along these lines holds. Tinkering in *Mathematica* strongly suggests that the following Conjectures are true in any finite number of dimensions $n$:

**Conjecture 2.** If a probability assignment gives greater than $1/ 2$ probability to some possibility, then expected inaccuracy is minimized by assigning probability 1 to that possibility. (But see the **Update** below.)

**Conjecture 3.** Given a non-uniform probability assignment, repeatedly minimizing expected inaccuracy will, within a finite number of steps, increase the probability of the/a possibility that was most probable initially beyond $1/ 2$.

If these conjectures hold, then there’s still a weak-ish sense in which Euclidean distance is “extremizing” in $n > 2$ dimensions. Given a non-uniform probability assignment, repeatedly minimizing expected inaccuracy will eventually lead to greater than $1/ 2$ probability in the/a possibility that was most probable to begin with. Then, minimizing inaccuracy will lead in a single step to certainty in that possibility.

Proving these conjectures would close much of the gap between the theorem we proved and the false conjecture I started with. If you’re interested, you can use this *Mathematica* notebook to test them.

**Update: Mar. 6, 2017.** Thanks to some excellent help from Jonathan Love, I’ve tweaked this post (and greatly simplified the previous one).

I changed the counterexample to the false Conjecture 1, which used to be $\p = (3/7, 2/7, 2/7)$ and $\p’ = (4/7, 2/7, 1/7)$. That works fine, but it’s potentially misleading.

As Jonathan kindly pointed out, the minimum point then is something quite nice. It’s obtained by moving in the $x$-dimension from $3/7$ to $\sqrt{3/7}$, and correspondingly reducing the probability in the $y$ and $z$ dimensions in equal parts.

But, in general, moving to the square root of the largest $p_i$ (when there is one) doesn’t minimize $\EIpq$. Even in the special case where all the other elements in the vector are equal, this doesn’t generally work.

Jonathan did solve that special case, though, and he found at least one interesting result connected with Conjecture 2. There appear to be cases where $p_i < 1/ 2$ for all $i$, and yet $\EIpq$ is still minimized by going directly to the extreme. For example, $\p = (.465, .2675, .2675)$.