Jonathan Weisberg
http://jonathanweisberg.org/index.xml
Recent content on Jonathan WeisbergHugo -- gohugo.ioen-usMon, 23 Apr 2018 00:00:00 -0500Where Are They Now? The Healy 2100
http://jonathanweisberg.org/post/Where%20Are%20They%20Now%20The%20Healy%202100/
Mon, 23 Apr 2018 00:00:00 -0500http://jonathanweisberg.org/post/Where%20Are%20They%20Now%20The%20Healy%202100/
<p>A <a href="https://www.timeshighereducation.com/news/how-much-research-goes-completely-uncited" target="_blank"><em>Times Higher Education</em>
piece</a>
making the rounds last week found that most published philosophy papers
are never cited. More exactly, of the studied philosophy papers
published in 2012, more than half had no citations indexed in <a href="https://clarivate.com/products/web-of-science/" target="_blank">Web of
Science</a> five years
later.</p>
<p>At Daily Nous, the <a href="http://dailynous.com/2018/04/19/philosophy-high-rate-uncited-publications/" target="_blank">discussion of that
finding</a>
turned up some interesting follow-up questions and findings. In
particular, Brian Weatherson found <a href="http://dailynous.com/2018/04/19/philosophy-high-rate-uncited-publications/#comment-141535" target="_blank">quite different
figures</a>
for papers published in <em>prestigious</em> philosophy journals. In the journals
he looked at,<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> 89% of the papers published in 2012 had at least one
citation in Web of Science five years later. And more than half had five
or more citations.</p>
<p>That’s a pretty striking difference: >50% vs. ~11%! Seems like where
you publish your paper makes a <em>big</em> difference to your chances of going
uncited.</p>
<p>Shocking, I know.</p>
<p>But this got me thinking about <a href="https://kieranhealy.org/blog/archives/2015/02/25/gender-and-citation-in-four-general-interest-philosophy-journals-1993-2013/" target="_blank">Kieran Healy’s
analysis</a>
from a few years back. He found an “uncitation” rate higher than
Weatherson’s—almost 20%—even though he was looking at just four of
philosophy’s most prominent journals: <em>Journal of Philosophy</em>, <em>Mind</em>,
<em>Noûs</em>, and <em>Philosophical Review</em>. (He found that around half of the
papers in these journals had five citations or fewer.)</p>
<p>So I wondered: what’s with the discrepancy? Do these journals not
necessarily get the most citations? Or is it that Healy was looking at
papers from 1993 to 2013, and things changed somehow over those two
decades, so that papers published in 2012 tend to get discussed more
than papers from 1993. Or is it just a symptom of when Healy collected
his data? Papers published in, say, 2011 wouldn’t have had much time to
gather citations by 2013 when Healy (apparently?) gathered his data.</p>
<p>Let’s take a look.</p>
<h1 id="the-healy-2100">The Healy 2100</h1>
<p>Since I don’t have Healy’s raw data, I went to Web of Science and
grabbed their data for all papers published over 1993–2013 in the
“Healy 4” journals. I ended up with a couple hundred more papers than
Healy looked at—not sure why. But the list of 2,100 papers he studied
is <a href="https://github.com/kjhealy/philpub" target="_blank">available on GitHub</a>. So I
focused on just those to get more of an apples-to-apples comparison.</p>
<p>Then I tried to reproduce his original findings, especially <a href="https://kieranhealy.org/files/misc/citation-histogram-freq.png" target="_blank">this
histogram</a>:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-2-1.png" alt="" /></p>
<p>I got pretty close, but I didn’t manage to reproduce his results
exactly. I found about 19.9% of the papers had no citations by the end
of 2013, compared to Healy’s ~18.5%. And I found about 58.6% with five
or fewer citations, compared with Healy’s “just over half”.</p>
<p>Still, the match is pretty close, so let’s go on to see how these papers
have aged since 2013.</p>
<h1 id="where-are-they-now">Where Are They Now?</h1>
<p>If we include citations up to the present day, only about 9.7% of these
2,100 papers have no citations, and about 39.1% have five or fewer.
Here’s the updated histogram:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-4-1.png" alt="" /></p>
<p>So it looks like the discrepancy with Weatherson’s result is (partly?)
down to the obvious thing. There hadn’t been enough time for the later
papers in Healy’s data set to accrue citations.</p>
<h1 id="cutting-out-supplements">Cutting Out Supplements</h1>
<p>A lot of these 2,100 papers are actually from the two supplements to
<em>Noûs</em>: <em>Philosophical Issues</em> and <em>Philosophical Perspectives</em>. And it
turns out they’re making a big difference.</p>
<p>Here’s what things look like when we cut supplements out:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-6-1.png" alt="" /></p>
<p>Now only 4.5% of our 1,694 papers have no citations to date, and just
31.5% have five or fewer.</p>
<h1 id="sliding-windows">Sliding Windows</h1>
<p>We’ve been looking at all citations accumulated to date for these
papers, which for older papers means 25 years’ worth of opportunity for
discussion. For more direct comparison to the <em>THE</em> analysis mentioned
at the outset, we can look at just the five-year window following each
paper’s publication.</p>
<p>So, how many citations did these papers accrue just within five years of
being published? Looking at only the “core” papers again (no
supplements), 15.2% had no citations within five years of publication,
and 69.9% had five or fewer.</p>
<p>Once again that’s higher than the 11% (respectively 50%) found by
Weatherson. So we have to ask: are things changing? Are recent papers in
these journals accruing citations faster?</p>
<p>It seems so. Here are the “uncitation” rates for the years 1993–2013:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-8-1.png" alt="" /> And here
are the “five or fewer citations” rates:
<img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-9-1.png" alt="" /></p>
<p>My impression is that it doesn’t take as long as it used to for a paper
to move through a journal’s pipeline. Both review times and production
times may be getting shorter. Could that be why citations are piling up
faster for papers in these journals?</p>
<p>If so, we’d expect the effect to lessen when we look at a ten-year
window. We can only do that for papers published up to 2008, but let’s
go ahead:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-10-1.png" alt="" /><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-10-2.png" alt="" /></p>
<p>The trend still looks fairly significant, so maybe other factors are at work.
Is it just more papers coming out, leading to more citations? Are
citation practices changing? I’m not sure.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Weatherson looked at 391 articles published in 2012 in <em>Philosophical Review</em>, <em>Mind</em>, <em>Journal of Philosophy</em>, <em>Nous</em>, <em>Philosophical Studies</em>, <em>Ethics</em>, <em>Philosophical Quarterly</em>, <em>Philosophy of Science</em>, and <em>Australasian Journal of Philosophy</em>.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>
Building a Neural Network from Scratch: Part 2
http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%202/
Wed, 07 Mar 2018 00:00:00 -0500http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%202/<p>In this post we’ll improve our training algorithm from the <a href="http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%201/">previous post</a>. When we’re done we’ll be able to achieve 98% precision on the MNIST data set, after just 9 epochs of training—which only takes about 30 seconds to run on my laptop.</p>
<p>For comparison, last time we only achieved 92% precision after 2,000 epochs of training, which took over an hour!</p>
<p>The main driver in this improvement is just switching from batch gradient descent to <em>mini</em>-batch gradient descent. But we’ll also make two other, smaller improvements: we’ll add momentum to our descent algorithm, and we’ll smarten up the initialization of our network’s weights.</p>
<p>We’ll also reorganize our code a bit while we’re at it, making things more modular.</p>
<p>But first we need to import and massage our data. These steps are the same as in the previous post:</p>
<pre><code class="language-python">from sklearn.datasets import fetch_mldata
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# import
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
# scale
X = X / 255
# one-hot encode labels
digits = 10
examples = y.shape[0]
y = y.reshape(1, examples)
Y_new = np.eye(digits)[y.astype('int32')]
Y_new = Y_new.T.reshape(digits, examples)
# split, reshape, shuffle
m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
Y_train, Y_test = Y_new[:,:m], Y_new[:,m:]
shuffle_index = np.random.permutation(m)
X_train, Y_train = X_train[:, shuffle_index], Y_train[:, shuffle_index]
</code></pre>
<p>Then we’ll define our key functions. Only the last two are new, and they just put the steps of forward and backward propagation into their own functions. This tidies up the training code to follow, so that we can focus on the novel elements, especially mini-batch descent and momentum.</p>
<p>Notice that in the process we introduce three dictionaries:<code>params</code>, <code>cache</code>, and <code>grads</code>. These are for conveniently passing information back and forth between the forward and backward passes.</p>
<pre><code class="language-python">def sigmoid(z):
s = 1. / (1. + np.exp(-z))
return s
def compute_loss(Y, Y_hat):
L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
m = Y.shape[1]
L = -(1./m) * L_sum
return L
def feed_forward(X, params):
cache = {}
cache["Z1"] = np.matmul(params["W1"], X) + params["b1"]
cache["A1"] = sigmoid(cache["Z1"])
cache["Z2"] = np.matmul(params["W2"], cache["A1"]) + params["b2"]
cache["A2"] = np.exp(cache["Z2"]) / np.sum(np.exp(cache["Z2"]), axis=0)
return cache
def back_propagate(X, Y, params, cache):
dZ2 = cache["A2"] - Y
dW2 = (1./m_batch) * np.matmul(dZ2, cache["A1"].T)
db2 = (1./m_batch) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(params["W2"].T, dZ2)
dZ1 = dA1 * sigmoid(cache["Z1"]) * (1 - sigmoid(cache["Z1"]))
dW1 = (1./m_batch) * np.matmul(dZ1, X.T)
db1 = (1./m_batch) * np.sum(dZ1, axis=1, keepdims=True)
grads = {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}
return grads
</code></pre>
<p>Now for the substantive stuff.</p>
<p>To switch to mini-batch descent, we add another <code>for</code> loop inside the pass through each epoch. At each pass we randomly shuffle the training set, then iterate through it in chunks of <code>batch_size</code>, which we’ll arbitrarily set to 128. We’ll see the code for all this in a moment.</p>
<p>Next, to add momentum, we keep a moving average of our gradients. So instead of updating our parameters by doing e.g.:</p>
<pre><code class="language-python">params["W1"] = params["W1"] - learning_rate * grads["dW1"]
</code></pre>
<p>we do this:</p>
<pre><code class="language-python">V_dW1 = (beta * V_dW1 + (1. - beta) * grads["dW1"])
params["W1"] = params["W1"] - learning_rate * V_dW1
</code></pre>
<p>Finally, to smarten up our initialization, we shrink the variance of the weights in each layer. Following <a href="https://www.coursera.org/learn/deep-neural-network/lecture/RwqYe/weight-initialization-for-deep-networks" target="_blank">this nice video</a> by Andrew Ng (whose excellent Coursera materials I’ve been relying on heavily in these posts), we’ll set the variance for each layer to $1/n$, where $n$ is the number of inputs feeding into that layer.</p>
<p>We’ve been using the <code>np.random.randn()</code> function to get our initial weights. And this function draws from the standard normal distribution. So to adjust the variance to $1/n$, we just divide by $\sqrt{n}$. In code this means that instead of doing e.g. <code>np.random.randn(n_h, n_x)</code>, we do <code>np.random.randn(n_h, n_x) * np.sqrt(1. / n_x)</code>.</p>
<p>Ok that covers our three improvements. Let’s build and train!</p>
<pre><code class="language-python">np.random.seed(138)
# hyperparameters
n_x = X_train.shape[0]
n_h = 64
learning_rate = 4
beta = .9
batch_size = 128
batches = -(-m // batch_size)
# initialization
params = { "W1": np.random.randn(n_h, n_x) * np.sqrt(1. / n_x),
"b1": np.zeros((n_h, 1)) * np.sqrt(1. / n_x),
"W2": np.random.randn(digits, n_h) * np.sqrt(1. / n_h),
"b2": np.zeros((digits, 1)) * np.sqrt(1. / n_h) }
V_dW1 = np.zeros(params["W1"].shape)
V_db1 = np.zeros(params["b1"].shape)
V_dW2 = np.zeros(params["W2"].shape)
V_db2 = np.zeros(params["b2"].shape)
# train
for i in range(9):
permutation = np.random.permutation(X_train.shape[1])
X_train_shuffled = X_train[:, permutation]
Y_train_shuffled = Y_train[:, permutation]
for j in range(batches):
begin = j * batch_size
end = min(begin + batch_size, X_train.shape[1] - 1)
X = X_train_shuffled[:, begin:end]
Y = Y_train_shuffled[:, begin:end]
m_batch = end - begin
cache = feed_forward(X, params)
grads = back_propagate(X, Y, params, cache)
V_dW1 = (beta * V_dW1 + (1. - beta) * grads["dW1"])
V_db1 = (beta * V_db1 + (1. - beta) * grads["db1"])
V_dW2 = (beta * V_dW2 + (1. - beta) * grads["dW2"])
V_db2 = (beta * V_db2 + (1. - beta) * grads["db2"])
params["W1"] = params["W1"] - learning_rate * V_dW1
params["b1"] = params["b1"] - learning_rate * V_db1
params["W2"] = params["W2"] - learning_rate * V_dW2
params["b2"] = params["b2"] - learning_rate * V_db2
cache = feed_forward(X_train, params)
train_cost = compute_loss(Y_train, cache["A2"])
cache = feed_forward(X_test, params)
test_cost = compute_loss(Y_test, cache["A2"])
print("Epoch {}: training cost = {}, test cost = {}".format(i+1 ,train_cost, test_cost))
print("Done.")
</code></pre>
<pre><code>Epoch 1: training cost = 0.15587093418167058, test cost = 0.16223940981168986
Epoch 2: training cost = 0.09417519634799829, test cost = 0.11032242938356147
Epoch 3: training cost = 0.07205872840102934, test cost = 0.0958078559246339
Epoch 4: training cost = 0.07008115814138867, test cost = 0.1010270024817398
Epoch 5: training cost = 0.05501929068580713, test cost = 0.09527116695490956
Epoch 6: training cost = 0.042663638371140164, test cost = 0.08268937190759178
Epoch 7: training cost = 0.03615501088752129, test cost = 0.08188431384719108
Epoch 8: training cost = 0.03610956910064329, test cost = 0.08675249924246693
Epoch 9: training cost = 0.027582647825206745, test cost = 0.08023754855128316
Done.
</code></pre>
<p>How’d we do?</p>
<pre><code class="language-python">cache = feed_forward(X_test, params)
predictions = np.argmax(cache["A2"], axis=0)
labels = np.argmax(Y_test, axis=0)
print(classification_report(predictions, labels))
</code></pre>
<pre><code> precision recall f1-score support
0 0.99 0.98 0.98 984
1 0.99 0.99 0.99 1136
2 0.98 0.98 0.98 1037
3 0.97 0.98 0.97 1004
4 0.97 0.98 0.98 970
5 0.97 0.96 0.97 900
6 0.97 0.99 0.98 942
7 0.97 0.97 0.97 1026
8 0.97 0.96 0.97 982
9 0.97 0.96 0.97 1019
avg / total 0.98 0.98 0.98 10000
</code></pre>
<p>And there it is: 98% precision in just 9 epochs of training.</p>
Building a Neural Network from Scratch: Part 1
http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%201/
Mon, 05 Mar 2018 00:00:00 -0500http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%201/
<p>In this post we’re going to build a neural network from scratch. We’ll train it to recognize hand-written digits, using the famous MNIST data set.</p>
<p>We’ll use just basic Python with NumPy to build our network (no high-level stuff like Keras or TensorFlow). We will dip into scikit-learn, but only to get the MNIST data and to assess our model once its built.</p>
<p>We’ll start with the simplest possible “network”: a single node that recognizes just the digit 0. This is actually just an implementation of logistic regression, which may seem kind of silly. But it’ll help us get some key components working before things get more complicated.</p>
<p>Then we’ll extend that into a network with one hidden layer, still recognizing just 0. Then we’ll add a softmax for recognizing all the digits 0 through 9. That’ll give us a 92% accurate digit-recognizer, bringing us up to the cutting edge of 1985 technology.</p>
<p>In a followup post we’ll bring that up into the high nineties by making sundry improvements: better optimization, more hidden layers, and smarter initialization.</p>
<h1 id="1-hello-mnist">1. Hello, MNIST</h1>
<p><a href="https://en.wikipedia.org/wiki/MNIST_database" target="_blank">MNIST</a> contains 70,000 images of hand-written digits, each 28 x 28 pixels, in greyscale with pixel-values from 0 to 255. We could <a href="http://yann.lecun.com/exdb/mnist/" target="_blank">download</a> and preprocess the data ourselves. But the makers of scikit-learn already did that for us. Since it would be rude to neglect their efforts, we’ll just import it:</p>
<pre><code class="language-python">from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
</code></pre>
<p>We’ll normalize the data to keep our gradients manageable:</p>
<pre><code class="language-python">X = X / 255
</code></pre>
<p>The default MNIST labels record <code>7</code> for an image of a seven, <code>4</code> for an image of a four, etc. But we’re just building a zero-classifier for now. So we want our labels to say <code>1</code> when we have a zero, and <code>0</code> otherwise (intuitive, I know). So we’ll overwrite the labels to make that happen:</p>
<pre><code class="language-python">import numpy as np
y_new = np.zeros(y.shape)
y_new[np.where(y == 0.0)[0]] = 1
y = y_new
</code></pre>
<p>Now we can make our train/test split. The MNIST images are pre-arranged so that the first 60,000 can be used for training, and the last 10,000 for testing. We’ll also transform the data into the shape we want, with each example in a column (instead of a row):</p>
<pre><code class="language-python">m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
y_train, y_test = y[:m].reshape(1,m), y[m:].reshape(1,m_test)
</code></pre>
<p>Finally we’ll shuffle the training set for good measure:</p>
<pre><code class="language-python">np.random.seed(138)
shuffle_index = np.random.permutation(m)
X_train, y_train = X_train[:,shuffle_index], y_train[:,shuffle_index]
</code></pre>
<p>Let’s have a look at a random image and label just to make sure we didn’t throw anything out of wack:</p>
<pre><code class="language-python">%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
i = 3
plt.imshow(X_train[:,i].reshape(28,28), cmap = matplotlib.cm.binary)
plt.axis("off")
plt.show()
print(y_train[:,i])
</code></pre>
<p><img src="http://jonathanweisberg.org/img/nn_from_scratch/output_13_0.png" alt="png" /></p>
<pre><code>[1.]
</code></pre>
<p>That’s a zero, so we want the label to be <code>1</code>, which it is. Looks good, so let’s build our first network.</p>
<h1 id="2-a-single-neuron-aka-logistic-regression">2. A Single Neuron (aka Logistic Regression)</h1>
<p>We want to build a simple, feed-forward network with 784 inputs (=28 x 28), and a single sigmoid unit generating the output.</p>
<h2 id="2-1-forward-propogation">2.1 Forward Propogation</h2>
<p>The forward pass on a single example $x$ executes the following computation:
$$ \hat{y} = \sigma(w^T x + b). $$
Here $\sigma$ is the sigmoid function:
$$ \sigma(z) = \frac{1}{1 + e^{-z}}. $$
So let’s define:</p>
<pre><code class="language-python">def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
</code></pre>
<p>We’ll vectorize by stacking examples side-by-side, so that our input matrix $X$ has an example in each column. The vectorized form of the forward pass is then:
$$ \hat{y} = \sigma(w^T X + b). $$
Note that $\hat{y}$ is now a vector, not a scalar as it was in the previous equation.</p>
<p>In our code we’ll compute this in two stages: <code>Z = np.matmul(W.T, X) + b</code> and then <code>A = sigmoid(Z)</code>. (<code>A</code> for Activation.) Breaking things up into stages like this is just for tidiness—it’ll make our forward propagation computations mirror the steps in our backward propagation computations.</p>
<h2 id="2-2-cost-function">2.2 Cost Function</h2>
<p>We’ll use cross-entropy for our cost function. The formula for a single training example is:
$$ L(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y}). $$
Averaging over a training set of $m$ examples we then have:
$$ L(Y, \hat{Y}) = -\frac{1}{m} \sum_{i=1}^m \left( y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right). $$
So let’s define:</p>
<pre><code class="language-python">def compute_loss(Y, Y_hat):
m = Y.shape[1]
L = -(1./m) * ( np.sum( np.multiply(np.log(Y_hat),Y) ) + np.sum( np.multiply(np.log(1-Y_hat),(1-Y)) ) )
return L
</code></pre>
<h2 id="2-3-backward-propagation">2.3 Backward Propagation</h2>
<p>For backpropagation, we’ll need to know how $L$ changes with respect to each component $w_j$ of $w$. That is, we must compute each $\partial L / \partial w_j$.</p>
<p>Focusing on a single example will make it easier to derive the formulas we need. Holding all values except $w_j$ fixed, we can think of $L$ as being computed in three steps: $w_j \rightarrow z \rightarrow \hat{y} \rightarrow L$. The formulas for these steps are:
$$
\begin{align}
z &= w^T x + b,\newline
\hat{y} &= \sigma(z),\newline
L(y, \hat{y}) &= -y \log(\hat{y}) - (1-y) \log(1-\hat{y}).
\end{align}
$$
And the chain rule tells us:
$$
\frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial w_j}.
$$
Looking at $\partial L / \partial \hat{y}$ first:
$$
\begin{align}
\frac{\partial L}{\partial \hat{y}} &= \frac{\partial}{\partial \hat{y}} \left( -y \log(\hat{y}) - (1-y) \log(1-\hat{y}) \right)\newline
&= -y \frac{\partial}{\partial \hat{y}} \log(\hat{y}) - (1-y) \frac{\partial}{\partial \hat{y}} \log(1-\hat{y})\newline
&= \frac{-y}{\hat{y}} + \frac{(1-y) }{1 - \hat{y}}\newline
&= \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}.
\end{align}
$$
Next we want $\partial \hat{y} / \partial z$:
$$
\begin{align}
\frac{\partial}{\partial z} \sigma(z) &= \frac{\partial}{\partial z} \left( \frac{1}{1 + e^{-z}} \right)\newline
&= - \frac{1}{(1 + e^{-z})^2} \frac{\partial}{\partial z} \left( 1 + e^{-z} \right)\newline
&= \frac{e^{-z}}{(1 + e^{-z})^2}\newline
&= \frac{1}{1 + e^{-z}} \frac{e^{-z}}{1 + e^{-z}}\newline
&= \sigma(z) \frac{e^{-z}}{1 + e^{-z}}\newline
&= \sigma(z) \left( 1 - \frac{1}{1 + e^{-z}} \right)\newline
&= \sigma(z) \left( 1 - \sigma(z) \right)\newline
&= \hat{y} (1-\hat{y}).
\end{align}
$$
Lastly we tackle $\partial z / \partial w_j$:
$$
\begin{align}
\frac{\partial}{\partial w_j} (w^T x + b) &= \frac{\partial}{\partial w_j} (w_0 x_0 + \ldots + w_n x_n + b)\newline
&= w_j.
\end{align}
$$
Finally we can substitute into the chain rule to find:
$$
\begin{align}
\frac{\partial L}{\partial w_j} &= \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial w_j}\newline
&= \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \hat{y} (1-\hat{y}) w_j\newline
&= (\hat{y} - y) w_j.\newline
\end{align}
$$
In vectorized form with $m$ training examples this gives us:
$$
\frac{\partial L}{\partial w} = \frac{1}{m} X (\hat{y} - y)^T.
$$
What about $\partial L / \partial b$? A very similar derivation yields, for a single example:
$$
\begin{align}
\frac{\partial L}{\partial b} &= (\hat{y} - y).
\end{align}
$$
Which in vectorized form amounts to:
$$
\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)}).
$$
In our code we’ll label these gradients according to their denominators, as <code>dW</code> and <code>db</code>. So for backpropagation we’ll compute <code>dW = (1/m) * np.matmul(X, (A-Y).T)</code> and <code>db = (1/m) * np.sum(A-Y, axis=1, keepdims=True)</code>.</p>
<h2 id="2-4-build-train">2.4 Build & Train</h2>
<p>Ok we’re ready to build and train our network!</p>
<pre><code class="language-python">learning_rate = 1
X = X_train
Y = y_train
n_x = X.shape[0]
m = X.shape[1]
W = np.random.randn(n_x, 1) * 0.01
b = np.zeros((1, 1))
for i in range(2000):
Z = np.matmul(W.T, X) + b
A = sigmoid(Z)
cost = compute_loss(Y, A)
dW = (1/m) * np.matmul(X, (A-Y).T)
db = (1/m) * np.sum(A-Y, axis=1, keepdims=True)
W = W - learning_rate * dW
b = b - learning_rate * db
if (i % 100 == 0):
print("Epoch", i, "cost: ", cost)
print("Final cost:", cost)
</code></pre>
<pre><code>Epoch 0 cost: 0.6840801595436431
Epoch 100 cost: 0.041305162058342754
... *snip* ...
Final cost: 0.02514156608481825
</code></pre>
<p>We could probably eek out a bit more accuracy with some more training. But the gains have slowed considerably. So let’s just see how we did, by looking at the confusion matrix:</p>
<pre><code class="language-python">from sklearn.metrics import classification_report, confusion_matrix
Z = np.matmul(W.T, X_test) + b
A = sigmoid(Z)
predictions = (A>.5)[0,:]
labels = (y_test == 1)[0,:]
print(confusion_matrix(predictions, labels))
</code></pre>
<pre><code>[[8980 33]
[ 40 947]]
</code></pre>
<p>Hey, that’s actually pretty good! We got 947 of the zeros and missed only 33, while getting nearly all the negative cases right. In terms of f1-score that’s 0.99:</p>
<pre><code class="language-python">print(classification_report(predictions, labels))
</code></pre>
<pre><code> precision recall f1-score support
False 1.00 1.00 1.00 9013
True 0.97 0.96 0.96 987
avg / total 0.99 0.99 0.99 10000
</code></pre>
<p>So, now that we’ve got a working model and optimization algorithm, let’s enrich it.</p>
<h1 id="3-one-hidden-layer">3. One Hidden Layer</h1>
<p>Let’s add a hidden layer now, with 64 units (a mostly arbitrary choice). I won’t go through the derivations of all the formulas for the forward and backward passes this time; they’re a pretty direct extension of the work we did earlier. Instead let’s just dive right in and build the model:</p>
<pre><code class="language-python">X = X_train
Y = y_train
n_x = X.shape[0]
n_h = 64
learning_rate = 1
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(1, n_h)
b2 = np.zeros((1, 1))
for i in range(2000):
Z1 = np.matmul(W1, X) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2, A1) + b2
A2 = sigmoid(Z2)
cost = compute_loss(Y, A2)
dZ2 = A2-Y
dW2 = (1./m) * np.matmul(dZ2, A1.T)
db2 = (1./m) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(W2.T, dZ2)
dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
dW1 = (1./m) * np.matmul(dZ1, X.T)
db1 = (1./m) * np.sum(dZ1, axis=1, keepdims=True)
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
if i % 100 == 0:
print("Epoch", i, "cost: ", cost)
print("Final cost:", cost)
</code></pre>
<pre><code>Epoch 0 cost: 0.9144384083567224
Epoch 100 cost: 0.08856953026938433
... *snip* ...
Final cost: 0.024249298861903648
</code></pre>
<p>How’d we do?</p>
<pre><code class="language-python">Z1 = np.matmul(W1, X_test) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2, A1) + b2
A2 = sigmoid(Z2)
predictions = (A2>.5)[0,:]
labels = (y_test == 1)[0,:]
print(confusion_matrix(predictions, labels))
print(classification_report(predictions, labels))
</code></pre>
<pre><code>[[8984 36]
[ 36 944]]
precision recall f1-score support
False 1.00 1.00 1.00 9020
True 0.96 0.96 0.96 980
avg / total 0.99 0.99 0.99 10000
</code></pre>
<p>Hmm, not bad, but about the same as our one-neuron model did. We could do more training and add more nodes/layers. But it’ll be slow going until we improve our optimization algorithm, which we’ll do in a followup post.</p>
<p>So for now let’s turn to recognizing all ten digits.</p>
<h1 id="4-upgrading-to-multiclass">4. Upgrading to Multiclass</h1>
<p><img src="https://medias.spotern.com/spots/w1280/3477.jpg" alt="" /></p>
<h2 id="4-1-labels">4.1 Labels</h2>
<p>First we need to redo our labels. We’ll re-import everything, so that we don’t have to go back and coordinate with our earlier shuffling:</p>
<pre><code class="language-python">mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
X = X / 255
</code></pre>
<p>Then we’ll one-hot encode MNIST’s labels, to get a 10 x 70,000 array.</p>
<pre><code class="language-python">digits = 10
examples = y.shape[0]
y = y.reshape(1, examples)
Y_new = np.eye(digits)[y.astype('int32')]
Y_new = Y_new.T.reshape(digits, examples)
</code></pre>
<p>Then we re-split, re-shape, and re-shuffle our training set:</p>
<pre><code class="language-python">m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
Y_train, Y_test = Y_new[:,:m], Y_new[:,m:]
shuffle_index = np.random.permutation(m)
X_train, Y_train = X_train[:, shuffle_index], Y_train[:, shuffle_index]
</code></pre>
<p>A quick check that things are as they should be:</p>
<pre><code class="language-python">i = 12
plt.imshow(X_train[:,i].reshape(28,28), cmap = matplotlib.cm.binary)
plt.axis("off")
plt.show()
Y_train[:,i]
</code></pre>
<p><img src="http://jonathanweisberg.org/img/nn_from_scratch/output_43_0.png" alt="png" /></p>
<pre><code>array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])
</code></pre>
<p>Looks good, so let’s consider what changes we need to make to the model itself.</p>
<h2 id="4-2-forward-propagation">4.2 Forward Propagation</h2>
<p>Only the last layer of our network is changing. To add the softmax, we have to replace our lone, final node with a 10-unit layer. Its final activations are the exponentials of its $z$-values, normalized across all ten such exponentials. So instead of just computing $\sigma(z)$, we compute the activation for each unit $i$:
$$ \frac{e^{z_i}}{\sum_{j=0}^{9} e^{z_j}}.$$
So, in our vectorized code, the last line of forward propagation will be <code>A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0)</code>.</p>
<h2 id="4-3-cost-function">4.3 Cost Function</h2>
<p>Our cost function now has to generalize to more than two classes. The general formula for $n$ classes is:
$$ L(y, \hat{y}) = -\sum_{i = 0}^n y_i \log(\hat{y}_i). $$
Averaging over $m$ training examples this becomes:
$$ L(Y, \hat{Y}) = - \frac{1}{m} \sum_{j = 0}^m \sum_{i = 0}^n y_i^{(j)} \log(\hat{y}_i^{(j)}). $$
So let’s define:</p>
<pre><code class="language-python">def compute_multiclass_loss(Y, Y_hat):
L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
m = Y.shape[1]
L = -(1/m) * L_sum
return L
</code></pre>
<h2 id="4-4-backprop">4.4 Backprop</h2>
<p>Luckily it turns out that backprop isn’t really affected by the switch to a softmax. A softmax generalizes the sigmoid activiation we’ve been using, and in such a way that the code we wrote earlier still works. We could verify this by deriving:
$$\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i.$$
But I won’t walk through the steps here. Let’s just go ahead and build our final network.</p>
<h2 id="4-5-build-train">4.5 Build & Train</h2>
<pre><code class="language-python">n_x = X_train.shape[0]
n_h = 64
learning_rate = 1
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(digits, n_h)
b2 = np.zeros((digits, 1))
X = X_train
Y = Y_train
for i in range(2000):
Z1 = np.matmul(W1,X) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2,A1) + b2
A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0)
cost = compute_multiclass_loss(Y, A2)
dZ2 = A2-Y
dW2 = (1./m) * np.matmul(dZ2, A1.T)
db2 = (1./m) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(W2.T, dZ2)
dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
dW1 = (1./m) * np.matmul(dZ1, X.T)
db1 = (1./m) * np.sum(dZ1, axis=1, keepdims=True)
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
if (i % 100 == 0):
print("Epoch", i, "cost: ", cost)
print("Final cost:", cost)
</code></pre>
<pre><code>Epoch 0 cost: 9.243960401572568
... *snip* ...
Epoch 1900 cost: 0.24585173887243117
Final cost: 0.24072776877870128
</code></pre>
<p>Let’s see how we did:</p>
<pre><code class="language-python">Z1 = np.matmul(W1, X_test) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2, A1) + b2
A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0)
predictions = np.argmax(A2, axis=0)
labels = np.argmax(Y_test, axis=0)
print(confusion_matrix(predictions, labels))
print(classification_report(predictions, labels))
</code></pre>
<pre><code>[[ 946 0 14 3 3 10 12 2 9 4]
[ 0 1112 3 2 1 1 2 8 3 4]
[ 3 4 937 24 10 7 8 18 8 3]
[ 4 2 17 924 1 39 4 13 26 9]
[ 0 1 10 0 905 9 11 9 10 40]
[ 12 5 2 26 3 786 15 3 24 14]
[ 8 1 19 2 9 10 902 1 9 1]
[ 2 1 13 14 3 5 1 946 9 25]
[ 5 9 16 11 5 18 3 5 868 9]
[ 0 0 1 4 42 7 0 23 8 900]]
precision recall f1-score support
0 0.97 0.94 0.95 1003
1 0.98 0.98 0.98 1136
2 0.91 0.92 0.91 1022
3 0.91 0.89 0.90 1039
4 0.92 0.91 0.92 995
5 0.88 0.88 0.88 890
6 0.94 0.94 0.94 962
7 0.92 0.93 0.92 1019
8 0.89 0.91 0.90 949
9 0.89 0.91 0.90 985
avg / total 0.92 0.92 0.92 10000
</code></pre>
<p>We’re at 92% accuracy across all digits, not bad! And it looks like we could still improve with more training.</p>
<p>But let’s work on speeding up our optimization alogirthm first. We’ll pick things up there in the next post.</p>
Call for Papers: Formal Epistemology Workshop (FEW) 2018
http://jonathanweisberg.org/post/CFP%20FEW%202018/
Thu, 21 Dec 2017 10:36:00 -0500http://jonathanweisberg.org/post/CFP%20FEW%202018/<p><strong>Location:</strong> University of Toronto<br />
<strong>Dates:</strong> June 12–14, 2018<br />
<strong>Keynote Speakers:</strong> <a href="http://www.larabuchak.net/" target="_blank">Lara Buchak</a> and <a href="https://sites.google.com/site/michaeltitelbaum/" target="_blank">Mike Titelbaum</a><br />
<strong>Submission Deadline:</strong> February 12, 2018<br />
<strong>Authors Notified:</strong> March 31, 2018</p>
<p>We are pleased to invite papers in formal epistemology, broadly construed to include related areas of philosophy as well as cognate disciplines like statistics, psychology, economics, computer science, and mathematics.</p>
<p>Submissions should be:</p>
<ol>
<li>prepared for anonymous review,</li>
<li>no more than 6,000 words,</li>
<li>accompanied by an abstract of up to 300 words, and</li>
<li>in PDF format.</li>
</ol>
<p>Submission is via the <a href="https://easychair.org/conferences/?conf=few2018" target="_blank">EasyChair website</a>.</p>
<p>The final selection of the program will be made with an eye to diversity. We especially encourage submissions from PhD candidates, early career researchers, and members of groups underrepresented in academic philosophy.</p>
<p>Some funds are available to reimburse speakers’ travel expenses. The available amounts are still being determined, but we hope to cover most/all expenses for student and early career speakers. Childcare can also be arranged.</p>
<p>The <a href="http://jonathanweisberg.org/few2018" target="_blank">conference website is here</a>. The contact address is <a href="mailto:few2018toronto@gmail.com" target="_blank">few2018toronto@gmail.com</a>. The local organizers are <a href="http://www.davidjamesbar.net/" target="_blank">David James Barnett</a>, <a href="http://individual.utoronto.ca/jnagel/Home_Page.html" target="_blank">Jennifer Nagel</a>, and <a href="http://jonathanweisberg.org/" target="_blank">Jonathan Weisberg</a>.</p>
REU Redux: Allais All Over Again
http://jonathanweisberg.org/post/REU%20Redeux/
Tue, 26 Sep 2017 20:24:04 -0500http://jonathanweisberg.org/post/REU%20Redeux/
<p><em>This post is coauthored with <a href="http://johannathoma.com/" target="_blank">Johanna Thoma</a> and cross-posted at <a href="https://choiceinference.wordpress.com/" target="_blank">Choice & Inference</a>. Accompanying Mathematica code is available on <a href="https://github.com/jweisber/reu" target="_blank">GitHub</a>.</em></p>
<p>Lara Buchak’s <a href="http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199672165.001.0001/acprof-9780199672165" target="_blank"><em>Risk & Rationality</em></a> advertises REU theory as able to recover the modal preferences in the Allais paradox. In <a href="https://link.springer.com/content/pdf/10.1007%2Fs11098-017-0916-3.pdf" target="_blank">our commentary</a> we challenged this claim. We pointed out that REU theory is strictly <a href="https://johannathoma.files.wordpress.com/2015/08/decision-theory-open-handbook-edit.pdf#page=11" target="_blank">“grand-world”</a>, and in the grand-world setting it actually struggles with the Allais preferences.</p>
<p>To demonstrate, we constructed a grand-world model of the Allais problem. We replaced each small-world outcome with a normal distribution whose mean matches its utility, and whose height corresponds to its probability.</p>
<p>Take for example the Allais gamble:
$$(\$0, .01; \$1M, .89; \$5M, .1).$$
If we adopt <em>Risk & Rationality</em>’s utility assignments:
$$u(\$0) = 0, u(\$1M) = 1, u(\$5M) = 2,$$
we can depict the small-world version of this gamble:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig1.png" alt="" /></p>
<p>On our grand-world model this becomes:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig2.png" alt="" /></p>
<p>And REU theory fails to predict the usual Allais preferences on this model, provided the normal distributions used are minimally spread out.</p>
<p>If we squeeze the normal distributions tight enough, the grand-world problem collapses into the small-world problem, and REU theory can recover the Allais preferences. But, we showed, they’d have to be squeezed absurdly tight. A small standard deviation like $\sigma = .1$ lets REU theory recover the Allais preferences.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> But it also requires outlandish certainty that a windfall of $\$$1M will lead to a better life than the one you’d expect to lead without it. The probability of a life of utility at most 0, despite winning $\$$1M, would have to be smaller than $1 \times 10^{-23}$.<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup> Yet the chances are massively greater than that of suffering life-ruining tragedy (illness, financial ruin… <em>Game of Thrones</em> ending happily ever after, etc.).</p>
<p>In response Buchak offers <a href="https://link.springer.com/content/pdf/10.1007%2Fs11098-017-0907-4.pdf" target="_blank">two replies</a>. The first is a technical maneuver, adjusting the model parameters. The second is more philosophical, adjusting the target interpretation of the Allais paradox instead.</p>
<h1 id="first-reply">First Reply</h1>
<p>Buchak’s first reply tweaks our model in two ways. First, the mean utility of winning $\$$5M is shifted from 2 down to 1.3. Second, all normal distributions are skewed by a factor of 5 (positive 5 for utility 0, negative otherwise). So, for example, the Allais gamble pictured above becomes:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig3.png" alt="" /></p>
<p>We’ll focus on the second tweak here, the introduction of skew. It rests on a technical error, as we’ll show momentarily. But it also wants for motivation.</p>
<h2 id="motivational-problems">Motivational Problems</h2>
<p>Why should the grand-world model be skewed? And why in this particular way? Buchak writes:</p>
<blockquote>
<p>[…] receiving $\$$1M makes the worst possibilities much less likely. Receiving $\$$1M provides security in the sense of making the probability associated with lower utility values smaller and smaller. The utility of $\$$1M is concentrated around a high mean with a long tail to the left: things likely will be great, though there is some small and diminishing chance they will be fine but not great. Similarly, the utility of $\$$0 is concentrated around a low mean with a long tail to the right: things likely will be fine but not great, though there is some small and diminishing chance they will be great. In other words, $\$$1M (and $\$$5M) is a gamble with negative skew, and $\$$0 is a gamble with positive skew <a href="p. 2401" target="_blank">…</a></p>
</blockquote>
<p>But this passage never actually identifies any asymmetry in the phenomena we’re modeling. True, “receiving $\$$1M makes the worst possibilities much less likely”, but it also makes the best possibilities much more likely. Likewise, “[r]eceiving $\$$1M provides security in the sense of making the probability associated with lower utility values smaller and smaller.” But $\$$1M also makes the probability associated with higher utility values larger. And so on.</p>
<p>The tendencies of large winnings to control bad outcomes and promote good outcomes was already captured in the original model. A normal distribution centered on utility 1 already admits “some small and diminishing chance that [things] will be fine but not great.” It just also admits some small chance that things will be much better than great, since it’s symmetric around utility 1. To motivate the skewed model, we’d need some reason to think this symmetry should not hold. But none has been given.</p>
<h2 id="technical-difficulties">Technical Difficulties</h2>
<p>Motivation aside, there is a technical fault in the skewed model.</p>
<p>Introducing skew is supposed to make room for a reasonably large standard deviation while still recovering the Allais preferences. Buchak advertises a standard deviation<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup> of $\sigma = .17$ for the skewed model, but the true value is actually $.106$—essentially the same as the $.1$ value Buchak concedes is implausibly small, and seeks to avoid by introducing skew.<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">4</a></sup></p>
<p>Where does the $.17$ figure come from then? It’s the <a href="https://en.wikipedia.org/wiki/Scale_parameter" target="_blank">scale parameter</a> of the skew normal distribution, often denoted $\omega$. For an ordinary normal distribution, the scale $\omega$ famously coincides with the standard deviation $\sigma$, and so we write $\sigma$ for both. But when we skew a normal distribution, we tighten it, shrinking the standard deviation:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig4.png" alt="" /></p>
<p>The distributions in this figure share the same scale parameter ($.17$) but the skewed one (yellow) is much narrower.<sup class="footnote-ref" id="fnref:5"><a rel="footnote" href="#fn:5">5</a></sup></p>
<p>Unfortunately, <em>Mathematica</em> uses $\sigma$ for the scale parameter even in skewed normal distributions, giving the misleading impression that it’s still the standard deviation.</p>
<p>What really matters, of course, isn’t the value of the standard deviation itself, but the probabilities that result from whatever parameters we choose. And Buchak argues that her model avoids the implausible probabilities we cited in the introduction. How can this be?</p>
<p>Buchak says that the skewed model has “more overlap in the utility that $\$$0 and $\$$1M might deliver”:</p>
<blockquote>
<p>[…] there is a 0.003 probability that the $\$$0 gamble will deliver more than 0.5 utils, and a 0.003 probability that the $\$$1M gamble will deliver less than 0.5 utils. (p. 2402)</p>
</blockquote>
<p>But this “overlap” was never the problematic quantity. The problem was, rather, that a small standard deviation like $.1$ requires you to think it less than $1 \times 10^{-23}$ likely you will end up with a life no better than $0$ utils, despite a $\$$1M windfall.</p>
<p>On Buchak’s model this probability is still absurdly small: $4 \times 10^{-9}$.<sup class="footnote-ref" id="fnref:6"><a rel="footnote" href="#fn:6">6</a></sup> This is a considerable improvement over $1 \times 10^{-23}$, but it’s still not plausible. For example, it’s almost $300,000$ times more likely that one author of this post (Jonathan Weisberg) will <a href="http://www.statcan.gc.ca/pub/84-537-x/2013005/tbl/tbl7a-eng.htm" target="_blank">die in the coming year at the ripe old age of 39</a>.</p>
<p>But worst of all, any improvement here comes at an impossible price: ludicrously low probabilities on the other side. For example, the probability that the life you’ll lead with $\$$1M will end up as good as the one you’d expect with $\$$5M is so small that <em>Mathematica</em> can’t distinguish it from zero.<sup class="footnote-ref" id="fnref:7"><a rel="footnote" href="#fn:7">7</a></sup> So the problem is actually worse than before, not better.</p>
<h1 id="second-reply">Second Reply</h1>
<p>Buchak’s second reply is that it wouldn’t in fact be a problem if REU theory could only recover the Allais preferences in a small-world setting. We should think of the Allais problem as a thought experiment: it asks us to abstract away from anything but the immediate rewards mentioned in the problem, and to think of the monetary rewards as stand-ins for things that are valuable for their own sakes.</p>
<p>What <em>Risk & Rationality</em> showed, according to Buchak, is that REU theory can accommodate people’s intuitions regarding such a small-world thought experiment. And this is a success, because this establishes that the theory can accommodate a certain kind of reasoning that we all engage in. Buchak moreover concedes that it may well be a mistake for agents to think of the choices they actually face in small-world terms. But she claims this is no problem for her theory:</p>
<blockquote>
<p>[I]f people ‘really’ face the simple choices, then their reasoning is correct and REU captures it. If people ‘really’ face the complex choices, then the reasoning in favor of their preferences is misapplied, and REU does not capture their preferences. Either way, the point still stands: REU-maximization rationalizes and formally reconstructs a certain kind of intuitive reasoning, as seen through REU theory’s ability to capture preferences over highly idealized gambles to which this reasoning is relevant. (p. 2403)</p>
</blockquote>
<p>But there isn’t actually an ‘if’ here. People do really face ‘complex’ choices as we tried to model them. Any reward from an isolated gamble an agent faces in her life really should itself be thought of as a gamble. This is not only true when the potential reward is something like money, which is only a means to something else. Even if the good in question is ‘ultimate’, it just adds to the larger gamble of the agent’s life she is yet to face. She might win a beautiful holiday, but she will still face 20 micromorts per day for the rest of her life (<a href="https://en.wikipedia.org/wiki/Micromort#Baseline" target="_blank">24 if she moves from Canada to England</a>). Even on our deathbeds, we are unsure about how a lot of things we care about will play out. REU theory makes this background risk relevant to the evaluation of any individual gamble.</p>
<p>So Buchak’s response really comes to this: REU theory captures a kind of intuitive reasoning that we employ in highly idealized decision contexts, but which would be misapplied in any actual decisions agents face in their lives. This raises two questions:</p>
<ol>
<li><p>Why should we care about accommodating reasoning in highly idealized decision contexts?</p>
<p>The original project of <em>Risk & Rationality</em> was to rationally accommodate the ordinary decision-maker. But now what we are rationally accommodating are at best her responses to thought experiments that are very far removed from her real life, namely thought experiments that ask her to imagine that she faces no other risk in her life. If our model is right, then REU theory still has to declare her irrational if she acts in real life as she would in the thought experiment—as presumably ordinary decision-makers do. And then we haven’t done very much to rationally accommodate her. At best, we have provided an error theory to explain her ordinary behaviour: her mistake is to treat grand-world problems like small-world problems. This is, of course, a different project than the one <em>Risk & Rationality</em> originally embarked on. As an error theory, REU theory will have to compete with other theories of choice under uncertainty that were never meant to be theories of rationality, such as prospect theory. Moreover, there is still another open question.</p></li>
<li><p>Why should agents have developed a knack for the reasoning displayed in the Allais problem if it is never actually rational to use it?</p>
<p>As a heuristic to try and approximate the behaviour of a perfectly rational system, at least in the Allais example, agents would do better to maximize expected utility—which is also easier to compute. Moreover, the burden of proof is on proponents of REU theory to show that there are any grand-world decisions commonly faced by real agents where REU theory comes to a significantly different assessment than expected utility theory. Unless they can show this, expected utility theory comes out as the better heuristic more generally. It is then quite mysterious what explains our supposed employment of REU-style reasoning. Why should irrational agents, who employ it more generally, have developed a bad heuristic? And why should rational agents, who never use it in real life, develop a tendency to employ it exclusively in highly idealized thought experiments?</p></li>
</ol>
<p>Ultimately, if Buchak’s first reply fails, and all we can rely on is her second reply, <em>Risk & Rationality</em> provides us with no reason to abandon expected utility theory as our best theory of rational choice under uncertainty in actual choice scenarios. Even if we grant that REU theory is a better theory of rational choice in hypothetical scenarios we never face, this is a much less exciting result than the one <em>Risk & Rationality</em> advertised.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Though we need a slightly more severe risk function than that used in <em>Risk & Rationality</em>: $r(p) = p^{2.05}$ instead of $r(p) = p^2$. See our original commentary for details.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
<li id="fn:2"><p>To get this figure we calculate the cumulative density, at zero, of the normal distribution $𝒩(1,.1)$. Using <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">CDF[NormalDistribution[1, .1], 0]
7.61985 × 10^-24
</code></pre>
<a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li>
<li id="fn:3">This is the “variance” in Buchak’s terminology, but we’ll continue to use “standard deviation” here for consistency with our previous discussion and the preferred nomenclature.
<a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li>
<li id="fn:4"><p>In <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">StandardDeviation[SkewNormalDistribution[1, .17, -5]]
0.105874
</code></pre>
<a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li>
<li id="fn:5">Skewing also shifts the mean, we should note.
<a class="footnote-return" href="#fnref:5"><sup>[return]</sup></a></li>
<li id="fn:6"><p>In <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">CDF[SkewNormalDistribution[1, .17, -5], 0]
4.04475 × 10^-9
</code></pre>
<a class="footnote-return" href="#fnref:6"><sup>[return]</sup></a></li>
<li id="fn:7"><p>Here we calculate the complement of the cumulative density, at $1.3$, of the skew normal distribution with location $1$, scale $.17$, and skew $-5$. In <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">1 - CDF[SkewNormalDistribution[1, .17, -5], 1.3]
0.
</code></pre>
<p>Note that <em>Mathematica</em> can estimate this value at the nearby point $1.25$, which gives us an upper bound of about $7 \times 10^{-16}$:</p>
<pre><code class="language-mathematica">1 - CDF[SkewNormalDistribution[1, .17, -5], 1.25]
6.66134 × 10^-16
</code></pre>
<p>For comparison, this probability was about $.0013$ with no skew and $\sigma = .1$:</p>
<pre><code class="language-mathematica">1 - CDF[NormalDistribution[1, .1], 1.3]
0.0013499
</code></pre>
<a class="footnote-return" href="#fnref:7"><sup>[return]</sup></a></li>
</ol>
</div>
The Mosteller Hall Puzzle
http://jonathanweisberg.org/post/Teaching%20Monty%20Hall/
Wed, 14 Jun 2017 15:21:42 -0500http://jonathanweisberg.org/post/Teaching%20Monty%20Hall/<p>One of my favourite probability puzzles to teach is a close cousin of the <a href="https://en.wikipedia.org/wiki/Monty_Hall_problem" target="_blank">Monty Hall problem</a>. Originally from a 1965 <a href="https://books.google.ca/books/about/Fifty_Challenging_Problems_in_Probabilit.html?id=QiuqPejnweEC" target="_blank">book by Frederick Mosteller</a>,<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> here’s my formulation:</p>
<blockquote>
<p>Three prisoners, A, B, and C, are condemned to die in the morning. But the king decides in the night to pardon one of them. He makes his choice at random and communicates it to the guard, who is sworn to secrecy. She can only tell the prisoners that one of them will be released at dawn.</p>
<p>Prisoner A welcomes the news, as he now has a 1/3 chance of survival. Hoping to go even further, he says to the guard, “I know you can’t tell me whether I am condemned or pardoned. But at least one other prisoner must still be condemned, so can you just name one who is?”. The guard replies (truthfully) that B is still condemned. “Ok”, says A, “then it’s either me or C who was pardoned. So my chance of survival has gone up to ½”.</p>
<p>Unfortunately for A, he is mistaken. But how?</p>
<p><strong>Update</strong>: turns out the puzzle isn’t originally due to Mosteller after all! It appears in <a href="https://www.nature.com/scientificamerican/journal/v201/n4/pdf/scientificamerican1059-174.pdf" target="_blank">a 1959 article</a> in <em>Scientific American</em>, by Martin Gardner.</p>
</blockquote>
<p>For me it’s really intuitive that A is mistaken. The way he figures things, his chance of survival will go up to ½ whoever the guard names in her response. But then A doesn’t even have to bother the guard. He can just skip ahead to the conclusion that his chance of survival is ½. And that’s absurd.</p>
<p>It’s a bit harder to say exactly <em>where</em> A goes wrong. But I’ve always taken this puzzle to be, like Monty Hall, a lesson in Carnap’s TER: the Total Evidence Requirement.</p>
<p>What A learns isn’t only that B is condemned, but also that the guard reports as much. And this report is more likely if C was pardoned than if A was. If C was pardoned, the guard had to name B, the only other prisoner still condemned. Whereas if A was pardoned, the guard could just as easily have named C instead.</p>
<p>So when the guard names B, her report fits twice as well with the hypothesis that C was pardoned, not A:</p>
<p><img src="http://jonathanweisberg.org/img/misc/mosteller_tree_diagram.png" alt="Tree diagram" /></p>
<p>Thus A’s chance of being condemned remains twice that of being pardoned.</p>
<p>If you’re like me, this reasoning will actually be less intuitive than the initial, gut feeling that A must be mistaken (because her logic would make it unnecessary to consult the guard). The argument is still instructive though, for several reasons:</p>
<ol>
<li><p>It shows how the initial, gut feeling is consistent with the probability axioms. We’ve constructed a plausible probability model that vindicates it.</p></li>
<li><p>The Total Evidence Requirement makes the difference in this model. Learning merely that B is condemned would have a different effect in this model. A’s chance of survival really would go up to ½ then.</p></li>
<li><p>These lessons can be carried over to Monty Hall. The same model yields the correct solution there, with the TER playing out in a parallel way.</p></li>
</ol>
<p>And that last point is the real point of this post. As my colleague <a href="http://www.sergiotenenbaum.org/" target="_blank">Sergio Tenenbaum</a> pointed out in conversation, it means you can use Mosteller’s puzzle to teach Monty Hall. Because, unlike in Monty Hall, <em>the intuitive judgment is the correct one in Mosteller’s puzzle</em>. So you can use it to get students on board with the less intuitive (but entirely correct) argument we used to resolve Mosteller’s puzzle.</p>
<p>Once students have seen how important it is to set up the probability model correctly, so that the Total Evidence Requirement can do its work, they may be more comfortable using the same technique on Monty Hall.</p>
<p>There are other ways of bringing students around to the correct solution to Monty Hall, of course. You can run them through a variant with a hundred doors instead of three; you can invite them to consider what would happen in the long run in repeated games; you can ask them how things would have been different had Monty opened the other door instead.</p>
<p>These are all worthy heuristics. And I expect different ones will click for different students.</p>
<p>But for my money, there’s nothing like a simple and concrete model to help me get oriented and shake off that befuddled feeling. And, in this case, Mosteller’s puzzle helps make the model more intuitive, hence more memorable.</p>
<p><img src="https://68.media.tumblr.com/776dfc1f8b3baa0309b41c6a90ea1a13/tumblr_nd53ozBNz81qj0u7fo1_r1_400.gif" alt="Fainting Goat" /></p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">So I think it actually predates Monty Hall, though I gather this general family of puzzles goes back at least to 1889 and <a href="https://en.wikipedia.org/wiki/Bertrand%27s_box_paradox" target="_blank">Bertrand’s box paradox</a>.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>
Accuracy for Dummies, Part 7: Dominance
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%207%20-%20Brier%20Dominance/
Wed, 07 Jun 2017 00:00:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%207%20-%20Brier%20Dominance/
<p>In our <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 5 - Convexity/">last</a> <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 6 - Obtusity/">two</a> posts we established two key facts:</p>
<ol>
<li>The set of possible probability assignments is convex.</li>
<li>Convex sets are “obtuse”. Given a point outside a convex set, there’s a point inside that forms a right-or-obtuse angle with any third point in the set.</li>
</ol>
<p>Today we’re putting them together to get the central result of the accuracy framework, the Brier dominance theorem. We’ll show that a non-probabilistic credence assignment is always “Brier dominated” by some probabilistic one. That is, there is always a probabilistic assignment that is closer, in terms of Brier distance, to every possible truth-value assignment.</p>
<p>In fact we’ll show something a bit more general. We’ll show that there’s a probability assignment that’s closer to all the possible <em>probability</em> assignments. But truth-value assignments are probability assignments, just extreme ones. So the result we really want follows straight away as a special case.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\x}{\vec{x}}
\newcommand{\y}{\vec{y}}
\newcommand{\z}{\vec{z}}
\newcommand{\v}{\vec{v}}
\newcommand{\p}{\vec{p}}
\newcommand{\q}{\vec{q}}
\newcommand{\B}{B}
\newcommand{\R}{\mathbb{R}}
\newcommand{\EIpq}{EI_{\p}(\q)}\newcommand{\EIpp}{EI_{\p}(\p)}
$</p>
<h1 id="recap">Recap</h1>
<p>For reference, let’s collect our notation, terminology, and previous results, so that we have everything in one place.</p>
<p>We’re using $n$ for the number of possibilities under consideration. And we use bold letters like $\x$ and $\p$ to represent $n$-tuples of real numbers. So $\p = (p_1, \ldots, p_n)$ is a point in $n$-dimensional space: a member of $\R^n$.</p>
<p>We call $\p$ a <em>probability assignment</em> if its coordinates are (a) all nonnegative, and (b) they sum to $1$. And we write $P$ for the set of all probability assignments.</p>
<p>We call $\v$ a <em>truth-value assignment</em> if its coordinates are all zeros except for a single $1$. And we write $V$ for the set of all truth-value assignments.</p>
<p>A point $\y$ is a <em>mixture</em> of the points $\x_1, \ldots, \x_n$ if there are real numbers $\lambda_1, \ldots, \lambda_n$ such that:</p>
<ul>
<li>$\lambda_i \geq 0$ for all $i$,</li>
<li>$\lambda_1 + \ldots + \lambda_n = 1$, and</li>
<li>$\y = \lambda_1 \x_1 + \ldots + \lambda_n \x_n$.</li>
</ul>
<p>We say that a set is <em>convex</em> if it is closed under mixing, i.e. any mixture of elements in the set is also in the set.</p>
<p>The difference between two points, $\x - \y$, is defined coordinate-wise:
$$ \x - \y = (x_1 - y_1, \ldots, x_n - y_n). $$
The <em>dot product</em> of two points $\x$ and $\y$ is written $\x \cdot \y$, and is defined:
$$ \x \cdot \y = x_1 y_1 + \ldots + x_n y_n. $$
As a reminder, the dot product returns a single, real number (not another $n$-dimensional point as one might expect). And the sign of the dot product reflects the angle between $\x$ and $\y$ when viewed as vectors/arrows. In particular, $\x \cdot \y \leq 0$ corresponds to a right-or-obtuse angle.</p>
<p>Finally, $\B(\x,\y)$ is the Brier distance between $\x$ and $\y$, which can be defined:
$$
\begin{align}
\B(\x,\y) &= (\x - \y)^2\\<br />
&= (\x - \y) \cdot (\x - \y).
\end{align}
$$</p>
<p>Now let’s restate the two key theorems we’ll be relying on.</p>
<p><strong>Theorem (Convexity).</strong>
The set of probability functions $P$ is convex.</p>
<p>We established this in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 5 - Convexity/">Part 5</a> of this series. In particular, we showed that $P$ is the “convex hull” of $V$: the set of all mixtures of points in $V$.</p>
<p><strong>Lemma (Obtusity).</strong>
If $S$ is convex, $\x \not \in S$, and $\y \in S$ minimizes $\B(\y,\x)$ as a function of $\y$ on the domain $S$, then for any $\z \in S$, $(\x - \y) \cdot (\z - \y) \leq 0$.</p>
<p>The intuitive idea behind this lemma, which we proved last time in <a href="(/post/Accuracy for Dummies - Part 6 - Obtusity/)" target="_blank">Part 6</a>, can be illustrated with a diagram:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma3.png" alt="" />
Given a point outside a convex set, we can find a point inside (the closest point) that forms a right-or-obtuse angle with all other points in the set.</p>
<p>What we’ll show next is the natural and intuitive consequence: that point $\y$ is thus closer to any point $\z$ of $S$ than $\x$ is.</p>
<h1 id="the-brier-dominance-theorem">The Brier Dominance Theorem</h1>
<p>Intuitively, we want to show that if the angle formed at point $\y$ with the points $\x$ and $\z$ is right-or-obtuse, then $\y$ must be closer to $\z$ than $\x$ is (in Brier distance).</p>
<p>Formally, a right-or-obtuse angle corresponds to a dot product less than or equal to zero: $(\x - \y) \cdot (\z - \y) \leq 0$. But if $\x = \y$, then the dot product will be zero trivially. So the precise statement of our theorem is:</p>
<p><strong>Theorem.</strong>
If $(\x - \y) \cdot (\z - \y) \leq 0$ and $\x \neq \y$, then $\B(\x,\z) > \B(\y,\z)$.</p>
<p><em>Proof.</em> To start, we establish a general identity via algebra:
$$
\begin{align}
\B(\x, \z) - \B(\x, \y) - \B(\y,\z)
&= (\x - \z)^2 - (\x - \y)^2 - (\y - \z)^2\\<br />
&= -2\y^2 - 2 \x \cdot \z + 2 \x \cdot \y + 2 \y \cdot \z\\<br />
&= -2 (\x - \y) \cdot (\z - \y).
\end{align}
$$
Now suppose $ (\x - \y) \cdot (\z - \y) \leq 0$. Then, given the negative sign on the $-2$ in the established identity,
$$ \B(\x, \z) - \B(\x, \y) - \B(\y,\z) \geq 0, $$
from which we derive
$$ \B(\x, \z) \geq \B(\x, \y) + \B(\y,\z). $$
Now, since $\x \neq \y$ by hypothesis, $\B(\x,\y) > 0$. Thus $\B(\x,\z) > \B(\y,\z)$, as desired.
<span class="floatright">$\Box$</span></p>
<p>It follows now that if $\x$ isn’t a probability assignment, there’s a probability assignment that’s closer to every truth-value assignment than $\x$ is.</p>
<p><strong>Corollary (Brier Dominance).</strong> If $\x \not \in P$ then there is a $\p \in P$ such that $\B(\p,\v) < \B(\x, \v)$ for all $\v \in V$.</p>
<p><em>Proof.</em> Fix $\x \not \in P$, and let $\p$ be the member of $P$ that minimizes $B(\y,\x)$ as a function of $\y$. The Convexity theorem tells us that $P$ is convex, so the Obtusity lemma implies $(\x - \p) \cdot (\v - \p) \leq 0$ for every $\v \in V$. And since $\x \neq \p$ (because $\x \not \in P$), the last theorem entails $\B(\p,\v) < \B(\x, \v)$, as desired.
<span class="floatright">$\Box$</span></p>
<p>This is the core of the main result we’ve been working towards. Hooray! But, we still have one piece of unfinished business. For what if $\p$ is itself dominated??</p>
<h1 id="undominated-dominance">Undominated Dominance</h1>
<p>We’ve shown that credences which violate the probability axioms are always “accuracy dominated” by some assignment of credences that obeys those axioms. But what if those dominating, probabilistic credences are themselves dominated? <em>What if they’re dominated by non-probabilistic credences??</em></p>
<p>For all we’ve said, that’s a real possibility. And if it actually obtains, then there’s nothing especially accuracy-conducive about the laws of probability. So we had better rule this possibility out. Luckily, that’s pretty easy to do.</p>
<p>In fact, the reals work here was already done back in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 3/">Part 3</a> of the series. There we showed that Brier distance is a “proper” measure of inaccuracy: each probability assignment expects itself to do best with respect to accuracy, if inaccuracy is measured by Brier distance.</p>
<p>As a reminder, we wrote $\EIpq$ for the expected inaccuracy of probability assignment $\q$ according to assignment $\p$. When inaccuracy is measured in terms of Brier distance:
$$ \EIpq = p_1 \B(\q,\v_1) + p_2 \B(\q,\v_2) + \ldots + p_n \B(\q,\v_n). $$
Here $\v_i$ is the truth-value assignment with a $1$ in the $i$-th coordinate, and $0$ everywhere else. What we showed in Part 3 was:</p>
<p><strong>Theorem.</strong>
$\EIpq$ is uniquely minimized when $\q = \p$.</p>
<p>And notice, this would be impossible if there were some $\q$ such that $\B(\q,\v_i) \leq \B(\p,\v_i)$ for all $i$. For then the weighted average $\EIpq$ would have to be no larger than $\EIpp$. And this contradicts the theorem, which says that $\EIpq > \EIpp$ for all $\q \neq \p$.</p>
<p>So, at long last, we have the full result we want:</p>
<p><strong>Corollary (Undominated Brier Dominance).</strong> If $\x \not \in P$ then there is a $\p \in P$ such that $\B(\p,\v) < \B(\x, \v)$ for all $\v \in V$. Moreover, there is no $\q \in P$ such that $\B(\q,\v) \leq \B(\p, \v)$ for all $\v \in V$.</p>
<p>So the laws of probability really are specially conducive to accuracy, as measured using Brier distance. Only probabilistic credence assignments are undominated.</p>
<h1 id="where-to-next">Where to Next?</h1>
<p>That’s a pretty sweet result. And it raises plenty of fun and interesting questions we could look at next. Here are three:</p>
<ol>
<li><p>What about other ways of measuring inaccuracy besides Brier? Are there reasonable alternatives, and if so, do similar results apply to them?</p></li>
<li><p>What about other probabilistic principles, like Conditionalization, the Principal Principle, or the Principle of Indifference? Can we take this approach beyond the probability axioms?</p></li>
<li><p>Speaking of the probability axioms, we’ve been working with a pretty paired down conception of a “probability assignment”. Usually we assign probabilities not just to atomic possibilities, but to disjunctions/sets of possibilities: e.g. “the prize is behind either door #1 or door #2”. Can we extend this result to such “super-atomic” probability assignments?</p></li>
</ol>
<p>We’ll tackle some or all of these questions in future posts. But I haven’t yet decided which ones or in what order.</p>
<p>So for now let’s just stop and appreciate the work we’ve already done. Because not only have we proved one of the most central and interesting results of the accuracy framework. But also, in a lot of ways the hardest work is already behind us. If you’ve come this far, I think you deserve a nice pat on the back.</p>
<p><img src="http://i1145.photobucket.com/albums/o503/KimmieRocks/tumblr_liqmv89ru51qb2dn6.gif" alt="" /></p>
Journal Submission Rates by Gender: A Look at the APA/BPA Data
http://jonathanweisberg.org/post/A%20Look%20at%20the%20APA-BPA%20Data/
Tue, 06 Jun 2017 11:45:04 -0500http://jonathanweisberg.org/post/A%20Look%20at%20the%20APA-BPA%20Data/
<p><strong>Update:</strong> <em>editors at CJP and Phil Quarterly have kindly shared some important, additional information. See the edit below for details.</em></p>
<p>A <a href="https://link.springer.com/article/10.1007/s11098-017-0919-0" target="_blank">new paper</a> on the representation of women in philosophy journals prompted some debate in the philosophy blogosphere last week. The paper found women to be underrepresented across a range of prominent journals, yet overrepresented in the two journals studied where review was non-anonymous.</p>
<p>Commenters <a href="http://dailynous.com/2017/05/26/women-philosophy-journals-new-data/" target="_blank">over at Daily Nous</a> complained about the lack of base-rate data. How many of the submissions to these journals were from women? In some respects, it’s hard to know what to make of these findings without such data.</p>
<p>A few commenters linked to <a href="http://www.apaonline.org/resource/resmgr/journal_surveys_2014/apa_bpa_survey_data_2014.xlsx" target="_blank">a survey</a> conducted by the APA and BPA a while back, which supplies some numbers along these lines. I was surprised, because I’ve wondered about these numbers, but I didn’t recall seeing this data-set before. I was excited too because the data-set is huge, in a way: it covers more than 30,000 submissions at 40+ journals over a span of three years!</p>
<p>So I was keen to give it a closer look. This post walks through that process. But I should warn you up front that the result is kinda disappointing.</p>
<h1 id="initial-reservations">Initial Reservations</h1>
<p>Right away some conspicuous omissions stand out.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> A good number of the usual suspects aren’t included, like <em>Philosophical Studies</em>, <em>Analysis</em>, and <em>Australasian Journal of Philosophy</em>. So the usual worries about response rates and selection bias apply.</p>
<p>The data are also a bit haphazard and incomplete. Fewer than half of the journals that responded included gender data. And some of those numbers are suspiciously round.</p>
<p>Still, there’s hope. We have data on over ten thousand submissions even after we exclude journals that didn’t submit any gender data. As long as they paint a reasonably consistent picture, we stand to learn a lot.</p>
<h1 id="first-pass">First Pass</h1>
<p>For starters we’ll just do some minimal cleaning. We’ll exclude data from 2014, since almost no journals supplied it. And we’ll lump together the submissions from the remaining three years, 2011–13, since the gender data isn’t broken down by year.</p>
<p>We can then calculate the following cross-journal tallies for 2011–13:</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Accepted submissions</th>
<th align="right">Rejected submissions</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Men</td>
<td align="right">792</td>
<td align="right">9104</td>
</tr>
<tr>
<td align="left">Women</td>
<td align="right">213</td>
<td align="right">1893</td>
</tr>
</tbody>
</table>
<p>The difference here looks notable at first: 17.5% of submitted papers came from women compared with 21.2% of accepted papers, a statistically significant difference (<em>p</em> = 0.002).</p>
<p>But if we plot the data by journal, the picture becomes much less clear:</p>
<p><img src="http://jonathanweisberg.org/img/apa_bpa_data_files/unnamed-chunk-3-1.png" alt="" /><!-- --></p>
<p>The dashed line<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup> indicates parity: where submission and acceptance rate would be equal. At journals above the line, women make up a larger portion of published authors than they do submitting authors. At journals below the line, it’s the reverse.</p>
<p>It’s pretty striking how much variation there is between journals. For example, <em>BJPS</em> is 12 points above the parity line while <em>Phil Quarterly</em> is 9 points below it.</p>
<p>It’s also notable that it’s the largest journals which diverge the most from parity: <em>BJPS</em>, <em>EJP</em>, <em>MIND</em>, and <em>Phil Quarterly</em>. (Note: <em>Hume Studies</em> is actually the most extreme by far. But I’ve excluded it from the plot because it’s very small, and as an extreme outlier it badly skews the <em>y</em>-axis.)</p>
<p>It’s hard to see all the details in the plot, so here’s the same data in a table.</p>
<table>
<thead>
<tr>
<th align="left">Journal</th>
<th align="right">submissions</th>
<th align="right">accepted</th>
<th align="left">% submissions women</th>
<th align="left">% accepted women</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ancient Philosophy</td>
<td align="right">346</td>
<td align="right">63</td>
<td align="left">20</td>
<td align="left">24</td>
</tr>
<tr>
<td align="left">British Journal for the Philosophy of Science</td>
<td align="right">1267</td>
<td align="right">117</td>
<td align="left">15</td>
<td align="left">27</td>
</tr>
<tr>
<td align="left">Canadian Journal of Philosophy</td>
<td align="right">792</td>
<td align="right">132</td>
<td align="left">20</td>
<td align="left">21</td>
</tr>
<tr>
<td align="left">Dialectica</td>
<td align="right">826</td>
<td align="right">74</td>
<td align="left">12.05</td>
<td align="left">15.48</td>
</tr>
<tr>
<td align="left">European Journal for Philosophy</td>
<td align="right">1554</td>
<td align="right">98</td>
<td align="left">11.84</td>
<td align="left">25</td>
</tr>
<tr>
<td align="left">Hume Studies</td>
<td align="right">152</td>
<td align="right">30</td>
<td align="left">23.7</td>
<td align="left">58.1</td>
</tr>
<tr>
<td align="left">Journal of Applied Philosophy</td>
<td align="right">510</td>
<td align="right">47</td>
<td align="left">20</td>
<td align="left">20</td>
</tr>
<tr>
<td align="left">Journal of Political Philosophy</td>
<td align="right">1143</td>
<td align="right">53</td>
<td align="left">35</td>
<td align="left">30</td>
</tr>
<tr>
<td align="left">MIND</td>
<td align="right">1498</td>
<td align="right">74</td>
<td align="left">10</td>
<td align="left">5</td>
</tr>
<tr>
<td align="left">Oxford Studies in Ancient Philosophy</td>
<td align="right">290</td>
<td align="right">43</td>
<td align="left">21</td>
<td align="left">20.3</td>
</tr>
<tr>
<td align="left">Philosophy East and West</td>
<td align="right">320</td>
<td align="right">66</td>
<td align="left">20</td>
<td align="left">15</td>
</tr>
<tr>
<td align="left">Phronesis</td>
<td align="right">388</td>
<td align="right">38</td>
<td align="left">24</td>
<td align="left">25</td>
</tr>
<tr>
<td align="left">The Journal of Aesthetics and Art Criticism</td>
<td align="right">611</td>
<td align="right">93</td>
<td align="left">29</td>
<td align="left">27</td>
</tr>
<tr>
<td align="left">The Philosophical Quarterly</td>
<td align="right">2305</td>
<td align="right">77</td>
<td align="left">14</td>
<td align="left">5</td>
</tr>
</tbody>
</table>
<h1 id="rounders-removed">Rounders Removed</h1>
<p>I mentioned that some of the numbers look suspiciously round. Maybe 10% of submissions to <em>MIND</em> really were from women, compared with 5% of accepted papers. But some of these cases probably involve non-trivial rounding, maybe even eyeballing or guesstimating. So let’s see how things look without them.</p>
<p>If we omit journals where both percentages are round (integer multiples of 5), that leaves ten journals. And the gap from before is even more pronounced: 16.3% of submissions from women compared with 22.9% of accepted papers (<em>p</em> = 0.0000003).</p>
<p>But it’s still a few, high-volume journals driving the result: <em>BJPS</em> and <em>EJP</em> do a ton of business, and each has a large gap. So much so that they’re able to overcome the opposite contribution of <em>Phil Quarterly</em> (which does a mind-boggling amount of business!).</p>
<h1 id="editors-anonymous">Editors Anonymous</h1>
<p>Naturally I fell to wondering how these big journals differ in their editorial practices. What are they doing differently that leads to such divergent results?</p>
<p>One thing the data tell us is which journals practice fully anonymous review, with even the editors ignorant of the author’s identity. That narrows it down to just three journals: <em>CJP</em>, <em>Dialectica</em>, and <em>Phil Quarterly</em>.<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup> The tallies then are:</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Accepted submissions</th>
<th align="right">Rejected submissions</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Men</td>
<td align="right">240</td>
<td align="right">3103</td>
</tr>
<tr>
<td align="left">Women</td>
<td align="right">43</td>
<td align="right">537</td>
</tr>
</tbody>
</table>
<p>And now the gap is gone: 14.8% of submissions from women, compared with 15.2% of accepted papers—not a statistically significant difference (<em>p</em> = 0.91). That makes it look like the gap is down to editors’ decisions being influenced by knowledge of the author’s gender (whether deliberately or unconsciously).</p>
<p>But notice again, <em>Phil Quarterly</em> is still a huge part of this story. It’s their high volume and unusually negative differential that compensates for the more modest, positive differentials at <em>CJP</em> and <em>Dialectica</em>. So I still want to know more about <em>Phil Quarterly</em>, and what might explain their unusually negative differential.</p>
<p><strong>Edit</strong>: editors at <em>CJP</em> and <em>Phil Quarterly</em> kindly wrote with the following, additional information.</p>
<p>At <em>CJP</em>, the author’s identity is withheld from the editors while they decide whether to send the paper for external review, but then their identity is revealed (presumably to avoid inviting referees who are unacceptably close to the author—e.g. those identical to the author).</p>
<p>And chairman of <em>Phil Quarterly</em>’s editorial board, Jessica Brown, writes:</p>
<blockquote>
<ol>
<li>the PQ is very aware of issues about the representation of women, unsurprisingly given that the editorial board consists of myself, Sarah Broadie and Sophie-Grace Chappell. We monitor data on submissions by women and papers accepted in the journal every year.</li>
<li>the PQ has for many years had fully anonymised processing including the point at which decisions on papers are made (i.e. accept, reject, R and R etc). So, when we make such decisions we have no idea of the identity of the author.</li>
<li><p>While in some years the data has concerned us, more recently the figures do look better which is encouraging:</p>
<ul>
<li>16-17: 25% declared female authored papers accepted; 16% submissions</li>
<li>15-16: 14% accepted; 15% submissions</li>
<li>14-15: 16% accepted; 16% submissions</li>
</ul></li>
</ol>
</blockquote>
<h1 id="a-gruesome-conclusion">A Gruesome Conclusion</h1>
<p>In the end, I don’t see a clear lesson here. Before drawing any conclusions from the aggregated, cross-journal tallies, it seems we’d need to know more about the policies and practices of the journals driving them. Otherwise we’re liable to be misled to a false generalization about a heterogeneous group.</p>
<p>Some of that policy-and-practice information is probably publicly available; I haven’t had a chance to look. And I bet a lot of it is available informally, if you just talk to the right people. So this data-set could still be informative on our base-rate question. But sadly, I don’t think I’m currently in a position to make informative use of it.</p>
<p><img src="http://i.imgur.com/ojvPBaY.jpg" alt="" /></p>
<h1 id="technical-note">Technical Note</h1>
<p>This post was written in R Markdown and the source is <a href="https://github.com/jweisber/rgo/blob/master/apa bpa data/apa_bpa_data.Rmd" target="_blank">available on GitHub</a>.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">No, I don’t mean <em>Ergo</em>! We published our first issue in 2014 while the survey covers mainly 2011–13.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
<li id="fn:2"><strong>Edit</strong>: the parity line was solid blue originally. But that misled some people into reading it as a fitted line. For reference and posterity, <a href="http://jonathanweisberg.org/img/apa_bpa_data_files/unnamed-chunk-3-2.png">the original image is here</a>.
<a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li>
<li id="fn:3">That’s if we continue to exclude journals with very round numbers. Adding these journals back in doesn’t change the following result, though.
<a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li>
</ol>
</div>
Accuracy for Dummies, Part 6: Obtusity
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%206%20-%20Obtusity/
Wed, 24 May 2017 00:00:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%206%20-%20Obtusity/
<p><a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 5 - Convexity/">Last time</a> we saw that the set of probability assignments is <em>convex</em>. Today we’re going to show that convex sets have a special sort of “obtuse” relationship with outsiders. Given a point <em>outside</em> a convex set, there is always a point <em>in</em> the set that forms a right-or-obtuse angle with it.</p>
<p>Recall our 2D diagram from the first post. The convex set of interest here is the diagonal line segment from $(0,1)$ to $(1,0)$:</p>
<p><img src="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png" alt="" /></p>
<p>For any point outside the diagonal, like $c^* $, there is a point like $c’$ on it that forms a right angle with all other points on the diagonal. As a result, $c’$ is closer to all other points on the diagonal than $c^* $ is. In particular, $c’$ is closer to both vertices, so it’s always more accurate than $c^*$. It’s “closer to the truth”.</p>
<p>The insider point $c’$ that we used in this case is the closest point on the diagonal to $c^*$. That’s what licenses the right-triangle reasoning here. Today we’re generalizing this strategy to $n$ dimensions.</p>
<p>To do that, we need some tools for reasoning about $n$-dimensional geometry.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\x}{\vec{x}}
\newcommand{\y}{\vec{y}}
\newcommand{\z}{\vec{z}}
\newcommand{\B}{B}
$</p>
<h1 id="arithmetic-with-arrows">Arithmetic with Arrows</h1>
<p>You’re familiar with arithmetic in one dimension: adding, subtracting, and multiplying single numbers. What about points in $n$ dimensions?</p>
<p>We introduced two ideas for arithmetic with points last time. We’ll add a few more today, and also talk about what they mean geometrically.</p>
<p>Suppose you have two points $\x$ and $\y$ in $n$ dimensions:
$$
\begin{align}
\x &= (x_1, \ldots, x_n),\\<br />
\y &= (y_1, \ldots, y_n).
\end{align}
$$
Their sum $\x + \y$, as we saw last time, is defined as follows:
$$ \x + \y = (x_1 + y_1, \ldots, x_n + y_n). $$
In other words, points are added coordinate-wise.</p>
<p>This definition has a natural, geometric meaning we didn’t mention last time. Start by thinking of $\x$ and $\y$ as <em>vectors</em>—as arrows pointing from the origin to the points $\x$ and $\y$. Then $\x + \y$ just amounts to putting the two arrows end-to-point and taking the point at the end:
<img src="http://jonathanweisberg.org/img/accuracy/VectorAddition.png" alt="" />
(Notice that we’re continuing our usual practice of bold letters for points/vectors like $\x$ and $\y$, and italics for single numbers like $x_1$ and $y_3$.)</p>
<p>You can also multiply a vector $\x$ by a single number, $a$. The definition is once again coordinate-wise:
$$ a \x = (a x_1, \ldots, a x_n). $$
And again there’s a natural, geometric meaning. We’ve lengthened the vector $\x$ by a factor of $a$.
<img src="http://jonathanweisberg.org/img/accuracy/VectorMultiplication.png" alt="" />
Notice that if $a$ is between $0$ and $1$, then “lengthening” is actually shortening. For example, multiplying a vector by $a = 1/ 2$ makes it half as long.</p>
<p>If $a$ is negative, then multiplying by $a$ reverses the direction of the arrow. For example, multiplying the northeasterly arrow $(1,1)$ by $-1$ yields the southwesterly arrow pointing to $(-1,-1)$.</p>
<p>That means we can define subtraction in terms of addition and multiplication by negative one (just as with single numbers):
$$
\begin{align}
\x - \y &= \x + (-1 \times \y)\\<br />
&= (x_1 - y_1, \ldots, x_n - y_n).
\end{align}
$$
So vector subtraction amounts to coordinate-wise subtraction.</p>
<p>But what about multiplying two vectors? That’s actually different from what you might expect! We don’t just multiply coordinate-wise. We do that <strong>and then add up the results</strong>:
$$ \x \cdot \y = x_1 y_1 + \ldots + x_n y_n. $$
So the product of two vectors is <strong>not a vector</strong>, but a number. That number is called the <em>dot product</em>, $\x \cdot \y$.</p>
<p>Why are dot products defined this way? Why do we add up the results of coordinate-wise multiplication to get a single number? Because it yields a more useful extension of the concept of multiplication from single numbers to vectors. We’ll see part of that in a moment, in the geometric meaning of the dot product.</p>
<p>(There’s an algebraic side to the story too, having to do with the axioms that characterize the real numbers—<a href="https://en.wikipedia.org/wiki/Field_(mathematics)" target="_blank">the field axioms</a>. We won’t go into that, but it comes out in <a href="http://www.youtube.com/watch?v=63HpaUFEtXY&t=8m28s" target="_blank">this bit</a> of a beautiful lecture by Francis Su, especially around <a href="http://www.youtube.com/watch?v=63HpaUFEtXY&t=11m45s" target="_blank">the 11:45 mark</a>.)</p>
<h1 id="signs-and-their-significance">Signs and Their Significance</h1>
<p>In two dimensions, a right angle has a special algebraic property: the dot-product of two arrows making the angle is always zero.</p>
<p>Imagine a right triangle at the origin, with one leg going up to the point $(0,1)$ and the other leg going out to $(1,0)$:
<img src="http://jonathanweisberg.org/img/accuracy/VectorRightAngle.png" alt="" />
The dot product of those two vectors is $(1,0) \cdot (0,1) = 1 \times 0 + 0 \times 1 = 0$. One more example: consider the right angle formed by the vectors $(-3,3)$ and $(1,1)$.
<img src="http://jonathanweisberg.org/img/accuracy/VectorRightAngle2.png" alt="" />
Again, the dot product is $(-3,3) \cdot (1,1) = -3 \times 1 + 3 \times 1 = 0.$</p>
<p>Going a bit further: the dot product is always positive for acute angles, and negative for obtuse angles. Take the vectors $(5,0)$ and $(-1,1)$:
<img src="http://jonathanweisberg.org/img/accuracy/VectorObtuseAngle.png" alt="" />
Then we have $(5,0) \cdot (-1,1) = -5$. Whereas for $(5,0)$ and $(1,1)$:
<img src="http://jonathanweisberg.org/img/accuracy/VectorAcuteAngle.png" alt="" />
we find $(5,0) \cdot (1,1) = 5$.</p>
<p>So the sign of the dot-product reflects the angle formed by the vectors $\x$ and $\y$:</p>
<ul>
<li>acute angle: $\x \cdot \y > 0$,</li>
<li>right angle: $\x \cdot \y = 0$,</li>
<li>obtuse angle: $\x \cdot \y < 0$.</li>
</ul>
<p>That’s going to be key in generalizing to $n$ dimensions, where reasoning with diagrams breaks down. But first, one last bit of groundwork.</p>
<h1 id="algebra-with-arrows">Algebra with Arrows</h1>
<p>You can check pretty easily that vector addition and multiplication behave a lot like ordinary addition and multiplication. The usual laws of commutativity, associativity, and distribution hold:</p>
<ul>
<li>$\x + \y = \y + \x$.</li>
<li>$\x + (\y + \z) = (\x + \y) + \z$.</li>
<li>$a ( \x + \y) = a\x + a\y$.</li>
<li>$\x \cdot \y = \y \cdot \x$.</li>
<li>$\x \cdot (\y + \z) = \x\y + \x\z$.</li>
<li>$a (\x \cdot \y) = a \x \cdot \y = \x \cdot a \y$.</li>
</ul>
<p>One notable consequence, which we’ll use below, is the analogue of the familiar <a href="https://en.wikipedia.org/wiki/FOIL_method" target="_blank">“FOIL method”</a> from high school algebra:
$$
\begin{align}
(\x - \y)^2 &= (\x - \y) \cdot (\x - \y)\\<br />
&= \x^2 - 2 \x \cdot \y + \y^2.
\end{align}
$$
We’ll also make use of the fact that the Brier distance between $\x$ and $\y$ can be written $(\x - \y)^2$. Why?</p>
<p>Let’s write $\B(\x,\y)$ for the Brier distance between points $\x$ and $\y$. Recall the definition of Brier distance, which is just the square of Euclidean distance:
$$ \B(\x,\y) = (x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots + (x_n - y_n)^2. $$
Now consider that, thanks to our definition of vector subtraction:
$$ \x - \y = (x_1 - y_1, x_2 - y_2, \ldots, x_n - y_n). $$
And thanks to the definition of the dot product:
$$ (\x - \y) \cdot (\x - \y) = (x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots (x_n - y_n)^2. $$
So $\B(\x, \y) = (\x - \y) \cdot (\x - \y)$, in other words:
$$ \B(\x, \y) = (\x - y)^2. $$</p>
<h1 id="a-cute-lemma">A Cute Lemma</h1>
<p>Now we can prove the lemma that’s the aim of this post. For the intuitive idea, picture a convex set $S$ in the plane, like a pentagon. Then choose an arbitrary point $\x$ outside that set:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma.png" alt="" />
Now trace a straight line from $\x$ to the closest point of the convex region, $\y$:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma2.png" alt="" />
Finally, trace another straight line to any other point $\z$ of $S$:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma3.png" alt="" />
No matter what point we choose for $\z$, the angle formed will either be right or obtuse. It cannot be acute.</p>
<p><strong>Lemma.</strong> Let $S$ be a convex set of points in $\mathbb{R}^n$. Let $\x \not \in S$, and let $\y \in S$ minimize $\B(\y, \x)$ as a function of $\y$ on the domain $S$. Then for any $\z \in S$,
$$ (\x - \y) \cdot (\z - \y) \leq 0. $$</p>
<p>Let’s pause to understand what the Lemma is saying before we dive into the proof.</p>
<p>Focus on the centered inequality. It’s about the vectors $\x - \y$ and $\z - \y$. These are the arrows pointing from $\y$ to $\x$, and from $\y$ to $\z$. So in terms of our original two dimensional diagram with the triangle:
<img src="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png" alt="" />
we’re looking at the angle between $c^*$, $c’$, and any point on the diagonal you like… which includes the ones we’re especially interested in, the vertices. What the lemma tells us is that this angle is always at least a right angle.</p>
<p>Of course, it’s exactly a right angle in this case, not an obtuse one. That’s because our convex region is just the diagonal line. But the Lemma could also be applied to the whole triangular region in the diagram. That’s a convex set too. And if we took a point inside the triangle as our third point, the angle formed would be obtuse. (This is actually important if you want to generalize the dominance theorem beyond what we’ll prove next time. But for us it’s just a mathematical extra.)</p>
<p>Now let’s prove the Lemma.</p>
<p><em>Proof.</em> Because $S$ is convex and $\y$ and $\z$ are in $S$, any mixture of $\y$ and $\z$ must also be in $S$. That is, every point $\lambda \z + (1-\lambda) \y$ is in $S$, given $0 \leq \lambda \leq 1$.</p>
<p>Notice that we can rewrite $\lambda \z + (1-\lambda) \y$ as follows:
$$ \lambda \z + (1-\lambda) \y = \y + \lambda(\z - \y). $$
We’ll use this fact momentarily.</p>
<p>Now, by hypothesis $\y$ is at least as close to $\x$ as any other point of $S$ is. So, in particular, $\y$ is at least as close to $\x$ as the mixtures of $\y$ and $\z$ are. Thus, for any given $\lambda \in [0,1]$:
$$ \B(\y,\x) \leq \B(\lambda \z + (1-\lambda) \y, \x). $$
Using algebra, we can transform the right-hand side as follows:
$$
\begin{align}
\B(\lambda \z + (1-\lambda) \y, \x) &= \B(\x, \lambda \z + (1-\lambda) \y)\\<br />
&= \B(\x, \y + \lambda(\z - \y))\\<br />
&= (\x - (\y + \lambda(\z - \y)))^2\\<br />
&= ((\x - \y) - \lambda(\z - \y))^2\\<br />
&= (\x - \y)^2 + \lambda^2(\z - \y)^2 - 2\lambda(\x - \y) \cdot (\z - \y)\\<br />
&= \B(\x,\y) + \lambda^2\B(\z,\y) - 2\lambda(\x - \y) \cdot (\z - \y).
\end{align}
$$
Combining this equation with the previous inequality, we have:
$$ \B(\y,\x) \leq \B(\x,\y) + \lambda^2\B(\z,\y) - 2\lambda(\x - \y) \cdot (\z - \y). $$
And because $\B(\y, \x) = \B(\x, \y)$, this becomes:<br />
$$ 0 \leq \lambda^2\B(\z,\y) - 2\lambda(\x - \y) \cdot (\z - \y). $$
If we then restrict our attention to $\lambda > 0$, we can divide and rearrange terms to get:
$$ (\x - \y) \cdot (\z - \y) \leq \frac{\lambda\B(\z,\y)}{2}. $$
And since this inequality holds no matter how small $\lambda$ is, it follows that
$$ (\x - \y) \cdot (\z - \y) \leq 0, $$
as desired.
<span class="floatright">$\Box$</span></p>
<h1 id="taking-stock">Taking Stock</h1>
<p>Here’s what we’ve got from this post and the last one:</p>
<ul>
<li>Last time: the set of probability functions $P$ is convex.</li>
<li>This time: given a point $\x$ outside $P$, there’s a point $\y$ inside $P$ that forms a right-or-obtuse angle with every other point $\z$ in $P$.</li>
</ul>
<p>Intuitively, it should follow that:</p>
<ul>
<li>$\y$ is closer to every $\z$ in $P$ than $\x$ is.</li>
</ul>
<p>And indeed, that’s what we’ll show in the next post!</p>
Accuracy for Dummies, Part 5: Convexity
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%205%20-%20Convexity/
Thu, 18 May 2017 10:35:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%205%20-%20Convexity/
<p>In this and the next two posts we’ll establish the central theorem of the accuracy framework. We’ll show that the laws of probability are specially suited to the pursuit of accuracy, measured in Brier distance.</p>
<p>We showed this for cases with two possible outcomes, like a coin toss, way back in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 1/">the first post of this series</a>. A simple, <a href="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png">two-dimensional diagram</a> was all we really needed for that argument. To see how the same idea extends to any number of dimensions, we need to generalize the key ingredients of that reasoning to $n$ dimensions.</p>
<p>This post supplies the first ingredient: the convexity theorem.</p>
<h1 id="convex-shapes">Convex Shapes</h1>
<p>Convex shapes are central to the accuracy framework because, in a way, the laws of probability have a convex shape. Hopefully that mystical pronouncement will make sense by the end of this post.</p>
<p>You probably know a convex shape when you see one. Circles, triangles, and octagons are convex; pentagrams and the state of Texas are not.</p>
<p>But what makes a convex shape convex? Roughly: <em>it contains all its connecting lines</em>. If you take any two points in a convex region and draw a line connecting them, the line will lie entirely inside that region.</p>
<p>But on a non-convex figure, you can find points whose connecting line leaves the figure’s boundary:</p>
<p><img src="http://jonathanweisberg.org/img/accuracy/TexasLine.png" alt="" /></p>
<p>We want to take this idea beyond two dimensions, though. And for that, we need to generalize the idea of connecting lines. We need the concept of a “mixture”.</p>
<h2 id="pointy-arithmetic">Pointy Arithmetic</h2>
<p>In two dimensions it’s pretty easy to see that if you take some percentage of one point, and a complementary percentage of another point, you get a third point on the line between them.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\p}{\vec{p}}
\newcommand{\q}{\vec{q}}
\newcommand{\r}{\vec{r}}
\newcommand{\v}{\vec{v}}
\newcommand{\R}{\mathbb{R}}
$</p>
<p>For example, if you take $1/ 2$ of $(0,0)$ and add it to $1/ 2$ of $(1,1)$, you get the point halfway between: $(1/ 2,1/ 2)$. That’s pretty intuitive geometrically:
<img src="http://jonathanweisberg.org/img/accuracy/Fig1.png" alt="" />
But we can capture the idea algebraically too:
$$
\begin{align}
1/ 2 \times (0,0) + 1/ 2 \times (1,1)
&= (0,0) + (1/ 2, 1/ 2)\\<br />
&= (1/ 2, 1/ 2).
\end{align}
$$</p>
<p>Likewise, if you add $3/10$ of $(0,0)$ to $7/10$ of $(1, 1)$, you get the point seven-tenths of the way in between, namely $(7/10, 7/10)$:
<img src="http://jonathanweisberg.org/img/accuracy/Fig2.png" alt="" />
In algebraic terms:
$$
\begin{align}
3/10 \times (0,0) + 7/10 \times (1,1)
&= (0,0) + (7/10, 7/10)\\<br />
&= (7/10, 7/10).
\end{align}
$$</p>
<p>Notice that we just introduced two rules for doing arithmetic with points. When multiplying a point $\p = (p_1, p_2)$ by a number $a$, we get:
$$ a \p = (a p_1, a p_2). $$
And when adding two points $\p = (p_1, p_2)$ and $\q = (q_1, q_2)$ together:
$$ \p + \q = (p_1 + q_1, p_2 + q_2). $$
In other words, multiplying a point by a single number works element-wise, and so does adding two points together.</p>
<p>We can generalize these ideas straightforwardly to any number of dimensions $n$. Given points $\p = (p_1, p_2, \ldots, p_n)$ and $\q = (q_1, q_2, \ldots, q_n)$, we can define:
$$ a \p = (a p_1, a p_2, \ldots, a p_n), $$
and
$$ \p + \q = (p_1 + q_1, p_2 + q_2, \ldots, p_n + q_n).$$
We’ll talk more about arithmetic with points next time. For now, these two definitions will do.</p>
<h2 id="mixtures">Mixtures</h2>
<p>Now back to connecting lines between points. The idea is that the straight line between $\p$ and $\q$ is the set of points we get by “mixing” some portion of $\p$ with some portion of $\q$.</p>
<p>We take some number $\lambda$ between $0$ and $1$, we multiply $\p$ by $\lambda$ and $\q$ by $1 - \lambda$, and we sum the results: $\lambda \p + (1-\lambda) \q$. The set of points you can obtain this way is the straight line between $\p$ and $\q$.</p>
<p>In fact, you can mix any number of points together. Given $m$ points $\q_1, \ldots, \q_m$, we can define their <em>mixture</em> as follows. Let $\lambda_1, \ldots \lambda_m$ be positive real numbers that sum to one. That is:</p>
<ul>
<li>$\lambda_i \geq 0$ for all $i$, and</li>
<li>$\lambda_1 + \lambda_2 + \ldots + \lambda_m = 1$.</li>
</ul>
<p>Then we multiply each $\q_i$ by the corresponding $\lambda_i$ and sum up:
$$ \p = \lambda_1 \q_1 + \ldots + \lambda_m \q_m. $$
The resulting point $\p$ is a <em>mixture</em> of the $\q_i$’s.</p>
<p>Now we can define the general notion of a <em>convex set</em> of points. A convex set is one where the mixture of any points in the set is also contained in the set. (A convex set is “closed under mixing”, you might say.)</p>
<h1 id="convex-hulls">Convex Hulls</h1>
<p>It turns out that the set of possible probability assignments is convex.</p>
<p>More than that, it’s the convex set generated by the possible truth-value assignments, in a certain way. It’s the “convex hull” of the possible truth-value assignments.</p>
<p>What in the world is a “convex hull”?</p>
<p>Imagine some points in the plane—the corners of a square, for example. Now imagine stretching a rubber band around those points and letting it snap tight. The shape you get is the square with those points as corners. And the set of points enclosed by the rubber band is a convex set. Take any two points inside the square, or on its boundary, and draw the straight line between them. The line will not leave the square.</p>
<p>Intuitively, the convex hull of a set of points in the plane is the set enclosed by the rubber band exercise. Formally, the convex hull of a set of points is the set of points that can be obtained from them as a mixture. (And this definition works in any number of dimensions.)</p>
<p>For example, any of the points in our square example can be obtained by taking a mixture of the vertices. Take the center of the square: it’s halfway between the bottom left and top right corners. To get something to the left of that we can mix in some of the top left corner (and correspondingly less of the top right). And so on.</p>
<p>Now imagine the rubber band exercise using the possible truth-value assignments, instead of the corners of a square. In two dimensions, those are the points $(0,1)$ and $(1,0)$. And when you let the band snap tight, you get the diagonal line connecting them. As we saw way back in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 1/">our first post</a>, the points on that diagonal line are the possible probability assignments.</p>
<h1 id="peeking-ahead">Peeking Ahead</h1>
<p>We also saw that if you take any point <em>not</em> on that diagonal, the closest point on the diagonal forms a right angle. That’s what lets us do some basic geometric reasoning to see that there’s a point on the line that’s closer to both vertices than the point off the line:</p>
<p><img src="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png" alt="" /></p>
<p>That fact about closest points and right angles is what’s going to enable us to generalize the argument beyond two dimensions. If you take any point not on a convex hull, there’s a point on the convex hull (namely the closest point) which forms a right (or obtuse) angle with the other points on the hull.</p>
<p>Consider the three dimensional case. The possible truth-value assignments are $(1,0,0)$, $(0,1,0)$, and $(0,0,1)$:
<img src="http://jonathanweisberg.org/img/accuracy/Three Vertices.png" alt="" />
And when you let a rubber band snap tight around them, it encloses the triangular surface connecting them:
<img src="http://jonathanweisberg.org/img/accuracy/Three Vertices with Hull.png" alt="" />
That’s the set of probability assignments for three outcomes.</p>
<p>Now take any point that’s not on that triangular surface. Drop a straight line to the closest point on the surface. Then draw another straight line from there to one of the triangle’s vertices. These two straight lines will form a right or obtuse angle. So the distance from the first, off-hull point to the vertex is further than the distance from the second, on-hull point to the vertex.</p>
<p>Essentially the same reasoning works in any number of dimensions. But to make it work, we need to do three things.</p>
<ol>
<li>Prove that the probability assignments always form a convex hull around the possible truth-value assignments.</li>
<li>Prove that any point outside a convex hull forms a right angle (or an obtuse angle) with any point on the hull.</li>
<li>Prove that the point off the hull is further from all the vertices than the closest point on the hull.</li>
</ol>
<p>This post is dedicated to the first item.</p>
<h1 id="the-convexity-theorem">The Convexity Theorem</h1>
<p>We’re going to prove that the set of possible probability assignments is the same as the convex hull of the possible truth-value assignments. First let’s get some notation in place.</p>
<h2 id="notation">Notation</h2>
<p>As usual $n$ is the number of possible outcomes under consideration. So each possible truth-value assignment is a point of $n$ coordinates, with a single $1$ and $0$ everywhere else. For example, if $n = 4$ then $(0, 0, 1, 0)$ represents the case where the third possibility obtains.</p>
<p>We’ll write $V$ for the set of all possible truth value assignments. And we’ll write $\v_1, \ldots, \v_n$ for the elements of $V$. The first element $\v_1$ has its $1$ in the first coordinate, $\v_2$ has its $1$ in the second coordinate, etc.</p>
<p>We’ll use a superscript $^+$ for the convex hull of a set. So $V^+$ is the convex hull of $V$. It’s the set of all points that can be obtained by mixing members of $V$.</p>
<p>Recall, a mixture is a point obtained by taking nonnegative real numbers $\lambda_1, \ldots, \lambda_n$ that sum to one, and multiplying each one against the corresponding $\v_i$ and then summing up:
$$ \lambda_1 \v_1 + \lambda_2 \v_2 + \ldots + \lambda_n \v_n. $$
So $V^+$ is the set of all points that can be obtained by this method. Each choice of values $\lambda_1, \ldots, \lambda_n$ generates a member of $V^+$. (To exclude one of the $\v_i$’s from a mixture, just set $\lambda_i = 0$.)</p>
<p>Finally, we’ll use $P$ for the set of all probability assignments. Recall: a probability assignment is a point of $n$ coordinates, where each coordinate is nonnegative, and all the coordinates together add up to one. That is, $\p = (p_1,\ldots,p_n)$ is a probability assignment just in case:</p>
<ul>
<li>$p_i \geq 0$ for all $i$, and</li>
<li>$p_1 + p_2 + \ldots + p_n = 1$.</li>
</ul>
<p>The set $P$ contains just those points $\p$ satisfying these two conditions.</p>
<h2 id="statement-and-proof">Statement and Proof</h2>
<p>In the notation just established, what we’re trying to show is that $V^+ = P$.</p>
<p><strong>Theorem.</strong> $V^+ = P$. That is, the convex hull of the possible truth-value assignments just is the set of possible probability assignments.</p>
<p><em>Proof.</em> Let’s first show that $V^+ \subseteq P$.</p>
<p>Notice that a truth-value assignment is also probability assignment. Its coordinates are always $1$ or $0$, so all coordinates are nonnegative. And since it has only a single coordinate with value $1$, its coordinates add up to $1$.</p>
<p>But we have to show that any mixture of truth-value assignments is also a probability assignment. So let $\lambda_1, \ldots, \lambda_n$ be nonnegative numbers that sum to $1$. If we multiply $\lambda_i$ against a truth-value assignment $\v_i$, we get a point with $0$ in every coordinate except the $i$-th coordinate, which has value $\lambda_i$. For example, $\lambda_3 \times (0, 0, 1, 0) = (0, 0, \lambda_3, 0)$. So the mixture that results from $\lambda_1, \ldots, \lambda_n$ is:
$$
\lambda_1 \v_1 + \lambda_2 \v_2 + \ldots \lambda_n \v_n = (\lambda_1, \lambda_2, \ldots, \lambda_n).
$$
And this mixture has coordinates that are all nonnegative and sum to $1$, by hypothesis. In other words, it is a probability assignment.</p>
<p>So we turn to showing that $P \subseteq V^+$. In other words, we want to show that every probability assignment can be obtained as a mixture of the $\v_i$’s.</p>
<p>So take an arbitrary probability assignment $\p \in P$, where $\p = (p_1, \ldots, p_n)$. Let the $\lambda_i$’s be the probabilities that $\p$ assigns to each $i$: $\lambda_1 = p_1$, $\lambda_2 = p_2$, and so on. Then, by the same logic as in the first part of the proof:
$$ \lambda_1 \v_1 + \ldots + \lambda_n \v_n = (p_1, \ldots, p_n). $$
In other words, $\p$ is a mixture of the possible truth-value assignments, where the weights in the mixture are just the probability values assigned by $\p$. <span style="float: right;">$\Box$</span></p>
<h1 id="up-next">Up Next</h1>
<p>We’ve established the first of the three items listed earlier. Next time we’ll establish the second: given a point outside a convex set, there’s always a point inside that forms a right or obtuse angle with any other point of the set. Then we’ll be just a few lines of algebra from the main result: the Brier dominance theorem!</p>
Journals as Ratings Agencies
http://jonathanweisberg.org/post/Journals%20as%20Ratings%20Agencies/
Thu, 30 Mar 2017 15:27:04 -0500http://jonathanweisberg.org/post/Journals%20as%20Ratings%20Agencies/
<p>Starting in July, philosophy’s two most prestigious journals won’t reject submitted papers anymore. Instead they’ll “grade” every submission, assigning a rating on the familiar letter-grade scale (A+, A, A-, B+, B, B-, etc.).</p>
<p>They will, in effect, become ratings agencies.</p>
<p>They’ll still publish papers. Those rated A- or higher can be published in the journal, if the authors want. Or they can seek another venue, if they think they can do better.</p>
<p>I just made that up. But imagine if it were true—especially if a bunch of journals did this. How would it change philosophy’s publication game?</p>
<p>Well we’d save a lot of wasted labour, for one thing. And we’d discourage frivolous submissions, for another.</p>
<h1 id="the-bad">The Bad</h1>
<p>Under the current arrangement, the system is sagging low under the weight of premature, mediocre, even low-quality submissions. (I’d say it’s even creaking and cracking.) Editors scrounge miserably for referees, and referees frantically churn out reports and recommendations, mostly for naught.</p>
<p>In a typical case, the editor rejects the submission and the referees’ reports are filed away in a database, never to be read again. Maybe the author makes substantial revisions, but very likely they don’t—especially if the paper’s main idea is the real limiting factor. The process repeats at another journal, often at several more journals. And in the end all the philosophical public sees is: accepted at <em>International Journal of Such & Such Studies</em>.</p>
<p>Of all the people who’ve read and assessed the paper by that point, only two have their assessments directly broadcast to the public. And even then, only the “two thumbs more-or-less up” part of the signal gets out.</p>
<p>Yet five, eight, or even ten people have weighed in on the paper by then. They’ve thought about its strengths and weaknesses, and they’ve generated valuable insights and assessments that could save others time and trouble. Yet only the handling editors and the authors get the direct benefit of that labour.</p>
<p>The current system even encourages authors to waste editors’ and referees’ time. Unless they’re in a rush, authors can start at the top of the journal-prestige hierarchy and work their way down. You don’t even have to perfect your paper before starting this incredibly inefficient process. With so many journals to try, you’ll basically get unlimited kicks at the can. So you might as well let the referees do your homework for you.</p>
<p>(This doesn’t apply to all authors, obviously. Some work in areas that severely limit their can-kicking. And many <em>are</em> in a rush, to get jobs and tenure.)</p>
<h1 id="the-good">The Good</h1>
<p>But, if a paper were publicly assigned a grade every place it was submitted, authors might be more realistic in deciding where to submit. They might also wait until their paper is truly ready for public consumption before imposing on editors and referees.</p>
<p>Readers would also benefit from seeing a paper’s transcript. Not only could it inform their decision about whether to read the paper, it could aid their sense of how its contribution is received by peers and experts.</p>
<p>Referees would also have better incentives, to take on referee work and to be more diligent about it. They would know that their labour would have a greater impact, and that their assessment would have a more lasting effect.</p>
<p>Editors could even limit submissions based on their grade-history, e.g. “no submissions already graded by two other journals”, or “no submissions with an average grade less than a B”. (Ideally, different journals would have different policies here, to allow some variety.)</p>
<h1 id="the-ugly">The Ugly</h1>
<p>Of course, several high-profile journals would have to take the lead to make this kind of thing happen. And there would have to be strong norms within the discipline about publicizing grades: requiring they be listed alongside the paper on CVs and websites, for example</p>
<p>And there would be costs.</p>
<p>Everybody has their favourite story about the groundbreaking paper that got rejected five times, but was finally published in <em>The Posh Journal of Philosophy Review</em>, and has since been cited a gajillion times. Such papers could be weighed down by having their grade-transcripts publicized. (On the plus side, we could have a new genre of great paper: the cult classic!)</p>
<p>Also, some authors have to rely on referee feedback more than others, because of their limited philosophical networks. They’d likely find their papers with longer, more checkered grade-transcripts, exacerbating an existing injustice.</p>
<p>And, in the end, the present proposal might only be a band-aid. If there really is an oversubmission problem in academic philosophy (as I suspect there is), it’s probably caused by increased pressure to publish—because jobs are scarce, and administrators demand it, for example. Turning journals into ratings agencies wouldn’t relieve that pressure, even if it would help to manage some of its bad effects.</p>
<h1 id="decision-r-r">Decision: R&R</h1>
<p>In the end, I’m undecided about this proposal. I think it has some very attractive features, but the costs give me pause (much the same as the alternatives I’m aware of, like <a href="http://davidfaraci.com/populus" target="_blank">Populus</a>). I’m only certain that we can’t keep going as we have been; it won’t end well.</p>
Accuracy for Dummies, Part 4: Euclid in the Round
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%204/
Thu, 23 Feb 2017 00:00:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%204/
<p>Last time we took Brier distance beyond two dimensions. <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 3/">We showed</a> that it’s “proper” in any finite number of dimensions. Today we’ll show that Euclidean distance is “improper” in any finite number dimensions.</p>
<p>When I first sat down to write this post, I had in mind a straightforward generalization of <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 1/">our previous result</a> for Euclidean distance in two dimensions. And I figured it would be easy to prove.</p>
<p>Not so.</p>
<p>My initial conjecture was false, and worse, when I asked my accuracy-guru friends for the truth, nobody seemed to know. (They did offer lots of helpful suggestions, though.)</p>
<p>So today we’re muddling through on our own even more than usual. Here goes.</p>
<h1 id="background">Background</h1>
<p>Let’s recall where we are. We’ve been considering different ways of measuring the inaccuracy of a probability assignment given a possibility, or a “possible world”.</p>
<p>Let’s start today by regimenting our terminology. We’ve used these terms semi-formally for a while now. But let’s gather them here for reference, and to make them a little more precise.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\p}{\vec{p}}
\newcommand{\q}{\vec{q}}
\newcommand{\u}{\vec{u}}
\newcommand{\EIpq}{EI_{\p}(\q)}
\newcommand{\EIpp}{EI_{\p}(\p)}
$</p>
<p>Given a number of dimensions $n$:</p>
<ul>
<li>A <em>probability assignment</em> $\p = (p_1, \ldots, p_n)$ is a vector of positive real numbers that sum to $1$.</li>
<li>A <em>possible world</em> is a vector $\u$ of length $n$ containing all zeros except for a single $1$. (A <a href="https://en.wikipedia.org/wiki/Unit_vector" target="_blank">unit vector</a> of length $n$, in other words.)</li>
<li>A <em>measure of inaccuracy</em> $D(\p, \u)$ is a function that takes a probability assignment and a possible world and returns a real number.</li>
</ul>
<p>We’ve been considering two measures of inaccuracy. The first is the familiar Euclidean distance between $\p$ and $\u$. For example, when $\u = (1, 0, \ldots, 0)$ we have:
$$ \sqrt{(p_1 - 1)^2 + (p_2 - 0)^2 + \ldots + (p_n - 0)^2}.$$
The second way of measuring inaccuracy is less familiar, Brier distance, which is just the square of Euclidean distance:
$$ (p_1 - 1)^2 + (p_2 - 0)^2 + \ldots + (p_n - 0)^2.$$</p>
<p>What we found in $n = 2$ dimensions is that Euclidean distance is “unstable” in a way that Brier is not. If we measure inaccuracy using Euclidean distance, a probability assignment can expect some <em>other</em> probability assignment to do better accuracy-wise, i.e. to have lower inaccuracy.</p>
<p>In fact, given almost any probability assignment, the way to minimize expected inaccuracy is to leap to certainty in the most likely possibility. Given $(2/3, 1/3)$, for example, the way to minimize expected inaccuracy is to move to $(1,0)$.</p>
<p>Because Euclidean distance is unstable in this way, it’s called an “improper” measure of inaccuracy. So, two more bits of terminology:</p>
<ul>
<li>Given a probability assignment $\p$ and a measure of inaccuracy $D$, the <em>expected inaccuracy</em> of probability assignment $\q$, written $\EIpq$, is the weighted sum:
$$
\EIpq = p_1 D(\q,\u_1) + \ldots + p_n D(\q,\u_n),
$$
where $\u_i$ is the possible world with a $1$ at index $i$.</li>
<li>A measure of inaccuracy $D$ is <em>improper</em> if there is a probability assignment $\p$ such that for some assignment $\q \neq \p$, $\EIpq < \EIpp$ when inaccuracy is measured according to $D$.</li>
</ul>
<p>Last time we showed that Brier is <em>proper</em> in any finite number of dimensions $n$. Today our main task is to show that Euclidean distance is <em><strong>im</strong>proper</em> in any finite number of dimensions $n$.</p>
<p>But first, let’s get a tempting mistake out of the way.</p>
<h1 id="a-conjecture-and-its-refutation">A Conjecture and Its Refutation</h1>
<p>In <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 1/">our first post</a>, we saw that Euclidean distance isn’t just improper in two dimensions. It’s also <em>extremizing</em>: the assignment $(2/3, 1/3)$ doesn’t just expect <em>some</em> other assignment to do better accuracy-wise. It expects the assignment $(1,0)$ to do best!</p>
<p>At first I thought we’d be proving a straightforward generalization of that result today:</p>
<p><strong>Conjecture 1 (False).</strong> Let $(p_1, \ldots, p_n)$ be a probability assignment with a unique largest element $p_i$. If we measure inaccuracy by Euclidean distance, then $\EIpq$ is minimized when $\q = \u_i$.</p>
<p>Intuitively: expected inaccuracy is minimized by leaping to certainty in the most probable possibility. Turns out this is false in three dimensions. Here’s a</p>
<p><strong>Counterexample.</strong> Let’s define:
$$
\begin{align}
\p &= (5/12, 4/12, 3/12),\\<br />
\p’ &= (6/12, 4/12, 2/12),\\<br />
\u_1 &= (1, 0, 0).
\end{align}
$$</p>
<p>Then we can calculate (or better, <a href="https://github.com/jweisber/a4d/blob/master/Euclid%20in%20the%20Round.nb" target="_blank">have <em>Mathematica</em> calculate</a>):
$$
\begin{align}
\EIpp &\approx .804,\\<br />
EI_{\p}(\p’) &\approx .800,\\<br />
EI_{\p}(\u_1) &\approx .825.
\end{align}
$$
In this case $\EIpp < EI_{\p}(\u_1)$. So leaping to certainty doesn’t minimize expected inaccuracy (as measured by Euclidean distance).</p>
<p>Of course, staying put doesn’t minimize it either, since $EI_{\p}(\p’) < \EIpp$.</p>
<p>So what <em>does</em> minimize it in this example? I asked <em>Mathematica</em> to minimize $\EIpq$ and got… nothing for days. Eventually I gave up waiting and asked instead for <a href="https://github.com/jweisber/a4d/blob/master/Euclid%20in%20the%20Round.nb" target="_blank">a numerical approximation of the minimum</a>. One second later I got:</p>
<p>$$EI_{\p}(0.575661, 0.250392, 0.173947) \approx 0.797432.$$</p>
<p>I have no idea what that is in more meaningful terms, I’m sorry to say. But at least we know it’s not anywhere near the extreme point $\u_1$ I conjectured at the outset. (See the <strong>Update</strong> at the end for a little more.)</p>
<h1 id="a-shortcut-and-its-shortcomings">A Shortcut and Its Shortcomings</h1>
<p>So I asked friends who do this kind of thing for a living how they handle the $n$-dimensional case. A couple of them suggested taking a shortcut around it!</p>
<blockquote>
<p>Look, you’ve already handled the two-dimensional case. And that’s just an instance of higher dimensional cases.</p>
<p>Take a probability assignment like (2/3, 1/3). We can also think of it as (2/3, 1/3, 0), or as (2/3, 0, 1/3, 0), etc.</p>
<p>No matter how many zeros we sprinkle around in there, the same thing is going to happen as in the two-dimensional case. Leaping to certainty in the 2/3 possibility will minimize expected inaccuracy. (Because possibilities with no probability make no difference to expected value calculations.)</p>
<p>So no matter how many dimensions we’re working in, there will always be <em>some</em> probability assignment where leaping to certainty minimizes expected inaccuracy. It just might have lots of zeros in it.</p>
<p>So Euclidean distance is, technically, improper in any finite number of dimensions.</p>
</blockquote>
<p>At first I thought that was good enough for philosophy. Though I still wanted to know how to handle “no zeros” cases for the mathematical clarity.</p>
<p>Then I realized there may be a philosophical reason to be dissatisfied with this shortcut. A lot of people endorse the <a href="http://philosophy.anu.edu.au/sites/default/files/Staying%20Regular.December%2028.2012.pdf" target="_blank">Regularity principle</a>: you should never assign zero probability to any possibility. For these people, the shortcut might be a dead end.</p>
<p>(Of course, maybe we shouldn’t embrace Regularity if we’re working in the accuracy framework. I won’t stop for that question here.)</p>
<h1 id="a-theorem-and-its-corollary">A Theorem and Its Corollary</h1>
<p>So let’s take the problem head on. We want to show that Euclidean distance is improper in $n > 2$ dimensions, even when there are “no zeros”. Two last bits of terminology:</p>
<ul>
<li>A probability assignment $(p_1, \ldots, p_n)$ is <em>regular</em> if $p_i > 0$ for all $i$.</li>
<li>A probability assignment $(p_1, \ldots, p_n)$ is <em>uniform</em> if $p_i = p_j$ for all $i,j$.</li>
</ul>
<p>So, for example, the assignment $(1/3, 1/3, 1/3)$ is both regular and uniform. Whereas the assignment $(2/5, 2/5, 1/5)$ is regular, but not uniform.</p>
<p>What we’ll show is that assignments like $(2/5, 2/5, 1/5)$ make Euclidean distance “unstable”: they expect some other assignment to do better, accuracy-wise. (Exactly which other assignment they’ll expect to do best isn’t always easy to say.)</p>
<p>(Though I try to keep the math in these posts as elementary as possible, this proof will use calculus. If you know a bit about derivatives, you should be fine. Technically we’ll use multi-variable calculus. But if you’ve worked with derivatives in single-variable calculus, that should be enough for the main ideas.)</p>
<p><strong>Theorem.</strong>
Let $\p = (p_1, \ldots, p_n)$ be a regular, non-uniform probability assignment. If accuracy is measured by Euclidean distance, then $EI_{\p}(\q)$ is not minimized when $\q = \p$.</p>
<p><em>Proof.</em>
Let $\p = (p_1, \ldots, p_n)$ be a regular and non-uniform probability assignment, and measure inaccuracy using Euclidean distance. Then:
$$
\begin{align}
EI_{\p}(\q) &= p_1 \sqrt{(q_1 - 1)^2 + \ldots + (q_n - 0)^2} + \ldots + p_n \sqrt{(q_1 - 0)^2 + \ldots + (q_n - 1)^2}\\<br />
&= p_1 \sqrt{(q_1 - 1)^2 + \ldots + q_n^2} + \ldots + p_n \sqrt{q_1^2 + \ldots + (q_n - 1)^2}
\end{align}
$$</p>
<p>The crux of our proof will be that the derivatives of this function are non-zero at the point $\q = \p$. Since the minimum of a function is always a <a href="https://en.wikipedia.org/wiki/Critical_point_(mathematics)" target="_blank">“critical point”</a>, that suffices to show that $\q = \p$ is not a minimum of $\EIpq$.</p>
<p>To start, we calculate the partial derivative of $\EIpq$ for an arbitrary $q_i$:
$$
\begin{align}
\frac{\partial}{\partial q_i} \EIpq
&=
\frac{\partial}{\partial q_i} \left( p_1 \sqrt{(q_1 - 1)^2 + \ldots + q_n^2} + \ldots + p_n \sqrt{q_1^2 + \ldots + (q_n - 1)^2} \right)\\<br />
&=
p_1 \frac{\partial}{\partial q_i} \sqrt{(q_1 - 1)^2 + \ldots + q_n^2} + \ldots + p_n \frac{\partial}{\partial q_i} \sqrt{q_1^2 + \ldots + (q_n - 1)^2}\\<br />
&= \quad
p_i \frac{q_i - 1}{\sqrt{(q_i - 1)^2 + \sum_{j \neq i} q_j^2}} + \sum_{j \neq i} p_j \frac{q_i}{\sqrt{(q_j - 1)^2 + \sum_{k \neq j} q_k^2}}\\<br />
&= \quad
\sum_{j \neq i} \frac{p_j q_i}{\sqrt{(q_j - 1)^2 + \sum_{k \neq j} q_k^2}} - \sum_{j \neq i} \frac{p_i q_j}{\sqrt{(q_i - 1)^2 + \sum_{j \neq i} q_j^2}}.
\end{align}
$$</p>
<p>Then we evaluate at $\q = \p$:
$$
\begin{align}
\frac{\partial}{\partial q_i} \EIpp
&= \sum_{j \neq i} \frac{p_i p_j}{\sqrt{(p_j - 1)^2 + \sum_{k \neq j} p_k^2}} - \sum_{j \neq i} \frac{p_i p_j}{\sqrt{(p_i - 1)^2 + \sum_{j \neq i} p_j^2}}
\end{align}
$$</p>
<p>Now, because $\p$ is not uniform, some of its elements are larger than others. And because it is finite, there is at least one largest element. When $p_i$ is one of these largest elements, then $\partial / \partial q_i \EIpp$ is negative.</p>
<p>Why?</p>
<p>In our equation for $\partial / \partial q_i \EIpp$, each positive term has a corresponding negative term whose numerator is identical. And when $p_i$ is a largest element of $\p$, the denominator of each negative term will never be larger, but will sometimes be smaller, than the denominator of its corresponding positive term. Subtracting $1$ from $p_i$ before squaring does more to reduce the sum of squares $p_i^2 + \sum_{j \neq i} p_j^2$ than subtracting $1$ from any smaller term would. It effectively removes the/a largest square from the sum and substitutes the smallest replacement. So the negative terms are never smaller, but are sometimes larger, than their positive counterparts.</p>
<p>If, on the other hand, $p_i$ is the one of the smallest elements, then $\partial / \partial q_i \EIpp$ is positive. For then the reverse argument applies: the denominator of each negative term will never be smaller and will sometimes be larger than the denominator of the corresponding positive term. So the negatives terms are never larger, but are sometimes smaller, than their positive counterparts.</p>
<p>We have shown that the partial derivates of $\EIpq$ are non-zero at the point $\q = \p$. Thus $\p$ is not a critical point of $\EIpq$, and hence cannot be a minimum of $\EIpq$. <span class="floatright">$\Box$</span></p>
<p><strong>Corollary.</strong> Euclidean distance is improper in any finite number of dimensions.</p>
<p><em>Proof.</em> This is just a slight restatement of our theorem. If $\q = \p$ is not a minimum of $\EIpq$, then there is some $\q \neq \p$ such that $\EIpq < \EIpp$. <span class="floatright">$\Box$</span></p>
<h1 id="conjectures-awaiting-refutations">Conjectures Awaiting Refutations</h1>
<p>Notice, we’ve also shown something a bit stronger. We showed that the slope of $\EIpq$ at the point $\q = \p$ is always negative in the direction of $\p$’s largest element(s), and positive in the direction of its smallest element(s). That means we can always reduce expected inaccuracy by taking some small quantity away from the/a smallest element of $\p$ and adding it to the/a largest element. In other words, we can always reduce expected inaccuracy by moving <em>some</em> way towards perfect certainty in the/a possibility that $\p$ rates most probable.</p>
<p>However, we <em>haven’t</em> shown that repeatedly minimizing expected inaccuracy will, eventually, lead to certainty in the/a possibility that was most probable to begin with. For one thing, we haven’t shown that moving towards certainty in this direction minimizes expected inaccuracy at each step. We’ve only shown that moving in this direction reduces it.</p>
<p>Still, I’m pretty sure a result along these lines holds. Tinkering in <em>Mathematica</em> strongly suggests that the following Conjectures are true in any finite number of dimensions $n$:</p>
<p><strong>Conjecture 2.</strong> If a probability assignment gives greater than $1/ 2$ probability to some possibility, then expected inaccuracy is minimized by assigning probability 1 to that possibility. (But see the <strong>Update</strong> below.)</p>
<p><strong>Conjecture 3.</strong> Given a non-uniform probability assignment, repeatedly minimizing expected inaccuracy will, within a finite number of steps, increase the probability of the/a possibility that was most probable initially beyond $1/ 2$.</p>
<p>If these conjectures hold, then there’s still a weak-ish sense in which Euclidean distance is “extremizing” in $n > 2$ dimensions. Given a non-uniform probability assignment, repeatedly minimizing expected inaccuracy will eventually lead to greater than $1/ 2$ probability in the/a possibility that was most probable to begin with. Then, minimizing inaccuracy will lead in a single step to certainty in that possibility.</p>
<p>Proving these conjectures would close much of the gap between the theorem we proved and the false conjecture I started with. If you’re interested, you can use <a href="https://github.com/jweisber/a4d/blob/master/Euclid%20in%20the%20Round.nb" target="_blank">this <em>Mathematica</em> notebook</a> to test them.</p>
<p><strong>Update: Mar. 6, 2017.</strong> Thanks to some excellent help from <a href="https://mathematics.stanford.edu/people/department-directory/name/jonathan-love/" target="_blank">Jonathan Love</a>, I’ve tweaked this post (and greatly simplified <a href="http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%203/">the previous one</a>).</p>
<p>I changed the counterexample to the false Conjecture 1, which used to be $\p = (3/7, 2/7, 2/7)$ and $\p’ = (4/7, 2/7, 1/7)$. That works fine, but it’s potentially misleading.</p>
<p>As Jonathan kindly pointed out, the minimum point then is something quite nice. It’s obtained by moving in the $x$-dimension from $3/7$ to $\sqrt{3/7}$, and correspondingly reducing the probability in the $y$ and $z$ dimensions in equal parts.</p>
<p>But, in general, moving to the square root of the largest $p_i$ (when there is one) doesn’t minimize $\EIpq$. Even in the special case where all the other elements in the vector are equal, this doesn’t generally work.</p>
<p>Jonathan did solve that special case, though, and he found at least one interesting result connected with Conjecture 2. There appear to be cases where $p_i < 1/ 2$ for all $i$, and yet $\EIpq$ is still minimized by going directly to the extreme. For example, $\p = (.465, .2675, .2675)$.</p>
Editorial Gravity
http://jonathanweisberg.org/post/Editorial%20Gravity/
Wed, 22 Feb 2017 10:44:10 -0500http://jonathanweisberg.org/post/Editorial%20Gravity/
<p>We’ve all been there. One referee is positive, the other negative, and the editor decides to reject the submission.</p>
<p>I’ve heard it said editors tend to be conservative given the recommendations of their referees. And that jibes with my experience as an author.</p>
<p>So is there anything to it—is “editorial gravity” a real thing? And if it is, how strong is its pull? Is there some magic function editors use to compute their decision based on the referees’ recommendations?</p>
<p>In this post I’ll consider how things shake out at <a href="http://www.ergophiljournal.org/" target="_blank"><em>Ergo</em></a>.</p>
<h1 id="decision-rules">Decision Rules</h1>
<p><em>Ergo</em> doesn’t have any rule about what an editor’s decision should be given the referees’ recommendations. In fact, we explicitly discourage our editors from relying on any such heuristic. Instead we encourage them to rely on their judgment about the submission’s merits, informed by the substance of the referees’ reports.</p>
<p>Still, maybe there’s some natural law of journal editing waiting to be discovered here, or some unwritten rule.</p>
<p>Referees choose from four possible recommendations at <em>Ergo</em>: Reject, Major Revisions, Minor Revisions, or Accept. Let’s consider four simple rules we might use to predict an editor’s decision, given the recommendations of their referees.</p>
<ol>
<li>Max: the editor follows the recommendation of the most positive referee. (Ha!)</li>
<li>Mean: the editor “splits the difference” between the referees’ recommendations.
<ul>
<li>Accept + Major Revisions → Minor Revisions, for example.</li>
<li>When the difference is intermediate between possible decisions, we’ll stipulate that this rule “rounds down”.
<ul>
<li>Major Revisions + Minor Revisions → Major Revisions, for example.</li>
</ul></li>
</ul></li>
<li>Min: the editor follows the recommendation of the most negative referee.</li>
<li>Less-than-Min: the editor’s decision is a step more negative than either of the referees’.
<ul>
<li>Major Revisions + Minor Revisions → Reject, for example.</li>
<li>Except obviously that Reject + anything → Reject.</li>
</ul></li>
</ol>
<p>Do any of these rules do a decent job of predicting editorial decisions? If so, which does best?</p>
<h1 id="a-test">A Test</h1>
<p>Let’s run the simplest test possible. We’ll go through the externally reviewed submissions in <em>Ergo</em>’s database and see how often each rule makes the correct prediction.</p>
<p><img src="http://jonathanweisberg.org/img/editorial_gravity_files/unnamed-chunk-2-1.png" alt="" /></p>
<p>Not only was Min the most accurate rule, its predictions were correct 85% of the time! (The sample size here is 233 submissions, by the way.) Apparently, editorial gravity is a real thing, at least at <em>Ergo</em>.</p>
<p>Of course, <em>Ergo</em> might be atypical here. It’s a new journal, and online-only with no regular publication schedule. So there’s some pressure to play it safe, and no incentive to accept papers in order to fill space.</p>
<p>But let’s suppose for a moment that <em>Ergo</em> is typical as far as editorial gravity goes. That raises some questions. Here are two.</p>
<h1 id="two-questions">Two Questions</h1>
<p>First question: can we improve on the Min rule? Is there a not-too-complicated heuristic that’s even more accurate?</p>
<p>Visualizing our data might help us spot any patterns. Typically there are two referees, so we can plot most submissions on a plane according to the referees’ recommendations. Then we can colour them according to the editor’s decision. Adding a little random jitter to make all the points visible:</p>
<p><img src="http://jonathanweisberg.org/img/editorial_gravity_files/unnamed-chunk-3-1.png" alt="" /></p>
<p>To my eye this looks a lot like the pattern of concentric-corners you’d expect from the Min rule. Though not exactly, especially when the two referees strongly disagree—the top-left and bottom-right corners of the plot. Still, other than treating cases of strong disagreement as a tossup, no simple way of improving on the Min rule jumps out at me.</p>
<p>Second question: if editorial gravity is a thing, is it a good thing or a bad thing?</p>
<p>I’ll leave that as an exercise for the reader.</p>
<h1 id="technical-note">Technical Note</h1>
<p>This post was written in R Markdown and the source code is <a href="https://github.com/jweisber/rgo/blob/master/editorial gravity/editorial gravity.Rmd" target="_blank">available on GitHub</a>.</p>
Gender & Journal Referees
http://jonathanweisberg.org/post/Referee%20Gender/
Mon, 20 Feb 2017 09:34:10 -0500http://jonathanweisberg.org/post/Referee%20Gender/
<p>We looked at author gender in <a href="http://jonathanweisberg.org/post/Author Gender/">a previous post</a>, today let’s consider referees. Does their gender have any predictive value?</p>
<p>Once again our discussion only covers men and women because we don’t have the data to support a deeper analysis.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup></p>
<p>Using data from <a href="http://www.ergophiljournal.org/" target="_blank"><em>Ergo</em></a>, we’ll consider the following questions:</p>
<ol>
<li><em>Requests</em>. How are requests to referee distributed between men and women? Are men more likely to be invited, for example?</li>
<li><em>Responses</em>. Does gender inform a referee’s response to a request? Are women more likely to say ‘yes’, for example?</li>
<li><em>Response-speed</em>. Does gender inform how quickly a referee responds to an invitation (whether to agree or to decline)? Do men take longer to agree/decline an invitation, for example?</li>
<li><em>Completion-speed</em>. If a referee does agree to provide a report, does their gender inform how quickly they’ll complete that report? Do men and women tend to complete their reports in the same time-frame?</li>
<li><em>Recommendations</em>. Does gender inform how positive/negative a referee’s recommendation is? Are men and women equally likely to recommend that a submission be rejected, for example?</li>
<li><em>Influence</em>. Does a referee’s gender affect the influence of their recommendation on the editor’s decison? Are the recommendations of male referees more likely to be followed, for example?</li>
</ol>
<p>A quick overview of our data set: there are a total of 1526 referee-requests in <em>Ergo</em>’s database. But only 1394 are included in this analysis. I’ve excluded:</p>
<ol>
<li>Requests to review an invited resubmission, since these are a different sort of beast.</li>
<li>Pending requests and reports, since the data for these are incomplete.</li>
<li>A handfull of cases where the referee’s gender is either unknown, or doesn’t fit the male/female classification.</li>
</ol>
<h1 id="requests">Requests</h1>
<p>How are requests distributed between men and women? 322 of our 1394 requests went to women, or 23.1% (1072 went to men, or 76.9%).</p>
<p>How does this compare to the way men and women are represented in academic philosophy in general? Different sources and different subpopulations yield a range of estimates.</p>
<p>At the low end, we saw in <a href="http://jonathanweisberg.org/post/Author Gender/">an earlier post</a> that about 15.3% of <em>Ergo</em>’s submissions come from women. The PhilPapers survey yields a range from 16.2% (<a href="https://philpapers.org/surveys/demographics.pl" target="_blank">all respondents</a>) to 18.4% (<a href="https://philpapers.org/surveys/demographics.pl?affil=Target+faculty&survey=8" target="_blank">“target” faculty</a>). And sources cited in <a href="http://www.faculty.ucr.edu/~eschwitz/SchwitzPapers/WomenInPhil-160315b.pdf" target="_blank">Schwitzgebel & Jennings</a> estimate the percentage of women faculty in various English speaking countries at 23% for Australia, 24% for the U.K., and 19–26% for the U.S.</p>
<p>So we have a range of baseline estimates from 15% to 26%. For comparison, the 95% confidence interval around our 23.1% finding is (21%, 25.4%).</p>
<h1 id="responses">Responses</h1>
<p>Do men and women differ in their responses to these requests? Here are the raw numbers:</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Agreed</th>
<th align="right">Declined / No Response / Canceled</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Female</td>
<td align="right">101</td>
<td align="right">221</td>
</tr>
<tr>
<td align="left">Male</td>
<td align="right">403</td>
<td align="right">669</td>
</tr>
</tbody>
</table>
<p>The final column calls for some explanation. I’m lumping togther several scenarios here: (i) the referee responds to decline the request, (ii) the referee never responds, (iii) the editors cancel the request because it was made in error. Unfortunately, these three scenarios are hard to distinguish based on the raw data. For example, sometimes a referee declines by email rather than via our online system, and the handling editor then cancels the request instead of marking it as “Declined”.</p>
<p>With that in mind, here are the proportions graphically:</p>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-6-1.png" alt="" /></p>
<p>Men agreed more often than women: approximately 38% vs. 31%. And this difference is statistically significant.<sup class="footnote-ref" id="fnref:0"><a rel="footnote" href="#fn:0">2</a></sup></p>
<p>Note that women and men accounted for about 20% and 80% of the “Agreed” responses, respectively. Whether this figure differs significantly from the gender makeup of “the general population” depends, as before, on the source and subpopulation we use for that estimate.</p>
<p>We saw that estimates of female representation ranged from roughly 15% to 26%. For comparison, the 95% confidence interval around our 20% finding is (16.8%, 23.8%).</p>
<h1 id="response-speed">Response-speed</h1>
<p>Do men and women differ in response-speed—in how quickly they respond to a referee request (whether to agree or to decline)?</p>
<p>The average response-time for women is 1.92 days, and for men it’s 1.58 days. This difference is not statistically significant.<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup></p>
<p>A boxplot likewise suggests that men and women have similar interquartile ranges:</p>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-9-1.png" alt="" /><!-- --></p>
<h1 id="completion-speed">Completion-speed</h1>
<p>What about completion-speed: is there any difference in how long men and women take to complete their reports?</p>
<p>Women took 27.6 days on average, while men took 23.8 days. This difference is statistically significant.<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">4</a></sup></p>
<p>Does that mean men are more likely to complete their reports on time? Not necessarily. Here’s a frequency polygram showing when reports were completed:</p>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-11-1.png" alt="" /><!-- --></p>
<p>The spike at the four-week mark corresponds to the standard due date. We ask referees to submit their reports within 28 days of the initial request.</p>
<p>It looks like men had a stronger tendency to complete their reports early. But were they more likely to complete them on time?</p>
<p>One way to tackle this question is to look at how completed reports accumulate with time (the <a href="https://en.wikipedia.org/wiki/Empirical_distribution_function" target="_blank">empirical cumulative distribution</a>):</p>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-12-1.png" alt="" /><!-- --></p>
<p>As expected, the plot shows that men completed their reports early with greater frequency. But it also looks like women and men converged around the four-week mark, when reports were due.</p>
<p>Another way of approaching the question is to classify reports as either “On Time” or “Late”, according to whether they were completed before Day 29.</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">On Time</th>
<th align="right">Late</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Female</td>
<td align="right">50</td>
<td align="right">38</td>
</tr>
<tr>
<td align="left">Male</td>
<td align="right">242</td>
<td align="right">121</td>
</tr>
</tbody>
</table>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-14-1.png" alt="" /><!-- --></p>
<p>A chi-square test of independence then finds no statistically significant difference.<sup class="footnote-ref" id="fnref:6"><a rel="footnote" href="#fn:6">5</a></sup></p>
<p>Apparently men and women differed in their tendency to be early, but not necessarily in their tendency to be on time.</p>
<h1 id="recommendations">Recommendations</h1>
<p>Did male and female referees differ in their recommendations to the editors?</p>
<p><em>Ergo</em> offers referees four recommendations to choose from. The raw numbers:</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Reject</th>
<th align="right">Major Revisions</th>
<th align="right">Minor Revisions</th>
<th align="right">Accept</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Female</td>
<td align="right">42</td>
<td align="right">29</td>
<td align="right">9</td>
<td align="right">8</td>
</tr>
<tr>
<td align="left">Male</td>
<td align="right">154</td>
<td align="right">103</td>
<td align="right">61</td>
<td align="right">45</td>
</tr>
</tbody>
</table>
<p>In terms of frequencies:</p>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-16-1.png" alt="" /><!-- --></p>
<p>The differences here are not statistically significant according to a chi-square test of independence.<sup class="footnote-ref" id="fnref:5"><a rel="footnote" href="#fn:5">6</a></sup></p>
<h1 id="influence">Influence</h1>
<p>Does a referee’s gender affect whether the editor follows their recommendation? We can tackle this question a few different ways.</p>
<p>One way is to just tally up those cases where the editor’s decision was the same as the referee’s recommendation, and those where it was different.</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Same</th>
<th align="right">Different</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Female</td>
<td align="right">51</td>
<td align="right">37</td>
</tr>
<tr>
<td align="left">Male</td>
<td align="right">206</td>
<td align="right">157</td>
</tr>
</tbody>
</table>
<p><img src="http://jonathanweisberg.org/img/referee_gender_files/unnamed-chunk-17-1.png" alt="" /><!-- --></p>
<p>Clearly there’s no statistically significant difference between male and female referees here.<sup class="footnote-ref" id="fnref:7"><a rel="footnote" href="#fn:7">7</a></sup></p>
<p>A second approach would be to assign numerical ranks to referees’ recommendations and editors’ decisions: Reject = 1, Major Revisions = 2, etc. Then we can consider how far the editor’s decision is from the referee’s recommendation. For example, a decision of Accept is 3 away from a recommendation of Reject, while a decision of Major Revisions is 2 away from a recommendation of Accept.</p>
<p>By this measure, the average distance between the referee’s recommendation and the editor’s decision was 0.57 for women and 0.56 for men—clearly not a statistically significant difference.<sup class="footnote-ref" id="fnref:8"><a rel="footnote" href="#fn:8">8</a></sup></p>
<h1 id="summary">Summary</h1>
<p>Men received more requests to referee than women, as expected given the well known gender imbalance in academic philosophy. The distribution of requests between men (76.9%) and women (23.1%) was in line with some estimates of the gender makeup of academic philosophy, though not all estimates.</p>
<p>Men were more likely to agree to a request (38% vs. 31%), a statistically significant difference. Women accounted for about 20% of the “Agreed” responses, however, consistent with most (but not all) estimates of the gender makeup of academic philosophy.</p>
<p>There was no statistically significant difference in response-speed, but there was in the speed with which reports were completed (23.8 days on average for men, 27.6 days for women). This difference appears to be due to a stronger tendency on the part of men to complete their reports early, though not necessarily a greater chance of meeting the deadline.</p>
<p>Finally, there was no statistically significant difference in the recommendations of male and female referees, or in editors’ uptake of those recommendations.</p>
<h1 id="technical-notes">Technical Notes</h1>
<p>This post was written in R Markdown and the source is <a href="https://github.com/jweisber/rgo/blob/master/referee%20gender/referee%20gender.Rmd" target="_blank">available on GitHub</a>. I’m new to both R and classical statistics, and this post is a learning exercise for me. So I encourage you to check the code and contact me with corrections.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Unlike in the previous analysis of author gender, however, here we do have a few known cases where either (i) the referee identifies as neither male nor female, or (ii) they identify as something more specific, e.g. “transgender male” rather than just “male”. But these cases are still too few for statistical analysis.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
<li id="fn:0">$\chi^2$(1, <em>N</em> = 1394) = 3.89, <em>p</em> = 0.05.
<a class="footnote-return" href="#fnref:0"><sup>[return]</sup></a></li>
<li id="fn:3"><em>t</em>(437.43) = -1.63, <em>p</em> = 0.1
<a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li>
<li id="fn:4"><em>t</em>(144.26) = -2.46, <em>p</em> = 0.02
<a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li>
<li id="fn:6">$\chi^2$(1, <em>N</em> = 451) = 2.59, <em>p</em> = 0.11.
<a class="footnote-return" href="#fnref:6"><sup>[return]</sup></a></li>
<li id="fn:5">$\chi^2$(3, <em>N</em> = 451) = 3.6, <em>p</em> = 0.31.
<a class="footnote-return" href="#fnref:5"><sup>[return]</sup></a></li>
<li id="fn:7">$\chi^2$(1, <em>N</em> = 451) = 0.01, <em>p</em> = 0.93.
<a class="footnote-return" href="#fnref:7"><sup>[return]</sup></a></li>
<li id="fn:8"><em>t</em>(117.57) = 0.07, <em>p</em> = 0.95.
<a class="footnote-return" href="#fnref:8"><sup>[return]</sup></a></li>
</ol>
</div>
In Defense of Reviewer 2
http://jonathanweisberg.org/post/Reviewer%202/
Mon, 06 Feb 2017 10:36:10 -0500http://jonathanweisberg.org/post/Reviewer%202/
<p>Spare a thought for Reviewer 2, that much-maligned shade of academe. There’s even <a href="https://twitter.com/hashtag/reviewer2" target="_blank">a hashtag</a> dedicated to the joke:</p>
<p><blockquote class="twitter-tweet tw-align-center" data-lang="en"><p lang="en" dir="ltr">A rare glimpse of reviewer 2, seen here in their natural habitat <a href="https://t.co/lpT1BVhDCX">pic.twitter.com/lpT1BVhDCX</a></p>— Aidan McGlynn (@AidanMcGlynn) <a href="https://twitter.com/AidanMcGlynn/status/820647829446283264">January 15, 2017</a></blockquote>
<script async src="http://platform.twitter.com/widgets.js" charset="utf-8"></script></p>
<p>But is it just a joke? Order could easily matter here.</p>
<p>Referees invited later weren’t the editor’s first choice, after all. Maybe they’re less competent, less likely to appreciate your brilliant insights as an author. Or maybe they’re more likely to miss well-disguised flaws! Then we should expect Reviewer 2 to be the more <em>generous</em> one.</p>
<p>Come to think of it, we can order referees in other ways beside order-of-invite. We might order them according to who completes their report fastest, for example. And faster referees might be more careless, hence more dismissive. Or they might be less critical and thus more generous.</p>
<p>There’s a lot to consider. Let’s investigate, using <a href="http://www.ergophiljournal.org/" target="_blank"><em>Ergo</em></a>’s data, <a href="http://jonathanweisberg.org/tags/rgo/">as usual</a>.</p>
<h1 id="severity-generosity">Severity & Generosity</h1>
<p>Reviewer 2 is accused of a lot. It’s not just that their overall take is more severe; they also tend to miss the point. They’re irresponsible and superficial in their reading. And to the extent they do appreciate the author’s point, their objections are poorly thought out. What’s more, if they bother to demand revisions, their demands are unreasonable.</p>
<p>We can’t measure these things directly, of course. But we can estimate a referee’s generosity indirectly, using their recommendation to the editors as a proxy.</p>
<p><em>Ergo</em>’s referees choose from four possible recommendations: Reject, Major Revisions, Minor Revisions, and Accept. To estimate a referee’s generosity, we’ll assign these recommendations numerical ranks, from 1 (Reject) up through 4 (Accept).</p>
<p>The higher this number, the more generous the referee; the lower, the more severe.</p>
<h1 id="invite-order">Invite Order</h1>
<p>Is there any connection between the order in which referees are invited and their severity?</p>
<p>Usually an editor has to try a few people before they get two takers. So we can assign each potential referee an “invite rank”. The first person asked has rank 1, the second person asked has rank 2, and so on.</p>
<p>Is there a correlation between invite rank and severity?</p>
<p>Here’s a plot of invite rank (<em>x</em>-axis) and generosity (<em>y</em>-axis). (The points have non-integer heights because I’ve added some random <a href="http://r4ds.had.co.nz/data-visualisation.html#position-adjustments" target="_blank">“jitter”</a> to make them all visible. Otherwise you’d just see an uninformative grid.)</p>
<p><img src="http://jonathanweisberg.org/img/reviewer_2_files/unnamed-chunk-2-1.png" alt="" /></p>
<p>The blue curve shows the overall trend in the data.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> It’s basically flat all the way through, except at the far-right end where the data is too sparse to be informative.</p>
<p>We can also look at the classic measure of correlation known as <a href="https://en.wikipedia.org/wiki/Spearman's_rank_correlation_coefficient" target="_blank">Spearman’s rho</a>. The estimate is essentially 0 given our data ($r_s$ = 0.01).<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup></p>
<p>Evidently, invite-rank has no discernible impact on severity.</p>
<h1 id="speed">Speed</h1>
<p>But now let’s look at the speed with which a referee completes their report:</p>
<p><img src="http://jonathanweisberg.org/img/reviewer_2_files/unnamed-chunk-4-1.png" alt="" /></p>
<p>Here an upward trend is discernible. And our estimate of Spearman’s rho agrees: $r_s$ = 0.1, a small but non-trivial correlation.<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup></p>
<p>Apparently, referees who take longer tend to be more generous!</p>
<h1 id="my-take">My Take</h1>
<p>I find these results encouraging, for the most part.</p>
<p>It’s nice to know that an editor’s first choice for a referee is the same as their fifth, as far as how severe or generous they’re likely to be.</p>
<p>It’s also nice to know that the speed with which a referee completes their report doesn’t <em>hugely</em> inform heir severity.</p>
<p>One we might well worry that faster referees are unduly severe. But this worry is tempered by a few considerations.</p>
<p>For one thing, the effect we found is small enough that it could just be noise. It is detectable using tools like regression and significance testing, so it’s not to be dismissed out of hand. But we might also do well to heed the wisdom of <a href="https://xkcd.com/1725/" target="_blank">XKCD</a> here:</p>
<p><img src="https://imgs.xkcd.com/comics/linear_regression_2x.png" alt="" /></p>
<p>Even if the effect is real, though, it could be a good thing just as easily as a bad thing.</p>
<p>True, referees who work fast might be sloppy and dismissive. And those who take longer might feel guiltier and thus be unduly generous.</p>
<p>But maybe referees who are more on the ball are both more prompt and more apt to spot a submission’s flaws. Or (as my coeditor Franz Huber pointed out) manuscripts that should clearly be rejected might be easier to referee on average, hence faster.</p>
<p>It’s hard to know what to make of this effect, if it is an effect. Clearly, <a href="https://twitter.com/hashtag/moreresearchisneeded" target="_blank">#MoreResearchIsNeeded</a>.</p>
<h1 id="technical-notes">Technical Notes</h1>
<p>This post was written in R Markdown and the source is <a href="https://github.com/jweisber/rgo/blob/master/reviewer%202/reviewer%202.Rmd" target="_blank">available on GitHub</a>. I’m new to both R and statistics, and this post is a learning exercise for me. So I encourage you to check the code and contact me with corrections.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Specifically, the blue curve is a regression curve using the <a href="https://en.wikipedia.org/wiki/Local_regression#Definition_of_a_LOESS_model" target="_blank">LOESS</a> method of fit.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
<li id="fn:2">A significance test of the null hypothesis $\rho_s$ = 0 yields <em>p</em> = 0.87.
<a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li>
<li id="fn:3">Testing the null hypothesis $\rho_s$ = 0 yields <em>p</em> = 0.03.
<a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li>
</ol>
</div>