Jonathan Weisberg
http://jonathanweisberg.org/index.xml
Recent content on Jonathan WeisbergHugo -- gohugo.ioen-usMon, 15 Oct 2018 00:00:00 -0500Model Referees
http://jonathanweisberg.org/post/Model%20Referees/
Mon, 15 Oct 2018 00:00:00 -0500http://jonathanweisberg.org/post/Model%20Referees/<p>In <a href="http://jonathanweisberg.org/post/How Hard Is It to Find Referees/">the previous post</a> we saw there’s about a $35$% chance a given referee will agree to review a paper for <em>Ergo</em>. And on average it takes about $5.8$ tries to find two referees for a submission. The full empirical distribution looks like this:</p>
<p><img src="http://jonathanweisberg.org/img/model_referees_files/unnamed-chunk-2-1.png" alt="" /></p>
<p>But there’s also an a priori way of exploring an editor’s predicament here, by using a classic model: the <a href="https://en.wikipedia.org/wiki/Negative_binomial_distribution" target="_blank">negative binomial distribution</a>. So I thougth I’d make a little exercise of seeing how well the model captures the empirical reality here.</p>
<p>Contacting potential referees is a bit like flipping a loaded coin: you keep flipping until you get two heads, then stop. Our question is how many flips it’ll take to get to that point.</p>
<p>Let $p$ be the probability of heads on each toss, and let $T$ be the number of tails you get before landing the second head. The negative binomial model says the probability of getting $t$ tails, $P(T = t)$, is:
$$ P(T = t) = \binom{t + 1}{t} p^t (1 - p)^2. $$
And the mean of this distribution is $2(1-p)/p$.</p>
<p>If the coin is fair, $p = .5$, and we should expect to get $T = 2$ tails:
$$ 2(1-p)/p = 2(1-.5)/.5 = 2. $$
An editor’s “coin” is biased against them though, at least at <em>Ergo</em>: $p = .35$. So we would expect $T = 3.7$ referees on average to decline before we get two takers:
$$ 2(1-p)/p = 2(1-.35)/.35 = 3.7. $$
In other words, we expect it to take on average $5.7$ tries to secure two referees for a submission, which very closely matches the empirical average of $5.8$!</p>
<p>How about the full distribution, how well does it match the empirical reality?</p>
<p><img src="http://jonathanweisberg.org/img/model_referees_files/unnamed-chunk-3-1.png" alt="" /></p>
<p>The model peaks a bit early, but otherwise it’s pretty accurate.</p>
<p>Of course, mileage may vary depending on the journal. For example, <em>Ergo</em> has a pretty high desk-rejection rate—about $67$%. And referees may be more willing to agree when they know a submission has already passed that hurdle.</p>
<p>So let’s conclude by looking at the model’s predictions when referees are more/less likely to agree. Here are the predictions for some plausible values of $p$. The mean $\mu$ is the corresponding number of invites required to secure two reviews, on average.</p>
<p><img src="http://jonathanweisberg.org/img/model_referees_files/unnamed-chunk-4-1.png" alt="" /></p>
<p>All these models assume that referees’ responses are independent, like flips of a coin, which isn’t too realistic. But given how close the model is for <em>Ergo</em>, it might still be good enough for other journals too.</p>
How Hard Is It to Find Referees?
http://jonathanweisberg.org/post/How%20Hard%20Is%20It%20to%20Find%20Referees/
Sat, 06 Oct 2018 00:00:00 -0500http://jonathanweisberg.org/post/How%20Hard%20Is%20It%20to%20Find%20Referees/<p>Finding willing referees is one of the more tedious parts of an editor’s
job. And with all the talk about how overloaded the peer-review system
is, it’s worth pausing to examine just how hard it is to find referees.</p>
<p>Well, at <em>Ergo</em> it takes on average 5.8 tries before we find two
referees to review a submission. The following plot gives the full
picture.</p>
<p><img src="http://jonathanweisberg.org/img/finding_referees_files/unnamed-chunk-2-1.png" alt="" /></p>
<p>So most submissions take six or fewer invites, and the overwhelming
majority require fewer than 10. But 10–15 is not unheard of. And on very
rare occasions it’s taken more than 20.</p>
<p>A different perspective is the time it takes to find two willing
referees. The average time between when the first invite is sent out and
two referees have agreed is 10.8 days. Here’s the histogram:</p>
<p><img src="http://jonathanweisberg.org/img/finding_referees_files/unnamed-chunk-4-1.png" alt="" /></p>
<p>So it usually takes less than two weeks to find two takers, though on
occasion it can take more than a month. And on very rare occasions it’s
taken more than two months.</p>
<p>That’s all looking just at submissions that had two referees, mind you.
Most submissions to <em>Ergo</em> aren’t sent out to external reviewers at all,
being desk-rejected instead. But more directly relevant is that about
20% of externally reviewed submissions are rejected just on one
referee’s recommendation, because the referee submits a decisively
negative report before a second can be commissioned. And on some
occasions an editor will even commission three reports.</p>
<p>So maybe the best way to look at the whole thing is just to calculate
The Big Number: 35% of all invitations sent to referees end with a
completed report. Which, I gotta say, is actually a lot better than I
would have guessed.</p>
Waiting for the Editor: A New App
http://jonathanweisberg.org/post/Shiny%20Wait%20Times/
Wed, 27 Jun 2018 00:00:00 -0500http://jonathanweisberg.org/post/Shiny%20Wait%20Times/<p>If waiting to hear back from journals makes you as twitchy as it makes me, you might appreciate <a href="https://jweisber.shinyapps.io/WaitingForTheEditor/" target="_blank">Waiting for the Editor</a>. It’s a little app that displays wait time forecasts and data from <a href="https://blog.apaonline.org/2017/04/13/journal-surveys-assessing-the-peer-review-process/" target="_blank">the APA Journal Survey</a>.</p>
<p>It has two kinds of display, based on <a href="http://jonathanweisberg.org/post/Journal%20Surveys">my earlier post</a> about the APA survey. You can view scatterplots:</p>
<p><a href="https://jweisber.shinyapps.io/WaitingForTheEditor/" target="_blank"><img src="http://jonathanweisberg.org/img/shiny_wait_times/scatter.png" alt="" /></a></p>
<p>Or ridgeplots:</p>
<p><a href="https://jweisber.shinyapps.io/WaitingForTheEditor/" target="_blank"><img src="http://jonathanweisberg.org/img/shiny_wait_times/ridge.png" alt="" /></a></p>
<p><strong>Note that the ridgeplots can be misleading.</strong> They treat old data and new the same. So if a journal’s wait times have improved/worsened since the survey started in 2009, this won’t be reflected in the ridgeplot.</p>
<p>You can customize these displays by…</p>
<ol>
<li>selecting which journals to show.</li>
<li>including all submissions or just e.g. rejected ones (since <a href="http://jonathanweisberg.org/post/Journal%20Surveys/#acceptance-rates" target="_blank">accepted submissions are overrepresented</a>).</li>
<li>choosing what dates you want to draw the data from, since some journals have gotten better/worse since 2009.</li>
<li>choosing your own cap for the maximum review time.</li>
</ol>
Visualizing the Philosophy Journal Surveys
http://jonathanweisberg.org/post/Journal%20Surveys/
Tue, 22 May 2018 00:00:00 -0500http://jonathanweisberg.org/post/Journal%20Surveys/
<p>In 2009 Andrew Cullison set up an <a href="https://blog.apaonline.org/journal-surveys/" target="_blank">ongoing survey</a> for philosophers to report their experiences submitting papers to various journals. For me, a junior philosopher working toward tenure at the time, it was a great resource. It was the best guide I knew to my chances of getting a paper accepted at <em>Journal X</em>, or at least getting rejected quickly by <em>Journal Y</em>.</p>
<p>But I always wondered about self-selection bias. I figured disgruntled authors were more likely to use the survey to vent. So I wondered whether the data overestimated things like wait times and rejection rates.</p>
<p>This post is an attempt to better understand the survey data, especially through visualization and comparisons with other sources.</p>
<h1 id="timeline">Timeline</h1>
<p>The survey has accrued 7,425 responses as of this writing. Of these, 720 have no date recorded. Here’s the timeline for the rest:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-1-1.png" alt="" /><!-- --></p>
<p>Two things jump out right away: the spike at the beginning and the dead zone near the end. What gives?</p>
<p>I’m guessing the spike reflects records imported manually from another source at the survey’s inception. Here I’ll mostly assume these records are legitimate, and include them in our analyses. But since the dates attached to those responses are certainly wrong, I’ll exclude them when we get to temporal questions (toward the end of the post).</p>
<p>What about the 2016–17 dead zone? I tried contacting people involved with the surveys, but nobody seemed to really know for sure what happened there. This dead period is right around when the surveys were <a href="https://blog.apaonline.org/2017/04/13/journal-surveys-assessing-the-peer-review-process/" target="_blank">handed over to the APA</a>. In that process the data were moved to a different hosting service, apparently with some changes to the survey format. So maybe the records for this period were lost in translation.</p>
<p>In any case, it looks like the norm is for the survey to get around 50 to 100 responses each month.</p>
<h1 id="journals">Journals</h1>
<p>There are 155 journals covered by the survey, but most have only a handful of responses. Here are the journals with 50 or more:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-2-1.png" alt="" /><!-- --></p>
<p>How do these numbers compare to the ground truth? Do <em>Phil Studies</em> and <em>Phil Quarterly</em> really get the most submissions, for example? And do they really get 4–5 times as many as, say, <em>BJPS</em>?</p>
<p>One way to check is to compare these numbers with those reported by the journals themselves to the APA and BPA in <a href="http://www.apaonline.org/page/journalsurveys" target="_blank">this study</a> from 2011–13. <em>Phil Studies</em> isn’t included in that report unfortunately, but <em>Phil Quarterly</em> and <em>BJPS</em> are. They reported receiving 2,305 and 1,267 submissions, respectively, during 2011–13. So <em>Phil Quarterly</em> does seem to get a lot more submissions, though not 4 times as many.</p>
<p>For a fuller picture let’s do the same comparison for all journals that reported their submission totals to the APA/BPA. That gives us a subset of 33 journals. If we look at the number of survey responses for these journals over the years 2011–2013, we can get a sense of how large each journal looms in the Journal Survey vs. the APA/BPA report:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-4-1.png" alt="" /><!-- --></p>
<p>There’s a pretty a strong correlation evident here. But it’s also clear there’s some bias in the survey responses. Bias towards what? I’m not exactly sure. Roughly the pattern seems to be that the more submissions a journal receives, the more likely it is to be overrepresented in the survey. But it might instead be a bias towards generalist journals, or journals with fast turn around times. This question would need a more careful analysis, I think.</p>
<h1 id="acceptance-rates">Acceptance Rates</h1>
<p>What about acceptance rates? Here are the acceptance rates for those journals with 30+ responses in the survey:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-5-1.png" alt="" /><!-- --></p>
<p>These numbers look suspiciously high to me. Most philosophy journals I know have an acceptance rate under 10%. So let’s compare with an outside source again.</p>
<p>The most comprehensive list of acceptance rates I know is <a href="http://certaindoubts.com/philosophy-journal-information-esf-rankings-citation-impact-rejection-rates/" target="_blank">this one</a> based on data from the ESF. It’s not as current as I’d like (2011), nor as complete (<em>Phil Imprint</em> isn’t included, perhaps too new at the time). It’s also not entirely accurate: it reports an acceptance rate of 8% for <em>Phil Quarterly</em> vs. 3% reported in the APA/BPA study.</p>
<p>Still, the ESF values do seem to be largely accurate for many prominent journals I’ve checked. For example, they’re within 1 or 2% of the numbers reported elsewhere by <em>Ethics</em>, <em>Mind</em>, <em>Phil Review</em>, <em>JPhil</em>, <em>Nous</em>, and <em>PPR</em>.<sup class="footnote-ref" id="fnref:Sources-the-APA"><a rel="footnote" href="#fn:Sources-the-APA">1</a></sup> So they’re useful for at least a rough validation.</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-6-1.png" alt="" /><!-- --></p>
<p>Apparently the Journal Surveys do overrepresent accepted submissions. Consistently so in fact: with the exception of <em>Phil Review</em>, <em>Analysis</em>, <em>Ancient Philosophy</em>, and <em>Phil Sci</em>, the surveys overrepresent accepted submissions for every other journal in this comparison. And in many cases accepted submissions are drastically overrepresented.</p>
<p>This surprised me, since I figured the surveys would serve as an outlet for disgruntled authors. But maybe it’s the other way around: people are more likely to use the surveys as a way to share happy news. (Draw your own conclusions about human nature.)</p>
<h1 id="seniority">Seniority</h1>
<p>So who uses the journal surveys: grad students? Faculty? The survey records five categories: Graduate Student, Non-TT Faculty, TT-but-not-T Faculty, Tenured Faculty, and Other. A few entries have no professional position recorded.</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-7-1.png" alt="" /><!-- --></p>
<p>Evidently, participation drops off with seniority. Also interesting if not too terribly surprising is that seniority affects acceptance:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-8-1.png" alt="" /><!-- --></p>
<p>Compared to grad students, tenured faculty were about 10% more likely to report their papers as having been accepted.</p>
<h1 id="gender">Gender</h1>
<p>About 79% of respondents specified their gender. Of those, 16.4% were women and 83.6% were men. How does this compare to journal-submitting philosophers in general?</p>
<p><a href="http://jonathanweisberg.org/post/Referee%20Gender/" target="_blank">Various other sources</a> put the percentage of women in academic philosophy roughly in the 15–25% range. But we’re looking for something more specific: what portion of journal submissions come from women vs. men?</p>
<p>The APA/BPA report gives the percentage of submissions from women at 14 journals. And we can use those figures to infer that 17.6% of submissions to these journals were from women, which matches the 16.4% in the Journal Surveys fairly well.</p>
<p>Looking at individual journals gives a more mixed picture, however:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-10-1.png" alt="" /><!-- --></p>
<p>While the numbers are reasonably close for some of these journals, they’re significantly different for many of them. So, using the Journal Surveys to estimate the gender makeup of a journal’s submission pool probably isn’t a good idea.</p>
<p>Does gender affect acceptance? Looking at the data from all journals together, it seems not:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-11-1.png" alt="" /><!-- --></p>
<p>In fact it’s striking how stark the non-effect is here, given the quirks we’ve already noted in this data set.</p>
<p>We could break things down further, going journal by journal. But then we’d face <a href="https://imgs.xkcd.com/comics/significant.png" target="_blank">the problem of multiple comparisons</a>, and we’ve already seen that the journal-by-journal numbers on gender aren’t terribly reliable. So I won’t dig into that exercise here.</p>
<h1 id="wait-times">Wait Times</h1>
<p>For me, the surveys were always most interesting as a means to compare wait times across journals. But how reliable are these comparisons?</p>
<p>The APA/BPA report gives the average wait times at 38 journals. It also reports how many decisions were delivered within 2 months, in 2–6 months, in 7–11 months, and after 12+ months.</p>
<p>Trouble is, a lot of these numbers look dodgy. The average wait times are all whole numbers of months—except inexplicaby for one journal, <em>Ratio</em>. I guess someone at the APA/BPA has a sense of humour.</p>
<p>The other wait time figures are also suspiciously round. For example, <em>APQ</em> is listed as returning 60% of its decisions within 2 months, 35% after 2–6 months, and the remaining 5% after 7–11 months. Round percentages like these are the norm. So, at best, most of these numbers are rounded estimates. At worst, they don’t always reflect an actual count, but rather the editor’s perception of their own performance.</p>
<p>On top of all that, there are differences between <a href="https://apaonline.site-ym.com/resource/resmgr/journal_surveys_2014/apa_bpa_survey_data_2014.xlsx" target="_blank">the downloadable Excel spreadsheet</a> and <a href="https://apaonline.site-ym.com/general/custom.asp?page=journalsurveys" target="_blank">the APA’s webpages</a> reporting (supposedly) the same data. For example, the spreadsheet gives an average wait time of 6 months for <em>Phil Imprint</em> (certainly wrong), while the webpage says “not available”. In fact the Excel spreadsheet flatly contradicts itself here: it says <em>Phil Imprint</em> returns 73% of its decisions within 2 months, the rest in 2–6 months.</p>
<p>I don’t know any other comprehensive list of wait times, though, so we’ll have to make do. Here I’ll restrict the comparison to journals with 30+ responses in the 2011–2013 timeframe, and exclude <em>Phil Imprint</em> because of the inconsistencies just mentioned.</p>
<p>That leaves us with 11 journals on which to compare average wait times:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-15-1.png" alt="" /><!-- --></p>
<p>The results are pretty stark. The match is close for most of these journals. In fact, if we’re forgiving about the rounding, only three journals have a discrepancy that’s clearly more than 1 month: <em>Erkenntnis</em>, <em>Mind</em>, and <em>Synthese</em>.</p>
<p>Notably, these are the three journals with the longest wait times according to survey respondents. I’d add that the reported 2 month average for <em>Mind</em> is wildly implausible by reputation. I can’t comment on the discrepancies for <em>Erkenntnis</em> and <em>Synthese</em>, though, since I know much less about their reputations for turnaround.</p>
<p>I do want to flag that <em>Mind</em> has radically improved its review times recently, as we’ll soon see. But for the present purpose—validating the Journal Survey data—we’re confined to look at 2011–13. And the survey responses align much better with <em>Mind</em>’s reputation during that time period than the 2 month average listed in the APA/BPA report.</p>
<p>In any case, since the wait time data looks to be carrying a fair amount of signal, let’s conclude our analysis with some visualizations of it.</p>
<h1 id="visualizing-wait-times">Visualizing Wait Times</h1>
<p>A journal’s average wait time doesn’t tell the whole story, of course. Two journals might have the same average wait time even though one of them is much more consistent and predictable. Or, a journal with a high desk-rejection rate might have a low average wait time, but still take a long time with its few externally reviewed submissions. So it’s helpful to see the whole picture.</p>
<p>One way to see the whole picture is with a scatterplot. This also let’s us see how a journal’s wait times have changed. To make this feasible, I’ll focus on two groups of journals I expect to be of broad interest.</p>
<p>The first is a list of 18 “general” journals that were highly rated in <a href="http://leiterreports.typepad.com/blog/2015/09/the-top-20-general-philosophy-journals-2015.html" target="_blank">a pair of polls</a> at Leiter Reports.<sup class="footnote-ref" id="fnref:The-poll-results"><a rel="footnote" href="#fn:The-poll-results">2</a></sup> For the sake of visibility, I’ll cap these scatterplots at 24 months. The handful of entries with longer wait times are squashed down to 24 so they can still inform the plot.</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-16-1.png" alt="" /><!-- --></p>
<p>In addition to the improvements at <em>Mind</em> mentioned earlier, <em>Phil Review</em>, <em>PPQ</em>, <em>CJP</em>, and <em>Erkenntnis</em> all seem to be shortening their wait times. <em>APQ</em> and <em>EJP</em> on the other hand appear to be drifting upward.</p>
<p>Keeping that in mind, let’s visualize expected wait times at these journals with a ridgeplot. The plot shows a smoothed estimate of the probable wait times for each journal. Note that here I’ve truncated the timeline at 12 months, squashing all wait times longer than 12 months down to 12.</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-17-1.png" alt="" /><!-- --></p>
<p>Remember though, the ridgeplot reflects old data as much as new. Authors submitting to journals like <em>Mind</em> and <em>CJP</em>, where wait times have significantly improved recently, should definitely not just set their expectations according to this plot. Consult the scatterplot!</p>
<p>Our second group consists of 8 “specialty” journals drawn from <a href="http://leiterreports.typepad.com/blog/2013/07/top-philosophy-journals-without-regard-to-area.html" target="_blank">another poll</a> at Leiter Reports. Here I’ll cap the scale at 15 months for the sake of visibility:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-18-1.png" alt="" /><!-- --></p>
<p>And for the ridgeplot we’ll return to a cap of 12 months:</p>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/unnamed-chunk-19-1.png" alt="" /><!-- --></p>
<p>Again, remember that the ridgeplot reflects out-of-date information for some journals. Consult the scatterplot! And please direct others to do the same if you share any of this on social media.</p>
<h1 id="conclusions">Conclusions</h1>
<p><img src="http://jonathanweisberg.org/img/journal_surveys/inigo.jpg" alt="" /></p>
<ul>
<li>A journal’s prominence in the survey is a decent <em>comparative</em> guide to the quantity of submissions it receives.</li>
<li>Accepted submissions are overrepresented in the survey. Acceptance rates estimated from the survey will pretty consistently overestimate the true rate—in many cases by a lot.</li>
<li>Grad students and non-tenured faculty use the surveys a lot more than tenured faculty.</li>
<li>Acceptance rates increase with seniority.</li>
<li>Men and women seem to be represented about the same as in the population of journal-submitting philosophers more generally.</li>
<li>Gender doesn’t seem to affect acceptance rate.</li>
<li>The Survey seems to be a reasonably good guide to expected wait times, though there may be some anomalies (e.g. <em>Synthese</em> and <em>Erkenntnis</em>).</li>
<li>Some journals’ wait times have been improving significantly, such as <em>CJP</em>, <em>Erkenntnis</em>, <em>Mind</em>, <em>PPQ</em>, and <em>Phil Review</em>.</li>
</ul>
<div class="footnotes">
<hr />
<ol>
<li id="fn:Sources-the-APA">Sources: the APA/BPA study, <a href="http://dailynous.com/2015/01/20/closer-look-philosophy-journal-practices/" target="_blank">Daily Nous</a>, and the websites for <a href="https://philreview.gorgesapps.us/statistics" target="_blank"><em>Phil Review</em></a> and <a href="https://www.journals.uchicago.edu/pb-assets/docs/journals/ethics-editorial-final-2018-02-07.pdf" target="_blank"><em>Ethics</em></a>. One notable exception is <em>CJP</em>, which reported 17% to the APA/BPA but 6% on Daily Nous. The ESF gives 10%. <a class="footnote-return" href="#fnref:Sources-the-APA"><sup>[return]</sup></a></li>
<li id="fn:The-poll-results">The poll results identified 20 journals ranked “best” by respondents. So why does our list only have 18? Because 3 of those 20 aren’t covered in the survey data, and I’ve included the “runner up” journal ranked 21st. <a class="footnote-return" href="#fnref:The-poll-results"><sup>[return]</sup></a></li>
</ol>
</div>
Where Are They Now? The Healy 2100
http://jonathanweisberg.org/post/Where%20Are%20They%20Now%20The%20Healy%202100/
Mon, 23 Apr 2018 00:00:00 -0500http://jonathanweisberg.org/post/Where%20Are%20They%20Now%20The%20Healy%202100/
<p>A <a href="https://www.timeshighereducation.com/news/how-much-research-goes-completely-uncited" target="_blank"><em>Times Higher Education</em>
piece</a>
making the rounds last week found that most published philosophy papers
are never cited. More exactly, of the studied philosophy papers
published in 2012, more than half had no citations indexed in <a href="https://clarivate.com/products/web-of-science/" target="_blank">Web of
Science</a> five years
later.</p>
<p>At Daily Nous, the <a href="http://dailynous.com/2018/04/19/philosophy-high-rate-uncited-publications/" target="_blank">discussion of that
finding</a>
turned up some interesting follow-up questions and findings. In
particular, Brian Weatherson found <a href="http://dailynous.com/2018/04/19/philosophy-high-rate-uncited-publications/#comment-141535" target="_blank">quite different
figures</a>
for papers published in <em>prestigious</em> philosophy journals. In the journals
he looked at, 89% of the papers published in 2012 had at least one
citation in Web of Science five years later.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> And more than half had five
or more citations.</p>
<p>That’s a pretty striking difference: >50% vs. ~11%! Seems like where
you publish your paper makes a <em>big</em> difference to your chances of going
uncited.</p>
<p>Shocking, I know.</p>
<p>But this got me thinking about <a href="https://kieranhealy.org/blog/archives/2015/02/25/gender-and-citation-in-four-general-interest-philosophy-journals-1993-2013/" target="_blank">Kieran Healy’s
analysis</a>
from a few years back. He found an “uncitation” rate higher than
Weatherson’s 11%—almost 20%—even though he was looking at just four of
philosophy’s most prominent journals: <em>Journal of Philosophy</em>, <em>Mind</em>,
<em>Noûs</em>, and <em>Philosophical Review</em>. (He found that around half of the
papers in these journals had five citations or fewer.)</p>
<p>So I wondered: what’s with the discrepancy? Do these journals not
necessarily get the most citations? Or is it that Healy was looking at
papers from 1993 to 2013, and things changed somehow over those two
decades, so that papers published in 2012 tend to get discussed more
than papers from 1993. Or is it just a symptom of when Healy collected
his data? Papers published in, say, 2011 wouldn’t have had much time to
gather citations by 2013 when Healy (apparently?) gathered his data.</p>
<p>Let’s take a look.</p>
<h1 id="the-healy-2100">The Healy 2100</h1>
<p>Since I don’t have Healy’s raw data, I went to Web of Science and
grabbed their data for all papers published over 1993–2013 in the
“Healy 4” journals. I ended up with a couple hundred more papers than
Healy looked at—not sure why. But the list of 2,100 papers he studied
is <a href="https://github.com/kjhealy/philpub" target="_blank">available on GitHub</a>. So I
focused on just those to get more of an apples-to-apples comparison.</p>
<p>Then I tried to reproduce his original findings, especially <a href="https://kieranhealy.org/blog/archives/2015/02/25/gender-and-citation-in-four-general-interest-philosophy-journals-1993-2013/#the-matthew-effect-is-a-harsh-mistress" target="_blank">this
histogram</a>. Here’s my version:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-2-1.png" alt="" /></p>
<p>I got pretty close, but I didn’t manage to reproduce his results
exactly. I found about 19.9% of the papers had no citations by the end
of 2013, compared to Healy’s ~18.5%. And I found about 58.6% with five
or fewer citations, compared with Healy’s “just over half”.</p>
<p>Still, the match is pretty close, so let’s go on to see how these papers
have aged since 2013.</p>
<h1 id="where-are-they-now">Where Are They Now?</h1>
<p>If we include citations up to the present day, only about 9.7% of these
2,100 papers have no citations, and about 39.1% have five or fewer.
Here’s the updated histogram:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-4-1.png" alt="" /></p>
<p>So it looks like the discrepancy with Weatherson’s result is (partly?)
down to the obvious thing. There hadn’t been enough time for the later
papers in Healy’s data set to accrue citations.</p>
<h1 id="cutting-out-supplements">Cutting Out Supplements</h1>
<p>A lot of these 2,100 papers are actually from the two supplements to
<em>Noûs</em>: <em>Philosophical Issues</em> and <em>Philosophical Perspectives</em>. And it
turns out they’re making a big difference.</p>
<p>Here’s what things look like when we cut supplements out:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-6-1.png" alt="" /></p>
<p>Now only 3.8% of our 1,677 papers have no citations to date, and just
30.9% have five or fewer.</p>
<h1 id="sliding-windows">Sliding Windows</h1>
<p>We’ve been looking at all citations accumulated to date for these
papers, which for older papers means 25 years’ worth of opportunity for
discussion. For more direct comparison to the <em>THE</em> analysis mentioned
at the outset, we can look at just the five-year window following each
paper’s publication.</p>
<p>So, how many citations did these papers accrue just within five years of
being published? Looking at only the “core” papers again (no
supplements), 14.4% had no citations within five years of publication,
and 69.6% had five or fewer.</p>
<p>That’s still a bit higher than the 11% (respectively 50%) found by
Weatherson. So we have to ask: are things changing? Are recent papers in
these journals accruing citations faster?</p>
<p>It seems so. Here are the “uncitation” rates for the years 1993–2013:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-8-1.png" alt="" /></p>
<p>And here are the “five or fewer citations” rates:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-9-1.png" alt="" /></p>
<p><strong>Update</strong>: Eric Schliesser pointed out to me what should have been obvious here, namely the internet. When Google took over the search engine industry around the year 2000, there was a citation boom across the board. So you’d naturally expect 2012 to be a very different year than 1993 as far as uncitation goes.</p>
<p>At first I thought the data plainly vindicated the “Google effect” hypothesis, and I had some plots up here to show as much. But it turned out Web of Science had snuck a bunch of supplemental <em>Phil Issues</em>/<em>Phil Perspectives</em> papers past me unmarked!</p>
<p>With those removed, the Google effect isn’t looking so big. Here’s the long view, from 1970 to 2013:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-10-1.png" alt="" /></p>
<p>Apparently papers with zero citations after five years have been declining for a while now. There does still seem to be a significant Google effect in the “five or fewer” measure:</p>
<p><img src="http://jonathanweisberg.org/img/where_are_they_now_the_healy_2100/unnamed-chunk-11-1.png" alt="" /></p>
<p>But it still looks like there were changes before Google. Does this reflect a growing profession? A faster-moving profession? An increasing number of publications? Longer bibliographies? Just idiosyncracies of Web of Science’s indexing? Something else? I’m not sure.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Weatherson looked at 391 articles published in 2012 in <em>Philosophical Review</em>, <em>Mind</em>, <em>Journal of Philosophy</em>, <em>Nous</em>, <em>Philosophical Studies</em>, <em>Ethics</em>, <em>Philosophical Quarterly</em>, <em>Philosophy of Science</em>, and <em>Australasian Journal of Philosophy</em>.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>
Building a Neural Network from Scratch: Part 2
http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%202/
Wed, 07 Mar 2018 00:00:00 -0500http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%202/<p>In this post we’ll improve our training algorithm from the <a href="http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%201/">previous post</a>. When we’re done we’ll be able to achieve 98% precision on the MNIST data set, after just 9 epochs of training—which only takes about 30 seconds to run on my laptop.</p>
<p>For comparison, last time we only achieved 92% precision after 2,000 epochs of training, which took over an hour!</p>
<p>The main driver in this improvement is just switching from batch gradient descent to <em>mini</em>-batch gradient descent. But we’ll also make two other, smaller improvements: we’ll add momentum to our descent algorithm, and we’ll smarten up the initialization of our network’s weights.</p>
<p>We’ll also reorganize our code a bit while we’re at it, making things more modular.</p>
<p>But first we need to import and massage our data. These steps are the same as in the previous post:</p>
<pre><code class="language-python">from sklearn.datasets import fetch_mldata
from sklearn.metrics import classification_report, confusion_matrix
import numpy as np
# import
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
# scale
X = X / 255
# one-hot encode labels
digits = 10
examples = y.shape[0]
y = y.reshape(1, examples)
Y_new = np.eye(digits)[y.astype('int32')]
Y_new = Y_new.T.reshape(digits, examples)
# split, reshape, shuffle
m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
Y_train, Y_test = Y_new[:,:m], Y_new[:,m:]
shuffle_index = np.random.permutation(m)
X_train, Y_train = X_train[:, shuffle_index], Y_train[:, shuffle_index]
</code></pre>
<p>Then we’ll define our key functions. Only the last two are new, and they just put the steps of forward and backward propagation into their own functions. This tidies up the training code to follow, so that we can focus on the novel elements, especially mini-batch descent and momentum.</p>
<p>Notice that in the process we introduce three dictionaries:<code>params</code>, <code>cache</code>, and <code>grads</code>. These are for conveniently passing information back and forth between the forward and backward passes.</p>
<pre><code class="language-python">def sigmoid(z):
s = 1. / (1. + np.exp(-z))
return s
def compute_loss(Y, Y_hat):
L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
m = Y.shape[1]
L = -(1./m) * L_sum
return L
def feed_forward(X, params):
cache = {}
cache["Z1"] = np.matmul(params["W1"], X) + params["b1"]
cache["A1"] = sigmoid(cache["Z1"])
cache["Z2"] = np.matmul(params["W2"], cache["A1"]) + params["b2"]
cache["A2"] = np.exp(cache["Z2"]) / np.sum(np.exp(cache["Z2"]), axis=0)
return cache
def back_propagate(X, Y, params, cache):
dZ2 = cache["A2"] - Y
dW2 = (1./m_batch) * np.matmul(dZ2, cache["A1"].T)
db2 = (1./m_batch) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(params["W2"].T, dZ2)
dZ1 = dA1 * sigmoid(cache["Z1"]) * (1 - sigmoid(cache["Z1"]))
dW1 = (1./m_batch) * np.matmul(dZ1, X.T)
db1 = (1./m_batch) * np.sum(dZ1, axis=1, keepdims=True)
grads = {"dW1": dW1, "db1": db1, "dW2": dW2, "db2": db2}
return grads
</code></pre>
<p>Now for the substantive stuff.</p>
<p>To switch to mini-batch descent, we add another <code>for</code> loop inside the pass through each epoch. At each pass we randomly shuffle the training set, then iterate through it in chunks of <code>batch_size</code>, which we’ll arbitrarily set to 128. We’ll see the code for all this in a moment.</p>
<p>Next, to add momentum, we keep a moving average of our gradients. So instead of updating our parameters by doing e.g.:</p>
<pre><code class="language-python">params["W1"] = params["W1"] - learning_rate * grads["dW1"]
</code></pre>
<p>we do this:</p>
<pre><code class="language-python">V_dW1 = (beta * V_dW1 + (1. - beta) * grads["dW1"])
params["W1"] = params["W1"] - learning_rate * V_dW1
</code></pre>
<p>Finally, to smarten up our initialization, we shrink the variance of the weights in each layer. Following <a href="https://www.coursera.org/learn/deep-neural-network/lecture/RwqYe/weight-initialization-for-deep-networks" target="_blank">this nice video</a> by Andrew Ng (whose excellent Coursera materials I’ve been relying on heavily in these posts), we’ll set the variance for each layer to $1/n$, where $n$ is the number of inputs feeding into that layer.</p>
<p>We’ve been using the <code>np.random.randn()</code> function to get our initial weights. And this function draws from the standard normal distribution. So to adjust the variance to $1/n$, we just divide by $\sqrt{n}$. In code this means that instead of doing e.g. <code>np.random.randn(n_h, n_x)</code>, we do <code>np.random.randn(n_h, n_x) * np.sqrt(1. / n_x)</code>.</p>
<p>Ok that covers our three improvements. Let’s build and train!</p>
<pre><code class="language-python">np.random.seed(138)
# hyperparameters
n_x = X_train.shape[0]
n_h = 64
learning_rate = 4
beta = .9
batch_size = 128
batches = -(-m // batch_size)
# initialization
params = { "W1": np.random.randn(n_h, n_x) * np.sqrt(1. / n_x),
"b1": np.zeros((n_h, 1)) * np.sqrt(1. / n_x),
"W2": np.random.randn(digits, n_h) * np.sqrt(1. / n_h),
"b2": np.zeros((digits, 1)) * np.sqrt(1. / n_h) }
V_dW1 = np.zeros(params["W1"].shape)
V_db1 = np.zeros(params["b1"].shape)
V_dW2 = np.zeros(params["W2"].shape)
V_db2 = np.zeros(params["b2"].shape)
# train
for i in range(9):
permutation = np.random.permutation(X_train.shape[1])
X_train_shuffled = X_train[:, permutation]
Y_train_shuffled = Y_train[:, permutation]
for j in range(batches):
begin = j * batch_size
end = min(begin + batch_size, X_train.shape[1] - 1)
X = X_train_shuffled[:, begin:end]
Y = Y_train_shuffled[:, begin:end]
m_batch = end - begin
cache = feed_forward(X, params)
grads = back_propagate(X, Y, params, cache)
V_dW1 = (beta * V_dW1 + (1. - beta) * grads["dW1"])
V_db1 = (beta * V_db1 + (1. - beta) * grads["db1"])
V_dW2 = (beta * V_dW2 + (1. - beta) * grads["dW2"])
V_db2 = (beta * V_db2 + (1. - beta) * grads["db2"])
params["W1"] = params["W1"] - learning_rate * V_dW1
params["b1"] = params["b1"] - learning_rate * V_db1
params["W2"] = params["W2"] - learning_rate * V_dW2
params["b2"] = params["b2"] - learning_rate * V_db2
cache = feed_forward(X_train, params)
train_cost = compute_loss(Y_train, cache["A2"])
cache = feed_forward(X_test, params)
test_cost = compute_loss(Y_test, cache["A2"])
print("Epoch {}: training cost = {}, test cost = {}".format(i+1 ,train_cost, test_cost))
print("Done.")
</code></pre>
<pre><code>Epoch 1: training cost = 0.15587093418167058, test cost = 0.16223940981168986
Epoch 2: training cost = 0.09417519634799829, test cost = 0.11032242938356147
Epoch 3: training cost = 0.07205872840102934, test cost = 0.0958078559246339
Epoch 4: training cost = 0.07008115814138867, test cost = 0.1010270024817398
Epoch 5: training cost = 0.05501929068580713, test cost = 0.09527116695490956
Epoch 6: training cost = 0.042663638371140164, test cost = 0.08268937190759178
Epoch 7: training cost = 0.03615501088752129, test cost = 0.08188431384719108
Epoch 8: training cost = 0.03610956910064329, test cost = 0.08675249924246693
Epoch 9: training cost = 0.027582647825206745, test cost = 0.08023754855128316
Done.
</code></pre>
<p>How’d we do?</p>
<pre><code class="language-python">cache = feed_forward(X_test, params)
predictions = np.argmax(cache["A2"], axis=0)
labels = np.argmax(Y_test, axis=0)
print(classification_report(predictions, labels))
</code></pre>
<pre><code> precision recall f1-score support
0 0.99 0.98 0.98 984
1 0.99 0.99 0.99 1136
2 0.98 0.98 0.98 1037
3 0.97 0.98 0.97 1004
4 0.97 0.98 0.98 970
5 0.97 0.96 0.97 900
6 0.97 0.99 0.98 942
7 0.97 0.97 0.97 1026
8 0.97 0.96 0.97 982
9 0.97 0.96 0.97 1019
avg / total 0.98 0.98 0.98 10000
</code></pre>
<p>And there it is: 98% precision in just 9 epochs of training.</p>
Building a Neural Network from Scratch: Part 1
http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%201/
Mon, 05 Mar 2018 00:00:00 -0500http://jonathanweisberg.org/post/A%20Neural%20Network%20from%20Scratch%20-%20Part%201/
<p>In this post we’re going to build a neural network from scratch. We’ll train it to recognize hand-written digits, using the famous MNIST data set.</p>
<p>We’ll use just basic Python with NumPy to build our network (no high-level stuff like Keras or TensorFlow). We will dip into scikit-learn, but only to get the MNIST data and to assess our model once its built.</p>
<p>We’ll start with the simplest possible “network”: a single node that recognizes just the digit 0. This is actually just an implementation of logistic regression, which may seem kind of silly. But it’ll help us get some key components working before things get more complicated.</p>
<p>Then we’ll extend that into a network with one hidden layer, still recognizing just 0. Then we’ll add a softmax for recognizing all the digits 0 through 9. That’ll give us a 92% accurate digit-recognizer, bringing us up to the cutting edge of 1985 technology.</p>
<p>In a followup post we’ll bring that up into the high nineties by making sundry improvements: better optimization, more hidden layers, and smarter initialization.</p>
<h1 id="1-hello-mnist">1. Hello, MNIST</h1>
<p><a href="https://en.wikipedia.org/wiki/MNIST_database" target="_blank">MNIST</a> contains 70,000 images of hand-written digits, each 28 x 28 pixels, in greyscale with pixel-values from 0 to 255. We could <a href="http://yann.lecun.com/exdb/mnist/" target="_blank">download</a> and preprocess the data ourselves. But the makers of scikit-learn already did that for us. Since it would be rude to neglect their efforts, we’ll just import it:</p>
<pre><code class="language-python">from sklearn.datasets import fetch_mldata
mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
</code></pre>
<p>We’ll normalize the data to keep our gradients manageable:</p>
<pre><code class="language-python">X = X / 255
</code></pre>
<p>The default MNIST labels record <code>7</code> for an image of a seven, <code>4</code> for an image of a four, etc. But we’re just building a zero-classifier for now. So we want our labels to say <code>1</code> when we have a zero, and <code>0</code> otherwise (intuitive, I know). So we’ll overwrite the labels to make that happen:</p>
<pre><code class="language-python">import numpy as np
y_new = np.zeros(y.shape)
y_new[np.where(y == 0.0)[0]] = 1
y = y_new
</code></pre>
<p>Now we can make our train/test split. The MNIST images are pre-arranged so that the first 60,000 can be used for training, and the last 10,000 for testing. We’ll also transform the data into the shape we want, with each example in a column (instead of a row):</p>
<pre><code class="language-python">m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
y_train, y_test = y[:m].reshape(1,m), y[m:].reshape(1,m_test)
</code></pre>
<p>Finally we’ll shuffle the training set for good measure:</p>
<pre><code class="language-python">np.random.seed(138)
shuffle_index = np.random.permutation(m)
X_train, y_train = X_train[:,shuffle_index], y_train[:,shuffle_index]
</code></pre>
<p>Let’s have a look at a random image and label just to make sure we didn’t throw anything out of wack:</p>
<pre><code class="language-python">%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
i = 3
plt.imshow(X_train[:,i].reshape(28,28), cmap = matplotlib.cm.binary)
plt.axis("off")
plt.show()
print(y_train[:,i])
</code></pre>
<p><img src="http://jonathanweisberg.org/img/nn_from_scratch/output_13_0.png" alt="png" /></p>
<pre><code>[1.]
</code></pre>
<p>That’s a zero, so we want the label to be <code>1</code>, which it is. Looks good, so let’s build our first network.</p>
<h1 id="2-a-single-neuron-aka-logistic-regression">2. A Single Neuron (aka Logistic Regression)</h1>
<p>We want to build a simple, feed-forward network with 784 inputs (=28 x 28), and a single sigmoid unit generating the output.</p>
<h2 id="2-1-forward-propogation">2.1 Forward Propogation</h2>
<p>The forward pass on a single example $x$ executes the following computation:
$$ \hat{y} = \sigma(w^T x + b). $$
Here $\sigma$ is the sigmoid function:
$$ \sigma(z) = \frac{1}{1 + e^{-z}}. $$
So let’s define:</p>
<pre><code class="language-python">def sigmoid(z):
s = 1 / (1 + np.exp(-z))
return s
</code></pre>
<p>We’ll vectorize by stacking examples side-by-side, so that our input matrix $X$ has an example in each column. The vectorized form of the forward pass is then:
$$ \hat{y} = \sigma(w^T X + b). $$
Note that $\hat{y}$ is now a vector, not a scalar as it was in the previous equation.</p>
<p>In our code we’ll compute this in two stages: <code>Z = np.matmul(W.T, X) + b</code> and then <code>A = sigmoid(Z)</code>. (<code>A</code> for Activation.) Breaking things up into stages like this is just for tidiness—it’ll make our forward propagation computations mirror the steps in our backward propagation computations.</p>
<h2 id="2-2-cost-function">2.2 Cost Function</h2>
<p>We’ll use cross-entropy for our cost function. The formula for a single training example is:
$$ L(y, \hat{y}) = -y \log(\hat{y}) - (1-y) \log(1-\hat{y}). $$
Averaging over a training set of $m$ examples we then have:
$$ L(Y, \hat{Y}) = -\frac{1}{m} \sum_{i=1}^m \left( y^{(i)} \log(\hat{y}^{(i)}) + (1-y^{(i)}) \log(1-\hat{y}^{(i)}) \right). $$
So let’s define:</p>
<pre><code class="language-python">def compute_loss(Y, Y_hat):
m = Y.shape[1]
L = -(1./m) * ( np.sum( np.multiply(np.log(Y_hat),Y) ) + np.sum( np.multiply(np.log(1-Y_hat),(1-Y)) ) )
return L
</code></pre>
<h2 id="2-3-backward-propagation">2.3 Backward Propagation</h2>
<p>For backpropagation, we’ll need to know how $L$ changes with respect to each component $w_j$ of $w$. That is, we must compute each $\partial L / \partial w_j$.</p>
<p>Focusing on a single example will make it easier to derive the formulas we need. Holding all values except $w_j$ fixed, we can think of $L$ as being computed in three steps: $w_j \rightarrow z \rightarrow \hat{y} \rightarrow L$. The formulas for these steps are:
$$
\begin{align}
z &= w^T x + b,\newline
\hat{y} &= \sigma(z),\newline
L(y, \hat{y}) &= -y \log(\hat{y}) - (1-y) \log(1-\hat{y}).
\end{align}
$$
And the chain rule tells us:
$$
\frac{\partial L}{\partial w_j} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial w_j}.
$$
Looking at $\partial L / \partial \hat{y}$ first:
$$
\begin{align}
\frac{\partial L}{\partial \hat{y}} &= \frac{\partial}{\partial \hat{y}} \left( -y \log(\hat{y}) - (1-y) \log(1-\hat{y}) \right)\newline
&= -y \frac{\partial}{\partial \hat{y}} \log(\hat{y}) - (1-y) \frac{\partial}{\partial \hat{y}} \log(1-\hat{y})\newline
&= \frac{-y}{\hat{y}} + \frac{(1-y) }{1 - \hat{y}}\newline
&= \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})}.
\end{align}
$$
Next we want $\partial \hat{y} / \partial z$:
$$
\begin{align}
\frac{\partial}{\partial z} \sigma(z) &= \frac{\partial}{\partial z} \left( \frac{1}{1 + e^{-z}} \right)\newline
&= - \frac{1}{(1 + e^{-z})^2} \frac{\partial}{\partial z} \left( 1 + e^{-z} \right)\newline
&= \frac{e^{-z}}{(1 + e^{-z})^2}\newline
&= \frac{1}{1 + e^{-z}} \frac{e^{-z}}{1 + e^{-z}}\newline
&= \sigma(z) \frac{e^{-z}}{1 + e^{-z}}\newline
&= \sigma(z) \left( 1 - \frac{1}{1 + e^{-z}} \right)\newline
&= \sigma(z) \left( 1 - \sigma(z) \right)\newline
&= \hat{y} (1-\hat{y}).
\end{align}
$$
Lastly we tackle $\partial z / \partial w_j$:
$$
\begin{align}
\frac{\partial}{\partial w_j} (w^T x + b) &= \frac{\partial}{\partial w_j} (w_0 x_0 + \ldots + w_n x_n + b)\newline
&= w_j.
\end{align}
$$
Finally we can substitute into the chain rule to find:
$$
\begin{align}
\frac{\partial L}{\partial w_j} &= \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z} \frac{\partial z}{\partial w_j}\newline
&= \frac{\hat{y} - y}{\hat{y}(1 - \hat{y})} \hat{y} (1-\hat{y}) w_j\newline
&= (\hat{y} - y) w_j.\newline
\end{align}
$$
In vectorized form with $m$ training examples this gives us:
$$
\frac{\partial L}{\partial w} = \frac{1}{m} X (\hat{y} - y)^T.
$$
What about $\partial L / \partial b$? A very similar derivation yields, for a single example:
$$
\begin{align}
\frac{\partial L}{\partial b} &= (\hat{y} - y).
\end{align}
$$
Which in vectorized form amounts to:
$$
\frac{\partial L}{\partial b} = \frac{1}{m} \sum_{i=1}^m (\hat{y}^{(i)} - y^{(i)}).
$$
In our code we’ll label these gradients according to their denominators, as <code>dW</code> and <code>db</code>. So for backpropagation we’ll compute <code>dW = (1/m) * np.matmul(X, (A-Y).T)</code> and <code>db = (1/m) * np.sum(A-Y, axis=1, keepdims=True)</code>.</p>
<h2 id="2-4-build-train">2.4 Build & Train</h2>
<p>Ok we’re ready to build and train our network!</p>
<pre><code class="language-python">learning_rate = 1
X = X_train
Y = y_train
n_x = X.shape[0]
m = X.shape[1]
W = np.random.randn(n_x, 1) * 0.01
b = np.zeros((1, 1))
for i in range(2000):
Z = np.matmul(W.T, X) + b
A = sigmoid(Z)
cost = compute_loss(Y, A)
dW = (1/m) * np.matmul(X, (A-Y).T)
db = (1/m) * np.sum(A-Y, axis=1, keepdims=True)
W = W - learning_rate * dW
b = b - learning_rate * db
if (i % 100 == 0):
print("Epoch", i, "cost: ", cost)
print("Final cost:", cost)
</code></pre>
<pre><code>Epoch 0 cost: 0.6840801595436431
Epoch 100 cost: 0.041305162058342754
... *snip* ...
Final cost: 0.02514156608481825
</code></pre>
<p>We could probably eek out a bit more accuracy with some more training. But the gains have slowed considerably. So let’s just see how we did, by looking at the confusion matrix:</p>
<pre><code class="language-python">from sklearn.metrics import classification_report, confusion_matrix
Z = np.matmul(W.T, X_test) + b
A = sigmoid(Z)
predictions = (A>.5)[0,:]
labels = (y_test == 1)[0,:]
print(confusion_matrix(predictions, labels))
</code></pre>
<pre><code>[[8980 33]
[ 40 947]]
</code></pre>
<p>Hey, that’s actually pretty good! We got 947 of the zeros and missed only 33, while getting nearly all the negative cases right. In terms of f1-score that’s 0.99:</p>
<pre><code class="language-python">print(classification_report(predictions, labels))
</code></pre>
<pre><code> precision recall f1-score support
False 1.00 1.00 1.00 9013
True 0.97 0.96 0.96 987
avg / total 0.99 0.99 0.99 10000
</code></pre>
<p>So, now that we’ve got a working model and optimization algorithm, let’s enrich it.</p>
<h1 id="3-one-hidden-layer">3. One Hidden Layer</h1>
<p>Let’s add a hidden layer now, with 64 units (a mostly arbitrary choice). I won’t go through the derivations of all the formulas for the forward and backward passes this time; they’re a pretty direct extension of the work we did earlier. Instead let’s just dive right in and build the model:</p>
<pre><code class="language-python">X = X_train
Y = y_train
n_x = X.shape[0]
n_h = 64
learning_rate = 1
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(1, n_h)
b2 = np.zeros((1, 1))
for i in range(2000):
Z1 = np.matmul(W1, X) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2, A1) + b2
A2 = sigmoid(Z2)
cost = compute_loss(Y, A2)
dZ2 = A2-Y
dW2 = (1./m) * np.matmul(dZ2, A1.T)
db2 = (1./m) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(W2.T, dZ2)
dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
dW1 = (1./m) * np.matmul(dZ1, X.T)
db1 = (1./m) * np.sum(dZ1, axis=1, keepdims=True)
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
if i % 100 == 0:
print("Epoch", i, "cost: ", cost)
print("Final cost:", cost)
</code></pre>
<pre><code>Epoch 0 cost: 0.9144384083567224
Epoch 100 cost: 0.08856953026938433
... *snip* ...
Final cost: 0.024249298861903648
</code></pre>
<p>How’d we do?</p>
<pre><code class="language-python">Z1 = np.matmul(W1, X_test) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2, A1) + b2
A2 = sigmoid(Z2)
predictions = (A2>.5)[0,:]
labels = (y_test == 1)[0,:]
print(confusion_matrix(predictions, labels))
print(classification_report(predictions, labels))
</code></pre>
<pre><code>[[8984 36]
[ 36 944]]
precision recall f1-score support
False 1.00 1.00 1.00 9020
True 0.96 0.96 0.96 980
avg / total 0.99 0.99 0.99 10000
</code></pre>
<p>Hmm, not bad, but about the same as our one-neuron model did. We could do more training and add more nodes/layers. But it’ll be slow going until we improve our optimization algorithm, which we’ll do in a followup post.</p>
<p>So for now let’s turn to recognizing all ten digits.</p>
<h1 id="4-upgrading-to-multiclass">4. Upgrading to Multiclass</h1>
<p><img src="https://medias.spotern.com/spots/w1280/3477.jpg" alt="" /></p>
<h2 id="4-1-labels">4.1 Labels</h2>
<p>First we need to redo our labels. We’ll re-import everything, so that we don’t have to go back and coordinate with our earlier shuffling:</p>
<pre><code class="language-python">mnist = fetch_mldata('MNIST original')
X, y = mnist["data"], mnist["target"]
X = X / 255
</code></pre>
<p>Then we’ll one-hot encode MNIST’s labels, to get a 10 x 70,000 array.</p>
<pre><code class="language-python">digits = 10
examples = y.shape[0]
y = y.reshape(1, examples)
Y_new = np.eye(digits)[y.astype('int32')]
Y_new = Y_new.T.reshape(digits, examples)
</code></pre>
<p>Then we re-split, re-shape, and re-shuffle our training set:</p>
<pre><code class="language-python">m = 60000
m_test = X.shape[0] - m
X_train, X_test = X[:m].T, X[m:].T
Y_train, Y_test = Y_new[:,:m], Y_new[:,m:]
shuffle_index = np.random.permutation(m)
X_train, Y_train = X_train[:, shuffle_index], Y_train[:, shuffle_index]
</code></pre>
<p>A quick check that things are as they should be:</p>
<pre><code class="language-python">i = 12
plt.imshow(X_train[:,i].reshape(28,28), cmap = matplotlib.cm.binary)
plt.axis("off")
plt.show()
Y_train[:,i]
</code></pre>
<p><img src="http://jonathanweisberg.org/img/nn_from_scratch/output_43_0.png" alt="png" /></p>
<pre><code>array([0., 0., 0., 0., 0., 0., 1., 0., 0., 0.])
</code></pre>
<p>Looks good, so let’s consider what changes we need to make to the model itself.</p>
<h2 id="4-2-forward-propagation">4.2 Forward Propagation</h2>
<p>Only the last layer of our network is changing. To add the softmax, we have to replace our lone, final node with a 10-unit layer. Its final activations are the exponentials of its $z$-values, normalized across all ten such exponentials. So instead of just computing $\sigma(z)$, we compute the activation for each unit $i$:
$$ \frac{e^{z_i}}{\sum_{j=0}^{9} e^{z_j}}.$$
So, in our vectorized code, the last line of forward propagation will be <code>A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0)</code>.</p>
<h2 id="4-3-cost-function">4.3 Cost Function</h2>
<p>Our cost function now has to generalize to more than two classes. The general formula for $n$ classes is:
$$ L(y, \hat{y}) = -\sum_{i = 0}^n y_i \log(\hat{y}_i). $$
Averaging over $m$ training examples this becomes:
$$ L(Y, \hat{Y}) = - \frac{1}{m} \sum_{j = 0}^m \sum_{i = 0}^n y_i^{(j)} \log(\hat{y}_i^{(j)}). $$
So let’s define:</p>
<pre><code class="language-python">def compute_multiclass_loss(Y, Y_hat):
L_sum = np.sum(np.multiply(Y, np.log(Y_hat)))
m = Y.shape[1]
L = -(1/m) * L_sum
return L
</code></pre>
<h2 id="4-4-backprop">4.4 Backprop</h2>
<p>Luckily it turns out that backprop isn’t really affected by the switch to a softmax. A softmax generalizes the sigmoid activiation we’ve been using, and in such a way that the code we wrote earlier still works. We could verify this by deriving:
$$\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i.$$
But I won’t walk through the steps here. Let’s just go ahead and build our final network.</p>
<h2 id="4-5-build-train">4.5 Build & Train</h2>
<pre><code class="language-python">n_x = X_train.shape[0]
n_h = 64
learning_rate = 1
W1 = np.random.randn(n_h, n_x)
b1 = np.zeros((n_h, 1))
W2 = np.random.randn(digits, n_h)
b2 = np.zeros((digits, 1))
X = X_train
Y = Y_train
for i in range(2000):
Z1 = np.matmul(W1,X) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2,A1) + b2
A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0)
cost = compute_multiclass_loss(Y, A2)
dZ2 = A2-Y
dW2 = (1./m) * np.matmul(dZ2, A1.T)
db2 = (1./m) * np.sum(dZ2, axis=1, keepdims=True)
dA1 = np.matmul(W2.T, dZ2)
dZ1 = dA1 * sigmoid(Z1) * (1 - sigmoid(Z1))
dW1 = (1./m) * np.matmul(dZ1, X.T)
db1 = (1./m) * np.sum(dZ1, axis=1, keepdims=True)
W2 = W2 - learning_rate * dW2
b2 = b2 - learning_rate * db2
W1 = W1 - learning_rate * dW1
b1 = b1 - learning_rate * db1
if (i % 100 == 0):
print("Epoch", i, "cost: ", cost)
print("Final cost:", cost)
</code></pre>
<pre><code>Epoch 0 cost: 9.243960401572568
... *snip* ...
Epoch 1900 cost: 0.24585173887243117
Final cost: 0.24072776877870128
</code></pre>
<p>Let’s see how we did:</p>
<pre><code class="language-python">Z1 = np.matmul(W1, X_test) + b1
A1 = sigmoid(Z1)
Z2 = np.matmul(W2, A1) + b2
A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0)
predictions = np.argmax(A2, axis=0)
labels = np.argmax(Y_test, axis=0)
print(confusion_matrix(predictions, labels))
print(classification_report(predictions, labels))
</code></pre>
<pre><code>[[ 946 0 14 3 3 10 12 2 9 4]
[ 0 1112 3 2 1 1 2 8 3 4]
[ 3 4 937 24 10 7 8 18 8 3]
[ 4 2 17 924 1 39 4 13 26 9]
[ 0 1 10 0 905 9 11 9 10 40]
[ 12 5 2 26 3 786 15 3 24 14]
[ 8 1 19 2 9 10 902 1 9 1]
[ 2 1 13 14 3 5 1 946 9 25]
[ 5 9 16 11 5 18 3 5 868 9]
[ 0 0 1 4 42 7 0 23 8 900]]
precision recall f1-score support
0 0.97 0.94 0.95 1003
1 0.98 0.98 0.98 1136
2 0.91 0.92 0.91 1022
3 0.91 0.89 0.90 1039
4 0.92 0.91 0.92 995
5 0.88 0.88 0.88 890
6 0.94 0.94 0.94 962
7 0.92 0.93 0.92 1019
8 0.89 0.91 0.90 949
9 0.89 0.91 0.90 985
avg / total 0.92 0.92 0.92 10000
</code></pre>
<p>We’re at 92% accuracy across all digits, not bad! And it looks like we could still improve with more training.</p>
<p>But let’s work on speeding up our optimization alogirthm first. We’ll pick things up there in the next post.</p>
Call for Papers: Formal Epistemology Workshop (FEW) 2018
http://jonathanweisberg.org/post/CFP%20FEW%202018/
Thu, 21 Dec 2017 10:36:00 -0500http://jonathanweisberg.org/post/CFP%20FEW%202018/<p><strong>Location:</strong> University of Toronto<br />
<strong>Dates:</strong> June 12–14, 2018<br />
<strong>Keynote Speakers:</strong> <a href="http://www.larabuchak.net/" target="_blank">Lara Buchak</a> and <a href="https://sites.google.com/site/michaeltitelbaum/" target="_blank">Mike Titelbaum</a><br />
<strong>Submission Deadline:</strong> February 12, 2018<br />
<strong>Authors Notified:</strong> March 31, 2018</p>
<p>We are pleased to invite papers in formal epistemology, broadly construed to include related areas of philosophy as well as cognate disciplines like statistics, psychology, economics, computer science, and mathematics.</p>
<p>Submissions should be:</p>
<ol>
<li>prepared for anonymous review,</li>
<li>no more than 6,000 words,</li>
<li>accompanied by an abstract of up to 300 words, and</li>
<li>in PDF format.</li>
</ol>
<p>Submission is via the <a href="https://easychair.org/conferences/?conf=few2018" target="_blank">EasyChair website</a>.</p>
<p>The final selection of the program will be made with an eye to diversity. We especially encourage submissions from PhD candidates, early career researchers, and members of groups underrepresented in academic philosophy.</p>
<p>Some funds are available to reimburse speakers’ travel expenses. The available amounts are still being determined, but we hope to cover most/all expenses for student and early career speakers. Childcare can also be arranged.</p>
<p>The <a href="http://jonathanweisberg.org/few2018" target="_blank">conference website is here</a>. The contact address is <a href="mailto:few2018toronto@gmail.com" target="_blank">few2018toronto@gmail.com</a>. The local organizers are <a href="http://www.davidjamesbar.net/" target="_blank">David James Barnett</a>, <a href="http://individual.utoronto.ca/jnagel/Home_Page.html" target="_blank">Jennifer Nagel</a>, and <a href="http://jonathanweisberg.org/" target="_blank">Jonathan Weisberg</a>.</p>
REU Redux: Allais All Over Again
http://jonathanweisberg.org/post/REU%20Redeux/
Tue, 26 Sep 2017 20:24:04 -0500http://jonathanweisberg.org/post/REU%20Redeux/
<p><em>This post is coauthored with <a href="http://johannathoma.com/" target="_blank">Johanna Thoma</a> and cross-posted at <a href="https://choiceinference.wordpress.com/" target="_blank">Choice & Inference</a>. Accompanying Mathematica code is available on <a href="https://github.com/jweisber/reu" target="_blank">GitHub</a>.</em></p>
<p>Lara Buchak’s <a href="http://www.oxfordscholarship.com/view/10.1093/acprof:oso/9780199672165.001.0001/acprof-9780199672165" target="_blank"><em>Risk & Rationality</em></a> advertises REU theory as able to recover the modal preferences in the Allais paradox. In <a href="https://link.springer.com/content/pdf/10.1007%2Fs11098-017-0916-3.pdf" target="_blank">our commentary</a> we challenged this claim. We pointed out that REU theory is strictly <a href="https://johannathoma.files.wordpress.com/2015/08/decision-theory-open-handbook-edit.pdf#page=11" target="_blank">“grand-world”</a>, and in the grand-world setting it actually struggles with the Allais preferences.</p>
<p>To demonstrate, we constructed a grand-world model of the Allais problem. We replaced each small-world outcome with a normal distribution whose mean matches its utility, and whose height corresponds to its probability.</p>
<p>Take for example the Allais gamble:
$$(\$0, .01; \$1M, .89; \$5M, .1).$$
If we adopt <em>Risk & Rationality</em>’s utility assignments:
$$u(\$0) = 0, u(\$1M) = 1, u(\$5M) = 2,$$
we can depict the small-world version of this gamble:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig1.png" alt="" /></p>
<p>On our grand-world model this becomes:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig2.png" alt="" /></p>
<p>And REU theory fails to predict the usual Allais preferences on this model, provided the normal distributions used are minimally spread out.</p>
<p>If we squeeze the normal distributions tight enough, the grand-world problem collapses into the small-world problem, and REU theory can recover the Allais preferences. But, we showed, they’d have to be squeezed absurdly tight. A small standard deviation like $\sigma = .1$ lets REU theory recover the Allais preferences.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> But it also requires outlandish certainty that a windfall of $\$$1M will lead to a better life than the one you’d expect to lead without it. The probability of a life of utility at most 0, despite winning $\$$1M, would have to be smaller than $1 \times 10^{-23}$.<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup> Yet the chances are massively greater than that of suffering life-ruining tragedy (illness, financial ruin… <em>Game of Thrones</em> ending happily ever after, etc.).</p>
<p>In response Buchak offers <a href="https://link.springer.com/content/pdf/10.1007%2Fs11098-017-0907-4.pdf" target="_blank">two replies</a>. The first is a technical maneuver, adjusting the model parameters. The second is more philosophical, adjusting the target interpretation of the Allais paradox instead.</p>
<h1 id="first-reply">First Reply</h1>
<p>Buchak’s first reply tweaks our model in two ways. First, the mean utility of winning $\$$5M is shifted from 2 down to 1.3. Second, all normal distributions are skewed by a factor of 5 (positive 5 for utility 0, negative otherwise). So, for example, the Allais gamble pictured above becomes:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig3.png" alt="" /></p>
<p>We’ll focus on the second tweak here, the introduction of skew. It rests on a technical error, as we’ll show momentarily. But it also wants for motivation.</p>
<h2 id="motivational-problems">Motivational Problems</h2>
<p>Why should the grand-world model be skewed? And why in this particular way? Buchak writes:</p>
<blockquote>
<p>[…] receiving $\$$1M makes the worst possibilities much less likely. Receiving $\$$1M provides security in the sense of making the probability associated with lower utility values smaller and smaller. The utility of $\$$1M is concentrated around a high mean with a long tail to the left: things likely will be great, though there is some small and diminishing chance they will be fine but not great. Similarly, the utility of $\$$0 is concentrated around a low mean with a long tail to the right: things likely will be fine but not great, though there is some small and diminishing chance they will be great. In other words, $\$$1M (and $\$$5M) is a gamble with negative skew, and $\$$0 is a gamble with positive skew <a href="p. 2401" target="_blank">…</a></p>
</blockquote>
<p>But this passage never actually identifies any asymmetry in the phenomena we’re modeling. True, “receiving $\$$1M makes the worst possibilities much less likely”, but it also makes the best possibilities much more likely. Likewise, “[r]eceiving $\$$1M provides security in the sense of making the probability associated with lower utility values smaller and smaller.” But $\$$1M also makes the probability associated with higher utility values larger. And so on.</p>
<p>The tendencies of large winnings to control bad outcomes and promote good outcomes was already captured in the original model. A normal distribution centered on utility 1 already admits “some small and diminishing chance that [things] will be fine but not great.” It just also admits some small chance that things will be much better than great, since it’s symmetric around utility 1. To motivate the skewed model, we’d need some reason to think this symmetry should not hold. But none has been given.</p>
<h2 id="technical-difficulties">Technical Difficulties</h2>
<p>Motivation aside, there is a technical fault in the skewed model.</p>
<p>Introducing skew is supposed to make room for a reasonably large standard deviation while still recovering the Allais preferences. Buchak advertises a standard deviation<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup> of $\sigma = .17$ for the skewed model, but the true value is actually $.106$—essentially the same as the $.1$ value Buchak concedes is implausibly small, and seeks to avoid by introducing skew.<sup class="footnote-ref" id="fnref:4"><a rel="footnote" href="#fn:4">4</a></sup></p>
<p>Where does the $.17$ figure come from then? It’s the <a href="https://en.wikipedia.org/wiki/Scale_parameter" target="_blank">scale parameter</a> of the skew normal distribution, often denoted $\omega$. For an ordinary normal distribution, the scale $\omega$ famously coincides with the standard deviation $\sigma$, and so we write $\sigma$ for both. But when we skew a normal distribution, we tighten it, shrinking the standard deviation:</p>
<p><img src="http://jonathanweisberg.org/img/reu_redeux/fig4.png" alt="" /></p>
<p>The distributions in this figure share the same scale parameter ($.17$) but the skewed one (yellow) is much narrower.<sup class="footnote-ref" id="fnref:5"><a rel="footnote" href="#fn:5">5</a></sup></p>
<p>Unfortunately, <em>Mathematica</em> uses $\sigma$ for the scale parameter even in skewed normal distributions, giving the misleading impression that it’s still the standard deviation.</p>
<p>What really matters, of course, isn’t the value of the standard deviation itself, but the probabilities that result from whatever parameters we choose. And Buchak argues that her model avoids the implausible probabilities we cited in the introduction. How can this be?</p>
<p>Buchak says that the skewed model has “more overlap in the utility that $\$$0 and $\$$1M might deliver”:</p>
<blockquote>
<p>[…] there is a 0.003 probability that the $\$$0 gamble will deliver more than 0.5 utils, and a 0.003 probability that the $\$$1M gamble will deliver less than 0.5 utils. (p. 2402)</p>
</blockquote>
<p>But this “overlap” was never the problematic quantity. The problem was, rather, that a small standard deviation like $.1$ requires you to think it less than $1 \times 10^{-23}$ likely you will end up with a life no better than $0$ utils, despite a $\$$1M windfall.</p>
<p>On Buchak’s model this probability is still absurdly small: $4 \times 10^{-9}$.<sup class="footnote-ref" id="fnref:6"><a rel="footnote" href="#fn:6">6</a></sup> This is a considerable improvement over $1 \times 10^{-23}$, but it’s still not plausible. For example, it’s almost $300,000$ times more likely that one author of this post (Jonathan Weisberg) will <a href="http://www.statcan.gc.ca/pub/84-537-x/2013005/tbl/tbl7a-eng.htm" target="_blank">die in the coming year at the ripe old age of 39</a>.</p>
<p>But worst of all, any improvement here comes at an impossible price: ludicrously low probabilities on the other side. For example, the probability that the life you’ll lead with $\$$1M will end up as good as the one you’d expect with $\$$5M is so small that <em>Mathematica</em> can’t distinguish it from zero.<sup class="footnote-ref" id="fnref:7"><a rel="footnote" href="#fn:7">7</a></sup> So the problem is actually worse than before, not better.</p>
<h1 id="second-reply">Second Reply</h1>
<p>Buchak’s second reply is that it wouldn’t in fact be a problem if REU theory could only recover the Allais preferences in a small-world setting. We should think of the Allais problem as a thought experiment: it asks us to abstract away from anything but the immediate rewards mentioned in the problem, and to think of the monetary rewards as stand-ins for things that are valuable for their own sakes.</p>
<p>What <em>Risk & Rationality</em> showed, according to Buchak, is that REU theory can accommodate people’s intuitions regarding such a small-world thought experiment. And this is a success, because this establishes that the theory can accommodate a certain kind of reasoning that we all engage in. Buchak moreover concedes that it may well be a mistake for agents to think of the choices they actually face in small-world terms. But she claims this is no problem for her theory:</p>
<blockquote>
<p>[I]f people ‘really’ face the simple choices, then their reasoning is correct and REU captures it. If people ‘really’ face the complex choices, then the reasoning in favor of their preferences is misapplied, and REU does not capture their preferences. Either way, the point still stands: REU-maximization rationalizes and formally reconstructs a certain kind of intuitive reasoning, as seen through REU theory’s ability to capture preferences over highly idealized gambles to which this reasoning is relevant. (p. 2403)</p>
</blockquote>
<p>But there isn’t actually an ‘if’ here. People do really face ‘complex’ choices as we tried to model them. Any reward from an isolated gamble an agent faces in her life really should itself be thought of as a gamble. This is not only true when the potential reward is something like money, which is only a means to something else. Even if the good in question is ‘ultimate’, it just adds to the larger gamble of the agent’s life she is yet to face. She might win a beautiful holiday, but she will still face 20 micromorts per day for the rest of her life (<a href="https://en.wikipedia.org/wiki/Micromort#Baseline" target="_blank">24 if she moves from Canada to England</a>). Even on our deathbeds, we are unsure about how a lot of things we care about will play out. REU theory makes this background risk relevant to the evaluation of any individual gamble.</p>
<p>So Buchak’s response really comes to this: REU theory captures a kind of intuitive reasoning that we employ in highly idealized decision contexts, but which would be misapplied in any actual decisions agents face in their lives. This raises two questions:</p>
<ol>
<li><p>Why should we care about accommodating reasoning in highly idealized decision contexts?</p>
<p>The original project of <em>Risk & Rationality</em> was to rationally accommodate the ordinary decision-maker. But now what we are rationally accommodating are at best her responses to thought experiments that are very far removed from her real life, namely thought experiments that ask her to imagine that she faces no other risk in her life. If our model is right, then REU theory still has to declare her irrational if she acts in real life as she would in the thought experiment—as presumably ordinary decision-makers do. And then we haven’t done very much to rationally accommodate her. At best, we have provided an error theory to explain her ordinary behaviour: her mistake is to treat grand-world problems like small-world problems. This is, of course, a different project than the one <em>Risk & Rationality</em> originally embarked on. As an error theory, REU theory will have to compete with other theories of choice under uncertainty that were never meant to be theories of rationality, such as prospect theory. Moreover, there is still another open question.</p></li>
<li><p>Why should agents have developed a knack for the reasoning displayed in the Allais problem if it is never actually rational to use it?</p>
<p>As a heuristic to try and approximate the behaviour of a perfectly rational system, at least in the Allais example, agents would do better to maximize expected utility—which is also easier to compute. Moreover, the burden of proof is on proponents of REU theory to show that there are any grand-world decisions commonly faced by real agents where REU theory comes to a significantly different assessment than expected utility theory. Unless they can show this, expected utility theory comes out as the better heuristic more generally. It is then quite mysterious what explains our supposed employment of REU-style reasoning. Why should irrational agents, who employ it more generally, have developed a bad heuristic? And why should rational agents, who never use it in real life, develop a tendency to employ it exclusively in highly idealized thought experiments?</p></li>
</ol>
<p>Ultimately, if Buchak’s first reply fails, and all we can rely on is her second reply, <em>Risk & Rationality</em> provides us with no reason to abandon expected utility theory as our best theory of rational choice under uncertainty in actual choice scenarios. Even if we grant that REU theory is a better theory of rational choice in hypothetical scenarios we never face, this is a much less exciting result than the one <em>Risk & Rationality</em> advertised.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">Though we need a slightly more severe risk function than that used in <em>Risk & Rationality</em>: $r(p) = p^{2.05}$ instead of $r(p) = p^2$. See our original commentary for details.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
<li id="fn:2"><p>To get this figure we calculate the cumulative density, at zero, of the normal distribution $𝒩(1,.1)$. Using <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">CDF[NormalDistribution[1, .1], 0]
7.61985 × 10^-24
</code></pre>
<a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li>
<li id="fn:3">This is the “variance” in Buchak’s terminology, but we’ll continue to use “standard deviation” here for consistency with our previous discussion and the preferred nomenclature.
<a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li>
<li id="fn:4"><p>In <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">StandardDeviation[SkewNormalDistribution[1, .17, -5]]
0.105874
</code></pre>
<a class="footnote-return" href="#fnref:4"><sup>[return]</sup></a></li>
<li id="fn:5">Skewing also shifts the mean, we should note.
<a class="footnote-return" href="#fnref:5"><sup>[return]</sup></a></li>
<li id="fn:6"><p>In <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">CDF[SkewNormalDistribution[1, .17, -5], 0]
4.04475 × 10^-9
</code></pre>
<a class="footnote-return" href="#fnref:6"><sup>[return]</sup></a></li>
<li id="fn:7"><p>Here we calculate the complement of the cumulative density, at $1.3$, of the skew normal distribution with location $1$, scale $.17$, and skew $-5$. In <em>Mathematica</em>:</p>
<pre><code class="language-mathematica">1 - CDF[SkewNormalDistribution[1, .17, -5], 1.3]
0.
</code></pre>
<p>Note that <em>Mathematica</em> can estimate this value at the nearby point $1.25$, which gives us an upper bound of about $7 \times 10^{-16}$:</p>
<pre><code class="language-mathematica">1 - CDF[SkewNormalDistribution[1, .17, -5], 1.25]
6.66134 × 10^-16
</code></pre>
<p>For comparison, this probability was about $.0013$ with no skew and $\sigma = .1$:</p>
<pre><code class="language-mathematica">1 - CDF[NormalDistribution[1, .1], 1.3]
0.0013499
</code></pre>
<a class="footnote-return" href="#fnref:7"><sup>[return]</sup></a></li>
</ol>
</div>
The Mosteller Hall Puzzle
http://jonathanweisberg.org/post/Teaching%20Monty%20Hall/
Wed, 14 Jun 2017 15:21:42 -0500http://jonathanweisberg.org/post/Teaching%20Monty%20Hall/<p>One of my favourite probability puzzles to teach is a close cousin of the <a href="https://en.wikipedia.org/wiki/Monty_Hall_problem" target="_blank">Monty Hall problem</a>. Originally from a 1965 <a href="https://books.google.ca/books/about/Fifty_Challenging_Problems_in_Probabilit.html?id=QiuqPejnweEC" target="_blank">book by Frederick Mosteller</a>,<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> here’s my formulation:</p>
<blockquote>
<p>Three prisoners, A, B, and C, are condemned to die in the morning. But the king decides in the night to pardon one of them. He makes his choice at random and communicates it to the guard, who is sworn to secrecy. She can only tell the prisoners that one of them will be released at dawn.</p>
<p>Prisoner A welcomes the news, as he now has a 1/3 chance of survival. Hoping to go even further, he says to the guard, “I know you can’t tell me whether I am condemned or pardoned. But at least one other prisoner must still be condemned, so can you just name one who is?”. The guard replies (truthfully) that B is still condemned. “Ok”, says A, “then it’s either me or C who was pardoned. So my chance of survival has gone up to ½”.</p>
<p>Unfortunately for A, he is mistaken. But how?</p>
<p><strong>Update</strong>: turns out the puzzle isn’t originally due to Mosteller after all! It appears in <a href="https://www.nature.com/scientificamerican/journal/v201/n4/pdf/scientificamerican1059-174.pdf" target="_blank">a 1959 article</a> in <em>Scientific American</em>, by Martin Gardner.</p>
</blockquote>
<p>For me it’s really intuitive that A is mistaken. The way he figures things, his chance of survival will go up to ½ whoever the guard names in her response. But then A doesn’t even have to bother the guard. He can just skip ahead to the conclusion that his chance of survival is ½. And that’s absurd.</p>
<p>It’s a bit harder to say exactly <em>where</em> A goes wrong. But I’ve always taken this puzzle to be, like Monty Hall, a lesson in Carnap’s TER: the Total Evidence Requirement.</p>
<p>What A learns isn’t only that B is condemned, but also that the guard reports as much. And this report is more likely if C was pardoned than if A was. If C was pardoned, the guard had to name B, the only other prisoner still condemned. Whereas if A was pardoned, the guard could just as easily have named C instead.</p>
<p>So when the guard names B, her report fits twice as well with the hypothesis that C was pardoned, not A:</p>
<p><img src="http://jonathanweisberg.org/img/misc/mosteller_tree_diagram.png" alt="Tree diagram" /></p>
<p>Thus A’s chance of being condemned remains twice that of being pardoned.</p>
<p>If you’re like me, this reasoning will actually be less intuitive than the initial, gut feeling that A must be mistaken (because her logic would make it unnecessary to consult the guard). The argument is still instructive though, for several reasons:</p>
<ol>
<li><p>It shows how the initial, gut feeling is consistent with the probability axioms. We’ve constructed a plausible probability model that vindicates it.</p></li>
<li><p>The Total Evidence Requirement makes the difference in this model. Learning merely that B is condemned would have a different effect in this model. A’s chance of survival really would go up to ½ then.</p></li>
<li><p>These lessons can be carried over to Monty Hall. The same model yields the correct solution there, with the TER playing out in a parallel way.</p></li>
</ol>
<p>And that last point is the real point of this post. As my colleague <a href="http://www.sergiotenenbaum.org/" target="_blank">Sergio Tenenbaum</a> pointed out in conversation, it means you can use Mosteller’s puzzle to teach Monty Hall. Because, unlike in Monty Hall, <em>the intuitive judgment is the correct one in Mosteller’s puzzle</em>. So you can use it to get students on board with the less intuitive (but entirely correct) argument we used to resolve Mosteller’s puzzle.</p>
<p>Once students have seen how important it is to set up the probability model correctly, so that the Total Evidence Requirement can do its work, they may be more comfortable using the same technique on Monty Hall.</p>
<p>There are other ways of bringing students around to the correct solution to Monty Hall, of course. You can run them through a variant with a hundred doors instead of three; you can invite them to consider what would happen in the long run in repeated games; you can ask them how things would have been different had Monty opened the other door instead.</p>
<p>These are all worthy heuristics. And I expect different ones will click for different students.</p>
<p>But for my money, there’s nothing like a simple and concrete model to help me get oriented and shake off that befuddled feeling. And, in this case, Mosteller’s puzzle helps make the model more intuitive, hence more memorable.</p>
<p><img src="https://68.media.tumblr.com/776dfc1f8b3baa0309b41c6a90ea1a13/tumblr_nd53ozBNz81qj0u7fo1_r1_400.gif" alt="Fainting Goat" /></p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">So I think it actually predates Monty Hall, though I gather this general family of puzzles goes back at least to 1889 and <a href="https://en.wikipedia.org/wiki/Bertrand%27s_box_paradox" target="_blank">Bertrand’s box paradox</a>.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
</ol>
</div>
Accuracy for Dummies, Part 7: Dominance
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%207%20-%20Brier%20Dominance/
Wed, 07 Jun 2017 00:00:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%207%20-%20Brier%20Dominance/
<p>In our <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 5 - Convexity/">last</a> <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 6 - Obtusity/">two</a> posts we established two key facts:</p>
<ol>
<li>The set of possible probability assignments is convex.</li>
<li>Convex sets are “obtuse”. Given a point outside a convex set, there’s a point inside that forms a right-or-obtuse angle with any third point in the set.</li>
</ol>
<p>Today we’re putting them together to get the central result of the accuracy framework, the Brier dominance theorem. We’ll show that a non-probabilistic credence assignment is always “Brier dominated” by some probabilistic one. That is, there is always a probabilistic assignment that is closer, in terms of Brier distance, to every possible truth-value assignment.</p>
<p>In fact we’ll show something a bit more general. We’ll show that there’s a probability assignment that’s closer to all the possible <em>probability</em> assignments. But truth-value assignments are probability assignments, just extreme ones. So the result we really want follows straight away as a special case.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\x}{\vec{x}}
\newcommand{\y}{\vec{y}}
\newcommand{\z}{\vec{z}}
\newcommand{\v}{\vec{v}}
\newcommand{\p}{\vec{p}}
\newcommand{\q}{\vec{q}}
\newcommand{\B}{B}
\newcommand{\R}{\mathbb{R}}
\newcommand{\EIpq}{EI_{\p}(\q)}\newcommand{\EIpp}{EI_{\p}(\p)}
$</p>
<h1 id="recap">Recap</h1>
<p>For reference, let’s collect our notation, terminology, and previous results, so that we have everything in one place.</p>
<p>We’re using $n$ for the number of possibilities under consideration. And we use bold letters like $\x$ and $\p$ to represent $n$-tuples of real numbers. So $\p = (p_1, \ldots, p_n)$ is a point in $n$-dimensional space: a member of $\R^n$.</p>
<p>We call $\p$ a <em>probability assignment</em> if its coordinates are (a) all nonnegative, and (b) they sum to $1$. And we write $P$ for the set of all probability assignments.</p>
<p>We call $\v$ a <em>truth-value assignment</em> if its coordinates are all zeros except for a single $1$. And we write $V$ for the set of all truth-value assignments.</p>
<p>A point $\y$ is a <em>mixture</em> of the points $\x_1, \ldots, \x_n$ if there are real numbers $\lambda_1, \ldots, \lambda_n$ such that:</p>
<ul>
<li>$\lambda_i \geq 0$ for all $i$,</li>
<li>$\lambda_1 + \ldots + \lambda_n = 1$, and</li>
<li>$\y = \lambda_1 \x_1 + \ldots + \lambda_n \x_n$.</li>
</ul>
<p>We say that a set is <em>convex</em> if it is closed under mixing, i.e. any mixture of elements in the set is also in the set.</p>
<p>The difference between two points, $\x - \y$, is defined coordinate-wise:
$$ \x - \y = (x_1 - y_1, \ldots, x_n - y_n). $$
The <em>dot product</em> of two points $\x$ and $\y$ is written $\x \cdot \y$, and is defined:
$$ \x \cdot \y = x_1 y_1 + \ldots + x_n y_n. $$
As a reminder, the dot product returns a single, real number (not another $n$-dimensional point as one might expect). And the sign of the dot product reflects the angle between $\x$ and $\y$ when viewed as vectors/arrows. In particular, $\x \cdot \y \leq 0$ corresponds to a right-or-obtuse angle.</p>
<p>Finally, $\B(\x,\y)$ is the Brier distance between $\x$ and $\y$, which can be defined:
$$
\begin{align}
\B(\x,\y) &= (\x - \y)^2\\<br />
&= (\x - \y) \cdot (\x - \y).
\end{align}
$$</p>
<p>Now let’s restate the two key theorems we’ll be relying on.</p>
<p><strong>Theorem (Convexity).</strong>
The set of probability functions $P$ is convex.</p>
<p>We established this in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 5 - Convexity/">Part 5</a> of this series. In particular, we showed that $P$ is the “convex hull” of $V$: the set of all mixtures of points in $V$.</p>
<p><strong>Lemma (Obtusity).</strong>
If $S$ is convex, $\x \not \in S$, and $\y \in S$ minimizes $\B(\y,\x)$ as a function of $\y$ on the domain $S$, then for any $\z \in S$, $(\x - \y) \cdot (\z - \y) \leq 0$.</p>
<p>The intuitive idea behind this lemma, which we proved last time in <a href="(/post/Accuracy for Dummies - Part 6 - Obtusity/)" target="_blank">Part 6</a>, can be illustrated with a diagram:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma3.png" alt="" />
Given a point outside a convex set, we can find a point inside (the closest point) that forms a right-or-obtuse angle with all other points in the set.</p>
<p>What we’ll show next is the natural and intuitive consequence: that point $\y$ is thus closer to any point $\z$ of $S$ than $\x$ is.</p>
<h1 id="the-brier-dominance-theorem">The Brier Dominance Theorem</h1>
<p>Intuitively, we want to show that if the angle formed at point $\y$ with the points $\x$ and $\z$ is right-or-obtuse, then $\y$ must be closer to $\z$ than $\x$ is (in Brier distance).</p>
<p>Formally, a right-or-obtuse angle corresponds to a dot product less than or equal to zero: $(\x - \y) \cdot (\z - \y) \leq 0$. But if $\x = \y$, then the dot product will be zero trivially. So the precise statement of our theorem is:</p>
<p><strong>Theorem.</strong>
If $(\x - \y) \cdot (\z - \y) \leq 0$ and $\x \neq \y$, then $\B(\x,\z) > \B(\y,\z)$.</p>
<p><em>Proof.</em> To start, we establish a general identity via algebra:
$$
\begin{align}
\B(\x, \z) - \B(\x, \y) - \B(\y,\z)
&= (\x - \z)^2 - (\x - \y)^2 - (\y - \z)^2\\<br />
&= -2\y^2 - 2 \x \cdot \z + 2 \x \cdot \y + 2 \y \cdot \z\\<br />
&= -2 (\x - \y) \cdot (\z - \y).
\end{align}
$$
Now suppose $ (\x - \y) \cdot (\z - \y) \leq 0$. Then, given the negative sign on the $-2$ in the established identity,
$$ \B(\x, \z) - \B(\x, \y) - \B(\y,\z) \geq 0, $$
from which we derive
$$ \B(\x, \z) \geq \B(\x, \y) + \B(\y,\z). $$
Now, since $\x \neq \y$ by hypothesis, $\B(\x,\y) > 0$. Thus $\B(\x,\z) > \B(\y,\z)$, as desired.
<span class="floatright">$\Box$</span></p>
<p>It follows now that if $\x$ isn’t a probability assignment, there’s a probability assignment that’s closer to every truth-value assignment than $\x$ is.</p>
<p><strong>Corollary (Brier Dominance).</strong> If $\x \not \in P$ then there is a $\p \in P$ such that $\B(\p,\v) < \B(\x, \v)$ for all $\v \in V$.</p>
<p><em>Proof.</em> Fix $\x \not \in P$, and let $\p$ be the member of $P$ that minimizes $B(\y,\x)$ as a function of $\y$. The Convexity theorem tells us that $P$ is convex, so the Obtusity lemma implies $(\x - \p) \cdot (\v - \p) \leq 0$ for every $\v \in V$. And since $\x \neq \p$ (because $\x \not \in P$), the last theorem entails $\B(\p,\v) < \B(\x, \v)$, as desired.
<span class="floatright">$\Box$</span></p>
<p>This is the core of the main result we’ve been working towards. Hooray! But, we still have one piece of unfinished business. For what if $\p$ is itself dominated??</p>
<h1 id="undominated-dominance">Undominated Dominance</h1>
<p>We’ve shown that credences which violate the probability axioms are always “accuracy dominated” by some assignment of credences that obeys those axioms. But what if those dominating, probabilistic credences are themselves dominated? <em>What if they’re dominated by non-probabilistic credences??</em></p>
<p>For all we’ve said, that’s a real possibility. And if it actually obtains, then there’s nothing especially accuracy-conducive about the laws of probability. So we had better rule this possibility out. Luckily, that’s pretty easy to do.</p>
<p>In fact, the reals work here was already done back in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 3/">Part 3</a> of the series. There we showed that Brier distance is a “proper” measure of inaccuracy: each probability assignment expects itself to do best with respect to accuracy, if inaccuracy is measured by Brier distance.</p>
<p>As a reminder, we wrote $\EIpq$ for the expected inaccuracy of probability assignment $\q$ according to assignment $\p$. When inaccuracy is measured in terms of Brier distance:
$$ \EIpq = p_1 \B(\q,\v_1) + p_2 \B(\q,\v_2) + \ldots + p_n \B(\q,\v_n). $$
Here $\v_i$ is the truth-value assignment with a $1$ in the $i$-th coordinate, and $0$ everywhere else. What we showed in Part 3 was:</p>
<p><strong>Theorem.</strong>
$\EIpq$ is uniquely minimized when $\q = \p$.</p>
<p>And notice, this would be impossible if there were some $\q$ such that $\B(\q,\v_i) \leq \B(\p,\v_i)$ for all $i$. For then the weighted average $\EIpq$ would have to be no larger than $\EIpp$. And this contradicts the theorem, which says that $\EIpq > \EIpp$ for all $\q \neq \p$.</p>
<p>So, at long last, we have the full result we want:</p>
<p><strong>Corollary (Undominated Brier Dominance).</strong> If $\x \not \in P$ then there is a $\p \in P$ such that $\B(\p,\v) < \B(\x, \v)$ for all $\v \in V$. Moreover, there is no $\q \in P$ such that $\B(\q,\v) \leq \B(\p, \v)$ for all $\v \in V$.</p>
<p>So the laws of probability really are specially conducive to accuracy, as measured using Brier distance. Only probabilistic credence assignments are undominated.</p>
<h1 id="where-to-next">Where to Next?</h1>
<p>That’s a pretty sweet result. And it raises plenty of fun and interesting questions we could look at next. Here are three:</p>
<ol>
<li><p>What about other ways of measuring inaccuracy besides Brier? Are there reasonable alternatives, and if so, do similar results apply to them?</p></li>
<li><p>What about other probabilistic principles, like Conditionalization, the Principal Principle, or the Principle of Indifference? Can we take this approach beyond the probability axioms?</p></li>
<li><p>Speaking of the probability axioms, we’ve been working with a pretty paired down conception of a “probability assignment”. Usually we assign probabilities not just to atomic possibilities, but to disjunctions/sets of possibilities: e.g. “the prize is behind either door #1 or door #2”. Can we extend this result to such “super-atomic” probability assignments?</p></li>
</ol>
<p>We’ll tackle some or all of these questions in future posts. But I haven’t yet decided which ones or in what order.</p>
<p>So for now let’s just stop and appreciate the work we’ve already done. Because not only have we proved one of the most central and interesting results of the accuracy framework. But also, in a lot of ways the hardest work is already behind us. If you’ve come this far, I think you deserve a nice pat on the back.</p>
<p><img src="http://i1145.photobucket.com/albums/o503/KimmieRocks/tumblr_liqmv89ru51qb2dn6.gif" alt="" /></p>
Journal Submission Rates by Gender: A Look at the APA/BPA Data
http://jonathanweisberg.org/post/A%20Look%20at%20the%20APA-BPA%20Data/
Tue, 06 Jun 2017 11:45:04 -0500http://jonathanweisberg.org/post/A%20Look%20at%20the%20APA-BPA%20Data/
<p><strong>Update:</strong> <em>editors at CJP and Phil Quarterly have kindly shared some important, additional information. See the edit below for details.</em></p>
<p>A <a href="https://link.springer.com/article/10.1007/s11098-017-0919-0" target="_blank">new paper</a> on the representation of women in philosophy journals prompted some debate in the philosophy blogosphere last week. The paper found women to be underrepresented across a range of prominent journals, yet overrepresented in the two journals studied where review was non-anonymous.</p>
<p>Commenters <a href="http://dailynous.com/2017/05/26/women-philosophy-journals-new-data/" target="_blank">over at Daily Nous</a> complained about the lack of base-rate data. How many of the submissions to these journals were from women? In some respects, it’s hard to know what to make of these findings without such data.</p>
<p>A few commenters linked to <a href="http://www.apaonline.org/resource/resmgr/journal_surveys_2014/apa_bpa_survey_data_2014.xlsx" target="_blank">a survey</a> conducted by the APA and BPA a while back, which supplies some numbers along these lines. I was surprised, because I’ve wondered about these numbers, but I didn’t recall seeing this data-set before. I was excited too because the data-set is huge, in a way: it covers more than 30,000 submissions at 40+ journals over a span of three years!</p>
<p>So I was keen to give it a closer look. This post walks through that process. But I should warn you up front that the result is kinda disappointing.</p>
<h1 id="initial-reservations">Initial Reservations</h1>
<p>Right away some conspicuous omissions stand out.<sup class="footnote-ref" id="fnref:1"><a rel="footnote" href="#fn:1">1</a></sup> A good number of the usual suspects aren’t included, like <em>Philosophical Studies</em>, <em>Analysis</em>, and <em>Australasian Journal of Philosophy</em>. So the usual worries about response rates and selection bias apply.</p>
<p>The data are also a bit haphazard and incomplete. Fewer than half of the journals that responded included gender data. And some of those numbers are suspiciously round.</p>
<p>Still, there’s hope. We have data on over ten thousand submissions even after we exclude journals that didn’t submit any gender data. As long as they paint a reasonably consistent picture, we stand to learn a lot.</p>
<h1 id="first-pass">First Pass</h1>
<p>For starters we’ll just do some minimal cleaning. We’ll exclude data from 2014, since almost no journals supplied it. And we’ll lump together the submissions from the remaining three years, 2011–13, since the gender data isn’t broken down by year.</p>
<p>We can then calculate the following cross-journal tallies for 2011–13:</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Accepted submissions</th>
<th align="right">Rejected submissions</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Men</td>
<td align="right">792</td>
<td align="right">9104</td>
</tr>
<tr>
<td align="left">Women</td>
<td align="right">213</td>
<td align="right">1893</td>
</tr>
</tbody>
</table>
<p>The difference here looks notable at first: 17.5% of submitted papers came from women compared with 21.2% of accepted papers, a statistically significant difference (<em>p</em> = 0.002).</p>
<p>But if we plot the data by journal, the picture becomes much less clear:</p>
<p><img src="http://jonathanweisberg.org/img/apa_bpa_data_files/unnamed-chunk-3-1.png" alt="" /><!-- --></p>
<p>The dashed line<sup class="footnote-ref" id="fnref:2"><a rel="footnote" href="#fn:2">2</a></sup> indicates parity: where submission and acceptance rate would be equal. At journals above the line, women make up a larger portion of published authors than they do submitting authors. At journals below the line, it’s the reverse.</p>
<p>It’s pretty striking how much variation there is between journals. For example, <em>BJPS</em> is 12 points above the parity line while <em>Phil Quarterly</em> is 9 points below it.</p>
<p>It’s also notable that it’s the largest journals which diverge the most from parity: <em>BJPS</em>, <em>EJP</em>, <em>MIND</em>, and <em>Phil Quarterly</em>. (Note: <em>Hume Studies</em> is actually the most extreme by far. But I’ve excluded it from the plot because it’s very small, and as an extreme outlier it badly skews the <em>y</em>-axis.)</p>
<p>It’s hard to see all the details in the plot, so here’s the same data in a table.</p>
<table>
<thead>
<tr>
<th align="left">Journal</th>
<th align="right">submissions</th>
<th align="right">accepted</th>
<th align="left">% submissions women</th>
<th align="left">% accepted women</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Ancient Philosophy</td>
<td align="right">346</td>
<td align="right">63</td>
<td align="left">20</td>
<td align="left">24</td>
</tr>
<tr>
<td align="left">British Journal for the Philosophy of Science</td>
<td align="right">1267</td>
<td align="right">117</td>
<td align="left">15</td>
<td align="left">27</td>
</tr>
<tr>
<td align="left">Canadian Journal of Philosophy</td>
<td align="right">792</td>
<td align="right">132</td>
<td align="left">20</td>
<td align="left">21</td>
</tr>
<tr>
<td align="left">Dialectica</td>
<td align="right">826</td>
<td align="right">74</td>
<td align="left">12.05</td>
<td align="left">15.48</td>
</tr>
<tr>
<td align="left">European Journal for Philosophy</td>
<td align="right">1554</td>
<td align="right">98</td>
<td align="left">11.84</td>
<td align="left">25</td>
</tr>
<tr>
<td align="left">Hume Studies</td>
<td align="right">152</td>
<td align="right">30</td>
<td align="left">23.7</td>
<td align="left">58.1</td>
</tr>
<tr>
<td align="left">Journal of Applied Philosophy</td>
<td align="right">510</td>
<td align="right">47</td>
<td align="left">20</td>
<td align="left">20</td>
</tr>
<tr>
<td align="left">Journal of Political Philosophy</td>
<td align="right">1143</td>
<td align="right">53</td>
<td align="left">35</td>
<td align="left">30</td>
</tr>
<tr>
<td align="left">MIND</td>
<td align="right">1498</td>
<td align="right">74</td>
<td align="left">10</td>
<td align="left">5</td>
</tr>
<tr>
<td align="left">Oxford Studies in Ancient Philosophy</td>
<td align="right">290</td>
<td align="right">43</td>
<td align="left">21</td>
<td align="left">20.3</td>
</tr>
<tr>
<td align="left">Philosophy East and West</td>
<td align="right">320</td>
<td align="right">66</td>
<td align="left">20</td>
<td align="left">15</td>
</tr>
<tr>
<td align="left">Phronesis</td>
<td align="right">388</td>
<td align="right">38</td>
<td align="left">24</td>
<td align="left">25</td>
</tr>
<tr>
<td align="left">The Journal of Aesthetics and Art Criticism</td>
<td align="right">611</td>
<td align="right">93</td>
<td align="left">29</td>
<td align="left">27</td>
</tr>
<tr>
<td align="left">The Philosophical Quarterly</td>
<td align="right">2305</td>
<td align="right">77</td>
<td align="left">14</td>
<td align="left">5</td>
</tr>
</tbody>
</table>
<h1 id="rounders-removed">Rounders Removed</h1>
<p>I mentioned that some of the numbers look suspiciously round. Maybe 10% of submissions to <em>MIND</em> really were from women, compared with 5% of accepted papers. But some of these cases probably involve non-trivial rounding, maybe even eyeballing or guesstimating. So let’s see how things look without them.</p>
<p>If we omit journals where both percentages are round (integer multiples of 5), that leaves ten journals. And the gap from before is even more pronounced: 16.3% of submissions from women compared with 22.9% of accepted papers (<em>p</em> = 0.0000003).</p>
<p>But it’s still a few, high-volume journals driving the result: <em>BJPS</em> and <em>EJP</em> do a ton of business, and each has a large gap. So much so that they’re able to overcome the opposite contribution of <em>Phil Quarterly</em> (which does a mind-boggling amount of business!).</p>
<h1 id="editors-anonymous">Editors Anonymous</h1>
<p>Naturally I fell to wondering how these big journals differ in their editorial practices. What are they doing differently that leads to such divergent results?</p>
<p>One thing the data tell us is which journals practice fully anonymous review, with even the editors ignorant of the author’s identity. That narrows it down to just three journals: <em>CJP</em>, <em>Dialectica</em>, and <em>Phil Quarterly</em>.<sup class="footnote-ref" id="fnref:3"><a rel="footnote" href="#fn:3">3</a></sup> The tallies then are:</p>
<table>
<thead>
<tr>
<th align="left"></th>
<th align="right">Accepted submissions</th>
<th align="right">Rejected submissions</th>
</tr>
</thead>
<tbody>
<tr>
<td align="left">Men</td>
<td align="right">240</td>
<td align="right">3103</td>
</tr>
<tr>
<td align="left">Women</td>
<td align="right">43</td>
<td align="right">537</td>
</tr>
</tbody>
</table>
<p>And now the gap is gone: 14.8% of submissions from women, compared with 15.2% of accepted papers—not a statistically significant difference (<em>p</em> = 0.91). That makes it look like the gap is down to editors’ decisions being influenced by knowledge of the author’s gender (whether deliberately or unconsciously).</p>
<p>But notice again, <em>Phil Quarterly</em> is still a huge part of this story. It’s their high volume and unusually negative differential that compensates for the more modest, positive differentials at <em>CJP</em> and <em>Dialectica</em>. So I still want to know more about <em>Phil Quarterly</em>, and what might explain their unusually negative differential.</p>
<p><strong>Edit</strong>: editors at <em>CJP</em> and <em>Phil Quarterly</em> kindly wrote with the following, additional information.</p>
<p>At <em>CJP</em>, the author’s identity is withheld from the editors while they decide whether to send the paper for external review, but then their identity is revealed (presumably to avoid inviting referees who are unacceptably close to the author—e.g. those identical to the author).</p>
<p>And chairman of <em>Phil Quarterly</em>’s editorial board, Jessica Brown, writes:</p>
<blockquote>
<ol>
<li>the PQ is very aware of issues about the representation of women, unsurprisingly given that the editorial board consists of myself, Sarah Broadie and Sophie-Grace Chappell. We monitor data on submissions by women and papers accepted in the journal every year.</li>
<li>the PQ has for many years had fully anonymised processing including the point at which decisions on papers are made (i.e. accept, reject, R and R etc). So, when we make such decisions we have no idea of the identity of the author.</li>
<li><p>While in some years the data has concerned us, more recently the figures do look better which is encouraging:</p>
<ul>
<li>16-17: 25% declared female authored papers accepted; 16% submissions</li>
<li>15-16: 14% accepted; 15% submissions</li>
<li>14-15: 16% accepted; 16% submissions</li>
</ul></li>
</ol>
</blockquote>
<h1 id="a-gruesome-conclusion">A Gruesome Conclusion</h1>
<p>In the end, I don’t see a clear lesson here. Before drawing any conclusions from the aggregated, cross-journal tallies, it seems we’d need to know more about the policies and practices of the journals driving them. Otherwise we’re liable to be misled to a false generalization about a heterogeneous group.</p>
<p>Some of that policy-and-practice information is probably publicly available; I haven’t had a chance to look. And I bet a lot of it is available informally, if you just talk to the right people. So this data-set could still be informative on our base-rate question. But sadly, I don’t think I’m currently in a position to make informative use of it.</p>
<p><img src="http://i.imgur.com/ojvPBaY.jpg" alt="" /></p>
<h1 id="technical-note">Technical Note</h1>
<p>This post was written in R Markdown and the source is <a href="https://github.com/jweisber/rgo/blob/master/apa bpa data/apa_bpa_data.Rmd" target="_blank">available on GitHub</a>.</p>
<div class="footnotes">
<hr />
<ol>
<li id="fn:1">No, I don’t mean <em>Ergo</em>! We published our first issue in 2014 while the survey covers mainly 2011–13.
<a class="footnote-return" href="#fnref:1"><sup>[return]</sup></a></li>
<li id="fn:2"><strong>Edit</strong>: the parity line was solid blue originally. But that misled some people into reading it as a fitted line. For reference and posterity, <a href="http://jonathanweisberg.org/img/apa_bpa_data_files/unnamed-chunk-3-2.png">the original image is here</a>.
<a class="footnote-return" href="#fnref:2"><sup>[return]</sup></a></li>
<li id="fn:3">That’s if we continue to exclude journals with very round numbers. Adding these journals back in doesn’t change the following result, though.
<a class="footnote-return" href="#fnref:3"><sup>[return]</sup></a></li>
</ol>
</div>
Accuracy for Dummies, Part 6: Obtusity
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%206%20-%20Obtusity/
Wed, 24 May 2017 00:00:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%206%20-%20Obtusity/
<p><a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 5 - Convexity/">Last time</a> we saw that the set of probability assignments is <em>convex</em>. Today we’re going to show that convex sets have a special sort of “obtuse” relationship with outsiders. Given a point <em>outside</em> a convex set, there is always a point <em>in</em> the set that forms a right-or-obtuse angle with it.</p>
<p>Recall our 2D diagram from the first post. The convex set of interest here is the diagonal line segment from $(0,1)$ to $(1,0)$:</p>
<p><img src="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png" alt="" /></p>
<p>For any point outside the diagonal, like $c^* $, there is a point like $c’$ on it that forms a right angle with all other points on the diagonal. As a result, $c’$ is closer to all other points on the diagonal than $c^* $ is. In particular, $c’$ is closer to both vertices, so it’s always more accurate than $c^*$. It’s “closer to the truth”.</p>
<p>The insider point $c’$ that we used in this case is the closest point on the diagonal to $c^*$. That’s what licenses the right-triangle reasoning here. Today we’re generalizing this strategy to $n$ dimensions.</p>
<p>To do that, we need some tools for reasoning about $n$-dimensional geometry.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\x}{\vec{x}}
\newcommand{\y}{\vec{y}}
\newcommand{\z}{\vec{z}}
\newcommand{\B}{B}
$</p>
<h1 id="arithmetic-with-arrows">Arithmetic with Arrows</h1>
<p>You’re familiar with arithmetic in one dimension: adding, subtracting, and multiplying single numbers. What about points in $n$ dimensions?</p>
<p>We introduced two ideas for arithmetic with points last time. We’ll add a few more today, and also talk about what they mean geometrically.</p>
<p>Suppose you have two points $\x$ and $\y$ in $n$ dimensions:
$$
\begin{align}
\x &= (x_1, \ldots, x_n),\\<br />
\y &= (y_1, \ldots, y_n).
\end{align}
$$
Their sum $\x + \y$, as we saw last time, is defined as follows:
$$ \x + \y = (x_1 + y_1, \ldots, x_n + y_n). $$
In other words, points are added coordinate-wise.</p>
<p>This definition has a natural, geometric meaning we didn’t mention last time. Start by thinking of $\x$ and $\y$ as <em>vectors</em>—as arrows pointing from the origin to the points $\x$ and $\y$. Then $\x + \y$ just amounts to putting the two arrows end-to-point and taking the point at the end:
<img src="http://jonathanweisberg.org/img/accuracy/VectorAddition.png" alt="" />
(Notice that we’re continuing our usual practice of bold letters for points/vectors like $\x$ and $\y$, and italics for single numbers like $x_1$ and $y_3$.)</p>
<p>You can also multiply a vector $\x$ by a single number, $a$. The definition is once again coordinate-wise:
$$ a \x = (a x_1, \ldots, a x_n). $$
And again there’s a natural, geometric meaning. We’ve lengthened the vector $\x$ by a factor of $a$.
<img src="http://jonathanweisberg.org/img/accuracy/VectorMultiplication.png" alt="" />
Notice that if $a$ is between $0$ and $1$, then “lengthening” is actually shortening. For example, multiplying a vector by $a = 1/ 2$ makes it half as long.</p>
<p>If $a$ is negative, then multiplying by $a$ reverses the direction of the arrow. For example, multiplying the northeasterly arrow $(1,1)$ by $-1$ yields the southwesterly arrow pointing to $(-1,-1)$.</p>
<p>That means we can define subtraction in terms of addition and multiplication by negative one (just as with single numbers):
$$
\begin{align}
\x - \y &= \x + (-1 \times \y)\\<br />
&= (x_1 - y_1, \ldots, x_n - y_n).
\end{align}
$$
So vector subtraction amounts to coordinate-wise subtraction.</p>
<p>But what about multiplying two vectors? That’s actually different from what you might expect! We don’t just multiply coordinate-wise. We do that <strong>and then add up the results</strong>:
$$ \x \cdot \y = x_1 y_1 + \ldots + x_n y_n. $$
So the product of two vectors is <strong>not a vector</strong>, but a number. That number is called the <em>dot product</em>, $\x \cdot \y$.</p>
<p>Why are dot products defined this way? Why do we add up the results of coordinate-wise multiplication to get a single number? Because it yields a more useful extension of the concept of multiplication from single numbers to vectors. We’ll see part of that in a moment, in the geometric meaning of the dot product.</p>
<p>(There’s an algebraic side to the story too, having to do with the axioms that characterize the real numbers—<a href="https://en.wikipedia.org/wiki/Field_(mathematics)" target="_blank">the field axioms</a>. We won’t go into that, but it comes out in <a href="http://www.youtube.com/watch?v=63HpaUFEtXY&t=8m28s" target="_blank">this bit</a> of a beautiful lecture by Francis Su, especially around <a href="http://www.youtube.com/watch?v=63HpaUFEtXY&t=11m45s" target="_blank">the 11:45 mark</a>.)</p>
<h1 id="signs-and-their-significance">Signs and Their Significance</h1>
<p>In two dimensions, a right angle has a special algebraic property: the dot-product of two arrows making the angle is always zero.</p>
<p>Imagine a right triangle at the origin, with one leg going up to the point $(0,1)$ and the other leg going out to $(1,0)$:
<img src="http://jonathanweisberg.org/img/accuracy/VectorRightAngle.png" alt="" />
The dot product of those two vectors is $(1,0) \cdot (0,1) = 1 \times 0 + 0 \times 1 = 0$. One more example: consider the right angle formed by the vectors $(-3,3)$ and $(1,1)$.
<img src="http://jonathanweisberg.org/img/accuracy/VectorRightAngle2.png" alt="" />
Again, the dot product is $(-3,3) \cdot (1,1) = -3 \times 1 + 3 \times 1 = 0.$</p>
<p>Going a bit further: the dot product is always positive for acute angles, and negative for obtuse angles. Take the vectors $(5,0)$ and $(-1,1)$:
<img src="http://jonathanweisberg.org/img/accuracy/VectorObtuseAngle.png" alt="" />
Then we have $(5,0) \cdot (-1,1) = -5$. Whereas for $(5,0)$ and $(1,1)$:
<img src="http://jonathanweisberg.org/img/accuracy/VectorAcuteAngle.png" alt="" />
we find $(5,0) \cdot (1,1) = 5$.</p>
<p>So the sign of the dot-product reflects the angle formed by the vectors $\x$ and $\y$:</p>
<ul>
<li>acute angle: $\x \cdot \y > 0$,</li>
<li>right angle: $\x \cdot \y = 0$,</li>
<li>obtuse angle: $\x \cdot \y < 0$.</li>
</ul>
<p>That’s going to be key in generalizing to $n$ dimensions, where reasoning with diagrams breaks down. But first, one last bit of groundwork.</p>
<h1 id="algebra-with-arrows">Algebra with Arrows</h1>
<p>You can check pretty easily that vector addition and multiplication behave a lot like ordinary addition and multiplication. The usual laws of commutativity, associativity, and distribution hold:</p>
<ul>
<li>$\x + \y = \y + \x$.</li>
<li>$\x + (\y + \z) = (\x + \y) + \z$.</li>
<li>$a ( \x + \y) = a\x + a\y$.</li>
<li>$\x \cdot \y = \y \cdot \x$.</li>
<li>$\x \cdot (\y + \z) = \x\y + \x\z$.</li>
<li>$a (\x \cdot \y) = a \x \cdot \y = \x \cdot a \y$.</li>
</ul>
<p>One notable consequence, which we’ll use below, is the analogue of the familiar <a href="https://en.wikipedia.org/wiki/FOIL_method" target="_blank">“FOIL method”</a> from high school algebra:
$$
\begin{align}
(\x - \y)^2 &= (\x - \y) \cdot (\x - \y)\\<br />
&= \x^2 - 2 \x \cdot \y + \y^2.
\end{align}
$$
We’ll also make use of the fact that the Brier distance between $\x$ and $\y$ can be written $(\x - \y)^2$. Why?</p>
<p>Let’s write $\B(\x,\y)$ for the Brier distance between points $\x$ and $\y$. Recall the definition of Brier distance, which is just the square of Euclidean distance:
$$ \B(\x,\y) = (x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots + (x_n - y_n)^2. $$
Now consider that, thanks to our definition of vector subtraction:
$$ \x - \y = (x_1 - y_1, x_2 - y_2, \ldots, x_n - y_n). $$
And thanks to the definition of the dot product:
$$ (\x - \y) \cdot (\x - \y) = (x_1 - y_1)^2 + (x_2 - y_2)^2 + \ldots (x_n - y_n)^2. $$
So $\B(\x, \y) = (\x - \y) \cdot (\x - \y)$, in other words:
$$ \B(\x, \y) = (\x - y)^2. $$</p>
<h1 id="a-cute-lemma">A Cute Lemma</h1>
<p>Now we can prove the lemma that’s the aim of this post. For the intuitive idea, picture a convex set $S$ in the plane, like a pentagon. Then choose an arbitrary point $\x$ outside that set:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma.png" alt="" />
Now trace a straight line from $\x$ to the closest point of the convex region, $\y$:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma2.png" alt="" />
Finally, trace another straight line to any other point $\z$ of $S$:
<img src="http://jonathanweisberg.org/img/accuracy/ObtusityLemma3.png" alt="" />
No matter what point we choose for $\z$, the angle formed will either be right or obtuse. It cannot be acute.</p>
<p><strong>Lemma.</strong> Let $S$ be a convex set of points in $\mathbb{R}^n$. Let $\x \not \in S$, and let $\y \in S$ minimize $\B(\y, \x)$ as a function of $\y$ on the domain $S$. Then for any $\z \in S$,
$$ (\x - \y) \cdot (\z - \y) \leq 0. $$</p>
<p>Let’s pause to understand what the Lemma is saying before we dive into the proof.</p>
<p>Focus on the centered inequality. It’s about the vectors $\x - \y$ and $\z - \y$. These are the arrows pointing from $\y$ to $\x$, and from $\y$ to $\z$. So in terms of our original two dimensional diagram with the triangle:
<img src="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png" alt="" />
we’re looking at the angle between $c^*$, $c’$, and any point on the diagonal you like… which includes the ones we’re especially interested in, the vertices. What the lemma tells us is that this angle is always at least a right angle.</p>
<p>Of course, it’s exactly a right angle in this case, not an obtuse one. That’s because our convex region is just the diagonal line. But the Lemma could also be applied to the whole triangular region in the diagram. That’s a convex set too. And if we took a point inside the triangle as our third point, the angle formed would be obtuse. (This is actually important if you want to generalize the dominance theorem beyond what we’ll prove next time. But for us it’s just a mathematical extra.)</p>
<p>Now let’s prove the Lemma.</p>
<p><em>Proof.</em> Because $S$ is convex and $\y$ and $\z$ are in $S$, any mixture of $\y$ and $\z$ must also be in $S$. That is, every point $\lambda \z + (1-\lambda) \y$ is in $S$, given $0 \leq \lambda \leq 1$.</p>
<p>Notice that we can rewrite $\lambda \z + (1-\lambda) \y$ as follows:
$$ \lambda \z + (1-\lambda) \y = \y + \lambda(\z - \y). $$
We’ll use this fact momentarily.</p>
<p>Now, by hypothesis $\y$ is at least as close to $\x$ as any other point of $S$ is. So, in particular, $\y$ is at least as close to $\x$ as the mixtures of $\y$ and $\z$ are. Thus, for any given $\lambda \in [0,1]$:
$$ \B(\y,\x) \leq \B(\lambda \z + (1-\lambda) \y, \x). $$
Using algebra, we can transform the right-hand side as follows:
$$
\begin{align}
\B(\lambda \z + (1-\lambda) \y, \x) &= \B(\x, \lambda \z + (1-\lambda) \y)\\<br />
&= \B(\x, \y + \lambda(\z - \y))\\<br />
&= (\x - (\y + \lambda(\z - \y)))^2\\<br />
&= ((\x - \y) - \lambda(\z - \y))^2\\<br />
&= (\x - \y)^2 + \lambda^2(\z - \y)^2 - 2\lambda(\x - \y) \cdot (\z - \y)\\<br />
&= \B(\x,\y) + \lambda^2\B(\z,\y) - 2\lambda(\x - \y) \cdot (\z - \y).
\end{align}
$$
Combining this equation with the previous inequality, we have:
$$ \B(\y,\x) \leq \B(\x,\y) + \lambda^2\B(\z,\y) - 2\lambda(\x - \y) \cdot (\z - \y). $$
And because $\B(\y, \x) = \B(\x, \y)$, this becomes:<br />
$$ 0 \leq \lambda^2\B(\z,\y) - 2\lambda(\x - \y) \cdot (\z - \y). $$
If we then restrict our attention to $\lambda > 0$, we can divide and rearrange terms to get:
$$ (\x - \y) \cdot (\z - \y) \leq \frac{\lambda\B(\z,\y)}{2}. $$
And since this inequality holds no matter how small $\lambda$ is, it follows that
$$ (\x - \y) \cdot (\z - \y) \leq 0, $$
as desired.
<span class="floatright">$\Box$</span></p>
<h1 id="taking-stock">Taking Stock</h1>
<p>Here’s what we’ve got from this post and the last one:</p>
<ul>
<li>Last time: the set of probability functions $P$ is convex.</li>
<li>This time: given a point $\x$ outside $P$, there’s a point $\y$ inside $P$ that forms a right-or-obtuse angle with every other point $\z$ in $P$.</li>
</ul>
<p>Intuitively, it should follow that:</p>
<ul>
<li>$\y$ is closer to every $\z$ in $P$ than $\x$ is.</li>
</ul>
<p>And indeed, that’s what we’ll show in the next post!</p>
Accuracy for Dummies, Part 5: Convexity
http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%205%20-%20Convexity/
Thu, 18 May 2017 10:35:00 -0500http://jonathanweisberg.org/post/Accuracy%20for%20Dummies%20-%20Part%205%20-%20Convexity/
<p>In this and the next two posts we’ll establish the central theorem of the accuracy framework. We’ll show that the laws of probability are specially suited to the pursuit of accuracy, measured in Brier distance.</p>
<p>We showed this for cases with two possible outcomes, like a coin toss, way back in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 1/">the first post of this series</a>. A simple, <a href="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png">two-dimensional diagram</a> was all we really needed for that argument. To see how the same idea extends to any number of dimensions, we need to generalize the key ingredients of that reasoning to $n$ dimensions.</p>
<p>This post supplies the first ingredient: the convexity theorem.</p>
<h1 id="convex-shapes">Convex Shapes</h1>
<p>Convex shapes are central to the accuracy framework because, in a way, the laws of probability have a convex shape. Hopefully that mystical pronouncement will make sense by the end of this post.</p>
<p>You probably know a convex shape when you see one. Circles, triangles, and octagons are convex; pentagrams and the state of Texas are not.</p>
<p>But what makes a convex shape convex? Roughly: <em>it contains all its connecting lines</em>. If you take any two points in a convex region and draw a line connecting them, the line will lie entirely inside that region.</p>
<p>But on a non-convex figure, you can find points whose connecting line leaves the figure’s boundary:</p>
<p><img src="http://jonathanweisberg.org/img/accuracy/TexasLine.png" alt="" /></p>
<p>We want to take this idea beyond two dimensions, though. And for that, we need to generalize the idea of connecting lines. We need the concept of a “mixture”.</p>
<h2 id="pointy-arithmetic">Pointy Arithmetic</h2>
<p>In two dimensions it’s pretty easy to see that if you take some percentage of one point, and a complementary percentage of another point, you get a third point on the line between them.$
\renewcommand{\vec}[1]{\mathbf{#1}}
\newcommand{\p}{\vec{p}}
\newcommand{\q}{\vec{q}}
\newcommand{\r}{\vec{r}}
\newcommand{\v}{\vec{v}}
\newcommand{\R}{\mathbb{R}}
$</p>
<p>For example, if you take $1/ 2$ of $(0,0)$ and add it to $1/ 2$ of $(1,1)$, you get the point halfway between: $(1/ 2,1/ 2)$. That’s pretty intuitive geometrically:
<img src="http://jonathanweisberg.org/img/accuracy/Fig1.png" alt="" />
But we can capture the idea algebraically too:
$$
\begin{align}
1/ 2 \times (0,0) + 1/ 2 \times (1,1)
&= (0,0) + (1/ 2, 1/ 2)\\<br />
&= (1/ 2, 1/ 2).
\end{align}
$$</p>
<p>Likewise, if you add $3/10$ of $(0,0)$ to $7/10$ of $(1, 1)$, you get the point seven-tenths of the way in between, namely $(7/10, 7/10)$:
<img src="http://jonathanweisberg.org/img/accuracy/Fig2.png" alt="" />
In algebraic terms:
$$
\begin{align}
3/10 \times (0,0) + 7/10 \times (1,1)
&= (0,0) + (7/10, 7/10)\\<br />
&= (7/10, 7/10).
\end{align}
$$</p>
<p>Notice that we just introduced two rules for doing arithmetic with points. When multiplying a point $\p = (p_1, p_2)$ by a number $a$, we get:
$$ a \p = (a p_1, a p_2). $$
And when adding two points $\p = (p_1, p_2)$ and $\q = (q_1, q_2)$ together:
$$ \p + \q = (p_1 + q_1, p_2 + q_2). $$
In other words, multiplying a point by a single number works element-wise, and so does adding two points together.</p>
<p>We can generalize these ideas straightforwardly to any number of dimensions $n$. Given points $\p = (p_1, p_2, \ldots, p_n)$ and $\q = (q_1, q_2, \ldots, q_n)$, we can define:
$$ a \p = (a p_1, a p_2, \ldots, a p_n), $$
and
$$ \p + \q = (p_1 + q_1, p_2 + q_2, \ldots, p_n + q_n).$$
We’ll talk more about arithmetic with points next time. For now, these two definitions will do.</p>
<h2 id="mixtures">Mixtures</h2>
<p>Now back to connecting lines between points. The idea is that the straight line between $\p$ and $\q$ is the set of points we get by “mixing” some portion of $\p$ with some portion of $\q$.</p>
<p>We take some number $\lambda$ between $0$ and $1$, we multiply $\p$ by $\lambda$ and $\q$ by $1 - \lambda$, and we sum the results: $\lambda \p + (1-\lambda) \q$. The set of points you can obtain this way is the straight line between $\p$ and $\q$.</p>
<p>In fact, you can mix any number of points together. Given $m$ points $\q_1, \ldots, \q_m$, we can define their <em>mixture</em> as follows. Let $\lambda_1, \ldots \lambda_m$ be positive real numbers that sum to one. That is:</p>
<ul>
<li>$\lambda_i \geq 0$ for all $i$, and</li>
<li>$\lambda_1 + \lambda_2 + \ldots + \lambda_m = 1$.</li>
</ul>
<p>Then we multiply each $\q_i$ by the corresponding $\lambda_i$ and sum up:
$$ \p = \lambda_1 \q_1 + \ldots + \lambda_m \q_m. $$
The resulting point $\p$ is a <em>mixture</em> of the $\q_i$’s.</p>
<p>Now we can define the general notion of a <em>convex set</em> of points. A convex set is one where the mixture of any points in the set is also contained in the set. (A convex set is “closed under mixing”, you might say.)</p>
<h1 id="convex-hulls">Convex Hulls</h1>
<p>It turns out that the set of possible probability assignments is convex.</p>
<p>More than that, it’s the convex set generated by the possible truth-value assignments, in a certain way. It’s the “convex hull” of the possible truth-value assignments.</p>
<p>What in the world is a “convex hull”?</p>
<p>Imagine some points in the plane—the corners of a square, for example. Now imagine stretching a rubber band around those points and letting it snap tight. The shape you get is the square with those points as corners. And the set of points enclosed by the rubber band is a convex set. Take any two points inside the square, or on its boundary, and draw the straight line between them. The line will not leave the square.</p>
<p>Intuitively, the convex hull of a set of points in the plane is the set enclosed by the rubber band exercise. Formally, the convex hull of a set of points is the set of points that can be obtained from them as a mixture. (And this definition works in any number of dimensions.)</p>
<p>For example, any of the points in our square example can be obtained by taking a mixture of the vertices. Take the center of the square: it’s halfway between the bottom left and top right corners. To get something to the left of that we can mix in some of the top left corner (and correspondingly less of the top right). And so on.</p>
<p>Now imagine the rubber band exercise using the possible truth-value assignments, instead of the corners of a square. In two dimensions, those are the points $(0,1)$ and $(1,0)$. And when you let the band snap tight, you get the diagonal line connecting them. As we saw way back in <a href="http://jonathanweisberg.org/post/Accuracy for Dummies - Part 1/">our first post</a>, the points on that diagonal line are the possible probability assignments.</p>
<h1 id="peeking-ahead">Peeking Ahead</h1>
<p>We also saw that if you take any point <em>not</em> on that diagonal, the closest point on the diagonal forms a right angle. That’s what lets us do some basic geometric reasoning to see that there’s a point on the line that’s closer to both vertices than the point off the line:</p>
<p><img src="http://jonathanweisberg.org/img/accuracy/2D Dominance Diagram - 400px.png" alt="" /></p>
<p>That fact about closest points and right angles is what’s going to enable us to generalize the argument beyond two dimensions. If you take any point not on a convex hull, there’s a point on the convex hull (namely the closest point) which forms a right (or obtuse) angle with the other points on the hull.</p>
<p>Consider the three dimensional case. The possible truth-value assignments are $(1,0,0)$, $(0,1,0)$, and $(0,0,1)$:
<img src="http://jonathanweisberg.org/img/accuracy/Three Vertices.png" alt="" />
And when you let a rubber band snap tight around them, it encloses the triangular surface connecting them:
<img src="http://jonathanweisberg.org/img/accuracy/Three Vertices with Hull.png" alt="" />
That’s the set of probability assignments for three outcomes.</p>
<p>Now take any point that’s not on that triangular surface. Drop a straight line to the closest point on the surface. Then draw another straight line from there to one of the triangle’s vertices. These two straight lines will form a right or obtuse angle. So the distance from the first, off-hull point to the vertex is further than the distance from the second, on-hull point to the vertex.</p>
<p>Essentially the same reasoning works in any number of dimensions. But to make it work, we need to do three things.</p>
<ol>
<li>Prove that the probability assignments always form a convex hull around the possible truth-value assignments.</li>
<li>Prove that any point outside a convex hull forms a right angle (or an obtuse angle) with any point on the hull.</li>
<li>Prove that the point off the hull is further from all the vertices than the closest point on the hull.</li>
</ol>
<p>This post is dedicated to the first item.</p>
<h1 id="the-convexity-theorem">The Convexity Theorem</h1>
<p>We’re going to prove that the set of possible probability assignments is the same as the convex hull of the possible truth-value assignments. First let’s get some notation in place.</p>
<h2 id="notation">Notation</h2>
<p>As usual $n$ is the number of possible outcomes under consideration. So each possible truth-value assignment is a point of $n$ coordinates, with a single $1$ and $0$ everywhere else. For example, if $n = 4$ then $(0, 0, 1, 0)$ represents the case where the third possibility obtains.</p>
<p>We’ll write $V$ for the set of all possible truth value assignments. And we’ll write $\v_1, \ldots, \v_n$ for the elements of $V$. The first element $\v_1$ has its $1$ in the first coordinate, $\v_2$ has its $1$ in the second coordinate, etc.</p>
<p>We’ll use a superscript $^+$ for the convex hull of a set. So $V^+$ is the convex hull of $V$. It’s the set of all points that can be obtained by mixing members of $V$.</p>
<p>Recall, a mixture is a point obtained by taking nonnegative real numbers $\lambda_1, \ldots, \lambda_n$ that sum to one, and multiplying each one against the corresponding $\v_i$ and then summing up:
$$ \lambda_1 \v_1 + \lambda_2 \v_2 + \ldots + \lambda_n \v_n. $$
So $V^+$ is the set of all points that can be obtained by this method. Each choice of values $\lambda_1, \ldots, \lambda_n$ generates a member of $V^+$. (To exclude one of the $\v_i$’s from a mixture, just set $\lambda_i = 0$.)</p>
<p>Finally, we’ll use $P$ for the set of all probability assignments. Recall: a probability assignment is a point of $n$ coordinates, where each coordinate is nonnegative, and all the coordinates together add up to one. That is, $\p = (p_1,\ldots,p_n)$ is a probability assignment just in case:</p>
<ul>
<li>$p_i \geq 0$ for all $i$, and</li>
<li>$p_1 + p_2 + \ldots + p_n = 1$.</li>
</ul>
<p>The set $P$ contains just those points $\p$ satisfying these two conditions.</p>
<h2 id="statement-and-proof">Statement and Proof</h2>
<p>In the notation just established, what we’re trying to show is that $V^+ = P$.</p>
<p><strong>Theorem.</strong> $V^+ = P$. That is, the convex hull of the possible truth-value assignments just is the set of possible probability assignments.</p>
<p><em>Proof.</em> Let’s first show that $V^+ \subseteq P$.</p>
<p>Notice that a truth-value assignment is also probability assignment. Its coordinates are always $1$ or $0$, so all coordinates are nonnegative. And since it has only a single coordinate with value $1$, its coordinates add up to $1$.</p>
<p>But we have to show that any mixture of truth-value assignments is also a probability assignment. So let $\lambda_1, \ldots, \lambda_n$ be nonnegative numbers that sum to $1$. If we multiply $\lambda_i$ against a truth-value assignment $\v_i$, we get a point with $0$ in every coordinate except the $i$-th coordinate, which has value $\lambda_i$. For example, $\lambda_3 \times (0, 0, 1, 0) = (0, 0, \lambda_3, 0)$. So the mixture that results from $\lambda_1, \ldots, \lambda_n$ is:
$$
\lambda_1 \v_1 + \lambda_2 \v_2 + \ldots \lambda_n \v_n = (\lambda_1, \lambda_2, \ldots, \lambda_n).
$$
And this mixture has coordinates that are all nonnegative and sum to $1$, by hypothesis. In other words, it is a probability assignment.</p>
<p>So we turn to showing that $P \subseteq V^+$. In other words, we want to show that every probability assignment can be obtained as a mixture of the $\v_i$’s.</p>
<p>So take an arbitrary probability assignment $\p \in P$, where $\p = (p_1, \ldots, p_n)$. Let the $\lambda_i$’s be the probabilities that $\p$ assigns to each $i$: $\lambda_1 = p_1$, $\lambda_2 = p_2$, and so on. Then, by the same logic as in the first part of the proof:
$$ \lambda_1 \v_1 + \ldots + \lambda_n \v_n = (p_1, \ldots, p_n). $$
In other words, $\p$ is a mixture of the possible truth-value assignments, where the weights in the mixture are just the probability values assigned by $\p$. <span style="float: right;">$\Box$</span></p>
<h1 id="up-next">Up Next</h1>
<p>We’ve established the first of the three items listed earlier. Next time we’ll establish the second: given a point outside a convex set, there’s always a point inside that forms a right or obtuse angle with any other point of the set. Then we’ll be just a few lines of algebra from the main result: the Brier dominance theorem!</p>
Journals as Ratings Agencies
http://jonathanweisberg.org/post/Journals%20as%20Ratings%20Agencies/
Thu, 30 Mar 2017 15:27:04 -0500http://jonathanweisberg.org/post/Journals%20as%20Ratings%20Agencies/
<p>Starting in July, philosophy’s two most prestigious journals won’t reject submitted papers anymore. Instead they’ll “grade” every submission, assigning a rating on the familiar letter-grade scale (A+, A, A-, B+, B, B-, etc.).</p>
<p>They will, in effect, become ratings agencies.</p>
<p>They’ll still publish papers. Those rated A- or higher can be published in the journal, if the authors want. Or they can seek another venue, if they think they can do better.</p>
<p>I just made that up. But imagine if it were true—especially if a bunch of journals did this. How would it change philosophy’s publication game?</p>
<p>Well we’d save a lot of wasted labour, for one thing. And we’d discourage frivolous submissions, for another.</p>
<h1 id="the-bad">The Bad</h1>
<p>Under the current arrangement, the system is sagging low under the weight of premature, mediocre, even low-quality submissions. (I’d say it’s even creaking and cracking.) Editors scrounge miserably for referees, and referees frantically churn out reports and recommendations, mostly for naught.</p>
<p>In a typical case, the editor rejects the submission and the referees’ reports are filed away in a database, never to be read again. Maybe the author makes substantial revisions, but very likely they don’t—especially if the paper’s main idea is the real limiting factor. The process repeats at another journal, often at several more journals. And in the end all the philosophical public sees is: accepted at <em>International Journal of Such & Such Studies</em>.</p>
<p>Of all the people who’ve read and assessed the paper by that point, only two have their assessments directly broadcast to the public. And even then, only the “two thumbs more-or-less up” part of the signal gets out.</p>
<p>Yet five, eight, or even ten people have weighed in on the paper by then. They’ve thought about its strengths and weaknesses, and they’ve generated valuable insights and assessments that could save others time and trouble. Yet only the handling editors and the authors get the direct benefit of that labour.</p>
<p>The current system even encourages authors to waste editors’ and referees’ time. Unless they’re in a rush, authors can start at the top of the journal-prestige hierarchy and work their way down. You don’t even have to perfect your paper before starting this incredibly inefficient process. With so many journals to try, you’ll basically get unlimited kicks at the can. So you might as well let the referees do your homework for you.</p>
<p>(This doesn’t apply to all authors, obviously. Some work in areas that severely limit their can-kicking. And many <em>are</em> in a rush, to get jobs and tenure.)</p>
<h1 id="the-good">The Good</h1>
<p>But, if a paper were publicly assigned a grade every place it was submitted, authors might be more realistic in deciding where to submit. They might also wait until their paper is truly ready for public consumption before imposing on editors and referees.</p>
<p>Readers would also benefit from seeing a paper’s transcript. Not only could it inform their decision about whether to read the paper, it could aid their sense of how its contribution is received by peers and experts.</p>
<p>Referees would also have better incentives, to take on referee work and to be more diligent about it. They would know that their labour would have a greater impact, and that their assessment would have a more lasting effect.</p>
<p>Editors could even limit submissions based on their grade-history, e.g. “no submissions already graded by two other journals”, or “no submissions with an average grade less than a B”. (Ideally, different journals would have different policies here, to allow some variety.)</p>
<h1 id="the-ugly">The Ugly</h1>
<p>Of course, several high-profile journals would have to take the lead to make this kind of thing happen. And there would have to be strong norms within the discipline about publicizing grades: requiring they be listed alongside the paper on CVs and websites, for example</p>
<p>And there would be costs.</p>
<p>Everybody has their favourite story about the groundbreaking paper that got rejected five times, but was finally published in <em>The Posh Journal of Philosophy Review</em>, and has since been cited a gajillion times. Such papers could be weighed down by having their grade-transcripts publicized. (On the plus side, we could have a new genre of great paper: the cult classic!)</p>
<p>Also, some authors have to rely on referee feedback more than others, because of their limited philosophical networks. They’d likely find their papers with longer, more checkered grade-transcripts, exacerbating an existing injustice.</p>
<p>And, in the end, the present proposal might only be a band-aid. If there really is an oversubmission problem in academic philosophy (as I suspect there is), it’s probably caused by increased pressure to publish—because jobs are scarce, and administrators demand it, for example. Turning journals into ratings agencies wouldn’t relieve that pressure, even if it would help to manage some of its bad effects.</p>
<h1 id="decision-r-r">Decision: R&R</h1>
<p>In the end, I’m undecided about this proposal. I think it has some very attractive features, but the costs give me pause (much the same as the alternatives I’m aware of, like <a href="http://davidfaraci.com/populus" target="_blank">Populus</a>). I’m only certain that we can’t keep going as we have been; it won’t end well.</p>