How to score 50% predictions
Mar. 19th, 2018 09:39 amI've recently been experimenting with removing stress from my daily[1] todo list by listing things I hoped to do but putting a likelihood on them, like "90% foo, bar, blah; 75% other thing" etc. I know this seems overcomplicated, but I find failing things I'd planned to do REALLY REALLY kills my motivation, so it's worth arranging things such that even if they go better or less well than expected, they fall into the broad range of "what I planned for". And it also means that I'm more pushed to put small, comparatively important things first, rather than starting with the difficult things and never getting to anything else.
I don't know if I will keep it up, but even just trying it raised several interesting questions.
Slatestarcodex sometimes posts predictions like this, usually for an upcoming year, to test where he's being honest with what he expects and where he isn't (usually about external factual things like politics, but some of himself). A question arises, how to score this? Especially the 50% ones.
You can cobble together some score which is maximised when 90% of the 90% predictions are true. I think there's some particular baysian probability thing that measures, given those expectations, how unlikely a particular outcome is (which equates to 'how wrong you are' which you try to minimise).
But the bit that gets confusing is, how to rate 50% predictions? For my system, I'm predicting what *will* get done, so I feel like it's clear if I'm over-estimating or under-estimating. But Scott had the problem, that it seemed arbitrary if he said "X will do Y" or "X will do not-Y", so the 50% predictions should be random even if they're comically bad, which logically makes them impossible to score. And yet, you feel that if they're really bad, you should be able to recognise that in a systematic way. Maybe they should all be stated relative to the status quo? Or something else?
[1] Freudiano 'faily' :)
I don't know if I will keep it up, but even just trying it raised several interesting questions.
Slatestarcodex sometimes posts predictions like this, usually for an upcoming year, to test where he's being honest with what he expects and where he isn't (usually about external factual things like politics, but some of himself). A question arises, how to score this? Especially the 50% ones.
You can cobble together some score which is maximised when 90% of the 90% predictions are true. I think there's some particular baysian probability thing that measures, given those expectations, how unlikely a particular outcome is (which equates to 'how wrong you are' which you try to minimise).
But the bit that gets confusing is, how to rate 50% predictions? For my system, I'm predicting what *will* get done, so I feel like it's clear if I'm over-estimating or under-estimating. But Scott had the problem, that it seemed arbitrary if he said "X will do Y" or "X will do not-Y", so the 50% predictions should be random even if they're comically bad, which logically makes them impossible to score. And yet, you feel that if they're really bad, you should be able to recognise that in a systematic way. Maybe they should all be stated relative to the status quo? Or something else?
[1] Freudiano 'faily' :)
no subject
Date: 2018-03-19 10:31 am (UTC)Note that with these things, there's two concerns - calibration and whats-the-other-one-called. With "calibration" it's very much about getting 90% of your 90% predictions to be true in the first place. With the other one its about getting so that you can have lots of 90% predictions while being well-calibrated. Calibration is about having the confidence you deserve, the other one is about deserving high confidence. These functions munge those two concerns together, so you can underscore by being mis-calibrated or by being just plain wrong.
Edit to add: these scores are only meaningful as a part of some comparison. Sometimes an implicit comparison - e.g. compare the predictions generated by some machine learning algorithm with parameter set P and the predictions made with slightly tweaked parameters P', you can do that implictly using gradient descent or whatever.
(no subject)
From:(no subject)
From:no subject
Date: 2018-03-19 11:34 am (UTC)I'm pretty sure somebody argued in the follow-up comments (without quite the rigour of a Proper Proof but still pretty convincingly as I recall) that there was a fundamental difficulty along the lines of, if you're trying to get your accumulated score over a great many guesses as close as possible to the centre point of 'neither systematically under- nor overestimating', there's no way a scoring system can dis-incentivise the system-gaming technique of tracking your current running total and deliberately erring on the side of whatever will move it closer to the middle.
Unfortunately there seems to have been a rare archiving cockup on Mono in that particular subdirectory, so I can't go back and dig out the argument to see if it had any holes in it, or whether it made an assumption about the type of scoring system that need not hold.
(Perhaps one could define a score with no centre point, i.e. the system awards penalty points with the same sign regardless of which way you err and doesn't track under- vs over-estimates anyway; or perhaps one could randomise the sense of each prediction, and treat some 75% predictions of X as 25% predictions of not-X, or some such. But there are probably still secondary system-gamings possible, such as skewing which kinds of event you even try to predict, going for mostly almost-sure things or mostly 50%ish things...)
Anyway, it certainly does seem to me that a necessary robustness property for any scoring system of this type is that it should reward you for not cheating, i.e. for making a good-faith effort to estimate the probability of each predicted event as accurately as you can, independently of what other events might have already come up and what your current score might be. And whether or not the not-quite-proof I mention was sound, it seems clear that such a robustness property is at the very least difficult to achieve...
(no subject)
From:no subject
Date: 2018-03-19 04:24 pm (UTC)(no subject)
From:(no subject)
From:no subject
Date: 2018-03-20 11:36 am (UTC)Imagine Alice plans to go swimming, at random, but averaging one day in five. Eve can be secretly taking notes on Alice and making predictions, and predicting 20% each day will do fine, the proper scoring rules I've discussed create no perverse incentives for Eve. However, if Alice herself starts predicting she'll go swimming with 20% probability each day, then each day there's a perverse incentive to not go swimming because it makes the predictions better. To a certain extent you can get around this by batching things: e.g. "61% chance of 6 or more swims in a 31-day period" or maybe even "binomial distribution, n=31, p=0.2". However aspirational stretch goals that can't be made to go above 50% with any amount of aggregating may be an unavoidable perverse incentive trap.
Possibly you want to be specifically avoiding pushing yourself too hard, in which case the aspirational stretch goals problem isn't a problem.
no subject
Date: 2018-03-20 01:26 pm (UTC)If you coarse-grain your probabilities - e.g. divide things into Must Should Could Won't buckets, and put acceptable ranges on them, then when you're reviewing you can see how many Musts are hitting, how many Shoulds, etc. and adjust as appropriate. There's complicated maths to deal with one-off probabilities, but if you put them in buckets it becomes a lot easier to deal with.