jack: (Default)
[personal profile] jack
I've recently been experimenting with removing stress from my daily[1] todo list by listing things I hoped to do but putting a likelihood on them, like "90% foo, bar, blah; 75% other thing" etc. I know this seems overcomplicated, but I find failing things I'd planned to do REALLY REALLY kills my motivation, so it's worth arranging things such that even if they go better or less well than expected, they fall into the broad range of "what I planned for". And it also means that I'm more pushed to put small, comparatively important things first, rather than starting with the difficult things and never getting to anything else.

I don't know if I will keep it up, but even just trying it raised several interesting questions.

Slatestarcodex sometimes posts predictions like this, usually for an upcoming year, to test where he's being honest with what he expects and where he isn't (usually about external factual things like politics, but some of himself). A question arises, how to score this? Especially the 50% ones.

You can cobble together some score which is maximised when 90% of the 90% predictions are true. I think there's some particular baysian probability thing that measures, given those expectations, how unlikely a particular outcome is (which equates to 'how wrong you are' which you try to minimise).

But the bit that gets confusing is, how to rate 50% predictions? For my system, I'm predicting what *will* get done, so I feel like it's clear if I'm over-estimating or under-estimating. But Scott had the problem, that it seemed arbitrary if he said "X will do Y" or "X will do not-Y", so the 50% predictions should be random even if they're comically bad, which logically makes them impossible to score. And yet, you feel that if they're really bad, you should be able to recognise that in a systematic way. Maybe they should all be stated relative to the status quo? Or something else?

[1] Freudiano 'faily' :)

Date: 2018-03-19 10:31 am (UTC)
ptc24: (Default)
From: [personal profile] ptc24
There are various functions that come under the heading of Proper Scoring Rules. A while back I derived the logarithmic scoring rule, I was suprised later to find there was another one (quadratic) and only much later did I learn of the spherical one.

Note that with these things, there's two concerns - calibration and whats-the-other-one-called. With "calibration" it's very much about getting 90% of your 90% predictions to be true in the first place. With the other one its about getting so that you can have lots of 90% predictions while being well-calibrated. Calibration is about having the confidence you deserve, the other one is about deserving high confidence. These functions munge those two concerns together, so you can underscore by being mis-calibrated or by being just plain wrong.

Edit to add: these scores are only meaningful as a part of some comparison. Sometimes an implicit comparison - e.g. compare the predictions generated by some machine learning algorithm with parameter set P and the predictions made with slightly tweaked parameters P', you can do that implictly using gradient descent or whatever.
Edited Date: 2018-03-19 10:47 am (UTC)

Date: 2018-03-19 11:34 am (UTC)
simont: A picture of me in 2016 (Default)
From: [personal profile] simont
I remember having had this thought myself, and discussed it on Monochrome a few years ago.

I'm pretty sure somebody argued in the follow-up comments (without quite the rigour of a Proper Proof but still pretty convincingly as I recall) that there was a fundamental difficulty along the lines of, if you're trying to get your accumulated score over a great many guesses as close as possible to the centre point of 'neither systematically under- nor overestimating', there's no way a scoring system can dis-incentivise the system-gaming technique of tracking your current running total and deliberately erring on the side of whatever will move it closer to the middle.

Unfortunately there seems to have been a rare archiving cockup on Mono in that particular subdirectory, so I can't go back and dig out the argument to see if it had any holes in it, or whether it made an assumption about the type of scoring system that need not hold.

(Perhaps one could define a score with no centre point, i.e. the system awards penalty points with the same sign regardless of which way you err and doesn't track under- vs over-estimates anyway; or perhaps one could randomise the sense of each prediction, and treat some 75% predictions of X as 25% predictions of not-X, or some such. But there are probably still secondary system-gamings possible, such as skewing which kinds of event you even try to predict, going for mostly almost-sure things or mostly 50%ish things...)

Anyway, it certainly does seem to me that a necessary robustness property for any scoring system of this type is that it should reward you for not cheating, i.e. for making a good-faith effort to estimate the probability of each predicted event as accurately as you can, independently of what other events might have already come up and what your current score might be. And whether or not the not-quite-proof I mention was sound, it seems clear that such a robustness property is at the very least difficult to achieve...

Date: 2018-03-19 04:24 pm (UTC)
seekingferret: Two warning signs one above the other. 1) Falling Rocks. 2) Falling Rocs. (Default)
From: [personal profile] seekingferret
In your situation, should you be making 50/50 predictions? I feel like if you're scheduling a lot of 50/50 tasks for yourself, you're not doing a good job of planning out your day.

Date: 2018-03-20 11:36 am (UTC)
ptc24: (Default)
From: [personal profile] ptc24
More thoughts: if you're trying to express a policy by means of a probability distribution, I think it gets harder. At least some of the mind-wrenching Friston stuff on SSC is about probability-as-policy.

Imagine Alice plans to go swimming, at random, but averaging one day in five. Eve can be secretly taking notes on Alice and making predictions, and predicting 20% each day will do fine, the proper scoring rules I've discussed create no perverse incentives for Eve. However, if Alice herself starts predicting she'll go swimming with 20% probability each day, then each day there's a perverse incentive to not go swimming because it makes the predictions better. To a certain extent you can get around this by batching things: e.g. "61% chance of 6 or more swims in a 31-day period" or maybe even "binomial distribution, n=31, p=0.2". However aspirational stretch goals that can't be made to go above 50% with any amount of aggregating may be an unavoidable perverse incentive trap.

Possibly you want to be specifically avoiding pushing yourself too hard, in which case the aspirational stretch goals problem isn't a problem.

Date: 2018-03-20 01:26 pm (UTC)
ptc24: (Default)
From: [personal profile] ptc24
Further further thoughts:

If you coarse-grain your probabilities - e.g. divide things into Must Should Could Won't buckets, and put acceptable ranges on them, then when you're reviewing you can see how many Musts are hitting, how many Shoulds, etc. and adjust as appropriate. There's complicated maths to deal with one-off probabilities, but if you put them in buckets it becomes a lot easier to deal with.