jack

From:

that feels like they matched me except for one datum, but the old scoring system would only give four points. Is there a better algorithm, maybe something that (in a simple way) includes a information theory metric of distance?

Mmm. Fiddly one. You kind of want to find the set of shared cards between the two hands, and then look just within those shared cards to see whether they're all in the same order. That sorts out your example of ABCDE vs XABCD. But then the question is what you do if the shared set aren't in the right order; again, you want to score ABCDE vs EABCD by treating the E as a single out-of-place element rather than by considering absolutely everything to be undifferentiatedly "in the wrong place" as Mastermind would.

One obvious thing is to look at all the pairs of cards among that set, and score a point for each pair that's in the same order as in the other hand (or inflict a penalty point for each pair that isn't, depending on how you want to combine this score with the one for completely different cards). That has the nice feature that ABCDE vs ACBDE only scores one demerit, but ABCDE vs EABCD scores four, representing the fact that E was more significantly misplaced. On the other hand, it's a bit inconvenient that it's a scoring system with a quadratic cap – you then have to scale it in some annoying way in order to combine it with the basically linear score for elements not in the other list at all.

(This ties slightly in to a thing I was thinking about some years ago, which was how you determine the interestingness of an anagram. If you process a large computerised word list by sorting the letters of each word into alphabetical order and then sorting the whole list using those transformed versions as a key, you bring together every set of words that are mutually anagrams and can then go through and list them very easily. But how do you extract the most interesting anagrams from that list? Many of the obvious metrics for how much jumbling has taken place, such as the one I describe above, turn out to have the undesirable property in this application of giving a disproportionately high score to the operation of moving one or two letters from the start to the end, so that your list-of-interesting-anagrams is topped by lots of tedious things like REINTERPRET / INTERPRETER. Eventually Gareth came up with a neat metric which avoided this problem and which is still the best thing I've seen so far, although it seems clear that for today's application it's not what's required.)

eta Oh yes. The other thing I was thinking about was to work the scoring in terms of the longest common subsequence between the two lists, which seamlessly handles both the ABCDE/XABCD and ABCDE/EABCD cases without having to give fundamentally different treatment to replacement of an element as opposed to movement. Unfortunately, that approach gives identical scores to ABCDE/XYABC and ABCDE/DEABC, and it seems pretty clear to me that I'd want those to be scored differently!