The description of the case doesn’t make sense to me, either. But I’m not having an easy time imagining what the philology example would be like, which makes me worry that there might be something specific about the philology example that would affect things. I presume there’s a reason you’re avoiding giving too many details on purpose, but if you only reconstruct the features of the case that you consider relevant, and the case doesn’t make sense to you, it may not be very revealing that the case doesn’t make sense to others; the problem may be that something you are treating as irrelevant and not mentioning actually matters.

I mean, if I imagine that we’re looking at words in a body of literature, and horses are word A and zebras are word B, and the argument is intended to show that word B was actually in use in that body of literature (as opposed to only appearing as a result of slips of the pen, copyist errors, etc.), then I can’t see that this statistical argument proves anything; what we’d really want is some data about how common such errors are and what forms they usually take, in order to determine how many errors would be likely to arise by chance. Comparing to a hypothetical no-error case seems, as you say, a red herring, entirely and bizarrely beside the point. But this is also so obvious that I find it hard to imagine that anything like this was the original argument. Perhaps I am being too charitable.

Yes, I’d prefer not to give Dr. Yagami’s exact words so as not to make it too easy to find him—or for him to stumble on this post. I, too, worry that I may have left something essential out—but I can’t for the life of me see what.

If I can swear you to secrecy, I’d be happy to send you a scan of the actual couple of pages from the actual book.

The main reason I posted this is that I am sometimes wrong about things. Maybe the zebra example turns out to make sense in some way I hadn’t thought of. Maybe Yagami is using some sort of standard method. Maybe there’s some failure mode I haven’t thought of. It would be really good to know this before I make an ass of myself with the review. And talking about asses—there are some wild asses in Mongolia which got left out of my parable—but they’re kind of cute so here is a link.

The problem is that as Protagoras brought up above, there may be issues that you are missing. It doesn’t really make sense to consider the failure mode that you missed the importance of some detail, but not consider the failure mode where you missed not only the importance of the detail, but the detail itself.

I’m confident by now that conditional on my paraphrase being true to the original, Dr. Yagami’s statistics don’t make any sense. But, like you say, it’s possible that I missed something. If someone were willing to take a look at the original text with me, we could probably settle that.

The test is about whether or not the equids being real correlates with them being horses. It concludes that there is a correlation. An equid is more likely to be a horse if it’s one of the ones you made up.

I’m thinking the writer was not using the appropriate test for what he wanted to show.

I think I agree with you that this is bogus statistics.

It sounds like the underlying claim is “there are wild zebras here and this is interesting.” And the question isn’t how many zebras were observed compared to the number of horses, the real question is the probability of recording a zebra when you just saw a horse with dust on it. Which yes, Fisher’s exact test won’t tell you.

My sense is that many researchers are confused about statistics, and so the prior for “author is confused and the statistical machinery is being misapplied” should be high.

I agree wholeheartedly with OP’s conclusions. Here, AFAICT, there are two uncorrelated problems. The first is that Fisher’s test can be used to show correlation in a multivariate distribution against the null hypothesis of uncorrelated distribution. Here clearly we do not have such a distribution, since the ‘virtual’ example is actually the prior, or the null hypothesis in frequentist parlance. But you cannot inject the prior in the data sampling distribution, pretending it to be a “virtual” parameter, and hope to get meaningful results back. The second problem is of course that the prior is categorical about the absence of zebra, an assumption clearly refuted by the data. Yagami’s conclusion still stands, but for a much simpler and certain reasons than his bogus analysis: if you assume there are no zebras, and you find some zebras, then the assumption is refuted, with certainty 1 (or again, in frequentist-speak, with p=0).

A null hypothesis is a statement about parameters, while the virtual sample is a statement about statistics, so it’s not quite correct to say that virtual example is the null. And the null isn’t the same as the prior; the null hypothesis is, as the name implies, a hypothesis, while the prior is a confidence level assigned to a hypothesis. So, for instance, “I think the null has a 90% chance of being true” would be a prior.

Your last paragraph, thought, is correct. The correct test would be a Poisson distribution with lambda = 0, and the probability of getting a non-zero value when lambda = 0 is 0.

I don’t understand the point you’re arguing against, so I can’t evaluate your argument against it. What’s the conclusion being drawn, in terms of what Dr. Yagami anticipates?

To be fair, theories in philological subjects are rarely about anticipation. It’s not like you can just go and perform a few experimental tests or new observations.

They are about actual anticipation only insofar as possible future discoveries of materials from the past are concerned. Otherwise, they’re about making coherent sense of what has been found so far.

I agree with Apprentice: this is not a comparison of two independently obtained samples but an observed sample vs a predicted state of affairs.

The problem with this situation is another flavour of 0 and 1 are not probabilities. Here, treating the expected population of zebras as literally P(zebra)=0 in which case inferential statistical methods related to possible variation around observed values break down. Under the null hypothesis: observed data come from the expected distribution, and if P(zebra)=0) the variance in this distribution = 0.

Or put a different way, the “expected” distribution is not a distribution as we typically consider them—because 0 is not a probability.

“Here, treating the expected population of zebras as literally P(zebra)=0”

That phrase lacks a finite verb.

I don’t see that inferential statistical methods break down. On the contrary, they give exactly the correct answer that one would expect. The variance under the null is zero, so the z value is infinity, so the p value is zero. Whether you do a z-test, a t-test, or a Poisson test, you’re going to get p = 0, and therefore reject the null. Your trying to link this to the claim that 0 is not a probability is begging the question.

It might make sense if there were several groups of people who independently counted the equines in Mongolia. 9 of the 10 groups did not encounter any Zebras, but the 10th did. What Dr. Yagami is suspecting is that the 10th group did not actually count equines in Mongolia but got off track and counted somewhere else. An alternative explanation would be that they did in fact count in Mongolia but since they were much more eager than the other groups they discovered the few Zebras in Mongolia which the other groups didn’t. So what Dr. Yagami is trying to prove is that the 8 Zebras are too much to be a statistical fluke that the other groups just didn’t get and it is more likely that the 10th group did in fact not count in Mongolia. Although I’m currently too drunk to figure out whether that is actually supported by the narrative.

Turn the question on it’s head and make up a story where the math matches the observation in every circumstance. If you can, and I’m not sure I could, work backwards from there to find the breaking point. Or just remember that the map is not the territory, the finger that points to the moon is not the moon, and get on with things.

This was my attempt to make up a story where the math would match something real:

Statistically comparing two samples of equids would make some sense if Dr. Yagami had sampled 2987 horses and 8 zebras while Dr. Eru had sampled 2995 horses and 0 zebras. Then Fisher’s exact test could tell us that they did, with high probability, not sample the same population with the same methods.

But in the actual case what we have is just a “virtual sample”. I’m wondering if there are any conceivable circumstances where a virtual sample would make sense.

How about the classic example of testing whether a coin is biased? This seems to use “virtual sample” as described in the original post to reflect the hypothesised state of affairs in which the coin is fair: P(heads) = P(tails) = 0.5. This can be simulated without a coin (whatever number of samples one wishes) then compared against observed counts of heads vs tails of the coin in question.

The same applies for any other situation where there is a theoretically derived prediction about probabilities to be tested (for example, “is my multiple choice exam so hard that students are not performing above chance?” If there are four choices we can test against a hypothetical P=.25).

But there you have a probabilistically formulated null hypothesis (coin is fair, students perform at chance level). In the equids example, the null hypothesis is that the probability of sampling a zebra is 0, which is disproven by simply pointing out that you, in fact, sampled some zebras. It makes no sense to calculate a p-value.

I have no idea what Fisher’s test is supposed to do here. Show a correlation between the property of being a zebra and the property of being in the real, as opposed to the imaginary, sample? … That’s meaningless.

Agreed! Perhaps Fisher’s test was used because it can deal with small expected values in cells of contingency tables (where chi-square is flawed) but “small” must still > 0.

Which just made me think that it would have been hilarious if Dr. Yagami had realised this and continued by saying that because of it, and in order to make the statistical test applicable, he is going to add an amount of random noise.

I don’t think that there’s any examination using a statistical test that uses a virtual sample that can’t be done as well or better with another statistical test. The whole point of Fisher’s is that you have four samples from an unknown distribution. If you pretend that there is a distribution that is unknown under the null that is in fact known under the null, you are throwing information away.

IMAO Dr.Yagami doesn’t understand statistics and is spectacularly confused (did a zebra kick him in the head, by any chance?) about p-values and significance testing.

The description of the case doesn’t make sense to me, either. But I’m not having an easy time imagining what the philology example would be like, which makes me worry that there might be something specific about the philology example that would affect things. I presume there’s a reason you’re avoiding giving too many details on purpose, but if you only reconstruct the features of the case that you consider relevant, and the case doesn’t make sense to you, it may not be very revealing that the case doesn’t make sense to others; the problem may be that something you are treating as irrelevant and not mentioning actually matters.

I mean, if I imagine that we’re looking at words in a body of literature, and horses are word A and zebras are word B, and the argument is intended to show that word B was actually in use in that body of literature (as opposed to only appearing as a result of slips of the pen, copyist errors, etc.), then I can’t see that this statistical argument proves anything; what we’d really want is some data about how common such errors are and what forms they usually take, in order to determine how many errors would be likely to arise by chance. Comparing to a hypothetical no-error case seems, as you say, a red herring, entirely and bizarrely beside the point. But this is also so obvious that I find it hard to imagine that anything like this was the original argument. Perhaps I am being too charitable.

Yes, I’d prefer not to give Dr. Yagami’s exact words so as not to make it too easy to find him—or for him to stumble on this post. I, too, worry that I may have left something essential out—but I can’t for the life of me see what.

If I can swear you to secrecy, I’d be happy to send you a scan of the actual couple of pages from the actual book.

The main reason I posted this is that I am sometimes wrong about things. Maybe the zebra example turns out to make sense in some way I hadn’t thought of. Maybe Yagami is using some sort of standard method. Maybe there’s some failure mode I haven’t thought of. It would be really good to know this before I make an ass of myself with the review. And talking about asses—there are some wild asses in Mongolia which got left out of my parable—but they’re kind of cute so here is a link.

The problem is that as Protagoras brought up above, there may be issues that you are missing. It doesn’t really make sense to consider the failure mode that you missed the importance of some detail, but not consider the failure mode where you missed not only the importance of the detail, but the detail itself.

I’m confident by now that conditional on my paraphrase being true to the original, Dr. Yagami’s statistics don’t make any sense. But, like you say, it’s possible that I missed something. If someone were willing to take a look at the original text with me, we could probably settle that.

The test is about whether or not the equids being real correlates with them being horses. It concludes that there is a correlation. An equid is more likely to be a horse if it’s one of the ones you made up.

I’m thinking the writer was not using the appropriate test for what he wanted to show.

I think I agree with you that this is bogus statistics.

It sounds like the underlying claim is “there are wild zebras here and this is interesting.” And the question isn’t how many zebras were observed compared to the number of horses, the real question is the probability of recording a zebra when you just saw a horse with dust on it. Which yes, Fisher’s exact test won’t tell you.

My sense is that many researchers are confused about statistics, and so the prior for “author is confused and the statistical machinery is being misapplied” should be high.

I agree wholeheartedly with OP’s conclusions.

Here, AFAICT, there are two uncorrelated problems.

The first is that Fisher’s test can be used to show correlation in a multivariate distribution against the null hypothesis of uncorrelated distribution. Here clearly we do not have such a distribution, since the ‘virtual’ example is actually the prior, or the null hypothesis in frequentist parlance. But you cannot inject the prior in the data sampling distribution, pretending it to be a “virtual” parameter, and hope to get meaningful results back.

The second problem is of course that the prior is categorical about the absence of zebra, an assumption clearly refuted by the data.

Yagami’s conclusion still stands, but for a much simpler and certain reasons than his bogus analysis: if you assume there are no zebras, and you find some zebras, then the assumption is refuted, with certainty 1 (or again, in frequentist-speak, with p=0).

A null hypothesis is a statement about parameters, while the virtual sample is a statement about statistics, so it’s not quite correct to say that virtual example is the null. And the null isn’t the same as the prior; the null hypothesis is, as the name implies, a hypothesis, while the prior is a confidence level assigned to a hypothesis. So, for instance, “I think the null has a 90% chance of being true” would be a prior.

Your last paragraph, thought, is correct. The correct test would be a Poisson distribution with lambda = 0, and the probability of getting a non-zero value when lambda = 0 is 0.

I don’t understand the point you’re arguing against, so I can’t evaluate your argument against it. What’s the conclusion being drawn, in terms of what Dr. Yagami anticipates?

To be fair, theories in philological subjects are rarely about

anticipation. It’s not like you can just go and perform a few experimental tests or new observations.Wait, what are they about then?

They are about actual

anticipationonly insofar as possible future discoveries of materials from the past are concerned. Otherwise, they’re about making coherent sense of what has been found so far.Sub-sampling seems to be a useful concept to introduce here. Then you can ‘discover’ things over and over again.

I agree with Apprentice: this is not a comparison of two independently obtained samples but an observed sample vs a predicted state of affairs.

The problem with this situation is another flavour of 0 and 1 are not probabilities. Here, treating the expected population of zebras as literally P(zebra)=0 in which case inferential statistical methods related to possible variation around observed values break down. Under the null hypothesis: observed data come from the expected distribution, and if P(zebra)=0) the variance in this distribution = 0.

Or put a different way, the “expected” distribution is not a distribution as we typically consider them—because 0 is not a probability.

“Here, treating the expected population of zebras as literally P(zebra)=0”

That phrase lacks a finite verb.

I don’t see that inferential statistical methods break down. On the contrary, they give exactly the correct answer that one would expect. The variance under the null is zero, so the z value is infinity, so the p value is zero. Whether you do a z-test, a t-test, or a Poisson test, you’re going to get p = 0, and therefore reject the null. Your trying to link this to the claim that 0 is not a probability is begging the question.

It might make sense if there were several groups of people who independently counted the equines in Mongolia. 9 of the 10 groups did not encounter any Zebras, but the 10th did. What Dr. Yagami is suspecting is that the 10th group did not actually count equines in Mongolia but got off track and counted somewhere else. An alternative explanation would be that they did in fact count in Mongolia but since they were much more eager than the other groups they discovered the few Zebras in Mongolia which the other groups didn’t. So what Dr. Yagami is trying to prove is that the 8 Zebras are too much to be a statistical fluke that the other groups just didn’t get and it is more likely that the 10th group did in fact not count in Mongolia. Although I’m currently too drunk to figure out whether that is actually supported by the narrative.

Turn the question on it’s head and make up a story where the math matches the observation in every circumstance. If you can, and I’m not sure I could, work backwards from there to find the breaking point. Or just remember that the map is not the territory, the finger that points to the moon is not the moon, and get on with things.

This was my attempt to make up a story where the math would match something real:

But in the actual case what we have is just a “virtual sample”. I’m wondering if there are any conceivable circumstances where a virtual sample would make sense.

How about the classic example of testing whether a coin is biased? This seems to use “virtual sample” as described in the original post to reflect the hypothesised state of affairs in which the coin is fair: P(heads) = P(tails) = 0.5. This can be simulated without a coin (whatever number of samples one wishes) then compared against observed counts of heads vs tails of the coin in question.

The same applies for any other situation where there is a theoretically derived prediction about probabilities to be tested (for example, “is my multiple choice exam so hard that students are not performing above chance?” If there are four choices we can test against a hypothetical P=.25).

But there you have a probabilistically formulated null hypothesis (coin is fair, students perform at chance level). In the equids example, the null hypothesis is that the probability of sampling a zebra is 0, which is disproven by simply pointing out that you, in fact, sampled some zebras. It makes no sense to calculate a p-value.

I have no idea what Fisher’s test is supposed to do here. Show a correlation between the property of being a zebra and the property of being in the real, as opposed to the imaginary, sample? … That’s meaningless.

Agreed! Perhaps Fisher’s test was used because it can deal with small expected values in cells of contingency tables (where chi-square is flawed) but “small” must still > 0.

Which just made me think that it would have been hilarious if Dr. Yagami had realised this and continued by saying that because of it, and in order to make the statistical test applicable, he is going to add an amount of random noise.

I don’t think that there’s any examination using a statistical test that uses a virtual sample that can’t be done as well or better with another statistical test. The whole point of Fisher’s is that you have four samples from an unknown distribution. If you pretend that there is a distribution that is unknown under the null that is in fact known under the null, you are throwing information away.

IMAO Dr.Yagami doesn’t understand statistics and is spectacularly confused (did a zebra kick him in the head, by any chance?) about p-values and significance testing.