Nicolaas van Dijk

Statistics - Selecting the Population

When reading research, I sometimes feel unsure about assessing the data. I want to consider prior knowledge along with this new information to inform my decision making. Due to practical reasons, probabilities are defined for different “populations”, but it is not always clear which “population” is most appropriate. If I was asked to predict the sex of the next person I saw, I would start with the US “population”. Based on census data, there is about a 50% chance that person is female. This may be the best information available, so that probability and “population” seem reasonable. But what if better information was out there. What if this question was asked prior to entering my graduate school class room where the female:male ratio is 70:30. Then I would consider a different “population”, or “sub-population” and the likelihood would be 70%.

Sometimes it is not as clear which “population” is most appropriate. Consider an example of a diagnostic test for a rare disease (1 in 1000 have the disease, or 0.1%), where the accuracy of the test is 99%. If someone receives a positive test result, does that mean there is a 99% chance they have the disease?

Accuracy of these tests are often defined as:

  • the number of people with the disease who test positive VS the number of people with the disease who test negative

Since the disease is rare, even a positive test should be seen with some skepticism. The accuracy does not account for the underlying incidence of the rare disease and a different “population” should be used. The correct “population” is everyone with a positive test, not just the “population” of those with a positive test AND who have the disease. In this situation, considering the incidence of the disease (0.1%) can help make more realistic inferences about the meaning of a positive test.