Tuesday, May 12, 2020

Statistics

In order for a statistical analysis of a set of numbers to take place, there are a few things that must be done.
The data set must be random. If it isn't, you suffer from selection bias. For example, if all of a person's friends tell them that they were going to vote for HRC, they would have been surprised that Trump won the election. This is because their sample wasn't random.

The sample must also be large enough to be significant. When measuring a large data set or population, like a workforce, you don’t always need to collect information from every member of that population – a sample does the job just as well. The trick is to determine the right size for a sample to be accurate. Using proportion and standard deviation methods, you are able to accurately determine the right sample size you need to make your data collection statistically significant.

When studying a new, untested variable in a population, your proportion equations might need to rely on certain assumptions. However, these assumptions might be completely inaccurate. This error is then passed along to your sample size determination and then onto the rest of your statistical data analysis.

That is what is happening now to statistical analysis of Wuhan virus numbers. We have a data set that tells us how many people had the disease at the time of testing. There are a couple of assumptions there that are making those numbers worthless.


  • We didn't test people at all before we were aware of the Wuhan virus. So we have no idea how many people had it, or what the outcome of their case was. 
  • The people who were tested at first were only tested if they had been travelling to certain countries, so people who had the disease and had not traveled recently were not part of the data set. 
  • The people who were tested didn't have the Wuhan V on the day they were tested. That doesn't mean that had not already had it months before, nor does it mean that they didn't get it days or weeks later. 

The state of Florida thinks that since they have tested a large percentage of the state that they can predict what will happen. The numbers we have now cannot be used for a statistical analysis because they are not random. We are also assuming that the people who tested negative didn't get it later, and didn't have it earlier. This makes it impossible to determine the CFR.

2 comments:

Angus McThag said...

Impossible to calculate the IFR.

CFR is easy. Deaths over confirmed cases. CFR is a crappy way to decide how vicious a disease is because it misses many cases which don't get confirmed.

Florida is running a CFR of 4.24% and who knows what the IFR is.

You're correct about how we're getting our confirmed cases isn't getting an actual cross-section of the disease, not accounting for those who're untested and recovered or whom was tested and became infected later.

The asymptomatic are not being tested at all, for example.

We don't even know the rate of infection to any degree of certainty.

Divemedic said...

CFR stats are trash as well. Remember that in the early days, they were not testing people who were symptomatic for COVID unless they had also traveled to a set list of countries within the past 14 days. There were dozens of people who got infected by a doctor who had COVID but was never tested because he hadn't been to one of those countries.

The CDC has screwed this up so badly that we will never know how bad or not this infection was.