How to lie with statistics — Book summary

22.02.2023 — summary, book — 8 min read

This is my summary of the key ideas of How to Lie with Statistics written by Darrell Huff.

The book has a lot of real world examples in government, business, adverts, etc. I eliminated a lot of the examples because it was difficult to summarize them without loosing the gist.

👻 Chapter 1: The sample with the built-in bias

The problem, as with anything based on sampling, is how to read it without learning too much which is not necessarily so.

The result of a sampling study is no better than the sample it’s based on.
However, by the time the data from a sampling study has been filtered through layers of statistical manipulation and reduced to a decimal-pointed average, the result begins to take on an aura of conviction that a closer look at the sampling would deny.
To be worth much, a report based on sampling must use a representative sample, which is one from which every source of bias has been removed.
The dependability of a sample can be destroyed just as easily by invisible sources of bias .
The test of a random sample: Does every individual in the population have an equal chance to be in the sample?
The purely random sample is the only kind that can be examined with entire confidence through statistical theory.
However, a purely random sample is difficult and expensive to obtain. A more economical substitute is called stratified random sampling.
To get a stratified sample you divide your universe into several groups in proportion to their known prevalence.
A core problem with stratified sampling: How do you know that your information about the groups’ proportion is correct?
Any questionnaire is only a sample of the possible questions and the answer given is no more than a sample of an individual’s attitudes and experiences on each question.

☯️ Chapter 2: The well-chosen average

When you are told that something is an average you still don't know very much about it unless you can find out which of the common kinds of average it is: mean, median, or mode.

The different averages come out close together when you deal with data, such as those having to do with many human characteristics, that have the grace to fall close to what is called the normal distribution.
When you see an average figure, first ask: Average of what? Who's included?
The United States Steel Corporation once said that its employee's average weekly earnings went up 107% from 1940-1948. So they did — but some of the punch goes out of the magnificent increase when you note that the 1940 figure includes a much larger number of partially employed people. If you work half-time one year and full-time the next, your earnings will double, but that doesn't indicate anything at all about your wage rate.

🧹 Chapter 3: The little figures that are not there

Knowing nothing about a subject is frequently healthier than knowing what is not so, and a little learning may be a dangerous thing.

The importance of using a small group (to lie) is this: With a large group any difference produced by chance is likely to be a small one and unworthy of big hype.
Only when there is a substantial number of trials involved is the law of averages a useful description or prediction.
Averages should usually be augmented with a range ; unless you are trying to hide something of course 😉.
It’s dangerous to mention any subject having high emotional content without hastily saying whether you are for or against it.

😱 Chapter 4: Much ado about practically nothing

Sometimes the big ado is made about a difference that is mathematically real and demonstrable but so tiny as to have no importance. This is in defiance of the fine old saying that a difference is a difference only if it makes a difference.

What an IQ test purports to be is a sampling of the intellect. Like any other product of the sampling method the IQ is a figure with a statistical error, which expresses the precision or reliability of that figure.
How accurately a sample can be taken to represent the whole is a measure that can be represented in figures: the probable error and the standard error.
What this comes down to is that the only way to think about IQs and many other sampling results is in ranges: "Normal" is not 100, but the range of 90-110, and there would be some point in comparing a child in this range with a child in a lower or higher range. But comparisons between figures with small differences are meaningless.

🙈 Chapter 5: The gee-whiz graph

[…] The figures are the same and so is the curve. It is the same graph. Nothing has been falsified — except the impression that it gives.

You can truncate the bottom of a line graph to add schmaltz (Hey, you saved paper too 😉).
After truncating the line graph, change the proportion between the ordinate and abscissa. This exaggerates a rise/fall.

🫣 Chapter 6: The one-dimensional picture

Look with suspicion at any version of a bar graph in which the bars change their widths as well as their lengths while representing a single factor or in which they picture three-dimensional objects the volumes of which are not easy to compare.
A truncated bar chart has, and deserves, the same reputation as the truncated line graph in the last chapter.

A pictograph where the height/width of each image is drawn in proportion to the data being reported can be deceiving because even tho a side is in proportion, the area occupied by the image can be a lot bigger.

🃏 Chapter 7: The semi-attached figure

You can't prove that your nostrum cures colds, but you can publish a sworn laboratory report that half an ounce of the stuff killed 31,108 germs in a test tube in 11 seconds. While you are about it, make sure that the laboratory is reputable or has an impressive name. Reproduce the report in full. Photograph a doctor-type model in white clothes and put his picture alongside.

If you can't prove what you want to prove, demonstrate something else and pretend that they are the same thing. In the daze that follows the collision of statistics with the human mind, hardly anybody will notice the difference.
The general method is to pick two things that sound the same but are not.

There are often many ways of expressing any figure. You can, for instance, express exactly the same fact by calling it:

1* A 1% return on sales
2* A 15% return on investment
3* A ten-million-dollar profit
4* An increase in profits of 40% (compared with the 1935-39 average)
5* A decrease of 60% from last year.

The method is to choose the one that sounds best for the purpose at hand and trust that few who read it will recognize how imperfectly it reflects the situation.

If you purchase an article every morning for $0.99 and sell it each afternoon for $1, you will make only 1% on total sales, but 365% on invested money during the year.
[…] The evidence: "Four times more fatalities! occur on the highways at 7 P.M. than at 7 A.M.". Now that is approximately true, but the conclusion doesn't follow:
- More people are killed in the evening than in the morning simply because more people are on the highways then to be killed.
- You, a single driver, may be in greater danger in the evening, but there is nothing in the figures to prove it either way.
- By the same kind of nonsense that the article writer used you can show that clear weather is more dangerous than foggy weather.
- The rate is more useful here than the number of fatalities.
Don’t forget the old before-and-after trick with several unmentioned factors introduced and made to appear what they are not.

⏰ Chapter 8: Post hoc rides again

There are two clocks which keep perfect time.
When the first clock points to the hour, the second clock strikes
Did the first clock cause the second to strike?

Post hoc fallacy: If B follows A, then A caused B.
When there are many reasonable explanations you are hardly entitled to pick one that suits your taste and insist on it.

Correlation can be of several types:

1* Correlation by chance.  Given a small sample, you are likely to find some substantial correlation between any pair of characteristics or events that you can think of.
2* The relationship between variables is real but it’s not possible to be sure which of the variables is the cause and which the effect. In some of these instances cause and effect may change places from time to time or indeed both may be cause and effect at the same time. A correlation between income and ownership of stocks might be of that kind. The more money you make, the more stock you buy, and the more stock you buy, the more income you get; it is not accurate to say simply that one has produced the other.
3* Neither of the variables has any effect

at all on the other, yet there is a real correlation.

Watch out for a conclusion in which a correlation has been inferred to continue beyond the data with which it has been demonstrated: It is easy to show that the more it rains in an area, the greater the crop (positive correlation); But a season of very heavy rainfall may damage the crop (negative correlation).
A correlation of course shows a tendency that is not often the ideal relationship described as one-to-one: The exceptions can be numerous, but the tendency is strong and clear.
A correlation may be real and based on real cause and effect — and still be almost worthless in determining action in any single case.
An observation might be accurate and sound, but the conclusion can be wrong.

👩‍🔬 Chapter 9: How to statisticulate

"Buy your Christmas presents now and save 100%” advises an advertisement. This sounds like an offer worthy of old Santa himself, but it turns out to be merely a confusion of base.
The reduction is only 50%.
The saving is 100% of the reduced or new price, it is true, but that isn't what the offer says.

One of the trickiest ways to misrepresent statistical data is by utilizing a map. A map introduces a fine bag of variables in which facts can be concealed and relationships distorted.
A decimal (7.537) communicates more rigor than an integer (7).
Any percentage figure based on a small number of cases is likely to be misleading. It is more informative to give the figure itself.
To offset a pay cut of 50% you must get a raise of 100%:
- Assume the salary was $1.
- After a 50% cut, the salary is now $0.5.
- After a 100% raise on the new salary, the salary is back to $1.
When a hardware jobber offers "50% and 20% off list," he doesn't mean a 70% discount. The cut is 60% since the 20% is figured on the smaller base left after taking off 50%. Percentages with different bases shouldn’t be added.
Utilize the confusion between percentage and percentage points:
If your profits should climb from 3% on investment one year to 6% the next, you can make it sound quite modest by calling it a rise of 3 percentage points; With equal validity, you can describe it as a 100% increase 😉.
Percentiles are deceptive too:
When you are told how Johnny stands compared to his classmates in algebra, the figure may be a percentile. It means his rank in each 100 students. For example, in a class of 300, the top 3 will be at the 99 percentile, the next three at the 98, and so on. The odd thing about percentiles is that a student with a 99-percentile rating is probably quite a bit superior to one standing at 90, while those at the 40 and 60 percentiles may be of almost equal achievement. This comes from the habit that so many characteristics have of clustering about their own average, forming the "normal" bell curve.

🤺 Chapter 10: How to talk back to a statistic

A report of a great increase in deaths from cancer in the last quarter-century is misleading unless you know how much of it is a product of such extraneous factors as these:
Cancer is often listed now where "causes unknown” was formerly used;
Autopsies are more frequent, giving surer diagnoses;
Reporting and compiling of medical statistics are more complete;
and people more frequently reach the most susceptible ages now.
And if you are looking at total deaths rather than the death rate, don't neglect the fact that there are more people now than there used to be.
Who says so? Look for both conscious and unconscious bias and its possible manifestation.
How do they know? Watch out for biased samples and reported correlations.
What’s missing? Ask for the raw values when given percentages. Sometimes what is missing is the factor that caused a change to occur: This omission leaves the implication that some other, more desired, factor is responsible.
Did somebody change the subject? When assaying a statistic, watch out for a switch somewhere between the raw figure and the conclusion, e.g: More reported cases of a disease are not always the same thing as more cases of the disease.
Does it make sense? Many a statistic is false on its face. It gets by only because the magic of numbers brings about a suspension of common sense.