Dive 64: Lying with Data
Hey, it’s Alvin.
On December 6th, 2005, a small twin-engine ATR-72 turboprop takes off from Bari, Italy towards Djerba, Tunisia as Tuninter Flight 1153. About 50 minutes after takeoff, at around 23,000 feet over the Mediterranean Sea below, the pilots get a warning on their displays…
Fuel feed—low pressure.
Shortly after, the right engine shuts down.
OK, no worries. Planes can still fly with just one engine; they just can’t fly as high. So, they descend to 17,000 feet. Once there, one of the pilots starts going through procedures to address the low-pressure warning. While the pilot is reading the procedures, the pilot flying tells him to stop…
The left engine just shut down, too.
The plane starts descending… gradually.
In desperation, the pilots try to restart both engines, but to no avail.
So, the pilots issue a MAYDAY call over their radios. They ask the air traffic controller for the nearest airport. They need to land ASAP.
The air traffic controller tells them the nearest airport is Palermo, Italy.
They’re now at 12,000 feet and 40 nautical miles (NM) from Palermo.
What is happening to their plane? They have no idea. The errors make no sense to them. But they try to restart their engines, anyway.
No matter how many times they try, the engines just won’t start.
They’re now at 4000 feet and 20 NM from Palermo.
That’s when it dawns on the pilots…
They’re not gonna make it.
They’re going to have to ditch the aircraft into the sea.
The problem is that aircraft are not designed to skim along the water surface gracefully, like boats. And if you’ve ever belly-flopped into a pool, you know that the surface tension of water makes it a hard surface to slam into. It hurts.
It didn’t matter how gently the pilots tried to land the plane in the ocean. When it contacted the water, the plane tore apart.
16 of 39 people lost their lives that day.
The story of Tuninter 1153 offers many lessons, but I just want to focus on one today:
We need to know where data comes from.
Investigators found that the fuel gauge installed on the ATR-72 was not designed for the plane. Maintenance workers wanted to replace a faulty gauge and swapped one in from an ATR-42. This gauge reported fuel levels much higher than what was actually in the fuel tanks. In fact, even after the pilots ran out of fuel, the gauges still showed they had fuel on board.
When the plane departed Bari, it did not have enough fuel to make it to Djerba. Had they known they had no fuel, they wouldn’t have tried to restart their engines. Simulations showed that if the pilots just focused on gliding their aircraft at an optimal descent rate from the moment they lost both engines, they could have made it to Palermo.
The pilots were fed bad data. Had they known it was bad, they would’ve acted differently. They could’ve saved the lives of all 39 people on board.
So, why should this matter if you’re not a pilot?
I’m not a pilot either. But every day, we’re all inundated with data. We live in a data-driven world.
Not even that.
We live in a data-OBSESSED world.
It seems like every time someone wants to convince me of something; they start by saying:
“Let’s look at the data…” or
“Here’s what the data shows…” or
“But the data says…”
And then, of course, I get an obligatory graph. Line graph, bar graph, scatterplot… take your pick.
But I often sense that they never questioned where the data comes from. Even though how the data is derived tells us a lot about the data, which can change what the data is saying.
Here’s a simple example:
The Research is the Data
StackOverflow is (or was) a popular Q&A website for software developers. The website surveys developers every year to report on the state of software development. One question in its 2020 survey asked developers how many hours they worked each week. The conclusion was that over 75% of developers worked fewer than 45 hours per week.
Can you spot a problem with this question?
The most glaring problem with the overall survey is its sampling bias. When I conducted psychological research in university, I learned we need a random sample of participants for a survey to be anywhere near useful. Because otherwise you’re not getting an accurate representation of reality.
But the people answering this survey are going to be those who have time to spare AND will volunteer it for a survey that offers no direct, meaningful benefits. If a developer is working 55+ hours per week, they’re more likely to be senior developers or in a more experienced role. So, they probably have less need to visit the website. And when they do, they don’t have time to dwell on it to notice there’s a survey they can do. And if they DO notice the survey, I doubt they have the time to do it or want to waste their time on it.
So, it’s extremely likely those who work 55+ hours per week are vastly underrepresented in a survey like this. Which makes sense. If you only had a couple of hours each day to yourself, would you rather spend it with your loved ones, your friends, on hobbies, relaxing with a movie/book? Or would you rather spend your precious minutes on a random survey on a website you visit on occasion?
What about the fact that we work more hours closer to critical release dates? Maybe you usually work around 40 hours per week. But maybe sometimes you work 60 hours per week closer to deadlines. How does this question capture this nuance?
Professors from my university days warned our class that a survey is one of the most unreliable ways to collect data. Mainly because it’s hard to ask the right questions to capture nuanced responses.
The StackOverflow researchers literally concluded that “globally, over 75% of developers work less than 45 hours per week.” But if we consider the broader research methodology that gave way to the data, there’s a different conclusion:
“Over 75% of developers who had time to answer the survey said they worked fewer than 45 hours per week.” And we know this statistic probably doesn’t represent every week of a calendar year.
That’s quite a different conclusion. Because it leaves the door open to the possibility that there are plenty of unknowns the research did not address. And those unknowns could even lead to completely different conclusions.
The data presented on pretty graphs tell a story. But it’s important to consider data in the broader context of the research it came from because the latter affects the meaning of the story.
This isn’t even a new idea.
The Medium is the Message
In 1964, Canadian communication theorist Marshall McLuhan coined the phrase,
McLuhan said that the medium holding a message affects the way we perceive the message in significant ways we often overlook.
For example, if you get an invitation card to a party with a message handwritten in calligraphy, you might get the impression it’s a formal affair. But if you get the same invitation message on social media, you might get the impression it’s casual and fun. The message is the same. The medium is different. So, the meaning of the message is different.
That’s why a person who only looks at data (the message) will see it differently from a person who also looks at the research (the medium) containing the data.
Had the pilots of Tuninter 1153 known that an incompatible fuel gauge (the medium) was installed on their plane, they would have known the fuel reading (the message) was inaccurate. They wouldn’t even have taken off.
So, the next time you’re presented with data, don’t just take it at face value. Because what the table or graph is telling you is almost never the full story. The real story may hold a different lesson entirely. Dive deeper. Or, at least, be skeptical.
Reply to firstname.lastname@example.org if you have questions or comments. I’d love the hear from you.
Thank you for reading. Question data. And I’ll see you in the next one.