Last week, I wrote about a scientific study out of Norway that tested whether men or women were better at assembling Ikea furniture. This week, we conducted an entirely unscientific replication of that experiment. The sample size? 4.

Working in teams of two, men versus women, we raced to put together a Råskog utility cart. I partnered with video editor Tom LeGro, who told me that he had extensive and up-to-date experience with Ikea furniture, being the father of a 4-year-old. Our opponents were business of health reporter Carolyn Johnson and video editor Jenny Starrs.

Carolyn builds furniture, from scratch, for fun; Jenny played a ton with Legos as a kid, which is a trait that psychologists have linked to higher spatial ability. I, on the other hand, regularly confuse left and right, which makes car rides with me impossible. “Can you just tell me clockwise or counterclockwise?!” is something I have snapped at passengers.

Maybe I’m providing this information to justify, after the fact, why Tom and I lost. Badly. We made a fatal error when we mounted one of our baskets askew. It took about a minute to figure out why none of our screws were going in. We would never make up that lost time.

The point of this game was to demonstrate some principles about statistics, experimentation, and how we learn from science.

Some commenters on last week’s piece complained that the Norway study had “only” 80 participants. This is a somewhat strange thing to say, and I would ask anyone who criticizes an experiment’s sample size: How big is big enough?

It depends on what you’re trying to measure, and whether your data points are close together or scattered far apart. Experiments like these try to detect differences in averages. I like the shooting range analogy. Maybe you want to figure out if two guns are pointed at the same target or different targets. You have each gun fire a couple of shots, which land in clusters. If the guns are pointed at the same target, then the clusters should share the same center — the same average.

But there’s some randomness here. After measuring, you find that each cluster is centered on a slightly different point. Does that difference reflect chance, or is it evidence that the guns actually point at different spots?

You may need to fire more shots — you may need more data, a larger sample size — to overcome the uncertainty. If the clusters are really tightly grouped, it’s easier to detect differences. If they’re spread out, it’s harder. You also need more data if the guns are just slightly misaligned — if you’re trying to distinguish between very minute differences.

All of these concepts are wrapped up in the idea of statistical significance. The scientists in the Ikea study found differences that were statistically significant, which means there’s a high chance that the men were different from women. This already takes into account how spread out the data were.

Of course, there remains a small chance that their results were a fluke — there’s always a small chance. That’s one reason social scientists has been pushing for money to replicate more studies. There’s widespread concern that researchers might publish only their fluke results, because fluke results get fame and attention. Repeated experiments would help rule out that possibility.

Replication by different research teams also reassures people that there wasn’t experimenter error. Maybe the Norway team accidentally advantaged the men in a subtle way. Maybe the gender of the person giving the test affected people. Maybe subconsciously the researchers were nicer to the men and meaner to the women. And remember that these subjects were all Norwegian 20-somethings — do we think that Norway is representative of humanity?

Follow-up studies would be able to address these and other concerns by changing the context of the experiment and seeing if same result shows up.

That's kind of what we did here. In our shoddy science experiment, we were interested in what factors, aside from mental rotation ability, affect a person’s ability to assemble Ikea furniture. It turns out that this kitchen cart had a lot of tiny, fiddly screws. Tom and I struggled with them, while Carolyn and Jenny seemed a lot more dexterous.

Doing the competition in teams may have also benefited the women. Studies tend to show that women make better teammates because they are more likely to be better communicators. “I really like working in teams, and felt like having a partner helped a lot even on such a small task,” Jenny said in a follow-up email.

Tom, in a very politely worded message, blamed me for our failure.

“I think yesterday confirmed for me that I can’t put together IKEA furniture with anyone. I have to do it by myself, following the instructions step by step,” he said.