In July of 1846, a young doctor began his residency at the General Hospital in Vienna, Austria. Almost immediately, he was confronted with a morbid puzzle.

Ignaz Semmelweis learned that the number of maternal deaths from what was then called "childbed fever" was three times higher in the hospital's First Division maternity ward than in the Second Division. Interestingly, the first was staffed by physicians and medical students; the second was staffed largely by midwives.

The disease was horribly common in European maternity hospitals at the time, often resulting in death rates as high as 25 or 30 percent. Semmelweis suspected that it was spread by some "cadaveric matter" that surgeons and students carried on their hands from the autopsy and dissection room into the maternity ward. The midwives did not perform autopsies. In the fall of 1847, he put his theory to the test, insisting that all students and surgeons scrub their hands with chlorinated lime before entering the maternity ward.

By the following year, the death rate in the First Division had fallen more than tenfold, to a level even lower than the second. Yet Semmelweis's superiors were not impressed and even accused him of insubordination for his stubborn insistence on hand-washing, or antisepsis as it is now known.

In the course of introducing antisepsis to medicine, Semmelweis had also pioneered a practice that is now considered absolutely essential to research on living subjects: comparing the results of different treatments on two separate but equivalent groups.

Nowadays that is the gold standard for determining whether new drugs are effective. But it would not become the norm until 100 years after Semmelweis.

As recently as the 1930s, pill-peddlers could make wild claims about their products and sell them without prior testing. Products such as Radam's Microbe Killer (99.381 percent water) were advertised as cures for everything from measles to cancer. A remedy to soothe teething babies might contain opium. And manufacturers were not required to list the contents of their "secret formulas."

Unfortunately, it took a series of highly publicized medical disasters to change all that. In 1937, 107 Americans, mostly children, died after taking Elixir of Sulfanilamide, a medicine used to treat bacterial infections. The drug sulfanilamide itself was safe. But a chemist at the Samuel E. Massengill Company added a solvent, diethylene glycol, that made the solution poisonous. The following year, the Federal Food, Drug and Cosmetic Act was signed into law by President Franklin D. Roosevelt, giving the Food and Drug Administration expanded powers to keep dangerous drugs off the market.

Ever since, drugs approved in the United States must go through extensive testing in animals and humans (called "clinical trials") to show that they actually work and are safe. Of course, reports of deaths from new drugs still make headlines. Some are accidents. But many represent a trade-off between risk and benefit. For example, last March an FDA panel recommended that a new diabetes drug stay on the market, even though it has been linked to the deaths of 28 people. That controversial drug, Rezulin, is known to cause serious liver damage in some patients. But it is also an extremely effective treatment for people who have few other options.

How do scientists know when a drug is helping people to get better? The body has a remarkable capacity to fight off disease on its own. We all get the flu periodically and recover a few days later, whether or not we take advantage of the latest over-the-counter remedies.

But scientists want to know not only that you got better, but that you got better because of the drug. So they study two groups of patients, one taking the new treatment (the "experimental" group) and one taking the old treatment or no treatment at all (the "control" group). If more patients in the experimental group recover, compared with the control group, that is evidence that the new treatment is better. The FDA also looks for a few additional features that are central to the modern clinical trial: randomization, blinding and adequate sample size.

Random Assignment

In a "randomized" trial, subjects are assigned through a random process, somewhat like flipping a coin, to either the experimental or control group. This procedure helps prevent certain kinds of bias from skewing the results.

Imagine that a small pharmaceutical company has developed a new remedy for bad breath called Fresh-Air. Unbeknownst to them, the product doesn't work in smokers. Suppose that when they go to test it in a clinical trial, most of the smokers happen to end up in the Fresh-Air group (perhaps they were on a smoking break when the control group was being filled). The control group gets nothing. Run the test in these circumstances and FreshAir won't look very powerful. Yet that is only because the comparison is unfair.

Suppose instead that the investigators flip a coin for each subject to decide whether he should go into the FreshAir group or the control group. Each smoker has a 50 percent chance (heads or tails) of ending up in either group. So we expect that about half of the smokers will end up in the experimental group and half in the control group.

Researchers hope that the two groups will be as similar as possible in all respects except for the fact that only one group gets the experimental medicine.

Alternatively, researchers could simply count how many smokers are in each group and make sure they are equally balanced. In fact, they often do this for factors that they know to be important, such as smoking status, age and sex. But there are countless other things, such as diet, blood pressure or genetics, that might affect how a drug acts. It is impossible to anticipate all of these things or to know which ones are important, so researchers rely on randomization to produce balanced groups.

Randomization should not be confused with random sampling. Political opinion pollsters, for example, want to question a group that is statistically representative of the entire population. If they want to know what percentage of subjects believes that the president should be impeached, they identify people at random to participate in the poll. In contrast, medical researchers do not choose experimental participants at random, and subjects in clinical studies may not be representative of the entire population. Traditionally women, minorities and children have been underrepresented in clinical research.

better not to know

Suppose you are participating in a drug study at the National Institutes of Health. Your medicine comes in the form of a pink capsule; so does everyone else's. A numerical code identifies the contents, but neither you nor the clinical staff know how to decipher it. Only when the study is over will you or your doctor find out whether you got the experimental drug or a "dummy" control pill. That is, both of you are "blind" to the information. This elaborate setup helps to prevent personal biases of participants from affecting the study findings.

Subjects in a drug study sometimes have strong beliefs about which treatment they think is better. A cancer patient who has tried everything but is still sick may believe the latest experimental drug is the last hope and be absolutely convinced it will work before the study even begins. Researchers are not unbiased either. They want to see a publishable positive result to justify all their work. So wishful thinking can influence how both subjects and researchers report their observations, even if they don't intend it to.

"Blinding" was first advocated in the 1930s by a group of U.S. drug researchers who were suspicious of the tests carried out by major pharmaceutical manufacturers. Drug firms have a clear financial interest in how their experiments turn out because they want to sell more products. Medical researcher Harry Gold and his colleagues at Cornell University Medical College suggested that blinding would make clinical tests more like rigorous and objective laboratory experiments.

They introduced another element as well: the placebo (Latin for "I will please"). A placebo is an inactive substance, such as a sugar pill, sometimes given to the control group in place of the experimental therapy. The placebo tests whether the experimental drug's effect is merely due to the often powerful "placebo response," in which some patients get better simply by believing they have been treated. In a recent and controversial study of a surgical treatment for Parkinson's disease, the control group was given fake surgery. Some of those patients reported substantial improvement and were sure that they had received the real treatment.

Size Matters

In clinical trials, bigger is definitely better. Even the strictest randomized, blinded, placebo-controlled trial will be seriously flawed if it includes too few subjects. There are two common mistakes you can make in interpreting experimental results. You can see a difference between two treatments when there is no real difference, or you can fail to see a difference when there really is one. Controlling the size of the study makes both less likely.

Suppose you run a test of the experimental drug Cure-All in a group of four people with the sniffles. Two get Cure-All; two get nothing. At the end of the study the two Cure-Alls have stopped sniffling and the two controls remain ill. Mathematically speaking, 100 percent of the experimental group has recovered.

However, such patterns can easily crop up entirely by chance. When you roll a die 10 times, there is a chance that you'll come up with four or five sixes in a row. It might seem amazing or meaningful. But it doesn't mean anything. It's just an accident of random distribution that occurred because 10 rolls isn't much of a sample. The old law of large numbers decrees that the more times you roll the die, the less likely you are to get a whole bunch of sixes in a row.

In the trial of Cure-All, it might be by chance that the Cure-All group did better. Perhaps those two people just happened to be healthier overall. In a larger study, it is less likely that all the healthy people will end up in one group, assuming that they were randomly assigned.

Researchers also want to make sure they don't miss important but subtle effects. Suppose you want to know whether giving extra milk every day to 5-year-old children will make a difference in how much they grow, on average, over one year. The difference will be small and the children will vary a lot. Thus, a large sample group -- probably several hundred -- will be needed.

How many subjects are enough? There is no magic number. It depends on how big an effect you are looking for and how confident you want to be about your results. Generally, a randomized trial involves more than 100 patients in order to be able to tell the difference between a real effect and chance. (However, small studies involving 20 to 40 patients are usually carried out to test the safety of a new drug before it is given to larger groups of patients.)

Most questions that clinical researchers address today involve complex and subtle effects that require large trials. For example, a trial looking for a 5 percent difference in the death rate associated with two heart disease drugs might require thousands of participants. Large studies are also needed to find rare side effects that may occur in only one in a thousand patients.

Why More Drug Scares?

Even a large, randomized, controlled trial is no guarantee of safety. Rare side effects may be hard to detect in even the largest trials. A drug may act differently in particular situations, such as when it is being taken along with another drug. In 1997, a scandal erupted over the weight-loss treatment called "fen-phen," a regimen of two drugs that turned out to be harmful when used together. They had not been tested together in clinical trials.

Sometimes problems are not identified until a product is already on the market. The FDA watches for unwelcome effects even after a drug is approved. The market is, at least in this sense, the final test. The FDA also has the authority to remove a drug from the market if it poses unnecessary risk.

There is always some risk in taking any medicine. During the recent Rezulin controversy, a researcher told CNN that "if one death is too many, then yes, take Rezulin off the market. But then you must also take off . . . insulin . . . Motrin, aspirin, Tylenol and many other medications used to treat patients with cancer and HIV."

Yet there is no doubt that the modern clinical trial has greatly reduced those risks from what they were at the start of the century. And the benefits of drug treatment have grown with scientific knowledge, so that the risks are more often worth taking.

Mark Parascandola is a fellow in the Department of Clinical Bioethics at the National Institutes of Health.

Enough Is Enough?

How large is a large enough sample? One statistician calculated that a trial has to have 50 patients before there is even a 30 percent chance of finding a 50 percent difference in results.

Sometimes large populations indeed are needed. If some kind of cancer usually strikes three people per 2,000, and you suspect that the rate is quadrupled in people exposed to substance X, you would have to study 4,000 people for the observed excess rate to have a 95 percent chance of reaching statistical significance.

The likelihood that a 30- to 39-year-old woman will suffer a heart attack while taking an oral contraceptive is about 1 in 18,000 per year. To be 95 percent sure of observing at least one such event in a one-year trial, you would have to observe nearly 54,000 women.

All this means that you must often ask, What's your denominator? What's the size of your population? A rate is only a figure.

After all, some researchers reportedly announced a new treatment for a disease of chickens by saying, "33.3 percent were cured, 33.3 percent died, and the other one got away."

-- excerpted from News & Numbers by Victor Cohn

(Iowa State University Press, 1990)