A scorecard for AI-designed drug molecules just dropped.

Microsoft researchers published the first comprehensive scorecard this week for AI-designed drug molecules. Four tables, dozens of AI systems, hundreds of disease targets. Every row is a real lab experiment, not a computer prediction. It is the first time anyone has pulled all these results into one place.

What an AI "drug molecule" actually is

AI drug design is a catch-all phrase that hides three distinct jobs. The scorecard sorts them out.

Binders are small-to-medium proteins engineered to stick to a specific target on a cell surface or inside a cell. They are used as research tools, for drug delivery, and as the basis for more complex therapeutic molecules. A binder that sticks to a cancer protein, for example, can be linked to a drug payload that gets delivered only where it is needed.

Antibodies are a specific kind of binder, the same Y-shaped molecules your immune system makes naturally. They are the basis for roughly a quarter of all prescription drugs, including Keytruda (cancer), Humira (arthritis), and Dupixent (asthma). Designing an antibody from scratch used to require evolution-like processes in the lab, taking years.

Enzymes are the molecules that perform chemical reactions in the body. Designed enzymes can break down plastics, generate complex molecules that are hard to make chemically, or edit DNA.

Each of these is a different problem for AI, and the scorecard shows they are at different maturity levels.

The headline numbers, translated

Binders work best. Several AI systems now design working binders against common disease targets at rates between 10% and 80%. One target, an immune-disease protein called BHRF1, has been hit at 88% by a system from Google DeepMind. A Cas9-targeting binder (Cas9 is the gene-editing protein) hit 100% in a small test.

Antibodies work inconsistently. The best systems hit any given antibody target 15% to 70% of the time, but they fail completely on others. The variability is the story. There is no general recipe yet.

Enzymes are harder. A few systems reach high rates on specific reaction types (one group designed 57 working enzymes out of 63 attempts for a specific reaction), but the total space of reactions they can handle is narrow. Most enzymes in biology have not been AI-designable yet.

The unbeaten target

TNF-α is a protein that sits at the center of rheumatoid arthritis, Crohn's disease, and other inflammation diseases. Multiple existing drugs (Humira, Remicade, Enbrel) work by blocking it. So it is one of the best-understood proteins in medicine.

That has not helped the AI. Four major AI systems, more than 150 designed molecules between them, zero working hits. One system, called Chai-2, produced a single working molecule out of a large screen. That is it. TNF-α is the target that keeps beating the AI systems.

The likely reason: TNF-α has a small, diffuse surface that binders and antibodies have to grip. Traditional drug design tools handle it poorly too, which means the training data the AI learns from does not have many good examples. The AI is extrapolating, not interpolating. It is failing in the way current AI usually fails, confidently, on a problem that looks simple but isn't.

What this is not saying

Every hit rate in the scorecard is measured against a lab-based test, not against a patient. A binder that sticks to its target is a first step. Real drugs have to cross many more bars: they have to reach the right tissue in the body, avoid triggering an immune reaction, not interact badly with other drugs the patient is taking, hold up in storage, and ultimately improve a clinical outcome. The scorecard is a map of the first step, not the finish line.

Why the scorecard itself is the news

For years, AI drug design press releases have emphasized the successes and skipped the failures. The scorecard does the opposite. It puts the failures next to the successes, with concrete numbers. That changes what a reader should expect from the next press release.

What we host

Our platform catalogs peptide drug candidates at different maturity levels: designed, computed, reproduced, synthesized, bioassayed. That ladder maps directly onto what the scorecard is measuring at the first few rungs. The next decade of AI drug design depends on more candidates climbing that ladder in public, with their failures visible alongside their successes. This paper is a snapshot of where the field is right now. Next year's version will look different.