Research 11 min read

PepForge uses HELM to generate modified peptides—and flags 799 new antimicrobial candidates

A new bioRxiv preprint argues that standard sequence models miss the chemistry of macrocycles and non-canonical monomers. HELM is the workaround.

· Updated

Most “AI peptide design” models still think in strings of letters. That works fine for simple linear peptides made of the 20 standard amino acids. It breaks down fast once you step into the part of peptide drug discovery that actually pays the bills: non‑canonical monomers, lipidation, stapling, macrocyclization, and other engineered connections that turn a fragile peptide into something that survives in a body.

A new bioRxiv preprint introduces PepForge, a generative platform that treats peptides as macromolecules described in HELM (Hierarchical Editing Language for Macromolecules), not as plain sequences. The authors claim a training set of 383,817 HELM peptides, then an antimicrobial demo: 4.78 million generated peptides and 799 structurally novel antimicrobial candidates after potency and safety filtering.

This is not a clinical story. It’s a tooling story—about how peptide discovery is trying to represent the chemistry it cares about. And it’s worth understanding, because the representation layer often decides what kinds of drugs get proposed in the first place.

HELM, in plain terms: why it exists

If you’ve ever looked at a peptide therapeutic label and wondered why it behaves nothing like a short protein fragment, the answer is usually modification. Drug-like peptides are often:

  • Macrocyclic (a ring closure constrains shape and can protect from proteases).
  • Stapled or crosslinked (extra connections that lock helices or turn motifs).
  • Built from non‑canonical monomers (D‑amino acids, β‑amino acids, N‑methylation, etc.).
  • Conjugated (lipids, PEG, dyes, targeting moieties).

A linear sequence like “ACDEFG…” doesn’t tell you where the ring closure sits or what the monomers actually are. HELM was designed for that gap: a formal notation that can encode monomer identity and special connections.

What PepForge claims to do

According to the preprint, PepForge uses a Layout‑Content‑Connection (LCC) cascade that breaks generation into three steps: (1) block layout, (2) monomer content, and (3) prediction of special connections. The authors write that the cascade is trained on 383,817 HELM peptides spanning 425 monomers and nine connection types.

One practical claim is important for real workflows: the system supports masked infilling—changing a scaffold without redesigning the whole molecule. That’s how peptide chemistry often works in the lab: keep the scaffold, tweak a position, move a linkage, and see what breaks.

The antimicrobial demo: 4.78 million designs to 799 hits

To show downstream use, the authors built an antimicrobial potency ensemble predictor trained on 11,026 peptides with MIC values and compared against an external predictor (PeptiVerse). They report generating 4.78 million novel HELM peptides and narrowing to 799 structurally novel hit antimicrobial peptide candidates after potency and safety filtering.

The headline number is seductive. The more useful question is: what does “hit” mean here? In the preprint, “hit” is computational—high predicted potency and acceptable predicted safety under their filters. That’s a legitimate first step, but it’s not a culture plate, an animal model, or a dosing regimen.

Why HELM-native generation matters for peptide therapeutics

There are two ways peptide discovery stalls. One is biology. The other is representation.

If your model can only output linear sequences over 20 letters, it will keep suggesting peptides that are easy to write down, not peptides that are easy to turn into drugs. A HELM-native generator is an attempt to reverse that bias: start from the language medicinal chemists actually use for modified peptides.

In that sense, PepForge is less like a new “peptide idea machine” and more like a bridge between machine learning and peptide chemistry’s real design space.

How to read this preprint without over-reading it

BioRxiv is a useful place to see tools early. It’s also where methods can look stronger on paper than in practice.

Questions that matter before you believe the headline

  • Data provenance: Where do the 383,817 HELM peptides come from, and how are duplicates and near-duplicates handled?
  • Novelty definition: “Structurally novel” compared to what reference set?
  • Predictor leakage: Does the MIC predictor share training families with the generator’s data?
  • Safety filtering: What assays are they approximating, and what failure modes are invisible to the filters?

Those are not “gotchas.” They’re the normal due diligence steps when a generative model outputs a large number of candidates.

Where this goes next: from modified peptides to medicines

Antimicrobial peptides (AMPs) are a good demo domain because MIC data exists at scale. But the more consequential destination for HELM-based generation is likely modified peptide therapeutics—the macrocycles and conjugates designed to hit difficult targets with a drug-like profile.

That’s also the area where any model has to collide with manufacturing, stability, and regulatory realities. A peptide that looks great in silico but can’t be synthesized cleanly, characterized, or reproduced at scale doesn’t survive long.

Frequently Asked Questions

What is HELM in peptide design?

HELM (Hierarchical Editing Language for Macromolecules) is a notation system that can describe complex macromolecules—like modified peptides—by encoding non-standard monomers and special connections (such as cyclizations) that a plain amino-acid sequence can’t capture.

Does “799 antimicrobial candidates” mean 799 new drugs?

No. In the PepForge preprint, the candidates are computational hits after predictive scoring and filtering. They would still need synthesis and experimental validation, and most will fail somewhere along that pipeline.

Why would peptide clinicians care about a generative model paper?

Because model outputs influence which peptide scaffolds get attention and investment. A shift toward HELM-native generation can steer early discovery toward modified peptides that are more plausible drug candidates than simple linear sequences.

Sources

Topics

researchAIHELmantimicrobial peptidesmacrocyclespeptide designbioRxiv

Sources & References

  1. FDA PCAC Meeting Announcement (July 23-24, 2026)
  2. PBS: FDA to Weigh Easing Limits on Peptides Favored by RFK Jr.
  3. BioPharma Dive: FDA Peptides RFK Advisory Committee Restrictions
  4. RAPS: FDA Considers Adding a Dozen Peptides to Bulk Drug List
  5. Ars Technica: RFK Jr. Forces FDA to Reconsider 12 Peptides
  6. ProPublica: Peptide Safety Investigation
  7. New York Times: Peptide Ban FDA RFK Jr.
  8. SSRP Institute: FDA Announces Change in Status of 12 Peptides
  9. CNBC: RFK Jr. Peptides Hims Hers GLP-1
  10. USA Today: RFK Jr. FDA Peptides Explainer