PhD researcher at UCL's Barnes Lab. Encode: AI for Science Fellow (Pillar VC × ARIA). Previously Tempus AI, Snorkel AI, Databricks.
Genomic AI has a usability problem. We have foundation models that can read and write DNA, but almost no one outside a handful of ML labs can use them. My research combines DNA language models with reinforcement learning to design synthetic biology constructs that work on the first try — and to make those models accessible to working biologists through human-readable interfaces. The long-term goal is a closed-loop system where models propose sequences, wet-lab experiments validate them, and the results train the next generation of models — collapsing the cost and timeline of biological engineering.
I think computational biology is approaching its ChatGPT moment — the point where interface design, not just model capability, determines who gets to use these tools. I'm building toward that.
Last updated: April 2026.
A conditional Mixture-of-Experts language model for plasmid DNA.
PlasmidLM is a generative model trained on hundreds of thousands of natural plasmid sequences. It uses a conditional MoE architecture to specialize across functional sequence classes (origins, selection markers, regulatory elements) and generates novel, biologically plausible constructs from natural-language or structural prompts.
GRPO post-training for biologically realistic DNA generation.
PlasmidRL applies GRPO-based reinforcement learning to PlasmidLM, using composite reward signals from sequence alignment, motif scoring, and structural priors. The result is a model whose outputs exhibit emergent biological realism — replication origins in the right places, codon usage that matches host organisms, and regulatory architecture that holds up under expert review. Currently under review at ICML 2026.
An end-to-end neoantigen vaccine design pipeline.
ChatNAV is an 11-module pipeline that takes patient sequencing data and outputs ranked, manufacturable mRNA vaccine candidates. It integrates variant calling, HLA typing, MHC binding prediction, structural scoring (PANDORA, AlphaFold2-Multimer), and polyepitope optimization behind a single FastAPI backend. Built to make personalized cancer vaccine design accessible to clinical research labs without bioinformatics infrastructure.
Models that design DNA, robots that build it, data that trains the next model.
In collaboration with Twig Bio, this project pairs PlasmidLM-generated constructs with high-throughput expression assays (GFP fluorescence, AlphaLISA) to create a self-improving design loop. Currently the subject of a Google.org AI for Science Impact Challenge proposal.
Emergent Biological Realism in RL-Trained DNA Language Models
Under review at ICML 2026
GRPO post-training of PlasmidLM produces sequences with emergent structural and functional realism.
Designing Minimal E. coli Genomes Using Variational Autoencoders
Cell Systems — revision in progress
Loading recent posts...
Machine Learning Scientist
Founding member of the generative AI team at one of the largest precision medicine companies in the US. Worked on LLM applications over clinical and genomic data.
Senior Machine Learning Engineer
Built synthetic data and RLHF infrastructure used by Fortune 500 customers to fine-tune and align production language models.
Founder
Founder of an AI-native lead development platform for wealth managers. Acquired by Praxis Solutions.
B.A. Data Science, minor in Bioengineering
I work with teams on production ML systems — particularly around Databricks, MLflow, and LLM evaluation. Recent engagements include LLM-as-a-judge infrastructure and agentic application evaluation. If you're working on something hard in this space, get in touch.
Outside research: basketball, padel, and a standing interest in the London and Bay Area startup scenes.
me [at] mcclainthiel [dot] com