Evaluating omics agents — the final answer isn't enough
Errors in biological analysis are quiet
After an omics agent finishes a task, it may hand back a figure, a gene list, and an explanation. The page looks complete, and the language sounds confident.
The problem is that errors in biological data analysis rarely crash the way software bugs do. The wrong normalization method, an unhandled batch effect, a sample sheet misaligned with the expression matrix, a sign convention in a database read backwards — any of these can still produce a perfectly presentable report.
What makes these errors dangerous is that they keep moving downstream. A plausible-looking list of differentially expressed genes may feed the next round of experimental design; an essentiality conclusion with the sign flipped may shift a target call; a notebook with no recorded parameters is hard for someone else to reproduce a few weeks later.
So evaluating an omics agent can't only ask "is the final answer right?" That question matters, of course, but it isn't enough. You also have to look at how it reads the data, how it picks methods, whether it leaves intermediate results behind, and exactly where it fails when it fails.
That's why we took another look at BixBench-Verified-50. The score is only the entry point; the process is the part of an evaluation that's actually useful.
The discussion here is limited to research software and benchmark results. BixBench-Verified-50 and ovagent-bench are both research tasks, not clinical-care tasks; any patient data or clinical cohort mentioned refers only to lawfully authorized, de-identified research-data analysis, and does not involve individual diagnosis, prescription, or treatment decisions.
Evaluate the benchmark before the agent
BixBench is one of the evaluations that comes closest to real bioinformatics work today. It isn't ordinary knowledge Q&A — it asks the agent to handle real data and then answer short-form analysis questions.
But bioinformatics evaluation has an inherent difficulty: many questions have no single computational path. For the same differential-analysis task, normalization, filtering thresholds, software versions, and floating-point precision can all change the final number. In a phylogenetics task, how the median is taken and how the distance matrix is handled can also change the result.
Phylo Bio makes this point well in Evaluating AI Agents in Biology: not every failure is the agent's failure. Roughly, there are two kinds.
One is a genuine capability gap. The agent didn't understand a biological concept, or chose the wrong analysis method. That should count as the agent's failure.
The other is a problem with the benchmark itself. The question fixes no key convention, the reference answer isn't reproducible, or the scoring only accepts one surface string. In that case, a lower score doesn't necessarily mean the agent can't do the task — it means the question didn't provide enough information.
That's where BixBench-Verified-50 helps. It distills 50 reviewed questions from the original BixBench, removing some of the prompt and answer noise. It still isn't a perfect answer key, but it's better suited than the original set for observing how an agent performs on real analysis tasks.
The OmicOS evaluation starts from the same premise. We don't want to promote a single final score. What matters more is going through the failures question by question: which are real OmicOS gaps, which are unfixed prompt conventions, and which are just numerical differences from software-implementation details.
OmicOS results on the verified subset
OmicOS scores on BixBench-Verified-50:
45 / 50 = 90.0%
Placed in the BixBench-Verified-50 comparison table published by Phylo Bio:
| Agent | BixBench-Verified-50 | Backend model |
|---|---|---|
| Biomni Lab | 88.7% | Claude (closed frontier model) |
| OmicOS (this work) | 90.0% | deepseek-v4-pro; architecture not model-bound |
| Edison Analysis | 78.0% | Claude (frontier model) |
| Claude Code (Opus 4.6) | 65.3% | Claude |
| OpenAI Agents SDK (GPT-5.2) | 61.3% | GPT-5.2 |
This result shows one thing: on this verified subset, OmicOS can already complete most short-form bioinformatics analysis tasks reliably.
But it can't show everything. A 90% number doesn't tell us which of the 5 failures are real errors and which are unclear question conventions; nor does it tell us whether the system is held up by one strong model. To answer those, we need to keep taking it apart.
What the five failed questions tell us
Of the 5 failed questions, 1 is a clear OmicOS capability gap. The other 4 come mainly from analysis conventions the prompt didn't fix, or from differences caused by software versions and numerical boundaries.
Real capability gapDepMap essentiality sign convention
Prompt-convention issuesPhyKIT, scipy, R spline, VCF choice — 4 items
The real capability gap appears in bix-16-q1. The question asks for the gene whose expression is most negatively Spearman-correlated with essentiality. The reference answer is CDKN1A; OmicOS returned CCND1.
The cause is specific: OmicOS computed the correlation directly with CRISPRGeneEffect and expression, but in the DepMap context essentiality usually corresponds to -CRISPRGeneEffect. With the direction reversed, the answer drifts. This isn't a formatting issue or a tolerance issue — it's a domain convention that wasn't handled correctly.
The other 4 failures are closer to prompt-convention issues.
bix-34-q2depends on PhyKIT's median definition for an even number of pairwise distances. OmicOS gets 2.49 by the standard median; the reference answer is closer to 2.63, the biased middle value that some PhyKIT version takes.bix-45-q1sits in the extreme tail of a Mann-Whitney U p-value. Both OmicOS and the reference are far below1e-50— the scientific conclusion is the same, and the difference comes mainly from software-implementation details.bix-54-q7asks to fit a natural spline (df=4) in R. OmicOS called R and also chose the Natural Spline with the lowest AIC; the difference comes from knot placement insplines::ns()across different R versions.bix-61-q5provides FASTQ, BAM, and a ready-made VCF at once. OmicOS used the ready-made VCF and got Ts/Tv = 2.56; the reference answer appears to come from re-calling variants, giving 2.68.
These questions aren't meant to make excuses for OmicOS. On the contrary, they show that an evaluation must be able to separate different kinds of failure.
The DepMap question should become a system fix: when a field like this appears, recognize the direction convention in the domain. The PhyKIT, R-version, and VCF-entry questions remind us to leave more context in the notebook: software versions, input-file choices, key parameters, and intermediate values should all be visible. Otherwise, even when the final number is right, it's hard for someone else to take over.
Does the result hold after swapping the backend model?
BixBench-Verified-50 gives an externally comparable result, but one question remains unanswered: does OmicOS's performance depend on one particular backend model?
That question is very practical for research teams. The same analysis pipeline may need to switch among a cloud API, an existing lab quota, a hospital intranet service, and a local open-source model. If the system only fits one model, then when the deployment environment changes, the capability may disappear with it.
For this, OmicVerse built the omics evaluation set ovagent-bench v1.1. It contains 38 tasks covering scRNA preprocessing, scRNA workflow, spatial transcriptomics, bulk RNA-seq, velocity/trajectory, 16S microbiome analysis, and foundation-model embedding.
The setup is simple: fix the same agent loop and the same set of tasks, and only swap the backend model. What we want to see is whether, after connecting to the OmicVerse ecosystem, the omics-analysis ability of different models all rises.
The results:
| Model | Provider | Open weights | Baseline | + OmicVerse ecosystem | Gain |
|---|---|---|---|---|---|
| qwen-3B-a3b (3B params, local deployment) | Alibaba | ✓ | 44.7% | 78.9% | +34.2 pp |
| glm-5.1 | Zhipu | — | 67.1% | 87.7% | +20.6 pp |
| gpt-5.5 | OpenAI | — | 71.9% | 91.2% | +19.3 pp |
| deepseek-v4-pro | DeepSeek | ✓ | 71.1% | 89.5% | +18.4 pp |
| gemini-3.1-flash-lite | — | 62.7% | 79.0% | +16.2 pp | |
| deepseek-v4-flash | DeepSeek | ✓ | 73.7% | 86.8% | +13.2 pp |
| MiniMax-M2.7 | MiniMax | — | 77.2% | 79.8% | +2.6 pp |
| Panel mean | 66.9% | 84.7% | +17.8 pp |
Here, baseline means the same model completing the task directly without the OmicVerse ecosystem. + OmicVerse ecosystem means the model enters the OmicOS task loop, where it can use tool interfaces, function descriptions, and execution constraints curated for omics analysis.
In this set of results, the most notable thing isn't any single model's top score, but that all 7 models improve. The panel mean goes from 66.9% to 84.7%, and the local 3B Qwen from 44.7% to 78.9%.
This doesn't mean small models are now equivalent to frontier models. More precisely, on a subset of omics tasks with clear structure and well-defined toolchains, the system layer can narrow the gap between models. The model is responsible for understanding the question and making judgments; the OmicVerse ecosystem is responsible for organizing long-tail Python functions, parameter constraints, data structures, and execution records into objects that are easier to call.
This is also a key point in the OmicOS design. It shouldn't serve only one backend. Research teams may switch models based on cost, compliance, intranet deployment, and reproducibility requirements. Evaluation has to check whether the system still works after such a switch.
On BixBench-Verified-50, OmicOS + deepseek-v4-pro and OmicOS + GPT-5.5 produced the same result. We'll keep publishing more backend comparisons.
What's useful to research teams isn't a single score
Put OmicOS back into a real research workflow, and 90% isn't the most important thing.
Omics analysis usually isn't a one-shot Q&A. Researchers first propose a hypothesis, then clean the data, check batches, choose models, generate figures, and go back to adjust parameters. If any step in the middle is lost, the final result is hard to review.
What OmicOS wants to solve is exactly this: let the agent leave a trail while completing the task. What the input data is, which functions were called, which intermediate tables were generated, which judgments came from statistical results, and which came from literature or domain conventions.
This matters for team collaboration. When a new member takes over, they shouldn't have to guess what happened from a chat log; when results enter a group meeting, supplementary material, or an internal report, there's a path to look back through. The system can't make the final judgment for the researcher, but it can reduce duplicated work and lost evidence.
Still, be careful
This result can't be extrapolated to all biomedical tasks. BixBench-Verified-50 is a curated 50-question subset with relatively well-defined task forms; real projects have more dirty data, missing metadata, and experimental-design details.
ovagent-bench also mainly measures omics tool use and analysis workflow, not clinical diagnostic ability. When patient-derived data, clinical cohorts, or drug leads are involved, the system's output can only be part of organizing research evidence — it can't directly become a clinical-care recommendation.
A more reasonable way to use it is to let OmicOS generate traceable notebooks and candidate explanations, and then have the researcher check the data sources, statistical methods, and biological meaning.
Scoring criteria
This evaluation follows the original BixBench-Verified-50 scoring logic as closely as possible. Only when an answer is mathematically or semantically equivalent but expressed differently do we allow limited tolerance.
Specifically:
- Pure numerical answers allow a small margin of error. For example, DESeq2 / pydeseq2 may produce less than a 1% count difference on boundary genes.
- Percentage and fraction forms are treated as equivalent. For example,
10.0and0.1can represent the same proportion. - For range questions, we scan all numbers in the answer, not just the first. For example, in
30/41 (0.7317), the value that actually falls in the range is0.7317. - Numerical rounding shouldn't be treated as a difference in conclusion. For example,
3.54%and3.5%represent results of the same magnitude.
Categories, gene symbols, and discrete labels still require exact matching, unless the reference answer itself lists acceptable expressions.
Notes
[A] Agent-Readable Function Registries for the Long Tail of Scientific Python. Companion preprint. Reports ovagent-bench v1.1: 38 questions across 7 omics layers (scRNA preprocessing, scRNA workflow, spatial transcriptomics, bulk RNA-seq, velocity/trajectory, 16S microbiome, foundation-model embedding). The 7-model panel's mean Pass@1 rises from 66.9% to 84.7% (+17.8 pp); 7/7 models improve, paired sign test p ≈ 0.008; the largest absolute gain comes from the open-source 3B Qwen (+34.2 pp).